# Modeling Traffic Patterns for Given Dimensions

For example,
* How many cars travel this road on a given day?
* Which were the days that saw the highest traffic?
* What were top 10 games when there was maximum traffic?
* what's the average traffic like on a game day vs non-game day?

Sample data has,
* Dimensions = Date, Game Day, Opponent, Win/Loss...
* Metrics = Traffic count

__Learning Objective:__

* Summarizing Data along Dimensions using Spark's PairRDD
* Learn Spark PairRDD's operations like map, reduceByKey, sortBy, leftOuterJoin, combineByKey, etc.

### Load the data and get quick sense of data

<font color="red">TODO: Configure files path</font>

In [2]:
trafficPath = "D:\\Tirthal-LABs\\xLocal-Git-Repo\\Learning-BigData\\gs-spark\\gs-spark-python\\notebooks\\03\\Dodgers.data"
gamesPath = "D:\\Tirthal-LABs\\xLocal-Git-Repo\\Learning-BigData\\gs-spark\\gs-spark-python\\notebooks\\03\\Dodgers.events"

In [3]:
traffic = sc.textFile(trafficPath)
traffic.take(10)

['4/10/2005 0:00,-1',
 '4/10/2005 0:05,-1',
 '4/10/2005 0:10,-1',
 '4/10/2005 0:15,-1',
 '4/10/2005 0:20,-1',
 '4/10/2005 0:25,-1',
 '4/10/2005 0:30,-1',
 '4/10/2005 0:35,-1',
 '4/10/2005 0:40,-1',
 '4/10/2005 0:45,-1']

__Traffic data analysis__ <br/>
Column 1 = 5 minutes slice of time <br/>
Column 2 = No of cars passed by in that 5 minutes slice window

In [4]:
games = sc.textFile(gamesPath)
games.take(10)

['04/12/05,13:10:00,16:23:00,55892,San Francisco,W 9-8�',
 '04/13/05,19:10:00,21:48:00,46514,San Francisco,W 4-1�',
 '04/15/05,19:40:00,21:48:00,51816,San Diego,W 4-0�',
 '04/16/05,19:10:00,21:52:00,54704,San Diego,W 8-3�',
 '04/17/05,13:10:00,15:31:00,53402,San Diego,W 6-0�',
 '04/25/05,19:10:00,21:33:00,36876,Arizona,L 4-2�',
 '04/26/05,19:10:00,22:00:00,44486,Arizona,L 3-2�',
 '04/27/05,19:10:00,22:17:00,54387,Arizona,L 6-3�',
 '04/29/05,19:40:00,22:01:00,40150,Colorado,W 6-3�',
 '04/30/05,19:10:00,21:45:00,54123,Colorado,W 6-2�']

__Events data analysis__ <br/>
Column 1 and 2 = Start and End time of Game <br/>
Column 3 = Audience count <br/>
Column 4 = Opponent <br/>
Column 5 = Win/Loss score

### Creating a PairRDD

In [5]:
from datetime import datetime 
import csv
from io import StringIO

def parseTraffic(row):
    DATE_FMT = "%m/%d/%Y %H:%M"
    row = row.split(",")
    row[0] = datetime.strptime(row[0],DATE_FMT)
    row[1] = int(row[1])
    return (row[0],row[1])

In [6]:
trafficParsed = traffic.map(parseTraffic)

In [7]:
trafficParsed.take(10)

[(datetime.datetime(2005, 4, 10, 0, 0), -1),
 (datetime.datetime(2005, 4, 10, 0, 5), -1),
 (datetime.datetime(2005, 4, 10, 0, 10), -1),
 (datetime.datetime(2005, 4, 10, 0, 15), -1),
 (datetime.datetime(2005, 4, 10, 0, 20), -1),
 (datetime.datetime(2005, 4, 10, 0, 25), -1),
 (datetime.datetime(2005, 4, 10, 0, 30), -1),
 (datetime.datetime(2005, 4, 10, 0, 35), -1),
 (datetime.datetime(2005, 4, 10, 0, 40), -1),
 (datetime.datetime(2005, 4, 10, 0, 45), -1)]

### Summarizing a Pair RDD using reduceByKey

reduceByKey = Combine records with the same key in a specified way e.g. sum, maximum, minimum

__Computing daily traffic trends__ 

In [8]:
# Requirement = what's traffic by date?
# Extract date from timestamp
dailyTrend = trafficParsed.map(lambda x: (x[0].date(),x[1]))\
                        .reduceByKey(lambda x,y:x+y)

In [9]:
dailyTrend.take(10)

[(datetime.date(2005, 4, 10), -288),
 (datetime.date(2005, 4, 11), 5062),
 (datetime.date(2005, 4, 14), 6423),
 (datetime.date(2005, 4, 15), 6459),
 (datetime.date(2005, 4, 16), 6002),
 (datetime.date(2005, 4, 17), 5322),
 (datetime.date(2005, 4, 18), 5600),
 (datetime.date(2005, 4, 19), 6049),
 (datetime.date(2005, 4, 21), 5977),
 (datetime.date(2005, 4, 22), 6038)]

In [10]:
# Requirement = Find on which date, there was maximum amount of traffic?
# Sort based on second element (-x[1] to specify descending order)
dailyTrend.sortBy(lambda x:-x[1]).take(10)

[(datetime.date(2005, 7, 28), 7661),
 (datetime.date(2005, 7, 29), 7499),
 (datetime.date(2005, 8, 12), 7287),
 (datetime.date(2005, 7, 27), 7238),
 (datetime.date(2005, 9, 23), 7175),
 (datetime.date(2005, 7, 26), 7163),
 (datetime.date(2005, 5, 20), 7119),
 (datetime.date(2005, 8, 11), 7110),
 (datetime.date(2005, 9, 8), 7107),
 (datetime.date(2005, 9, 7), 7082)]

### Merging Pair RDDs and Adding a Dimension to RDD using leftOuterJoin

Merge two Pair RDDs based on keys
* join (inner join) = values whose keys match are grouped together and a join returns a new Pair RDD
* leftOuterJoin = all keys from the left RDD are returned
* rightOuterJoin = all keys from the right RDD are returned

__Merging with the games data__

In [17]:
# Requirement = What are top 10 game days when there was maximum traffic?

# Both PairRDD must need same key, so need to create PairRDD with date as key for games
def parseGames(row):
    DATE_FMT = "%m/%d/%y"
    row = row.split(",")
    row[0] = datetime.strptime(row[0],DATE_FMT).date()
    return (row[0],row[4])

gamesParsed = games.map(parseGames)
gamesParsed.take(10)

[(datetime.date(2005, 4, 12), 'San Francisco'),
 (datetime.date(2005, 4, 13), 'San Francisco'),
 (datetime.date(2005, 4, 15), 'San Diego'),
 (datetime.date(2005, 4, 16), 'San Diego'),
 (datetime.date(2005, 4, 17), 'San Diego'),
 (datetime.date(2005, 4, 25), 'Arizona'),
 (datetime.date(2005, 4, 26), 'Arizona'),
 (datetime.date(2005, 4, 27), 'Arizona'),
 (datetime.date(2005, 4, 29), 'Colorado'),
 (datetime.date(2005, 4, 30), 'Colorado')]

In [18]:
# Joining with Games 
dailyTrendCombined = dailyTrend.leftOuterJoin(gamesParsed)

In [19]:
dailyTrendCombined.take(10)

[(datetime.date(2005, 4, 11), (5062, None)),
 (datetime.date(2005, 4, 15), (6459, 'San Diego')),
 (datetime.date(2005, 4, 17), (5322, 'San Diego')),
 (datetime.date(2005, 4, 19), (6049, None)),
 (datetime.date(2005, 4, 21), (5977, None)),
 (datetime.date(2005, 4, 22), (6038, None)),
 (datetime.date(2005, 4, 23), (5366, None)),
 (datetime.date(2005, 4, 24), (4319, None)),
 (datetime.date(2005, 4, 25), (6280, 'Arizona')),
 (datetime.date(2005, 4, 30), (6090, 'Colorado'))]

Right RDD value is None, which indicates that there was no game on that day

In [20]:
# Function to take record and convert to tupple(Date, Opponenet, Type of Day, No of Cars)
def checkGameDay(row):
    if row[1][1] == None:
        return (row[0],row[1][1],"Regular Day",row[1][0])
    else:
        return (row[0],row[1][1],"Game Day",row[1][0])
    
dailyTrendbyGames = dailyTrendCombined.map(checkGameDay)

In [21]:
dailyTrendbyGames.take(10)

[(datetime.date(2005, 4, 11), None, 'Regular Day', 5062),
 (datetime.date(2005, 4, 15), 'San Diego', 'Game Day', 6459),
 (datetime.date(2005, 4, 17), 'San Diego', 'Game Day', 5322),
 (datetime.date(2005, 4, 19), None, 'Regular Day', 6049),
 (datetime.date(2005, 4, 21), None, 'Regular Day', 5977),
 (datetime.date(2005, 4, 22), None, 'Regular Day', 6038),
 (datetime.date(2005, 4, 23), None, 'Regular Day', 5366),
 (datetime.date(2005, 4, 24), None, 'Regular Day', 4319),
 (datetime.date(2005, 4, 25), 'Arizona', 'Game Day', 6280),
 (datetime.date(2005, 4, 30), 'Colorado', 'Game Day', 6090)]

In [22]:
# Sort by no of cars to see maximum traffic trends along with if there was Game or not on that day
dailyTrendbyGames.sortBy(lambda x:-x[3]).take(10)

[(datetime.date(2005, 7, 28), 'Cincinnati', 'Game Day', 7661),
 (datetime.date(2005, 7, 29), 'St. Louis', 'Game Day', 7499),
 (datetime.date(2005, 8, 12), 'NY Mets', 'Game Day', 7287),
 (datetime.date(2005, 7, 27), 'Cincinnati', 'Game Day', 7238),
 (datetime.date(2005, 9, 23), 'Pittsburgh', 'Game Day', 7175),
 (datetime.date(2005, 7, 26), 'Cincinnati', 'Game Day', 7163),
 (datetime.date(2005, 5, 20), 'LA Angels', 'Game Day', 7119),
 (datetime.date(2005, 8, 11), 'Philadelphia', 'Game Day', 7110),
 (datetime.date(2005, 9, 8), None, 'Regular Day', 7107),
 (datetime.date(2005, 9, 7), 'San Francisco', 'Game Day', 7082)]

### Computing Averages with Pair RDDs using combineByKey

combineByKey = combine records with the same key in a specified way (i.e. which gives very granular control over how the computation should happen

In [23]:
# Requirement = Compare average traffic on Game Day vs Non Game day 

dailyTrendbyGames.map(lambda x:(x[2],x[3]))\
                 .combineByKey(lambda value : (value,1),\
                      lambda acc,value:(acc[0]+value,acc[1]+1),\
                      lambda acc1,acc2:(acc1[0]+acc2[0],acc1[1]+acc2[1]))\
                 .mapValues(lambda x:x[0]/x[1])\
                 .collect()

[('Regular Day', 5411.329787234043), ('Game Day', 5948.604938271605)]

### Disclaimer

This is an example code of the [pluralsight course](https://app.pluralsight.com/library/courses/apache-spark-beginning-data-exploration-analysis), which I just upgraded to work with Python 3.6 while learning Spark.