## Contents
1. [Introduction](#Introduction)
2. [Data Loading and Overview](#DataLoad)
3. [Data Exploration](#data_exp) <br>
    3.1 [Target Distribution](#data_exp)<br>
    3.2 [Direction](#direction) <br>
    3.3 [Location & Time](#locationtime) <br>
4. [Conclusion](#conclusion)

## Introduction
This kernel will be an exploratory data analysis on the BigQuery-Geotab Intersection Congestion competition data.

Geotab provides a wide variety of aggregate datasets gathered from commercial vehicle telematics devices. Harnessing the insights from this data has the power to improve safety, optimize operations, and identify opportunities for infrastructure challenges.

We have a regression problem for which we must predict 6 target values: three statistics: the 20th, 50th, and 80th percentiles, for each of two metrics: the total time a vehicle stopped at an intersection and the distance between the intersection and the first place a vehicle stopped while waiting,for each observation in the test set.  The given data consists of aggregated trip logging metrics from commercial vehicles, such as semi-trucks. The data have been grouped by intersection, month, hour of day, direction driven through the intersection, and whether the day was on a weekend or not.


![](https://www.geotab.com/blog/wp-content/uploads/2018/07/traffic-congestion.jpg)

Image Source: *https://www.geotab.com/blog/traffic-congestion/*

*Importing libraries..*

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python


import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.style as style 
style.use('seaborn-bright')
import matplotlib.pyplot as plt
import seaborn as sns
import tqdm

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))



## Data loading and overview <a id = 'DataLoad'></a>

*Loading data..*

In [None]:
train = pd.read_csv('../input/bigquery-geotab-intersection-congestion/train.csv')
test = pd.read_csv('../input/bigquery-geotab-intersection-congestion/test.csv')
mergedData = train.merge(test, 'outer')

In [None]:
train.head()

In [None]:
test.head()

In [None]:
mergedData.info()

**Data features:** <br>
- RowId: Unique identifier for each row/observation, each of which is an aggregate of some congestion data.
- IntersectionId: Identifier for specific intersection (2839 unique in train and test sets combined) 
- Latitude
- Longitude
- EntryStreetName
- ExitStreetName
- EntryHeading: Direction cars entering intersection are moving in (N, W, E, S, NW, SW, SE, NE)
- ExitHeading: Direction cars exiting intersection are moving in (N, W, E, S, NW, SW, SE, NE)
- Hour: Hour of day (0-23)
- Weekend: No: 0, Yes: 1
- Month: Has unique values (1,5,6,7,8,9,10,11,12) in both train and test sets, which implies that the congestion data ranges from May of one year to January of the next year.
- Path: Concatenation of strings from the columns: EntryStreetName, EntryHeading, ExitStreetName, ExitHeading
- City: Atlanta, Boston, Chicago, or Philadelphia


In [None]:
print(f'Train dataset has {train.shape[0]} rows and {train.shape[1]} columns.')
print(f'Test dataset has {test.shape[0]} rows and {test.shape[1]} columns.')

We have two moderately sized datasets, with the number of observations in the test set being notably higher than that of the train set.  The following features are the difference between the two sets:

In [None]:
[col for col in train.columns if col not in test.columns]

The test set contains the same columns from the training set except for the above features. 
The task is to predict the 20th, 50th, and 80th percentiles of the variables TotalTimeStopped and DistanceToFirstStop.  It seems that additional percentile statistics, as well as a TimeFromFirstStop metric have been provided to assist with model building.

Examine the columns that have missing data:

In [None]:
print('Print names of columns in train set with null values, along with number of null values:')
trainNull = [col for col in  train.columns if train[col].isnull().any()]
trainNullD = {}
for col in trainNull:
    trainNullD.update({col: ['Number missing: ' + str(train[col].isnull().sum()), 
                             'Proportion missing: ' + str(round(train[col].isnull().sum() / train.shape[0],3))]})
for a,b in trainNullD.items():
    print(a)
    print(b)

In [None]:
print('Print names of columns in test set with null values, along with number of null values:')
testNull = [col for col in  test.columns if test[col].isnull().any()]
testNullD = {}
for col in testNull:
    testNullD.update({col: ['Number missing: ' + str(test[col].isnull().sum()), 
                             'Proportion missing: ' + str(round(test[col].isnull().sum() / test.shape[0],3))]})
for a,b in testNullD.items():
    print(a)
    print(b)

The columns with missing data are the same in both train and test sets: EntryStreetName and ExitStreetName. It seems that most columns are without any missing data. And those that do have only a small proportion missing, so it should be safe to simply remove those observations prior to model fitting.

## Data Exploration: Target Distribution <a id = 'data_exp'></a>

First, let's see how each of the given statistics are distributed.

In [None]:
#Create lists of columns for each group of statistics
TotalTimeStoppedCols = list(train.loc[:,'TotalTimeStopped_p20':'TotalTimeStopped_p80'].columns)
DistanceToFirstStop = list(train.loc[:,'DistanceToFirstStop_p20':'DistanceToFirstStop_p80'].columns)
TimeFromFirstStopCols = list(train.loc[:,'TimeFromFirstStop_p20':'TimeFromFirstStop_p80'].columns)


f1, axes = plt.subplots(3,5, figsize=(20, 10), sharex=False)

for i, col in enumerate(TotalTimeStoppedCols):
    sns.distplot(train[col], kde=False, ax=axes[0,i], label = col, color = 'r')
    
for i, col in enumerate(DistanceToFirstStop):
    sns.distplot(train[col], kde=False, ax=axes[1,i], label = col,  color = 'g')
    
for i, col in enumerate(TimeFromFirstStopCols):  
    sns.distplot(train[col], kde=False, ax=axes[2,i], label = col, color = 'b')    

    
plt.tight_layout()

For each of the three statistics, the number of samples contained in each percentile visibly increases as the histograms are read from left to right, as expected.  Let's zoom in on the highest available percentile for each, which encompasses all available data.

In [None]:
#lastPercentiles = ['TotalTimeStopped_p80', 'DistanceToFirstStop_p80', 'TimeFromFirstStop_p80']
f2, axes2 = plt.subplots(1,3, figsize=(20, 8), sharex=False)

sns.distplot(train['TotalTimeStopped_p80'], kde=False, ax=axes2[0], color = 'r' )
sns.distplot(train['DistanceToFirstStop_p80'], kde=False, ax=axes2[1], color = 'g' )
sns.distplot(train['TimeFromFirstStop_p80'], kde=False, ax=axes2[2], color = 'b' )


The data is skewed towards zero for all three statistics.  This implies that cars are most likely to pass through the intersection without stopping, out of all possible stopping times,  and higher stopping times are increasingly unlikely, which makes sense. TotalTimeStopped by itself should be enough to provide a good idea of how heavy the congestion is.  For each congestion data aggregate (each row), a low value of TotalTimeStopped_p80 should thus be representative of normal or good traffic, while a high value would represent slow traffic.  As a simple test, the average 80th percentile of total stop time is 24.83 seconds on weekdays, and 18.05 seconds on weekends.  This makes sense as most people are more likely to work on weekdays only and rest on weekends, leading to busier streets on weekdays due to commuting to/from work, etc.

In [None]:
mergedData.groupby('Weekend')['TotalTimeStopped_p80'].mean()

## Data Exploration: Direction <a id='direction' > </a>

From prior driving experience, one can recall that it would take on average more time to make a left/right turn than to drive straight through an intersection.  This makes sense since oncoming cars in the opposite lane have the right of way, and one would have to wait for them to pass before turning, leading to blocking/slowing down the cars behind you as well.  Let's see if the data agrees with us by creating an 'isSameDirection' variable, which is true if the entry direction and exit direction are the same and false otherwise:

In [None]:
#%%time
train['isSameDirection'] = train['EntryHeading'] == train['ExitHeading']

f3, axes3 = plt.subplots(1,2, figsize=(14, 6))

bar = sns.barplot(x='isSameDirection', y='TotalTimeStopped_p80', data = train, palette = 'rocket', 
                  ax = axes3[0] )
axes3[0].set_title('Barplot of average stop time vs. isSameDirection')
axes3[0].set_xlabel('isSameDirection')
axes3[0].set_ylabel('Avg. of TotalTimeStopped_p80')

strip = sns.stripplot(x='isSameDirection', y='TotalTimeStopped_p80', data = train, palette = 'rocket', 
                      jitter = 0.5, alpha = 0.3, ax = axes3[1])
axt = axes3[1].twinx()
vio = sns.violinplot(x='isSameDirection', y='TotalTimeStopped_p80', data = train, palette = 'rocket',
                    ax = axt)

axes3[1].set_title('Stripplot of TotalTimeStopped_p80 vs. isSameDirection')
axes3[1].set_xlabel('isSameDirection')
axes3[1].set_ylabel('TotalTimeStopped_p80')


In [None]:
print(train.groupby('isSameDirection')['TotalTimeStopped_p80'].agg('mean'))
print(train['isSameDirection'].value_counts(normalize=True))

It seems that cars approaching intersections drive straight through with a probability of roughly 0.70, and turn in a different direction with a probability of 0.30 (This distribution is very nearly the same for the test set as well). <br> The barplot supports our earlier guess: that cars travelling through intersections without changing direction would, on average, have lower total stop time.   <br> isSameDirection should be a feature worth keeping for model training.  Taking this idea further, it might be worth investigating the same metric for all combinations of entry and exit direction from (N, W, E, S, NW, SW, SE, NE).  Let's take a look at the counts of all the entry-exit combinations:

In [None]:
#column 'Path' has extraneous information (i.e. street names), we only want path direction
train['pathDirectionOnly'] = train['EntryHeading'] + '_' + train['ExitHeading']

pathd = pd.DataFrame({'EntryHeading_ExitHeading': train['pathDirectionOnly'].value_counts().index, 
                      'Count': train['pathDirectionOnly'].value_counts()})

f4, axes4 = plt.subplots( figsize=(13, 14))
sns.barplot(x='Count', y='EntryHeading_ExitHeading', data = pathd, ax = axes4, palette='Blues_r')
axes4.set_title('Counts of all entry-exit combinations')


There are a total of 64 entry-exit combinations.  The most frequent ones involve no turning (i.e. East -> East: 0 degrees), the next most frequent are simple right/left turns (i.e. East ->North: 90 degrees), followed by turns of abnormal angles (i.e. Northwest -> West: 45 degrees).  It may be more informative to convert these 64 entry-exit combinations into the 5 possible angles associated with a turn (0, 45, 90, 135, 180), and check how this feature compares with stop time.

![](https://i.imgur.com/FBQ0Uxn.png)

In [None]:
#Define function to get angle between 'EntryHeading' and 'ExitHeading'
def getAngleOfTurn(df):
        compassAngle = {'N':0, 'NE':45, 'E':90, 'SE':135, 'S':180, 
                        'SW':225, 'W':270, 'NW':315 }
        a = df['EntryHeading'] 
        b = df['ExitHeading']
        angle = abs(compassAngle[a] - compassAngle[b])
        if angle < 225:
            return angle
        elif angle == 225:
            return 135
        elif angle == 270:
            return 90
        elif angle == 315:
            return 45        
train['angleOfTurn'] = train.apply(getAngleOfTurn, axis=1)

In [None]:
anglesDf = pd.DataFrame({'angleOfTurn': train.groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean').index,
                         'Count': train.groupby('angleOfTurn')['angleOfTurn'].agg('count'),
                         'StopTimeMean' : train.groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean')})

f5, axes5 = plt.subplots( figsize=(12, 8))

dircount = sns.barplot(x='angleOfTurn', y='Count' , data = anglesDf, palette = 'rocket', ax = axes5)
dircount.set_title('Counts of turning angles and associated avg. stop times')
dircount.set_ylabel('Count')
dircount.set_xlabel('Degrees')

a = axes5.twinx()
sns.pointplot(x = 'angleOfTurn', y = 'StopTimeMean', color = 'coral',data = anglesDf, ax = a)

In [None]:
anglesDf.iloc[:,1:]

The frequency of each turn angle is consistent with our previous investigation of 'isSameDirection' - turns of 0 degrees count for around 0.70 of all turns, and frequencies of all other turn angles (45, 90, 135, and 180) sum to roughly 0.30.  Plotting the average 80th percentile of TotalTimeStopped (abbreviated as StopTimeMean) for each of the angle groups on the same axis reveals a linear relationship between the magnitude of the angle turned and the time stopped.  Interestingly, there are even some drivers that made complete U-turns of 180 degrees at the intersection, going back the way they came.  Unsurprisingly, these turned out to have the highest average stop time.  'angleOfTurn' should be a feature worth keeping.

## Data Exploration: Location & Time <a id = 'locationtime' ></a>

![](https://i.imgur.com/Zdohjf3.jpg)

Another space-relevant component of data given to us is latitude and longitude.  Latitude provides information about the distance north or south of the equator and ranges from 0 to 90 (0 at the equator, 90 at the North or South Pole), while longitude provides information about the distance east or west of the Prime Meridian, an imaginary line drawn between the North and South Poles, passing through Greenwich, England, and ranges from 0 to 180. 

The Earth’s axis is tilted 23.5° to the perpendicular, meaning that the amount of sunlight that a particular latitude receives changes with the seasons. From April to September, the Northern Hemisphere is tilted toward the Sun, where it receives more energy; the Southern Hemisphere receives this additional energy between October and March, when it is tilted toward the Sun. *Source: https://enviroliteracy.org/air-climate-weather/climate/latitude-climate-zones/*


With this in mind, it should be worthwhile to investigate the effect of climate on traffic by plotting the relationship between latitude and average stop times, organized by seasons.  For our purposes here, we'll treat values of latitude as a categorical variable and round them, since our data encompass only four cities, for each of which the values of latitude are sharply distributed.

In [None]:
f7, axes7 = plt.subplots(1,2, figsize=(14, 4))
kde1 = sns.kdeplot(data = train['Latitude'], shade = True, color = 'lightskyblue', ax = axes7[0])
kde1.set_xlabel('Latitude')
kde1.set_title('KDE of Latitude')

kde2 = sns.countplot(x = 'City', data = train, ax = axes7[1],  palette = 'Blues',
                     order = ['Atlanta', 'Philadelphia', 'Chicago', 'Boston'] )
kde2.set_title('Countplot of City')
kde2.set_ylabel('Count')

- Atlanta: Latitude ~34° N
- Philadelphia: Latitude ~40° N
- Chicago & Boston: Latitude ~ 42° N


Since Boston and Chicago have nearly the same latitude, and climate differences are unlikely to be noticeable over such a small range of latitude difference, we will categorize these two cities together.

In [None]:
#%%time
train['LatitudeRounded'] = round(train['Latitude'])

trainSpring = train.loc[train['Month'].isin([3,4,5])]
trainSummer = train.loc[train['Month'].isin([6,7,8])]
trainFall = train.loc[train['Month'].isin([9,10,11])]
trainWinter = train.loc[train['Month'].isin([12,1,2])]

f6, axes6 = plt.subplots(2, 2, figsize=(12, 8))
##############################################################################
a = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSpring, 
                  color = 'palegreen', alpha = 0.3, ax=axes6[0,0])
axt1 = axes6[0,0].twinx()
vio1 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSpring,
                     color = 'palegreen', alpha = 0.3, ax = axt1)
a.set_title('Spring')
a.set_ylim(0,700)
a.set_autoscaley_on(False)
axt1.set_ylabel('')
axt1.set_ylim(0,700)
axt1.set_autoscaley_on(False)
##############################################################################
b = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSummer, 
                color = 'salmon', alpha = 0.3, ax=axes6[0,1])
axt2 = axes6[0,1].twinx()
vio2 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainSummer,
                     color = 'salmon', alpha = 0.3, ax = axt2)
b.set_title('Summer')
b.set_ylim(0,700)
b.set_autoscaley_on(False)
axt2.set_ylabel('')
axt2.set_ylim(0,700)
axt2.set_autoscaley_on(False)
##############################################################################
c = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainFall,
                color = 'orange', alpha = 0.3, ax=axes6[1,0])
axt3 = axes6[1,0].twinx()
vio3 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainFall,
                     color = 'orange', alpha = 0.3, ax = axt3)
c.set_title('Fall')
c.set_ylim(0,700)
c.set_autoscaley_on(False)
axt3.set_ylabel('')
axt3.set_ylim(0,700)
axt3.set_autoscaley_on(False)
##############################################################################
d = sns.stripplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainWinter,
                color = 'lightblue', alpha = 0.3, ax=axes6[1,1])
axt4 = axes6[1,1].twinx()
vio4 = sns.violinplot(x='LatitudeRounded', y='TotalTimeStopped_p80', data = trainWinter,
                     color = 'lightblue', alpha = 0.3, ax = axt4)
d.set_title('Winter')
d.set_ylim(0,700)
d.set_autoscaley_on(False)
axt4.set_ylabel('')
axt4.set_ylim(0,700)
axt4.set_autoscaley_on(False)

plt.tight_layout()

Unfortunately for our "Spring" dataset, there contains data only from the month of May, with data from March and April missing, and our "Winter" dataset is also missing a month, February (Since data was taken for all months except February, March, and April).  Observing each of the remaining seasons, Summer and Fall, it seems that there are differences in stop time among the different locations.  However, this cannot simply be attributed to climate differences since there may have been differences in infrastructure etc. among the cities that were the cause instead.  In Summer, Latitude 34  corresponds to the highest stop times, while in Fall, Latitude 42 corresponds to the highest stop times. However, this too may be due to indirect effects of season on people's behaviors and is not necessarily evidence of a direct effect of climate on traffic. <br>
It seems that we are unable to distinguish between effects of infrastructure and climate.  Let's instead try to investigate the differences in infrastructure among the four cities.

We will revisit our earlier plot of 'Counts of turning angles and associated avg. stop times', but now we will create four different subplots, grouping data by each of the cities.  

In [None]:
trainAtlanta = train.loc[train['City'] == 'Atlanta']
trainBoston = train.loc[train['City'] == 'Boston']
trainChicago = train.loc[train['City'] == 'Chicago']
trainPhiladelphia = train.loc[train['City'] == 'Philadelphia']
t = [trainAtlanta, trainBoston, trainChicago, trainPhiladelphia]
n = ['Atlanta', 'Boston', 'Chicago', 'Philadelphia']
I = 0
f8, axes8 = plt.subplots(2,2, figsize=(12, 8))
splots = [axes8[0,0], axes8[0,1], axes8[1,0], axes8[1,1]]

for item in splots:
    anglesDf = pd.DataFrame({'angleOfTurn': t[I].groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean').index,
                         'Count': t[I].groupby('angleOfTurn')['angleOfTurn'].agg('count'),
                         'StopTimeMean' : t[I].groupby('angleOfTurn')['TotalTimeStopped_p80'].agg('mean')})
    
    dircount = sns.barplot(x='angleOfTurn', y='Count' , data = anglesDf, palette = 'rocket', ax = item)
    dircount.set_title(n[I])
    dircount.set_ylabel('Count')
    dircount.set_xlabel('Degrees')
    
    a = item.twinx()
    sns.pointplot(x = 'angleOfTurn', y = 'StopTimeMean', color = 'coral',data = anglesDf, ax = a)
    
    I = I+1
plt.tight_layout()

Although quite abstract, information on the frequency of turn angles should still provide some insight on differences in infrastructure among the four cities.  For all cities except Boston, the trend remains that turn frequencies have the following decreasing order: 0, 90, 45, 135, and 180 degrees.  But there are very clear differences in the proportions of each when comparisons are made among the different cities.  The exact reasons for these differences may be un-extractable from this abstract view of the data.  It may be due to genuine structural differences among the four cities, or it may simply be a matter of where/when Geotab decided to collect their data, or even a combination of the two.
<br>
Also, it seems to be generally true that increasing turn angle means increasing stop time, even when we group the data by city.  The few exceptions that can be seen above are likely attributed to these large-angle turns being made at non-busy intersections and/or times, removing the need for the driver to stop and wait due to blockage from other cars.  

Now let us explore the times of day and see which times are busiest.  At the same time, we'll also take a look at the relationship between season and traffic from a different perspective.  Since data for the months of Spring are missing except for its last, May, we will consider May to be part of the season that immediately follows: Summer.  And for the purpose of maintaining (approximately) equal amounts of data in each season set, I don't think anyone would mind if we shifted the last month of Summer "down" into the next, and do the same for Fall.  After this process, our new seasons contain the following months... Summer: 5,6,7; Fall: 8,9,10; Winter: 11,12,1.

In [None]:
trainSummer2 = train.loc[train['Month'].isin([5,6,7])]
trainFall2 = train.loc[train['Month'].isin([8,9,10])]
trainWinter2 = train.loc[train['Month'].isin([11,12,1])]


f9, axes9 = plt.subplots(3,1, figsize=(12, 9))
c = sns.countplot(x = 'Hour', data = trainSummer2, hue = 'City', palette = 'rocket', ax = axes9[0])
c.set_title('Summer')
c.set_ylabel('Count')
       
axt = axes9[0].twinx()
p = sns.pointplot(x='Hour', y='TotalTimeStopped_p80', data = trainSummer2, color = 'salmon',ax = axt )
p.set_ylabel('StopTimeMean')
#####
c = sns.countplot(x = 'Hour', data = trainFall2, hue = 'City', palette = 'YlOrRd', ax = axes9[1])
c.set_title('Fall')
c.set_ylabel('Count')
       
axt = axes9[1].twinx()
p = sns.pointplot(x='Hour', y='TotalTimeStopped_p80', data = trainFall2, color = 'orange', ax = axt )
p.set_ylabel('StopTimeMean')
#####
c = sns.countplot(x = 'Hour', data = trainWinter2, hue = 'City', palette = 'Blues', ax = axes9[2])
c.set_title('Winter')
c.set_ylabel('Count')
       
axt = axes9[2].twinx()
p = sns.pointplot(x='Hour', y='TotalTimeStopped_p80', data = trainWinter2, color = 'lightblue', ax = axt )
p.set_ylabel('StopTimeMean')

plt.tight_layout()

For all three seasons, there are two notable peaks in average stop time.  The first of these occurs at around 8 AM and the second occurs at around 5 PM.  Sound familiar?  This pattern suggests the commute to-and-from your typical 9-to-5 day job.  It makes sense to observe peaks in traffic activity at these two times.  There also seems to be a minimum in the amount of data available, centered at around 4 AM.  Makes sense, most people are asleep at this time and so one wouldn't expect high levels of traffic.  There doesn't seem to be a significant difference in stop times among the different seasons.  

Let's take a closer look at the data contained in different months.

In [None]:
f11, axes11 = plt.subplots(figsize=(14, 5))
m = sns.countplot(y= 'Month', hue = 'City',data = train, palette = 'Blues_r', ax = axes11, 
                  order = [10,9,12,11,8,7,6,1,5], hue_order=['Philadelphia','Boston','Atlanta', 'Chicago'])
#a = axes11.twinx()
#p = sns.pointplot(x='Month', y='TotalTimeStopped_p80', data = train[train['Month'].isin([6,7,8,9,10,11,12])], color = 'lightblue', ax = a )
m.set_title('Countplot of months')

Upon closer inspection, it seems that there are disproportionately few data for the months of January and May compared to the other months.  Earlier, we observed stop times over the time frame of hours and found some sensible trends. Now let's observe stop times over the time frame of months and see if we can find patterns on this larger time scale.  We will exclude January and May since they don't contain much data. 

In [None]:
trainP = train.loc[train['City'] == 'Philadelphia']
trainB = train.loc[train['City'] == 'Boston']
trainA = train.loc[train['City'] == 'Atlanta']
trainC = train.loc[train['City'] == 'Chicago']

f = plt.figure(figsize = (14,8))
s = plt.subplot2grid((3,2),(0,0), 1, 3,f)
p = plt.subplot2grid((3,2),(1,0), 1, 1,f)
b = plt.subplot2grid((3,2),(1,1), 1, 1,f)
a = plt.subplot2grid((3,2),(2,0), 1, 1,f)
c = plt.subplot2grid((3,2),(2,1), 1, 1,f)

m=sns.countplot(x = 'Month', data = train[train['Month'].isin([6,7,8,9,10,11,12])], 
                palette = 'Blues', ax = s)
at = s.twinx()
point = sns.pointplot(x='Month', y='TotalTimeStopped_p80', data = train[train['Month'].isin([6,7,8,9,10,11,12])], 
                   color = 'lightblue', ax = at )
m.set_title('Stop times across months (all cities)')
m.set_ylabel('Count')
point.set_ylabel('StopTimeMean')
#####
temp = [[trainP,trainB,trainA,trainC],
        [p,b,a,c],
        ['Philadelphia', 'Boston', 'Atlanta', 'Chicago']]
for i in range(0,4):
    
    count = sns.countplot(x = 'Month', data = temp[0][i][temp[0][i]['Month'].isin([6,7,8,9,10,11,12])], 
                   palette = 'PuBu', ax = temp[1][i])
    count.set_title(temp[2][i])
    count.set_ylabel('Count')
    at = temp[1][i].twinx()
    point = sns.pointplot(x='Month', y='TotalTimeStopped_p80', data = temp[0][i][temp[0][i]['Month'].isin([6,7,8,9,10,11,12])],
                      color = 'lightblue', ax = at )
    
    point.set_ylabel('StopTimeMean')


plt.tight_layout()

Viewing the relationship between month and stop times for all cities, a minimum for the stop time can be seen in the data for July, after which the average stop time climbs upwards reaching a maximum at around October and starts to decrease once again towards the months of November and December.  My guess is that as summer time approaches, a significant portion of people take breaks from work and maybe go on vacation, which would explain the minimum at July.  And similar reasoning should also apply as we approach the holiday months near the end of the year, as evidenced by the second dip in stop times.  The same general trend can be seen if we observe each of the cities individually, with some variance in where the minimum and maximum are for these specific cases.  An indirect effect of season/monthly times on average stop time is thus observed.  Direct effects of climate on traffic seem impossible to isolate from the given data alone.

## Conclusions <a id = 'conclusion' ></a>
* The data has only a small proportion of null values, located in EntryStreetName and ExitStreetName.
* There is no traffic data recorded for months February, March, and April, and a very low amount of data is recorded for the months of January and May compared to the other months with data.
* TotalTimeStopped turns out to be a useful metric for estimating traffic activity.
* Drivers drive straight through intersections without turning 0.70 of the time, and turn 0.30 of the time.
* Average stop times depend on whether or not a driver has made a turn at an intersection, with turning resulting in higher average stop times than not turning.
* Taking the above idea further, we found that stop time has an approximately linear relationship with the magnitude of the angle of the turn.
* From our observation of differing distributions of turn-angle frequencies for each city, it seems that there are nontrivial infrastructural differences among the 4 cities.
* Due to the above, it is likely impossible for the human eye to distinguish the effect of these infrastructural differences from the effect of climate differences at different latitudes (if any).


Thanks for going through my kernel