# Introduction 
This dataset contains information regarding traffic congestion in major US Interstates and attempts to create a model that accurately predicts future congestion. The development of this notebook will take on the following structure:

* Exploratory Analysis: This stage will explore the data, rename labels as appropriate and discover that kind of pre-processing must be made (whether there are empty datapoints, distribution of data, entropy of each feature).
* Preprocessing: This stage will pre-process the data to put it in a way the model can make an accurate prediction.
* Algorithm selection and implementation
* Commentary 

In [None]:
# Load in packages 
import numpy as np 
import pandas as pd 
import scipy as sp
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import folium
from folium.plugins import HeatMap

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Exploratory Analysis

In [None]:
# Read data
df_train = pd.read_csv('/kaggle/input/bigquery-geotab-intersection-congestion/train.csv')
df_test = pd.read_csv('/kaggle/input/test-1/test (1).csv') #use received file from email on 1 May 2020
df_subm = pd.read_csv('/kaggle/input/bigquery-geotab-intersection-congestion/sample_submission.csv')

First thing to check is the data type of each column:

In [None]:
# Preview and analyse train data
df_train.head()

The test data is different from the training data; it does not contain columns for 'TimeFromFirstStop_p80' and 'DistanceToFirstStop' which are the variables that need to be predicted in this study.

In [None]:
# Preview and analyse test data
df_test

In [None]:
# Review the number of (row,column) for train and test data respectively
# Result shows that train data has more additional columns/information for training the dataset
# Checking for distribution of ALL DATA for each city
train_plot = sns.countplot(x="City", data=df_train)
print(df_train.shape)
print(df_test.shape)

We can clearly see that Philadelphia has the highest data count of all cities. The data is therefore unevenly distributed and this could affect our models. However, this is total count for all cities and that doesn't yield too much more information.

In [None]:
# Preview and analyse sample submission data
df_subm

Comparing the df_test and df_subm, we could tell there is a discrepancy in the RowId. The submission data is only tested from ID 0 to 1920334 whereas test data is from ID 0 to 1921356. Hence, the test data needs to be adjusted so that it could be submitted in the kaggle. However, as we are using the previous test file received from previous kaggle participant, this issue is resolved as both df_test and df_subm have matching RowID.

In [None]:
# Adjusting test data to match sample submission data
# As we are using the previous test file received from previous kaggle participant, this issue is resolved as both df_test and df_subm have matching RowID.
# Hence, this step can be skipped

# df_test=df_test[df_test['RowId']<1920335]

In [None]:
# Check how many missing values in TRAIN data
# Only 'EntryStreetName' and 'ExitStreetName' columns contain missing values 
df_train.isnull().sum()

In [None]:
# Check how many missing values in TEST data
# Only 'EntryStreetName' and 'ExitStreetName' columns contain missing values 
df_test.isnull().sum()

In [None]:
# Investigate the missing values (NaN) in train data for 'EntryStreetName'
# There are total 8148 total rows as per above finding
df_train[df_train['EntryStreetName'].isnull()==True]

In [None]:
# Investigate the missing values (NaN) in train data for 'ExitStreetName'
# There are total 6287 total rows as per above finding
df_train[df_train['ExitStreetName'].isnull()==True]

From above, the NaN values only happen in 'EntryStreetName' and 'ExitStreetName' columns.
These values don't need to be amended or dropped as these 2 columns won't be used for machine learning process; The columns 'EntryHeading' and 'ExitHeading' will be used instead as they are all correlated.

With the same understanding, 'Latitude' and 'Longitude' won't be used for the machine learning process too, as they are correlated to the column 'Intersection Id' and 'City'

In [None]:
# Group by the city and preview how many unique data in each column.
# Result shows that Philadephia has the most data, but Chicago has the most intersectionId.
df_train.groupby(["City"]).nunique()

In [None]:
# let's see the distribution of traffic by month and date
plt.figure(figsize=(15,12))

plt.subplot(211)
g = sns.countplot(x="Hour", data=df_train, hue='City', dodge=True)
g.set_title("Distribution by hour and city", fontsize=20)
g.set_ylabel("Count",fontsize= 17)
g.set_xlabel("Hours of Day", fontsize=17)
sizes=[]
for p in g.patches:
    height = p.get_height()
    sizes.append(height)

g.set_ylim(0, max(sizes) * 1.15)

plt.subplot(212)
g1 = sns.countplot(x="Month", data=df_train, hue='City', dodge=True)
g1.set_title("Hour Count Distribution by Month and City", fontsize=20)
g1.set_ylabel("Count",fontsize= 17)
g1.set_xlabel("Months", fontsize=17)
sizes=[]
for p in g1.patches:
    height = p.get_height()
    sizes.append(height)

g1.set_ylim(0, max(sizes) * 1.15)

plt.subplots_adjust(hspace = 0.3)

plt.show()

Again, philly comes out on top when it comes to count of traffic data. However this still doesn't give enough support to the theory that Philly has more traffic simply because it has more data. We could check this assumption by seeing how much actual stopping there is in philly traffic vs other city's traffic.

## Map Analysis

Let's plot out the intersections for each city using the latitude and longitude in the dataset

In [None]:
# 6 variables to be predicted
y1 = "TotalTimeStopped_p20"
y2 = "TotalTimeStopped_p50"
y3 = "TotalTimeStopped_p80"
y4 = "DistanceToFirstStop_p20"
y5 = "DistanceToFirstStop_p50"
y6 = "DistanceToFirstStop_p80"

# Group data by City
df_train_A = df_train[df_train['City']=='Atlanta']
df_train_B = df_train[df_train['City']=='Boston']
df_train_C = df_train[df_train['City']=='Chicago']
df_train_P = df_train[df_train['City']=='Philadelphia']

Atlanta waiting times at the intersection

In [None]:
# use plotly to plot where intersections are for all cities. Provide observational data. 
# then do a heatmap which groups intersectionId with TotalStoppingTime across space to see where the heaviest traffic is. 
# Investigate what's around here and provide observations.
traffic_df=df_train_A.groupby(['Latitude','Longitude'])['IntersectionId'].count().to_frame()
traffic_df.columns.values[0]='count1'
traffic_df=traffic_df.reset_index()
lats=traffic_df[['Latitude','Longitude','count1']].values.tolist()
    
hmap = folium.Map(location=[33.7638493,-84.3801108], zoom_start=12)
hmap.add_child(HeatMap(lats, radius = 6))
hmap

Boston waiting times at the intersection

In [None]:
# use plotly to plot where intersections are for all cities. Provide observational data. 
# then do a heatmap which groups intersectionId with TotalStoppingTime across space to see where the heaviest traffic is. 
# Investigate what's around here and provide observations.
traffic_df=df_train_B.groupby(['Latitude','Longitude'])['IntersectionId'].count().to_frame()
traffic_df.columns.values[0]='count1'
traffic_df=traffic_df.reset_index()
lats=traffic_df[['Latitude','Longitude','count1']].values.tolist()
    
hmap = folium.Map(location=[42.3158246,-71.0787574], zoom_start=12)
hmap.add_child(HeatMap(lats, radius = 6))
hmap

Chicago waiting times at the intersection

In [None]:
# use plotly to plot where intersections are for all cities. Provide observational data. 
# then do a heatmap which groups intersectionId with TotalStoppingTime across space to see where the heaviest traffic is. 
# Investigate what's around here and provide observations.
traffic_df=df_train_C.groupby(['Latitude','Longitude'])['IntersectionId'].count().to_frame()
traffic_df.columns.values[0]='count1'
traffic_df=traffic_df.reset_index()
lats=traffic_df[['Latitude','Longitude','count1']].values.tolist()
    
hmap = folium.Map(location=[41.8420892,-87.7237629], zoom_start=11)
hmap.add_child(HeatMap(lats, radius = 6))
hmap

Philadelphia waiting times at the intersection

In [None]:
# use plotly to plot where intersections are for all cities. Provide observational data. 
# then do a heatmap which groups intersectionId with TotalStoppingTime across space to see where the heaviest traffic is. 
# Investigate what's around here and provide observations.
traffic_df=df_train_P.groupby(['Latitude','Longitude'])['IntersectionId'].count().to_frame()
traffic_df.columns.values[0]='count1'
traffic_df=traffic_df.reset_index()
lats=traffic_df[['Latitude','Longitude','count1']].values.tolist()
    
hmap = folium.Map(location=[39.9484792,-75.1774329], zoom_start=12)
hmap.add_child(HeatMap(lats, radius = 6))
hmap

The maps show a clear correlation between congestion and distance to city center.

The objective of the project is to predict congestion, based on an aggregate measure of stopping distance (p20,p50,p80) and waiting times (p20,p50,p80), at intersections in 4 major US cities: Atlanta, Boston, Chicago & Philadelphia.
For premiliminary study, the simplest way to predict is to get the mean values of each data group by its city.

In [None]:
df_train_A[['IntersectionId',y1,y2,y3,y4,y5,y6]].groupby('IntersectionId').mean().head(6)

In [None]:
df_train_B[['IntersectionId',y1,y2,y3,y4,y5,y6]].groupby('IntersectionId').mean().head(6)

In [None]:
df_train_C[['IntersectionId',y1,y2,y3,y4,y5,y6]].groupby('IntersectionId').mean().head(6)

In [None]:
df_train_P[['IntersectionId',y1,y2,y3,y4,y5,y6]].groupby('IntersectionId').mean().head(6)

The above is just a preliminary study to get sense of the data.
Next, these other factors need to be accounted in training the dataset:
- Intersection ID
- Direction: entry or exit with 8 directions: E, N, NE, NW, S, SE, SW, W
- Hour
- Weekend
- Month

The strategy is to use linear regression model to predict the congestion. From the above, it is observed that 'Intersection Id' values can be the same between all cities despite of different location. For example, 'Intersectiod Id'=2 exist at both Atlanta and Chicago although they are actually at different location. Hence, to prevent this to confuse the model prediction, the regression models need to be separated between the 4 cities and would be combined at last for submission.

# Preprocessing

In [None]:
# Encoding the train data with the 8 different directions using .get_dummies to create new columns
# Using array for different cities

ar_train = [df_train_A,df_train_B,df_train_C,df_train_P]
ar_entry = [1,1,1,1]
ar_exit = [1,1,1,1]

for i in range (0,4):
    ar_entry[i] = pd.get_dummies(ar_train[i]["EntryHeading"],prefix = 'n')
    ar_exit[i] = pd.get_dummies(ar_train[i]["ExitHeading"],prefix = 'x')
    ar_train[i] = pd.concat([ar_train[i],ar_entry[i]],axis=1)
    ar_train[i] = pd.concat([ar_train[i],ar_exit[i]],axis=1)

ar_train[0].head()

In [None]:
# Encoding the test data same as the above

df_test_A = df_test[df_test['City']=='Atlanta']
df_test_B = df_test[df_test['City']=='Boston']
df_test_C = df_test[df_test['City']=='Chicago']
df_test_P = df_test[df_test['City']=='Philadelphia']

ar_test = [df_test_A,df_test_B,df_test_C,df_test_P]

for i in range (0,4):
    ar_entry[i] = pd.get_dummies(ar_test[i]["EntryHeading"],prefix = 'n')
    ar_exit[i] = pd.get_dummies(ar_test[i]["ExitHeading"],prefix = 'x')
    ar_test[i] = pd.concat([ar_test[i],ar_entry[i]],axis=1)
    ar_test[i] = pd.concat([ar_test[i],ar_exit[i]],axis=1)

In [None]:
# Creating the x_train,x_test,y_train_y_test variables for each cities

x_train = [1,1,1,1] 
x_test = [1,1,1,1]
y1_train = [1,1,1,1]
y1_test =[1,1,1,1]
y2_train = [1,1,1,1]
y2_test =[1,1,1,1]
y3_train = [1,1,1,1]
y3_test =[1,1,1,1]
y4_train = [1,1,1,1]
y4_test =[1,1,1,1]
y5_train = [1,1,1,1]
y5_test =[1,1,1,1]
y6_train = [1,1,1,1]
y6_test =[1,1,1,1]

columns = ["IntersectionId","Hour","Weekend","Month",'n_E','n_N', 'n_NE', 'n_NW', 'n_S', 'n_SE', 'n_SW', 'n_W', 'x_E','x_N', 'x_NE', 'x_NW', 'x_S', 'x_SE', 'x_SW', 'x_W']
for i in range (0,4):
    x_train[i] = ar_train[i][columns]
    x_test[i] = ar_test[i][columns]
    y1_train[i] = ar_train[i][y1]
    y2_train[i] = ar_train[i][y2]
    y3_train[i] = ar_train[i][y3]
    y4_train[i] = ar_train[i][y4]
    y5_train[i] = ar_train[i][y5]
    y6_train[i] = ar_train[i][y6]

In [None]:
# Check correlation matrix before proceeding in the learning process
# all correlation values are below 0.55, hence it is good to proceed
corr = x_train[0].corr()
corr.style.background_gradient(cmap='coolwarm').set_precision(2)

# Algorithm selection and implementation

Various regression methods, namely the LinearRegression, RidgeRegression and LassoRegression with different alpha values, have been used in this study. The highest kaggle score value is when linear_model.Lasso(alpha=0.15) is used. Hence,  this model is used in the project.

In [None]:
# import regression package
from sklearn import datasets, linear_model
regression = linear_model.Lasso(alpha=0.15)

In [None]:
# execute the learning process
for i in range (0,4):
    regression.fit(x_train[i],y1_train[i])
    y1_test[i] = regression.predict(x_test[i])

    regression.fit(x_train[i],y2_train[i])
    y2_test[i] = regression.predict(x_test[i])
    
    regression.fit(x_train[i],y3_train[i])
    y3_test[i] = regression.predict(x_test[i])
    
    regression.fit(x_train[i],y4_train[i])
    y4_test[i] = regression.predict(x_test[i])
    
    regression.fit(x_train[i],y5_train[i])
    y5_test[i] = regression.predict(x_test[i])
    
    regression.fit(x_train[i],y6_train[i])
    y6_test[i] = regression.predict(x_test[i])

In [None]:
# To confirm if the total numbers of y_test are the same as total rows of sample submission file before combining the y_test 

6*(len(y1_test[0])+len(y1_test[1])+len(y1_test[2])+len(y1_test[3])) == len(df_subm)

In [None]:
# Combining all y_test

y_test = []
for i in range (0,4):
    for j in range(len(y1_test[i])):
        for k in [y1_test[i],y2_test[i],y3_test[i],y4_test[i],y5_test[i],y6_test[i]]:
            y_test.append(k[j])
            
len(y_test)

In [None]:
# Preview result
y_test

In [None]:
# Combine result with submission file and save to csv
df_subm["Target"] = y_test
df_subm.to_csv("CZ4041.csv",index = False)

## Side Study on Cross-Validation (CV) of the Training Dataset

In [None]:
# Using KFold to splits the data into 5-fold
from sklearn.model_selection import KFold
X = np.array(x_train[0]) 
y = y1_train[0] 
kf = KFold(n_splits=5) # Define the split - into 5 folds 
kf.get_n_splits(X) 
print(kf) 

# Enumerate splits
for train_index, test_index in kf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]

Next, some metrics will be reviewed, namely the R2, mean absolute error, mean squared error, and cross-validated score. The cross-validated score (CVS) is done based on 5-fold splitting. Results are grouped based on City and the 6 parameters (y1 to y6).

In [None]:
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

City=['Atlanta','Boston','Chicago','Philadelphia']
y = [y1,y2,y3,y4,y5,y6]
y_train=[y1_train,y2_train,y3_train,y4_train,y5_train,y6_train]
CVS = []

for j in range (0,6):
    print("\n----- Score for",y[j],"-----")
    CVS.append([])
    for i in range (0,4):

        X_train, X_test, Y_train, Y_test = train_test_split(x_train[i], y_train[j][i], 
                                                        random_state=100, 
                                                        test_size= 0.15)
        regression = linear_model.LinearRegression()
        regression.fit(X_train, Y_train)

        fitted_values = regression.predict(X_test)

        r2 = r2_score(Y_test, fitted_values)
        mae = mean_absolute_error(Y_test, fitted_values)  
        rmse = mean_squared_error(Y_test, fitted_values)**0.5
        cvs = cross_val_score(regression, X_train, Y_train, cv=5)
        CVS[j].append(np.average(cvs))
        
        print(City[i])
        print("  r2", r2)
        print("  MAE", mae)
        print("  MSE", rmse)
        print("  Cross-Validated Score (5 fold):", cvs)
        print("  Average CVS:", np.average(cvs))
        
        #Scroll output below to preview all

In [None]:
print('CVS of y1 to y6 for each City:')
for i in range (0,4):
    print(" -", City[i], np.transpose(CVS)[i])

print('\nAverage CVS of y1 to y6 for each City:')
for i in range (0,4):
    print(" -", City[i], np.average(np.transpose(CVS)[i]))
    
print('\nAverage CVS of all cities combined')    
print(" -", np.average(CVS))

The cross-validated score (5-fold) of the training datasets is 1.5%. Atlanta, Boston and Philadephia have similar CVS ranging 1.5-2.0% whereas Chicago is falling behind with 0.7%. This may be due to Chicago has the least amount of datasets as shown in the Exploratory Analysis graph. The R2 value is quite small, which may be due to many learning inputs are involved in the regression model, namely the IntersectionId, Hour, Weekend, Month, Entry and Exit Directions. Meanwhile the MAE and MSE values are also relatively small, which indicate good learning model.

### Commentary:
The dataset leans heavily towards Philadelphia, which can skew the result of our model. From the hour and month plots we can draw some observations:
* There is less data in the early hours of the morning for all cities, increasing throughout the day
* Philadelphia data count peaks between 3pm-7pm
* Boston peaks at 10am and gradually falls
* Chicago closely follow Boston's trend throughout the day
* Atlanta stays constant throughout the day starting around 7-8am

When it comes to months of the year:
* There is significantly less data in the spring
* Data count for Boston increases towards the end of the year
* Philadelphia has the highest data count of any other city

Latitude and Longitude can help pinpoint us where exactly the vehicle was going at the time the data was captured. 'EntryStreetName' and 'ExitStreetName' provide directional data, which can tell us on *what* direction traffic is flowing.  Perhaps traffic is heavier on certain streets only in one direction rather than the other (think commuters).

'Month', 'Weekend' and 'Day' are really interesting because they provides us a chance to do analysis across time. Seasons change, which can bring more congestion (think of how much traffic slows during snowstorms as opposed to a sunny day). School breaks influence traffic, so do important holidays and recurring cultural events.

Eventually this whole study gets score of **Private Score 77.600 and Public Score 79.675** in kaggle. All datasets provided have been used to predict the congestion. 