# BigQuery-Geotab Intersection Congestion
## Can you predict wait times at major city intersections?

Driving is a means of travelling that is preferred by many people. We all want to drive efficiently and reach our destination with as little time stuck in a traffic jam as possible. The ability to predict traffic at intersections enables us to plan our course ahead of time and avoid busy streets and intersections. In this project, we attempt to train machine learning models to predict the time it takes to cross an intersection and how congested it is at an intersection.


### Objective:

The objective is to predict congestion, based on an aggregate measure of stopping distance and waiting times, at intersections in 4 major US cities: Atlanta, Boston, Chicago & Philadelphia.

### Data columns:

#### 1. Independent Variables (Features)
- IntersectionId: Represents a unique intersectionID for some intersection of roads within a city.
- Latitude: The latitude of the intersection.
- Longitude: The longitude of the intersection.
- EntryStreetName: The street name from which the vehicle entered towards the intersection.
- ExitStreetName: The street name to which the vehicle goes from the intersection.
- EntryHeading: Direction to which the car was heading while entering the intersection.
- ExitHeading: Direction to which the car went after it went through the intersection.
- Hour: The hour of the day.
- Weekend: It's weekend or not.
- Month: Which Month it is.
- Path: It is a concatination in the format: EntryStreetName_EntryHeading ExitStreetName_ExitHeading.
- City: Name of the city

#### 2. Dependent Variables (Targets)
- TotalTimeStopped_p20: Total time for which 20% of the vehicles had to stop at an intersection.
- TotalTimeStopped_p40: Total time for which 40% of the vehicles had to stop at an intersection.
- TotalTimeStopped_p50: Total time for which 50% of the vehicles had to stop at an intersection.
- TotalTimeStopped_p60: Total time for which 60% of the vehicles had to stop at an intersection.
- TotalTimeStopped_p80: Total time for which 80% of the vehicles had to stop at an intersection.
- TimeFromFirstStop_p20: Time taken for 20% of the vehicles to stop again after crossing a intersection.
- TimeFromFirstStop_p40: Time taken for 40% of the vehicles to stop again after crossing a intersection.
- TimeFromFirstStop_p50: Time taken for 50% of the vehicles to stop again after crossing a intersection.
- TimeFromFirstStop_p60: Time taken for 60% of the vehicles to stop again after crossing a intersection.
- TimeFromFirstStop_p80: Time taken for 80% of the vehicles to stop again after crossing a intersection.
- DistanceToFirstStop_p20: How far before the intersection the 20% of the vehicles stopped for the first time.
- DistanceToFirstStop_p40: How far before the intersection the 40% of the vehicles stopped for the first time.
- DistanceToFirstStop_p50: How far before the intersection the 50% of the vehicles stopped for the first time.
- DistanceToFirstStop_p60: How far before the intersection the 60% of the vehicles stopped for the first time.
- DistanceToFirstStop_p80: How far before the intersection the 80% of the vehicles stopped for the first time.

#### 3. Target Output (based on Competition's Rules)

Total time stopped at an intersection, 20th, 50th, 80th percentiles and Distance between the intersection and the first place the vehicle stopped and started waiting, 20th, 50th, 80th percentiles

- TotalTimeStopped_p20
- TotalTimeStopped_p50
- TotalTimeStopped_p80
- DistanceToFirstStop_p20
- DistanceToFirstStop_p50
- DistanceToFirstStop_p80

## 1. Data Analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score, KFold, RepeatedKFold
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,GradientBoostingRegressor
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.neural_network import MLPRegressor
from sklearn.multioutput import MultiOutputRegressor, RegressorChain

In [None]:
import warnings 
warnings.simplefilter(action = "ignore")

In [None]:
# Reading Train & Test datasets
df_train = pd.read_csv('../input/bigquery-geotab-intersection-congestion/train.csv')
df_test = pd.read_csv('../input/bigquery-geotab-intersection-congestion/test.csv')

In [None]:
# display shape of the 2 datasets 
# Train dataset has 15 more columns than Test dataset and Test dataset has twice the number of observations
print ("shape of train dataset :", df_train.shape)
print ("shape of test dataset :", df_test.shape)

In [None]:
# display the first 5 observations of Train dataset 
df_train.head()

In [None]:
# display the first 5 observations of Test dataset 
df_test.head()

In [None]:
# train dataset info - 6 columns with text valuse and 22 with numeric values
df_train.info()

In [None]:
# train dataset info - 6 columns with text valuse and 7 with numeric values
df_test.info()

In [None]:
# display the ommon columns between the 2 datasets 
print ("Common columns between Train & Test datasets :", np.intersect1d(df_train.columns, df_test.columns).tolist())

In [None]:
# display different columns between the 2 datasets - only in train dataset 
print ("Columns in Train dataset only :", df_train.columns.symmetric_difference(df_test.columns).values )

In [None]:
# Same Cities are used in Train & Test datasets
print ("Cities in Train dataset:", df_train['City'].unique().tolist())
print ("Cities in Test dataset:", df_test['City'].unique().tolist())

In [None]:
# Number of observations per "City" in Train & Test dataset - similar distribution of data per city 
train_city = df_train.groupby('City').size().reset_index().rename(columns={0:'train'})
test_city = df_test.groupby('City').size().reset_index().rename(columns={0:'test'})

data = train_city.merge(test_city, on='City').sort_values('test')
display (data)

sns.barplot(x='City',y='value',hue='variable',data=data.melt(id_vars='City', value_vars=['train','test']))

In [None]:
# Number of Intersections per City in Train & Test dataset - similar distribution of data per city 
train_intersection = df_train[['City', 'IntersectionId']].drop_duplicates().groupby('City').size().reset_index().rename(columns={0:'train'})
test_intersection = df_test[['City', 'IntersectionId']].drop_duplicates().groupby('City').size().reset_index().rename(columns={0:'test'})

data = train_intersection.merge(test_intersection, on='City').sort_values('train')
display(data)

sns.barplot(x='City',y='value',hue='variable',data=data.melt(id_vars='City', value_vars=['train','test']))

In [None]:
# Number of observation per Month in Train & Test dataset - Small about of data for Jan & May and missing data for Feb-Apr
train_months = df_train.groupby('Month').size().reset_index().rename(columns={0:'train'})
test_months = df_test.groupby('Month').size().reset_index().rename(columns={0:'test'})

data = train_months.merge(test_months, on='Month')
display(data)

sns.barplot(x='Month',y='value',hue='variable',data=data.melt(id_vars='Month', value_vars=['train','test']))

In [None]:
# Number of observation per Hour in Train & Test dataset - Similar distrbution of data in Train & Test datasets
train_hours = df_train.groupby('Hour').size().reset_index().rename(columns={0:'train'})
test_hours = df_test.groupby('Hour').size().reset_index().rename(columns={0:'test'})

data = train_hours.merge(test_hours, on='Hour')
#display(data)

sns.barplot(x='Hour',y='value',hue='variable',data=data.melt(id_vars='Hour', value_vars=['train','test']))

Both datasets have similar distrubtion of data for the 4 cities and for the same months

Now we will focus our analysis on the Train dataset

### 1.1 Descriptive Statistics

**Train dataset includes :**
- Total time stopped = the amount of time spent at 0 speed
- Time from first stop = time from the first stop until the vehicle passes through the intersection
- Distance to first stop = the distance from the center of the intersection to the first stop, to give an idea of queue length

The data is presented in percentile 

In [None]:
# Descriptive statistics of the "Total time stopped"
df_train[['TotalTimeStopped_p20','TotalTimeStopped_p40','TotalTimeStopped_p50',
          'TotalTimeStopped_p60','TotalTimeStopped_p80']].describe().T

On average, the first 60% of cars didn't stop at the intersecton while the cars at the 80th percentile stopped for 16 seconds

In [None]:
# Descriptive statistics of the "Time from first stop" 
df_train[['TimeFromFirstStop_p20','TimeFromFirstStop_p40', 'TimeFromFirstStop_p50','TimeFromFirstStop_p60',
          'TimeFromFirstStop_p80']].describe().T

On average, time from the first stop until the vehicle passes through the intersection for 80the perctile of the cars is 27 seconds

In [None]:
# Descriptive statistics of the "â€¢Distance to first stop" 
df_train[['DistanceToFirstStop_p20','DistanceToFirstStop_p40', 'DistanceToFirstStop_p50','DistanceToFirstStop_p60',
 'DistanceToFirstStop_p80']].describe().T

On average, the distance from the first stop until the vehicle passes through the intersection for 80the perctile of the cars is 60 'meters'

### 1.2 Traffic Congestion by City

In [None]:
# Averages per City based on 50 & 80 percentile 
df_train.groupby('City').agg({'TotalTimeStopped_p50':'mean','TimeFromFirstStop_p50':'mean','DistanceToFirstStop_p50':'mean','TotalTimeStopped_p80':'mean','TimeFromFirstStop_p80':'mean','DistanceToFirstStop_p80':'mean'})

In [None]:
data = df_train.groupby(['City','IntersectionId','Latitude','Longitude']).agg({'TotalTimeStopped_p50':'mean'}).reset_index()

fig,axes=plt.subplots(nrows=2, ncols=2, figsize=(15,10))
for i,city in enumerate(data['City'].unique().tolist()):   
    sns.scatterplot(x='Latitude',y='Longitude',data=data[data['City']==city],hue='TotalTimeStopped_p50',ax=axes[i%2,i//2],legend=False)
    axes[i%2,i//2].set_title(city)
    axes[i%2,i//2].set_xlabel('')
    axes[i%2,i//2].set_ylabel('')

In [None]:
# Hourly Traffic per City on Weekdays using 80 percentile 
data = df_train[df_train['Weekend']==0].groupby(['City','Hour']).agg({'TotalTimeStopped_p80':'mean'}).reset_index()

fig,axes = plt.subplots(nrows=1, ncols=data['City'].nunique(), figsize=(20,4), sharey=True)
for i,city in enumerate(data['City'].unique()):
    sns.barplot(data=data[data['City']==city] ,x='Hour', y='TotalTimeStopped_p80',ax=axes[i], color='C0')
    axes[i].set_ylabel('')
    axes[i].set_title(city)
    axes[i].set_xlabel('')
    axes[i].get_xaxis().set_ticks([])
    axes[i].spines['top'].set_visible(False)
    axes[i].spines['right'].set_visible(False)
plt.subplots_adjust(top=0.8)
fig.suptitle('Hourly Traffic on Weekdays')

In [None]:
# Hourly Traffic per City on Weekends using 80 percentile 
data = df_train[df_train['Weekend']==1].groupby(['City','Hour']).agg({'TotalTimeStopped_p80':'mean'}).reset_index()

fig,axes = plt.subplots(nrows=1, ncols=data['City'].nunique(), figsize=(20,4), sharey=True)
for i,city in enumerate(data['City'].unique()):
    sns.barplot(data=data[data['City']==city] ,x='Hour', y='TotalTimeStopped_p80',ax=axes[i], color='C0')
    axes[i].set_ylabel('')
    axes[i].set_title(city)
    axes[i].set_xlabel('')
    axes[i].get_xaxis().set_ticks([])
    axes[i].spines['top'].set_visible(False)
    axes[i].spines['right'].set_visible(False)
plt.subplots_adjust(top=0.8)
fig.suptitle('Hourly Traffic on Weekends')

Atlanta is more congested than the other cities with average waiting time of 9.7 seconds and distance to intersecion of 30 'meters' followed by Boston and Chicago

### 1.3 Monthly Traffic Analysis

In [None]:
# Monthly Traffic per Hour using 80 percentile 
data = df_train.groupby(['Month','Hour']).agg({'TotalTimeStopped_p80':'mean'}).reset_index()

fig,axes = plt.subplots(nrows=data['Month'].nunique()//3, ncols=data['Month'].nunique()//3, figsize=(15,8), sharey=True,sharex=True)
for i,month in enumerate(sorted(data['Month'].unique())):
    sns.barplot(data=data[data['Month']==month] ,x='Hour', y='TotalTimeStopped_p80',ax=axes[i%3,i//3], color='C0')
    axes[i%3,i//3].set_title(f'month = {month}')
    axes[i%3,i//3].set_ylabel('')
    axes[i%3,i//3].set_xlabel('')
    axes[i%3,i//3].get_xaxis().set_ticks([])
    axes[i%3,i//3].spines['top'].set_visible(False)
    axes[i%3,i//3].spines['right'].set_visible(False)

In [None]:
# Monthly averages based on 50 & 80 percentile 
df_train.groupby('Month').agg({'Month':'count','TotalTimeStopped_p50':'mean','TimeFromFirstStop_p50':'mean','DistanceToFirstStop_p50':'mean','TotalTimeStopped_p80':'mean','TimeFromFirstStop_p80':'mean','DistanceToFirstStop_p80':'mean'}).rename(columns = {'Month':'Count'})

- Minimum data provided for Jan & May and no data provided for Feb to April
- For Jun to Dec, similar trend with minimum traffic from midnight to early morning and then it peaks up from 7am to 9am in the morning and again from 3pm to 5pm which is typically the time to and from work

### 1.4 Data Correlation

In [None]:
# Train correlation - "Total Time", "Time from First Stop" & "Distance from First Stop" are all postively correlated 
corr = df_train.iloc[:,12:-1].corr()
mask = np.triu(np.ones_like(corr,dtype=bool))
cmap = sns.diverging_palette(250,15,s=75,l=40, n=9, center='light', as_cmap=True)
fig = plt.figure(figsize=(12,12))
sns.heatmap(corr, mask=mask, cmap=cmap, annot=True, fmt='.2f')

## 2. Data Preprocessing

### 2.1 Missing Observation Analysis

In [None]:
# check for Null Values in Train dataset - EntryStreetName & ExitStreetName are the only 2 columns with missing values 
df_train.isna().sum()

In [None]:
# Ratio of missing data in Train dataset
df_train[['EntryStreetName','ExitStreetName']].isna().sum() / df_train.shape[0]

In [None]:
# check for Null Values in Test dataset - EntryStreetName & ExitStreetName are the only 2 columns with missing values 
df_test.isna().sum()

In [None]:
# Ratio of missing data in Test dataset
df_test[['EntryStreetName','ExitStreetName']].isna().sum() / df_test.shape[0]

In [None]:
# check some of data data to see if we can fill missing data
df_train[df_train['EntryStreetName'].isna()].tail()

In [None]:
# "East" entry of Intersection# 1696 in Philadelphia shows missing EntryStreetName 
# this means we cannot simply fill the data by using Intersection# & City
df_train[(df_train['IntersectionId']==1696) & (df_train['City']=='Philadelphia')].groupby(['EntryStreetName','ExitStreetName','EntryHeading','ExitHeading','Path'], dropna=False).size()

In [None]:
# Intersection can have mutiple entries and exits - Intersection# 0 in Boston has 4 Entries (NE,E,W,S) & 5 Exits (NE,NW,E,W,S)
df_train[(df_train['IntersectionId']==0) & (df_train['City']=='Boston')][['EntryStreetName','ExitStreetName','EntryHeading','ExitHeading','Path']].value_counts()

Here we notice couple of things : 
- there are different type of Streets (Avenues / Street / Highways ..etc)
- each street have different number of Entries and Exits - this is identified by direction 

In [None]:
# type of streets can be identified from the Street Name
data = df_train[['City','EntryStreetName','IntersectionId']].drop_duplicates()
print("Number of Avenues :", data['EntryStreetName'].str.contains('Avenue').sum())
print("Number of Streets :", data['EntryStreetName'].str.contains('Street').sum())
print("Number of Boulevards :", data['EntryStreetName'].str.contains('Boulevard').sum())
print("Number of Roads:", data['EntryStreetName'].str.contains('Road').sum())
print("Number of Highways :", data['EntryStreetName'].str.contains('Highway').sum())
print("Number of Drives :", data['EntryStreetName'].str.contains('Drive').sum())
print("Number of Parkways :", data['EntryStreetName'].str.contains('Parkway').sum())

In [None]:
# using directions to identify possible EntryHeading & ExitHeading values
print (df_train['EntryHeading'].unique())
print (df_train['ExitHeading'].unique())

Although we didn't fill or drop the missing data for StreetName, we identified 2 ascpects of the data that can help in our modeling. This will further explored in the Feature Engineering Section. 

Furthermore, StreetName will not be used in our model so having the missing data will not affect our analysis

### 2.2 Outlier Observation Analysis

In [None]:
# Total time stopped = the amount of time spent at 0 speed
cols = ['TotalTimeStopped_p20','TotalTimeStopped_p40','TotalTimeStopped_p50', 
        'TotalTimeStopped_p60', 'TotalTimeStopped_p80']
sns.boxplot(data=df_train[cols],orient='h')

In [None]:
# Time from first stop = time from the first stop until the vehicle passes through the intersection
cols = ['TimeFromFirstStop_p20', 'TimeFromFirstStop_p40','TimeFromFirstStop_p50', 
        'TimeFromFirstStop_p60', 'TimeFromFirstStop_p80']
sns.boxplot(data=df_train[cols],orient='h')

In [None]:
# Distance to first stop
cols = ['DistanceToFirstStop_p20', 'DistanceToFirstStop_p40', 'DistanceToFirstStop_p50',
       'DistanceToFirstStop_p60', 'DistanceToFirstStop_p80']
sns.boxplot(data=df_train[cols],orient='h')

The data is presented in pecetiles up to 80% - Therefore, outliers for extreme traffic congestions beyond 80% already been removed from data - therefore, we will not drop the outliers but we will normalize the data before modeling

## 3. Feature Engineering

### 3.1 Street Type

Street Type can affect traffic as smaller roads tend to be busier while wider roads tend to be faster. Street Type can be extracted from "Street Name". 

In [None]:
# We start by creating a columns to identify the type of the Street
#df_train['EntryStreetType'] = np.NaN
#df_train['ExitStreetType'] = np.NaN
#df_test['EntryStreetType'] = np.NaN
#df_test['ExitStreetType'] = np.NaN

str_code = ['Avenue','Street','Boulevard','Road','Highway','Drive','Parkway','Square','Way','Ave','St','Pkwy','Lane','Circle','Place','Other']
str_name = ['Avenue','Street','Boulevard','Road','Highway','Drive','Parkway','Square','Way','Avenue','Street','Parkway','Lane','Circle','Place','Other']

for st in range(len(str_code)):
    df_train.loc[~(df_train['EntryStreetName'].isna()) & (df_train['EntryStreetName'].str.contains(str_code[st])), 'EntryStreetType'] = str_name[st]
    df_train.loc[~(df_train['ExitStreetName'].isna()) & (df_train['ExitStreetName'].str.contains(str_code[st])), 'ExitStreetType'] = str_name[st]
    df_test.loc[~(df_test['EntryStreetName'].isna()) & (df_test['EntryStreetName'].str.contains(str_code[st])), 'EntryStreetType'] = str_name[st]
    df_test.loc[~(df_test['ExitStreetName'].isna()) & (df_test['ExitStreetName'].str.contains(str_code[st])), 'ExitStreetType'] = str_name[st]
    
df_train['EntryStreetType'].fillna('Other',inplace=True)
df_train['ExitStreetType'].fillna('Other',inplace=True)

df_test['EntryStreetType'].fillna('Other',inplace=True)
df_test['ExitStreetType'].fillna('Other',inplace=True)

In [None]:
print (df_train[(df_train['EntryStreetType']=='Other') & ~(df_train['EntryStreetName'].isna())]['EntryStreetName'].unique())

In [None]:
# Number of intersection per "EntryStreetType" - mostly Streets and Avenues
df_train[['City','EntryStreetType','IntersectionId']].drop_duplicates().groupby('EntryStreetType',dropna=False).size().sort_values()

In [None]:
# Averages per Street Type based on 50 & 80 percentile 
df_train.groupby('EntryStreetType').agg({'RowId':'count','TotalTimeStopped_p50':'mean','TimeFromFirstStop_p50':'mean','DistanceToFirstStop_p50':'mean','TotalTimeStopped_p80':'mean','TimeFromFirstStop_p80':'mean','DistanceToFirstStop_p80':'mean'}).rename(columns = {'RowId':'Count'}).reset_index().sort_values('TotalTimeStopped_p50', ascending=False)

### 3.2 Number of Entries & Exits

Number of Entries and Exits of the intersection can affect the traffic and waiting time. This data will be extracted from "EntryHeading" & "ExitHeading"

In [None]:
# We can create columns to identify the number of directions for each intersection
entry_data = df_train[['City','IntersectionId','EntryHeading']].drop_duplicates().groupby(['City','IntersectionId']).agg({'EntryHeading':'count'}).reset_index().rename(columns={'EntryHeading':'EntryCount'})
exit_data = df_train[['City','IntersectionId','ExitHeading']].drop_duplicates().groupby(['City','IntersectionId']).agg({'ExitHeading':'count'}).reset_index().rename(columns={'ExitHeading':'ExitCount'})

# Then we add Number of Entries & Exits for each intersection
df_train = df_train.merge(entry_data, on=['City','IntersectionId'], how='left')
df_train = df_train.merge(exit_data, on=['City','IntersectionId'], how='left')
df_train.head()

In [None]:
# Replicate for Test dataset
entry_data = df_test[['City','IntersectionId','EntryHeading']].drop_duplicates().groupby(['City','IntersectionId']).agg({'EntryHeading':'count'}).reset_index().rename(columns={'EntryHeading':'EntryCount'})
exit_data = df_test[['City','IntersectionId','ExitHeading']].drop_duplicates().groupby(['City','IntersectionId']).agg({'ExitHeading':'count'}).reset_index().rename(columns={'ExitHeading':'ExitCount'})

df_test = df_test.merge(entry_data, on=['City','IntersectionId'], how='left')
df_test = df_test.merge(exit_data, on=['City','IntersectionId'], how='left')
df_test.head()

In [None]:
# Averages per 'Entry Count' based on 50 & 80 percentile - intersecions with more entries are busier
df_train.groupby('EntryCount').agg({'RowId':'count','TotalTimeStopped_p50':'mean','TimeFromFirstStop_p50':'mean','DistanceToFirstStop_p50':'mean','TotalTimeStopped_p80':'mean','TimeFromFirstStop_p80':'mean','DistanceToFirstStop_p80':'mean'}).rename(columns = {'RowId':'Count'}).reset_index().sort_values('TotalTimeStopped_p50', ascending=False)

### 3.3 Turn Type

Type of turn (left turns / right_turn / same entry-same exit directions ..etc) can affect congestion and waiting time. This extracted from EntryHeading & ExitHeading
 
But to implement that, first we need to map directions while keeping the right sequence: N = 1, NE = 2, E = 3 ...etc 

In [None]:
# EntryHeading & ExitHeading while keeping the sequence in order
heading_map = {'N':1,'NE':2,'E':3,'SE':4,'S':5, 'SW':6, 'W':7, 'NW': 8}

df_train['EntryHeading'] = df_train['EntryHeading'].map(heading_map)
df_train['ExitHeading'] = df_train['ExitHeading'].map(heading_map)
df_test['EntryHeading'] = df_test['EntryHeading'].map(heading_map)
df_test['ExitHeading'] = df_test['ExitHeading'].map(heading_map)

df_train.head()

In [None]:
# Turn Type - difference between Exit & Entry
df_train['TurnType'] = df_train['ExitHeading'] - df_train['EntryHeading']
df_test['TurnType'] = df_test['ExitHeading'] - df_test['EntryHeading']
df_train.head()

In [None]:
# Averages per 'EntryExitDiff' based on 50 & 80 percentile 
df_train.groupby('TurnType').agg({'RowId':'count','TotalTimeStopped_p50':'mean','TimeFromFirstStop_p50':'mean','DistanceToFirstStop_p50':'mean','TotalTimeStopped_p80':'mean','TimeFromFirstStop_p80':'mean','DistanceToFirstStop_p80':'mean'}).rename(columns = {'RowId':'Count'}).reset_index().sort_values('TotalTimeStopped_p50', ascending=False)

### 3.3 Distance from City Centre

City Centers are busier than country sides. Distance to City Center is culculated using Latitude and Longitude

In [None]:
from sklearn.neighbors import DistanceMetric

def calc_distance(row):#(lat1, lon1, lat2, lon2):
    R = 6373.0
    lat1 = row['CCLatitude']
    lon1 = row['CCLongitude']
    lat2 = row['Latitude']
    lon2 = row['Longitude']
    dist = DistanceMetric.get_metric('haversine')
    X = [[np.radians(lat1), np.radians(lon1)], [np.radians(lat2), np.radians(lon2)]]
    distance = np.abs(np.array(R * dist.pairwise(X)).item(1))
    return distance

In [None]:
# https://www.latlong.net/country/united-states-236.html
cities = ['Atlanta', 'Boston', 'Chicago', 'Philadelphia']
cc_lat = [33.753746, 42.361145,41.881832,39.952583]
cc_lon = [-84.386330,-71.057083,-87.623177,-75.165222]

for c in range(len(cities)):
    df_train.loc[df_train['City']==cities[c], 'CCLatitude'] = cc_lat[c]
    df_train.loc[df_train['City']==cities[c], 'CCLongitude'] = cc_lon[c]
    df_test.loc[df_test['City']==cities[c], 'CCLatitude'] = cc_lat[c]
    df_test.loc[df_test['City']==cities[c], 'CCLongitude'] = cc_lon[c]
    
    
df_train['CCDist'] = df_train.apply(calc_distance, axis=1)
df_test['CCDist'] = df_test.apply(calc_distance, axis=1)
df_train.head()

### 3.4 Weather Information

Monthly weather including temperature & rain can affect congestion. This can also help generalize the model for the missing months 

In [None]:
# Use Month for indetify weather impact on traffic
# https://www.climatestotravel.com/
temp = [['Chicago',-4.6,-2.4,3.2,9.4,15.1,20.5,23.3,22.4,18.1,11.4,4.6,-2.3],
['Boston',-1.5,0,3,9,14.5,19.5,23,22,18,12,7,1.5],
['Atlanta',6.6,8.7,12.6,16.8,21.4,25.3,26.8,26.3,23.2,17.6,12.5,7.7],
['Philadelphia',0.5,2.1,6.4,12.2,17.7,23,25.6,24.8,20.6,14.2,8.6,3]]

rain = [['Chicago',45,45,65,85,95,90,95,125,80,80,80,55],
['Boston',85,85,110,95,90,95,85,85,85,100,100,95],
['Atlanta',105,120,120,85,95,100,135,100,115,85,105,100],
['Philadelphia',75,65,95,90,95,85,110,90,95,80,75,90]]

columns = ['City'] + np.linspace(1,12,12,dtype=int).tolist()

df_temp = pd.DataFrame(temp, columns = columns).set_index('City').unstack().reset_index()
df_temp.columns = ['Month','City','Temperature']

df_rain = pd.DataFrame(rain, columns = columns).set_index('City').unstack().reset_index()
df_rain.columns = ['Month','City','Rainfall']

df_train = df_train.merge(df_temp, on=['Month','City'])
df_train = df_train.merge(df_rain, on=['Month','City'])

df_test = df_test.merge(df_temp, on=['Month','City'])
df_test = df_test.merge(df_rain, on=['Month','City'])

df_train.head()

## 4. Data Scaling

In [None]:
df_train.columns

In [None]:
# features to be used in modeling 
features = ['Hour', 'Weekend','EntryStreetType', 'ExitStreetType', 'EntryCount', 
            'ExitCount','TurnType', 'CCDist','Temperature','Rainfall','City']

_df_train = df_train[features]
_df_test = df_test[features]

Categorical variables in the data set should be converted into numerical values. For this reason, these transformation processes are performed with Label Encoding

In [None]:
# encoding city name
city_encoder = LabelEncoder().fit(cities)
_df_train.loc[:,'City'] = city_encoder.transform(_df_train['City'])
_df_test.loc[:,'City'] = city_encoder.transform(_df_test['City'])

# encoding street type
StreetType = np.unique(_df_train['EntryStreetType'].unique().tolist() + _df_test['ExitStreetType'].unique().tolist()) 
street_encoder = LabelEncoder().fit(StreetType)
_df_train.loc[:,'EntryStreetType'] = street_encoder.transform(_df_train['EntryStreetType'])
_df_train.loc[:,'ExitStreetType'] = street_encoder.transform(_df_train['ExitStreetType'])
_df_test.loc[:,'EntryStreetType'] = street_encoder.transform(_df_test['EntryStreetType'])
_df_test.loc[:,'ExitStreetType'] = street_encoder.transform(_df_test['ExitStreetType'])

_df_train.head()

We can improve the performance of the models by standardization. These are methods such as" Normalize"," MinMax"," Robust" and "Scale" that can be used for standardization

In [None]:
scaler = StandardScaler().fit(_df_train)
df_train_scaled = scaler.transform(_df_train)
df_test_scaled = scaler.transform(_df_test)

## 5. PCA

Principal Component Analysis, or PCA, is a dimensionality-reduction method that is often used to reduce the dimensionality of large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set.

In [None]:
# We create PCA and plot variance explained
pca = PCA()
pca.fit_transform(df_train_scaled)
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_)
plt.ylabel('variance explained')
plt.xlabel('PCA feature')

In [None]:
# We also plot the explained variance ratio.
plt.plot(pca.explained_variance_ratio_)
plt.xlabel('number of components')
plt.ylabel('cumulative variance ratio')

In [None]:
# We also plot the cumultive explained variance ratio.
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative variance')
plt.ylim([0,1.1])

In [None]:
print ("Number of components with explained variance ratio >= 0.01 :", (pca.explained_variance_ratio_>=0.05).sum())
print (f"Total explained variance retained : {pca.explained_variance_ratio_[:np.sum(pca.explained_variance_ratio_>=.01)].sum():2.4f}")

In [None]:
# create a pca dataframe based on 90% explained variance retained 
pca = PCA(n_components=.90).fit(df_train_scaled)
pca_train = pca.transform(df_train_scaled)
pca_test = pca.transform(df_test_scaled)
col_lst = []
for i in range(0,pca_train.shape[1]):
    col_lst.append(f'PC{i}')
    
df_pca_train = pd.DataFrame(pca_train,columns=col_lst)
df_pca_test = pd.DataFrame(pca_test,columns=col_lst)
df_pca_train.head()

## 6. Base Models

Since this is a supervised regression problem, we will use regression models (LR/KNN/DT/RF) and comapare the results.

We will use Root Mean Squared Error (rmse) as a criteria to evaluate the performance of each model.

In [None]:
# Trying both the scaled & PCA data to see if we can maintain good accuracy level with less features
X = df_train_scaled 
X_pca = df_pca_train
y = df_train[['TotalTimeStopped_p20','TotalTimeStopped_p50','TotalTimeStopped_p80',
     'DistanceToFirstStop_p20','DistanceToFirstStop_p50','DistanceToFirstStop_p80']]

# split the data into train and test 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
X_train_pca, X_test_pca, y_train_pca, y_test_pca = train_test_split(X_pca, y, test_size=0.3, random_state=42)

In [None]:
# Create function to run different models and return rmse
def modeling(X_train, X_test, y_train, y_test):
    models = []
    models.append(('LR', LinearRegression()))
    models.append(('KNN', KNeighborsRegressor()))
    models.append(('DT', DecisionTreeRegressor(random_state = 1)))
    models.append(('RF', RandomForestRegressor(random_state = 1)))
    models.append(('GB', MultiOutputRegressor(GradientBoostingRegressor(random_state = 1))))
#    models.append(('RG-SVR', RegressorChain(SVR())))

    #Evaluate each model in turn
    names = []
    rmses = []
    
    for name, model in models:
#    cv = RepeatedKFold(n_splits=3, n_repeats=3, random_state=1)
#    cv_scores = cross_val_score(model, X_train, y_train, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
        model.fit (X_train, y_train)
        y_pred = model.predict(X_test)
        mse = mean_squared_error (y_test, y_pred)
        rmse = np.sqrt(mse)
        print (f'{name} : mse {mse} - rmse {rmse}')
        names.append(name)
        rmses.append(rmse)
    return names, rmses

In [None]:
# run modeling function on scaled data
print ("Scaled Data Modeling :")
names, rmses = modeling (X_train, X_test, y_train, y_test)

In [None]:
# run modeling function on pca data
print ("PCA Data Modeling :")
pca_names, pca_rmses = modeling (X_train_pca, X_test_pca, y_train_pca, y_test_pca)

In [None]:
# plot and compare the results of scaled data and pca data
results = pd.DataFrame([range(1,6),names,rmses, pca_rmses]).T
results.columns = ['idx','names', 'scaled data', 'pca data']
print (results.melt(id_vars=['names','idx']))
ax = sns.barplot(x='idx',y='value',hue='variable', data=results.melt(id_vars=['names','idx']))
ax.set_xticklabels(names)
ax.set_xlabel("model")
ax.set_ylabel("rmse")

Best rmse result was obtained when we ran RF on scaled data. 

In general, modeling scaled data performed better that PCA data since we use didn't try to reduce dimesionality of the data. Yet the gap in the different models vary as we see in the LR the gap in rmse is very minimum while it's quiet high in the DT model



## 7. Model Tuning

### 7.1 Random Forests Tuning

Since RF performed the best in our initial testing, we will start by trying to tune the RF paramters

In [None]:
# i tried different parameters but best results were obtained when I kept the default parameters
# that's why I commented the paramaters to help the application run faster 
rf_params = {}#"n_estimators" :[50, 100], 
             #"min_samples_split": [10,20],
             #"max_depth": [10,20]}

In [None]:
rf_model = RandomForestRegressor(random_state = 1)

In [None]:
rf_cv = GridSearchCV(rf_model, 
                    rf_params,
                    cv = 3).fit(X_train, y_train)

In [None]:
rf_cv.best_params_

In [None]:
rf_tuned = rf_cv.best_estimator_
y_pred = rf_tuned.predict(X_test)
rf_mse = mean_squared_error(y_test, y_pred)
rf_rmse = np.sqrt(rf_mse)
print ('mse :', rf_mse)
print ('rmse :', rf_rmse) 

In [None]:
feature_imp = pd.Series(rf_tuned.feature_importances_,
                        index=_df_train.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Significance Score Of Variables')
plt.ylabel('Variables')
plt.title("Variable Severity Levels")
plt.show()

### 7.2 Gradient Boosting Tuning

Next we will tune the GB paramters

In [None]:
gb_params = {}#"n_estimators" :[50, 100, 200], 
             #"min_samples_split": [5,10,15]}#,
            #"max_depth": [5,10,20]}

In [None]:
gb_model = MultiOutputRegressor(GradientBoostingRegressor(random_state = 1))

In [None]:
gb_cv = GridSearchCV(gb_model, 
                    gb_params,
                    cv = 3).fit(X_train, y_train)

In [None]:
gb_tuned = gb_cv.best_estimator_
y_pred = gb_tuned.predict(X_test)
gb_mse = mean_squared_error(y_test, y_pred)
gb_rmse = np.sqrt(gb_mse)
print ('mse :', gb_mse)
print ('rmse :', gb_rmse)

In [None]:
feature_imp = pd.Series(gb_tuned.estimators_[0].feature_importances_,
                        index=_df_train.columns).sort_values(ascending=False)

sns.barplot(x=feature_imp, y=feature_imp.index)
plt.xlabel('Significance Score Of Variables')
plt.ylabel('Variables')
plt.title("Variable Severity Levels")
plt.show()

## 8 Comparison of Final Models

In [None]:
print ('RF model : mse :', rf_mse, '- rmse :', rf_rmse) 
print ('GB model : mse :', gb_mse, '- rmse :', gb_rmse) 

RF shows better rmse score than GB

## 9. Final Model Installation

### 9.1 Random Forest

In [None]:
# predict test dataset
rf_pred = rf_tuned.predict(df_test_scaled)

In [None]:
# create TargetId & Target as required by the competition 
rf_submission = pd.DataFrame (rf_pred, columns=range(0,6)).reset_index()
rf_submission = rf_submission.melt(id_vars='index', value_vars=range(0,6), value_name='Target')
rf_submission['TargetId'] = rf_submission['index'].astype(str) + '_' + rf_submission['variable'].astype(str)
rf_submission.sort_values(['index','variable'], inplace=True)

In [None]:
rf_submission.shape

In [None]:
sample = pd.read_csv("../input/bigquery-geotab-intersection-congestion/sample_submission.csv")
sample.shape

In [None]:
# The number of rows required in the submission file is slightly less - re-adjust the size of the submission file
rf_results = rf_submission.merge(sample[['TargetId']], on='TargetId', how='inner')
rf_results.shape

In [None]:
rf_results[['TargetId','Target']].to_csv('rf_submission.csv', index=False)

### 9.2 Gradient Boosting 

In [None]:
# predict test dataset
gb_pred = gb_tuned.predict(df_test_scaled)

In [None]:
gb_submission = pd.DataFrame (gb_pred, columns=range(0,6)).reset_index()
gb_submission = gb_submission.melt(id_vars='index', value_vars=range(0,6), value_name='Target')
gb_submission['TargetId'] = gb_submission['index'].astype(str) + '_' + gb_submission['variable'].astype(str)
gb_submission.sort_values(['index','variable'], inplace=True)

In [None]:
gb_submission.shape

In [None]:
# The number of rows required in the submission file is slightly less - re-adjust the size of the submission file
gb_results = gb_submission.merge(sample[['TargetId']], on='TargetId', how='inner')
gb_results.shape

In [None]:
gb_results[['TargetId','Target']].to_csv('gb_submission.csv', index=False)

## 10. Reporting

The aim of this study was to create regression models to predict traffic congestion in 4 major cities in th US. 
The work done is as follows:

1) Train & Test Data Set read.

2) With Exploratory Data Analysis; The data set's structural data were checked. The types of variables in the dataset were examined. Size information of the dataset was accessed. There are missing values in the data set  but that doesn't affect the modeling. Descriptive statistics of the data set were examined.

3) Data Preprocessing section; The outliers were determined  and X variables were standardized with the rubost method..

4) During Model Building; Linear Regression, KNN, Decision Tree, Random Forest & Grediant Boosting machine learning models were calculated. Later Random Forest  and Grediant Boosting hyperparameter optimizations optimized to increase accuracy level.

5) Result; The model created as a result of Random Forest hyperparameter optimization became the model with the lowest RMSE value. (63.8)

## 11. Final Remarks

Both RF & GB results were submitted in Kaggle competition. 
- RF submission got a score of 97.790061
- GB submission got a score of 82.170315