**Intersection Congestion in 4 major US cities: Atlanta, Boston, Chicago & Philadelphia**

We’ve all been there: Stuck at a traffic light, only to be given mere seconds to pass through an intersection, behind a parade of other commuters. Imagine if you could help city planners and governments anticipate traffic hot spots ahead of time and reduce the stop-and-go stress of millions of commuters like you.

Geotab provides a wide variety of aggregate datasets gathered from commercial vehicle telematics devices. Harnessing the insights from this data has the power to improve safety, optimize operations, and identify opportunities for infrastructure challenges.

The dataset for this competition includes aggregate stopped vehicle information and intersection wait times. Your task is to predict congestion, based on an aggregate measure of stopping distance and waiting times, at intersections in 4 major US cities: Atlanta, Boston, Chicago & Philadelphia.

![](https://hotshotwarriors.com/wp-content/uploads/2018/03/I-95-Traffic.jpg)

**Geotab** is advancing security, connecting commercial vehicles to the internet and providing web-based analytics to help customers better manage their fleets. Geotab’s open platform and Marketplace, offering hundreds of third-party solution options, allows both small and large businesses to automate operations by integrating vehicle data with their other data assets. As an IoT hub, the in-vehicle device provides additional functionality through IOX Add-Ons. Processing billions of data points a day, Geotab leverages data analytics and machine learning to help customers improve productivity, optimize fleets through the reduction of fuel consumption, enhance driver safety, and achieve strong compliance to regulatory changes. Geotab’s products are represented and sold worldwide through Authorized Geotab Resellers. To learn more, You could visit www.geotab.com 

This competition is being hosted in partnership with **BigQuery**, a data warehouse for manipulating, joining, and querying large scale tabular datasets. BigQuery also offers BigQuery ML, an easy way for users to create and run machine learning models to generate predictions through a SQL query interface.

Alright, stop waiting and get started!

** What is BigQuery ML and when should you use it? ** 

BigQuery Machine Learning (BQML) is a toolset that allows you to train and serve machine learning models directly in BigQuery. This has several advantages:

You don't have to read your data into local memory. One question I get a lot is "how can I train my ML model if my dataset is just too big to fit on my computer?". You can subsample your dataset, of course, but you can also use tools like BQML that train your model directly in your database.
You don't have to use multiple languages. Particularly if you're working in a team where most of your teammates don't know Python or R or your preferred language for modelling, working in SQL can make it easier for you to collaborate.
You can serve your model immediately after it's trained. Because your model is already in the same place as your data, you can make predictions directly from your database. This lets you get around the hassle of cleaning up your code and either putting it intro production or passing it off to your engineering colleagues.
BQML probably won't replace all your modelling tools, but it's a nice quick way to train and serve a model without spending a lot of time moving code or data around.

**Models supported by BQML**

  One limitation of BQML is that a limited number of model types are supported. As of August 6, 2019, BQML supports the following         types of models. More model types are being built out, though, so check the documentation for the most

* Linear regression (LINEAR_REG). This is the OG modelling technique, used to predict the value of a continuous variable. This is what you'd use for questions like "how many units can we expect a custom to buy?".
* Logistic regression (LOGISTIC_REG). This regression technique lets you classify which category an observation fits in to. For example, "will this person buy the blue one or the red one?".
* K-means (KMEANS). This is an unsupervised clustering algorithm. It lets you identify categories. For example, "given all of the customers in our database, how could we identify five distinct groups?".
* Tensorflow (TENSORFLOW). If you've already got a trained TensorFlow model, you can upload it to BQML and serve it directly from there. You can't currently train a TensorFlow model in BQML.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:

import seaborn as sns
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn import preprocessing
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
# import keras
from bayes_opt import BayesianOptimization
import lightgbm as lgb
import os, sys


In [None]:
!pip install tensorflow --upgrade

In [None]:
import tensorflow as tf

In [None]:
print(tf.__version__)

In [None]:
from tensorflow import feature_column
from tensorflow.keras import layers

# 1. Data loading and Exploration 

In [None]:
# Load data 
train = pd.read_csv('../input/bigquery-geotab-intersection-congestion/train.csv')
test = pd.read_csv('../input/bigquery-geotab-intersection-congestion/test.csv')
submission = pd.read_csv('../input/bigquery-geotab-intersection-congestion/sample_submission.csv')

In [None]:
train

In [None]:
train.columns

In [None]:
test.columns

In [None]:
test

The target columns is not found in the testing data  so we don't need to pop it 

In [None]:
train.describe()

In [None]:
train.columns

In [None]:
train.dtypes

In [None]:
train.info()

In [None]:
train.isnull().sum()


In [None]:
train.dropna(axis=0, inplace=True)

We are asked to predict TotalTimeStopped_p20, TotalTimeStopped_p50, TotalTimeStopped_p80, DistanceToFirstStop_p20, DistanceToFirstStop_p50 and DistanceToFirstStop_p80

We also have a feature called TimeFromFirstStop_px in the training set that can be usefull

Other percentiles for the features mention recently can be found in the training set. Maybee it is a good idea to predict all the percentiles and use it in a smart way to improve our results

> Missing Values


In [None]:
train.isnull().sum()

In [None]:
def missing_values(train):
    df = pd.DataFrame(train.isnull().sum()).reset_index()
    df.columns = ['Feature', 'Frequency']
    df['Percentage'] = (df['Frequency']/train.shape[0])*100
    df['Percentage'] = df['Percentage'].astype(str) + '%'
    df.sort_values('Percentage', inplace = True, ascending = False)
    return df

missing_values(train).head()

We have 2 features in the train set and test set that have missing values.

Let's check each feature.

In [None]:
#Finding the numerical columns 
num_cols = train._get_numeric_data().columns
print("Numerical Columns")
print(num_cols)

# Get list of categorical variables
s = (train.dtypes == 'object')
object_cols = list(s[s].index)

print("Categorical variables:")
print(object_cols)
for object_col in object_cols:
    print("---------------------------")
    print(train[object_col].unique())    

In [None]:
#Submission data
#the first number being the RowId and the second being the metric id (of the TargetId)
submission


# 2. Data Visualization

In [None]:
for i in ['TotalTimeStopped_p20', 'TotalTimeStopped_p50', 'TotalTimeStopped_p80', 'DistanceToFirstStop_p20', 
          'DistanceToFirstStop_p50', 'DistanceToFirstStop_p80']:
    plt.figure(figsize = (12, 8))
    plt.scatter(train.index, train[i])
    plt.title('{} distribution'.format(i))

A lot of 0. Let's calculate what is the percentage of 0 in each of our target variables



In [None]:
def tv_ratio(train, column):
    df = train[train[column]==0]
    ratio = df.shape[0] / train.shape[0]
    return ratio

target_variables = ['TotalTimeStopped_p20', 'TotalTimeStopped_p50', 'TotalTimeStopped_p80', 
                    'DistanceToFirstStop_p20', 'DistanceToFirstStop_p50', 'DistanceToFirstStop_p80']

for i in target_variables:
    print('{} have a 0 ratio of: '.format(i), tv_ratio(train, i))

** Total Time Stopped**

In this section, we are going analysis the total time stopped on the intersections in different cities.



In [None]:
fig, ax = plt.subplots(nrows=2, ncols=2)
sns.set_style("whitegrid")

train[train['City']=='Atlanta'].groupby('Hour')['TotalTimeStopped_p80'].mean().plot(
    ax=ax[0,0],title="Atlanda's Total Stoppage Time in Hours", color='r', figsize=(18,15))

train[train['City']=='Boston'].groupby('Hour')['TotalTimeStopped_p80'].mean().plot(
    ax=ax[0,1],title="Boston's Total Stoppage Time in Hours", color='r', figsize=(18,15))


train[train['City']=='Chicago'].groupby('Hour')['TotalTimeStopped_p80'].mean().plot(
    ax=ax[1,0],title="Chicago's Total Stoppage Time in Hours", color='r', figsize=(18,15))


train[train['City']=='Philadelphia'].groupby('Hour')['TotalTimeStopped_p80'].mean().plot(
    ax=ax[1,1],title="Philadelphia's Total Stoppage Time in Hours", color='r', figsize=(18,15))

plt.show()

In [None]:
def plot_dist(train, test, column, type = 'kde', together = True):
    if type == 'kde':
        if together == False:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12,8))
            sns.kdeplot(train[column], ax = ax1, color = 'blue', shade=True)
            ax1.set_title('{} distribution of the train set'.format(column))
            sns.kdeplot(test[column], ax = ax2, color = 'red', shade=True)
            ax2.set_title('{} distribution of the test set'.format(column))
            plt.show()
        else:
            fig , ax = plt.subplots(1, 1, figsize = (12,8))
            sns.kdeplot(train[column], ax = ax, color = 'blue', shade=True, label = 'Train {}'.format(column))
            sns.kdeplot(test[column], ax = ax, color = 'red', shade=True, label = 'Test {}'.format(column))
            ax.set_title('{} Distribution'.format(column))
            plt.show()
    else:
        if together == False:
            fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12,8))
            sns.distplot(train[column], ax = ax1, color = 'blue', kde = False)
            ax1.set_title('{} distribution of the train set'.format(column))
            sns.distplot(test[column], ax = ax2, color = 'red', kde = False)
            ax2.set_title('{} distribution of the test set'.format(column))
            plt.show()
        else:
            fig , ax = plt.subplots(1, 1, figsize = (12,8))
            sns.distplot(train[column], ax = ax, color = 'blue', kde = False)
            sns.distplot(test[column], ax = ax, color = 'red', kde = False)
            plt.show()
    
plot_dist(train, test, 'Latitude', type = 'kde', together = True)
plot_dist(train, test, 'Latitude', type = 'other', together = False)


**Time Features**

In [None]:
def get_frec(df, column):
    df1 = pd.DataFrame(df[column].value_counts(normalize = True)).reset_index()
    df1.columns = [column, 'Percentage']
    df1.sort_values(column, inplace = True, ascending = True)
    return df1


def plot_frec(train, test, column):
    df = get_frec(train, column)
    df1 = get_frec(test, column)
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize = (12,8))
    sns.barplot(df[column], df['Percentage'], ax = ax1, color = 'blue')
    ax1.set_title('{} percentages for the train set'.format(column))
    sns.barplot(df1[column], df1['Percentage'], ax = ax2, color = 'red')
    ax2.set_title('{} percentages for the test set'.format(column))
    
plot_frec(train, test, 'Month')

# 3. Pre-Processing 

In [None]:
train["same_street_exact"] = (train["EntryStreetName"] ==  train["ExitStreetName"]).astype(int)
test["same_street_exact"] = (test["EntryStreetName"] ==  test["ExitStreetName"]).astype(int)

**Skip OHE intersections for now - memory issues**

Intersection IDs aren't unique between cities - so we'll make new ones

Running fit on just train reveals that the test data has a "novel" city + intersection! ( '3Atlanta'!) (We will fix this)

Means we need to be careful when OHEing the data
There are 2,796 intersections, more if we count unique by city (~4K) = many, many columns. gave me memory issues when doing one hot encoding
Could try count or target mean encoding.
For now - ordinal encoding

In [None]:
le = preprocessing.LabelEncoder()


In [None]:
train["Intersection"] = train["IntersectionId"].astype(str) + train["City"]
test["Intersection"] = test["IntersectionId"].astype(str) + test["City"]

print(train["Intersection"].sample(6).values)

with ordinal encoder - 
ideally we'd encode all the "new" cols with a single missing value, but it doesn't really matter given that they're Out of Distribution anyway (no such values in train).
So we'll fit on train+Test in order to avoid encoding errors - when using the ordinal encoder! (LEss of a n issue with OHE)

In [None]:
pd.concat([train["Intersection"],test["Intersection"]],axis=0).drop_duplicates().values

In [None]:
le.fit(pd.concat([train["Intersection"],test["Intersection"]]).drop_duplicates().values)
train["Intersection"] = le.transform(train["Intersection"])
test["Intersection"] = le.transform(test["Intersection"])

**OneHotEncode**

We could Create one hot encoding for entry , exit direction fields - but may make more sense to leave them as continous
Intersection ID is only unique within a city

In [None]:
pd.get_dummies(train["City"],dummy_na=False, drop_first=False).head()


In [None]:
train = pd.concat([train,pd.get_dummies(train["City"],dummy_na=False, drop_first=False)],axis=1).drop(["City"],axis=1)
test = pd.concat([test,pd.get_dummies(test["City"],dummy_na=False, drop_first=False)],axis=1).drop(["City"],axis=1)

In [None]:
train.shape,test.shape


In [None]:
test.head()


In [None]:
train.columns


Approach: We will make 6 predictions based on features we derived - IntersectionId , Hour , Weekend , Month , entry & exit directions .

Target variables will be TotalTimeStopped_p20 ,TotalTimeStopped_p50,TotalTimeStopped_p80,DistanceToFirstStop_p20,DistanceToFirstStop_p50,DistanceToFirstStop_p80 .

I leave in the original IntersectionId just in case there's meaning accidentally encoded in the numbers

In [None]:
FEAT_COLS = ["IntersectionId",
             'Intersection',
            'same_street_exact',
           "Hour","Weekend","Month",
          'Latitude', 'Longitude',
          'Atlanta', 'Boston', 'Chicago',
       'Philadelphia']

In [None]:
train.head()


In [None]:
train.columns


In [None]:
X = train[FEAT_COLS]
y1 = train["TotalTimeStopped_p20"]
y2 = train["TotalTimeStopped_p50"]
y3 = train["TotalTimeStopped_p80"]
y4 = train["DistanceToFirstStop_p20"]
y5 = train["DistanceToFirstStop_p50"]
y6 = train["DistanceToFirstStop_p80"]

In [None]:
y = train[['TotalTimeStopped_p20', 'TotalTimeStopped_p50', 'TotalTimeStopped_p80',
        'DistanceToFirstStop_p20', 'DistanceToFirstStop_p50', 'DistanceToFirstStop_p80']]

In [None]:
testX = test[FEAT_COLS]


In [None]:
lr = RandomForestRegressor(n_estimators=100,min_samples_split=3)

In [None]:
lr.fit(X,y1)
pred1 = lr.predict(testX)
lr.fit(X,y2)
pred2 = lr.predict(testX)
lr.fit(X,y3)
pred3 = lr.predict(testX)
lr.fit(X,y4)
pred4 = lr.predict(testX)
lr.fit(X,y5)
pred5 = lr.predict(testX)
lr.fit(X,y6)
pred6 = lr.predict(testX)


# Appending all predictions
all_preds = []
for i in range(len(pred1)):
    for j in [pred1,pred2,pred3,pred4,pred5,pred6]:
        all_preds.append(j[i])   
        
sub  = pd.read_csv("../input/bigquery-geotab-intersection-congestion/sample_submission.csv")
sub["Target"] = all_preds
sub.to_csv("benchmark_beat_rfr_multimodels.csv",index = False)

print(len(all_preds))

In [None]:
lr.fit(X,y)
print("fitted")

all_preds = lr.predict(testX)

In [None]:
## convert list of lists to format required for submissions
print(all_preds[0])

s = pd.Series(list(all_preds) )
all_preds = pd.Series.explode(s)

print(len(all_preds))
print(all_preds[0])

In [None]:
sub  = pd.read_csv("../input/bigquery-geotab-intersection-congestion/sample_submission.csv")
print(sub.shape)
sub.head()

In [None]:
sub["Target"] = all_preds.values
sub.sample(5)