#### Introduction

As a former Business Intelligence Analyst, now part of an Advanced Analytics team, I have been in the past months focusing on the enhancement my technical skills (Python, pandas, SkLearn, etc). 
Thanks to the great diversity of e-learning courses and blogs I covered the basic tools of Data Science, allowing me to start playing around predictive models in simple use cases (thanks Kaggle!). 

This being said, my first goal for this work was to contribute to the online librairy of use cases for beginners like me. Hope it will help you! 

#### Goal of the analysis

Identify the predictors of the daily bike traffic in Seattle and build a simple predictive model.

#### Agenda

In this notebook I will cover the 3 main steps to follow to produce simple regression model and assess their performance : 

    I. Environment setup
    
    II.Exploratory data analysis
    
    III. Building a predicive model 
        A. Multivariate linear regression
        B. Decision tree & Random Forest regression
    

This notebook is based on the great Python Data Science Handobook by Jake VanderPlas, I highly recommend it for your learning journey!
https://jakevdp.github.io/PythonDataScienceHandbook/

# **I. Environment set up**

#### Import relevant librairies

In [None]:
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns

Get the datasets from the following 2 open sources for the city of Seattle : 

Seattle Fremonth bridge bike count : https://data.seattle.gov/api/views/65db-xm6k/rows.csv?accessType=DOWNLOADssType=DOWNLOAD

Weather data of Seattle : https://www.ncdc.noaa.gov/cdo-web/search?datasetid=GHCND
            Use the location ID as Search term : USW00024233

#### Import datasets, here from csv file with pd.read_csv("directory/file.csv")

In [None]:
traffic_data = pd.read_csv('../input/Fremont_Bridge_Bicycle_Counter.csv', index_col='Date', parse_dates=True)
weather_data = pd.read_csv('../input/Seattle_weather_daily.csv', index_col='DATE', parse_dates=True)

Quickly visualize the 'head', by default the top 5 observations of the panda dataframe : using [.head()] or [.tail()] to print the last 5 

Note that you can only print the head of 1 dataframe at a time, create an extra cell to print the head of the second dataframe

In [None]:
traffic_data.head()

In [None]:
weather_data.head(2)

> # **II. Exploratory data analysis**

A quick way to get a first grasp of the dataset is the print its shape (number of observations,number of columns)

In [None]:
weather_data.shape

Another useful function to get the descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.
source : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html?highlight=describe#pandas.DataFrame.describe

In [None]:
weather_data.describe()

In [None]:
traffic_data.shape

In [None]:
traffic_data.describe()

In [None]:
traffic_data.head()

The traffic data set contains hourly measures, on both sides of the bridge. 
We only need daily totals for the purpose of the analysis, thus we will resample the data to get to the daily traffic (similar to group by day).

In [None]:
daily = (
    traffic_data
    .resample('d')
    .sum()
    .loc[:, ['Fremont Bridge Total']]
    .rename(
        columns={
            'Fremont Bridge Total': 'Total'
        }
    )
)

Note that we used the Modern Pandas to increase the readability of our code. 

Check it out here : https://tomaugspurger.github.io/modern-1-intro.html

In [None]:
daily.head()

#### Resampling

To get a better grasp of traffic variation in the monthl/yearl/hourl, we cna resample the daily data to show plot trends.  

See full pandas documentation here : https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#resampling

And some documentation regarding the possibilities of the DateOffets object : https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects


##### By month


In [None]:
monthly = (
    traffic_data
    .resample('m')
    .sum()
).plot(figsize = (15,5))

##### By hour

In [None]:
by_time = (
    traffic_data
    .groupby(traffic_data.index.time)
    .mean()
)

hourly_ticks = 4 * 60 * 60 * np.arange(6)

by_time.plot(xticks=hourly_ticks, style=[':', '--', '-']);

We can here clearly read 2 peaks of traffic during the day : a strong one in the morning on the East side of the Bridge (residential area?) and a second in the evening on the West side (business district?). 

##### By day of the week

In [None]:
by_weekday = (
    traffic_data
    .groupby(traffic_data.index.dayofweek)
    .mean()
)
by_weekday.index = ['Mon', 'Tues', 'Wed', 'Thurs', 'Fri', 'Sat', 'Sun']
by_weekday.plot(style=[':', '--', '-']);

 We can see on the figure above that weekdays register a higher traffic than during weekends,this support our hypothesis of communiting towards business district. 
  
In order to take this information in account for our model, we will assign the day of the week to our daily traffic data. 

We are here starting to actually build the data set on which we will train our model to predict the numbe rof bikes per day. The step is called feature engineering and his the first step to take to build or improve a predictive model. 
 

> # **III. Building a predictive model**

#### 1. Feature engineering

##### Week day label 
We could simply create an extra column with the label of each date (nominal metric). BUT as we are planning to build a linear regression model afterwards, and this model can only support continuous data. We will create a binary column for each day. 

For further info on ' Feature engineering' : 
https://jakevdp.github.io/PythonDataScienceHandbook/05.04-feature-engineering.html

In [None]:
### let's assign those days of the week labels in the dataset itself 

daily = (
    daily
    .assign(
        day_of_week=lambda _df: _df.index.dayofweek
    )
    .pipe(pd.get_dummies, columns=['day_of_week'])
    .rename(
        columns={
            'day_of_week_0': 'Mon',
            'day_of_week_1': 'Tue',
            'day_of_week_2': 'Wed',
            'day_of_week_3': 'Thu',
            'day_of_week_4': 'Fri',
            'day_of_week_5': 'Sat',
            'day_of_week_6': 'Sun'
        }
    )
)
daily.head()

##### National holidays 

If working days have such an impact on the traffic, we might want to check also for the impact of holidays.

US holiday calendar available directly in panda timeseries !

We will add this information in a new binary column 'holiday'

In [None]:
from pandas.tseries.holiday import USFederalHolidayCalendar
cal = USFederalHolidayCalendar()
holidays = cal.holidays('2012', '2020')
daily = daily.join(pd.Series(1, index=holidays, name='holiday'))
daily['holiday'].fillna(0, inplace=True)

In [None]:
(
    daily
    .loc[daily.holiday == 1]
    .reset_index()
    .sort_values(by= "Date")
    .tail(10)
)

Looking at 2019 national holidays, we see that the Xmas period only counts 1 official day off on the 25th whereas people tend to take longer breaks that will not then be accounted for. Let's keep this in mind and see if correction will be needed to improve performance of our model. 

##### Weather data

Our daily dataset is now ready, we still need to fix the weather dataset by : 

Converting metrics in the right unit of measure (like Celsius for Temperature) 

Merger with the weather dataframe with the daily one based on Date

Drop missing value before running our linear model trial

In [None]:
weather_data.describe()

In [None]:
weather_data.tail(1)

In [None]:
# Temperatures are in 1/10 deg C; convert to C
weather_data['Temp (C)'] = (weather_data['TAVG'] - 32)/(9/5)

# We can create a new binomial metric 'Dry day' as day with or without precipitation
weather_data['dry day'] = (weather_data['PRCP'] == 0).astype(int)

# Join the 2 datasets
daily = daily.join(weather_data[['PRCP', 'Temp (C)', 'dry day']])

# Drop any rows with null values
daily.dropna(axis=0, how='any', inplace=True)

In [None]:
daily.describe()

We can add quick and easy plot to show the relationships between some of these variables and the target (Total).

##### Dry Days

In [None]:
sns.boxplot(x='dry day', y='Total', data=daily, hue='dry day')

Dry days have a higher traffic median than rainy ones, this features is positively correlated with the Total traffic target metric. 

##### Temperature

to check correlation between Temperature and Traffic we will use the more advanced Joint plot from Seaborn library that replaces the scatterplots and histograms with density estimates. It gives indication on both distribution and direction of the relationship.

In [None]:
sns.jointplot(daily['Temp (C)'], daily['Total'], kind='kde')

We see that the relationship between total traffic and temp is positive, meaning the higher the temperature the higher the traffic.

#### 2. Multivariate linear regression 

Let's build our very first model to see if the select metrics have linear relationship with the total number of riders 
to do so we use SkLearn LinearRegression model, predicting the number of riders based on the selected independant variables 
and finally comparing with the actual count of riders.

In [None]:
daily.head()

In [None]:
column_names = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday','dry day', 'Temp (C)']
X = daily[column_names] # define the independant varibles 
y = daily['Total'] # define the target value, the dependant variable

from sklearn.model_selection  import train_test_split   
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, random_state=1)  ## split the dataset in train and test sub-sets

from sklearn.linear_model import LinearRegression # 1. choose model class
model = LinearRegression(fit_intercept=False)     # 2. instantiate model
model.fit(Xtrain, ytrain)                         # 3. fit model to train data
y_model = model.predict(Xtest)                    # 4. predict on new test data

from sklearn.metrics import r2_score
r2_score(ytest, y_model)  ## check score of the model chosen

In [None]:
from sklearn.model_selection  import cross_validate
cv = cross_validate(model, X, y, cv=10, return_train_score=True)
cv_df = pd.DataFrame({"train_score": cv["train_score"], "test_score": cv["test_score"]})
cv_df

In [None]:
print(cv_df["train_score"].mean(),cv_df["test_score"].mean())

By default the score used in linear regression is the coefficient of determination R2. 
It is the proportion of the variance in the dependent variable that is predictable from the independent variable(s).
It ranges from 0 to 1.

The linear model we have build has a score of 0.77 on unseen (test) data, a fair score for a straighforward linear regression. 
To understand how to enhance the model we can look at gaps between actual and predicted (error) :

In [None]:
daily['predicted'] = model.predict(X) # add the predicted number of riders in the orginial data set
daily[['Total', 'predicted']].plot(figsize =(20,10), legend=True);

We can have a look at the coefficient of all independant variables of our multivariate linear model.
Our regression model has found the optimal coefficients for all the attribute. 

Here is how to read it : 
For each increase of 1 degree of Temperature(Celsius), we have around +100 bikers on the bridge.

In [None]:
coeff_df = pd.DataFrame(model.coef_, X.columns, columns=['Coefficient'])
coeff_df

The above results conforts us in saying that bike traffic is mainly due to work commute as days of the week have a positive effect on traffic, with no big difference on the day itself. Whilst weekends tends to see a decrease of traffic, so are holidays. 

Weather condition also have an impact with dry days and high temperature do influence traffic in a positive way. 

Let's have a quick look at the errors of our model, the difference between our Total traffic and predicted one. 
As we do not have the same number of datapoints per year (e.g. in 2020 I only had 31 days of data), we will use the average error instead of the sum. 
(You can check that you have less data points for 2020 by doing daily.resample('y').count()

In [None]:
daily["error"]= (daily.predicted - daily.Total).astype(int)
daily["error_abs"]= (daily.predicted - daily.Total).abs().astype(int)

monthly_error = daily.resample('m').mean().reset_index()
monthly_error.plot(x="Date", y=["Total","predicted"], figsize=(15,5))
year_error = daily.resample('y').mean().reset_index()
year_error.plot(x="Date", y=["error_abs"], figsize=(15,5))

Overall our linear model as a faire prediction score (0.77). Looking at the monthly graph, we see gaps year on year we could compensate by intrucing additional features. Indeed national holidays may not be accurate enought, we might want to try the model with holiday 'periods' based on scolar breaks for instance : Xmas and Summer periods.

In addtion, with the above yearly graph we see that the model had a steady performance from 2013-2018, but for 2019 the bike traffic has been highly underestimated. This may be due to an increase of bike usage in the population itself. 

Let's do some additional feature engineering to cover the 2 points above and try to improve the accuracy score of our linear regression.

#### 3. Model enhancement

##### Holiday period
Resampling by month allow use to see if the error tend to repeat for specific month on the year.
The plot below shows indeed that summer and Xmas periods tend to be over estimated. 

In [None]:
monthly_error = (
    daily
    .resample('m')
    .mean()
    .reset_index()
)
monthly_error.plot(x='Date',y='error', figsize=(15,5))

The holiday calendar imported before only includes national holiday thus does not cover this notion of holiday period.

Looking at error per month we see that December, July and August tend to be over estimated year on year. 

In [None]:
(
    monthly_error
    .sort_values(by ='error')
    .tail(5)
)

Let's create 2 new features : Xmas_period and Summer_period to integrate those notions to our model. 

We will define Xmas (December) and Summer (July & August) periods as full month periods for simplicity. 

In [None]:
daily =(
    daily
    .assign(
        month_num=lambda _df: _df.index.month, # get the month number from date
        Xmas_period=lambda _df: _df['month_num'] == 12,
        Summer_period=lambda _df: _df['month_num'].isin([7, 8])
    )
    .drop(columns=['month_num'])
)

#### 4. Performance comparision 

We will now try again the multivariate linear regression model with the additional 2 holiday periods features. 

In [None]:
column_names_fe = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday','dry day', 'Temp (C)','Xmas_period','Summer_period']
X_fe = daily[column_names_fe] # define the independant varibles 
y_fe = daily['Total'] # define the target value, the dependant variable

Xtrain, Xtest, ytrain, ytest = train_test_split(X_fe, y_fe, random_state=1)  ## split the dataset in train and test sub-sets

model = LinearRegression(fit_intercept=False)     # 2. instantiate model
model.fit(Xtrain, ytrain)                         # 3. fit model to train data
y_model = model.predict(Xtest)

r2_score(ytest, y_model)  ## check score of the model chosen

In [None]:
cv = cross_validate(model, X_fe, y_fe, cv=10, return_train_score=True)
cv_df = pd.DataFrame({"train_score": cv["train_score"], "test_score": cv["test_score"]})
cv_df

In [None]:
print(cv_df["train_score"].mean(),cv_df["test_score"].mean())

In [None]:
coeff_df = pd.DataFrame(model.coef_, X_fe.columns, columns=['Coefficient'])
coeff_df

We can see that those 2 holiday periods have a negative impact on the number of bikers on the road. 

In [None]:
daily['predicted_fe'] = model.predict(X_fe) # incoporated the predicted number of riders in the orginial data set

daily["error_fe"]= (daily.predicted_fe - daily.Total).astype(int)
daily["error_fe_abs"]= (daily.predicted - daily.Total).abs().astype(int)

monthly_error = daily.resample('m').mean().reset_index()

monthly_error.plot(x="Date", y=["Total","predicted_fe","predicted"], figsize=(15,5))

We see that the Xmas_period feature improved the prediction for the end of the year period, summer is stillfar from actuals. 

In [None]:
year_error = daily.resample('y').mean().reset_index()
year_error.plot(x="Date", y=["error_abs","error_fe_abs"], figsize=(15,5))

##### Population data

This increased of bikers in 2019 might be explained by an increase of the city population, we can get this information from the city of Seattle webiste : 
https://www.ofm.wa.gov/washington-data-research/population-demographics/population-estimates/april-1-official-population-estimates

Note that those figures are yearly, for the sake of quickness we will simple split those figures per day.

In [None]:
#Dataset import
population_data = pd.read_csv("../input/Seattle_yearly_pop.csv", index_col='Date', parse_dates=True)

# Join the 2 datasets
daily = daily.join(population_data[['population']])

# Drop any rows with null values
daily.dropna(axis=0, how='any', inplace=True)

In [None]:
column_names_fe = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', 'holiday','dry day', 'Temp (C)','Xmas_period','Summer_period','population']
X_fe = daily[column_names_fe] # define the independant varibles 
y_fe = daily['Total'] # define the target value, the dependant variable

Xtrain, Xtest, ytrain, ytest = train_test_split(X_fe, y_fe, random_state=1)
model = LinearRegression(fit_intercept=False)     
model.fit(Xtrain, ytrain)                         
y_model = model.predict(Xtest)

r2_score(ytest, y_model)  ## check score of the model chosen

#daily['predicted_fe'] = model_fe.predict(X_fe) # incoporated the predicted number of riders in the orginial data set

In [None]:
coeff_df = pd.DataFrame(model.coef_, X_fe.columns, columns=['Coefficient'])
coeff_df

In [None]:
daily["error_fe"]= (daily.predicted_fe - daily.Total).astype(int)
daily["error_fe_abs"]= (daily.predicted - daily.Total).abs().astype(int)

monthly_error = daily.resample('m').mean().reset_index()

monthly_error.plot(x="Date", y=["Total","predicted_fe","predicted"], figsize=(15,5))

This population feature does not improve much the score of the model, and does not contribute to reduce the error for 2019. We can diregard this metric as is. 

## IV. Regression tree

Another widely used prediction model is regression trees, let's give it a try to predict the traffic.

I would recommend to have a look at this other great kernel to get some background and further explanation on decision trees (classification and regression) : https://www.kaggle.com/vipulgandhi/a-guide-to-decision-trees-for-beginners

In [None]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection  import cross_val_score

daily_X = daily[['Mon','Tue','Wed','Thu','Fri','Sat','Sun','holiday','Temp (C)','dry day','Xmas_period','Summer_period']]
daily_y = daily['Total']

tree_reg = DecisionTreeRegressor(max_depth=6)
tree_reg.fit(daily_X, daily_y)

In [None]:
cross_val_score(tree_reg, X, y, cv=10)

In [None]:
scores_tree = cross_val_score(tree_reg, X, y, cv=10).mean()
scores_tree

The accuracy of our regression tree is actually lower than our linear regression model (0.8). 


In [None]:
# Visualize the trained Decision Tree by export_graphviz() method

from sklearn.tree import export_graphviz
from sklearn import tree
from IPython.display import SVG
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from graphviz import Source
from IPython.display import display

In [None]:
labels = daily_X.columns

graph = Source(tree.export_graphviz(tree_reg ,feature_names = labels,max_depth=5, filled = True))
display(SVG(graph.pipe(format='svg')))

Note that for visualization purpose we've set the depth of the tree to 3 only. 

To summarize, our linear model perfromed better than the regression tree, here are 3 rules that I gathered in my trainings when to assess between Tree based models vs. Linear models :

	• If the relationship between dependent & independent variable is well approximated by a linear model, linear regression will outperform tree based model.
    
	• If there is a high non-linearity & complex relationship between dependent & independent variables, a tree model will outperform a classical regression method.
    
    • If we need to build a model which is easy to explain to people, a decision tree model will always do better than a linear model. Decision tree models are even simpler to interpret than linear regression!

### Conclusion and potential improvements

Additional features you might want to research to explain the increase of bikes in 2019 : 
- new city cycling incentives 
- new companies policies 
- improved cycling infrastrucure

In addition, another predicive method would be interesting to try out : Time series analysis. This method is good to spot seasonality and overall trend but hard to understand the actual cause (like enhanced bike lanes, etc). 

Another more advanced improvement of our model training would be not to shuffle our training data as we have a time component in our data. Indeed we are here randomly predicting traffic on past dates, whereas it would make more sense to train our model on 2015 to 2018 and predict 2019/2020 figures for instance.  
To do this : train_test_split with (shuffle=False) see full documentation here https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
