Rosemonde Lareau-Dussault
===
PhD Mathematics, University of Toronto
---
----------

1- Introduction
===

-------
This notebook studies the behavior of cyclists in Montréal in 2015. To goal is to relate weather data and whether it's a working day or not with how many cyclists used bike lanes.

I am using the data set [Montreal bike lanes](https://www.kaggle.com/pablomonleon/montreal-bike-lanes). From the data set description: the "rows correspond to days of the year (in) 2015 and the columns are the bike lanes in Montreal. The numbers in the cells are the number of bikes that used that lane."

To get weather data, I downloaded a [2015 weather report](http://climat.meteo.gc.ca) and I [uploaded it to kaggle](https://www.kaggle.com/rosemondeld/weather-montreal-2015-en).

Given that I want to predict the quantity of cyclists in a day, I want to predict a Poisson process. That is if $y_i$ is the quantity of cyclist on day $i$ and the weather conditions are $X_i$, I assume that $P(y_i|X_i)$ follows a Poisson distribution. Because the quantity of cyclists is large enough, this Poisson distribution should be approximatable by a normal distribution (at least during the summer time ;) ).

This notebook is organized as follows: In Section **2 - Data**, I download, clear and visualize the data. I also try to make the number of cyclists depend on my factors linearly by using transformations and I scale the factors to have them on a comparable scale. In section **3- Modeling**, I create a predictive model that outputs an expected number of cyclists using bike lanes in Montréal given certain weather conditions and knowing whether it is a working day or a weekend or a holiday. Such a model could be used by the city to provide adapted support for cyclists (ex: more police force or less construction when a large number of cyclists is expected) or by an advertising company that wants to target advertising to cyclists.

In [1]:
#I import all the libraries I'll need
from matplotlib import pyplot
import seaborn as sns # for pictures

from mlxtend.preprocessing import minmax_scaling # to scale variable between 0 and 1

import numpy as np # linear algebra
import pandas as pd # for data structure

from sklearn.model_selection import cross_val_score # as I have very few data points, I will use cross validation instead of creating a training and a testing set
from sklearn.metrics import mean_absolute_error # this is a metric I'll use to compare models
from sklearn.model_selection import train_test_split # to separate our data into a traning and a testing set

import xgboost as xgb
from xgboost import XGBRegressor # This is the model I'll use for decision tree (it is a combinaison of decision trees)

2- Data
===
----------
2.1-Montreal bike data
---
I load the bike data, and set the index to be the date.

In [2]:
bike = pd.read_csv('../input/montreal-bike-lanes/comptagesvelo2015.csv',parse_dates=['Date'], dayfirst=True)
bike=bike.set_index('Date')

Let's have a quick look at the data.

In [3]:
bike.head()

First, we're going to remove the 'Unnamed: 1' column as it contains no information.

In [4]:
bike['Unnamed: 1'].describe()

In [5]:
bike = bike.drop('Unnamed: 1', 1)

Now, I find and observe the columns that have missing values.

In [6]:
print(bike.isnull().sum())

In [7]:
bike[['Maisonneuve_1','Parc U-Zelt Test','Pont_Jacques_Cartier','Saint-Laurent U-Zelt Test']].plot(figsize=(20,8))

As we can see, four columns have missing values. From the name of columns 'Parc U-Zelt Test' and 'Saint-Laurent U-Zelt Test' I deduce that these are probably new bike lanes being tested for the end of 2015. For the two others, there is also missing information for a large portion of the year (probably because they didn't collect the data for a time or because the bike lane was temporary closed). I remove these from the data set.

In [8]:
bike = bike.drop(['Maisonneuve_1', 'Parc U-Zelt Test', 'Pont_Jacques_Cartier', 'Saint-Laurent U-Zelt Test'], 1)

We can see from plotting all the remaining bike lanes togetther that there seems to be a general trend, i.e. some days where there were more people on all the bike lanes and some where there were less. Therefore, I added all the colomns and keep only a total count of cyclist on all bike lanes.

In [9]:
bike.plot(figsize=(20,8))

In [10]:
#I create the total column.
bike['total']=bike['Berri1']+bike['Boyer']+bike['Brébeuf']+bike['CSC (Côte Sainte-Catherine)']+bike['Maisonneuve_2']+bike['Maisonneuve_3']+bike['Notre-Dame']+bike['Parc']+bike['PierDup']+bike['Rachel / Hôtel de Ville']+bike['Rachel / Papineau']+bike['René-Lévesque']+bike['Saint-Antoine']+bike['Saint-Urbain']+bike['Totem_Laurier']+bike['University']+bike['Viger']
#I droped all the other columns.
bike = bike.drop(['Berri1', 'Boyer', 'Brébeuf', 'CSC (Côte Sainte-Catherine)','Maisonneuve_2','Maisonneuve_3','Notre-Dame','Parc','PierDup','Rachel / Hôtel de Ville','Rachel / Papineau','René-Lévesque','Saint-Antoine','Saint-Urbain','Totem_Laurier','University','Viger'], 1)

Let's look at the final bike data:

In [11]:
bike.describe()

In [12]:
bike.plot(figsize=(20,8))

In [13]:
bike.boxplot()

----------
2.2-Montreal weather data
---
I load the weather data.

In [14]:
weather = pd.read_csv('../input/weather-montreal-2015-en/eng-daily-01012015-12312015.csv')
weather = weather[:319] #I only keep until November 15th as I have no data after this date (I think Montréal closes its bike lanes after this date).

Let's have a quick look at the data.

In [15]:
weather.sample(5)

I add a column to detect whether it is the weekend or not.

In [16]:
weather['index'] = weather.index
weather['weekend']=np.where((weather['index']%7==2) | (weather['index']%7==3), 1, 0)
weather = weather.drop(['index'],1)

I am going to make the data be the index as I did for the bike data. Note that I first create a column for the month. I will use this later to impute some missing values, as the weather is correlated with the month.

In [17]:
weather['Date']=pd.to_datetime(weather['Date/Time'], format = "%Y-%m-%d")
weather['month']=weather['Date'].dt.month
weather=weather.set_index('Date')
weather = weather.drop(['Date/Time'],1)

I manually add the Statutory Holidays in Québec in 2015 to the weekend column.

In [18]:
weather.at['2015-01-01', 'weekend']=1
weather.at['2015-04-03', 'weekend']=1
weather.at['2015-04-06', 'weekend']=1
weather.at['2015-05-18', 'weekend']=1
weather.at['2015-06-24', 'weekend']=1
weather.at['2015-07-01', 'weekend']=1
weather.at['2015-09-07', 'weekend']=1
weather.at['2015-10-12', 'weekend']=1

Now, I will remove columns that I don't intend to use. I only keep information about temperature and precipitation.
I also rename the columns to have more usable names, i.e. no degree sign, no space and no parenthesis.

In [19]:
weather = weather.drop(['Year','Month','Day','Data Quality','Max Temp Flag','Min Temp Flag','Mean Temp Flag','Heat Deg Days (°C)','Heat Deg Days Flag',
                       'Cool Deg Days (°C)','Cool Deg Days Flag','Total Rain Flag','Total Snow Flag','Total Precip (mm)','Total Precip Flag','Snow on Grnd Flag',
                       'Dir of Max Gust (10s deg)','Dir of Max Gust Flag','Spd of Max Gust (km/h)','Spd of Max Gust Flag'], 1)
weather.columns = ['max_temp','min_temp','mean_temp','rain','snow','snow_on_grnd','weekend','month']

In [20]:
weather.head()

I now look if there are any missing values and I consider how I am going to deal with them.

In [21]:
print(weather.isnull().sum())

I start with the rain column. Let's look at the missing values.

In [22]:
rain = weather.drop(['snow','snow_on_grnd'],1)
norain = weather[rain.isnull().any(axis=1)]
norain

As some of the missing values for rain are when the temperature is bellow zero all day I can't impute from the average. Indeed, the quantity of rain depends on the temperature. Therefore, I impute this value by replacing the average total rain per month. During the cold months, there is no rain, so the missing values are replaced by 0.

In [23]:
weather['rain'].fillna(weather.groupby("month")["rain"].transform("mean"), inplace=True)

Now for snow.

In [24]:
snow = weather.drop(['rain','snow_on_grnd'],1)
nosnow = weather[snow.isnull().any(axis=1)]
nosnow

Similarly, the snow probably depends on the current month (i.e. no snow when it's warm).

In [25]:
weather['snow'].fillna(weather.groupby("month")["snow"].transform("mean"), inplace=True)

Now for the quantity of snow on the ground.

In [27]:
snowongrnd = weather.drop(['rain','snow'],1)
nosnowongrnd = weather[snowongrnd.isnull().any(axis=1)]
nosnowongrnd

There seems to be full months of missing data, so I can't impute from the monthly average. Most of those are during warmer months and therefore can safely be set to zero. The only two values to consider are 2015-03-23 and 2015-03-31. For those two values, I will take the average of the preceding and the following day, because the quantity of snow on the ground doesn't vary a lot from one day to the next.

In [28]:
weather.at['2015-03-23', 'snow_on_grnd']=weather.loc['2015-03-22']['snow_on_grnd']/2+weather.loc['2015-03-24']['snow_on_grnd']/2
weather.at['2015-03-31', 'snow_on_grnd']=weather.loc['2015-03-30']['snow_on_grnd']/2+weather.loc['2015-04-01']['snow_on_grnd']/2
weather['snow_on_grnd'].fillna(0, inplace=True)

I remove the month column.

In [29]:
weather = weather.drop(['month'],1)

I am now done clearing the weather values. Let's look at them!

In [30]:
weather.describe()

In [31]:
weather.plot(figsize=(20,8))

---------
2.3-Bike and Weather data --- Let's look at the data!
---
Now, I combine both data sets.


In [32]:
bike_and_weather = pd.concat([bike, weather], axis = 1)
bike_and_weather.head()

In order to plot the total number of cyclists with the weather data, I normalize all variables, i.e. I center and scale them so their range is between 0 and 1. This allows the data to all be on the same scale of magnitude.

I plot the weekend data separately, because otherwise it "clogs" the image.

In [33]:
scale_data = minmax_scaling(bike_and_weather, columns = ['max_temp','min_temp','mean_temp','rain','snow','snow_on_grnd','weekend','total'])
scale_data[['mean_temp','rain','snow','snow_on_grnd','total']].plot(figsize=(20,8))
scale_data[['weekend','total']].plot(figsize=(20,8))

I observe that there are fewer bikers when it is cold and there is snow on the ground. I also observe that there are fewer bikers every time it rains. There also seems to be fewer bikers on the weekend.


I now plot a corellation matrix.

In [34]:
sns.heatmap(bike_and_weather[['mean_temp','rain','snow','snow_on_grnd','weekend','total']].corr(),annot=True, cmap="YlGnBu")

We can see that the number of cyclists is highly correlated with the temperature. Let's compare the correlation with our 3 measures of the temperature. We can also see that this is why we introduce the log of the total number of cyclists. Indeed, by observing the pair plot between the temperature and the total number of cyclists, we see that the relationship is not linear, but could be with this transformation.

In [35]:
sns.heatmap(bike_and_weather[['max_temp','mean_temp','min_temp','total']].corr(),annot=True, cmap="YlGnBu")
sns.pairplot(bike_and_weather[['max_temp','mean_temp','min_temp','total']])

The plot between the total number of cyclists and the temperature is more of a curve. To correct that, we take the natural logarithm of the total number of cyclists.

In [36]:
bike_and_weather['log_total']=np.log(bike_and_weather['total'])
sns.heatmap(bike_and_weather[['max_temp','mean_temp','min_temp','log_total']].corr(),annot=True, cmap="YlGnBu")
sns.pairplot(bike_and_weather[['max_temp','mean_temp','min_temp','log_total']])

It seems that the max and mean temp are a little more important than the min. This makes sense to me, because people usually bike during the day, when the temperature is closer to the max or the mean of the day. Because there is very little value added to keeping all three features, I only keep the maximum temperature.

In [37]:
bike_and_weather=bike_and_weather.drop(['mean_temp','min_temp'], axis=1);

*Note that now that we used the log_total for one feature, we need to keep using it for the following features.*

------

Now, we consider the other features. It is not because the correlation coefficient isn't high that a feature can't be a good predictor.
First, we consider the features related to precipitations.

In [38]:
sns.pairplot(bike_and_weather[['rain','snow','snow_on_grnd','total','log_total']])

For those three variables, I observe, by looking a the pairplot with the total number of cyclists, that there is a cut off value (for snow and snow_on_grnd it is close to 0, for rain it is at about 15) for which it is almost impossible to have a lot of cyclists above (for the rain variable, being above 15 means the total will almost always be below 37500). Such a variable could be used in a decision tree. An idea of how to deal with these variables if we consider a regression, would be to find a good cutoff and transform these variables into binary. Alternatively, we can use transformations to increase the linear dependance (correlation).

This is what we do here, starting with the rain variable. It takes a little trial and error to find a good transformation to increase correlation.

In [39]:
bike_and_weather['log_rain']=np.log(bike_and_weather['rain']+0.001) #I have added a very small value to the rain in order to take the log, because some values are 0.
sns.pairplot(bike_and_weather[['rain','log_rain','log_total']])
print('correlation between rain and log_total', bike_and_weather['rain'].corr(bike_and_weather['log_total']))
print('correlation between log_rain and log_total', bike_and_weather['log_rain'].corr(bike_and_weather['log_total']))
bike_and_weather=bike_and_weather.drop(['rain'], axis=1);

Now, we do the same for the snow variable.

In [40]:
bike_and_weather['snow_inv']=-np.power(bike_and_weather['snow']+0.001,-1)
sns.pairplot(bike_and_weather[['snow','snow_inv','log_total']])
print('correlation between snow and log_total', bike_and_weather['snow'].corr(bike_and_weather['log_total']))
print('correlation between snow_inv and log_total', bike_and_weather['snow_inv'].corr(bike_and_weather['log_total']))
bike_and_weather=bike_and_weather.drop(['snow'], axis=1);

We can see that this transformation basically differentiates between the presence and absence of snow. Which, as I discussed earlier, is the most important factor.

Finally, we also transform the variable for the quantity of snow on the ground.

In [41]:
bike_and_weather['snow_on_grnd_inv']=-np.power(bike_and_weather['snow_on_grnd']+0.001,-1)
sns.pairplot(bike_and_weather[['snow_on_grnd','snow_on_grnd_inv','log_total']])
print('correlation between snow_on_grnd and log_total', bike_and_weather['snow_on_grnd'].corr(bike_and_weather['log_total']))
print('correlation between snow_on_grnd_inv and log_total', bike_and_weather['snow_on_grnd_inv'].corr(bike_and_weather['log_total']))
bike_and_weather=bike_and_weather.drop(['snow_on_grnd'], axis=1);

Once again, the new variable distinguishes between whether there is snow on the ground or not.

------
Finaly, we look at the weekend/holiday factor.

In [42]:
g=sns.FacetGrid(bike_and_weather, col='weekend')
g.map(sns.distplot, "log_total")

We can see that these distributions are similar (they are both bimodal), but there is less cyclists during the weekend and the holidays.

------
Finaly, I'll rescale all factors to be between 0 and 1. I could use the minmax_scaling function, but I want to be able to inverse this scaling in order to do analysis on the factors. Therefore, I keep track of the min and the max of each factor and scale them myself.

In [43]:
min_max = bike_and_weather.describe()
min_max

In [44]:
scale_data = bike_and_weather.copy()
scale_data['max_temp']=(bike_and_weather['max_temp']-min_max.loc['min','max_temp'])/(min_max.loc['max','max_temp']-min_max.loc['min','max_temp'])
scale_data['log_rain']=(bike_and_weather['log_rain']-min_max.loc['min','log_rain'])/(min_max.loc['max','log_rain']-min_max.loc['min','log_rain'])
scale_data['snow_inv']=(bike_and_weather['snow_inv']-min_max.loc['min','snow_inv'])/(min_max.loc['max','snow_inv']-min_max.loc['min','snow_inv'])
scale_data['snow_on_grnd_inv']=(bike_and_weather['snow_on_grnd_inv']-min_max.loc['min','snow_on_grnd_inv'])/(min_max.loc['max','snow_on_grnd_inv']-min_max.loc['min','snow_on_grnd_inv'])
scale_data_rnd = scale_data.sample(frac=1)


------
3- Modeling
===
As I have a very small data set, I'll use cross validation to evaluate the quality of my model.


First, I'll set the data.

In [45]:
y= scale_data_rnd['total']
y_log = scale_data_rnd['log_total']
X = scale_data_rnd.drop(['total','log_total'], axis=1)

--------
 I'll use XGBoost. XGBoost uses several decision trees and combines them to create predictions. As it uses decision trees, this model doesn't requires linearity (I still transformed my variables to increase linearity as an exercice).


In [46]:
# Add silent=True to avoid printing out updates with each cycle
model = XGBRegressor(silent=True)
scores = cross_val_score(model, X, y_log, scoring='neg_mean_absolute_error')
model.fit(X, y_log)
print(scores/(min_max.loc['max','log_total']-min_max.loc['min','log_total']))

As the mean absolute error is pretty low and constant, I think this is a pretty good model.

I plot the importance of different variables. This is a count of how many times each feature is used for a split in the many decision trees created by XGB.

In [47]:
xgb.plot_importance(model)

Now, we plot the results

In [48]:
y_pred = model.predict(X)
scale_data_rnd['pred']=np.exp(y_pred)
scale_data_rnd[['total','pred']].plot(figsize=(20,8))

This is a really good fit! So good that one might actually be concerned about overfitting. But as I used cross validation, I believe that the number of cyclist is just really correlated with the weather and whether it's a work day or not! Also, given that I used so few predictors, it is less possible to have overfitting. 

------
Finally, to use the model, I need to transform the original data. I created this function which takes in the maximum temperature, the quantity of rain, snow and snow on the ground and whether it's a work day or not, and outputs the predicted number of cyclists on the Montréal bike lanes for a day.

**Note that as the model was created with 2015 values using only some bike lanes, it can only be used to estimate number of cyclists for those bike lanes in Montréal during that year. I believe this work could be extended to other cities or other years.**

In [49]:
def predict_cyclist(model , max_temp, weekend, rain,snow, snow_on_grnd):
    temp = pd.DataFrame(columns=['max_temp','weekend','log_rain','snow_inv','snow_on_grnd_inv'])
    max_temp_model = (max_temp-min_max.loc['min','max_temp'])/(min_max.loc['max','max_temp']-min_max.loc['min','max_temp'])
    log_rain = (np.log(rain+0.001)-min_max.loc['min','log_rain'])/(min_max.loc['max','log_rain']-min_max.loc['min','log_rain'])
    snow_inv = (-np.power(snow+0.001,-1))/(min_max.loc['max','snow_inv']-min_max.loc['min','snow_inv'])
    snow_on_grnd_inv = (-np.power(snow_on_grnd+0.001,-1)-min_max.loc['min','snow_on_grnd_inv'])/(min_max.loc['max','snow_on_grnd_inv']-min_max.loc['min','snow_on_grnd_inv'])
    temp.loc[0] = [max_temp_model,weekend,log_rain,snow_inv,snow_on_grnd_inv]
    return np.exp(model.predict(temp))

In [50]:
predict_cyclist(model,20,0,0,0,0)