### Example: Predicting Bicycle Traffic

• As an example, let's take a look at whether we can predict the number of bicycle trips across Seattle's Fremont Bridge based on weather, season, and other factors.

• We have seen this data already in Working With Time Series.

• In this section, we will join the bike data with another dataset, and try to determine the extent to which weather and seasonal factors—temperature, precipitation, and daylight hours—affect the volume of bicycle traffic through this corridor.

• Fortunately, the NOAA makes available their daily weather station data (I used station ID USW00024233) and we can easily use Pandas to join the two data sources.

• We will perform a simple linear regression to relate weather and other information to bicycle counts, in order to estimate how a change in any one of these parameters affects the number of riders on a given day.

• In particular, this is an example of how the tools of Scikit-Learn can be used in a statistical modeling framework, in which the parameters of the model are assumed to have interpretable meaning.

• As discussed previously, this is not a standard approach within machine learning, but such interpretation is possible for some models.

• Let's start by loading the two datasets, indexing by date:

In [1]:
import pandas as pd

In [5]:
counts = pd.read_csv('Bicycle_Counts.csv', index_col = 'Date', parse_dates = True)
weather = pd.read_csv('Bicycle_Weather.csv', index_col = 'DATE', parse_dates = True)

• Next we will compute the total daily bicycle traffic, and put this in its own dataframe:

In [13]:
daily = counts.resample('d').sum()
daily['Total'] = daily.sum(axis = 1)
daily.head()

Unnamed: 0_level_0,Fremont Bridge East Sidewalk,Fremont Bridge West Sidewalk,Total
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2012-10-03,1760.0,1761.0,3521.0
2012-10-04,1708.0,1767.0,3475.0
2012-10-05,1558.0,1590.0,3148.0
2012-10-06,1080.0,926.0,2006.0
2012-10-07,1191.0,951.0,2142.0


• We saw previously that the patterns of use generally vary from day to day; let's account for this in our data by adding binary columns that indicate the day of the week: