# The Trail Foundation - Forecasting Trail Traffic

The Austin Parks department has time series data on foot and bike traffic on a trial near downtown Austin. This data is influenced by a number of conditions including weather, time of year, day of week, major and minor events, openings of new business and event spaces, and other more difficult-to-predict events and trends.

Using historical data, participants will attempt to develop models which capture these sources of variability in the data and forecast future expected traffic on various locations within the trails. This should include models, statistics, visualizations, and, potentially, interfacing components to allow models to be retrained and/or queried.

<img src="https://thetrailfoundation.org/wp-content/uploads/2014/01/TTF-logo-horizontal.png"/>

<img src="https://5107083.toastmastersclubs.org/imageuploads/5107083/walmart_technology_logo.png"/>

# 0. Imports

This is where we import the libraries we need for this exercise.

In [1]:
import numpy as np
import pandas as pd
from plotly import tools
import plotly.graph_objs as go
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot, iplot_mpl
import plotly.plotly as py
from scipy.optimize import leastsq

init_notebook_mode(connected=True)

# 1. Data Ingestion

This section contains all the necessary data ingestion code which allows for easier Exploratory Data Analysis later in the file.

## 1.1 Data Importing

The data is contained within the *daily_counts_7-2-19.xlsx* file. We import this and observe that the data was parsed correctly.

In [2]:
trail_df = pd.read_csv('daily_counts_10-25-19.csv')

trail_df = trail_df.rename(index=str, columns={"Time": "Date"})
trail_df.columns = ['Date', 'Butler Trail - Crenshaw Bridge PC Urban Trail', 
                    'Butler Trail - South Lamar PC Urban Trail', 'Butler Trail - North Congress PC Urban Trail',
                   'Butler Trail - Longhorn Dam PC Urban Trail', 'Shoal Creek Solar Trail PC Urban Trail ped/bike']

trail_df

Unnamed: 0,Date,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike
0,2/17/16 0:00,4242.0,,,,
1,2/18/16 0:00,4979.0,,,,
2,2/19/16 0:00,5002.0,,,,
3,2/20/16 0:00,7697.0,,,,
4,2/21/16 0:00,5958.0,,,,
...,...,...,...,...,...,...
1341,10/20/19 0:00,2510.0,2510.0,,1699.0,329.0
1342,10/21/19 0:00,2839.0,2911.0,,961.0,674.0
1343,10/22/19 0:00,3101.0,2753.0,,918.0,483.0
1344,10/23/19 0:00,2107.0,1995.0,,987.0,329.0


## 1.2 Data Cleaning

There are several validations and cleaning steps we should do before analyzing the data. Let's take a look at some descriptive statistics to make sure the ranges for the column values are reasonable and see how the NaNs are being handled.

In [3]:
trail_df.shape

(1346, 6)

In [4]:
trail_df.describe()

Unnamed: 0,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike
count,1297.0,571.0,567.0,539.0,807.0
mean,4021.16037,2554.957968,2985.592593,948.474954,266.263941
std,2480.562293,1613.786218,1333.612252,470.920084,266.157994
min,224.0,94.0,332.0,1.0,1.0
25%,2721.0,1648.5,2044.5,730.5,124.5
50%,3801.0,2278.0,2792.0,910.0,191.0
75%,4906.0,3277.0,3655.0,1202.0,329.0
max,20530.0,21729.0,10374.0,2569.0,3061.0


This looks right. `trail_df` is 1346 rows long, and the most entries any column has is 1297. Some have far fewer, likely because there wasn't a counter installed on the trail yet. It also appears that Pandas ignores NaNs instead of including them in counts or infilling them. I'm adding a month column since we will want to use that later.

In [5]:
trail_df['Date'] = pd.to_datetime(trail_df['Date'])

In [6]:
trail_df['Butler Trail - Crenshaw Bridge PC Urban Trail'] = trail_df['Butler Trail - Crenshaw Bridge PC Urban Trail'].replace(0, np.nan)
trail_df['month'] = trail_df['Date'].dt.month
trail_df['Butler Trail - Crenshaw Bridge PC Urban Trail'] = trail_df['Butler Trail - Crenshaw Bridge PC Urban Trail'].astype(float)

# trail_df = trail_df.astype({'Butler Trail - Crenshaw Bridge PC Urban Trail': 'int64', 'Butler Trail - South Lamar PC Urban Trail': 'Int64',
#                 'Butler Trail - North Congress PC Urban Trail': 'Int64', 'Butler Trail - Longhorn Dam PC Urban Trail': 'Int64',
#                 'Shoal Creek Solar Trail PC Urban Trail ped/bike': 'Int64', 'month': 'Int64'})

In [7]:
trail_df.dtypes

Date                                               datetime64[ns]
Butler Trail - Crenshaw Bridge PC Urban Trail             float64
Butler Trail - South Lamar PC Urban Trail                 float64
Butler Trail - North Congress PC Urban Trail              float64
Butler Trail - Longhorn Dam PC Urban Trail                float64
Shoal Creek Solar Trail PC Urban Trail ped/bike           float64
month                                                       int64
dtype: object

# 2. Exploratory Data Analysis (EDA)

This section contains basic EDA of the counts in the data set in order to understand their properties.

First, let's take a look about daily traffic at the five locations where there are people counters.

In [8]:
x = trail_df['Date']
y1 = trail_df['Butler Trail - Crenshaw Bridge PC Urban Trail']
y2 = trail_df['Butler Trail - South Lamar PC Urban Trail']
y3 = trail_df['Butler Trail - North Congress PC Urban Trail']
y4 = trail_df['Butler Trail - Longhorn Dam PC Urban Trail']
y5 = trail_df['Shoal Creek Solar Trail PC Urban Trail ped/bike']

trace1 = go.Scatter(x=x, y=y1, name='Butler - Crenshaw')
trace2 = go.Scatter(x=x, y=y2, name='Butler - S Lamar')
trace3 = go.Scatter(x=x, y=y3, name='Butler - Congress')
trace4 = go.Scatter(x=x, y=y4, name='Butler - Longhorn Dam')
trace5 = go.Scatter(x=x, y=y5, name='Shoal Creek')

data = [trace1, trace2, trace3, trace4, trace5]
layout = go.Layout(title='Daily Total Traffic', legend=dict(x=-.1, y=1.2))

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [9]:
# There are still some dates that are empty for one or more trails. Replacing this with a zero or an average would 
# indicate a trend that isn't there, so I will leave them as is.
trail_df[trail_df['Date'] == '2018-07-25']

Unnamed: 0,Date,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike,month
889,2018-07-25,,1732.0,1782.0,808.0,,7


It will also be useful to get an idea how traffic on the trail varies by day. Let's look at mean traffic counts for each of those five locations by day of the week.

The output row indices indicate day of the week from Mon - Sun.

In [10]:
week_df = trail_df.groupby(trail_df['Date'].dt.weekday).mean().drop(columns=['month'])
week_df

Unnamed: 0_level_0,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,3415.827027,2425.469136,2639.358025,831.506329,244.922414
1,3316.075676,2044.08642,2472.938272,765.558442,247.478261
2,3364.897849,1929.439024,2379.864198,741.74359,258.052174
3,3355.284946,2129.04878,2648.950617,789.924051,289.808696
4,3602.589189,1997.012195,2667.160494,767.828947,267.905172
5,5675.248649,3631.585366,4122.0,1376.72,291.521739
6,5425.345946,3734.641975,3968.876543,1396.293333,264.330435


In [11]:
x = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

trace1 = go.Bar(x=x, y=week_df['Butler Trail - Crenshaw Bridge PC Urban Trail'], name='Butler - Crenshaw')
trace2 = go.Bar(x=x, y=week_df['Butler Trail - South Lamar PC Urban Trail'], name='Butler - South Lamar')
trace3 = go.Bar(x=x, y=week_df['Butler Trail - North Congress PC Urban Trail'], name='Butler - Congress')
trace4 = go.Bar(x=x, y=week_df['Butler Trail - Longhorn Dam PC Urban Trail'], name='Butler - Longhorn Dam')
trace5 = go.Bar(x=x, y=week_df['Shoal Creek Solar Trail PC Urban Trail ped/bike'], name='Shoal Creek')

data = [trace1, trace2, trace3, trace4, trace5]
layout = go.Layout(xaxis=dict(tickangle=-45), barmode='group', title='Mean Traffic by Weekday')

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Weekends are busier on average for every section of the trail that TTF monitors. This is what we would expect. 

Now let's look to see if there is monthly periodicity too. The output row indices indicate month from Jan - Dec.

In [12]:
month_df = trail_df.groupby(trail_df['Date'].dt.month).mean().drop(columns=['month'])

month_df

Unnamed: 0_level_0,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,4143.634409,2772.903226,2842.516129,946.483871,126.138889
2,4164.886598,2764.285714,3044.642857,725.5625,125.479167
3,4681.104839,2978.03125,4004.864865,755.545455,128.483871
4,4445.216667,3542.083333,4196.383333,1132.728814,397.579545
5,3903.66129,3109.983871,2820.354839,1082.016129,309.72043
6,3336.94898,2755.716667,3221.4,1121.983333,341.5
7,2801.010309,2427.66129,2783.048387,1080.887097,302.369863
8,2800.443548,1824.612903,2199.870968,987.032258,238.725
9,3330.583333,1991.35,2641.333333,1005.85,292.753247
10,5919.316239,1498.763636,2686.204545,741.327273,263.626866


In [13]:
x = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November',
    'December']

trace1 = go.Bar(x=x, y=month_df['Butler Trail - Crenshaw Bridge PC Urban Trail'], name='Butler - Crenshaw')
trace2 = go.Bar(x=x, y=month_df['Butler Trail - South Lamar PC Urban Trail'], name='Butler - South Lamar')
trace3 = go.Bar(x=x, y=month_df['Butler Trail - North Congress PC Urban Trail'], name='Butler - Congress')
trace4 = go.Bar(x=x, y=month_df['Butler Trail - Longhorn Dam PC Urban Trail'], name='Butler - Longhorn Dam')
trace5 = go.Bar(x=x, y=month_df['Shoal Creek Solar Trail PC Urban Trail ped/bike'], name='Shoal Creek')

data = [trace1, trace2, trace3, trace4, trace5]
layout = go.Layout(xaxis=dict(tickangle=-45), barmode='group', title='Mean Traffic by Month')

fig = go.Figure(data=data, layout=layout)
iplot(fig)

Highest traffic for most locations is in March/April, possibly because that's when the nice spring weather starts. Peak traffic at Butler-Crenshaw is in October - this is likely caused by ACL, which is in October and is [right next door](https://goo.gl/maps/Qf9k7HQpVvMApN8z8) to the counter.

Let's see if there is periodicity between weeks of the year. The output row indices indicate month from Jan - Dec.

In [14]:
week_df = trail_df.groupby(trail_df['Date'].dt.week).mean().drop(columns=['month'])
week_df = week_df[:-1]
week_df

Unnamed: 0_level_0,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,3964.047619,3263.142857,3809.0,1096.142857,82.0
2,4336.0,3213.857143,2862.0,1016.571429,143.142857
3,3593.809524,2371.0,2391.714286,803.428571,101.714286
4,4720.761905,2843.285714,2973.714286,1102.857143,175.2
5,4291.190476,2782.0,2540.142857,961.714286,88.857143
6,3713.380952,2037.285714,2719.571429,721.714286,118.785714
7,4327.846154,3075.857143,3551.0,457.6,193.0
8,4351.607143,3115.714286,3241.857143,5.0,90.888889
9,4874.678571,2136.428571,2241.0,,104.818182
10,3812.321429,2548.571429,2868.571429,22.5,130.214286


In [15]:
x = [i for i in range(52)]

trace1 = go.Bar(x=x, y=week_df['Butler Trail - Crenshaw Bridge PC Urban Trail'], name='Butler - Crenshaw')
trace2 = go.Bar(x=x, y=week_df['Butler Trail - South Lamar PC Urban Trail'], name='Butler - South Lamar')
trace3 = go.Bar(x=x, y=week_df['Butler Trail - North Congress PC Urban Trail'], name='Butler - Congress')
trace4 = go.Bar(x=x, y=week_df['Butler Trail - Longhorn Dam PC Urban Trail'], name='Butler - Longhorn Dam')
trace5 = go.Bar(x=x, y=week_df['Shoal Creek Solar Trail PC Urban Trail ped/bike'], name='Shoal Creek')

data = [trace1, trace2, trace3, trace4, trace5]

layout = go.Layout(xaxis=dict(tickvals=[(2*k-1)*(52/12)/2 for k in range(1,13)],
                              ticktext=['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 
                                        'September', 'October', 'November', 'December'], 
                              tickangle=-45), 
                   yaxis=dict(hoverformat='.0f'), 
                   barmode='group', 
                   title='Mean Traffic by Ordinal Week',
                   annotations=[dict(x=39, y=9400, 
                                     xref='x', yref='y', 
                                     text='ACL Weekends', 
                                     showarrow=True, arrowhead=6, 
                                     ax=0, ay=-30)]
                  )


fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [16]:
# Box plot of Butler-Crenshaw traffic by month.

x = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November',
    'December']

trace_dict = {}
for i in range(len(x)):
    trace_dict['trace{}'.format(i)] = go.Box(
        {
        'y': trail_df['Butler Trail - Crenshaw Bridge PC Urban Trail'].loc[trail_df['month'] == i + 1], 
        'type':'box', 
        'name': x[i]
        }
    )

data = [data for trace, data in trace_dict.items()]
layout = go.Layout(xaxis=dict(tickangle=-45), title='Monthly Count Box Plots (Butler-Crenshaw)', showlegend=False)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

# 3. Forecasting

I'm now going to test out the Prophet library from Facebook. This will let me model the trends we are seeing, add information about known busy times on the trail, and add weather as a regressor. I'll just be looking at the Crenshaw Bridge counts since they cover the longest time period and have the most traffic. I'll be trying out several models and evaluating their performance against each other:
* Mean
* Moving Average
* ARIMA
* Generalized Additive Model (GAM)

## 3.1 Data Processing
These four models need the data in different formats. I'll go ahead and take care of that here so we don't have to do it later.

In [17]:
from matplotlib import pyplot
from matplotlib import figure

from pandas import read_csv
from pandas import datetime
from pandas.plotting import autocorrelation_plot

from statsmodels.tsa.arima_model import ARIMA

from sklearn.metrics import mean_squared_error, mean_absolute_error


%load_ext autoreload

%autoreload 2

I want to only work with the period that has values for Butler-Crenshaw since that goes back the farthest. Before I start building and evaluating models on that data, I need to decide how to take care of the missing values from 6/9/18 - 7/27/18. I don't want to lose 3 years of data by starting at the end of this, so I'm going to put in average values for the missing dates (accounting for the day of the week that it is).

In [18]:
crenshaw_df = trail_df.copy()
crenshaw_df['week'] = crenshaw_df['Date'].dt.week
crenshaw_df['day'] = crenshaw_df['Date'].dt.dayofweek

In [19]:
crenshaw_df.head()

Unnamed: 0,Date,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike,month,week,day
0,2016-02-17,4242.0,,,,,2,7,2
1,2016-02-18,4979.0,,,,,2,7,3
2,2016-02-19,5002.0,,,,,2,7,4
3,2016-02-20,7697.0,,,,,2,7,5
4,2016-02-21,5958.0,,,,,2,7,6


In [20]:
nan_idxs = list(np.where(crenshaw_df['Butler Trail - Crenshaw Bridge PC Urban Trail'].isnull())[0])

For every row in nan_idxs:
* Select all *other* (excluding current) instances in the dataset with that same week and day.
* Take the mean count for those days for Crenshaw
* Replace count (currently NaN) for row with mean count

In [21]:
inverse_df = crenshaw_df.drop(crenshaw_df.index[nan_idxs])

In [22]:
def impute_nans(nan_row):
    week = nan_row['week']
    day = nan_row['day']
    temp_df = inverse_df.query('week=={} and day=={}'.format(week, day))
    new_val = temp_df['Butler Trail - Crenshaw Bridge PC Urban Trail'].mean()
    return new_val

In [23]:
nan_df = crenshaw_df.iloc[nan_idxs]
nan_df['Butler Trail - Crenshaw Bridge PC Urban Trail'] = nan_df.apply(impute_nans, axis=1)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



In [24]:
# NaN values are now replaced with the imputed values created above.
crenshaw_df.iloc[nan_idxs] = nan_df

In [25]:
# Yo.
crenshaw_df.iloc[nan_idxs]

Unnamed: 0,Date,Butler Trail - Crenshaw Bridge PC Urban Trail,Butler Trail - South Lamar PC Urban Trail,Butler Trail - North Congress PC Urban Trail,Butler Trail - Longhorn Dam PC Urban Trail,Shoal Creek Solar Trail PC Urban Trail ped/bike,month,week,day
843,2018-06-09,4883.0,4231.0,4363.0,1559.0,495.0,6,23,5
844,2018-06-10,4303.333333,3310.0,3720.0,1629.0,423.0,6,23,6
845,2018-06-11,2832.666667,1931.0,2510.0,1208.0,371.0,6,24,0
846,2018-06-12,3314.0,2001.0,2615.0,986.0,363.0,6,24,1
847,2018-06-13,4594.666667,2232.0,2616.0,984.0,343.0,6,24,2
848,2018-06-14,2913.666667,2214.0,2846.0,1761.0,510.0,6,24,3
849,2018-06-15,2649.666667,2223.0,2945.0,1026.0,302.0,6,24,4
850,2018-06-16,3903.0,4130.0,4457.0,1453.0,406.0,6,24,5
851,2018-06-17,3488.666667,3965.0,3957.0,1466.0,484.0,6,24,6
852,2018-06-18,2901.333333,2260.0,2696.0,717.0,489.0,6,25,0


In [26]:
# Let's take a look and make sure this looks sane.
x = crenshaw_df['Date']
y = crenshaw_df['Butler Trail - Crenshaw Bridge PC Urban Trail']

trace1 = go.Scatter(x=x, y=y, name='Butler - Crenshaw')

data = [trace1]
layout = go.Layout(title='Daily Total Traffic', legend=dict(x=-.1, y=1.2))

fig = go.Figure(data=data, layout=layout)
iplot(fig)

In [27]:
multivariate_df = crenshaw_df.copy()
multivariate_df = multivariate_df.set_index('Date')

weather_df = pd.read_csv('weather_10-22-19.csv')
weather_df['DATE'] = pd.to_datetime(weather_df['DATE'])
weather_df = weather_df.set_index('DATE')

multivariate_df = pd.merge(multivariate_df, weather_df, how='inner', left_index=True, right_index=True)

multivariate_df = multivariate_df.drop(columns=['month', 'week', 'day', 'STATION', 'NAME', 
                                                'Butler Trail - South Lamar PC Urban Trail', 
                                                'Butler Trail - North Congress PC Urban Trail', 
                                                'Butler Trail - Longhorn Dam PC Urban Trail', 
                                                'Shoal Creek Solar Trail PC Urban Trail ped/bike'])

multivariate_df.head()

Unnamed: 0,Butler Trail - Crenshaw Bridge PC Urban Trail,PRCP,TMAX,TMIN
2016-02-17,4242.0,0.0,79,45
2016-02-18,4979.0,0.0,73,46
2016-02-19,5002.0,0.0,77,53
2016-02-20,7697.0,0.0,79,62
2016-02-21,5958.0,0.0,73,63


In [28]:
multivariate_df.tail()

Unnamed: 0,Butler Trail - Crenshaw Bridge PC Urban Trail,PRCP,TMAX,TMIN
2019-10-18,2549.0,0.0,74,47
2019-10-19,2707.0,0.0,82,48
2019-10-20,2510.0,0.0,92,61
2019-10-21,2839.0,0.4,90,59
2019-10-22,3101.0,0.0,76,48


Now we have a univariate timeseries with no missing values and outliers removed.

In [29]:
# split a dataset into train/test sets
def split_dataset(data):
    # split into standard weeks
    test_split_loc = .2 * len(data)
    test_split_loc = int(7 * (test_split_loc // 7)) - 1
    train, test = data[2:-test_split_loc], data[-test_split_loc:-6]    
    # restructure into windows of weekly data
    train = np.array(np.split(train, len(train)/7))
    test = np.array(np.split(test, len(test)/7))
    return train, test

In [30]:
days = ['+1', '+2', '+3', '+4', '+5', '+6', '+7']

### 3.1.1 Mean Model

In [31]:
split = int(len(multivariate_df) * .8)

train, test = multivariate_df[:split], multivariate_df[split:]

In [32]:
mean = train['Butler Trail - Crenshaw Bridge PC Urban Trail'][-7:].mean()

In [33]:
def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def mean_scoring(mean, train_data, test_data, forecast_len=7):
    predictions_list = []
    predictions = [mean for i in range(forecast_len)]
    predictions_list.append(predictions)
    data = train_data.append(test_data)
    for i in range(len(test_data) - forecast_len):
        mean = data['Butler Trail - Crenshaw Bridge PC Urban Trail'][len(train_data) - 6 + i:len(train_data) - 6 + i + forecast_len].mean()
        predictions = [mean for i in range(forecast_len)]
        predictions_list.append(predictions)
    # calculate mae
    test_weeks = rolling_window(test['Butler Trail - Crenshaw Bridge PC Urban Trail'], 7)
    predictions_array = np.array(predictions_list)
    scores = []
    for i in range(test_weeks.shape[1]):
        mae = mean_absolute_error(test_weeks[:, i],
                                  predictions_array[:, i])
        scores.append(mae)
    # calculate overall MAE
    s = 0
    for row in range(test_weeks.shape[0]):
        for col in range(test_weeks.shape[1]):
            s += np.abs((test_weeks[row, col] - predictions_array[row, col]))
    score = (s / (test_weeks.shape[0] * test_weeks.shape[1]))
    return score, scores, predictions_list

In [34]:
mean_score, mean_scores, mean_pred_list = mean_scoring(mean, train, test)


Series.strides is deprecated and will be removed in a future version



In [35]:
print('Mean Model: [%.3f] %s' % (mean_score, mean_scores))


# forecast of final week of data compared to actual final week
test_trace = go.Scatter(
    x=test.index[-7:],
    y=test['Butler Trail - Crenshaw Bridge PC Urban Trail'][-7:],
    name='Actual Data'
)

forecast_trace = go.Scatter(
    x=test.index[-7:],
    y=mean_pred_list[-1],
    name='Forecasted Data'
)

data = [
    test_trace, 
    forecast_trace
]

layout = go.Layout(
    xaxis=dict(
        tickvals=test.index[-7:],
#         ticktext=days,
    ), 
    yaxis=dict(
        title='Count',
        rangemode='tozero'
    ), 
    title='Naive Mean Model Forecast'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

# error plot
mae_trace = go.Scatter(
    x=days, 
    y=mean_scores, 
    name='Naive Mean'
)

data = [mae_trace]

layout = go.Layout(
    xaxis=dict(
        tickvals=days,
        ticktext=days,
        title='Days Forecast',
    ), 
    yaxis=dict(
        title='Mean Absolute Error',
        rangemode='tozero'
    ), 
    title='Naive Mean Model Performance'
)


fig = go.Figure(data=data, layout=layout)
iplot(fig)

Mean Model: [926.782] [875.5317762085823, 906.9608908202063, 927.3183052688755, 940.5681694731124, 947.5730581205867, 938.8278109722977, 950.6963606735469]


### 3.1.2 Sine Wave Model

In [36]:
t = np.array([i for i in range(len(train))])

guess_mean = np.mean(np.array(train['Butler Trail - Crenshaw Bridge PC Urban Trail']))
guess_std = 3*np.std(np.array(train['Butler Trail - Crenshaw Bridge PC Urban Trail']))/(2**0.5)/(2**0.5)
guess_phase = 0
guess_freq = 1
guess_amp = 1

# we'll use this to plot our first estimate. This might already be good enough for you
data_first_guess = guess_std*np.sin(t+guess_phase) + guess_mean

# Define the function to optimize, in this case, we want to minimize the difference
# between the actual data and our "guessed" parameters
optimize_func = lambda x: x[0]*np.sin(x[1]*t+x[2]) + x[3] - np.array(train['Butler Trail - Crenshaw Bridge PC Urban Trail'])
est_amp, est_freq, est_phase, est_mean = leastsq(optimize_func, [guess_amp, guess_freq, guess_phase, guess_mean])[0]

# recreate the fitted curve using the optimized parameters
data_fit = est_amp*np.sin(est_freq*t+est_phase) + est_mean

In [37]:
def sine_scoring(train_data, test_data, forecast_len=7):
    predictions_list = []
    t = np.array([i for i in range(len(train_data) - forecast_len, len(train_data))])
    predictions = est_amp * np.sin(est_freq * t + est_phase) + est_mean
    predictions_list.append(predictions)
#     data = train_data.append(test_data)
    for i in range(len(test_data) - forecast_len):
        t += 1
        predictions = est_amp * np.sin(est_freq * t + est_phase) + est_mean
        predictions_list.append(predictions)
    # calculate mae
    test_weeks = rolling_window(test['Butler Trail - Crenshaw Bridge PC Urban Trail'], 7)
    predictions_array = np.array(predictions_list)
    scores = []
    for i in range(test_weeks.shape[1]):
        mae = mean_absolute_error(test_weeks[:, i],
                                  predictions_array[:, i])
        scores.append(mae)
    # calculate overall MAE
    s = 0
    for row in range(test_weeks.shape[0]):
        for col in range(test_weeks.shape[1]):
            s += np.abs((test_weeks[row, col] - predictions_array[row, col]))
    score = (s / (test_weeks.shape[0] * test_weeks.shape[1]))
    return score, scores, predictions_list

In [38]:
sine_score, sine_scores, sine_pred_list = sine_scoring(train, test)


Series.strides is deprecated and will be removed in a future version



In [39]:
print('Sine Wave Model: [%.3f] %s' % (sine_score, sine_scores))


t = np.array([i for i in range(7)])


# forecast of final week of data compared to actual final week
test_trace = go.Scatter(
    x=test.index[-7:],
    y=test['Butler Trail - Crenshaw Bridge PC Urban Trail'][-7:],
    name='Actual Data'
)

forecast_trace = go.Scatter(
    x=test.index[-7:], 
    y=sine_pred_list[-1],
    name='Forecasted Data'
)


data = [
    test_trace, 
    forecast_trace
]

layout = go.Layout(
    xaxis=dict(
        tickvals=test.index[-7:],
#         ticktext=days,
    ), 
    yaxis=dict(
        title='Count',
        rangemode='tozero'
    ), 
    title='Naive Sine Wave Model Forecast'
)

fig = go.Figure(data=data, layout=layout)
iplot(fig)

# error plot
mae_trace = go.Scatter(
    x=days, 
    y=sine_scores, 
    name='Naive Sine Wave'
)

data = [mae_trace]

layout = go.Layout(
    xaxis=dict(
        tickvals=days,
        ticktext=days,
        title='Days Forecast',
    ), 
    yaxis=dict(
        title='Mean Absolute Error',
        rangemode='tozero'
    ), 
    title='Naive Sine Wave Model Performance'
)


fig = go.Figure(data=data, layout=layout)
iplot(fig)

Sine Wave Model: [2614.412] [2618.1863652731195, 2607.1838342825067, 2614.532733990534, 2615.176415223934, 2617.3159293804865, 2615.2148873259703, 2613.2766160373735]
