# BCycle Austin Full / Empty models

This notebook concludes the BCycle Austin series of blog posts, and looks at how machine learning could be used to help the BCycle team. I'll be using weather data in addition to the station and bike information, and building models which I hope might be useful. Let's get started !

## Imports and data loading

Before getting started, let's import some useful libraries for visualization, and the bcycle utils library.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import folium
import seaborn as sns

import datetime

from bcycle_lib.utils import *

%matplotlib inline
plt.rc('xtick', labelsize=14) 
plt.rc('ytick', labelsize=14) 

# for auto-reloading external modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2

## Station full/empty prediction as classification problem

First of all, treat stations being full or empty as a classification problem. First build a model which can predict if a station will run out of bikes given the time and weather.


In [7]:
bikes_df = load_bikes()
bike_trips_df = load_bike_trips()
stations_df = load_stations()
weather_df = load_weather()

In [23]:
# Create a set of categories for the model to predict
bike_model_df = bikes_df.copy()

# Create empty, full, and ok indicators
bike_model_df['empty'] = bike_model_df['bikes'] == 0
bike_model_df['full'] = bike_model_df['docks'] == 0
bike_model_df['ok'] = ~bike_model_df['full'] & ~bike_model_df['empty']

# Now use the datetime and extract the day of week, hour of day, and minute
bike_model_df['dayofweek'] = bike_model_df['datetime'].dt.dayofweek
bike_model_df['hour'] = bike_model_df['datetime'].dt.hour
bike_model_df['minute'] = bike_model_df['datetime'].dt.minute
bike_model_df = bike_model_df.set_index('datetime')
bike_model_df.head()

Unnamed: 0_level_0,station_id,bikes,docks,empty,full,ok,dayofweek,hour,minute
datetime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2016-04-01,1,6,7,False,False,True,4,0,0
2016-04-01,2,5,8,False,False,True,4,0,0
2016-04-01,3,4,9,False,False,True,4,0,0
2016-04-01,4,7,11,False,False,True,4,0,0
2016-04-01,5,8,11,False,False,True,4,0,0


In [50]:
# Create date ranges
TRAIN_START = '2016-04-01'
TRAIN_END   = '2016-05-15'
VAL_START   = '2016-05-16'
VAL_END     = '2016-05-31'

VALID_STATIONS = range(1,49)

print('\nTraining data first and last row:\n{}\n{}'.format(train_df.iloc[0], train_df.iloc[-1]))
print('\nValidation data first and last row:\n{}\n{}'.format(val_df.iloc[0], val_df.iloc[-1]))


Training data first and last row:
station_id        1
bikes             6
docks             7
empty         False
full          False
ok             True
dayofweek         4
hour              0
minute            0
Name: 2016-04-01 00:00:00, dtype: object
station_id       48
bikes             9
docks             2
empty         False
full          False
ok             True
dayofweek         6
hour              0
minute            0
Name: 2016-05-15 00:00:00, dtype: object

Validation data first and last row:
station_id        1
bikes            11
docks             2
empty         False
full          False
ok             True
dayofweek         0
hour              0
minute            0
Name: 2016-05-16 00:00:00, dtype: object
station_id       49
bikes             7
docks             2
empty         False
full          False
ok             True
dayofweek         1
hour              0
minute            0
Name: 2016-05-31 00:00:00, dtype: object


In [117]:
stations = dict()

for station in VALID_STATIONS:
    station_df = bike_model_df[bike_model_df['station_id'] == station]
#     station_df = station_df[['dayofweek', 'hour', 'minute', 'ok']]    
    stations[station] = station_df
    
print(stations[1].head(1))
print(stations[1].tail(1))

            station_id  bikes  docks  empty   full    ok  dayofweek  hour  \
date                                                                        
2016-04-01           1      6      7  False  False  True          4     0   

            minute  
date                
2016-04-01       0  
                     station_id  bikes  docks  empty   full    ok  dayofweek  \
date                                                                           
2016-05-31 23:55:00           1     11      2  False  False  True          1   

                     hour  minute  
date                               
2016-05-31 23:55:00    23      55  


In [140]:
# Simple baseline - predict most common case


common_scores = dict()

for station_id, station_df in stations.items():
#     print(station_id)

    station_train_df = station_df.loc[TRAIN_START:TRAIN_END,:]
    station_val_df = station_df.loc[VAL_START:VAL_END,:]
    n_train = station_train_df.shape[0]
    n_val = station_val_df.shape[0]
    assert n_train + n_val == station_df.shape[0]

    pred = station_df[['empty', 'full', 'ok']].sum().idxmax()
    train_acc = station_train_df[pred].sum() / float(n_train)
    print('Train: Station {}, predicting {}, accuracy = {:.2f}'.format(station_id, pred, train_acc))
    
    val_acc = station_val_df[pred].sum() / float(n_val)
    print('Validation: Station {}, accuracy = {:.2f}'.format(station_id, val_acc))
    
#     acc = train_array[max_index] / n_rows

#     assert total_empty + total_full + total_ok == n_rows
#     print('Station ID {}, total_empty {}, total full {}, total ok {}'.format(station_id, total_empty, total_full, total_ok))
#     print('Station ID {}, max_index {} - {}'.format(station_id, total_array, max_index))
#     print('Station ID {}, accuracy {}'.format(station_id, acc))

    
    #     lbe = LabelBinarizer()
#     dayofweek_ohe = lbe.fit_transform(station_df['dayofweek'])
#     X = np.hstack((station_df[['dayofweek', 'hour', 'minute']].values, dayofweek_ohe))
#     y = station_df['ok'].astype(np.bool)

#     train_stations[station_id] = (X_train, y_train)
#     val_stations[station_id] = (X_val, y_val)

Train: Station 1, predicting ok, accuracy = 0.89
Validation: Station 1, accuracy = 0.90
Train: Station 2, predicting ok, accuracy = 0.86
Validation: Station 2, accuracy = 0.85
Train: Station 3, predicting ok, accuracy = 0.97
Validation: Station 3, accuracy = 0.94
Train: Station 4, predicting ok, accuracy = 0.98
Validation: Station 4, accuracy = 0.90
Train: Station 5, predicting ok, accuracy = 0.95
Validation: Station 5, accuracy = 0.95
Train: Station 6, predicting ok, accuracy = 0.93
Validation: Station 6, accuracy = 0.94
Train: Station 7, predicting ok, accuracy = 0.81
Validation: Station 7, accuracy = 0.88
Train: Station 8, predicting ok, accuracy = 0.84
Validation: Station 8, accuracy = 0.83
Train: Station 9, predicting ok, accuracy = 0.96
Validation: Station 9, accuracy = 0.92
Train: Station 10, predicting ok, accuracy = 0.89
Validation: Station 10, accuracy = 0.94
Train: Station 11, predicting ok, accuracy = 0.96
Validation: Station 11, accuracy = 0.91
Train: Station 12, predictin

In [142]:
# Now create a logistic regression using the time features
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelBinarizer

# print(stations)
station_scores = dict()

for station_id, (X_train, y_train) in train_stations.items():

    (X_val, y_val) = val_stations[station_id]
    clf = LogisticRegression(class_weight='balanced')
    clf.fit(X_train, y_train)
    # y_train_pred = clf.predict(val_df[['dayofweek', 'hour', 'minute']])
    clf_pred_val_score = clf.score(X_train, y_train)
    clf_pred_train_score = clf.score(X_val, y_val)

    station_scores[station_id] = (clf_pred_val_score, clf_pred_train_score)
    
station_scores
# # y_val_pred = reg.predict(X_val)
# # reg_val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))

# # # Store the evaluation results
# # if 'linreg_time_weather_norm' not in scores_df.index:
# #     scores_df = scores_df.append(pd.DataFrame({'train_rmse' : reg_train_rmse, 'val_rmse' : reg_val_rmse}, 
# #                                               index=['linreg_time_weather_norm']))

# # print('Weather Regression RMSE - Train: {:.2f}, Val: {:.2f}'.format(reg_train_rmse, reg_val_rmse))

# # reg_result_train_df, reg_result_val_df = df_from_results(train_df.index, y_train, y_train_pred,
# #                                                          val_df.index, y_val, y_val_pred)

# # plot_results(reg_result_train_df, reg_result_val_df, 'pred', 'true', title='Linear regression with weather')


{1: (0.61897392246382421, 0.60034980323567988),
 2: (0.53973535556759267, 0.48119807608220377),
 3: (0.74402228584693952, 0.6119370354175776),
 4: (0.65944440145477057, 0.67927415828596416),
 5: (0.59800355954499729, 0.61149978137297767),
 6: (0.55807475044494315, 0.66943594228246617),
 7: (0.51156852124119789, 0.40730214254481856),
 8: (0.61162268823028709, 0.61740271097507649),
 9: (0.5544378240346669, 0.57455181460428506),
 10: (0.62230132322216203, 0.58900701198823791),
 11: (0.63638474038535942, 0.67205946655006554),
 12: (0.58817612009595299, 0.62243113248797555),
 13: (0.59939642497872014, 0.50240489724529946),
 14: (0.64195620212025073, 0.48819414079580237),
 15: (0.74487348138977016, 0.64910362920857023),
 16: (0.619206066702778, 0.52798425885439437),
 17: (0.69952797338079398, 0.72846523830345433),
 18: (0.73249245531223395, 0.68167905553126362),
 19: (0.49330650777683199, 0.48731963270660256),
 20: (0.51536021047744329, 0.50743331875819853),
 21: (0.72862338466300391, 0.6724

In [97]:
stations[1].index.get_loc(VAL_START).start

# stations[1].iloc[12923,:]

12923

In [27]:
clf_train_score

0.9127086007702182

In [96]:
dir(slice)

['__class__',
 '__delattr__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__le__',
 '__lt__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 'indices',
 'start',
 'step',
 'stop']

In [None]:
bike_empty_mask = bike_trips_df['bikes'] == 0
bike_full_mask = bike_trips_df['docks'] == 0
bike_empty_full_mask = bike_empty_mask | bike_full_mask

# bike_trips_df[bike_trips_df['station_id'==1]] # , 'checkouts']

bike_trips_df.loc[bike_trips_df['station_id'] == 37, ['bikes', 'docks']].resample('1H').max().plot.line(figsize=(20,10))
# bike_trips_df = bike_trips_df[bike_empty_full_mask]
# bike_trips_df['empty'] =  bike_trips_df['bikes'] == 0
# bike_trips_df['full'] = bike_trips_df['docks'] == 0
# bike_trips_df.head()

In [None]:
bike_trips_df.loc[bike_trips_df['station_id'] == 13, ['bikes']].plot.hist()

In [None]:
stations_df