## Definition

### Problem Statement  

The goal is to forecast the demand for bikes in dependency of weather conditions like outside temperature and calendric informations e.g. holidays. These information and the demand structure is provided in a set with two years of daily historic data.  
The demand is given as the total daily demand and as a split for registered users and casual users. To increase the quality of the prediction registered user demand and casual user demand will be predicted separately in step two.  
To make predictions machine learning is used to train regressors. Scikit-Learn recommends a support vector regressor (SVR) for this kind of problem and data amount. In addition a deep neuronal network (DNN) regressor is trained for comparison. To find the hyper-parameters for these regressors grid search and randomized search are utilized. Due to the small dataset cross validation is applied.    

> http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html  
> http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html#sklearn.svm.SVR  
> https://github.com/tensorflow/skflow/blob/master/g3doc/api_docs/python/estimators.md  
> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html  
> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html

In [1]:
# Import libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import calendar

from sklearn.svm import SVR
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.metrics import r2_score, mean_squared_error
from scipy.stats import randint as sp_randint
from scipy.stats import uniform as sp_uniform
from math import sqrt



## Analysis

In [2]:
# Fetching Dataset

bike_data = pd.read_csv("day.csv", header=0)

print("Data read successfully!")

Data read successfully!


In [3]:
bike_data.head()

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
0,1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
1,2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
2,3,2011-01-03,1,0,1,0,1,1,1,0.196364,0.189405,0.437273,0.248309,120,1229,1349
3,4,2011-01-04,1,0,1,0,2,1,1,0.2,0.212122,0.590435,0.160296,108,1454,1562
4,5,2011-01-05,1,0,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869,82,1518,1600


### Data Exploration

In [4]:
# Extracting

feature_cols = bike_data.columns[:-3]  # all columns but last are features
target_col = bike_data.columns[-1]  # last column is the target

print ("Feature column(s):\n{}\n".format(feature_cols))
print ("Target column:\n{}".format(target_col))

Feature column(s):
Index(['instant', 'dteday', 'season', 'yr', 'mnth', 'holiday', 'weekday',
       'workingday', 'weathersit', 'temp', 'atemp', 'hum', 'windspeed'],
      dtype='object')

Target column:
cnt


In [5]:
#### Function to Calculate Profit

In [6]:
def profit(y,y_cap):
    return 3 * np.minimum(y[::1], y_cap[::1]) - 2 * y_cap[::1]
    

#### Function to Convert from percentage to Actual Prediction

In [7]:
def convertToPrediction(data,percentage_predictions):
      return np.around(data + (np.multiply(data, percentage_predictions)/100))

### Base Model

#### For the base model the demand for today is the previous days demand. 

In [8]:
y_actual = bike_data[target_col][365:731]  # corresponding targets
y_actual = y_actual.reset_index(drop = True)

In [9]:
y_staged = y_actual.copy()

In [10]:
data = []
data.insert(0, bike_data[target_col][364])
data.insert(0, bike_data[target_col][363])

In [11]:
y_predicted_df = pd.concat([pd.DataFrame(data), y_staged], ignore_index=True)

In [12]:
y_predicted_df.drop(y_predicted_df.tail(2).index,inplace=True)

In [13]:
y_predicted = y_predicted_df[0]


##### Calculate Base Model Profit

In [14]:
print(profit(y_actual,y_predicted).sum())

1442972


### Algorithms and Techniques

In [15]:
X_raw_train = pd.read_csv("train.csv", header=0)
X_raw_test  = pd.read_csv("test.csv", header=0)

In [16]:
cols = ["temp","hum", "windspeed" ,"cnt_normal","week_moving_avg_normal","season_1","season_2","season_3","season_4","mnth_1","mnth_2","mnth_3","mnth_4","mnth_5","mnth_6","mnth_7","mnth_8","mnth_9","mnth_10","mnth_11","mnth_12","holiday_1","holiday_2","weekday_1","weekday_2","weekday_3","weekday_4","weekday_5","weekday_6","weekday_7","workingday_1","workingday_2","weathersit_1","weathersit_2","weathersit_3"]

In [17]:
cols = ["atemp","hum", "windspeed" ,"cnt_normal","week_moving_avg_normal","season_1","season_2","season_3","season_4","mnth_1","mnth_2","mnth_3","mnth_4","mnth_5","mnth_6","mnth_7","mnth_8","mnth_9","mnth_10","mnth_11","mnth_12","holiday_1","holiday_2","weekday_1","weekday_2","weekday_3","weekday_4","weekday_5","weekday_6","weekday_7","workingday_1","workingday_2","weathersit_1","weathersit_2","weathersit_3"]

In [18]:
X_train = X_raw_train[cols].values.tolist()
y_train_df = X_raw_train[['target']]
y_train = y_train_df['target'].tolist()

In [19]:
X_test = X_raw_test[cols].values.tolist()
y_test_df = X_raw_test[['target']]
y_test = y_test_df['target'].tolist()

#### Alternate dataset with percentage change

In [20]:
data = pd.read_csv("processed_Data.csv", header=0)
X_raw_train = data[0:359]
X_raw_test  = data[359:]

In [21]:
cols =["temp","atemp","hum", "windspeed" ,
       "atemp__1","atemp__2","atemp__3","atemp__4","atemp__5",
       "cnt__1","cnt__2","cnt__3","cnt__4","cnt__5",
       "holiday__1","holiday__2","holiday__3","holiday__4","holiday__5",
       "hum__1","hum__2","hum__3","hum__4","hum__5",
       "season__1","season__2","season__3","season__4","season__5",
       "temp__1","temp__2","temp__3","temp__4","temp__5",
       "weathersit__1","weathersit__2","weathersit__3","weathersit__4","weathersit__5",
       "weekday__1","weekday__2","weekday__3","weekday__4","weekday__5",
       "windspeed__1","windspeed__2","windspeed__3","windspeed__4","windspeed__5",
       "workingday__1","workingday__2","workingday__3","workingday__4","workingday__5",
       "moving_avg_weekly_cnt"]

In [128]:
cols =["temp","atemp","hum", "windspeed" ,
       "holiday__1","holiday__2","holiday__3","holiday__4","holiday__5",
       "season__1","season__2","season__3","season__4","season__5",
       "temp__1","temp__2","temp__3","temp__4","temp__5",
       "weathersit__1","weathersit__2","weathersit__3","weathersit__4","weathersit__5",
       "weekday__1","weekday__2","weekday__3","weekday__4","weekday__5",
       "workingday__1","workingday__2","workingday__3","workingday__4","workingday__5",
       "moving_avg_weekly_cnt"]
       
       

In [34]:
#Nandana's Variables
#Remove same day variables like "temp","atemp","hum", "windspeed"
#Also remove all variables except lag1 variables
cols =[
       "atemp__1",
       "cnt__1",
       "holiday__1",
       "hum__1",
       "season__1",
       "temp__1",
       "weathersit__1",
       "weekday__1",
       "windspeed__1",
       "workingday__1",
       "moving_avg_weekly_cnt"]

In [23]:
X_train = X_raw_train[cols].values.tolist()
y_train_df = X_raw_train[['demand_pc_inc']]
y_train = y_train_df['demand_pc_inc'].tolist()

In [24]:
X_test = X_raw_test[cols].values.tolist()
y_test_df = X_raw_test[['demand_pc_inc']]
y_test = y_test_df['demand_pc_inc'].tolist()

In [25]:
#Nandana - changed data[['cnt']] to data['cnt'] 
data_cnt = data['cnt']

In [26]:
actual_predictions = data_cnt[359:].values

In [27]:
y_for_calculations = data_cnt[357:723].values

### Benchmark

In [28]:
# Training SVR
svr = SVR()
svr.fit(X_train, y_train)

SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False)

In [29]:
# Validation SVR

svr_pred = svr.predict(X_test)
score_svr = r2_score(y_test, svr_pred)
rmse_svr = sqrt(mean_squared_error(y_test, svr_pred))

print("Score SVR: %f" % score_svr)
print("RMSE SVR: %f" % rmse_svr)

Score SVR: -0.003305
RMSE SVR: 1318.397273


In [30]:
model_predictions = convertToPrediction(y_for_calculations,svr_pred)

In [33]:
print(profit(actual_predictions,model_predictions).sum())

1438252.0


## Methodology

### Implementation

The regressors are trained using randomized search and cross-validation to identify the area of the best parameters. Then a grid search is used to tune parameter values of the regressor functions.

> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html  
> http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.RandomizedSearchCV.html

In [35]:
# Tuning SVR with GridSearch

tuned_parameters = [{'C': [1000, 3000, 10000], 
                     'kernel': ['linear', 'rbf']}
                   ]

#svr_tuned = GridSearchCV(SVR (C=1), param_grid = tuned_parameters, scoring = 'mean_squared_error') #default 3-fold cross-validation, score method of the estimator
svr_tuned_GS = GridSearchCV(SVR (C=1), param_grid = tuned_parameters, scoring = 'r2', n_jobs=-1) #default 3-fold cross-validation, score method of the estimator

svr_tuned_GS.fit(X_train, y_train)

print (svr_tuned_GS)
print ('\n' "Best parameter from grid search: " + str(svr_tuned_GS.best_params_) +'\n')

GridSearchCV(cv=None, error_score='raise',
       estimator=SVR(C=1, cache_size=200, coef0=0.0, degree=3, epsilon=0.1, gamma='auto',
  kernel='rbf', max_iter=-1, shrinking=True, tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid=[{'C': [1000, 3000, 10000], 'kernel': ['linear', 'rbf']}],
       pre_dispatch='2*n_jobs', refit=True, scoring='r2', verbose=0)

Best parameter from grid search: {'C': 1000, 'kernel': 'rbf'}



In [36]:
# Validation - SVR tuned 

svr_tuned_pred_GS = svr_tuned_GS.predict(X_test)

score_svr_tuned_GS = r2_score(y_test, svr_tuned_pred_GS)
rmse_svr_tuned_GS = sqrt(mean_squared_error(y_test, svr_tuned_pred_GS))

print("SVR Results\n")

#print("Score SVR: %f" % score_svr)
print("Score SVR tuned GS: %f" % score_svr_tuned_GS)

#print("\nRMSE SVR: %f" % rmse_svr)
print("RMSE SVR tuned GS: %f" % rmse_svr_tuned_GS)

SVR Results

Score SVR tuned GS: -0.002542
RMSE SVR tuned GS: 1317.895455


In [37]:
svr_tuned_pred_GS

#Change - Nandana
##Profit Calculation for pct approach
model_predictions = convertToPrediction(y_for_calculations,svr_tuned_pred_GS)
print(profit(actual_predictions,model_predictions).sum())

#Profit is just 1.26million!!

1269631.0


In [58]:
print(profit(y_test,svr_tuned_pred_GS).sum())

1348989.3817235236


In [65]:
# SVR tuned with RandomizesSearch
# may take a while!

# Parameters
param_dist = {  'C': sp_uniform (1000, 10000), 
                'kernel': ['rbf']
             }

n_iter_search = 1

# MSE optimized
#SVR_tuned_RS = RandomizedSearchCV(SVR (C=1), param_distributions = param_dist, scoring = 'mean_squared_error', n_iter=n_iter_search)

# R^2 optimized
SVR_tuned_RS = RandomizedSearchCV(SVR (C=1), param_distributions = param_dist, scoring = 'r2', n_iter=n_iter_search)

# Fit
SVR_tuned_RS.fit(X_train, y_train)

# Best score and corresponding parameters.
print('best CV score from grid search: {0:f}'.format(SVR_tuned_RS.best_score_))
print('corresponding parameters: {}'.format(SVR_tuned_RS.best_params_))

# Predict and score
predict = SVR_tuned_RS.predict(X_test)

score_svr_tuned_RS = r2_score(y_test, predict)
rmse_svr_tuned_RS = sqrt(mean_squared_error(y_test, predict))

best CV score from grid search: -0.837227
corresponding parameters: {'C': 5806.563127646195, 'kernel': 'rbf'}


In [66]:
print('SVR Results\n')

print("Score SVR: %f" % score_svr)
print("Score SVR tuned GS: %f" % score_svr_tuned_GS)
print("Score SVR tuned RS: %f" % score_svr_tuned_RS)

print("\nRMSE SVR: %f" % rmse_svr)
print("RMSE SVR tuned GS: %f" % rmse_svr_tuned_GS)
print("RMSE SVR tuned RS: %f" % rmse_svr_tuned_RS)

SVR Results

Score SVR: -1.057082
Score SVR tuned GS: -0.298199
Score SVR tuned RS: -0.220097

RMSE SVR: 2561.895499
RMSE SVR tuned GS: 2035.196203
RMSE SVR tuned RS: 1973.025704


The tuning works for the SVR.

In [67]:
print(profit(y_test,predict).sum())

1366504.424865197


### DNN Regressor

In [58]:
from sklearn.neural_network import MLPRegressor

In [59]:
import logging
from concurrent.futures import ThreadPoolExecutor, wait
from time import time
from typing import List

In [60]:
bike_model = MLPRegressor(hidden_layer_sizes=(10,20),
                                       activation='relu',
                                       solver='adam',
                                       learning_rate='adaptive',
                                       max_iter=10000,
                                       learning_rate_init=0.01,
                                       alpha=0.01)

In [61]:
start_time = int(time() * 1000)
bike_model.fit(X_train, y_train)
end_time = int(time() * 1000)
logging.debug('Finished training universal model')
logging.debug('Training took {} ms'.format(end_time - start_time)) 

In [62]:
predict = bike_model.predict(X_test)

In [63]:
predict

array([19.925289  , 22.00379558, 23.60938533, 23.22590156, 23.66675084,
       22.31659909, 23.75618579, 27.17276292, 31.90670065, 33.68316357,
       33.95392255, 35.19934529, 31.45097422, 30.59433766, 30.13415323,
       30.19660644, 27.16725898, 27.66936709, 26.14922341, 26.31709331,
       27.87693451, 29.32337225, 27.16925896, 24.7749419 , 23.3316477 ,
       26.07770522, 28.1026324 , 33.41624504, 36.07090639, 38.9842279 ,
       36.41829123, 35.50879304, 36.95006685, 38.93333522, 38.38345738,
       39.98427919, 38.35590939, 34.78239058, 33.67852122, 35.5024423 ,
       32.76172496, 34.61109878, 36.20540654, 32.83124582, 26.87785469,
       28.52593341, 29.30330569, 29.79384779, 31.32446197, 36.2594723 ,
       37.93538385, 34.93187274, 33.19646797, 35.52526001, 36.60442614,
       37.99753948, 39.28210918, 38.25839189, 37.11455999, 36.55559288,
       35.83843007, 32.43721168, 36.77307369, 36.33811625, 35.62399074,
       33.3871925 , 36.63942894, 35.24376458, 38.44643101, 40.93

In [120]:
print(profit(y_test,predict).sum())

1331736.7111305392


### GLM

In [3]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm
from scipy import stats
from matplotlib import pyplot as plt

#### Percentage change over the previous day

In [4]:
X_new_train = pd.read_csv("processed_Data.csv", header=0)

In [6]:
X_new_train.shape

(725, 73)

In [8]:
X_new_train.head(1)

Unnamed: 0,instant,dteday,season,yr,mnth,holiday,weekday,workingday,weathersit,temp,...,windspeed__3,windspeed__4,windspeed__5,workingday__1,workingday__2,workingday__3,workingday__4,workingday__5,moving_avg_weekly_cnt,demand_pc_inc
0,7,2011-01-07,1,0,1,0,5,1,2,0.196522,...,0.248309,0.248539,0.160446,1,1,1,0,0,1259,-5.625
