**About Me**

This notebook seeks to outline my learning journey. I am considerably new to Data Science and am actively looking for a way to make a career switch into the industry. Luckily in my country, there is an apprenticeship programme that helps people do exactly that. However, during the last application, I was unable to complete the technical assessment in time. It took me a couple of weeks to figure out the solution to the case study, by then the application has already been closed.

During my research, I came across the "Bike Sharing Demand" kaggle problem. It is almost identical to the case study during my application, and therefore I have chosen to publish it as my next Learning Journey. You can find more details from the original competition here: https://www.kaggle.com/c/bike-sharing-demand/data


**#STEP 1: IMPORTING LIBRARIES AND DATASET**

We start off as usual by importing the dataset and relevant libraries. Then take a brief look at the dataset.


In [None]:
#STEP 1: IMPORTING LIBRARIES AND DATASET

# Importing the libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.metrics import mean_squared_log_error



In [None]:
# Importing the dataset from Kaggle
traindf = pd.read_csv('../input/bike-sharing-demand/train.csv')
testdf = pd.read_csv('../input/bike-sharing-demand/test.csv')

In [None]:
# Print first 5 rows of traindf 
traindf.head()

In [None]:
# Print first 5 rows of testdf 
testdf.head()

In [None]:
print("Brief look at train set")
traindf.info()

**#STEP 2: EDA AND DATA PRE-PROCESSING**

We can see that the first column 'datetime' is of 'object' datatype. That means we should convert it into datetime datatype. We can then set them as our dataframe index and perform timeseries analysis.

In [None]:
traindf['datetime'] = pd.to_datetime(traindf['datetime'])
traindf = traindf.set_index('datetime')

# Creating relevant datetime columns
traindf['year'] = traindf.index.year
traindf['month'] = traindf.index.month
traindf['hour'] = traindf.index.hour

traindf.info()

Based on the data description and info provided, we can also see that 'season' and 'weather' are categorical features. This means that we should encode them before moving on to further analysis

In [None]:
# Encoding categorical data 

traindf['spring'] = (traindf['season']==1)*1
traindf['summer'] = (traindf['season']==2)*1
traindf['fall'] = (traindf['season']==3)*1
traindf['winter'] = (traindf['season']==4)*1

traindf['clear'] = (traindf['weather']==1)*1
traindf['cloudy'] = (traindf['weather']==2)*1
traindf['light_snow'] = (traindf['weather']==3)*1
traindf['heavy_snow'] = (traindf['weather']==4)*1

traindf = traindf.drop(['season'],axis=1)
traindf = traindf.drop(['weather'],axis=1)

traindf.head()

There will be further data processing that needs to be done, but let's plot a correlation matrix to find out more about the relationships between these features. We can do this by using the Seaborn heatmap

In [None]:
# Checking for Correlation
cor = traindf.corr()
sns.set(font_scale=1.25)
f, ax = plt.subplots(figsize=(15, 15))
sns.heatmap(cor, cmap="YlGnBu", annot=True, fmt='.2f', square =True, cbar=False);


Based on the above heatmap as well as feature details provided in the case studies, we can make the following assumptions:
* 'holiday' is redundant as it does not provide us with additional information that 'workingday' did not
* 'atemp' is redundant as it does not provide us with additional information that 'temp' did not
* 'humidity' has an inverse relation with 'count' - we infer that the higher the humidity, the less comfortable it is for people to ride bikes
* 'casual' and 'registered' adds up to make 'count' - as our target variable is count, we can drop the former 2 columns
* 'summer', 'winter', 'cloud' and 'heavy_snow' does not appear to have strong influence over the 'count' variable

So let's proceed to drop the irrelevant columns, to make our model more robust with fewer features. Let's also take the chance to check for missing data among our remaining features.

In [None]:
# Dropping redundant columns
cols_to_drop = ['holiday', 'atemp', 'summer', 'winter','cloudy','heavy_snow','casual','registered']
traindf = traindf.drop(cols_to_drop, axis=1)

# Check for missing data
total = traindf.isnull().sum().sort_values(ascending=False)
percent = (traindf.isnull().sum()/traindf.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
print(missing_data)

**#STEP 3: TRAIN TEST SPLIT FOR TRAIN DATASET**

Now that we are done preparing the train dataset, we can begin with the next step of modelling. To start off, we will split the trainset so that we can perform cross validation.

In [None]:
# Train_test_split using traindf
traindf_X = traindf.drop(['count'],axis=1)
traindf_y = traindf[['count']]
X_train, X_test, y_train, y_test = train_test_split(traindf_X,traindf_y,test_size=.2, random_state=8)


**#STEP 4: MODEL SELECTION AND HYPERPARAMETER TUNING**

Let's perform initial testing based on below 4 models. This is a technique picked out from Raj Mehrotra's notebook: https://www.kaggle.com/rajmehra03/bike-sharing-demand-rmsle-0-3194. I found this to be helpful in quickly selecting an appropriate model for the problem at hand, before doing a deep-dive into hyperparameter tuning.****

As the nature of this case study is to forecast bike sharing demand. It is important to note that under-forecasting has a more severe consequence than over-forecasting. This is because under-forecasting will result in loss of business opportunity and stunted growth in business. With this in mind, we have chosen to employ RMSLE as our evaluation metric instead of the common RMSE. This is because RMSLE incurs a larger penalty for underestimation - you can read more about it here: https://medium.com/analytics-vidhya/root-mean-square-log-error-rmse-vs-rmlse-935c6cc1802a


In [None]:
# Initial testing based on several models
models=[RandomForestRegressor(),AdaBoostRegressor(),SVR(),KNeighborsRegressor()]
model_names=['RandomForestRegressor','AdaBoostRegressor','SVR','KNeighborsRegressor']

# Compiling initial base results using RMSLE
rmsle=[]
model_result={}
for model in range (len(models)):
    clf=models[model]
    clf.fit(X_train,y_train.values.ravel())
    test_pred=clf.predict(X_test)
    rmsle.append(np.sqrt(mean_squared_log_error(test_pred,y_test)))
model_result={'Modelling Algo':model_names,'RMSLE':rmsle}   
rmsle_frame=pd.DataFrame(model_result)
rmsle_frame

We can see that the RandomForestRegressor easily outperforms the rest of the models with a low RMSLE score of 0.344. So let's move on and commit to the RandomForestRegressor model, and perform hyperparameter tuning. For the purposes of efficiency, I will not run the hyperparameter tuning here as it takes significant runtime.

I have performed GridSearchCV using below parameters:
* 'n_estimators':[300,500,700]
* 'bootstrap':[True,False]
* 'max_depth':[None, 25, 50, 75, 100]
* 'min_samples_leaf':[1,2,4]
* 'min_samples_split':[2,5,10]
* 'max_features':["auto",'sqrt','log2']

After an hour of runtime on my own machine, the best parameters appear to be:
* 'n_estimators':[300]
* 'bootstrap':[True]
* 'max_depth':[50]
* 'min_samples_leaf':[2]
* 'min_samples_split':[2]
* 'max_features':['auto']

As you can see, we will achieve a slight improved score of RMSLE: 0.339

In [None]:
# Fitting the best parameters to traindf
params_dict={'n_estimators':[300],'bootstrap':[True],'max_depth':[50],'min_samples_leaf':[2],'min_samples_split':[2],'n_jobs':[-1],'max_features':['auto']}
clf_rf=GridSearchCV(estimator=RandomForestRegressor(),param_grid=params_dict,scoring='neg_mean_squared_log_error',cv=5)
clf_rf.fit(X_train,y_train.values.ravel())
pred=clf_rf.predict(X_test)
print((np.sqrt(mean_squared_log_error(pred,y_test))))

**STEP 5: BUILDING DATA PIPELINE AND PREDICTING RESULT FOR TEST DATASET**

Now that we have the model, we will now need to feed in the testdf. However, we'll have to perform all the data preprocessing step that we did for traindf previously. As part of my learning journey, I have also learnt to build data Pipeline. I have read that Pipelines ensure our code to be reusable for future dataset. So let's get started!

Before building any data Pipeline, it is important to list down the steps that needs to be performed:
* Convert 'datetime' column to datetime datatype, set as index, and create 'year', 'month', 'day'
* Create binary columns for 'spring', 'fall', 'clear', 'light_snow'
* Drop irrelevant columns of 'holiday', 'atemp', 'weather', 'season', 'datetime'
* Fitting into RandomForestRegressor with best parameters

In [None]:
# Converting 'datetime' datatype and set as index
class DatetimeConverter(BaseEstimator):
    def __init__(self):
        pass
    def fit(self, documents, y=None):
        return self
    def transform(self, x_dataset):
        x_dataset['datetime'] = pd.to_datetime(x_dataset['datetime'])
        x_dataset = x_dataset.set_index('datetime')
        x_dataset['year'] = x_dataset.index.year
        x_dataset['month'] = x_dataset.index.month
        x_dataset['hour'] = x_dataset.index.hour        
        
        return x_dataset

# Creating custom class for binary encoding
class BinaryEncoder(BaseEstimator):
    def __init__(self):
        pass
    def fit(self, documents, y=None):
        return self
    def transform(self, x_dataset):
        x_dataset['spring'] = (x_dataset['season'] == 1)*1
        x_dataset['fall'] = (x_dataset['season'] == 3)*1
        x_dataset['clear'] = (x_dataset['weather'] == 1)*1
        x_dataset['light_snow'] = (x_dataset['weather'] == 3)*1
        
        return x_dataset

# Create transformer to drop irrelevant columns
drop_col = ColumnTransformer(remainder='passthrough',
                                transformers=[('drop_columns', 'drop', ['holiday', 'atemp', 'weather', 'season'])])

model_pipeline = Pipeline(steps=[('converting_datetime', DatetimeConverter()),
                                 ('create_binary_columns', BinaryEncoder()),
                                 ('drop_columns', drop_col),
                                 ('random_forest_regressor', RandomForestRegressor(n_estimators=300,
                                                                                   bootstrap=True,
                                                                                   max_depth=50,
                                                                                   min_samples_leaf=2,
                                                                                   min_samples_split=2,
                                                                                   max_features='auto'))])



In [None]:
# Re-importing the dataset from Kaggle
traindf = pd.read_csv('../input/bike-sharing-demand/train.csv')
testdf = pd.read_csv('../input/bike-sharing-demand/test.csv')
traindf_X = traindf.drop(['count','casual','registered'],axis=1)
traindf_y = traindf[['count']]

Now that we have built our model pipeline and re-imported our dataset, we are now ready to feed the data and generate the output file. The final result when submitted to the kaggle challenge is 0.479. This is definitely not the best result on the scoreboard, but it idd satisfied my objective for publishing this notebook of mine. Thanks for reading.

In [None]:
model_pipeline.fit(traindf_X,traindf_y.values.ravel())
submission=pd.DataFrame(model_pipeline.predict(testdf), index=testdf['datetime'])
submission.rename(columns={0:'count'}, inplace=True)
submission.to_csv('submission.csv', index=True)
print(submission)