## Problem Statement

In this notebook we are trying to predict the bike rental count hourly. Below are the columns definitions present in the dataset 

1. instant: record index
2. dteday : date
3. season : season (1:springer, 2:summer, 3:fall, 4:winter)
4. yr : year (0: 2011, 1:2012)
5. mnth : month ( 1 to 12)
6. hr : hour (0 to 23)
7. holiday : weather day is holiday or not
8. weekday : day of the week
9. workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
10. weathersit : 
		- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
		- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
		- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
		- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
11. temp : Normalized temperature in Celsius. The values are divided to 41 (max)
12. atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
13. hum: Normalized humidity. The values are divided to 100 (max)
14. windspeed: Normalized wind speed. The values are divided to 67 (max)
15. casual: count of casual users
16. registered: count of registered users
17. cnt: count of total rental bikes including both casual and registered

In [None]:
import numpy as np 
import pandas as pd
import calendar
import seaborn as sns
from matplotlib import pyplot as plt
%matplotlib inline
plt.style.use('bmh')

from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from math import sqrt
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import statsmodels.api as sm
from sklearn.preprocessing import MinMaxScaler
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV

In [None]:
sampleSubmission = pd.read_csv("../input/bike-sharing-demand/sampleSubmission.csv")
test_df = pd.read_csv("../input/bike-sharing-demand/test.csv")
df = pd.read_csv("../input/bike-sharing-demand/train.csv")

In [None]:
df.head()

## Feature Engineering

In [None]:
def addfeatures(data):
    data["hour"] = [t.hour for t in pd.DatetimeIndex(data.datetime)]
    data["day"] = [t.dayofweek for t in pd.DatetimeIndex(data.datetime)]
    data["month"] = [t.month for t in pd.DatetimeIndex(data.datetime)]
    data['year'] = [t.year for t in pd.DatetimeIndex(data.datetime)]
    data['date'] = pd.to_datetime(data['datetime']).apply(lambda x: x.date())
    data["weekday"] = pd.to_datetime(data['datetime']).dt.dayofweek
    data['year'] = data['year'].map({2011:0, 2012:1})
    data.drop('datetime',axis=1,inplace=True)

In [None]:
addfeatures(df)

In [None]:
addfeatures(test_df)

In [None]:
df.info()

based on the info we can see there are no NaN values or null in the dataset. 

## Visualization 


1. Understand the effect of weekday, hour of day on the count. Check if there are some days and time when the count is clearly less or negligible compare to other days and hours?
2. Verify the effect of season on the count. Which season tends to more rental?
3. Relationship between count, season and working day. Is there any relatioship between rental counts season and working day?

In [None]:
fig, (ax1) = plt.subplots(ncols=1, nrows=1, sharex=True, sharey=True, figsize = (14, 10))
df.groupby('date').mean()['count'].plot(ax =ax1, title='Bike Rent Count per Date')
plt.xlabel('Date')
plt.ylabel('Count');

In [None]:
fig, (ax1, ax2) = plt.subplots(ncols=1, nrows=2, sharex=True, sharey=True, figsize = (14, 10))
# rent per day per hour 
df.groupby(['weekday','hour']).mean()['count'].unstack('weekday').plot( ax=ax1, title='Bike Rent Count per day per hour' )

# rent per season per hour
df.groupby(['season', 'hour']).mean()['count'].unstack('season').rename(columns={1:'springer', 2:'summer', 3:'fall', 4:'winter'}).plot( ax=ax2, title = 'Bike Rent Count per season per hour')
# Set common labels
fig.text(0.5, 0.04, 'Hour of the Day', ha='center', va='center', fontsize = 14)
fig.text(0.06, 0.5, 'Rental Counts', ha='center', va='center', rotation='vertical', fontsize = 14);

Clearly morning 6-9 AM and evening 4-7 PM there is spike in the rental counts in allmost all the weekdays mostly people are traveling to and from work or school at this time. On weekends afternoon are better in terms of rental as it's leisure time on weekend. 

In [None]:
fig, (ax1, ax2, ax3, ax4) = plt.subplots(ncols=4, sharey=True, figsize = (14, 6))
sns.regplot(x="atemp", y="count", data=df, ax=ax1)
sns.regplot(x="temp", y="count", data=df, ax=ax2)
sns.regplot(x="windspeed", y="count", data=df, ax=ax3)
sns.regplot(x="humidity", y="count", data=df, ax=ax4);

atemp, temp have positive relation with count and humidity is negative relationship with the count. One this is there are so many 0 in the windspeed and positive relation with count. Either we could drop the windspeed or impute these 0.  Also, there are 0 in hum lets check these values and try to impu

## Imputing Missing data

In [None]:
# Windspeed 
print('Number of rows with missing Windspeed: ', df[df.windspeed ==0].shape[0])

In [None]:
#Replace nan windspeed with last non-zero digit.
df.windspeed = df.windspeed.replace(to_replace=0, method='ffill')

In [None]:
# Humidity
print('Number of rows with missing Humidity: ', df[df.humidity ==0].shape[0])

In [None]:
# all of 0 humidity in the data comes from the month of march 2011
df[df.humidity ==0]

In [None]:
march_mean = df[(df.year == 1) & (df.month ==3)]['humidity'].mean()

In [None]:
# replace 0 hum with march mean 2012
df.humidity = df.humidity.map( lambda x : march_mean if x == 0 else x)

In [None]:
plt.figure(figsize=(10,6))

# plot count per working day season wise 
labels=['springer', 'summer', 'fall', 'winter']

ax = sns.barplot(data=df, x='workingday',y='count', hue='season' )

h, l = ax.get_legend_handles_labels()
ax.legend(h, labels, title="Season", loc='upper left');

In [None]:
plt.figure(figsize=(8,6))

year = [2011, 2012]
ax = sns.boxplot(x="workingday", y="count", hue="year", data=df, palette="Set1", );
h, l = ax.get_legend_handles_labels()
ax.legend(h, year, title="Year", loc='upper left');

as seen in the boxplot the outliers lies on the working day and year 2012.

In [None]:
# Scale features 
scaler = MinMaxScaler()
col2scale = ['humidity', 'temp', 'windspeed']
for i in col2scale:
    df[i] = scaler.fit_transform(df[i].values.reshape(-1,1))

In [None]:
noncat = ['temp', 'atemp', 'humidity', 'windspeed', 'casual', 'registered', 'count']

In [None]:
# Correlation plot 
cor = df.corr()
plt.figure(figsize=(14,10))

mask = np.zeros_like(cor, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

ax = sns.heatmap(cor, mask = mask, annot=True, cmap="YlGnBu")

To avoid Multicollinearity, it is required to drop one of the `temp` or `atemp` temperature and actual temperature. Also Windspeed has a lot of missing value as seen earlier and correlation seems to be very less with count hence will drop the windspeed as well.

In [None]:
# Drop atemp
df.drop(['atemp'], axis=1, inplace = True)

In [None]:
test_df.drop(['atemp'], axis=1, inplace = True)

In [None]:
df['checkh'] = df.casual + df.registered

In [None]:
print('Number of rows where sum of causal and registered is equal to rental count:', sum(df['count'] == df.checkh))
print('Total Rows:', df.shape[0])

Casual and registered always add upto count, Will drop these two features for now.

In [None]:
df = df.drop(['casual', 'registered'], axis=1)

In [None]:
df['count'].plot(kind = 'kde');

The log transformation can be used to make highly skewed distributions less skewed.

In [None]:
df['cnt_log'] = np.log(df['count'])

In [None]:
# check the skewness after log transformation 
df.cnt_log.plot(kind = 'kde');

In [None]:
# remove outlier on count column 
#df = df[df.cnt.between(df.cnt.quantile(.05), df.cnt.quantile(.95))]

In [None]:
fig, (ax1, ax2) = plt.subplots(1, 2)
fig.suptitle('Count vs Log Transformed Count')
ax1.set_xlabel('Count')
ax1.hist(df['count'])

ax2.set_xlabel('Log of Count')
ax2.hist(df.cnt_log);

In [None]:
# Droping date and checkh variable (created to verify the resistered and casual) 
df.drop(['date','checkh'], axis=1, inplace=True)
df.head()

In [None]:
df.columns

In [None]:
cat = ['season', 'year', 'month', 'hour', 'holiday', 'weekday', 'workingday', 'weather']
for i in cat:
    df[i] = df[i].astype("category")

In [None]:
dfdummy = pd.get_dummies(df, columns=cat, drop_first=True)

## Model and Evaluation

In [None]:
features = dfdummy[[i for i in list(dfdummy.columns) if i not in ['count', 'cnt_log']]].columns

In [None]:
#Creating Test and train dataset
X = dfdummy.drop(['count', 'cnt_log'], axis=1).values
y = dfdummy['count'].values
yl = dfdummy.cnt_log.values

In [None]:
metric = pd.DataFrame(columns = ['r2', 'rmse'])
r2 = []
rms = []
def split_train_test(x,y):
    # get train test split
    X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=seed)
    # linear regression 
    lModel = LinearRegression()
    print('\nTraining Linear Reggresor on Train data....')
    result = lModel.fit(X_train, y_train)
    r2.append(round((result.score(X_test,y_test)),2))
    rms.append(round((sqrt(mean_squared_error(y_test, result.predict(X_test)))),2))
    print('Done!!!')
    print('\nTraining Random Forrest Reggresor on Train data....')
    # Random forrest 
    regr = RandomForestRegressor(n_estimators=300)
    regr.fit(X_train, y_train)
    r2.append(round((regr.score(X_test, y_test)),2))
    rms.append(round((sqrt(mean_squared_error(y_test, regr.predict(X_test)))),2))
    print('Done!!!')

In [None]:
seed = 123
target = [y, yl]


for i in target:
    if i is y:
        print('\nTarget is Count')
    else:
        print('\nTarget is Log of Count')
    split_train_test(X,i)

In [None]:
metric.r2 = r2
metric.rmse= rms
metric['Target-Model'] = ['LM_count', 'RF_Count', 'LM_Count_log', 'RF_Count_log']

In [None]:
g = sns.barplot(x = 'Target-Model', y = 'r2', data = metric)
ax=g
#annotate axis = seaborn axis
for p in ax.patches:
             ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=11, color='Blue', xytext=(0, 8),
                 textcoords='offset points')
_ = g.set_ylim(0,1) 

In [None]:
g = sns.barplot(x = 'Target-Model', y = 'rmse', data = metric)
ax=g
#annotate axis = seaborn axis
for p in ax.patches:
             ax.annotate("%.2f" % p.get_height(), (p.get_x() + p.get_width() / 2., p.get_height()),
                 ha='center', va='center', fontsize=11, color='Blue', xytext=(0, 8),
                 textcoords='offset points')
_ = g.set_ylim(0,109) 

In [None]:
# Cross validation

kfold = model_selection.KFold(n_splits=10, random_state=100)
model_kfold = LinearRegression()
results_kfold = model_selection.cross_val_score(model_kfold, X, yl, cv=kfold, scoring = 'r2')
print("R2 score: ",round(results_kfold.mean(),2))

In [None]:
from sklearn.metrics import mean_squared_log_error

kfold = model_selection.KFold(n_splits=5, random_state=100)
model_kfold = RandomForestRegressor(n_estimators=50)
results_kfold = model_selection.cross_val_score(model_kfold, X, yl, cv=kfold, scoring = 'r2')
print("R2 score: ",round(results_kfold.mean(),2))

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, yl, test_size=0.3, random_state=123)
# Random forrest 
regr = RandomForestRegressor(n_estimators=300)
regr.fit(X_train, y_train)
print("Root Mean Squared Logarithmic Error: ",sqrt(mean_squared_log_error(y_test, regr.predict(X_test))))

In [None]:
importances = regr.feature_importances_
std = np.std([tree.feature_importances_ for tree in regr.estimators_], axis=0)
indices = np.argsort(importances)[::-1]

In [None]:
# Plot the feature importances of the forest
plt.figure(figsize=(25,5))
plt.title("Feature importances")
plt.bar(range(X_test.shape[1]), importances[indices], yerr=std[indices], align="center")
plt.xticks(range(X_test.shape[1]), [features[i] for i in indices], rotation=90)
plt.xlim([-1, X_test.shape[1]])
plt.show()

## Key Findings

1. Removing outliers from the count reduces the prediction accuracy.
2. Count is left skewed and hence, log transformation of count when used as target gives better R2 and lesser error. 
3. Hour and temp as seen earlier is highly correlated to count.   

In [None]:
sns.scatterplot(x=regr.predict(X_test), y=(y_test-regr.predict(X_test)))

In [None]:
# visualize subset of Test count and predicted test coutn 
plt.figure(figsize=(16,5))
plt.plot(regr.predict(X_test)[200:400],'r')
plt.plot(y_test[200:400])