Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.

The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this notebook, we try to combine historical usage patterns with season data in order to forecast bike rental demand in the Bikeshare program in Washington, D.C.


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats, integrate
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
%matplotlib inline
pd.options.display.float_format = '{:.2f}'.format
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

In [None]:
bikes=pd.read_csv("../input/bikeshare.csv", index_col='datetime', parse_dates=True)
bikes.head()

In [None]:
# "count" is a function, so to avoid  confusion we change the column name to total
bikes.rename(columns={'count':'total'}, inplace=True)

In [None]:
bikes_data=bikes.copy()

In [None]:
print(bikes_data.shape)

In [None]:
bikes_data.describe()


The number of bikes rented on a average is 191.5 bikes.But due to large variation in min and max values of bikes rental lead to a high standard deviation. 

In [None]:
# To check Multicollinearity 

bikes_data.corr()


temp and atemp are highly correlated, So having both of them in regression model lead to multicollinearity issue. Therefore, we will drop one of the variable.

Categorical variables (season, holiday, workingday, weather) need chi square test, so we will not describe them and read only numerical values. 

Whereas, total is nothing but the combination of  casual & registered. So, we can ignore casual & registered because our dependent variable or output is 'total'. 




In [None]:
# scatter plot
a=sns.lmplot(x='temp', y='total', fit_reg=True, data=bikes_data, aspect=1.5, scatter_kws={'alpha':0.2})


From the graph, we can see that the number of rental bikes increases as the temp increases. From the correlation matrix, it is observed that temp and total having the positive correlation.

In [None]:
# exploring more features
feature_cols = ['temp', 'season', 'weather', 'humidity']

In [None]:
# multiple scatter plots in Seaborn
sns.pairplot(bikes_data, x_vars=feature_cols, y_vars='total', kind='reg')

It is noticed from the graph that season is showing some unsual trend. The number of bikes rental are high during winter than spring, which is unusual to accept that to in washington where we have adverse weather in winter (heavy snow). So go further with boxplot to dig more insight into it.

In [None]:
# box plot of rentals, grouped by season
bikes.boxplot(column='total', by='season')

The max bikes rental are during season 2  & 3 (summer & fall) but having winter more bike rental than spring is strange. The reason for such ambiguity is transition period of season is not defined in the dataset. When winter or season is starting and ending. Therefore, the values which need to be calculated in spring ends up in winter. Hence, potrays there are more rentals in the winter than the spring, but only because the system is experiencing overall growth and the winter months happen to come before the spring months.

**Handling categorical features**

scikit-learn expects all features to be numeric. So how do we include a categorical feature in our model?

Ordered categories: transform them to sensible numeric values (example: small=1, medium=2, large=3)

Unordered categories: use dummy encoding (0/1)

What are the categorical features in our dataset?

Ordered categories: weather (already encoded with sensible numeric values)

Unordered categories: season (needs dummy encoding), holiday (already dummy encoded), workingday (already dummy encoded)

For season, we can't simply leave the encoding as 1 = spring, 2 = summer, 3 = fall, and 4 = winter, because that would imply an ordered relationship. Instead, we create multiple dummy variables:

In [None]:
# create dummy variables
season_dummies = pd.get_dummies(bikes_data.season, prefix='season')

# print 5 random rows
season_dummies.sample(n=5, random_state=12)

However, we actually only need **three dummy variables (not four)**, and thus we'll drop the first dummy variable.

Why? Because three dummies captures all of the "information" about the season feature, and implicitly defines spring (season 1) as the **baseline level:**

In [None]:
# drop the first column
season_dummies.drop(season_dummies.columns[0], axis=1, inplace=True)

# print 5 random rows
season_dummies.sample(n=5, random_state=12)

In [None]:
# concatenate the original DataFrame and the dummy DataFrame (axis=0 means rows, axis=1 means columns)
bikes_data = pd.concat([bikes_data, season_dummies], axis=1)

# print 5 random rows
bikes_data.sample(n=5, random_state=12)

In [None]:
# include dummy variables for season in the model
feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity']
X = bikes_data[feature_cols]  # input or independent variable 
y = bikes_data.total          # Output or Dependent variable
linreg = LinearRegression()
linreg.fit(X, y)
list(zip(feature_cols, linreg.coef_))


To interpret the season coefficients? It is measured against the baseline (spring):

Holding all other features fixed, summer is associated with a rental decrease of 3.39 bikes compared to the spring.

Holding all other features fixed, fall is associated with a rental decrease of 41.7 bikes compared to the spring.

Holding all other features fixed, winter is associated with a rental increase of 64.4 bikes compared to the spring.

Would it matter if we changed which season was defined as the baseline?

No, it would simply change our interpretation of the coefficients.

**Important: Dummy encoding is relevant for all machine learning models, not just linear regression models.**

In [None]:
# splitting the data into training and test data.

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=12)

In [None]:
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
# Buliding the Linear model with the algorithm
lin_reg=LinearRegression()
model=lin_reg.fit(X_train,y_train)

In [None]:
# feature_cols = ['temp', 'season_2', 'season_3', 'season_4', 'humidity'] #Input or independent variable
print(model.intercept_)
print (model.coef_)

In [None]:
## Predicting the x_test with the model
predicted=model.predict(X_test)

In [None]:
print ('MAE:', metrics.mean_absolute_error(y_test, predicted))
print ('MSE:', metrics.mean_squared_error(y_test, predicted))
print ('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, predicted)))

In [None]:
# ** To measure accuracy of model the model generated RMSE value has to be lower than null RMSE** 

#Compute null RMSE
# split X and y into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3, random_state=12)

# create a NumPy array with the same shape as y_test
y_null = np.zeros_like(y_test, dtype=float)

# fill the array with the mean value of y_test
y_null.fill(y_test.mean())
y_null

In [None]:
print(y_test.shape)
print(y_null.shape)

In [None]:
# compute null RMSE
np.sqrt(metrics.mean_squared_error(y_test, y_null))

In [None]:
# define a function that accepts a list of features and returns testing RMSE
def train_test_rmse(feature_cols):
    X = bikes_data[feature_cols]
    y = bikes_data.total
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=123)
    linreg = LinearRegression()
    linreg.fit(X_train, y_train)
    y_pred = linreg.predict(X_test)
    return np.sqrt(metrics.mean_squared_error(y_test, y_pred))

In [None]:
# compare different sets of features
print (train_test_rmse(['temp', 'season', 'weather', 'humidity']))
print (train_test_rmse(['temp', 'season', 'weather']))
print (train_test_rmse(['temp', 'season', 'humidity']))
print (train_test_rmse(['temp', 'humidity']))
print (train_test_rmse(['temp', 'season_2', 'season_3', 'season_4','weather', 'humidity']))
print (train_test_rmse(['temp', 'season_2', 'season_3', 'season_4','weather']))
print (train_test_rmse(['temp', 'season_2', 'season_3', 'season_4', 'humidity']))

When we look at the RMSE value of # 1 & 3 model, we have a slight decrease or no impact in value of RMSE with a drop of variable weather. So, it indicate, weather is highly correlated with other features or variables. So, we can drop  any variable among them (temp, season, weather,)  to achieve better RMSE value. 
The RMSE reduces further with the dummy variable. So the lowest RMSE value is for Model 7 and it is a best model among all and remaining can be ignored.  

In [None]:
bikes_data['hour']=bikes_data.index.hour

In [None]:
bikes_data.head()

In [None]:
# hour as a categorical feature
hour_dummies = pd.get_dummies(bikes_data.hour, prefix='hour')
hour_dummies.drop(hour_dummies.columns[0], axis=1, inplace=True)
bikes_data = pd.concat([bikes_data, hour_dummies], axis=1)
#hour_dummies
bikes_data.head()

In [None]:
# with hour.
sns.factorplot(x="hour",y="total",data=bikes_data,kind='bar',size=5,aspect=1.5)

We  can see the the bike rentals are high during the morning hours between 7 to 9 am and similary between 5 to 6 pm in the evening. The main reason for this will be the office hours, where professionals try to beat the traffic with affordable transportation. In addition the weather condition are normal during morning & evening hours compare to day or night time. 

In [None]:
# hour as a categorical feature
hour_dummies = pd.get_dummies(bikes_data.hour, prefix='hour')
hour_dummies.drop(hour_dummies.columns[0], axis=1, inplace=True)
bikes_data = pd.concat([bikes_data, hour_dummies], axis=1)
#hour_dummies
bikes_data.head()

In [None]:
# daytime as a categorical feature
bikes_data['daytime'] = ((bikes_data.hour > 6) & (bikes_data.hour < 21)).astype(int)
bikes_data.tail()

In [None]:
print (train_test_rmse(['hour']))
print (train_test_rmse(bikes_data.columns[bikes_data.columns.str.startswith('hour_')]))
print (train_test_rmse(['daytime']))

 Looking at the rmse value, the lowest rmse value is for model#2 with hour_dummies. whereas the 'daytime' is the second best this could be due to our understanding of daytime from >6 & <21 is not same as the bikers thought prcess. Maybe bikers are considering different time zone classification. 

In [None]:
print('END')