# Forecasting Hourly Bike Rental Demand

__________________

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
from datetime import datetime
from datetime import date
import calendar

## Loading the Data

In [None]:
train = pd.read_csv('../input/bike-sharing-demand/train.csv')
test = pd.read_csv('../input/bike-sharing-demand/test.csv')

In [None]:
train.head()

Since casual and registered variable together make up the 'count' feature, so we can drop both the columns for further processing.

In [None]:
train.drop(['casual','registered'],axis=1,inplace=True)

In [None]:
train.head()

In [None]:
test.head()

## Data Exploration

In [None]:
#shape of the train and test data
train.shape, test.shape

In [None]:
train.columns

In [None]:
test.columns

We can see that there are 10 columns in the training dataset, whereas there are 9 columns in the test dataset. The missing column i.e., 'count' is our target variable and I will have to train my linear regression model to predict the variable.

In [None]:
# Information about the dataset
train.info()

From the above information we can infer that:

- type of the 'datetime' feature is character which we would have to change into a datetime data type
- season, holiday, working day and weather are shown as integers but they are actually categorical variables, so we will have  to convert them into 'character' data type.
- Apart from these, all the other features are in numeric in nature.

In [None]:
#Checking for null values in the train dataset
train.isnull().sum()

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis');

Since there are no missing values in the train dataset so we don't have to further impute any data.

In [None]:
# Checking for missing values in the test dataset
test.isnull().sum()

In [None]:
plt.figure(figsize=(6,4))
sns.heatmap(test.isnull(), yticklabels=False, cbar=False, cmap='viridis');

Similarly in the test dataset there are no missing values, hence no need for imputing any variable.

In [None]:
plt.style.use('ggplot')

In [None]:
# Descriptive stats of the train dataset
train.drop('count',axis=1).describe()

In [None]:
sns.pairplot(train);

## Exploratory Data Analysis

___________________________

## Univariate Analysis

#### Distribution of target variable 'count'

In [None]:
sns.distplot(train['count']);

From the above distribution plot of the 'count' variable we can infer that our target variable is right skewed and hence we have to take the log of the variable to check if the distribution becomes normal or not.

In [None]:
sns.distplot(np.log(train['count']));

## Bivariate Analysis

____________

#### Correlation matrix

Since 'season', 'holiday', 'workingday'and 'weather' are basically categorical variables and 'datetime' is string variable, so we will have to drop these columns in order to determine the correlation matrix.

In [None]:
corrdata = train[["temp","atemp","humidity","windspeed","count"]]
corrmat = corrdata.corr()
plt.figure(figsize=(10,8))
sns.heatmap(corrmat, annot = True, cmap= 'YlGnBu');

From the above heatmap we can infer that:

1. 'temp' and 'humidity' have a positive and a negative correlation with 'count' respectively. Eventhough the correlation between them is not that strong but still it is little bit dependent on the 'temp' and 'humidity' variables.

2. 'windspeed' will not be a useful feature since it has got a very low correlation value with the demand (count) so we will have to drop this feature.

3. Since 'temp' and 'atemp' possess a very strong correlation among themselves, so one of these variable has to be dropped during model building otherwise there will be multi-collinearity in the data.

### Analyzing the datetime column

Since the vlaues in the 'datetime' column are of string type so we have to convert it into datetime format.

In [None]:
train['datetime'] = pd.to_datetime(train['datetime'])

In [None]:
test['datetime'] = pd.to_datetime(train['datetime'])

In [None]:
#checking the data type of the datetime column
type(train['datetime'][0]), type(test['datetime'][0])

#### Creating Year, Date, Month, Hour and Day of the week columns for the train dataset

In [None]:
train['year'] = train['datetime'].dt.year

In [None]:
train['month'] = train['datetime'].dt.month

In [None]:
train['date'] = train['datetime'].dt.date

In [None]:
train['hour'] = train['datetime'].dt.hour

In [None]:
train['day of the week'] = train['datetime'].dt.dayofweek

#### Creating Date, Month, Hour and Day of the week columns for the test dataset

In [None]:
test['year'] = test['datetime'].dt.year

In [None]:
test['month'] = test['datetime'].dt.month

In [None]:
test['date'] = test['datetime'].dt.date

In [None]:
test['hour'] = test['datetime'].dt.hour

In [None]:
test['day of the week'] = test['datetime'].dt.dayofweek

In [None]:
#converting day of the weeks to name of the day
dmap = {0:'Mon', 1:'Tue', 2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
train['day of the week'] = train['day of the week'].map(dmap)
test['day of the week'] = test['day of the week'].map(dmap)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
# droppind the datetime column
train.drop('datetime',axis=1,inplace=True)
test.drop('datetime',axis=1,inplace=True)

In [None]:
train.info()

In [None]:
ymap = {2011:'0',2012:'1'}
train['year'] = train['year'].map(ymap)
test['year'] = test['year'].map(ymap)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.info(), test.info()

### Demand per day of the week

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=train, x = 'day of the week', y = 'count', palette='rainbow');

We can infer that the demand of bike rentals were almost same for each day of the week. So this feature will not be useful in predicting the demand therefore we will have to drop this feature.

### Demand per month

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(data=train, x='month', y = 'count', palette = 'rainbow');

We can see that demand for bike rentals was high during the months of summer and the demand drops during the months of winter.

### Demand per hour

In [None]:
plt.figure(figsize=(12,6))
sns.barplot(data = train, x = 'hour', y = 'count', palette = 'rainbow');

From the above graph we can see that the demand for bike rentals were high during the office hours i.e., from 7 am to 6 pm and it was low during the non-working hours as it might be possible that most of the people use the bike rental services to reach their office premises or leave their premises during these hours.

### Demand per season

In [None]:
plt.figure(figsize = (8,4))
sns.barplot(data = train, x = 'season', y = 'count', palette = 'rainbow');

We can infer that demand was high during the summer and fall seasons while it drops during winter and spring season as the weather during these seasons might not be suitable for bike rentals.

### Demand as per holidays

In [None]:
plt.figure(figsize=(6,6))
sns.barplot(data = train, x = 'holiday', y = 'count', palette = 'rainbow');

We can clearly see that demand was high during working days and low during holidays.

### Demand on working days

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(data = train, x = 'workingday', y = 'count');

We can infer that whether it is a working day or not it doesn't affect the demand much as it remains almost the same both the time, therefore we will have to drop this column for getting better predictions from our linear regression model.

### Demand according to the weather

In [None]:
plt.figure(figsize=(8,4))
sns.barplot(data = train, x = 'weather', y = 'count', palette = 'rainbow');

From the above graph we can see that demand was high when the weather was clear and mist and drops when there is rain or snowy.

In [None]:
plt.figure(figsize=(6,4))
sns.barplot(data = train, x = 'year', y = 'count');

Since we have data of only two years so we cannot infer much from it and hence we will have to drop this column.

### Temperature vs Demand

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data = train, x = 'temp', y = 'count');

We can say that as the temperature increases the demand also increases.

## Demand vs aTemp

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data = train, x = 'atemp', y = 'count');

We can see that this plot is almost similar to the previous plot of temperature and demand which means there is a high correlation between temp and atemp features which is quite understandable.

### Humidity vs Demand

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data = train, x = 'humidity', y='count');

We can infer that the humidity is not much correlated to the demand.

### Windspeed vs Demand

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(data = train, x = 'windspeed', y='count');

Since windspeed has a very low correlation with the demand so it does not affect the demand much.

_______________

In [None]:
train.head()

## Model Building

- We will drop the day of the week, date and year variable as we have already extracted features from this variables.
- We will also drop the atemp variable as we saw that it is highly correlated with the temp variable.
- We will also have to drop the workingday and windspeed variables as it does not affect the demand much.

In [None]:
train.drop(['atemp','date','day of the week','year','windspeed','workingday'], axis=1, inplace=True)

In [None]:
test.drop(['atemp','date','day of the week','year','windspeed','workingday'], axis=1, inplace=True)

In [None]:
train.head()

In [None]:
train.tail()

In [None]:
# creating dummy variables of the train dataset
season = pd.get_dummies(train['season'],prefix='season',drop_first=True)
weather = pd.get_dummies(train['weather'],prefix='weather',drop_first=True)
holiday = pd.get_dummies(train['holiday'],prefix='holiday',drop_first=True)
month = pd.get_dummies(train['month'],prefix='month',drop_first=True)
hour = pd.get_dummies(train['hour'],prefix='hour',drop_first=True)
train = pd.concat([train,season,weather,holiday,month,hour],axis=1)
train.drop(['season','weather','holiday','month','hour'], axis=1,inplace=True)
train.head()

In [None]:
#creating dummy variables of the test dataset
# creating dummy variables of the train dataset
season = pd.get_dummies(test['season'],prefix='season',drop_first=True)
weather = pd.get_dummies(test['weather'],prefix='weather',drop_first=True)
holiday = pd.get_dummies(test['holiday'],prefix='holiday',drop_first=True)
month = pd.get_dummies(test['month'],prefix='month',drop_first=True)
hour = pd.get_dummies(test['hour'],prefix='hour',drop_first=True)
test = pd.concat([test,season,weather,holiday,month,hour],axis=1)
test.drop(['season','weather','holiday','month','hour'], axis=1,inplace=True)
test.head()

In [None]:
X = train.drop('count',axis=1)
y = np.log(train['count'])

In [None]:
X.head()

### splitting our data

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train,X_val,y_train,y_val = train_test_split(X,y, test_size = 0.3,random_state=101)

### Training the Linear Regression Model

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
lm = LinearRegression()

In [None]:
#fitting the training data
lm.fit(X_train,y_train)

In [None]:
#printing the intercept
print(lm.intercept_)

In [None]:
#printing the coefficient
lm.coef_

In [None]:
cdf = pd.DataFrame(lm.coef_, X.columns, columns = ['Coefficients'])

In [None]:
cdf

Now that we have a trained linear regression model with us. We will now make prediction on the X_val set and check the performance of our model. Since the evaluation metric for this problem is RMSLE, we will define a model which will return the RMSLE score.

In [None]:
# Predictions
predictions = abs(lm.predict(X_val))

In [None]:
predictions

### Evaluating the regression

In [None]:
from sklearn import metrics

In [None]:
# Mean Absolute Error
metrics.mean_absolute_error(y_val, predictions)

In [None]:
# Mean Squared error
metrics.mean_squared_error(y_val,predictions)

In [None]:
# Root Mean Squared Log Error
np.sqrt(metrics.mean_squared_log_error(y_val,predictions))

We can see that our RMSLE score is very low which indicates that our model has predicted the values almost the same as the validation set.

Let's now make predictions for the test dataset which was our main goal.

In [None]:
test_prediction = abs(lm.predict(test))

In [None]:
final_prediction = np.exp(test_prediction)

In [None]:
final_prediction = np.round(final_prediction)

In [None]:
final_prediction

________________________

## Decision Tree

In [None]:
from sklearn.tree import DecisionTreeRegressor

In [None]:
dt_reg = DecisionTreeRegressor(max_depth=5)

Fitting the tree model

In [None]:
dt_reg.fit(X_train,y_train)

Predicting the validation set

In [None]:
predict = dt_reg.predict(X_val)

In [None]:
predict

In [None]:
# calculating rmsle of the predicted values
np.sqrt(metrics.mean_squared_log_error(y_val,predict))

The rmsle value decreased to 0.226. This is a decent score. Let's make predictions for the test dataset.

In [None]:
test_predict = dt_reg.predict(test)

In [None]:
final_predict = np.exp(test_predict)

In [None]:
final_predict = np.round(final_predict)

In [None]:
final_predict

In [None]:
submission = pd.DataFrame()

In [None]:
date=pd.read_csv('../input/bike-sharing-demand/test.csv')

In [None]:
submission['datetime']=date['datetime']

In [None]:
submission['count'] = final_predict

In [None]:
submission.to_csv('sample submission.csv',header=True,index=False)