<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Reading-and-Understanding-the-Data" data-toc-modified-id="Reading-and-Understanding-the-Data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Reading and Understanding the Data</a></span></li><li><span><a href="#Data-Preparation" data-toc-modified-id="Data-Preparation-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Preparation</a></span></li><li><span><a href="#Visualising-the-Data" data-toc-modified-id="Visualising-the-Data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Visualising the Data</a></span><ul class="toc-item"><li><span><a href="#Visualising-Numeric-Variables" data-toc-modified-id="Visualising-Numeric-Variables-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Visualising Numeric Variables</a></span></li><li><span><a href="#Visualising-Categorical-Variables" data-toc-modified-id="Visualising-Categorical-Variables-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Visualising Categorical Variables</a></span></li></ul></li><li><span><a href="#Dummy-Variables" data-toc-modified-id="Dummy-Variables-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Dummy Variables</a></span><ul class="toc-item"><li><span><a href="#Season" data-toc-modified-id="Season-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Season</a></span></li><li><span><a href="#Weather" data-toc-modified-id="Weather-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Weather</a></span></li><li><span><a href="#Weekday" data-toc-modified-id="Weekday-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Weekday</a></span></li><li><span><a href="#Month" data-toc-modified-id="Month-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Month</a></span></li></ul></li><li><span><a href="#Splitting-the-Data-into-Training-and-Testing-Sets" data-toc-modified-id="Splitting-the-Data-into-Training-and-Testing-Sets-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Splitting the Data into Training and Testing Sets</a></span><ul class="toc-item"><li><span><a href="#Dividing-into-X-and-Y-sets-for-the-model-building" data-toc-modified-id="Dividing-into-X-and-Y-sets-for-the-model-building-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Dividing into X and Y sets for the model building</a></span></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Train Test Split</a></span></li><li><span><a href="#Rescaling-the-Features" data-toc-modified-id="Rescaling-the-Features-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Rescaling the Features</a></span></li></ul></li><li><span><a href="#RFE" data-toc-modified-id="RFE-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>RFE</a></span></li><li><span><a href="#Building-a-linear-model" data-toc-modified-id="Building-a-linear-model-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Building a linear model</a></span><ul class="toc-item"><li><span><a href="#Build-model-with-features-selected-from-FRE" data-toc-modified-id="Build-model-with-features-selected-from-FRE-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Build model with features selected from FRE</a></span></li><li><span><a href="#Dropping-the-variable-and-Updating-the-model" data-toc-modified-id="Dropping-the-variable-and-Updating-the-model-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Dropping the variable and Updating the model</a></span></li></ul></li><li><span><a href="#Residual-Analysis-of-the-train-data" data-toc-modified-id="Residual-Analysis-of-the-train-data-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Residual Analysis of the train data</a></span></li><li><span><a href="#Making-Predictions-Using-the-Final-Model" data-toc-modified-id="Making-Predictions-Using-the-Final-Model-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Making Predictions Using the Final Model</a></span><ul class="toc-item"><li><span><a href="#Applying-the-scaling-on-the-test-sets" data-toc-modified-id="Applying-the-scaling-on-the-test-sets-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Applying the scaling on the test sets</a></span></li><li><span><a href="#Make-prediction" data-toc-modified-id="Make-prediction-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Make prediction</a></span></li></ul></li><li><span><a href="#Model-Evaluation" data-toc-modified-id="Model-Evaluation-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Model Evaluation</a></span></li></ul></div>

# Bike Sharing

In [None]:
# supress warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# import the required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import calendar
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

In [None]:
# Display all columns
pd.set_option('display.max_columns',200)

## Reading and Understanding the Data

In [None]:
# Read the csv file using 'read_csv'.
day= pd.read_csv("../input/boombikes-bike-sharing/day.csv")

Inspect the various aspects of the dataframe

In [None]:
# Check the head of the dataset
day.head()

In [None]:
# Check the number of rows and columns in the dataframe
day.shape

In [None]:
# Types of all columns
day.info(verbose=True)

In [None]:
# Check the summary for the numeric columns 
day.describe()

In [None]:
# Count the number of null values in each column
day.isnull().sum()

-> No null columns

## Data Preparation

In [None]:
# Function to convert number to month
def month_mapping(month_number):
    return calendar.month_abbr[month_number]

In [None]:
# Function to convert number to weekday
def weekday_mapping(weekday_number):
    return calendar.day_abbr[weekday_number]

In [None]:
# Convert numeric values into categorical string values

day['season'] = day['season'].map({1: 'spring', 2: 'summer', 3: 'fall', 4: 'winter'})
day['weathersit'] = day['weathersit'].map({1: 'Clear', 2: 'Cloudy', 3: 'Light Rain', 4: 'Heavy Rain'})
day['holiday'] = day['holiday'].map({1: 'Yes', 0: 'No'})
day['workingday'] = day['workingday'].map({1: 'Yes', 0: 'No'})
day['yr'] = day['yr'].map({0: '2018', 1:'2019'})
day['mnth'] = day['mnth'].apply(month_mapping)
day['weekday'] = day['weekday'].apply(weekday_mapping)

# Drop unnecessary columns
day.drop(['instant'], axis = 1, inplace = True)
day.drop(['dteday'], axis = 1, inplace = True)

In [None]:
# Check the head of the dataset
day.head()

In [None]:
# Types of all columns
day.info(verbose=True)

## Visualising the Data

### Visualising Numeric Variables

In [None]:
# describe gives all numerical cols summary
day.describe()

In [None]:
# Show all numerical columns
day.describe().columns

In [None]:
day[['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']].info()

In [None]:
# Let's make a pairplot of all the numeric variables
sns.pairplot(day)
plt.show()

In [None]:
# Correlation between numeric variables
cor = day[['temp', 'atemp', 'hum', 'windspeed', 'casual', 'registered', 'cnt']].corr()
cor

In [None]:
# Heatmap
mask = np.array(cor)
mask[np.tril_indices_from(mask)] = False
sns.heatmap(cor, mask=mask, vmax=.8, square=True, annot=True);

Insights:
- atemp and temp have multicollinear relationship
- registered and count have multicollinear relationship

In [None]:
# distplot for temp, atemp, hum, windspeed, casual, registered, cnt -> Provide insights
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.distplot(day['temp'])
plt.subplot(2,3,2)
sns.distplot(day['atemp'])
plt.subplot(2,3,3)
sns.distplot(day['hum'])
plt.subplot(2,3,4)
sns.distplot(day['windspeed'])
plt.subplot(2,3,5)
sns.distplot(day['casual'])
plt.subplot(2,3,6)
sns.distplot(day['registered'])
plt.show()

In [None]:
sns.distplot(day['cnt'])
plt.show()

Insights:
- Most of the days had temperature between 10 and 30 Celsius degree, feeling temperature between 10 and 35 Celsius degree. Very few days had temperature and feeling temperature less than 2 degree Celsius and more than 40 degree Celsius.
- Most of the days had humidity between 50 and 80 Celsius degree. Very few days had humidity less than 20 and greater than 100.
- Most of the days had winspeed between 8 and 17. Very few days had windspeed greater than 30.
- Most of the days had the number of rental bikes by casual customers around 1,000 or less. Very few days had more than 3,000 rental bikes by casual customers.
- Most of the rental bikes were from registered users. Most of the days had the number of rental bikes by this type of customers around 2,000 to 5,000.
- When taking into accounts both registered and casual customers, most of the days had around 2,000 to 7,000 rental bikes.

### Visualising Categorical Variables

In [None]:
# Types of all columns
day.info()

In [None]:
# Find all the categorical variables in the dataset
df_categorical = day.select_dtypes(exclude=['float64','int64'])
df_categorical.columns

In [None]:
# make a boxplot for categorical variables.
plt.figure(figsize=(20, 20))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'cnt', data = day)
plt.subplot(3,3,2)
sns.boxplot(x = 'weathersit', y = 'cnt', data = day)
plt.subplot(3,3,3)
sns.boxplot(x = 'mnth', y = 'cnt', data = day)
plt.subplot(3,3,4)
sns.boxplot(x = 'workingday', y = 'cnt', data = day)
plt.subplot(3,3,5)
sns.boxplot(x = 'holiday', y = 'cnt', data = day)
plt.subplot(3,3,6)
sns.boxplot(x = 'weekday', y = 'cnt', data = day)
plt.subplot(3,3,7)
sns.boxplot(x = 'yr', y = 'cnt', data = day)
plt.show()

In [None]:
# visualise categorical features parallely
plt.figure(figsize=(20, 20))
plt.subplot(3,3,1)
sns.boxplot(x = 'season', y = 'cnt', hue = 'weathersit', data = day)
plt.subplot(3,3,2)
sns.boxplot(x = 'season', y = 'cnt', hue = 'yr', data = day)
plt.subplot(3,3,3)
sns.boxplot(x = 'season', y = 'cnt', hue = 'workingday', data = day)
plt.subplot(3,3,4)
sns.boxplot(x = 'season', y = 'cnt', hue = 'holiday', data = day)
plt.subplot(3,3,5)
sns.boxplot(x = 'weathersit', y = 'cnt', hue = 'yr', data = day)
plt.subplot(3,3,6)
sns.boxplot(x = 'weekday', y = 'cnt', hue = 'season', data = day)
plt.subplot(3,3,7)
sns.boxplot(x = 'weekday', y = 'cnt', hue = 'weathersit', data = day)
plt.subplot(3,3,8)
sns.boxplot(x = 'weekday', y = 'cnt', hue = 'yr', data = day)
plt.show()

Insights:
- Fall and clear weather had more rental bikes
- 2019 had more rental bikes than 2018
- The number of rental bikes looked similar on working days and non-working days

## Dummy Variables

In [None]:
# Check the head of the dataset
day.head()

In [None]:
# Types of all columns
day.info(verbose=True)

In [None]:
# Convert yes/no into numeric values
day['holiday'] = day['holiday'].map({'Yes': 1, 'No': 0})
day['workingday'] = day['workingday'].map({'Yes': 1, 'No': 0})
day['yr'] = day['yr'].map({'2018': 0, '2019': 1})

In [None]:
# Function to create dummy variables
def dummy_variables(df, feature):
    
    # Get the dummy variables for the feature and store it in a new variable - 'dummy', drop the first column from status df using 'drop_first = True'
    dummy = pd.get_dummies(df[feature], drop_first = True)
    
    # Add prefix to column names
    prefix = feature + '_'
    dummy = dummy.add_prefix(prefix)
    
    # Add the results to the original season dataframe
    df = pd.concat([df, dummy], axis = 1)
    
    # Drop feature as we have created the dummies for it
    df.drop([feature], axis = 1, inplace = True)
    
    return df

### Season

In [None]:
# Get the dummy variables
day = dummy_variables(day, 'season')

# Now let's see the head of our dataframe.
day.head()

### Weather

In [None]:
# Get the dummy variables
day = dummy_variables(day, 'weathersit')

# Now let's see the head of our dataframe.
day.head()

### Weekday

In [None]:
# Get the dummy variables
day = dummy_variables(day, 'weekday')

# Now let's see the head of our dataframe.
day.head()

### Month

In [None]:
# Get the dummy variables
day = dummy_variables(day, 'mnth')

# Now let's see the head of our dataframe.
day.head()

In [None]:
day.shape

## Splitting the Data into Training and Testing Sets

In [None]:
# Drop casual & registered
day.drop(['casual'], axis = 1, inplace = True)
day.drop(['registered'], axis = 1, inplace = True)

# Drop atemp as it is highly multicollinear with temp
day.drop(['atemp'], axis = 1, inplace = True)

### Dividing into X and Y sets for the model building

In [None]:
# Dividing into X and Y sets for the model building
X = day.drop('cnt', axis=1)
y = day.cnt

In [None]:
# Check top values of X
X.head()

In [None]:
# Check top values of y
y.head()

### Train Test Split

In [None]:
# Split Train & Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

In [None]:
# See all features of X
num_feat = list(X_train.describe().columns)
num_feat

In [None]:
X_train.info()

### Rescaling the Features 

In [None]:
# scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[num_feat] = sc.fit_transform(X_train[num_feat])
X_test[num_feat] = sc.transform(X_test[num_feat])

In [None]:
# Check X_train
X_train.info()

In [None]:
# Check top values of X_train
X_train.head()

## RFE

In [None]:
# RFE -> Recursive Feature Elimination
from sklearn.svm import LinearSVC
from sklearn.feature_selection import RFE
lm = LinearRegression()
lm.fit(X_train, y_train)
rfe = RFE(lm, 20)
rfe = rfe.fit(X_train, y_train)

In [None]:
# See all of the columns & their ranking
list(zip(X_train.columns, rfe.support_, rfe.ranking_))

In [None]:
# Columns with high rank
X_train.columns[rfe.support_]

In [None]:
# Get X_train with those columns
X_train_rfe = X_train[X_train.columns[rfe.support_]]

In [None]:
# See top values of X_train_rfe
X_train_rfe.head()

## Building a linear model

<!-- Fit a regression line through the training data using `statsmodels`. Remember that in `statsmodels`, you need to explicitly fit a constant using `sm.add_constant(X)` because if we don't perform this step, `statsmodels` fits a regression line passing through the origin, by default. -->

In [None]:
def build_model(X,y):
    X = sm.add_constant(X) #Adding the constant
    lm = sm.OLS(y,X).fit() # fitting the model
    print(lm.summary()) # model summary
    return X, lm
    
def checkVIF(X):
    vif = pd.DataFrame()
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)

### Build model with features selected from FRE

In [None]:
X_train_new,lm = build_model(X_train_rfe, y_train)

In [None]:
checkVIF(X_train_new)

### Dropping the variable and Updating the model

In [None]:
def feature_reduction(X_train, y_train, variables):
    
    # Dropping highly correlated variables and insignificant variables
    X_train = X_train.drop(variables, 1,)
    
    # Build model
    X_train_lm,lm = build_model(X_train, y_train)

    
    # Calculate the VIFs again for the new model
    print(checkVIF(X_train_lm))

    return X_train, lm, X_train_lm

In [None]:
# Dropping highly correlated variables and insignificant variables, build model and calculate VIF again
X_train, lm, X_train_lm = feature_reduction(X_train_rfe, y_train, ["workingday","weekday_Mon","weekday_Sun", "mnth_Sep"])

In [None]:
X_train.head()

In [None]:
X_train.columns

## Residual Analysis of the train data

So, now to check if the error terms are also normally distributed (which is infact, one of the major assumptions of linear regression), let us plot the histogram of the error terms and see what it looks like.

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
coeff = pd.DataFrame(regressor.coef_, X_train.columns, columns=['Coefficients'])
coeff

In [None]:
regressor.intercept_

In [None]:
y_pred = regressor.predict(X_train)

In [None]:
df = pd.DataFrame({
    'Actual':y_train,
    'Predicted': y_pred
})
df

In [None]:
r2_score(y_train, y_pred)

In [None]:
y_train_count = lm.predict(X_train_lm)

In [None]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_count), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)                         # X-label

## Making Predictions Using the Final Model

Now that we have fitted the model and checked the normality of error terms, it's time to go ahead and make predictions using the final, i.e. fourth model.

### Applying the scaling on the test sets

In [None]:
num_vars = X_train.columns
X_train[num_vars] = sc.fit_transform(X_train[num_vars])
X_test[num_vars] = sc.transform(X_test[num_vars])

### Make prediction

In [None]:
X_test = X_test[num_vars]

In [None]:
X_test.head()

In [None]:
# Adding constant variable to test dataframe
X_test_m = sm.add_constant(X_test)

In [None]:
# Making predictions using the model
y_pred_m = lm.predict(X_test_m)

## Model Evaluation

Let's now plot the graph for actual versus predicted values.

In [None]:
# Plotting y_test and y_pred to understand the spread

fig = plt.figure()
plt.scatter(y_test, y_pred_m)
fig.suptitle('y_test vs y_pred', fontsize = 20)              # Plot heading 
plt.xlabel('y_test', fontsize = 18)                          # X-label
plt.ylabel('y_pred', fontsize = 16)      

In [None]:
from sklearn.metrics import r2_score
r2_score(y_test, y_pred_m)