<h2 style = "color : Brown"> BIKE SHARING ASSIGNMENT </h2>

#### PROBLEM STATEMENT :

A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. The company is finding it very difficult to sustain in the current market scenario. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state.

Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

1. Which variables are significant in predicting the demand for shared bikes.
2. How well those variables describe the bike demands
Based on various meteorological surveys and people's styles, the service provider firm has gathered a large dataset on daily bike demands across the American market based on some factors.

##### Goal:
Model the demand for shared bikes with the available independent variables.

1. It will be used by the management to understand how exactly the demands vary with different features.
2. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. 

> The model will be a good way for management to understand the demand dynamics of a new market.

### _STEP 1. Reading And Understanding The Data_

In [None]:
# Importing the requires libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import sklearn
from sklearn import linear_model
from sklearn import metrics
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import RFE
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error


import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Displaying all columns
pd.set_option('display.max_columns',200)

In [None]:
# Surpress warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importing the dataset

bike = pd.read_csv('../input/bikesharing/day.csv')
bike.head()

In [None]:
# Checking the number of rows and columns in the dataframe

bike.shape

In [None]:
# Checking the data types and the column-wise info in the data frame

bike.info()

In [None]:
# Checking the numeric columns in the dataframe

bike.describe()

In [None]:
# Listing of all the columns

list(bike.describe().columns)

**_Checking the Null values in the dataframe_**

In [None]:
bike.isnull().sum()

> There are no Null values in the dataset

In [None]:
# Converting to Date Time

bike['dteday'] = pd.to_datetime(bike['dteday'])
bike['dteday'].dtypes

In [None]:
# Mapping the Season column
def map_season(x):
    return x.map({1:'Spring',2:'Summer',3:'Fall',4:'Winter'})
bike[['season']] = bike[['season']].apply(map_season)

# Mapping the Month column
def map_month(x):
    return x.map({1:'Jan',2:'Feb',3:'Mar',4:'Apr',5:'May',6:'June',7:'July',8:'Aug',9:'Sep',10:'Oct',11:'Nov',12:'Dec'})
bike[['mnth']] = bike[['mnth']].apply(map_month)

# Mapping the Weathersit column
def map_weathersit(x):
    return x.map({1:'Clear',2:'Mist + Cloudy',3:'Light Snow',4:'Heavy Rain + Snow'})
bike[['weathersit']] = bike[['weathersit']].apply(map_weathersit)

# Mapping the Weekday column
def map_weekday(x):
    return x.map({0:'Sunday',1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday'})
bike[['weekday']] = bike[['weekday']].apply(map_weekday)

bike.head()

In [None]:
# Categorical Columns

bike_categorical = bike.select_dtypes(exclude=['float64', 'int64', 'datetime64'])
bike_categorical.columns

In [None]:
# Numerical Columns

bike_numerical = bike.select_dtypes(exclude=['object','datetime64'])
bike_numerical.columns

### _STEP 2. Data Visualisation_

### Numerical Variables


In [None]:
bike_numerical.columns

In [None]:
# Plotting pair plot for all of the numerical variables

sns.pairplot(data = bike, vars = bike_numerical)
plt.show()

### Analysing Numerical variables

`1. Temperature`

In [None]:
bike['temp'].describe()

> The mean value for the `Temperature / temp` is 20.3193, and the median is 20.4658 while the max temp is 35.328.

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(bike['temp'])

plt.xlabel('Temp')
plt.ylabel('Count')
plt.title('Relationship Between Target Variable and Temp')

plt.show()

**Inferences:**
> There are positive relationship between the `Temperature` and the Target Variable. However once the temperature are in range of 10-15, 25-30, the demand has started reducing.

`2. Atemp/ Adjusted Temperature`

In [None]:
bike['atemp'].describe()

> For the `Adjusted Temperature/ Atemp`, the mean temperature lies in 23.726, median at 24.368, and max temperature at 42.044.

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(bike['atemp'])

plt.xlabel('ATemp')
plt.ylabel('Count')
plt.title('Relationship Between Target Variable and Adjusted Temp')
plt.show()

**Inferences:**
> There are direct relationship between `Adjusted temperature` and the Target Variable, however once the Adjusted Temperature reached 30, the count has started reducing.

`3. Humidity/Hum`

In [None]:
bike['hum'].describe()

> For the `humidity/ hum`, the mean is 62.765, median is at 62.625 and max humidity is at 97.25.

In [None]:
plt.figure(figsize=(12,8))
sns.distplot(bike['hum'])

plt.xlabel('Humidity')
plt.ylabel('Count')
plt.title('Relationship Between Target Variable and Humidity')

plt.show()

**Inferences:**
> High Demands are there within the `Humidity/ hum` range of 40-80.

`4. Wind Speed`

In [None]:
bike['windspeed'].describe()

> The mean for the variable `windspeed` is 12.7636, median is at 12.1253 and max is at 34.00


In [None]:
plt.figure(figsize=(12,8))
sns.distplot(bike['windspeed'])

plt.xlabel('Windspeed')
plt.ylabel('Count')
plt.title('Relationship Between Target Variable and Windspeed')

plt.show()

**Inferences:**
> There more more concentrations of the Target Variable(cnt) within the `windspeed` range of 5-20.

### Checking the Outliers for the Numerical Column

In [None]:
num = ['temp', 'atemp', 'hum', 'windspeed', 'cnt']

plt.figure(figsize=(14,6))

for i in enumerate(num):
    plt.subplot(3,2, i[0]+1)
    sns.boxplot(x = i[1], data = bike, palette='mako')

**Inferences:**
> There are no outliers for cnt, temp and atemp, however, few outliers are there in the humidity (hum) variable and many outliers presents in the windspeed column.

### Categorical Variables

In [None]:
bike_categorical.columns

In [None]:
# Visualisation of the categorical variables

plt.figure(figsize=(20,20))
plt.subplot(3,3,1)
sns.countplot(x = 'yr', data = bike)
plt.subplot(3,3,2)
sns.countplot(x='mnth', data = bike)
plt.subplot(3,3,3)
sns.countplot(x='season', data = bike)
plt.subplot(3,3,4)
sns.countplot(x='weathersit', data = bike)
plt.subplot(3,3,5)
sns.countplot(x='workingday', data = bike)
plt.subplot(3,3,6)
sns.countplot(x='weekday', data = bike)
plt.subplot(3,3,7)
sns.countplot(x='holiday', data = bike)

plt.show()

### Analysing Categorical variables

`1. Year`

In [None]:
# Checking the value for the 'Year' column

bike['yr'].value_counts()

In [None]:
sns.boxplot('yr','cnt', data = bike, palette='viridis')
plt.xlabel('Year')
plt.ylabel('Count')
plt.title('Distribution of the Target Variable with the Year 2018 and 2019')


plt.show()

**Inferences:**
> The demand for the year 2019 is far greater than the demand for the year 2018. There are outliers for the year 2019.

`2. Month`

In [None]:
# Checking the value for the Month column

bike['mnth'].value_counts()

In [None]:
plt.figure(figsize=(12,6))
sns.boxplot('mnth','cnt',hue='yr', data = bike, palette='viridis')

plt.xlabel('Month')
plt.ylabel('Count')
plt.title('Distribution of the Target Variable with the Year and Month')


plt.show()

**Inferences:**
> As we can see the demand is increasing in the mid-year starting from the month of Mar to Oct and slowly decreasing from the month of Nov and Dec. We can see that there any outliers in almost every month in the dataset.

`3. Season`

In [None]:
# Checking the value for the Season column

bike['season'].value_counts()

In [None]:
sns.boxplot('season','cnt', data = bike, palette='viridis')
plt.xlabel('Seasons')
plt.ylabel('Count')
plt.title('Distribution of the Target Variable with the Seasons')



plt.show()

**Inferences:**
> During Summer and Fall season, there is an increase number of demand from the above graph. Followed by Winter and Spring. There are outliers for the season of Spring and Winter.

`4. Weekday`

In [None]:
# Checking the value for the weekday column

bike['weekday'].value_counts()

In [None]:
plt.figure(figsize=[10,9])
sns.boxplot('weekday','cnt', data = bike, palette='viridis')
plt.xlabel('Weekday')
plt.ylabel('Count')
plt.title('Distribution of the Target Variable with Weekday')


plt.show()

**Inferences:**
> The demand for the bike rental is almost equally distributed among the days of the weeks, however slightly higher on Thursday and Friday as compared with the other days.

`5. WorkingDay`

In [None]:
# Checking the value for the Working Day column

bike['workingday'].value_counts()

In [None]:
sns.boxplot('workingday','cnt', data = bike, palette='viridis')
plt.xlabel('Working Day')
plt.ylabel('Count')
plt.title('Distribution of Target Variable with Working Day')


plt.show()

**Inferences:**
> For working day or non-working day, there nare not much of differences in the demand of the bike rental.

`6. Weathersit`

In [None]:
# Checking the value for the Weathersit column

bike['weathersit'].value_counts()

In [None]:
sns.boxplot('weathersit','cnt', data = bike, palette='viridis')
plt.xlabel('Weathersit')
plt.ylabel('Count')
plt.title('Distribution of the Target Variable with the weathersit')


plt.show()

**Inferences:**
> On a Clear weather, we can see a significant increment on teh demand, followed by Mist + Cloudy weather and very few demand during the Light Snow.

`7. Holiday`

In [None]:
# Checking the value for the Holiday column

bike['holiday'].value_counts()

In [None]:
sns.boxplot('holiday','cnt', data = bike, palette='viridis')
plt.xlabel('Holiday')
plt.ylabel('Count')
plt.title('Distribution of the Target Variable with Holiday')


plt.show()

### Dropping the unnecessary and redundant columns

In [None]:
# Drop the instant column as it is an index column

bike.drop(['instant'], axis=1, inplace=True)
bike.head()

### Checking the Correlations

In [None]:
# Checking the correlation between variables

plt.figure(figsize=(20,10))
mask = np.array(bike.corr())
mask[np.tril_indices_from(mask)] = False
sns.heatmap(bike.corr(), mask = mask, vmax = .8, square =  True, annot=True, cmap='YlGnBu')

sns.set_style('whitegrid')
sns.set_context('talk')
plt.title('Correlations Between Variables in the Dataset')
plt.show()

**Inferences:**

From the above heatmap, we can see there is a high correlation between Temp and Atemp column. Thus in our next step, we shall drop either 1 of the column before proceed further with our analysis.

Also, there are positive correlation for our **`Target variable/ Cnt `** with the columns below:
* Registered
* Temp
* Atemp
* Casual
* Yr

The negative correlations with our **`Target variable/ Cnt `** can be seen with the columns:
* Holiday
* Workingday
* Hum
* Windspeed

### Dropping the unnecessary and redundant columns

In [None]:
# Dropping the atemp column based on the high correlation of 0.99 with temp 

bike.drop(['atemp'], axis=1, inplace=True)

# Dropping the casual and registered column as the total is the value for our target variable

bike.drop(['casual'], axis=1, inplace=True)
bike.drop(['registered'], axis=1, inplace=True)

#Dropping the dteday column as the yr and mnth columns are available

bike.drop(['dteday'], axis=1, inplace=True)

bike.head()

In [None]:
# Checking the number of rows and columns

bike.shape

### _Step 3: Data Preparation_

### Dummy Variables

In [None]:
# Checking the data type for season, month, weekday and weathersit columns

bike[['season','mnth','weekday','weathersit']].dtypes

> It is important to ensure that the data types of these 4 variables are 'object' data types, before we create our dummy variable.

In [None]:
# Creating the dummy variables for season, mnth, weekday and weathersit columns

seasons = pd.get_dummies(bike['season'], drop_first=True)
months = pd.get_dummies(bike['mnth'], drop_first=True)
weekdays = pd.get_dummies(bike['weekday'], drop_first=True)
weathersit = pd.get_dummies(bike['weathersit'], drop_first=True)

In [None]:
# Concatenate the dummy variables to the original dataframe

bike = pd.concat([bike, seasons, months, weekdays, weathersit], axis=1)

bike.head()

In [None]:
# Dropping the columns which dummy was created

bike.drop(['season','mnth','weekday','weathersit'], axis=1, inplace=True)
bike.head()

In [None]:
# Checking the rows and columns

bike.shape

> There are **`730 rows and 29 columns`**in our data set now.

### _Step 4: Splitting the Data into Training and Test Sets_

In [None]:
# Splitting the dataframe

train, test = train_test_split(bike, train_size = 0.7, test_size = 0.3, random_state = 100)


### Rescalling the Features

2 ways of Rescalling:

1. **Min-Max Scalling (normalisation)** 
    > Data is compressed into 0 and 1
2. **Standardisation (mean-0, sigma-1)**


For our Modelling, we will choose `Min-Max Scalling.`

In [None]:
# First thing we have to do is to Instatiate an object
# Min-Max scaling converted the data to 0 and 1

scaler = MinMaxScaler()



In [None]:
# We will select out Numerical variables as follows

num = ['temp','hum','windspeed','cnt']

In [None]:
# Applying Scaler() to all of the numerical columns

train[num] = scaler.fit_transform(train[num])


In [None]:
# Checking the variables

train.head()

In [None]:
# Checking whether the Max values is == 1.

train.describe()

In [None]:
# Plotting heatmap to check on the correlation coefficients

plt.figure(figsize=(30,24))
mask = np.array(train.corr())
mask[np.tril_indices_from(mask)] = False
sns.heatmap(train.corr(), mask = mask, vmax = .8, square =  True, annot=True, cmap='YlGnBu', fmt=".2f")

sns.set_style('whitegrid')
sns.set_context('talk')
plt.title('Correlations Between Variables on the Training Set')
plt.show()

**Inferences:**
The Correlations for the Train data set are as follows:

`Top 5 Positive Correlations:`
1. Temp - cnt : 0.64
2. Yr - cnt : 0.59
3. Jan - Spring : 0.55
4. Oct - Winter : 0.55
5. Nov - Winter : 0.53

> The positive correlations indicates that if 1 variables is increase, the other variable will increase (moves in the same direction).

`Top 5 Negative Correlations:`
1. Sunday - workingday : -0.63
2. Saturday - workingday : -0.61
3. Spring - temp : -0.61
4. Spring - cnt : -0.55
5. Jan - temp : -0.45

> The negative correlations indicates the inversely relationship between both variables, where if 1 variable increase, the other 1 variable will decrease.

### Dividing Data set into X and Y

In [None]:
# Trained Target variable

y_train = train.cnt
x_train = train.drop('cnt', axis = 1)

### _Step 5: Building the Linear Model_

### Recursive Feature Elimination (RFE)

For our Model Building, we decided to go froward with the Mixed approach, whereby we will be applying **Recursive Feature Elimination / RFE** by selecting 15 variables from our data. 

Then manually we will be removing each variable one-by-one based on its p-value and VIF before finding our best model.

In [None]:
# Applying LM

lm = LinearRegression()
lm.fit(x_train, y_train)

In [None]:
# Rows and columns for the train data

x_train.shape

In [None]:
# Selecting total 15 variables for RFE selection

rfe = RFE(lm, 15)
rfe = rfe.fit(x_train, y_train)

In [None]:
# The list of all of the variables selected and their ranking

list(zip(x_train.columns, rfe.support_, rfe.ranking_))


> **RFE Support** is True means the variable will be selected, and **RFE Ranking** will rank the variables according to its importance.

In [None]:
# Determining the columns with RFE selected variables

col = x_train.columns[rfe.support_]
col

In [None]:
# The total values of selected variables

(rfe.support_ == True).sum()

In [None]:
# Determining the columns with RFE NOT selected variables

x_train.columns[~rfe.support_]

In [None]:
# The total values of NON selected variables

(rfe.support_ == False).sum()

### Building our Model using the statsmodel.

### **`MODEL 1`**

In [None]:
# Creating x_train_rfe with RFE selected variables

x_train_rfe = x_train[col]

In [None]:
# Add a constant variable

x_train_rfe = sm.add_constant(x_train_rfe)

# Running the Linear Model

lm = sm.OLS(y_train, x_train_rfe).fit()

In [None]:
# Summary of the 1st Linear Model

print(lm.summary())

>> Our R-Squared for this Model is 84.5%. There are couple variables with quite a bit high of p-value, we will decide further once we look into the VIF of the variables.

#### Checking VIF

* VIF determines the correlations between the independent variables. The good Vif for the variable shall be less than <= 5. The closer the R-Squared to 1, the higher the value of VIF and multicollinearity.

In [None]:
# Calculating the VIF for the model

vif = pd.DataFrame()
vif['Features'] = x_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe.values, i) for i in range(x_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
print(vif)

In [None]:
x_train_rfe = x_train_rfe.drop(['const'], axis=1)

In [None]:
# Checking VIF after dropping const

vif = pd.DataFrame()
vif['Features'] = x_train_rfe.columns
vif['VIF'] = [variance_inflation_factor(x_train_rfe.values, i) for i in range(x_train_rfe.shape[1])]
vif['VIF'] = round(vif['VIF'],2)
vif = vif.sort_values(by = 'VIF', ascending = False)
print(vif)

>> In our 1st Model, we can see that all of our variables has an acceptable p-values < 0.05, then we will be checking our VIF to check for multicollinearity, hence we will drop the **hum** variable due it its significantly high VIF value = 30.94.

### **`MODEL 2`**

In [None]:
# Drop the 'hum' variable

x_train_rfe1 = x_train_rfe.drop(['hum'], axis = 1)


In [None]:
# Rebuilding the 2nd model

# Add a constant variable

x_train_lm = sm.add_constant(x_train_rfe1)

# Running the Linear Model

lm = sm.OLS(y_train, x_train_lm).fit()

In [None]:
# Summary of the 2nd Linear Model

print(lm.summary())

##### Checking VIF

In [None]:
# VIF for the 2nd Model

vif = pd.DataFrame()
X = x_train_rfe1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

>> * Our R-Squared for our 2nd model is 0.840.
>> For the next model, we will be dropping **'Summer'** variable. Since there are still VIF values which is > 5.0, thus we will be removing our variable to find the acceptable model.

### **`MODEL 3`**

In [None]:
# Drop the Summer variable

x_train_rfe1 = x_train_rfe1.drop(['Summer'], axis = 1)


In [None]:
# Rebuilding the 3rd model

# Add a constant variable

x_train_lm = sm.add_constant(x_train_rfe1)

# Running the Linear Model

lm3 = sm.OLS(y_train, x_train_lm).fit()

In [None]:
# Summary of the 3rd Linear Model

print(lm3.summary())

##### Checking VIF

In [None]:
# VIF for the 3rd Model

vif = pd.DataFrame()
X = x_train_rfe1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

>> * R-squared values for our 3rd model is 0.838.
>> We will be removing **'Nov'** further, to find the best model.

### **`MODEL 4`**


In [None]:
# Dropping Nov

x_train_rfe1 = x_train_rfe1.drop(['Nov'], axis = 1)

In [None]:
# Rebuilding the 4th model

# Add a constant variable

x_train_lm = sm.add_constant(x_train_rfe1)

# Running the Linear Model

lm4 = sm.OLS(y_train, x_train_lm).fit()

In [None]:
# Summary of the 4th Linear Model

print(lm4.summary())

In [None]:
# VIF for the 4th Model

vif = pd.DataFrame()
X = x_train_rfe1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

>> For the 4th Model, our R-Squared is 0.836. We will be removing **Dec** with p-value of 0.037. We will still removing one-by-one till we satisfied with out model.

### **`MODEL 5`**

In [None]:
# Dropping 'Dec'

x_train_rfe1 = x_train_rfe1.drop(['Dec'], axis = 1)

In [None]:
# Rebuilding the 5th model

# Add a constant variable

x_train_lm = sm.add_constant(x_train_rfe1)

# Running the Linear Model

lm5 = sm.OLS(y_train, x_train_lm).fit()

In [None]:
# Summary of the 5th Linear Model

print(lm5.summary())


In [None]:
# VIF for the 5th Model

vif = pd.DataFrame()
X = x_train_rfe1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

>> Our R-Squared decrease slightly from 0.836 to 0.835. Howver, we will still be removing our variable. Next is **Jan**.

### **`MODEL 6`**

In [None]:
# Drop the 'Jan' column

x_train_rfe1 = x_train_rfe1.drop(['Jan'], axis = 1)

In [None]:
# Rebuilding the 6th model

# Add a constant variable

x_train_lm = sm.add_constant(x_train_rfe1)

# Running the Linear Model

lm6 = sm.OLS(y_train, x_train_lm).fit()

In [None]:
# Summary of the 6th Linear Model

print(lm6.summary())

In [None]:
# VIF for the 6th Model

vif = pd.DataFrame()
X = x_train_rfe1
vif['Features'] = X.columns
vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

**From our above model, we can see that the above model is acceptable due to the following reasons:**

* R-Squared is **0.833 (83.3%)**

> *This explained the variance of the target variable **`(cnt)`**.*
    
    
* Adj R-Squared is **0.30 (83.0%)**


* P-values for all of the 10 variables are equivalent to **0.** 

* VIF values are **less than 5.** 

* F-statistic is **248.7** ( greater than 1) and Prob(F-statistic) is **1.16e-186** (very low).

> *The greater the value of F-Statistics, and the Prob is low indicates that the model is significant.*

* Finally, for this model, all of the p-values are equivalent to 0, and our VIF for all of the variables are less than 5.0. Thus we can say that this is our best model to proceed further.

In [None]:
# The coefficient values for our 6th Model with 10 variables are as follows:

lm6.params

### Residual Analysis

Residual analysis is used by assessing the appropriateness of the linear regression model by defining the residuals and examining it by plotting the graphs.

In [None]:
y_train_pred = lm6.predict(x_train_lm)

In [None]:
# Checking the Residuals

res = y_train - y_train_pred

In [None]:
# Plotting the Histogram for the Residual

fig = plt.figure()
plt.figure(figsize=(14,7))
sns.distplot((res), bins = 20)
sns.set_style("whitegrid", {'axes.grid' : False})
plt.title('Error Terms')
plt.xlabel('Error')
plt.show()

>**Inferences:**
    
    - From the above graphs, we can see that the Residuals are normally distributed and it centered towards 0. Thus it means that the model has a constant variance, HOMOSCEDASTICITY.

In [None]:
# Print R squared for train

r2_score(y_train, y_train_pred)

### _`STEP 1. Making Predictions`_

### Applying Scalar on the Test sets

In [None]:
# Applying Scaler() to all of the numeric variables and fit on data

num_test = ['temp','hum','windspeed','cnt']
test[num_test] = scaler.transform(test[num_test])

In [None]:
# Checking whether the value is compressed between 0-1
# All of the max values are equals to 1

test.describe()

> Tha values are all mapped to 0 - 1. And the max values for all of the variables are equal to 1.

In [None]:
# Plotting heatmap to check on the correlation coefficients

plt.figure(figsize=(30,26))
mask = np.array(test.corr())
mask[np.tril_indices_from(mask)] = False
sns.heatmap(test.corr(), mask = mask, vmax = .8, square =  True, annot=True, cmap='YlGnBu', fmt=".2f")

sns.set_style('whitegrid')
sns.set_context('talk')
plt.title('Correlations Between Variables on the Test Set')
plt.show()

**Inferences:**
The Correlations for the Test data set are as follows:

**`Top 5 Positive Correlations:`**
1. Temp - cnt : 0.59
2. Feb - Spring : 0.57
3. Oct - Winter : 0.54
4. Nov - Winter : 0.51
5. Yr - cnt : 0.49

> The positive correlations indicates that if 1 variables is increase, the other variable will increase (moves in the same direction).

**`Top 5 Negative Correlations:`**
1. Spring - temp : -0.65
2. Saturday - workingday : -0.61
3. Spring - cnt : -0.59
4. Sunday - workingday : -0.57
5. Jan - temp : -0.40

> The negative correlations indicates the inversely relationship between both variables, where if 1 variable increase, the other 1 variable will decrease.

### Dividing into X-Test and Y-test

In [None]:
# Test columns

test.columns

In [None]:
# Creating X and Y for the Test

y_test = test.pop('cnt')
x_test = test

In [None]:
# Add a constant variable

x_test = sm.add_constant(x_test)

test_col = x_train_lm.columns
x_test = x_test[test_col[1:]]

x_test = sm.add_constant(x_test)

x_test.info()

In [None]:
# Making Predictions for the Test

y_pred = lm6.predict(x_test)


### Finding the R2, Adjusted R2 and Mean Squared Error

In [None]:
# R2 for our 6th Model

r2 = r2_score(y_test, y_pred)
round(r2, 4)

In [None]:
n = x_test.shape[0]    # No. of rows of test data
p = x_test.shape[1]    # No. of columns of test data

adj_r2 = (1-(1-r2)*(n-1)/(n-p-1))
adj_r2

In [None]:
# Adjusted R2 for our 6th Model

Adj_r2 = 1 - (1 - 0.7836775)*(11-1)/(11-1-1)
round(Adj_r2,4)

In [None]:
# Mean Squared Error

mse = mean_squared_error(y_test, y_pred)
round(mse,4)

### RESULT COMPARISON:

* R-SQUARED (TRAIN) - **0.833 (83.3%)**
* R-SQUARED (TEST) - **0.807 (80.7%)**
* ADJ R-SQUARED (TRAIN) - **0.827 (82.7%)**
* ADJ R-SQUARED (TEST) - **0.7967 (79.67%)**

> The difference for R-Squared for Train and Test data is only **2.6%**. We can say that the value which is less than 5%, and this is a good R-squared value, hence we can see our model is performing good even on unseen data (test data).


### _`STEP 7: Model Evaluation`_

In [None]:
# Plotting scatter plot against Actual and Predicted Values

fig = plt.figure(figsize=(15,8))
sns.regplot(x=y_test, y=y_pred, ci=68, fit_reg=True,scatter_kws={"color": "green"}, line_kws={"color": "red"})
sns.set_style("whitegrid", {'axes.grid' : False})
plt.title('Actual Test Points vs Predicted Test Points')
plt.xlabel('y_test')
plt.ylabel('y_pred')
plt.show()

In [None]:
plt.figure(figsize=(11,5))

plt.subplot(1, 2, 1)
plt.scatter(x=y_train, y=y_train_pred, c="#7CAE00", alpha=0.3)

plt.tick_params(axis='x', which='both', bottom=False,
                top=False, labelbottom=False)

z = np.polyfit(y_train, y_train_pred, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test),"#F8766D")

plt.ylabel('Predicted')
plt.xlabel('Train - Actual')

plt.subplot(1, 2, 2)
plt.scatter(x=y_test, y=y_pred, c="#619CFF", alpha=0.3)

z = np.polyfit(y_test, y_pred, 1)
p = np.poly1d(z)
plt.plot(y_test,p(y_test),"#F8766D")

plt.xlabel('Test - Actual')

plt.show()

### _**EQUATION OF BEST FITTED LINE:**_
    
>**cnt** = 0.2519 +  0.2340(**`yr`**) + (-0.0986)(**`holiday`**) + 0.4515(**`temp`**) + (-0.1398)(**`windspeed`**) + (-0.1108)(**`Spring`**) + 0.04272(**`winter`**) + 0.0577(**`Sep`**) + (-0.2864)(**`Light Snow`**) + (-0.0811)(**`Mist + Cloudy`**)

* All of the coefficients with positive values indicates that they are positively correlated, means the value of independant variables increase, the dependent variable also tends to increase.

* For the Negative coefficients, if the values of the independent variable increase, the dependent variable will decrease.

* If we would have started with Null Hypothesis that all coefficients are zero , then from summary of our 6th model, we see that none of the coeeficients is zero. 
>> Thus, we **reject our null hypothesis in favour of Alternate Hypothesis that our model is statistically significant.**

#### Significant Variables to predict the demand for shared bikes are:

- [X] Temperature / Temp
- [X] Yr
- [X] Sep
- [X] Winter
- [X] July
- [X] Mist + Cloudy
- [X] Holiday
- [X] Spring
- [X] Light Snow
- [X] Windspeed

### SUMMARY:

The 4 most significant variables which affecting the demand for bike sharing are:

- [X] Temperature - 0.4515
- [X] Year/ Yr - 0.2340
- [X] Sep - 0.0577
- [X] Winter - 0.0427

As all of these 4 variables provides positive coefficents with the target variable, thus in order to increase the demand, the 4 variables/ factors shall be highly considered before making any business decisions.

Hence, once the COVID situation is back to normal, in order to increase the sales and volume, the company may focus on these 4 factors.