## Bike Sharing Assignment (BoomBikes) Linear Regression Model

### Business Goal:
You are required to model the demand for shared bikes with the available independent variables. It will be used by the management to understand how exactly the demands vary with different features. They can accordingly manipulate the business strategy to meet the demand levels and meet the customer's expectations. Further, the model will be a good way for management to understand the demand dynamics of a new market. 

- Which variables are significant in predicting the demand for shared bikes.
- How well those variables describe the bike demands

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

### Importing and Understanding Data

In [1]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [1]:
bikeSharing = pd.read_csv('/kaggle/input/boombikedata/day.csv')
bikeSharing.head()

#### Inspect the various aspects of the bikeSharing dataframe

In [1]:
bikeSharing.shape

In [1]:
bikeSharing.info()

In [1]:
# percentage of missing values in each column
round(100*(bikeSharing.isnull().sum()/len(bikeSharing)), 2).sort_values(ascending=False)

#### No null/NA values identified 

In [1]:
bikeSharing.describe()

## Data Preparation and Visualising

Let's now spend some time doing what is arguably the most important step - **understanding the data**.
- If there is some obvious multicollinearity going on, this is the first place to catch it
- Here's where you'll also identify if some predictors directly have a strong association with the outcome variable

We'll visualise our data using `matplotlib` and `seaborn`.

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid")
# setting the style for seaborn plots
%matplotlib inline

### Drop  unnessasary variables 
   - instant - index variale, so it doest make any siginficance for model, so we will consider to drop
   - dteday - year ,month and weekday as separate columns, we can consider to drop
   - casual,registered = cnt, we will consider to drop casual,registered and treate "cnt" as target variable 

In [1]:
#Dropping instant and dteday since they dont have significance with data
bikeSharing.drop(['instant','dteday','casual','registered'],inplace=True,axis=1)
bikeSharing.head()

### Converting categarical variables 

In [1]:
bikeSharing['season'] = bikeSharing['season'].map({1:'spring',2:'summer', 3:'fall', 4:'winter'})
bikeSharing['mnth'] = bikeSharing['mnth'].map({1:'Jan',2:'Feb', 3:'Mar', 4:'Apr',5:'May',6:'Jun',7:'Jul',8:'Aug',9:'Sept',10:'Oct',11:'Nov',12:'Dec'})
bikeSharing['weekday'] = bikeSharing['weekday'].map({0:'Sunday',1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday'})
bikeSharing['weathersit'] = bikeSharing['weathersit'].map({1:'Clear-Partlycloudy',2:'Mist-Cloudy',3:'LightSnow-lightRain-Thunderstorm',4:'HeavyRain-IcePallets-Thunderstorm'})

bikeSharing.head()



In [1]:
bikeSharing.info()

#### Visualising Categorical Variables

As we can notice, there are a few categorical variables as well. Let's make a boxplot for some of these variables with target variables to find best target variable.

In [1]:
def boxplot_cat_var(cat_var,target):
    plt.figure(figsize=(20, 12))
    for i in range(0,len(cat_var)):
        plt.subplot(2,3,i+1)
        sns.boxplot(x = cat_var[i], y = target, data = bikeSharing)
    plt.show()

cat_var =['season','yr','holiday','weekday','workingday','weathersit']
boxplot_cat_var(cat_var,'cnt')

In [1]:
sns.boxplot(x = 'mnth', y = 'cnt', data = bikeSharing)

### Observations:
#### season: 
- Almost 32% of the bike booking were happening in fall with a median of over 5000 booking (for the period of 2 years). This was followed by summer & winter with 27% & 25% of total booking. This indicates, season can be a good predictor for the dependent variable.
####  yr: 
- Almost 99% of the bike booking were increased in year with median of close to previus year booking (for the period of 2 years). This indicates, yr can be a good predictor for the dependent variable
####  weathersit:
- Almost 67% of the bike booking were happening during ‘Clear-Partlycloudy with a median of close to 5000 booking (for the period of 2 years). This was followed by Mist-Cloudy with 30% of total booking. This indicates, weathersit does show some trend towards the bike bookings can be a good predictor for the dependent variable.
####  holiday: 
- Almost 97.6% of the bike booking were happening when it is not a holiday which means this data is clearly biased. This indicates, holiday CANNOT be a good predictor for the dependent variable.
#### weekday: 
- weekday variable shows very close trend (between 13.5%-14.8% of total booking on all days of the week) having their independent medians between 4000 to 5000 bookings. This variable can have some or no influence towards the predictor. I will let the model decide if this needs to be added or not.
####  workingday: 
- Almost 69% of the bike booking were happening in ‘workingday’ with a median of close to 5000 booking (for the period of 2 years). This indicates, workingday can be a good predictor for the dependent variable
####  mnth: 
- Almost 10% of the bike booking were happening in the months may,jun,jul,aug & sept with a median of over 4000 booking per month. This indicates, mnth has some trend for bookings and can be a good predictor for the dependent variable.

### Dummy Variables

In [1]:
# Defining the map function
def dummies(x,df):
    temp = pd.get_dummies(df[x], drop_first = True)
    df = pd.concat([df, temp], axis = 1)
    df.drop([x], axis = 1, inplace = True)
    return df
# Applying the function to the bikeSharing

bikeSharing = dummies('season',bikeSharing)
bikeSharing = dummies('mnth',bikeSharing)
bikeSharing = dummies('weekday',bikeSharing)
bikeSharing = dummies('weathersit',bikeSharing)
bikeSharing.head()

In [1]:
bikeSharing.shape

In [1]:
bikeSharing.describe()

- The heatmap clearly shows which all variable are multicollinear in nature, and which variable have high collinearity with the target variable.
- We will refer this map back-and-forth while building the linear model so as to validate different correlated values along with VIF & p-value, for identifying the correct variable to select/eliminate from the model.

### Model Building
#### Assumptions
- Linear relationship
- Multivariate normality
- No or little multicollinearity
- No auto-correlation
- Homoscedasticity

## Splitting the Data into Training and Testing Sets

In [1]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)

df_train, df_test = train_test_split(bikeSharing, train_size = 0.7, test_size = 0.3, random_state = 100)

In [1]:
df_train.shape

In [1]:
df_test.shape

#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables

In [1]:
# we can see patterns between variables 
sns.pairplot(df_train[[ 'temp','atemp', 'hum', 'windspeed','cnt']],diag_kind='kde')
plt.show()

#### The above Pair-Plot tells us that there is a LINEAR RELATION between 'temp','atemp' and 'cnt' , we can see both variables has close values, we will predict with model to remove one variable 

In [1]:
#Correlation using heatmap
plt.figure(figsize = (30, 25))
sns.heatmap(df_train.corr(), annot = True, cmap="YlGnBu")
plt.show()

### Observations:
- The heatmap clearly shows which all variable are multicollinear in nature, and which variable have high collinearity with the target variable.
- We will refer this map back-and-forth while building the linear model so as to validate different correlated values along with VIF & p-value, for identifying the correct variable to select/eliminate from the model.

### Rescaling the Features 

We will use MinMax scaling.

In [1]:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

In [1]:
# Apply scaler() to all the columns except 'dummy' variables
num_vars = ['temp','atemp', 'hum', 'windspeed', 'cnt']

df_train[num_vars] = scaler.fit_transform(df_train[num_vars])

df_train.head()

In [1]:
df_train.describe()

### Dividing into X and Y sets for the model building

In [1]:
y_train = df_train.pop('cnt')
X_train = df_train

### RFE
Recursive feature elimination

In [1]:
#importing libs for RFE
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm 
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [1]:
# Running RFE with the output number of the variable equal to 15
lm = LinearRegression()
lm.fit(X_train,y_train)
rfe = RFE(lm, 15)
rfe = rfe.fit(X_train, y_train)

In [1]:
list(zip(X_train.columns,rfe.support_,rfe.ranking_))

In [1]:
X_train.columns[rfe.support_]

In [1]:
X_train.columns[~rfe.support_]

#### Building model using statsmodel, for the detailed statistics

In [1]:
X_train_rfe = X_train[X_train.columns[rfe.support_]]
X_train_rfe.head()

In [1]:
def build_model(X,y):
    X = sm.add_constant(X) #Adding the constant
    lm = sm.OLS(y,X).fit() # fitting the model
    print(lm.summary()) # model summary
    return X
    
def checkVIF(X):
    vif = pd.DataFrame()
    vif['Features'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    vif['VIF'] = round(vif['VIF'], 2)
    vif = vif.sort_values(by = "VIF", ascending = False)
    return(vif)



#### MODEL 1

In [1]:
X_train_new = build_model(X_train_rfe,y_train)

p-vale of `Jan` seems to be higher than the significance value of 0.05, hence dropping it as it is insignificant in presence of other variables.

In [1]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

In [1]:
X_train_new=X_train_new.drop(["Jan"], axis = 1)

#### MODEL 2

In [1]:
X_train_new = build_model(X_train_new,y_train)

In [1]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

- we can see holiday has high P value , will consider to drop

In [1]:
X_train_new=X_train_new.drop(["holiday"], axis = 1)

#### MODEL 3

In [1]:
X_train_new = build_model(X_train_new,y_train)

In [1]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

- spring has high VIF and p value, so we will consider to drop

In [1]:
X_train_new=X_train_new.drop(["spring"], axis = 1)

#### MODEL 4

In [1]:
X_train_new = build_model(X_train_new,y_train)

In [1]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

- Jul has high P value comparativly, we will consider to drop

In [1]:
X_train_new=X_train_new.drop(["Jul"], axis = 1)

#### MODEL 5

In [1]:
X_train_new = build_model(X_train_new,y_train)

In [1]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

#### We can see P values 0 and VIF less than 2, now we will check for correlation between remaining varibales

In [1]:
#Correlation using heatmap
plt.figure(figsize = (30, 25))
sns.heatmap(X_train_new.corr(), annot = True, cmap="YlGnBu")
plt.show()

#### We can see workingday and Saturday high negative correlation value

In [1]:
X_train_new=X_train_new.drop(["workingday"], axis = 1)

#### MODEL 6

In [1]:
X_train_new = build_model(X_train_new,y_train)

In [1]:
X_train_new=X_train_new.drop(["Saturday"], axis = 1)

#### MODEL 7

In [1]:
X_train_new = build_model(X_train_new,y_train)

In [1]:
#Calculating the Variance Inflation Factor
checkVIF(X_train_new)

### Observations 
- Model looks perfect with 9 variables and R-squared- 83.3 , Adj. R-squared- 83
- VIF less than - 2
- P values - 0
- Prob (F-statistic) - almost equal to - 0

### Residual Analysis of Model

In [1]:
lm = sm.OLS(y_train,X_train_new).fit()
y_train_cnt= lm.predict(X_train_new)

In [1]:
# Plot the histogram of the error terms
fig = plt.figure()
sns.distplot((y_train - y_train_cnt), bins = 20)
fig.suptitle('Error Terms', fontsize = 20)                  # Plot heading 
plt.xlabel('Errors', fontsize = 18)   

Error terms seem to be approximately normally distributed, so the assumption on the linear modeling seems to be fulfilled.

### Prediction and Evaluation

In [1]:
#Scaling the test set
num_vars = ['temp','atemp', 'hum', 'windspeed', 'cnt']
df_test[num_vars] = scaler.fit_transform(df_test[num_vars])

### Dividing the testset into X and Y sets for the model building

In [1]:
#Dividing into X and y
y_test = df_test.pop('cnt')
X_test = df_test

In [1]:
# Now let's use our model to make predictions.
X_train_new = X_train_new.drop('const',axis=1)
# Creating X_test_new dataframe by dropping variables from X_test
X_test_new = X_test[X_train_new.columns]

# Adding a constant variable 
X_test_new = sm.add_constant(X_test_new)

In [1]:
# Making predictions
y_pred = lm.predict(X_test_new)

#### Evaluation of test via comparison of y_pred and y_test

In [1]:
from sklearn.metrics import r2_score 
r2=r2_score(y_test, y_pred)
print(r2)

### Adjusted R^2 Value for TEST

In [1]:
X_test_new.shape

In [1]:
# We already have the value of R^2 (calculated in above step)
# n is number of rows in X
n = X_test_new.shape[0]

# Number of features (predictors, p) is the shape along axis 1
p = X_test_new.shape[1]

# We find the Adjusted R-squared using the formula

adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

In [1]:
#EVALUATION OF THE MODEL
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=20)              # Plot heading 
plt.xlabel('y_test', fontsize=18)                          # X-label
plt.ylabel('y_pred', fontsize=16)   

### Final Results 
- Train R^2 :0.833
- Train Adjusted R^2 :0.830
- Test R^2 :0.800
- Test Adjusted R^2 :0.791
#### This seems to be a really good model that can very well 'Generalize' various datasets.

In [1]:
print(lm.summary())

### We can see that the equation of our best fitted line is:
- cnt= 0.2215+0.2292 *yr+0.5754 * temp-0.1755 *hum-0.1890 *windspeed+0.0909 *summer+0.1391 *winter+0.1034 *sept-0.2320 *LightSnow-lightRain-Thunderstorm-0.0499 *Mist-Cloudy

### Hypothesis Testing
##### Hypothesis testing states that:

- H0:B1=B2=...=Bn=0
- H1: at least one Bi!=0

- From the final model summary, it is evident that all our coefficients are not equal to zerowhich means We REJECT the NULL HYPOTHESIS



### Analysing the above model, the comapny should focus on the following features:

- year: The company should encounter an increase in the number of users when the situation comes back to normal as compared to 2019.
- season: The company should focus on expanding it's business in the Summer and the Fall season.
- weather: The users prefer to rent a bike when the weather is pleasant i.e. either clear or cloudy.
- temp: The users prefer to ride or rent a bike in a moderate temperature.

### Hence when the situation comes back to normal, the company should face an increase in the business as compared to 2019 and should expand it's business with new availing offers or schemes in the season of summer and fall when the weather is pleasant with clear sky and moderate temperature.