# Problem Statement: 
A US bike-sharing provider BoomBikes has recently suffered considerable dips in their revenues due to the ongoing Corona pandemic. So, it has decided to come up with a mindful business plan to be able to accelerate its revenue as soon as the ongoing lockdown comes to an end, and the economy restores to a healthy state. BoomBikes aspires to understand the demand for shared bikes among the people after this ongoing quarantine situation ends across the nation due to Covid-19.
Specifically, they want to understand the factors affecting the demand for these shared bikes in the American market. The company wants to know:

- Which variables are significant in predicting the demand for shared bikes.
- How well those variables describe the bike demands

### Objective
- model the demand for shared bikes with the available independent variables.
- identify the variables are significant in predicting the demand for shared bikes.

### Steps:
- Reading and Understanding the Data
- Visualising the Data
- Data Preparation
- Splitting the Data into Training and Testing Sets
- Building a linear model
- Residual Analysis of the train data
- Making Predictions Using the Final Model

### Step 1: Reading and Understanding the Data

In [33]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [34]:
# import libs
import pandas as pd
import numpy as np
import matplotlib.pyplot as mlt
import seaborn as sns


In [35]:
# import dataset
df = pd.read_csv("day.csv" , low_memory=False)

In [None]:
# Inspect head
df.head()

In [None]:
df.info()

No missing values in the dataset. Types are also correct.

In [None]:
df.shape

In [None]:
df.value_counts()

In [None]:
#list the columns
df.columns

remove columns not necessary for EDA and model building:
- `instant` - just an index column
- `dteday` - month, year, weekday already present in separate columns
- `casual` and `registered` - total bike sharing count already present, which is the target column

In [None]:
cols_to_drop = ['instant', 'dteday', 'casual', 'registered']
df = df.drop(columns=cols_to_drop, axis=1)
df.info()

Handling outliers

In [None]:
# Draw boxplots for numeric continuous variables
num_cols = ['temp', 'atemp', 'hum', 'windspeed']
for col in num_cols:
    sns.boxplot(df, y=col)
    mlt.show()


In [None]:
# check min, max and IQR
for col in num_cols:
    print("---------------------------------" + col.upper() + "-------------------------------")
    print(df[col].describe())

No huge jump in values observed. Hence, no need of outlier handing

### Step 2: Visualizing the data

convert columns to categorical string values

In [None]:
df.head()

In [None]:
df['weathersit_cat'] = df['weathersit'].map({1: "sunny", 2: "cloudy", 3: "rainy", 4: "stormy"}).astype('object')
df.head()

In [None]:
# convert months
df['mnth_cat'] = df['mnth'].map({1: 'Jan',2: 'Feb',3: 'Mar',4: 'Apr',5: 'May',6: 'Jun',
                  7: 'Jul',8: 'Aug',9: 'Sept',10: 'Oct',11: 'Nov',12: 'Dec'}).astype('object')
df.head()

In [None]:
# convert yr, Holiday, Weekday, Workingday, season
df['yr_cat'] = df['yr'].map({0: '2018', 1:'2019'}).astype('object')
df['weekday_cat'] = df['weekday'].map({0: 'Sun',1: 'Mon',2: 'Tue',3: 'Wed',4: 'Thu',5: 'Fri',6: 'Sat'}).astype('object')
df['holiday_cat'] = df['holiday'].map({0: 'Yes',1: 'No'}).astype('object')
df['workingday_cat'] = df['workingday'].map({0: 'Working',1: 'NotWorking'}).astype('object')
df['season_cat'] = df['season'].map({1: 'spring', 2: 'summer', 3: 'fall', 4: 'winter'}).astype('object')

df.head()




#### EDA

Plot `cnt` against categorical columns

In [None]:
cat_columns = ['season_cat', 'weathersit_cat', 'mnth_cat', 'yr_cat', 'weekday_cat', 'holiday_cat', 'workingday_cat']
for col in cat_columns:
    print("--------------------------- " + col + " -----------------------------------")
    print(df[col].value_counts())

In [None]:
for col in cat_columns:
    sns.boxplot(df, x=col, y='cnt')
    mlt.show()

- Bike demand is highest in fall ( season 3 )
- Bike demand is highest in sunny weather ( weathersit 1: Clear, Few clouds, Partly cloudy, Partly cloudy). Stormy (weathersit 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog) has no bike demand at all.
- Bike demand rises from Feb to June.
- Bike demand has grown in next year 
- Demand decreases on holidays
- More demand on working days

Correlation among numerical continuous columns

In [None]:
df.dtypes

In [None]:
corr_cols = ['temp', 'atemp', 'hum', 'windspeed', 'cnt']
sns.heatmap(data=df[corr_cols].corr(), cmap="YlGnBu", annot = True)
mlt.show()

In [None]:
sns.pairplot(df, vars=corr_cols)
mlt.show()

- `temp` and `atemp` have strong positive correlation with `cnt`
- `temp` and `atemp` have high correlation between them (multicolinearity)

### Step 3: Data preparation

create dummy values for categorical columns having more than 2 levels

In [None]:
level_cols = ['season_cat', 'weekday_cat', 'weathersit_cat', 'mnth_cat']
df = pd.get_dummies(data=df, columns=level_cols, drop_first=True, dtype=int)
df.head()

### Step 4: Splitting the Data into Training and Testing Sets

In [None]:
# import relevant libs

import sklearn
from sklearn.model_selection import train_test_split


In [None]:
df_train, df_test = train_test_split(df, train_size=0.7, test_size=0.3, random_state=100)
print(df_train.head())
print(df_test.head())