#### About Dataset
    Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. 
    It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. 
    Eventually, providing the city with a stable supply of rental bikes becomes a major concern. 
    The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes.
    Data used include weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information
    
### Expected Outcome
    As it is a Regression problem the expected outcome will vary each day. 
    The expected outcome depends on features like Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall.
    
### Objective
    The objective is to calculate the number of bikes required each hour in order to provide a smooth and stable supply of rental bikes

#### Starting with Importing all the Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
%matplotlib inline

from sklearn.metrics import r2_score

import warnings
warnings.filterwarnings("ignore")

#### Load Dataset

In [None]:
df = pd.read_csv("SeoulBikeData.csv", encoding= 'unicode_escape')

#### Visualization of data

In [None]:
df.head()

#### Getting the shape of the data 

In [None]:
df.shape

#### columns of the dataset

In [None]:
df.columns

In [None]:
df = df.rename(columns={"Date": "Date", "Rented Bike Count": "Rented_Bike_Count", "Hour": "Hour", "Temperature(°C)":"Temperature", "Humidity(%)": "Humidity","Wind speed (m/s)": "Wind_speed","Visibility (10m)":"Visibility","Dew point temperature(°C)":"Dew_point_temperature","Solar Radiation (MJ/m2)":"Solar_Radiation","Rainfall(mm)":"Rainfall","Snowfall (cm)":"Snowfall","Seasons":"Seasons","Functioning Day":"Functioning_Day","Holiday":"Holiday"}, errors="raise")

In [None]:
df.head()

In [None]:
df.info()

#### To get all the information about all the features

In [None]:
df.dtypes

#### To get all the details

In [None]:
df.describe().transpose()

In [None]:
df.head(2)

In [None]:
df.isnull().sum()

#### Collecting the Date, month and year details from the Date column

In [None]:
df["date"] = df["Date"].str.split("/").str[0]
df["month"] = df["Date"].str.split("/").str[1]
df["year"] = df["Date"].str.split("/").str[2]

#### Dropping the Date Column

In [None]:
df = df.drop(["Date"],axis =1)

In [None]:
df.head(2)

#### Unique values of each features in the Dataset

In [None]:
for item in df.columns:
    print(item, ": " )
    print(df[item].unique())
    print(df[item].value_counts())
    print("************************************************")

#### Converting the data types of "date", "month" and "year" columns to "int" datatype

In [None]:
df['date'] = df['date'].astype('int')
df['month'] = df['month'].astype('int')
df['year'] = df['year'].astype('int')

In [None]:
## Lets analyze the Temporal Datetime Variables
## We will check whether there is a relation between year the house is sold and the sales price

df.groupby('month')['Rented_Bike_Count'].median().plot()
plt.xlabel('month')
plt.ylabel('Rented_Bike_Count')
plt.title("Rented_Bike_Count vs month")

In [None]:
df = pd.get_dummies(df, drop_first = True)

In [None]:
df = df.rename(columns={"Holiday_No Holiday":"Holiday_No_Holiday"}, errors="raise")

In [None]:
## Here we will compare the difference between All feature with Rented Bike Count

for feature in df.columns:
    if feature!='Rented_Bike_Count':
        data=df.copy()
        plt.scatter(data[feature],data['Rented_Bike_Count'])
        plt.xlabel(feature)
        plt.ylabel('Rented_Bike_Count')
        plt.show()

In [None]:
## We will be using logarithmic transformation


for feature in df.columns:
    data=df.copy()
    if 0 in data[feature].unique():
        pass
    else:
        data[feature]=np.log(data[feature])
        data['Rented_Bike_Count']=np.log(data['Rented_Bike_Count'])
        plt.scatter(data[feature],data['Rented_Bike_Count'])
        plt.xlabel(feature)
        plt.ylabel('Rented_Bike_Count')
        plt.title(feature)
        plt.show()

In [None]:
plt.figure(figsize = (15,25))
count = 1
for col in df:
    plt.subplot(6,3,count)
    plt.boxplot(df[col])
    plt.title(col)
    count += 1

plt.show()

In [None]:
df.drop(['date', 'month', 'year'], axis =1, inplace = True)

#### Removing the Outliers

In [None]:
df = df[(df["Wind_speed"]<4.2)]
df = df[(df["Solar_Radiation"]<2.2)]

#### Correlation Matrix

In [None]:
correlation = df.corr()
plt.figure(figsize = (15,15))
cmap= sns.diverging_palette(100, 10)
sns.heatmap(correlation, annot = True, cmap =cmap, center = 0)

#### Independent Features and Dependent Features

In [None]:
X = df.iloc[:, 1: ]
y = df.iloc[:, 0]

In [None]:
X.head()

In [None]:
y.head(9)

#### Dividing records in training and testing sets with test size 20% of whole dataset

In [None]:
# Splitting in to Train and Test Dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 0)
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

#### Model Building

In [None]:
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor()

In [None]:
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
print(n_estimators)

In [None]:
from sklearn.model_selection import RandomizedSearchCV

In [None]:
# Randomized Search CV

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1200, num = 12)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
#maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(5,30,num=6)]
#Minimum number of samples required to split a node
min_samples_split = [2,5,10,15,100]
#Minimum number of samples required at each leaf node
min_samples_leaf = [1,2,5,10]

In [None]:
# create the random grid
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf}

print(random_grid)

In [None]:
# Use the Random grid to search for best hyperparameters
# First Create the base model to tune
rf = RandomForestRegressor()

In [None]:
# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations
model = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring='neg_mean_squared_error', n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)

In [None]:
model.fit(X_train, y_train)

In [None]:
model.best_params

In [None]:
model.best_score_

In [None]:
y_pred = model.predict(X_test)

In [None]:
sns.distplot(y_test-y_pred)

In [None]:
plt.scatter(y_test,y_pred)

In [None]:
print("R2 Score: ", r2_score(y_test, y_pred))

In [None]:
import pickle

# to store the data
file = open("Seoul_Bike_Sharing_Demand.pkl", "wb")

# dump information to that file
pickle.dump(pipeline_rf, file)