<div class="alert alert-block ">
    <h1 align="center">Six Algorithm in Machine Learning</h1>
    <h1 align="center">Multiple Linear Regression</h1>
    <h4 align="center"><a href="https://www.kaggle.com/competitions/bike-sharing-demand/overview">Original Competition</a> | <a href="https://www.kaggle.com/competitions/bike-sharing-demand/data">Dataset</a> | <a href="https://rfebrians.github.io"> Writer</a></h5>
</div>

> This notebook is inspired by [Bike Sharing Demand Competition](https://www.kaggle.com/competitions/bike-sharing-demand/code) using the same Dataset ,

> On this notebook we'll explore further more about how multiple linear regression , the algorithm and other model works.

## Subject
You are an owner of a bike shop. You rent bicycles to customers. 

Now you want to predict the the number of requsets for bicycles by customers based on some information you have. 

So We'll analyze this to a kind of linear regression with multiple features.

![alt text](https://www.sefiles.net/merchant/5432/images/site/rental-ebike-2-slimC.jpg?t=1572206062455 "Title")



# The Algorithm

> Algorithm is just a workflow , it contain a procedur that can be read on human being .

![img](https://raw.githubusercontent.com/RFebrians/object-detection-playground/main/Six-Algorithm-ML.png)

1. Import All Libraries that we'll needed
2. Load Dataset and Begin EDA
3. Recognize Missing Values
4. Visualize Data
5. Train Various Model
6. Result

# Model that will be tested

* Linear Regression
* Multiple Linear Regression
* K Nearest Neighbour
* Decision Tree
* Random Forest

## Object

Let's get started with some information related to columns existed in the dataset.

##### datetime:
hourly date + timestamp  
##### season:
1 = spring, 2 = summer, 3 = fall, 4 = winter 
##### holiday:
whether the day is considered a holiday
##### workingday:
whether the day is neither a weekend nor holiday
##### weather:
* 1 = Clear, Few clouds, Partly cloudy, Partly cloudy
* 2 =  Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
* 3 = Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
* 4 = Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog

##### temp:
temperature in Celsius
##### atemp:
"feels like" temperature in Celsius
##### humidity:
relative humidity
##### windspeed:
wind speed
##### casual:
number of non-registered user rentals initiated
##### registered:
number of registered user rentals initiated
##### count:
number of total rentals



<div class="alert alert-block">
    <h1 align="center">Let's get started</h1>
    <h2 align="center">Step1: Import libraries</h2>
</div> 

In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
import time

<div class="alert alert-block">
    <h2 align="center">Step2: Load dataset and Begin Exploratory Data Analysis (EDA) </h2>
</div> 

In [None]:
# Load dataset 
data_unseen =  pd.read_csv('../input/bike-sharing-demand/test.csv')
data = pd.read_csv('../input/bike-sharing-demand/train.csv')
data.head()

In [None]:
data.info()

In [None]:
# As we understood, there is no null value in our data. Great !
# No we must change the type of columns to suitable value to use less memory
# Here we have a column called datetime with datetime64 type. So I change the type of it
data['datetime'] = data['datetime'].astype('datetime64')
#To machine learning we need numerical values so we must split this column to numerical values
data['datetime_year'] = data['datetime'].dt.year
data['datetime_month'] = data['datetime'].dt.month
data['datetime_day'] = data['datetime'].dt.day
data['datetime_hour'] = data['datetime'].dt.hour

data.drop(['datetime'], axis=1, inplace=True)
for i in data.columns:        
    if i in ['temp','atemp', 'windspeed']:
        data[f'{i}'] = data[f'{i}'].astype('float16')
    else:
        data[f'{i}'] = data[f'{i}'].astype('int16')
        

# Change the order of columns in the table
data = data[['datetime_year', 'datetime_month', 'datetime_day', 'datetime_hour',
            'season', 'holiday', 'workingday', 'weather', 'temp', 'atemp',
            'humidity', 'windspeed', 'casual', 'registered', 'count']]

# Similarly, we do these actions on data_unseen.
data_unseen['datetime'] = data_unseen['datetime'].astype('datetime64')
data_unseen['datetime_year'] = data_unseen['datetime'].dt.year
data_unseen['datetime_month'] = data_unseen['datetime'].dt.month
data_unseen['datetime_day'] = data_unseen['datetime'].dt.day
data_unseen['datetime_hour'] = data_unseen['datetime'].dt.hour
# Change the order of columns in the table
data_unseen = data_unseen[['datetime_year', 'datetime_month', 'datetime_day', 'datetime_hour',
            'season', 'holiday', 'workingday', 'weather', 'temp', 'atemp',
            'humidity', 'windspeed']]

In [None]:
data.info()

In [None]:
# We could optimize the useage of memory.Let's take a look a data again
data.head(3)

In [None]:
# Summary Statistics
data.describe()

<div class="alert alert-block">
    <h2 align="center">Step3: Recognize missing values</h2>
</div> 

In [None]:
# To check the number of missing values
data.isnull().sum()

<div class="alert alert-block">
    <h2 align="center">Step4: Visualize data for better understand</h2>
</div> 

In [None]:
# See some useful graphs
fig , axes = plt.subplots(2,2,figsize=(20,15))

# Axes[0,0]: Here We want to compare the number of total rentals based on each season
number_of_total_rentals = []
for i in (data.season.unique()):
    number_of_total_rentals.append(data[data.season==i]['count'].sum())
    
sns.barplot(ax=axes[0,0],x=['spring','summer','fall','winter'], y=number_of_total_rentals, label='number of total rentals based on each season',
            )
axes[0,0].legend(loc=2, fontsize=15)
axes[0,0].tick_params(axis='both', which='major', labelsize=15)
# Axes[0,1]: Here I want to compare the number of total rentals based on holidays
number_of_total_rentals = []
for i in (data.holiday.unique()):
    number_of_total_rentals.append(data[data.holiday==i]['count'].sum())

sns.barplot(ax= axes[0,1],x=['Not holiday(0)','holiday(1)'], y=number_of_total_rentals, label='number of total rentals based on holidays')
axes[0,1].legend(loc=1, fontsize=15)
axes[0,1].tick_params(axis='both', which='major', labelsize=15)
# Axes[1,0]: Here I want to compare the number of total rentals based on workingdays
number_of_total_rentals = []
for i in (data.workingday.unique()):
    number_of_total_rentals.append(data[data.workingday==i]['count'].sum())
    
sns.barplot(ax=axes[1,0],x=['Not workingday(0)','workingday(1)'], y=number_of_total_rentals, label='number of total rentals based on workingdays')
axes[1,0].legend(loc=2, fontsize=15)
axes[1,0].tick_params(axis='both', which='major', labelsize=15)
# Axes[1,1]: Here I want to compare the number of total rentals based on weather
number_of_total_rentals= []
for i in (data.weather.unique()):
    number_of_total_rentals.append(data[data.weather==i]['count'].sum())

sns.barplot(ax=axes[1,1],x=['1','2','3','4'], y=number_of_total_rentals, label='the number of total rentals based on weather')
axes[1,1].legend(loc=1, fontsize=15)
axes[1,1].tick_params(axis='both', which='major', labelsize=15)

In [None]:
# Find the correlation between parameters
my_correlation = data.corr()
plt.figure(figsize=(15,15))
sns.heatmap(my_correlation,cbar=True, square= True, fmt='.2f', annot=True,annot_kws={'size':15}, cmap='Greens')

In [None]:
my_correlation

<div class="alert alert-block">
    <h2 align="center">Step5: Train the model with different models</h2>
</div> 

## 1st model - Linear Regression

### What is Linear Regression ?

> Linear regression analysis is used to predict the value of a variable based on the value of another variable.

* The variable you want to predict is called the dependent variable. 
* The variable you are using to predict the other variable's value is called the independent variable.

In [None]:
# Define feature and target: According to the determined correlation, I chose the column temp as our feature and count as our target
x= data[['temp']]
y= data[['count']]

# Splitting data to training and testing data
x_train, x_test , y_train , y_test = train_test_split(x,y,test_size=.2,random_state=4)

# Set the model and find coefficient and intercept
regressor = LinearRegression()
regressor.fit(x_train, y_train)
coefficient = regressor.coef_
intercept = regressor.intercept_
print('coefficient = ',coefficient)
print('intercept = ',intercept)

In [None]:
# Predict with our model on x_test
y_predicted = regressor.predict(x_test)

In [None]:
# Plot data with fitline
fig, axs = plt.subplots(1, 2, figsize=(20, 10))
fig.suptitle('Our Graphs', fontweight='bold', fontsize=20)
axs[0].scatter(x=x, y=y, c='blue', marker='*', linewidths=1, label='Data')
axs[0].grid()
axs[0].set(xlabel='weather', ylabel='count')
axs[0].legend()
axs[1].scatter(x=x_test, y=y_test, c='red', marker='*', linewidths=1, label='data_test')
axs[1].scatter(x=x_train, y=y_train, c='blue', marker='*', linewidths=1, label='data_train')
axs[1].legend()

In [None]:
# Create score table and a function to store all scores obtained from different models
form = {'Model':[],'MAE':[],'MSE':[],'sqrt MSE':[]}
score_table = pd.DataFrame(data=form)
def store_scores(name_of_model,position_of_it_in_tabel, y_test, y_predicted):
    score_table.loc[position_of_it_in_tabel,['Model','MAE','MSE','sqrt MSE']] =[name_of_model,
                                                                int(metrics.mean_absolute_error(y_test, y_predicted))
                                                                ,int(metrics.mean_squared_error(y_test, y_predicted))
                                                                ,int(np.sqrt(metrics.mean_squared_error(y_test, y_predicted)))] 
    return score_table

store_scores('Linear Regression',0,y_test,y_predicted)

## 2nd model: Linear Regression with multiple features

### What is Multiple Linear Regression ?

> Multiple linear regression is a regression model that estimates the relationship between a quantitative dependent variable and two or more independent variables using a straight line.

In [None]:
# Define features and target:
x = data.iloc[:,:-3].values
y = data.iloc[:,-1].values

#Splitting the dataset into the Training set and Test set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.2, random_state=2)

# Feature scaling (standardized)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

# Set the model
regressor1 = LinearRegression()
regressor1.fit(X=x_train, y=y_train)
coefficient = regressor1.coef_
intercept = regressor1.intercept_
print('coefficient = ',coefficient)
print('intercept = ',intercept)

In [None]:
# Predict with our model on x_test
y_predicted = regressor1.predict(x_test)

In [None]:
# Store scores of this model in table
store_scores('Mutiple Lienar Regression',1,y_test,y_predicted)

## 3rd model: K nearest neighbours (KNN)

### What is KNN ?
> The KNN is a data classification method for estimating the likelihood that a data point will become a member of one group or another based on what group the data points nearest to it belong to.

In [None]:
# Define features and target:
x = data.iloc[:,:-3].values
y = data.iloc[:,-1].values

#Splitting the dataset into the Training set and Test set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.2, random_state=2)

# Feature scaling (standardized)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [None]:
# Now we must find the optimal value for k with plotting MAE, so we have:
MAE_list = []
for i in range(1,10):
    
    knn = KNeighborsRegressor(n_neighbors=i)
    knn.fit(x_train,y_train)
    y_predicted = knn.predict(x_test)
    MAE_list.append(metrics.mean_absolute_error(y_test,y_predicted))

#Plot the values in MAE_list
plt.figure(figsize=(10,6))
plt.plot(range(1,10),MAE_list,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('MAE vs. K Value')
plt.xlabel('K')
plt.ylabel('MAE')

In [None]:
# So the best k is equal to 2 for us
# Set the model and train it
knn = KNeighborsRegressor(n_neighbors=2)
knn.fit(x_train,y_train)
y_predicted = knn.predict(x_test)

In [None]:
# Store scores of this model in score_table
store_scores('K_Nearest neighbour',2,y_test,y_predicted)

## 4th model: Decision tree

> Decision tree is a a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. 

In [None]:
# Define features and target:
x = data.iloc[:,:-3].values
y = data.iloc[:,-1].values

#Splitting the dataset into the Training set and Test set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.2, random_state=2)

# Feature scaling (standardized)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

# Set the model and train it
regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(x_train, y_train)

In [None]:
# Store scores of this model in score_table
store_scores('Decision tree',3,y_test,y_predicted)

## 5th model: Random forest

> Random forests or random decision forests is an ensemble learning method for classification, regression and other tasks that operates by constructing a multitude of decision trees at training time

In [None]:
# Define features and target:
x = data.iloc[:,:-3].values
y = data.iloc[:,-1].values

#Splitting the dataset into the Training set and Test set
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=.2, random_state=2)

# Feature scaling (standardized)
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In [None]:
# Now we must find the optimal value for estimators ot trees with plotting MAE, so we have:
MAE_list =[]
print('The program is finding the best number for trees in the model. Please wait.')
for i in range(50,550,50):
    regressor = RandomForestRegressor(n_estimators = i, random_state = 0)
    regressor.fit(x_train,y_train)
    y_predicted = regressor.predict(x_test)
    MAE_list.append(metrics.mean_absolute_error(y_test,y_predicted))
    print("({}% of the program completed)".format(100*i//500))

In [None]:
#Plot the values in MAE_list
plt.figure(figsize=(10,6))
plt.plot(range(50,550,50),MAE_list,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('MAE vs. number of estimators(trees)')
plt.xlabel('Trees')
plt.ylabel('MAE')

In [None]:
# So the best value for trees is equal to 250 for us
# Set the model and train it
regressor = RandomForestRegressor(n_estimators = 250, random_state = 0)
regressor.fit(x_train,y_train)
y_predicted = regressor.predict(x_test)

In [None]:
# Store scores of this model in score_table
store_scores('Random Forest',4,y_test,y_predicted)

<div class="alert alert-block ">
    <h2 align="center">Step 6: Find the Results</h2>
</div> 

In [None]:
# Find the results for real data, which here is called test.csv file
# Now it's time to run the best model "Random forest" with good results on test.csv file(real data) to predict Count values
x_unseed = data_unseen.iloc[:,:].values # Set features of the dataset(test.csv, called data_unseen) to run prediction  on them
y_unseen = regressor.predict(x_unseed)

# Turn y_unseen to a data frame 
y_unseen = pd.DataFrame(data= y_unseen, columns=['Count'])

# Concatenate y_unseen to data_unseen and name it sampleSubmission , cause it's from Competition Notebook
sampleSubmission = pd.concat([data_unseen, y_unseen], axis=1)
# Save this results as sampleSubmission.csv
sampleSubmission.to_csv(path_or_buf='result.csv', index=False)

# Congratulations 

- Congratulations. We have done it together.Take a look at scores and results in the real data !
- Also you can find the result on right sidebar on output section , called result.csv

In [None]:
score_table

## Reference


* [Competition Playground ](https://www.kaggle.com/competitions/bike-sharing-demand/overview)
* [Unknown Author from GitHub/TDS on Linear / Multiple Linear Regression Section(?)]()
* [Bike Sharing Demand](https://www.kaggle.com/code/werooring/bike-sharing-demand-top-6-6-solution)
* [EDA , Linear Regression , KNN , Decision Tree](https://www.kaggle.com/code/ramasalahat/eda-linear-regression-ridge-k-nn-decision-tree)