<h1 style="color:tomato;">In this notebook i will do :</h1>

---
- 1.<a href="#1">Importing the libraries</a>


- 2.<a href="#2">Load the data</a>


- 3.<a href="#3">EDA</a>


- 4.<a href="#4">Preprocessing</a>
- - 4-1.<a href="#4_1">Missing values</a>
- - 4-2.<a href="#4_2">Extract month and hour from the datetime</a>
- - 4-3.<a href="#4_3">Train test split</a>
- - 4-4.<a href="#4_4">Scaling</a>



- 5.<a href="#5">Strory telling - Visualization</a>
- - 5-1.<a href="#5_1">Linear correlations</a>
- - 5-2.<a href="#5_2">Seasons comparison</a>
- - 5-3.<a href="#5_3">Work day VS free day</a>
- - 5-4.<a href="#5_4">Mean rentals per month</a>
- - 5-5.<a href="#5_5">Mean rentals per hour</a>



- 6.<a href="#6">Models :</a>
- - 6-1.<a href="#6_1">Simple linear regression</a>
- - 6_2.<a href="#6_2">Multiple linear regression</a>
- - 6_3.<a href="#6_3">Polynomial regression(degree = 2)</a>
- - 6_4.<a href="#6_4">Polynomial regression(degree = 3)</a>
- - 6_5.<a href="#6_5">Polynomial regression(degree = 4)</a>
- - 6_6.<a href="#6_6">K_Nearest Neighbors</a>
- - 6_7.<a href="#6_7">Support Vector Regression</a>
- - 6_8.<a href="#6_8">Regression Tree</a>
- - 6_9.<a href="#6_9">Random forest</a>

- 7.<a href="#7">Models comparison</a>

<a id = "1"></a>
## 1. Importing the libraries

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
from matplotlib.lines import Line2D



from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn import metrics



import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

<a id = "2"></a>
## 2. Load and Prepare Data

In [None]:
train_set = pd.read_csv('../input/bike-sharing-demand/train.csv')
final_test = pd.read_csv('../input/bike-sharing-demand/test.csv')

<a id = "3"></a>
## 3. EDA

In [None]:
print(train_set.columns)
print(final_test.columns)

<h1 style="color:tomato;">Note :</h1>

-  as you see casual and registered columns are not in the test.csv
-  but why? (i will answer to this question in the preprocessing phase)

In [None]:
train_set.head()

In [None]:
train_set.info()

<a id = "4"></a>
## 4. Data Preprocessing

In [None]:
train_set.isnull().sum()

<a id = "4_1"></a>
<h3 style="color:#005b96">no missing values -->> no need to dropna or impute</h3>

<a id = "4_2"></a>
<h1 style="color:tomato;">From the datetime some usefull features could be extracted!</h1>
<h2 style="color:#005b96">But :</h2>

- <h3> date in train.csv starts from 1th and ends in 19th for each month </h3>
- <h3> and date in test.csv starts from 20th to the end</h3>
- <h3> so because of the different range for these files, i will not extract day from datetime</h3>
- <h3 style="color:#7bc043"> i will extract <u>month</u> and <u>hour</u> from the datetime column </h3>

In [None]:
# extracting month and hour 
temp = pd.DataFrame()
temp["month"] = train_set.datetime.apply(lambda x : x.split()[0].split('-')[1]).astype('int')
temp["hour"] = train_set.datetime.apply(lambda x : x.split()[1].split(':')[0]).astype('int')
train_set = pd.concat([temp , train_set] , axis = 1)
train_set.sample(10)

In [None]:
train_set.drop(labels = 'datetime' , axis = 1 , inplace = True)
train_set.info()

<h1 style="color:tomato;">Note :</h1>

- [x] its obvious that number of casual + number of registered = count
- [x] and in test.csv we dont have casual and registered columns <b>because if we have them, there is no need for prediction and we can easily add them together to find the total count !</b>

<h3 style="color:#7bc043">so we should drop casual and registered columns</h3>

In [None]:
train_set.drop(labels = ['casual','registered'] , axis = 1 , inplace = True)
train_set.head()

<h3 style="color:#7bc043">i will do scaling at the end of the visualization phase</h3>

<a id = "5"></a>
## 5. Strory telling - Visualization

<a id = "5_1"></a>
<h2 style="color:tomato;">correlation between attributes :</h2>

In [None]:
corr = train_set.corr()
plt.figure(figsize = (9,5))
sns.heatmap(corr , cmap = 'coolwarm' , annot=True , fmt = '.2f')

<h2 style="color:#005b96">Heat map results :</h2>

- <h3>Between input features 'hour' has the strongest correlation with 'count'</h3>
- <h3>So we will use 'hour' for simple linear regression</h3>

<a id = "5_2"></a>

In [None]:
sns.catplot(data = train_set.replace({'season':{1:'spring' , 2:'summer' , 3:'fall' , 4:'winter'}} , inplace = False)\
           , x = 'season' , y = 'count' , kind = 'bar' , ci = None ,  alpha = 0.8)
plt.title('Comparison between season' , fontdict = {'family':'serif' , 'size':16})
print( "mean rentals in fall :", train_set[train_set['season'] == 3]['count'].mean())

<h2 style="color:#005b96">Bar plot results :</h2>

- <h3>Between seasons, fall has the most mean of rentals.</h3>
- <h3>Also spring has the least mean of rentals.</h3>
- <h3>So we can say that people use rental bicycles more, in fall !</h3>

<a id = "5_3"></a>

In [None]:
# mrw = mean rentals on working day
# mrf = mean rentals on free day

mrw = train_set[train_set['workingday'] == 1]['count'].mean()
mrf = train_set[train_set['workingday'] == 0]['count'].mean()

plt.pie([ mrw , mrf ] , labels = ['working day' , 'free day'] , explode = [0.1,0] , startangle = 10 , shadow = True)
plt.text(-0.4,0.5,'%f'%(mrw))
plt.text(-0.3,-0.5,'%f'%(mrf))
plt.show()

<h2 style="color:#005b96">Pie plot results :</h2>

- <h3>There is not a big difference in rentals between working days and free days</h3>
- <h3>But we can say that people use rental bicycles in working days more than free days !</h3>

<a id = "5_4"></a>

In [None]:
mean_per_month = pd.DataFrame()
means = []
mean_per_month['month'] = ['January','February','March','April','May','June','July','August','September','October'\
                                   ,'November','December']
for i in range(1,13):
    means.append(train_set[train_set['month'] == i]['count'].mean())
    
mean_per_month['mean'] = means
mean_per_month.sort_values('mean',ascending = False , inplace = True)
print("mean rentals per month in descending order :")
mean_per_month

In [None]:
ex = [0.2,0,0,0,0,0,0,0,0,0,0,0]
plt.pie(mean_per_month['mean'] , labels = mean_per_month['month'] , shadow = True , explode = ex )
plt.title('Mean per month')
plt.show()

<h2 style="color:#005b96">pie plot results :</h2>

- <h3>From the pie plot starting with January if you turn clockwise, mean of rentals is increasing</h3>
- <h3>People use rental bicycles in June more than other months</h3>
- <h3>In January we have the minimum usage of rental bicycles !</h3>

<a id = "5_5"></a>

In [None]:
legend_dict = {'spring':'tab:green' , 'summer':'tab:red' , 'fall':'gold' , 'winter':'tab:blue'}
line_color = ['tab:green','tab:red','gold','tab:blue']


for i in range(1,5): # we have 4 seasons from 1 to 4
    y = []
    for j in range(24): # also 24 hours from 0 to 23
        y.append(train_set[(train_set['hour'] == j)&(train_set['season'] == i)]['count'].mean())
    plt.plot(np.arange(24) , y , alpha = 0.9 , linestyle = '--' , linewidth = 2.2 , color = line_color[i-1])

    
handles = [Line2D([0], [0], marker='o', color='w', markerfacecolor=c, label=l, markersize=8) for l,c in legend_dict.items()]
plt.legend(title='Seasons', handles=handles, bbox_to_anchor=(1.05, 1), loc='upper left')
plt.plot([8,8] , [0,600] , linestyle = ':' , color = "k")
plt.plot([17,17] , [0,600] , linestyle = ':', color = "k")

plt_font = {'family':'serif' , 'size':14}
plt.xlabel('Hour' , fontdict = plt_font)
plt.ylabel('Mean rentals' , fontdict = plt_font)
plt.text(8.5,580,'x = 8')
plt.text(17.5,580,'x = 17')

<h2 style="color:#005b96">line plot results :</h2>

- <h4>Each colored line represents mean of rental bicycles in a specific season during day hours.</h4>
- <h4>The plot shows that seasons have approximately same behaviors.(due to their maximums and minimums)</h4>
- <h4>x = 7 is the global maximum and we can see a local maximum at x = 8</h4>
- <h4>It means at 8:00 and 17:00 we have the most number of rental bicycles(people are going to work and returning from work)</h4>
- <h4>Also the global minimum occurs around 4:00 </h4>
- <h4>From 4:00 to 8:00 there is an increasing behavior same in 15:00 to 17:00</h4>

<a id = "4_3"></a>
- <h2 style = color:tomato>Train test split:</h2>

In [None]:
x_set = train_set.iloc[:,:-1].values
y_set = train_set.iloc[:,-1]

x_train , x_test , y_train , y_test = train_test_split(x_set , y_set , test_size = 0.2 
                                                       , random_state = 1)

print("x_train ->" , x_train.shape)
print("y_train ->" , y_train.shape)
print("x_test ->" , x_test.shape)
print("y_test ->" , y_test.shape)

<a id = "4_4"></a>
- <h2 style = color:tomato>Scaling the data:</h2>

In [None]:
sc = StandardScaler()

x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

<a id = "6"></a>
## 6. Train the model

<a id = "6_1"></a>
- <h3 style = color:tomato>Simple linear regression</h3>

In [None]:
# by simple, i mean i just use one input feature
# remember that 'hour' has the highest correlation with 'count'

slr = LinearRegression()
slr.fit(np.reshape(x_train[:,1] , (-1 , 1)) , y_train)

slr_pred_train = slr.predict(np.reshape(x_train[:,1] , (-1 , 1)))
slr_pred_test = slr.predict(np.reshape(x_test[:,1] , (-1 , 1)))

In [None]:
slr_r2 = metrics.r2_score(y_test , slr_pred_test)
print("r2_score =" , slr_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , slr_pred_test))

<a id = "6_2"></a>
- <h3 style = color:tomato>Multiple linear regression</h3>

In [None]:
# lets use all the input features!
mlr = LinearRegression()
mlr.fit(x_train , y_train)

mlr_pred_train = mlr.predict(x_train)
mlr_pred_test = mlr.predict(x_test)

In [None]:
mlr_r2 = metrics.r2_score(y_test , mlr_pred_test)
print("r2 score =" , mlr_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , mlr_pred_test))

<a id = "6_3"></a>
- <h3 style = color:tomato>Polynomial regression(degree = 2)</h3>

In [None]:
features_deg2 = PolynomialFeatures(degree = 2)

x_train_deg2 = features_deg2.fit_transform(x_train)
x_test_deg2 = features_deg2.fit_transform(x_test)

In [None]:
# pr2 = polynomial regression degree 2
pr2 = LinearRegression()
pr2.fit(x_train_deg2 , y_train)

pr2_pred_train = pr2.predict(x_train_deg2) 
pr2_pred_test = pr2.predict(x_test_deg2)

In [None]:
pr2_r2 = metrics.r2_score(y_test , pr2_pred_test)
print("r2 score =" , pr2_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , pr2_pred_test))

<a id = "6_4"></a>
- <h3 style = color:tomato>Polynomial regression(degree = 3)</h3>

In [None]:
features_deg3 = PolynomialFeatures(degree = 3)

x_train_deg3 = features_deg3.fit_transform(x_train)
x_test_deg3 = features_deg3.fit_transform(x_test)

In [None]:
pr3 = LinearRegression()
pr3.fit(x_train_deg3 , y_train)

pr3_pred_train = pr3.predict(x_train_deg3) 
pr3_pred_test = pr3.predict(x_test_deg3)

In [None]:
pr3_r2 = metrics.r2_score(y_test , pr3_pred_test)
print("r2 score =" , pr3_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , pr3_pred_test))

<a id = "6_5"></a>
- <h3 style = color:tomato>Polynomial regression(degree = 4)</h3>

In [None]:
features_deg4 = PolynomialFeatures(degree = 4)

x_train_deg4 = features_deg4.fit_transform(x_train) 
x_test_deg4 = features_deg4.fit_transform(x_test)

In [None]:
pr4 = LinearRegression()
pr4.fit(x_train_deg4 , y_train)

pr4_pred_train = pr4.predict(x_train_deg4)
pr4_pred_test = pr4.predict(x_test_deg4)

In [None]:
pr4_r2 = metrics.r2_score(y_test , pr4_pred_test)
print("r2 score =" , pr4_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , pr4_pred_test))

- <h3 style = color:#005b96>i dont continue increasing the degree of features because its going to be costly with respect of time</h3>

<a id = "6_6"></a>
- <h3 style = color:tomato>K_Nearest Neighbors(KNN)</h3>

In [None]:
# lets do this with trial and error for finding the best k
acc = [] # accuracy per k (by accuracy i mean r2_score)
max_r2 = 0
best_k = 0
for k in range(1,31):
    knr = KNeighborsRegressor(n_neighbors = k)
    knr.fit(x_train , y_train)
    knr_pred = knr.predict(x_test)
    r2 = metrics.r2_score(y_test , knr_pred)
    acc.append(r2)
    if r2 > max_r2:
        max_r2 = r2
        best_k = k

print("best k:",best_k , " with r2_score = " , max_r2)
plt.plot(np.arange(1,31) , acc)

In [None]:
# best k is 4
knr = KNeighborsRegressor(n_neighbors = 4)
knr.fit(x_train , y_train)

knr_pred_train = knr.predict(x_train)
knr_pred_test = knr.predict(x_test)

# overfitting check :
knr_r2_train = metrics.r2_score(y_train , knr_pred_train)
knr_r2_test = metrics.r2_score(y_test , knr_pred_test)
print("%f not to bigger than %f , so its not much overfitted !"%(knr_r2_train , knr_r2_test))

In [None]:
knr_r2 = knr_r2_test
print("r2 score =" , knr_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , knr_pred_test))

<a id = "6_7"></a>
- <h3 style = color:tomato>Support Vector Regression(SVR)</h3>

In [None]:
svr = SVR(kernel = 'rbf')
svr.fit(x_train , y_train)

svr_pred_train = svr.predict(x_train)
svr_pred_test = svr.predict(x_test)

In [None]:
svr_r2 = metrics.r2_score(y_test , svr_pred_test)
print("r2 score =" , svr_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , svr_pred_test))

<a id = "6_8"></a>
- <h3 style = color:tomato>Regression Tree</h3>

In [None]:
# lets find the best min_samples_leaf with trial and error
acc_test = [] # accuracy per k (by accuracy i mean r2_score)
acc_train = []
max_r2_test = 0
best_k_test = 0
for k in range(1,201):
    rt = DecisionTreeRegressor(min_samples_leaf = k , random_state = 1)
    rt.fit(x_train , y_train)
    rt_pred_train = rt.predict(x_train)
    rt_pred_test = rt.predict(x_test)
    r2_train = metrics.r2_score(y_train , rt_pred_train)
    r2_test = metrics.r2_score(y_test , rt_pred_test)
    acc_test.append(r2_test)
    acc_train.append(r2_train)
    if r2_test > max_r2_test:
        max_r2_test = r2_test
        best_k_test = k

print("best min_samples_leaf:",best_k_test , " with r2_score = " , max_r2_test)
plt.plot(np.arange(1,201) , acc_train , color = 'tab:blue')
plt.plot(np.arange(1,201) , acc_test , color = 'tab:red')

- <h4>The above plot is showing that with min_samples_leaf = 1 the model is not learning, it just memorize the data, as you see the r2 for train is 1 but for test r2 is not good!</h4>
- <h4>So when we increase the min_samples_leaf, r2 for train is decreasing and r2 for test is increasing, because model is learning not memorizing</h4>
- <h4>The best state happens when min_samples_leaf = 8</h4>

In [None]:
rt = DecisionTreeRegressor(min_samples_leaf = 8 , random_state = 1)
rt.fit(x_train , y_train)

rt_pred_train = rt.predict(x_train)
rt_pred_test = rt.predict(x_test)

In [None]:
rt_r2 = metrics.r2_score(y_test , rt_pred_test)
print("r2 score =" , rt_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , rt_pred_test))

<a id = "6_9"></a>
- <h3 style = color:tomato>Random Forest Regressor</h3>

In [None]:
rfr = RandomForestRegressor(random_state = 1)
rfr.fit(x_train , y_train)

rfr_pred_train = rfr.predict(x_train) 
rfr_pred_test = rfr.predict(x_test)

In [None]:
rfr_r2 = metrics.r2_score(y_test , rfr_pred_test)
print("r2 score =" , rfr_r2)
print("mean squared error =" , metrics.mean_squared_error(y_test , rfr_pred_test))

<a id = "7"></a>
- <h2 style = color:tomato>Models comparison</h2>

In [None]:
model_names = ['SimpleLinearRegression' , 'MultipleLinearRegression' , 'PolyRegressionD2' , 'PolyRegressionD3',
              'PolyRegressionD4' , '4NearestNeighbors' , 'SupportVectorRegressor' , 'RegressionTree',
              'RandomForestRegressor']
model_r2_scores = [slr_r2 , mlr_r2 , pr2_r2 , pr3_r2 , pr4_r2 , knr_r2 , svr_r2 , rt_r2 , rfr_r2]

plt.figure(figsize = (20,7))
sns.barplot(x = model_names , y = model_r2_scores)
plt.title("Model comaprison" , fontdict = {'family':'serif' , 'size':14})
plt.ylabel("R2 score" , fontdict = {'family':'serif' , 'size':14})

In [None]:
temp2 = pd.DataFrame()
temp2["month"] = final_test.datetime.apply(lambda x : x.split()[0].split('-')[1]).astype('int')
temp2["hour"] = final_test.datetime.apply(lambda x : x.split()[1].split(':')[0]).astype('int')
final_test = pd.concat([temp2 , final_test] , axis = 1)

input_data = final_test.drop(labels = 'datetime' , axis = 1 , inplace = False)

In [None]:
standard_scaling = StandardScaler()
input_data = standard_scaling.fit_transform(input_data)

final_predictions = rfr.predict(input_data)
final_test['count'] = final_predictions
final_test = final_test[['datetime' , 'count']]
final_test.to_csv('./submission.csv' , index = False)

<h2 style = color:tomato>Thanks for reading</h2>
<h2 style = color:tomato>Hope to be helpful</h2>