# Bike sharing demand basic eda and model selection 
### Goal
To forecast bike rental demand in the Capital Bikeshare program in Washington, D.C. by combining historical usage patterns with weather data in order to forecast bike rental demand
### data items 
1. **Numerical type**: (use directly)
    - temp: actual temperature
    - atemp: body temperature
    - humidity: humidity
    - windspeed: wind speed
    - casual: the number of bikes rented by unregistered users
    - registered: Number of registered users rented bikes
    - count: total number of rental bikes
    
2. **Time series**:
datetime: Change to a single year, month, day, hour, and week

3. **Categorized data**: (create dummies )
    * season: season. 1: Spring; 2: Summer; 3: Autumn; 4: Winter
    * holiday: Whether it is a holiday. 0: No; 1: Yes
    * workingday: Whether it is a working day. 0: No; 1: Yes
    * weather: weather. 1: sunny; 2: cloudy; 3: light rain or snow; 4: severe weather
    

In [None]:
import numpy as np
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 

In [None]:
# import data 
df = pd.read_csv('../input/bike-sharing-demand/train.csv')

In [None]:
df.head()

In [None]:
df.info()

### check for null values 

In [None]:

df.isnull().sum()

In [None]:
# plot data no of bike rented hourly 
df[:120].plot(x='datetime', y='count',figsize=(10,5))
plt.xticks(rotation=45)
plt.grid()

In [None]:
# distribution of no of bike rented 
sns.distplot(df['count'])

In [None]:
# total no of bikes is 
# count = registered + casual 
# therefor dropping columns 
df = df.drop(['casual','registered'],axis=1)

# feature enginearing 

1. extract information from date and time 
    * year 
    * month
    * hour 
    * day 
2.  normalize data 
3. create dummies 


In [None]:
df.datetime=pd.to_datetime(df.datetime) # convert datetime in Date-time format  

In [None]:
# extract information from data and time 
df['year']= df.datetime.dt.year
df['month']=df.datetime.dt.month
df['day']=df.datetime.dt.day
df['hour']=df.datetime.dt.hour

In [None]:
# drop datatime 
df= df.drop("datetime",axis=1)

In [None]:
sns.barplot(x="month",y="count",data=df)
plt.title('count vs month')

In [None]:
plt.figure(figsize=(10,5))
sns.barplot(x='hour',y='count',data=df)
plt.title('hours vs count')

In [None]:
figure, axes = plt.subplots(2,2)
figure.set_size_inches(10, 10)
# boxplot of all catagorical features 
# season , weather , workingday , holiday
plt.subplot(2,2,1)
sns.boxplot(x='season',y='count',data=df)
plt.subplot(2,2,2)
sns.boxplot(x='weather',y='count',data=df)
plt.subplot(2,2,3)
sns.boxplot(x='workingday',y='count',data=df)
plt.subplot(2,2,4)
sns.boxplot(x='holiday',y='count',data=df)

***Normalization*** is a technique for organizing data in a database. It is important that a database is normalized to minimize redundancy (duplicate data) and to ensure only related data is stored in each table. It also prevents any issues stemming from database modifications such as insertions, deletions, and updates.

In [None]:
numeric_features = ['temp','humidity','atemp','windspeed']
# store the mean and std in a dictionary so that we could retrive it back later 
scaled_features ={}
for i in numeric_features:
    mean , std = df[i].mean() ,df[i].std()
    scaled_features[i] = [mean, std]
    df.loc[:, i] = (df[i]-mean)/std # using broadcasting all the colum elements is normalised 

 **creating dummies** of the cagorical variable and the conacatinate data with main dataframe 

In [None]:
# creating dummies of the cagorical variable 
dummy_feilds = ['season','weather',"hour","month"]
for i in dummy_feilds:
    dummies = pd.get_dummies(df[i],prefix=i,drop_first=False)
    df=pd.concat([df,dummies],axis=1)
    

In [None]:
drop = df.drop(dummy_feilds,axis=1)
df.head()

#  Train Test Split 
spliting data into train test 


In [None]:
from sklearn.model_selection import train_test_split 
x=df.drop("count",axis=1)
y= df["count"]
x_train, x_test , y_train , y_test = train_test_split(x,y,train_size= 0.8)

In [None]:
print(x_train.shape)
print(x_test.shape)

# Model Selection 
select the model which have best accuracy
model used :
- linear regressin 
- random forest regressor
- knn regressor

In [None]:
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_log_error,mean_squared_error, r2_score,mean_absolute_error

In [None]:
models=[LinearRegression(),RandomForestRegressor(),KNeighborsRegressor()]
model_names=['linear Regressor','Randomforest Regressor','Kneighbors Regressor']
rmse=[]
accuracy=[]
d={}
for model in range (len(models)):
    clf=models[model]
    clf.fit(x_train,y_train)
    test_pred=clf.predict(x_test)
    rmse.append(np.sqrt(mean_squared_error(test_pred,y_test)))
    accuracy.append(clf.score(x_test,y_test))
d={'Modelling Algo':model_names,'RMSE':rmse,"Accuracy":accuracy}  

In [None]:
data = pd.DataFrame(d)


In [None]:
data

# Make prediction using Random forest 
random forest of the best preforming model 

In [None]:

clf=RandomForestRegressor()
clf.fit(x_train,y_train)
y_pred = clf.predict(x_test)

In [None]:
plt.figure(figsize=(12,8))
y_test=y_test.reset_index(drop=True)
plt.plot(y_test[0:24*5],label='Data')
plt.plot(y_pred[0:24*5],label='Prediction')
plt.xticks([0,24,48,72,96,120],size=15)
plt.xlabel("hourly data ",size=15)
plt.ylabel("count",size=15)
plt.legend()

# plz upvote if u like 