# <center>Bike Sharing</center>

## Data

In [1]:
# import libraries
import numpy as np
import pandas as pd
import sklearn.tree as tree
import sklearn.metrics as eval_m
import sklearn.preprocessing as pre_pro
import sklearn.feature_selection as f_selec
from sklearn.model_selection import train_test_split

In [2]:
# load data
data = pd.read_csv("data/SeoulBikeData.csv",encoding = 'unicode_escape')

In [3]:
data.head()

Unnamed: 0,Date,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day
0,01/12/2017,254,0,-5.2,37,2.2,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
1,01/12/2017,204,1,-5.5,38,0.8,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
2,01/12/2017,173,2,-6.0,39,1.0,2000,-17.7,0.0,0.0,0.0,Winter,No Holiday,Yes
3,01/12/2017,107,3,-6.2,40,0.9,2000,-17.6,0.0,0.0,0.0,Winter,No Holiday,Yes
4,01/12/2017,78,4,-6.0,36,2.3,2000,-18.6,0.0,0.0,0.0,Winter,No Holiday,Yes


### [Data Description](https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand)
Currently Rental bikes are introduced in many urban cities for the enhancement of mobility comfort. It is important to make the rental bike available and accessible to the public at the right time as it lessens the waiting time. Eventually, providing the city with a stable supply of rental bikes becomes a major concern. The crucial part is the prediction of bike count required at each hour for the stable supply of rental bikes. 

The dataset contains weather information (Temperature, Humidity, Windspeed, Visibility, Dewpoint, Solar radiation, Snowfall, Rainfall), the number of bikes rented per hour and date information.

1) Date : year-month-day 
2) Rented Bike count - Count of bikes rented at each hour : `Target`
3) Hour - Hour of he day 
4) Temperature-Temperature in Celsius 
5) Humidity - % 
6) Windspeed - m/s 
7) Visibility - 10m 
8) Dew point temperature - Celsius 
9) Solar radiation - MJ/m2 
10) Rainfall - mm 
11) Snowfall - cm 
12) Seasons - Winter:3, Spring:1, Summer:0, Autumn:2
13) Holiday - Holiday:1 /No holiday:0
14) Functional Day - NoFunc(Non Functional Hours), Fun(Functional hours) -> yes:1, no:0

## Task
`Regression Problem`

## Data Preparation

In [4]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rented Bike Count,8760.0,704.602055,644.997468,0.0,191.0,504.5,1065.25,3556.0
Hour,8760.0,11.5,6.922582,0.0,5.75,11.5,17.25,23.0
Temperature(°C),8760.0,12.882922,11.944825,-17.8,3.5,13.7,22.5,39.4
Humidity(%),8760.0,58.226256,20.362413,0.0,42.0,57.0,74.0,98.0
Wind speed (m/s),8760.0,1.724909,1.0363,0.0,0.9,1.5,2.3,7.4
Visibility (10m),8760.0,1436.825799,608.298712,27.0,940.0,1698.0,2000.0,2000.0
Dew point temperature(°C),8760.0,4.073813,13.060369,-30.6,-4.7,5.1,14.8,27.2
Solar Radiation (MJ/m2),8760.0,0.569111,0.868746,0.0,0.0,0.01,0.93,3.52
Rainfall(mm),8760.0,0.148687,1.128193,0.0,0.0,0.0,0.0,35.0
Snowfall (cm),8760.0,0.075068,0.436746,0.0,0.0,0.0,0.0,8.8


### Feature Engineering

In [5]:
data["Date"] = pd.to_datetime(data["Date"],format='%d/%m/%Y')

In [6]:
data["Day"] = data["Date"].dt.day
data["Month"] = data["Date"].dt.month
data["Year"] = data["Date"].dt.year

In [7]:
# drop "date" feature
data.drop("Date", axis=1, inplace=True)

In [8]:
# map Seasons
def seasons_to_num(row):
    if row == "Summer":
        return 0
    elif row == "Spring":
        return 1
    elif row == "Autumn":
        return 2
    elif row == "Winter":
        return 3
    else:
        return np.nan
    
# map Holiday
def map_holiday(row):
    if row == "Holiday":
        return 1
    elif row == "No Holiday":
        return 0
    else:
        return np.nan


# map Functioning Day
def map_functioning_day(row):
    if row == "Yes":
        return 1
    elif row == "No":
        return 0
    else:
        return np.nan

Map Seasons, Holiday, Functional features

In [9]:
# map Seasons
data["Seasons"] = data["Seasons"].apply(seasons_to_num)

# map Holiday
data["Holiday"] = data["Holiday"].apply(map_holiday)

# map Functioning Day
data["Functioning Day"] = data["Functioning Day"].apply(map_functioning_day)

In [10]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Rented Bike Count,8760.0,704.602055,644.997468,0.0,191.0,504.5,1065.25,3556.0
Hour,8760.0,11.5,6.922582,0.0,5.75,11.5,17.25,23.0
Temperature(°C),8760.0,12.882922,11.944825,-17.8,3.5,13.7,22.5,39.4
Humidity(%),8760.0,58.226256,20.362413,0.0,42.0,57.0,74.0,98.0
Wind speed (m/s),8760.0,1.724909,1.0363,0.0,0.9,1.5,2.3,7.4
Visibility (10m),8760.0,1436.825799,608.298712,27.0,940.0,1698.0,2000.0,2000.0
Dew point temperature(°C),8760.0,4.073813,13.060369,-30.6,-4.7,5.1,14.8,27.2
Solar Radiation (MJ/m2),8760.0,0.569111,0.868746,0.0,0.0,0.01,0.93,3.52
Rainfall(mm),8760.0,0.148687,1.128193,0.0,0.0,0.0,0.0,35.0
Snowfall (cm),8760.0,0.075068,0.436746,0.0,0.0,0.0,0.0,8.8


### Feature Scaling
The purpose of this is to bring all the data in same range so there aren't any unfairness. We will use min max scaler.</br>

`MinMaxScaler()` : Transform features by scaling each feature to a given range.seasons_to_num

In [11]:
features = list(data.columns)
features

['Rented Bike Count',
 'Hour',
 'Temperature(°C)',
 'Humidity(%)',
 'Wind speed (m/s)',
 'Visibility (10m)',
 'Dew point temperature(°C)',
 'Solar Radiation (MJ/m2)',
 'Rainfall(mm)',
 'Snowfall (cm)',
 'Seasons',
 'Holiday',
 'Functioning Day',
 'Day',
 'Month',
 'Year']

In [12]:
min_max_scaler = pre_pro.MinMaxScaler(feature_range=(0,3))

In [13]:
# Fit to data, then transform it.
data[features] = min_max_scaler.fit_transform(data)

In [14]:
data.head()

Unnamed: 0,Rented Bike Count,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day,Day,Month,Year
0,0.214286,0.0,0.660839,1.132653,0.891892,3.0,0.67474,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
1,0.172103,0.130435,0.645105,1.163265,0.324324,3.0,0.67474,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
2,0.145951,0.26087,0.618881,1.193878,0.405405,3.0,0.66955,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
3,0.09027,0.391304,0.608392,1.22449,0.364865,3.0,0.67474,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
4,0.065804,0.521739,0.618881,1.102041,0.932432,3.0,0.622837,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0


### Splitting Data

In [15]:
# independent features
X = data.drop("Rented Bike Count",axis=1)

# dependent feature
y = data["Rented Bike Count"]

In [16]:
X.head()

Unnamed: 0,Hour,Temperature(°C),Humidity(%),Wind speed (m/s),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Rainfall(mm),Snowfall (cm),Seasons,Holiday,Functioning Day,Day,Month,Year
0,0.0,0.660839,1.132653,0.891892,3.0,0.67474,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
1,0.130435,0.645105,1.163265,0.324324,3.0,0.67474,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
2,0.26087,0.618881,1.193878,0.405405,3.0,0.66955,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
3,0.391304,0.608392,1.22449,0.364865,3.0,0.67474,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0
4,0.521739,0.618881,1.102041,0.932432,3.0,0.622837,0.0,0.0,0.0,3.0,0.0,3.0,0.0,3.0,0.0


In [17]:
y.head()

0    0.214286
1    0.172103
2    0.145951
3    0.090270
4    0.065804
Name: Rented Bike Count, dtype: float64

In [18]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=11)

In [19]:
X.shape, y.shape

((8760, 15), (8760,))

In [20]:
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((7008, 15), (1752, 15), (7008,), (1752,))

## Modelling

In [21]:
# model
model = tree.DecisionTreeRegressor(criterion="squared_error",max_depth=4)

# fit
model.fit(X_train,y_train)

# predict
y_pred = model.predict(X_test)

## Evaluation

In [22]:
eval_m.mean_absolute_error(y_test,y_pred)

0.20658160521361252

## Feature Selection | Filter Method - Variance

In [31]:
# variance
variance = X.var()
variance.sort_values()

Humidity(%)                  0.388552
Temperature(°C)              0.392473
Holiday                      0.421996
Dew point temperature(°C)    0.459513
Solar Radiation (MJ/m2)      0.548205
Year                         0.699543
Day                          0.773828
Hour                         0.815310
Visibility (10m)             0.855504
Month                        0.884308
Seasons                      1.247311
dtype: float64

In [32]:
variance1 = variance > 0.3
variance1

Hour                         True
Temperature(°C)              True
Humidity(%)                  True
Visibility (10m)             True
Dew point temperature(°C)    True
Solar Radiation (MJ/m2)      True
Seasons                      True
Holiday                      True
Day                          True
Month                        True
Year                         True
dtype: bool

In [33]:
X.columns[variance1]

Index(['Hour', 'Temperature(°C)', 'Humidity(%)', 'Visibility (10m)',
       'Dew point temperature(°C)', 'Solar Radiation (MJ/m2)', 'Seasons',
       'Holiday', 'Day', 'Month', 'Year'],
      dtype='object')

In [34]:
X = X[X.columns[variance1]]
X.head()

Unnamed: 0,Hour,Temperature(°C),Humidity(%),Visibility (10m),Dew point temperature(°C),Solar Radiation (MJ/m2),Seasons,Holiday,Day,Month,Year
0,0.0,0.660839,1.132653,3.0,0.67474,0.0,3.0,0.0,0.0,3.0,0.0
1,0.130435,0.645105,1.163265,3.0,0.67474,0.0,3.0,0.0,0.0,3.0,0.0
2,0.26087,0.618881,1.193878,3.0,0.66955,0.0,3.0,0.0,0.0,3.0,0.0
3,0.391304,0.608392,1.22449,3.0,0.67474,0.0,3.0,0.0,0.0,3.0,0.0
4,0.521739,0.618881,1.102041,3.0,0.622837,0.0,3.0,0.0,0.0,3.0,0.0


In [35]:
# data df also contains target feature
print(f"Features before Feature Selection: {data.shape[1]-1}")
print(f"Features after Feature Selection: {X.shape[1]}")
print(f"Features eliminated after Feature Selection: {data.shape[1]-1 - (X.shape[1])}")

Features before Feature Selection: 15
Features after Feature Selection: 11
Features eliminated after Feature Selection: 4


### Modelling

In [36]:
# model
model = tree.DecisionTreeRegressor(criterion="squared_error",max_depth=4)

# fit
model.fit(X_train,y_train)

# predict
y_pred = model.predict(X_test)

### Evaluation

In [37]:
eval_m.mean_absolute_error(y_test,y_pred)

0.20658160521361255

### Conclusion

* Proper feature selection can increase the accuracy of the model.
* After feature selection we have less number of features to work with, which will eventually help us to train the models faster.
* Feature Selection with `variance` can be implemented with Panda's pd.var() and sklearn.feature_selection's VarianceThreshold().

---