# Bike Sharing Data

The Shared Transport (in the pre-pandemic era XD ) is a game changer in urban mobility. This term indicates a demand-driven vehicle sharing method. Nowdays, there are many systems that allow users to rent a vehicle in a particular position and return back to another position, making urban travel easier without using public transport.

Bikes, electric bikes, scooters, cars, vans ... it's possible to rent any type of vehicle just using a simple smartphone application, you can access the service and search for the nearest vehicle. All these vehicles are usually connected and they have a huge number of sensors, with the data generated by the vehicles' fleet it's possible to monitor the mobility in the city!

In this example we will train a model to predict the rental count using environmental data and fleet status data. Maybe weather conditions, day of week, season, hour of the day, etc ... can affect the rental behaviors. Let's find out.  

This is the data available:
- instant: record index
- dteday : date
- season : season (1:Spring, 2:Summer, 3:Autumn, 4:Winter)
- yr : year (2011, 2012)
- mnth : month ( 1 to 12)
- hr : hour (0 to 23)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
- weathersit : 
	- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
	- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
	- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
	- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are divided to 41 (max)
- atemp: Normalized feeling temperature in Celsius. The values are divided to 50 (max)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.model_selection import train_test_split
import seaborn as sns

In [None]:
data = pd.read_csv("./bike_data.csv")

## Quick overview

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.season.value_counts()

In [None]:
data.weather.value_counts()

In [None]:
# A quick way to understand the available data is to plot the histogram of the numerical attributes
# The histo shows the number of instances (on the vertical axis) that have a given value range (on the horizontal axis).

data.hist(bins=50, figsize=(20, 15))

With the histogram analysis it's possible to see that the features have different scale and there are features with a very long tail. These two factors can influence the performance of the machine learning model.

## Test Train split
“Your brain is an amazing pattern detection system, which means that it is highly prone to overfitting: if you look at the test set, you may stumble upon some seemingly interesting pattern in the test data that leads you to select a particular kind of Machine Learning model. When you estimate the generalization error using the test set, your estimate will be too optimistic, and you will launch a system that will not perform as well as expected. This is called data snooping bias.”

Passi di: Aurélien Géron. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”. Apple Books. 



In [None]:
train_set, test_set = train_test_split(data, test_size=0.2, random_state=50)

## Feature engineering

“First, make sure you have put the test set aside and you are only exploring the training set. Also, if the training set is very large, you may want to sample an exploration set, to make manipulations easy and fast. In our case, the set is quite small, so you can just work directly on the full set. Let’s create a copy so that you can play with it without harming the training set:”

Passi di: Aurélien Géron. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”. Apple Books. 

In [None]:
bike_data = train_set.copy()

In [None]:
# It's a good idea to engineer the time feature because in the quick look it's possible to see that the datset is composed of houly measures

tmp = bike_data.datetime.astype('datetime64[ns]')

bike_data['year'] = tmp.dt.year
bike_data['month'] = tmp.dt.month
bike_data['day'] = tmp.dt.day_name()
bike_data['hour'] = tmp.dt.hour

In [None]:
# Let's check if it's a weekend day or not
bike_data['weekend'] = 0
bike_data.weekend[bike_data.day == "Saturday"] = 1
bike_data.weekend[bike_data.day == "Sunday"] = 1

In [None]:
bike_data.head()

## Data visualization

In [None]:
# let's check how the rentcount is releted to the day of the week
plt.figure(figsize=(10, 5))
sns.barplot(x="day", y="count", data=bike_data)

In [None]:
# how it's related to month
plt.figure(figsize=(10, 5))
sns.barplot(x="month", y="count", data=bike_data)

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x="season", y="count", data=bike_data)

In [None]:
plt.figure(figsize=(10, 5))
sns.barplot(x="weather", y="count", data=bike_data)

## Check for correlation

In [None]:
# Since the dataset is not too large it's possible to compute the standard correlation coefficient between every pair of features
corr_matrix = bike_data.corr()

In [None]:
# check the correlation with the count attribute
corr_matrix['count'].sort_values(ascending=False)

The number of registered users it's very correlated to the rent count, same with the casual number, this is because the total number of users is the sum of this two numbers! 
I think it's better to remove these two data, how can we know the kind of user a priori? It's better to build a model only using climate and time data

In [None]:
bike_data.drop(['registered', 'casual'], axis=1, inplace=True)

In [None]:
# let's plot the correlation with few promising attributes
from pandas.plotting import scatter_matrix

attributes = ["count", "temp", "atemp", 'hour', 'month', 'year']

scatter_matrix(bike_data[attributes], figsize=(30, 15))

## Data cleaning

In [None]:
bike_data.info()

There are null values in the column temp, humidity and windspeed. There are 3 options: 1. Drop columns, 2. Drop rows, 3. use some values

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

In [None]:
bike_num = bike_data.drop(['datetime', 'season', 'weather', 'day'], axis=1)

In [None]:
imputer.fit(bike_num)

In [None]:
imputer.statistics_

In [None]:
x = imputer.transform(bike_num)

In [None]:
# it's a numpy array
x

In [None]:
# handling categorical 
bike_cat = bike_data[['season', 'weather', 'day']]

In [None]:
from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
bike_1_hot = cat_encoder.fit_transform(bike_cat)

In [None]:
bike_1_hot

In [None]:
def feature_engineering(data):
    tmp = data.datetime.astype('datetime64[ns]')

    data['year'] = tmp.dt.year
    data['month'] = tmp.dt.month
    data['day'] = tmp.dt.day_name()
    data['hour'] = tmp.dt.hour

    data['weekend'] = 0
    data.weekend[data.day == "Saturday"] = 1
    data.weekend[data.day == "Sunday"] = 1

    data.drop('datetime', axis=1)



In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    
    def __init__(self):
        pass
    
    def fit(self, X, y=None):
        return self  # nothing else to do
    
    def transform(self, X):

        # the datetime column is at 0 index
        tmp = pd.DatetimeIndex(X[:, 0])
        
        return np.c_[X[:, 1:], tmp.year, tmp.month, tmp.day_name(), tmp.hour, ((tmp.dayofweek) // 5 == 1).astype(int)]

attr_adder = CombinedAttributesAdder()
a = attr_adder.transform(train_set.copy().values())

In [None]:
a[0]

## Feature Scaling


# bla bla for feature scaling

“One of the most important transformations you need to apply to your data is feature scaling. With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales.”

Passi di: Aurélien Géron. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”. Apple Books. 
    
    
“There are two common ways to get all attributes to have the same scale: min-max scaling and standardization.”

Passi di: Aurélien Géron. “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow”. Apple Books. 

## Pipelines

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median")),
        ('std_scaler', StandardScaler()),
    ])

bike_num_tr = num_pipeline.fit_transform(bike_num)

In [None]:
train_set.copy().head().values

In [None]:
cat_pipeline = Pipeline([
        ('attribs_adder', CombinedAttributesAdder()),
        ('cat', OneHotEncoder())
    ])

test = cat_pipeline.fit_transform(train_set.copy()[['datetime', 'season', 'weather']].values)

In [None]:
pd.DataFrame(test)

In [None]:
from sklearn.compose import ColumnTransformer

num_attribs = list(bike_num)
cat_attribs = ['season', 'weather', 'day']

full_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ('attribs_adder', CombinedAttributesAdder(), 'datetime'),
        ("cat", OneHotEncoder(), cat_attribs),
    ])

feature_engineering(bike_data)
bike_prepared = full_pipeline.fit_transform(bike_data)

In [None]:
bike_prepared.shape