# Bike-sharing

![](https://images.unsplash.com/photo-1533641568252-76ce0951d5b4?ixlib=rb-1.2.1&q=80&fm=jpg&crop=entropy&cs=tinysrgb&w=1080&fit=max&ixid=eyJhcHBfaWQiOjEyMDd9)
Picture by [Fernando Meloni](https://unsplash.com/@f_meloni)

I'm new to machine learning and training on this dataset. I focus on feature enginering for this kernel!

# I. Imports and loading the data

In [None]:
import os 

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

In [None]:
# Create a path to the data
path = os.path.join('..', 'input')
df_train = pd.read_csv(os.path.join(path, 'train.csv'))
df_train.shape

In [None]:
# Checking the data
df_train.head()

In [None]:
# Display the informations about the data
df_train.info()

In [None]:
# Checking some statistics informations about the data
df_train.describe()

# II. EDA

## II.1. Analysis of the given columns

Let's visualize the data to understand it better. 

In [None]:
ax, fig = plt.subplots(figsize=(12, 7))
sns.barplot(y='count', x='season', data=df_train, palette='hls')
plt.title('Repartion of the sharing by season');

_The season plays on the use of shared bike. One thing is interrogating, the season 1 is indicated as 'spring' in the documentation but correspond to january in the dataset. Watching the graph it seems quite intuitive that it would be winter, the less attrative season to do bike. But this point needs to be investigate to be sure of what 1 represents._ 

In [None]:
ax, fig = plt.subplots(figsize=(12, 7))
sns.barplot(y='count', x='holiday', data=df_train, palette='hls')
plt.title('Repartion of the sharing by holiday');

_The holidays don't seems to affect the use of the shared bike. We could see, by crossing this colum with an hour columns, if the hour of use are the same during the 2 periods._

In [None]:
ax, fig = plt.subplots(figsize=(12, 7))
sns.barplot(y='count', x='weather', data=df_train, palette='hls')
plt.title('Repartion of the sharing by weather types');

_1: Clear, Few clouds, Partly cloudy, Partly cloudy
2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog_

_There is more use when the weather is good, which is quite intuitive. The 4 nevertheless is quite strange, it's the worst weather so we could expect less people than when it's a little bad._

In [None]:
ax, fig = plt.subplots(figsize=(12, 7))
sns.barplot(y='count', x='workingday', data=df_train, palette='hls')
plt.title('Repartion of the sharing by working day');

_As the holiday graph we see that the number of renting stay quite similar for working day or not. It could be interesting to cross it with an hour count to see if it fluctuates._

## II.2. Features enginering to create new columns

In [None]:
# TConvert datetime column to a datetime
df_train['datetime'] = pd.to_datetime(df_train['datetime'])
df_train.head()

In [None]:
# Create a day of week column: dow
df_train['dow'] = df_train['datetime'].dt.dayofweek
df_train.head()

In [None]:
# Create a month column: month
df_train['month'] = df_train['datetime'].dt.month
df_train.head()

In [None]:
# Create a column with the week number: week
df_train['week'] = df_train['datetime'].dt.week
df_train.sample(10)

In [None]:
# Create a column with the hour: hour
df_train['hour'] = df_train['datetime'].dt.hour
df_train.head()

## II.3. Visualize the new data

### II.3-A. Univariable analysis

In [None]:
ax, fig = plt.subplots(figsize=(10, 7))
sns.barplot(y='count', x='month', data=df_train, palette='hls')
plt.title('Repartion of the sharing by month');

_The repartition of the renting by month quite follow the evolution of good weather. THere is a surprising and drastic decrease in january and february. Even compare to december, which should have a quite similar weather. **Why?**_

_This drop is quite similar to the gap between the season 1 and the others. It seems to indicate that season 1 correspond to winter (which was suggested by the fact that data from january are labelled with '1') and the documentation was wrong._

In [None]:
ax, fig = plt.subplots(figsize=(12, 7))
sns.barplot(y='count', x='dow', data=df_train, palette='hls')
plt.title('Repartion of the sharing by day of week');

_We see small fluctuation of the use of bike during the week. If there is quite less on sunday, saturday are almost equal to working day. So, as we saw previously, the working parameters doesn't have an important influence on the use._ 

### II.3-A. Multivariable analysis

In [None]:
ax, fig = plt.subplots(figsize=(12, 7))
sns.pointplot(y='count', x='hour', hue='dow', data=df_train, palette='hls')
plt.title('Repartion of the sharing by day of hour and day of the week');

_The ploting of the use by hour by day show that, if the count is similar for working day and weekend, the hours of use are nearly reversed._ 

_The use of bike is different between the working day and weekend but the total count stay stable._

In [None]:
ax, fig = plt.subplots(figsize=(15, 10))
sns.pointplot(y='count', x='dow', hue='month', data=df_train, palette='hls')
plt.title('Repartion of the sharing by day of month and day of the week');

_The use of bike in the week change according to the month. We could see the 2 way below lines for january & february, showing the same fall that on the month graph._

_But the different lines are quite unsimilar according to the month._

_Sadly the huge confidence interval tend to show this results are not really liable._

# III. Machine learning

## III.1. Chosing and dividing your data

In [None]:
# Define a y and X to work with by selecting the columns you're interested in. 
y_raw = df_train.loc[:, 'count']
X_raw = df_train.drop(['count', 'datetime', 'registered', 'casual'], axis=1)
y_raw.shape, X_raw.shape

In [None]:
# Import train_test_split to split your data.
from sklearn.model_selection import train_test_split

In [None]:
# Split the data and verify its shape.
X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw)
X_train.shape, y_train.shape, X_test.shape, y_test.shape

In [None]:
# Display X_train
X_train.head()

In [None]:
# Verify the absence of null values
X_train.isna().sum()

In [None]:
# Display y_train
y_train[:5]

In [None]:
# Verify the absence of null values. 
y_train.isna().sum()

## III.2. Create a model for machine learning

### III.2-A. Linear regression (on the data we split)

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

In [None]:
# Create a linear regression model and fit it on X_train, y_train.
lr = LinearRegression()
lr.fit(X_train, y_train)

In [None]:
# Predict on X_test.
y_pred = lr.predict(X_test)
y_pred

In [None]:
# Compute the mean squarred error.
mean_squared_error(y_test, y_pred)

In [None]:
X_train.max()

In [None]:
y_train.max()

### III.2-B. Random Forest (on the kaggle test sample)

We need to transform the test sample supplied by kaggle in order to predict on it. 

In [None]:
# Import of the data.
df_test = pd.read_csv(os.path.join('..', 'input', 'test.csv'))
df_test.shape

In [None]:
# Checking the data.
df_test.head()

In [None]:
# Doing the same processing than on the train dataframe.
df_test['datetime'] = pd.to_datetime(df_test['datetime'])
df_test['month'] = df_test['datetime'].dt.month
df_test['week'] = df_test['datetime'].dt.week
df_test['dow'] = df_test['datetime'].dt.dayofweek
df_test['hour'] = df_test['datetime'].dt.hour
df_test_clear = df_test.drop(['datetime'], axis=1)
df_test_clear.sample(10)

In [None]:
X_test_2 = df_test.drop(['datetime'], axis=1)
X_test_2.head()

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score

In [None]:
# Create a RandomForestRegressor model
rf = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=10, max_features='auto', 
max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, 
min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=1, oob_score=False, 
random_state=None, verbose=0, warm_start=False)

In [None]:
# Fit the model on X_train and y_train
rf.fit(X_train, y_train)

In [None]:
# Predict on X_test and register the log of the prediction
log_pred = rf.predict(X_test_2)
y_pred = np.expm1(log_pred)
y_pred[:5]

_Submit to kaggle, this model scores 0.59418 (evaluated with Root Mean Squared Logarithmic Error)_