Bike sharing polynomial features
---

Exercise - Load and split the data, set the baseline
---

> **Exercise**: Load the data set. Encode categorical variables with one-hot encoding. Split the data into train/test sets with the `train_test_split()` function from Scikit-learn (50-50 split, `random_state=0`). Fit a linear regression and compare its performance to the median baseline using the mean absolute error (MAE) measure.

In [1]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
import os

In [2]:
# Load data
data_df = pd.read_csv(os.path.join('data','bike-sharing.csv'))
data_df.head()


Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,weekday,season,weathersit,casual
0,0.344,0.806,0.16,2011,no,no,6,spring,cloudy,331
1,0.363,0.696,0.249,2011,no,no,0,spring,cloudy,131
2,0.196,0.437,0.248,2011,yes,no,1,spring,clear,120
3,0.2,0.59,0.16,2011,yes,no,2,spring,clear,108
4,0.227,0.437,0.187,2011,yes,no,3,spring,clear,82


In [15]:
# Encode categorical variables
df_encoded = pd.get_dummies(data_df, columns=['season','weathersit','weekday'],drop_first=True)
df_encoded.head(5)

Unnamed: 0,temp,hum,windspeed,yr,workingday,holiday,casual,season_spring,season_summer,season_winter,weathersit_cloudy,weathersit_rainy,weekday_1,weekday_2,weekday_3,weekday_4,weekday_5,weekday_6
0,0.344,0.806,0.16,2011,no,no,331,1,0,0,1,0,0,0,0,0,0,1
1,0.363,0.696,0.249,2011,no,no,131,1,0,0,1,0,0,0,0,0,0,0
2,0.196,0.437,0.248,2011,yes,no,120,1,0,0,0,0,1,0,0,0,0,0
3,0.2,0.59,0.16,2011,yes,no,108,1,0,0,0,0,0,1,0,0,0,0
4,0.227,0.437,0.187,2011,yes,no,82,1,0,0,0,0,0,0,1,0,0,0


In [47]:
# Split into train/test sets
# use np.train_test_split function
from sklearn.model_selection import train_test_split

X = df_encoded.drop(['casual','yr','workingday','holiday'],axis = 1).values
y = df_encoded['casual'].values

X = data_df.drop(['casual','yr','workingday','holiday','season','weathersit'],axis = 1).values



In [49]:
import numpy as np

# Mean absolute error (MAE)
def MAE(y, y_pred):
    return np.mean(np.abs(y - y_pred))

def RMSE(y, y_pred):
    return(np.sqrt(np.mean(np.square(y-y_pred))))

In [50]:
# Median baseline
mae_baseline = MAE(y_te,np.mean(y_te))

# Linear regression
from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(X_tr,y_tr)

mae_lr = MAE(y_te,lr.predict(X_te))

print('MAE baseline: {:.3f}'.format(mae_baseline))
print('MAE linear regression: {:.3f}'.format(mae_lr))

rmse_baseline = RMSE(y_te,y_te.mean())
rmse_lr = RMSE(y_te,lr.predict(X_te))
print("R2 score:",1-(rmse_lr/rmse_baseline))

MAE baseline: 513.843
MAE linear regression: 403.413
R2 score: 0.16644068923406508


Exercise - Add polynomial features
---

> **Exercise**: Add the `temp^2` and `temp^3` polynomial features. Then fit and evaluate a linear regression. Plot your model with a scatter plot of temperatures vs. number of users. Feel free to add other features.

In [None]:
# Add polynomial features
???

# Fit a linear regression
mae_lr2 = ???
print('MAE lr with new features: {:.3f}'.format(mae_lr2))

In [None]:
# Plot predictions
???

Exercise - Separate sources
---

In the last exercise, we saw that we can identify two sources in the data.

1. Data points collected during working days
1. Data points collected during non-working days

The goal of this exercise is to create a model for each source using your extended set of features, e.g., the original features plus the `temp^2`, `temp^3` polynomial features.

> **Exercise**: Create a model for each source with the extended set of features, and evaluate the overall performance on the test set using MAE. Plot the two models with a scatter plot of temperatures vs. number of users. Create a final comparison using a bar chart.

In [None]:
# Separate data points
???

In [None]:
# Fit a linear regression for working days (wd)
# and one for non-working days (nwd)
???

# Compute overall performance with MAE
mae_wdnwd = ???
print('MAE two sources: {:.3f}'.format(mae_wdnwd))

In [None]:
# Plot predictions
???

In [None]:
# Final comparison
???