# Chose 4 different regression techniques

## Read data
Only `train` file relevant for training:
`data/bike_sharing/bikeSharing.shuf.train.csv`

`test` file doesn't contain labels:
`data/bike_sharing/bikeSharing.shuf.test.csv`

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error as MSE

In [3]:
bike = pd.read_csv('data/bike_sharing/bikeSharing.shuf.train.csv')

### Check data
`info()` method shows if there are missing values
it shows that there aren't any missing values (non-null for every feature)

In [4]:
bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8690 entries, 0 to 8689
Data columns (total 15 columns):
id            8690 non-null int64
cnt           8690 non-null int64
dteday        8690 non-null object
season        8690 non-null int64
yr            8690 non-null int64
mnth          8690 non-null int64
hr            8690 non-null int64
holiday       8690 non-null int64
weekday       8690 non-null int64
workingday    8690 non-null int64
weathersit    8690 non-null int64
temp          8690 non-null float64
atemp         8690 non-null float64
hum           8690 non-null float64
windspeed     8690 non-null float64
dtypes: float64(4), int64(10), object(1)
memory usage: 1018.5+ KB


## Preprocessing (without scaling)
Preprocessing is needed because data contains a string datetime object
I have tried converting data to ordinal value (counting number) but converting data to `day of week` and
`weekofyear` performed better.
There were already time variables within the dataset (`yr`, `mnth`, ...). Since features shouldn't be correlated 
(sometimes), getting much information out of `dteday`isn't easily possible.  

In [5]:
# bike['dteday'] = pd.to_datetime(bike['dteday'], format="%Y/%m/%d").apply(lambda x: x.toordinal())
bike['weekday'] = pd.to_datetime(bike['dteday'], format="%Y/%m/%d").apply(lambda x: x.dayofweek)
bike['weekofyear'] = pd.to_datetime(bike['dteday'], format="%Y/%m/%d").apply(lambda x: x.weekofyear)

bike.drop('dteday', axis= 1, inplace=True)

In [6]:
bike.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8690 entries, 0 to 8689
Data columns (total 15 columns):
id            8690 non-null int64
cnt           8690 non-null int64
season        8690 non-null int64
yr            8690 non-null int64
mnth          8690 non-null int64
hr            8690 non-null int64
holiday       8690 non-null int64
weekday       8690 non-null int64
workingday    8690 non-null int64
weathersit    8690 non-null int64
temp          8690 non-null float64
atemp         8690 non-null float64
hum           8690 non-null float64
windspeed     8690 non-null float64
weekofyear    8690 non-null int64
dtypes: float64(4), int64(11)
memory usage: 1018.5 KB


## Creating feature and target arrays

In [7]:
X = bike.drop('cnt', axis = 1).values
y = bike['cnt'].values

## Creating train and test data 


In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

## Starting with a simple model to:
+ get a sense of how challenging the problem is
+ many more things might go wrong with complex models
+ how much signal can we pull out using basic models

## Ridge regression
__Score history:__
- Test set RMSE of rf: 137.47 (ordinal date)
- Test set RMSE of rf: 137.44 (dayofweek, ...)
- Test set RMSE of rf: 136.08 (parameter tuning)

### Ridge parameters
- __alpha:__
    - regularization strenght: the higher the value, the stronger the regularization
- __fit_intercept:__
    - weather to calculate an intercept for this model (e. g. not needed if centered) 
- __normalize:__
    - if `fit_intercept` and `normalize` are both `True` the regressor X will be normalized

In [23]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

In [54]:
# create parameter list for ridge regression
params_ridge = {
    'alpha': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
    'normalize': [True, False],
    'fit_intercept': [True, False]
}

ridge = Ridge()

In [55]:
# setup cross validation parameter grid search
grid_ridge = GridSearchCV(estimator=ridge,
    param_grid=params_ridge,
    cv=3,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1)

In [56]:
# fitting model on training data
grid_ridge.fit(X_train, y_train)

Fitting 3 folds for each of 28 candidates, totalling 84 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  84 out of  84 | elapsed:    0.2s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Ridge(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=None, normalize=False, random_state=None,
                             solver='auto', tol=0.001),
             iid='warn', n_jobs=-1,
             param_grid={'alpha': [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9],
                         'fit_intercept': [True, False],
                         'normalize': [True, False]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=1)

In [57]:
# extracting best parameters (can be used for finer hyper parameter tuning)
grid_ridge.best_params_

{'alpha': 0.9, 'fit_intercept': True, 'normalize': False}

In [58]:
# predicting values and calculating the rmse-score
y_pred_ridge = grid_ridge.predict(X_test)
rmse_test_ridge = MSE(y_test, y_pred_ridge)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test_ridge))

Test set RMSE of rf: 136.08


## Lasso regression
__Score history:__
- Test set RMSE of rf: 136.00

### Lasso Parameters:
- __alpha:__
    - constant that multiplies the L1-Norm
- __fit_intercept:__
    - weather to calculate an intercept for this model (e. g. not needed if centered) 
- __normalize:__
    - if `fit_intercept` and `normalize` are both `True` the regressor X will be normalized
- __positive:__
    - when `True` forces the coefficients to be positive

In [98]:
from sklearn.linear_model import Lasso

In [118]:
params_lasso = {
    'alpha': [0.001, 0.005, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2],
    'normalize': [True, False],
    'fit_intercept': [True, False],
    'positive': [True, False] 
}

lasso = Lasso()

In [119]:
grid_lasso = GridSearchCV(estimator=lasso,
    param_grid=params_lasso,
    cv=3,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1)

In [120]:
grid_lasso.fit(X_train, y_train)

Fitting 3 folds for each of 112 candidates, totalling 336 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    2.1s
[Parallel(n_jobs=-1)]: Done 336 out of 336 | elapsed:    3.6s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=Lasso(alpha=1.0, copy_X=True, fit_intercept=True,
                             max_iter=1000, normalize=False, positive=False,
                             precompute=False, random_state=None,
                             selection='cyclic', tol=0.0001, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'alpha': [0.001, 0.005, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6,
                                   0.7, 0.8, 0.9, 1, 1.1, 1.2],
                         'fit_intercept': [True, False],
                         'normalize': [True, False],
                         'positive': [True, False]},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring='neg_mean_squared_error', verbose=1)

In [121]:
grid_lasso.best_params_

{'alpha': 0.005, 'fit_intercept': True, 'normalize': True, 'positive': False}

In [122]:
y_pred_lasso = grid_lasso.predict(X_test)
rmse_test_lasso = MSE(y_test, y_pred_lasso)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test_lasso))

Test set RMSE of rf: 136.00


## Random Forest
__Score history:__
- Test set RMSE of rf: 112.39
- Test set RMSE of rf: 107.32 (start parameter tuning)
- Test set RMSE of rf: 106.62 (parameter tuning 2)
- Test set RMSE of rf: 71.72 (pt 3)
- Test set RMSE of rf: 70.43 (new date, higher estimators)
- Test set RMSE of rf: 41.65 (pt 4)
- Test set RMSE of rf: 41.53 (pt 5

### Random Forest parameters
- __n_estimators:__
    - number of trees in the forest
- __max_depth:__
    - maximum depth of the tree
- __min_samples_split:__
    - the min. number of samples required to split an internal node
- __min_samples_leaf:__
    - the minimum number of samples required to be at a leaf node
- __min_weight_fraction_leaf:__
    - the minimum weighted fraction of the sum total of weights
- __max_features:__
    - the number of features to consider when looking for the best split
- __min_impurity_decrease:__
    - A node will be split if this split induces a decrease of the impurity greater than or equal to this value.

In [128]:
from sklearn.ensemble import RandomForestRegressor

In [137]:
rf = RandomForestRegressor(random_state=42)

In [138]:
# take a look at the rf's parameters
print(rf.get_params())

{'bootstrap': True, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 'warn', 'n_jobs': None, 'oob_score': False, 'random_state': 42, 'verbose': 0, 'warm_start': False}


In [143]:
params_rf = {
    'n_estimators': [2000],
    'max_depth': [None, 20, 21, 22],
    'min_samples_split': [2, 3, 4]
}

grid_rf = GridSearchCV(estimator=rf,
    param_grid=params_rf,
    cv=3,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1)

In [144]:
grid_rf.fit(X_train, y_train)


Fitting 3 folds for each of 12 candidates, totalling 36 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed:  8.2min finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=RandomForestRegressor(bootstrap=True, criterion='mse',
                                             max_depth=None,
                                             max_features='auto',
                                             max_leaf_nodes=None,
                                             min_impurity_decrease=0.0,
                                             min_impurity_split=None,
                                             min_samples_leaf=1,
                                             min_samples_split=2,
                                             min_weight_fraction_leaf=0.0,
                                             n_estimators='warn', n_jobs=None,
                                             oob_score=False, random_state=42,
                                             verbose=0, warm_start=False),
             iid='warn', n_jobs=-1,
             param_grid={'max_depth': [None, 20, 21, 22],
  

In [141]:
grid_rf.best_params_

Best hyerparameters:
 {'max_depth': None, 'min_samples_split': 2, 'n_estimators': 1000}


In [142]:
y_pred = grid_rf.predict(X_test)
rmse_test = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))


Test set RMSE of rf: 41.53


## Creating SVM
__Score history:__
- Test set RMSE of rf: 143.46


### SVM parameters
- __c:__
    - penalty parameeter (higher => overfitting)
- __shrinking:__
    - weather to use the shrinking heuristic
- __kernel:__
    - specifies the kernel type

In [24]:
from sklearn.svm import SVR
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler

In [25]:
scaler = RobustScaler()
svr = SVR()

In [26]:
svr.get_params()

{'C': 1.0,
 'cache_size': 200,
 'coef0': 0.0,
 'degree': 3,
 'epsilon': 0.1,
 'gamma': 'auto_deprecated',
 'kernel': 'rbf',
 'max_iter': -1,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

In [27]:
pipe = make_pipeline(scaler, svr)

In [29]:
params_svr = {'C': [1.0, 1.1, 1.2, 1.3, 1.4],
 'svr__kernel': ['rbf', 'linear', 'poly', 'rbf', 'sigmoid'],
 'svr__shrinking': [True, False]}

grid_rf = GridSearchCV(estimator=pipe,
    param_grid=params_svr,
    cv=3,
    scoring='neg_mean_squared_error',
    verbose=1,
    n_jobs=-1)

In [30]:
pipe.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('robustscaler',
                 RobustScaler(copy=True, quantile_range=(25.0, 75.0),
                              with_centering=True, with_scaling=True)),
                ('svr',
                 SVR(C=1.0, cache_size=200, coef0=0.0, degree=3, epsilon=0.1,
                     gamma='auto_deprecated', kernel='rbf', max_iter=-1,
                     shrinking=True, tol=0.001, verbose=False))],
         verbose=False)

In [31]:
y_pred_svr = pipe.predict(X_test)
rmse_test_svr = MSE(y_test, y_pred_svr)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test_svr))


Test set RMSE of rf: 143.46


## Creating a GradientBoostingRegressor
__Score history:__
- Test set RMSE of rf: 65.32
- Test set RMSE of rf: 44.17 (parameter tuning)

In [163]:
from sklearn.ensemble import GradientBoostingRegressor

In [170]:
gbt = GradientBoostingRegressor(random_state=42)


In [177]:
params_gbt = {
'n_estimators': [200, 300, 400],
'max_depth': [1, 2, 3, 4, 5, 6, 7],
'max_features': ['log2'],
'learning_rate': [0.05, 0.1, 0.15],
'subsample': [0.8]
    
}
# Instantiate 'grid_rf'
grid_gbt = GridSearchCV(estimator=gbt,
param_grid=params_gbt,
cv=3,
scoring='neg_mean_squared_error',
verbose=1,
n_jobs=-1)

In [178]:
grid_gbt.fit(X_train, y_train)

Fitting 3 folds for each of 63 candidates, totalling 189 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    4.3s
[Parallel(n_jobs=-1)]: Done 189 out of 189 | elapsed:   29.0s finished


GridSearchCV(cv=3, error_score='raise-deprecating',
             estimator=GradientBoostingRegressor(alpha=0.9,
                                                 criterion='friedman_mse',
                                                 init=None, learning_rate=0.1,
                                                 loss='ls', max_depth=3,
                                                 max_features=None,
                                                 max_leaf_nodes=None,
                                                 min_impurity_decrease=0.0,
                                                 min_impurity_split=None,
                                                 min_samples_leaf=1,
                                                 min_samples_split=2,
                                                 min_weight_fraction_leaf=0.0,
                                                 n_estimators=100,
                                                 n_iter...
                             

In [179]:
grid_gbt.best_params_


Best hyerparameters:
 {'learning_rate': 0.15, 'max_depth': 5, 'max_features': 'log2', 'n_estimators': 400, 'subsample': 0.8}


In [182]:
best_model = grid_gbt.best_estimator_
y_pred = best_model.predict(X_test)
rmse_test = MSE(y_test, y_pred)**(1/2)
print('Test set RMSE of rf: {:.2f}'.format(rmse_test))

Test set RMSE of rf: 44.17
