<center>
    <h3>University of Toronto</h3>
    <h3>Department of Mechanical and Industrial Engineering</h3>
    <h3>MIE368 Analytics in Action </h3>
    <h3>(Fall 2020)</h3>
    <hr>
    <h1>Quiz 5: Model Engineering</h1>
    <h3>October 29, 2020</h3>
</center>

__Instructions__

*   Please use this Colab notebook to solve the questions in the coding section of Quiz Five. 
* Run the first codeblock to import the necessary quiz packages, import the quiz data, and split the data into training and validation datasets. 
*  Any additional code you will need to answer the questions can be added to this notebook. Please remember to copy and paste the code after each coding question.

In [1]:
# Import packages
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LassoCV
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Import the data
df = pd.read_csv("https://docs.google.com/uc?export=download&id=1cVN2rmrUtLYcnT7Pbxp05R4O3YmjsI_4")

'''We run some light preprocessing to clean the data for the quiz. You don't 
need to understand each line of code here'''

# Convert full date timestamp to hour of day
df.timestamp = df.timestamp.apply(lambda x: pd.Timestamp(x).hour)

# One-hot encode season
df = pd.get_dummies(df, columns=['season']) 
# convert season numbers to season
df.rename(columns={
    'season_0.0': 'winter',
    'season_1.0': 'spring', 
    'season_2.0': 'summer', 
    'season_3.0': 'fall'
    },
    inplace=True
    )

# Split training and validation sets
X_train, X_val, y_train, y_val = train_test_split(df.drop(columns=['cnt']),
                                                    df.cnt,
                                                    test_size = 0.25,
                                                    shuffle = False
                                                    )
# print out data frame snapshot
df.head()

Unnamed: 0,timestamp,cnt,t1,t2,hum,wind_speed,weather_code,is_holiday,is_weekend,winter,spring,summer,fall
0,0,182,3.0,2.0,93.0,6.0,3.0,0.0,1.0,0,0,0,1
1,1,138,3.0,2.5,93.0,5.0,1.0,0.0,1.0,0,0,0,1
2,2,134,2.5,2.5,96.5,0.0,1.0,0.0,1.0,0,0,0,1
3,3,72,2.0,2.0,100.0,0.0,1.0,0.0,1.0,0,0,0,1
4,4,47,2.0,0.0,93.0,6.5,1.0,0.0,1.0,0,0,0,1


In this quiz, we will use the London Bike Share dataset. This dataset contains the usage statistics of London's bike share service that lets members borrow bikes for short periods of time for a small fee. In this quiz, we will build a model to predict how often the service is used (i.e., how many bikes are rented in a given hour).

The columns in the dataset are as follows:


* **timestamp**:            (integer) Start time of one hour block (e.g., 0 is midnight to 1am)

* **cnt**:         (integer) The number of bikes rented  in the hour block

* **t1**:       (continuous) What the temperature is in ˚C

* **t2**:   (continuous) What the temperature feels like in ˚C

* **hum**:   (continuous) Humidity in %

* **wind_speed**: (continuous) Wind speed in km/hr

* **weather_code**:  (integer) Severity level of the weather (higher is more severe)

* **is_holiday**:   (binary) 1 if hour is during a holiday, 0 otherwise

* **winter**:   (binary)  1 if hour is in winter, 0 otherwise

* **spring**:   (binary)  1 if hour is in spring, 0 otherwise

* **summer**:   (binary)  1 if hour is in summer, 0 otherwise

* **fall**:   (binary)  1 if hour is in fall, 0 otherwise



## Question 1 (1 mark)

> Q: Is the `f_classif` function (from `sklearn.feature_selection`) appropriate to use for selecting features in this problem? Why? (25 words max)

>> A: No it is not appropriate. The target in this problem is continuous, and `f_classif` is only appropriate when the target is categorical.

## Question 2 (1 mark)

> Q: Which of the following cannot be used for feature engineering?

> * `wind_speed` * `cnt`
* `winter` / `is_holiday`
* `t1` * (`wind_speed`)^2
* `hum` + `wind_speed`
* (`t2` - `t1`) * `fall`
* `t2`^(`t1`)

>> A: `wind_speed` * `cnt` (the target should not be included in feature engineering)

## Question 3 (2 marks)

> Q_a: What is a stacking? How is model stacking different from bagging? (30 words max)

>> Stacking when the predictions from several models are input to another prediction model, which is different from bagging where they are input to a voting rule.

> Q_b: What is bagging? Why do we say bagging is an ensemble model? (30 words max)

>> Bagging is when several prediction input to a voting rule, which is an ensemble because it combines knowledge from multiple models. 

## Question 4 (3 marks)
Do a grid search (**with default settings**) to find suitable parameters for a `DecisionTreeRegressor` with a `random_state = 1`. Search over the following set of parameters:

* `max_depth`: (5, 10)
* `max_features`: ('auto', 'sqrt', 'log2)

According to your grid search, what parameters (i.e., `max_depth` number and `max_features` string) should you use to train your model on this data?

In [2]:
# Initialize a model
cart_model = DecisionTreeRegressor(random_state=1)

# Write your code here.
# -------------------

params_to_search = {
    'max_depth': [5, 10],
    'max_features': ['auto', 'sqrt', 'log2']
}

# Initialize the grid search
optimized_dt = GridSearchCV(cart_model, params_to_search)
# Run the grid search
optimized_dt.fit(X_train, y_train)

# Easiest way (not covered in lab)
optimized_dt.best_estimator_.score(X_val, y_val)

# Based on lab material
cv_result_df = pd.DataFrame(optimized_dt.cv_results_)
best_model_params_index = cv_result_df.mean_test_score.idxmax()
best_model_params = cv_result_df.params[best_model_params_index]

print(f'The best model parameters are {best_model_params}')

The best model parameters are {'max_depth': 10, 'max_features': 'auto'}


## Question 5 (3 marks)
Train a $k$-means clustering model on `X_train` using one centroid seed initialization, 10 clusters, and `random_state = 0`. Apply this model to the validation data, and report what cluster has the most holidays.

In [3]:
# Define clustering model 
mdk_k_means = KMeans(n_init = 1, n_clusters=10, random_state = 0) 

# Fit the model to the training set
mdk_k_means.fit(X_train)

# Make a series of cluster predictions
val_clusters = pd.Series(mdk_k_means.predict(X_val), name = 'cluster', index=X_val.index)

# Get cluster with most holidays
X_val.join(val_clusters).groupby('cluster').sum().is_holiday.idxmax()

6

## Question 6 (3 marks)

Standardize the training and validation features (i.e., `X_train` and `X_val`), and report the mean of `'t1'` (i.e., the second column, which is indexed by 1) of the standardized validation features. 

In [4]:
# Initialize standard scaler
scaler = StandardScaler()
# Fit scaler
scaler.fit(X_train)
# Transform feature data
X_train_standardized = scaler.transform(X_train)
X_val_standardized = scaler.transform(X_val)
# Evaluate validation mean
X_val_standardized[:,1].mean()

0.4541703067604414

## __Question 7.__ (2 marks)

Use the `describe()` method on `X_val`. Based on the output, why might you be worried about data imbalance in the validation set? (20 words max)

In [5]:
season_columns = ['winter', 'spring', 'summer', 'fall']
print(X_val[season_columns].describe())

       winter       spring       summer         fall
count  4354.0  4354.000000  4354.000000  4354.000000
mean      0.0     0.319936     0.492650     0.187414
std       0.0     0.466505     0.500003     0.390288
min       0.0     0.000000     0.000000     0.000000
25%       0.0     0.000000     0.000000     0.000000
50%       0.0     0.000000     0.000000     0.000000
75%       0.0     1.000000     1.000000     0.000000
max       0.0     1.000000     1.000000     1.000000


> We never validate the model on out-of-sample data from the winter, and other seasons have more representation.