## Class 2 - Defining evaluation metrics and fitting basic regression models
In our second lecture, we discussed a number of algorithms and evaluation metrics for regression problems. Today, we will go back to the datasets we looked at last week, and implement some of these algorithms and evaluation metrics on the predictive modeling problems we have defined. 

We will do all of this using `scikit-learn`. A couple of useful pointers to useful documentation, before we start:
- In general, the scikit-learn documentation is your friend: https://scikit-learn.org/stable/
- Here is a list of linear models implemented as Extractors/Predictors in sklearn: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.linear_model
- Here are different forms of neighbor-based models (we talked about `KNNRegressor` yesterday)
- Here are evaluation metrics implemented in sklearn: https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics 
- Here are utilities for preprocessing steps: https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing 

**Note**: Under `nbs/class_02` you will find a notebook called `example.ipynb`, where I provide an example of how to run today's exercise on simulated data.

### Today's exercise
Gather in the same or similar groups as last week. Under `class/class_02.md` you will find two predictive modeling questions, one for each datasets. There are different variants of the same questions, which differ in which outcome you want to predict.

What I would like you to do today is the following:
1. Create a folder called `group-x` within `nbs/class_02`, `cd` into it and work within that today
2. Choose an outcome variable for a regression problem. On the basis of this, define **which of the evaluation metrics** could be suitable. Evaluation metrics can be computed using scikit-learn: https://scikit-learn.org/stable/modules/model_evaluation.html#regression-metrics 
3. (a) If you are in the bike sharing group, split your dataset into a training/validation/test set using later time points as validation/test set. Validation and test set should be 15% of your data each. (b) If you are in the personality group, using sklearn's `train_test_split` function, create a 70/15/15 random split of your data.
    - Remember to set a seed (`random_state`) when you do so. Let's all use the same (the classic `random_state=42`)
    - Save these datasets as separate csv files in a subfolder called `data`
4. Look at your outcome and predictors: do you want to transform them in any way?
5. Estimate the performance of a dummy baseline (i.e., the mean model) on all splits
6. Now look at your predictors: do they need any preprocessing? Any transformations? Removal of "bad" data points?
7. Fit the other models using KKN (sklearn's `KNeighborsRegressor`: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html) and linear models (`LinearRegressor`: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). Save the fitted model object (with a meaningful name) using `pickle` (https://scikit-learn.org/stable/model_persistence.html) in a subfolder called `model`.
8. Once you are done, evaluate all models on both the training and the validation set and visualize the scores


### Once you have done this
Please submit a pull request to my repository where, within `nbs/class_02/group-x` you have: 
- the notebook on which you have worked
- a subfolder called `data` containing your splits
- a subfolder called `models` containing your models

In next week's class, I will ask each group to briefly present their results.

# Starting exercise

In [38]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

### Loading bike data

In [20]:
df = pd.read_csv("../../../../Project Files/data/class_01/bikes.csv")

### Creating outcome variable
Proportion of casual users

In [21]:
df["propoertion_cas_reg"] = df["casual"]/df["cnt"]

In [5]:
df.describe()

Unnamed: 0,instant,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt,propoertion_cas_reg
count,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0,17379.0
mean,8690.0,2.50164,0.502561,6.537775,11.546752,0.02877,3.003683,0.682721,1.425283,0.496987,0.475775,0.627229,0.190098,35.676218,153.786869,189.463088,0.172143
std,5017.0295,1.106918,0.500008,3.438776,6.914405,0.167165,2.005771,0.465431,0.639357,0.192556,0.17185,0.19293,0.12234,49.30503,151.357286,181.387599,0.136557
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.02,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,4345.5,2.0,0.0,4.0,6.0,0.0,1.0,0.0,1.0,0.34,0.3333,0.48,0.1045,4.0,34.0,40.0,0.063492
50%,8690.0,3.0,1.0,7.0,12.0,0.0,3.0,1.0,1.0,0.5,0.4848,0.63,0.194,17.0,115.0,142.0,0.146893
75%,13034.5,3.0,1.0,10.0,18.0,0.0,5.0,1.0,2.0,0.66,0.6212,0.78,0.2537,48.0,220.0,281.0,0.253731
max,17379.0,4.0,1.0,12.0,23.0,1.0,6.0,1.0,4.0,1.0,1.0,1.0,0.8507,367.0,886.0,977.0,1.0


### Choosing evaluation metric

R2 since it gives an idea how the overall performance of the model is. It is not scale dependent and it looks at global performance rather than local predictions ish

### Splitting data
15\% test, 15\% validation and 70\% train data
The test will be the latest 15%, and the training will be the oldest 70%

In [22]:
# number of rows amounting to 15% of the data
split_len = int(0.15 * len(df))

# test data - the last 15% of the rows
df_test = df.iloc[-split_len:,:]

# remaining data for train and validation
df_train_val = df.iloc[:-split_len,:]
# validation data - the last 15% of the rows
df_val = df_train_val.iloc[-split_len:,:]

# training data - the remaining data of the train and val data
df_train = df_train_val.iloc[:-split_len,:]

In [6]:
print(df_val)

       instant      dteday  season  yr  mnth  hr  holiday  weekday  \
12167    12168  2012-05-27       2   1     5   4        0        0   
12168    12169  2012-05-27       2   1     5   5        0        0   
12169    12170  2012-05-27       2   1     5   6        0        0   
12170    12171  2012-05-27       2   1     5   7        0        0   
12171    12172  2012-05-27       2   1     5   8        0        0   
...        ...         ...     ...  ..   ...  ..      ...      ...   
14768    14769  2012-09-12       3   1     9  13        0        3   
14769    14770  2012-09-12       3   1     9  14        0        3   
14770    14771  2012-09-12       3   1     9  15        0        3   
14771    14772  2012-09-12       3   1     9  16        0        3   
14772    14773  2012-09-12       3   1     9  17        0        3   

       workingday  weathersit  temp   atemp   hum  windspeed  casual  \
12167           0           1  0.62  0.5758  0.83     0.1940       4   
12168          

In [7]:
print(df_test)

       instant      dteday  season  yr  mnth  hr  holiday  weekday  \
14773    14774  2012-09-12       3   1     9  18        0        3   
14774    14775  2012-09-12       3   1     9  19        0        3   
14775    14776  2012-09-12       3   1     9  20        0        3   
14776    14777  2012-09-12       3   1     9  21        0        3   
14777    14778  2012-09-12       3   1     9  22        0        3   
...        ...         ...     ...  ..   ...  ..      ...      ...   
17374    17375  2012-12-31       1   1    12  19        0        1   
17375    17376  2012-12-31       1   1    12  20        0        1   
17376    17377  2012-12-31       1   1    12  21        0        1   
17377    17378  2012-12-31       1   1    12  22        0        1   
17378    17379  2012-12-31       1   1    12  23        0        1   

       workingday  weathersit  temp   atemp   hum  windspeed  casual  \
14773           1           1  0.66  0.6212  0.44     0.2537      91   
14774          

In [8]:
# saving data - something is wrong :((
df_path = "/work/SilleHasselbalchMarkussen#4503/DataSci-AU-24/nbs/group_RMDS/data/"

df_test.to_csv(f"{df_path}bikes_test.csv")
df_val.to_csv(f"{df_path}bikes_validation.csv")
df_train.to_csv(f"{df_path}bikes_train.csv")

### Reload data and split into X and y

In [9]:
# something is up.... !!!!
bike_train = pd.read_csv("data/bikes_train.csv")
bike_val = pd.read_csv("data/bikes_validation.csv")
bike_test = pd.read_csv("data/bikes_test.csv")

In [23]:
# train
y_train = df_train["propoertion_cas_reg"].values
X_train = df_train.drop(["propoertion_cas_reg", "registered", "casual", "cnt","dteday"], axis=1) # dropping on the first axis

In [25]:
# validation
y_val = df_val["propoertion_cas_reg"].values
X_val = df_val.drop(["propoertion_cas_reg", "registered", "casual", "cnt","dteday"], axis=1)

In [27]:
# test
y_test = df_test["propoertion_cas_reg"].values
X_test = df_test.drop(["propoertion_cas_reg", "registered", "casual", "cnt","dteday"], axis=1)

### Modeling


In [39]:
performances = []

mean_value = y_train.mean() # predicting 
model_name = 'dummy'

for y,nsplit in zip([y_train, y_val, y_test], # zip two list together
                    ['train', 'val', 'test']):
    performance = np.sqrt(mean_squared_error(y, 
                                             [mean_value]*y.shape[0]))
    r2 = r2_score(y, [mean_value]*y.shape[0])
    performances.append({'model': model_name, # append to list with all the info
                         'split': nsplit,
                         'rmse': performance.round(4),
                         'r2': r2.round(4)})

In [41]:
performances

[{'model': 'dummy', 'split': 'train', 'rmse': 0.1426, 'r2': 0.0},
 {'model': 'dummy', 'split': 'val', 'rmse': 0.1224, 'r2': -0.0753},
 {'model': 'dummy', 'split': 'test', 'rmse': 0.12, 'r2': -0.0734}]