# Preprocessing Workflow


This exercise will guide you through the preprocessing workflow. Step by step, feature by feature, you will investigate the dataset and take preprocessing decisions accordingly.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# Loading the dataset
selected_features = ['GrLivArea',
                     'BedroomAbvGr',
                     'KitchenAbvGr', 
                     'OverallCond',
                     'RoofSurface',
                     'GarageFinish',
                     'CentralAir',
                     'ChimneyStyle',
                     'MoSold',
                     'SalePrice']

df = pd.read_csv('data/iowa_housing.csv', usecols=selected_features)
df.head()

Remember, you have a file named "iowa_housing.txt" in the "data" folder with informations about the dataset.

## Duplicates

***Duplicates in datasets cause data leakage.*** 

It is important to locate and remove duplicates.

❓ **>>>** How many duplicated rows are there in the dataset ?

In [None]:
## Code here!


❓ **>>>** Remove the duplicates from the dataset. Overwite the dataframe `df`.

In [None]:
### Code here!



## Missing data

❓ **>>>** Print the percentage of missing values for every column of the dataframe.

In [None]:
# Code here!



### `GarageFinish`

❓ **>>>** Investigate the missing values in `GarageFinish`. Then, choose one of the following solutions:

1. Drop the column entirely.
2. Impute the column median using `SimpleImputer` from Scikit-Learn.
3. Preserve the NaNs and replace them with meaningful values.

Modify the Series `GarageFinish` accordingly. (Do not replace the entire DataFrame with your new Series!).

**HINT**: According to the dataset description, the missing values in `GarageFinish` represent a house having no garage. They need to be encoded as such.

In [None]:
## Code here



### `RoofSurface`

❓ **>>>** Investigate the missing values in `RoofSurface`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median using sklearn's `SimpleImputer`
3. Preserve the NaNs and replace them with meaningful values

**HINT**: `RoofSurface` has a few missing values that can be imputed by the median value.

In [None]:
## Code here!


### `ChimneyStyle`

❓ **>>>** Investigate the missing values in `ChimneyStyle`. Then, choose one of the following solutions:

1. Drop the column entirely
2. Impute the column median
3. Preserve the NaNs and replace them with meaningful values

**HINT** Be careful: not all missing values are represented as `np.nans`, and Python's `isnull()` only detects `np.nans`...

In [None]:
# Code here!


❓ **>>>** When you are done with handling missing value, print out the percentage of missing values for the entire dataframe.

You should no longer have missing values !

In [None]:
# Code here!


## Scaling

**First of all, before scaling...**

To understand the effects of scaling and encoding on model performance, let's use a linear regression and get a **base score without any data transformation**.

❓ **>>>** Cross-validate a linear regression model that predicts `SalePrice` using the other features.

⚠️ Note that a linear regression model can only handle numeric features. [DataFrame.select_dtypes](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.select_dtypes.html) can help.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# Code here!



✅ **Expected results** : About ```0.577```.

###  `RoofSurface` 

❓ **>>>** **Question** about `RoofSurface`

Investigate `RoofSurface` for distribution and outliers, you can use a ```.boxplot()``` and / or a ```.plot.hist()```. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [None]:
# Code here!


### `GrLivArea`

❓ **>>>** Investigate `GrLivArea` for distribution and outliers. Then, choose the most appropriate scaling technique. Either:

1. Standard Scaler
2. Robust Scaler
3. MinMax Scaler

Replace the original columns with the transformed values.

In [None]:
# Code here!


### `BedroomAbvGr` ,  `OverallCond` & `KitchenAbvGr`

❓ **>>>** Investigate `BedroomAbvGr`, `OverallCond` & `KitchenAbvGr`. Then, chose one of the following scaling techniques:

1. MinMax Scaler
2. Standard Scaler
3. Robust Scaler

Replace the original columns with the transformed values.

In [None]:
# Code here!


## Feature Encoding

### `GarageFinish`

❓ **>>>** Investigate `GarageFinish` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Add the encoding to the dataframe as new colum(s), and remove the original column.

In [None]:
# Code here!



### Encoding  `CentralAir`

❓ **>>>** Investigate `CentralAir` and choose one of the following encoding techniques accordingly:
- Ordinal encoding
- One-Hot encoding

Replace the original column with the newly generated encoded columns.

In [None]:
# Code here!


## Feature Engineering

### `MoSold` - Cyclical engineering 

A feature can be numerical (continuous or discrete), categorical or ordinal. But a feature can also be temporal (e.g. quarters, months, days, minutes, ...). 

Cyclical features like time need some specific preprocessing. Indeed, if you want any Machine Learning algorithm to capture this cyclicity, your cyclical features must be preprocessed in a certain way.

Let's take a look at the feature `MoSold`, the month on which the house was sold.

In [None]:
df["MoSold"].value_counts()

* Many houses were sold in June (6), July (7) and May (5) (Spring/Summer)
* Only a few houses were sold in December (12), January (1) and February (2) (~ Fall/Winter)
    * But for any Machine Learning model, there is no reason why December (12) and January (1) would be "close"...

***How to deal with cyclical features?***

The sort answer is : using $\cos$ and $\sin$!

You can read more about this in this [article](https://www.sefidian.com/2021/03/26/handling-cyclical-features-such-as-hours-in-a-day-for-machine-learning-pipelines-with-python-example/).



**MoSold**
- Let's create two new features `sin_MoSold` and `cos_MoSold` which correspond respectively to the sine and cosine of MoSold.
- Drop the original column `MoSold`

In [None]:
months_in_a_year = 12

df['sin_MoSold'] = np.sin(2*np.pi*(df['MoSold']) / months_in_a_year)
df['cos_MoSold'] = np.cos(2*np.pi*(df['MoSold']) / months_in_a_year)
df = df.drop(columns='MoSold')

df.head()

## Export the preprocessed dataset (optional)

# Feature Selection

In [None]:
# Let's all start with the same dataset!
df = pd.read_csv("data/iowa_preprocessed.csv") # already created file
df.head()

## Correlation investigation

❓ **>>>** Plot a heatmap of the ***Pearson Correlation*** between the columns of the dataset.

In [None]:
import seaborn as sns
# Code here!


❓ **>>>** Visualize the correlation between column pairs in a dataframe. You can use ```.stack()``` and then filter them.

In [None]:
# Code here!


❓ **>>>** How many pairs of features exceed a correlation of 0.9 or -0.9 ?

In [None]:
# Code here!



## Base Modelling

❓ **>>>** Prepare the feature set `X` and target `y`.

*Remember that we want to model the `SalePrice` with the preprocessed features.*

In [None]:
# Code here!


❓ **>>>** Cross validate a Linear Regression model.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# Code here!


✅ **Expected results** : About ```0.645``` for a ```cv=10```.

## Feature Permutation

❓ **>>>** Perform a feature permutation  and rank the features by order of importance. To display the coefficients you can use functions such as ```np.vstack``` or ```np.ndarray.reshape(1, -1)``` or ```.T``` (transposed), to better visualize.

In [None]:
from sklearn.inspection import permutation_importance
# Code here!


❓ **>>>** Which feature is the most important ?

In [None]:
# Write your answer (or code here) !



## Modelling with less complexity

❓ **>>>** Drop the weak features and cross-validate a new model.
You should aim at maintaining a score close to the previous one. 

**Hint** : You can try dropping features one by one starting from the ones lowest importance until your model score starts dropping significantly.

In [None]:
# Code here!


✅ **Expected results** : About ```0.637``` for a ```cv=10```