# <a id='toc1_'></a>[Feature engineering - I](#toc0_)

**Table of contents**<a id='toc0_'></a>    
- [Feature engineering - I](#toc1_)    
  - [Formal transformations](#toc1_1_)    
    - [One-hot encoding](#toc1_1_1_)    
    - [Label Encoding](#toc1_1_2_)    
    - [Binning / Discretization](#toc1_1_3_)    
      - [`cut`: range-based, different sized bins](#toc1_1_3_1_)    
      - [`qcut`: quartile-based, equally-sized bins](#toc1_1_3_2_)    
- [Modelling](#toc2_)    
  - [Train-test split](#toc2_1_)    
  - [Training & evaluation](#toc2_2_)    
- [Feature engineering - II](#toc3_)    
  - [Formal transformations](#toc3_1_)    
    - [Normalization](#toc3_1_1_)    
    - [Correlation Tresholds](#toc3_1_2_)    
  - [Semantic transformations](#toc3_2_)    
- [Acknowledgements](#toc4_)    

<!-- vscode-jupyter-toc-config
	numbering=false
	anchor=true
	flat=false
	minLevel=1
	maxLevel=6
	/vscode-jupyter-toc-config -->
<!-- THIS CELL WILL BE REPLACED ON TOC UPDATE. DO NOT WRITE YOUR TEXT IN THIS CELL -->

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

> The following dataset emulates the joint information from a company's HR file and medical exam (not REAL data!) - our goal is to try to approximate salaries from this information. We have chosen to use a `KNN regression` as our model (distance-based).

In [None]:
salary = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/salaries.csv')
salary.head()

## <a id='toc1_1_'></a>[Formal transformations](#toc0_)

In [None]:
salary['Daltonic'].value_counts(dropna=False)

In [None]:
salary['Daltonic'].fillna('No Daltonism', inplace=True)

### <a id='toc1_1_1_'></a>[One-hot encoding](#toc0_)

Besides `pd.dummies`, you can use `sklearn.preprocessing.OneHotEncoder`.

In [None]:
dalt_transformed = pd.get_dummies(salary['Daltonic'], drop_first=True)
dalt_transformed.head()

In [None]:
# dalt_transformed[np.NaN].value_counts()

In [None]:
salary_transformed = pd.merge(left = salary,
                              right = pd.get_dummies(salary['Daltonic'],prefix='Daltonic',drop_first=True),
                              left_index=True,
                              right_index=True)
salary_transformed

### <a id='toc1_1_2_'></a>[Label Encoding](#toc0_)

Can do it by direct mapping of values or by calling the `sklearn.preprocessing.LabelEncoder`:

In [None]:
salary_transformed['Experience_label'] = salary_transformed['Experience'].replace({'Junior':0,'Senior':1})
salary_transformed['Gender_label'] = salary_transformed['Gender'].replace({'Male':0,'Female':1})
salary_transformed.head()

### <a id='toc1_1_3_'></a>[Binning / Discretization](#toc0_)

Binning is used to turn numeric features into categorical ones. In this case we're not going to use categorical features, but for the record:

#### <a id='toc1_1_3_1_'></a>[`cut`: range-based, different sized bins](#toc0_)

In [None]:
# Binning: 
series = pd.cut(salary['Height'], 5, labels=['very short','short','average','tall','very tall'])
display(series.value_counts())

In [None]:
# Check calculation for normal binning
from dfply import *
height_diff = salary['Height'].max() - salary['Height'].min()
no_bins = 5
(salary >> mask(X.Height < (X.Height.min() + height_diff / no_bins)) >> select('Salary')).shape

#### <a id='toc1_1_3_2_'></a>[`qcut`: quartile-based, equally-sized bins](#toc0_)

In [None]:
# Binning: 
pd.qcut(salary['Height'], 5, labels=['very short','short','average','tall','very tall']).value_counts()

In [None]:
salary_transformed['Height_classes'] = pd.cut(salary['Height'],5,labels=['very short','short','average','tall','very tall'])
salary_transformed.head()

# <a id='toc2_'></a>[Modelling](#toc0_)

> We can now drop the non-numerics and keep only numeric columns:

In [None]:
salary_transformed = salary_transformed.drop(columns=['Experience','Gender','Daltonic','Height_classes', 'Company'])
salary_transformed.head()

## <a id='toc2_1_'></a>[Train-test split](#toc0_)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(salary_transformed.drop(columns=['Salary']), salary_transformed['Salary'], random_state=0)

## <a id='toc2_2_'></a>[Training & evaluation](#toc0_)

We evaluate our model using the `MSE score` (mean squared error): $(Salary_{real} - Salary_{predicted})^2$


In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error

# create knn, don't forget Hyperparameter
knn = KNeighborsRegressor(n_neighbors=3)

Fit our model and predict on the test set:

In [None]:
knn.fit(X_train, y_train)
pred = knn.predict(X_test)
pred

In [None]:
np.array(y_test)

In [None]:
np.sqrt(mean_squared_error(y_test,pred))

# <a id='toc3_'></a>[Feature engineering - II](#toc0_)

## <a id='toc3_1_'></a>[Formal transformations](#toc0_)

### <a id='toc3_1_1_'></a>[Normalization](#toc0_)

> We apply a normalization of the features since `flexibility` seems to count 200 times more than `Daltonic_None`:

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Create normalization object from scikit learn package, and "fit" it to the features in hand
normalizer = MinMaxScaler()
normalizer = normalizer.fit(X_train)

> 💡**Notice how we only use the **X_train** data to fit?**

> We want to **use only the training data to normalize** (establishing maximum and minimum values) to avoid `data leakage` from the test dataset. If we used data from the test dataset, the test results would be biased by having some info from the test set.

Note: We can use other scalers too, such as `sklearn.preprocessing.StandardScaler` or `sklearn.preprocessing.RobustScaler`.

In [None]:
# now that we have our normalizer we use it for both training and testing (and in the future for unseen data as well!)
X_train_normalized = normalizer.transform(X_train)
X_train_normalized = pd.DataFrame(X_train_normalized,columns=X_train.columns)
X_train_normalized.head()

In [None]:
# let's see if this normalization improves our model
# creating model
knn = KNeighborsRegressor(n_neighbors=3)
# training the model on normalized data
knn.fit(X_train_normalized, y_train)
# testing algorithm on normalized test
pred = knn.predict(X_test_normalized)

np.sqrt(mean_squared_error(y_test,pred))
#much better!

### <a id='toc3_1_2_'></a>[Correlation Tresholds](#toc0_)

> Let's see if our variables are too dependent:

In [None]:
X_train_normalized.corr()

> As you've seen before, a very common way to visualize the results discussed above is to create a correlation matrix. Only the lower triangular component of the matrix is shown due to the fact that the upper and lower (triangular) parts of the matrix are equal:

In [None]:
import matplotlib.pyplot as plt
import seaborn as sn

corr=np.abs(X_train_normalized.corr())

#Set up mask for triangle representation
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(14, 14))
# Generate a custom diverging colormap
cmap = sn.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sn.heatmap(corr, mask=mask,  vmax=1,square=True, linewidths=.5, cbar_kws={"shrink": .5},annot = corr)

plt.show()

In [None]:
X_train_normalized.head()

> Very clear that all variables are essentially the same! Except for experience! What is the effect of this?

In [None]:
X_train_reduced = X_train_normalized[['Gender_label','Experience_label']]
X_test_reduced = X_test_normalized[['Gender_label','Experience_label']]

In [None]:
# creating our knn model
knn = KNeighborsRegressor(n_neighbors=3)
# training the model on reduced, normalized data
knn.fit(X_train_reduced, y_train)
# testing algorithm on reduced, normalized test
pred = knn.predict(X_test_reduced)

np.sqrt(mean_squared_error(y_test,pred))

In [None]:
2945 * 100 / salary['Salary'].median()

## <a id='toc3_2_'></a>[Semantic transformations](#toc0_)

In [None]:
# we want to understand what drives loss of energy in our windfarms
energy = pd.read_csv('https://raw.githubusercontent.com/sabinagio/data-analytics/main/data/energy_loss.csv')
energy.head()

In [None]:
# let's try to predict it "raw"
X = energy[['Voltage','Rotation','Stability']]
y = energy['Loss']

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X, y)
reg.score(X, y)

> However we know the optimal values of `Voltage`, `Rotation` and `Stability` from an engineer:

In [None]:
energy_transformed = energy.copy()
energy_transformed['Voltage'] = np.square(energy_transformed['Voltage']-100)
energy_transformed['Rotation'] = np.square(energy_transformed['Rotation']-150)
energy_transformed['Stability'] = np.square(energy_transformed['Stability']-90)
X = energy_transformed[['Voltage','Rotation','Stability']]
y = energy_transformed['Loss']

In [None]:
X

In [None]:
# the model improves dramatically
import numpy as np
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X, y)
reg.score(X, y)

# <a id='toc4_'></a>[Acknowledgements](#toc0_)

Thank you, David Henriques, for your awesome lesson structure and content.