# Feature engineering

Feature engineering is done to:
- improve a model's performance
- reduce computational or data needs
- improve interpretability of the results 

## Mutual information

Mutual information is a lot like correlation in that it measures the relationship between two quantities. The advantage of mutual information is that it can detect any kind of relationship, while correlation only detects linear relationships. 

Mutual information is a great general-purpose metric and especially useful at the start of feature development when you might not know what model you'd like to use yet. It is:

- easy to interpret,
- computationally efficient,
- theoretically well-founded,
- resistant to overfitting and, 
- able to detect any kind of relationship.

### What it measures

Mutual information (MI) describes relationships in terms of **uncertainty**. The MI between two quantities is a measure of the extent to which knowledge of one quantity reduces uncertainty about the other. If you knew the values of a feature, how much more confident would you be about the target? In this context, **uncertainty** is measured using **entropy**. 

Let $(X, Y)$ be a pair of random variables with range $R_{XY} = R_X \times R_Y$. If their joint distribution is $P_{X,Y}$ and their marginal distributions are $P_X$ and $P_Y$, the mutual information is defined as
\begin{align*}
    I(X; Y) = D_{KL}(P_{X, Y} || P_X \otimes P_Y)
\end{align*}
where $D_{KL}$ is the Kullback-Leibler divergence and $P_X \otimes P_Y$ is the outer product distribution which assigns probability $P_X(x) \cdot P_Y(y)$ to each $(x, y)$. If $X$ and $Y$ are independent then $P_{X, Y}(x, y) = P_X(x) \cdot P_Y(y)$ and $I(X;Y) = 0$. At the other extreme, if $X$ is a deterministic function of $Y$ and $Y$ is a deterministic function of $X$, then all of the information conveyed by $X$ is shared with $Y$; knowing $X$ determines the value of $Y$ and vice versa. As a result. the mutual information is the same as the uncertainty in $Y$ (or $X$) alone, namely the **entropy** of $Y$ (or $X$). Mutual information is 

- non-negative, i.e., $I(X; Y) \geq 0 $ and
- symmetric, i.e., $I(X;Y) = I(Y;X)$

### Intepreting MI scores

As already mentioned, the least possible MI score is 0.0, for the case of independent random variables (i.e., knowing one tells you nothing about the other). In theory, there is no upper bound to what an MI score can be. In practice, values above 2 or so are uncommon. Some things to remember:

- MI can help with understanding the relative potential of a feature as a predictor for a target, considered by itself. 
- It's possible for a feature to be very informative when interacting with other features, but not so informative all alone. MI can't detect interactions between features. It is a univariate metric. 
- The actual usefulness of a feature depends on the model you use it with. **A feature is only useful to the extent that its relationship with the target is one that the model can learn**. Just because the feature has a high MI score, it doesn't mean that the model will be able to do anything with that information. You may need to transform the feature first to expose the association. 

## Inventing new features

## Target encoding for high cardinality features

## K-means clustering for segmentation features

## PCA for decomposition of dataset's variation

In [14]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import mutual_info_regression
import matplotlib.pyplot as plt

In [15]:
X_full = pd.read_csv('./data/housing_prices_competition/train.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

X_train, X_valid, y_train, y_valid = train_test_split(X_full, y, 
                                                      train_size=0.8,
                                                      test_size=0.2,
                                                      random_state=0)


In [16]:
X_train.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,20,RL,90.0,11694,Pave,,Reg,Lvl,AllPub,Inside,...,260,0,,,,0,7,2007,New,Partial
871,20,RL,60.0,6600,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,8,2009,WD,Normal
93,30,RL,80.0,13360,Pave,Grvl,IR1,HLS,AllPub,Inside,...,0,0,,,,0,8,2009,WD,Normal
818,20,RL,,13265,Pave,,IR1,Lvl,AllPub,CulDSac,...,0,0,,,,0,7,2008,WD,Normal
303,20,RL,118.0,13704,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,1,2006,WD,Normal
