# About this Notebook
Hey all,
my goal is to write a **compact guide** on feature engineering.
**I will add new sections to this notebook, whenever I had enough time work on this notebook**, which might take some time since I am currently attending many courses at university.


<div class="alert alert-danger" role="alert">
    <h3>Feel free to <span style="color:red">comment</span> if you have any suggestions   |   motivate me with an <span style="color:red">upvote</span> if you like this project.</h3>
</div>


<h1 style="background-color:DodgerBlue; color:white" >-> Topics:</h1>

## 1. [Motivation and General Advices](#sec1)
#### 1.1. [Feature-Target Relations and Monotony](#sec11)
#### 1.2. [Pearson Correlation and Collinearity](#sec12)

## 2. [Univariate Transformations on Numerical Data](#sec2)
#### 2.1. [Scaling, Centering and Standardization](#sec21)
#### 2.2. [Log Transformation](#sec22)
#### 2.3. [Box-Cox Power Transformation](#sec23)
#### 2.4. [Logit Transformation](#sec24)
#### 2.5. [Binning with Decision Trees](#sec25) 

## 3. [Encode Categorical Data](#sec3)
#### 3.1. [Label Encoding](#sec31)
#### 3.2. [One-Hot Encoding](#sec32)
#### 3.3. [Target-Mean Encoding](#sec33)


## 4. [Combine interacting Features](#sec-2)
#### 4.1. [Combine Features using Equations](#sec-21)
#### 4.2. [Combine Features using Groupby](#sec-22)

## 5. [Statistics Vocabulary and Plots](#sec-1)
#### 5.1. [Distribution Tails](#sec51)
#### 5.2. [The Quantile-Quantile Plot (qqplot)](#sec52)

## 6. [Further Readings & Helpful Videos](#sec6)

Some initial imports..

In [None]:
import numpy as np 
import pandas as pd 
from scipy import stats
import pylab 
import matplotlib.pyplot as plt

df_heart = pd.read_csv('/kaggle/input/heart-disease-uci/heart.csv')
df_heart.columns
trestbps = df_heart['trestbps']
chol = df_heart['chol']
target_heart = df_heart['target']
df_health = pd.read_csv('/kaggle/input/health-insurance-cross-sell-prediction/train.csv')
age = df_health['Age']
target_health = df_health['Response']


<a id="sec1"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 1. Motivation and General Advices</h1>

Reworking features to uncover **key relationships** between features and outcome is called **Feature Engineering**. It might be helpful to have some domain in order to understand the data best.

Feature Engineering relies on the resulting insights of [EDA](https://en.wikipedia.org/wiki/Exploratory_data_analysis).
The combination of Feature Engineering and EDA occurs in different phases of the whole modeling process, e.g. during post-modeling, based on Residual Analysis. **Residual Analysis** is the process of analysing which feature values lead to false predictions.


### Key relationships may be between the outcome and
* a transformation of a feature
* a product or ratio of multiple features 
* a functional relationship between features
* a different representation of a feature


### It helps us to obtain a good trade-off between:
* accuracy
* simplicity
* robustness

### A good Mindset for Feature Engineering leads to:
* Simplifying relationships with the target to either **binary flags** or **monotonic functions, linear if possible**.
* Treating each transformat on as one model in an Ensemble (just like in [Pipelining](https://www.kaggle.com/milankalkenings/no-pipelines-you-are-probably-doing-it-wrong))

<a id="sec11"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 1.1. Feature-Target Relations and Monotony</h1>

As mentioned above, it is best practice to create features which have a **monotonic relationship** with the target. This is due to:

* The relationship is easy to interpret for Data Analysts
* Machine Learning algorithms might converge faster
* Machine Learning algorithms in most cases provide better predictions with features like these




But what exactly are these monotonic relationships? Let me give you an example:
Imagine having data about some employees of your department, and you want to find out the relationship between years of deployment(*years*) and the *salary* of the employees. You want to predict your income over the next years (which means, that your target variable is the salary).

In [None]:
years = [1, 3, 8]
salary = [8, 11, 14]
values = list(zip(years, salary))
names = ['years', 'salary']
df = pd.DataFrame(values, columns=names)
df


We can now plot the data to take a look at the relationship between the feature *years* and the target variable *salary*:

In [None]:
df.plot.line('years', 'salary')
plt.show()

This relationship is called **(strictly) monotonic**, because the hgher the value of *years*, the higher is the value of *salary*. Note: this function of the input variable *years* and the output *salary*  is said to be (strictly) monotonically increasing, but (strictly) monotoniccally decreasing functions are also considered **(strictly) monotonic**.

So far so good. Let's consider you collected some more data and the relating relationship lookslike this:

In [None]:
years = [1, 3, 8, 10, 11]
salary = [8, 11, 14, 14, 15]
values = list(zip(years, salary))
names = ['years', 'salary']
df = pd.DataFrame(values, columns=names)
df.plot.line('years', 'salary')
plt.show()

This relationship is still called **monotonic**, despite the fact that the *salary* is the same for *years* = 8 **and** *years* = 10. The relationship is just not called **strictly monotonic** anymore. The same holds for monotonically decreasing functions.

Let's consider you collected even more data and this is the resulting relationship:

In [None]:
years = [1, 3, 4, 8, 10, 11]
salary = [8, 11, 9, 14, 14, 15]
values = list(zip(years, salary))
names = ['years', 'salary']
df = pd.DataFrame(values, columns=names)
df.plot.line('years', 'salary')
plt.show()

The result is a **non-monotonic** function, since the *salary* at *years* = 4 is lower than the *salary* at *years* = 3, **and** lower than the *salary* at *years* = 8. 

In reality, most relationship between your features and your target will not be monotonic, and we will most likely not achieve perfectly monotonic relationships by performing the feature transformations, which we will take a look at within the next sections. 

However, we should still make them **as monotonic as possible**, and therefore, I suggest using a simple metric for **monotony**, in order to compare the monotony of the original features and our transformed features. 

My very simple approach counts all monotony violations as seen in the last graphical example, and it returns $montony = 1 - \frac{|\text{monotony violations}|}{|\text{samples}|}$.

In [None]:
def monotony(feature, target):
    '''
    A simple function for determining the monotony of the feature-target relationship.
    '''
    num_samples = len(target)
    feature_name = feature.name
    target_name = target.name
    df = pd.concat([feature, target], axis=1)
    # sorts with priority 1: feature, priority 2: target
    df_sorted = df.sort_values([feature_name, target_name], ascending=[True, True])
    # first target value after sorting:
    first_target_val = df_sorted.loc[0, [target_name]].values[0] 
    
    
    
    # monotoniccally increasing ? 
    def mon_inc(target_val):
        nonlocal last_target_val
        nonlocal violations_inc
        if (target_val < last_target_val):
            violations_inc = violations_inc + 1
        last_target_val = target_val
            
    last_target_val = first_target_val
    violations_inc = 0
    df_sorted[target_name].apply(mon_inc)
    
     # monotoniccally decreasing ? 
    def mon_dec(target_val):
        nonlocal last_target_val
        nonlocal violations_dec
        if (target_val > last_target_val):
            violations_dec = violations_dec + 1
        last_target_val = target_val
            
    last_target_val = first_target_val
    violations_dec = 0
    df_sorted[target_name].apply(mon_dec)
    
    
    
    # scores:
    score_inc = 1 - round(violations_inc / num_samples,2)
    score_dec = 1 - round(violations_dec / num_samples,2)
    return [score_inc, score_dec, violations_inc, violations_dec]
        
    
    
    
monotony_metrics = monotony(df['years'], df['salary'])
print(f'monotonically increasing? violations: {monotony_metrics[2]}, monotony score: {monotony_metrics[0]}\n' +
     f'monotonically decreasing? violations: {monotony_metrics[3]}, monotony score: {monotony_metrics[1]}')

As we can see, the relation is an almost **monotonically increasing function**. However, this approach is pretty naive, since it makes too many assumptions about the data. These are some of the reasons, why simply checking for monotony this way might be **problematically**:

* Feature values might be **non-unique** which rises multiple opportunities (E.g. should we use the mean/median target value in these cases?)
* Functions might have many local minima and maxima, but they could still follow a monotonic trend, when smoothed. In that case, we would have very bad ratios for the decreasing and the incresing case. 

A better approach might include checking the **integral** values for varying areas of the feature space. 


However, we can also use another, yet conceptionally pretty similar indicator for approximate (linear) monotony, the so called **Pearson Correlation Coefficient**.

<a id="sec12"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 1.2. Pearson Correlation and Collinearity</h1>


The Pearson Correlation Coefficient indicates the degree at whcih two variables have a linear relationship. A perfect linear relationship is by definition a [monotonic relationship](#sec11).
Our goal is to reach as strong correlations between every single feature and the target as possible, and as weak correlations between multiple features as possible. Whenever two features are correlated to each other, and on of them is way more correlated to the target, one should consider dropping the feature, which is less correlated to the target. Correlation between features is also known as **(multi) collinearity**.


# Not finished yet

make sure to motivate me by upvoting this notebook 

and feel free to suggest any improvements in the comments, since all of us are using kaggle for studying =) 

<a id="sec2"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 2. Univariate Transformations on Numerical Data </h1>


Numerical Data may..
* be on different scales
* follow a [long-tailed distribution](#tail). Long tails might dominate the underlying calcuations in models, which rely on polynomial calculations on the features (most linear models, SVMs and neural networks)
* have a complex relationship with the outcome
* be represented inefficiently, sometimes simply **normally distributed respresentations may already improve the performance**

### One often wants data to be **normally distributed**, but why?
* The whole distribution is defined by the mean(= mode = median) and the variance, which might be of importance
* the normal distribution is [symmetric](#tail), which has some significant impact on the performance of many models 
* due to the [central limit theorem](https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_probability/BS704_Probability12.html), many machine learning models are [parametric methods](#stats), which [assume the feature values to be drawn from a normally distributed population](https://stackoverflow.com/questions/54071893/a-feature-distribution-is-nearly-normal-what-does-that-imply-for-my-ml-model) like linear regression, logistic regression, LDA, QDA and Gaussian Naive Bayes

<a id="sec21"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 2.1. Scaling, Centering and Standardization </h1>

When talking about this topic, people tend to mix the following terms:
* [Centering](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) refers to subtracting the mean of a column
* [Standardization](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) refers to dividing a centered feature by the standard deviation and leads to a standard deviation of one
* [Range Scaling](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html) refers to using the Minimum and the Maximum value of a feature to rescale the data on a different scale (e.g. between 0 and 1)

<a id="sec211"></a>
## Centering and Standardization
The transformed numerical feature will...
* have a *mean* of $0$
* have a standard deviation of $1$ 

Most Deep Learning methods demand these properties.

Therefore, the standard scaler simply subtracts the mean of the feature and divides each value by the standard deviation of the feature.

Let's take a look at some data containing the age of some *pets* and whether they are *house trained* or not:

In [None]:
age = [3, 4, 2, 7, 8]
house_trained = [1, 1, 0, 1, 0]
values = list(zip(age, house_trained))
names = ['age', 'house_trained']
df = pd.DataFrame(values, columns=names)
df

In [None]:
from sklearn.preprocessing import StandardScaler
# centering = subtract the mean
center = StandardScaler(with_std=False)
df['centered'] = center.fit_transform(df['age'].values.reshape((-1,1)))

# standardization = divide a centered feature by its' std
std = StandardScaler()
df['standardized'] = std.fit_transform(df['age'].values.reshape((-1,1)))
df

<a id="sec211"></a>
## Range Scaling
The transformed numerical feature will...
* have a similarly formed distribution as the original feature
* still contain outliers
* contain values between $0$ and $1$ by default, which enables further transformations like the [Logit Transformation](#sec24)
* be especially beneficial for models, which assume the data to be on the same scale (distance based methods like KNN), if applied to all numerical features

In [None]:
from sklearn.preprocessing import MinMaxScaler
# self defined interval after transformation: [-1,1]
scaler = MinMaxScaler(feature_range=(-1, 1)) 
df['minmax']  = scaler.fit_transform(df['age'].values.reshape((-1,1)))
df

<a id="sec22"></a>
<a id="log"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 2.2. Log Transformation </h1>

$\Large
     x_{transformed}=ln(x)
$
* commonly used
* suitable for data which approximately follows a [log-normal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution)
* is a special case of the [Box-Cox Transformation](#box-cox), so take a look at that section if you are interested in this kind of transformations


In [None]:
fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(8, 10))

skewness = stats.skew(chol)
title = f'original, skewness = {round(skewness, 2)}'
chol.plot(kind='hist', ax=ax1, color='red', alpha=0.5, title=title)
chol_t = chol.apply(np.log)
chol_t = pd.Series(chol_t)
skewness_t = stats.skew(chol_t)
title_t = f'transformed, skewness = {round(skewness_t, 2)}'
chol_t.plot(kind='hist', ax=ax2, color='cyan', alpha=0.8, title=title_t)

plt.tight_layout()
plt.show()

<a id="sec23"></a>
<a id="box-cox"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 2.3. Box-Cox Power Transformation </h1>

$\Large
     x_{transformed}=\left\{\begin{array}{ll} \frac{x^\lambda}{\lambda}, & x\neq 0 \\
         ln(x), & x = 0\end{array}\right. 
$
  
  
  
* transforms the feature into normal shape
* the paramter $\lambda$ might be set explicitely or might be estimated in order to obtain **as normally distributed data as possible**
* different $\lambda$ cover the Identity Transformation, the [Log Transformation](#log), the Square Root Transformation, the Inverse Transformation, and no-name transformations in between
* requires the data to be positive
* is a [variance stabilizing transformation](https://en.wikipedia.org/wiki/Variance-stabilizing_transformation) 
* improves the **validity** of Pearson **Correlation**, and thus **multicollinearity** between features
* the [scipy implementation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html) allows us to store the best lambda. We can apply a Box-Cox transformation with that lambda value when predicting outcomes for our test data/ validation data

Another Power Transformation that might be interesting is the [Yeo Johnson transformation](https://www.stat.umn.edu/arc/yjpower.pdf). It allows the feature to contain negative values

In [None]:
from statsmodels.graphics.gofplots import qqplot
fig, (ax1, ax2, ax3, ax4) = plt.subplots(nrows=4, ncols=1, figsize=(5, 20))

# original
skewness = stats.skew(trestbps)
title = f'original, skewness = {round(skewness, 2)}'
trestbps.plot(kind='hist', ax=ax1, color='red', alpha=0.5, title=title)

##qqplot
qqplot(data=trestbps, dist="norm", ax=ax2, line='s')

# transformation
trestbps_t, lmbda_best = stats.boxcox(trestbps)
trestbps_t = pd.Series(trestbps_t)
skewness_t = stats.skew(trestbps_t)
title_t = f'transformed, skewness = {round(skewness_t, 2)}, lambda = {round(lmbda_best, 2)}'
trestbps_t.plot(kind='hist', ax=ax3, color='cyan', alpha=0.8, title=title_t)

##qqplot
qqplot(data=trestbps_t, dist="norm", ax=ax4, line='s')


plt.tight_layout()
plt.savefig('.png')
plt.show()

As we can see in the [qqplots](qqplot), the data used to be [right skewed](#tail) and matches the normal distribution way better now.

https://en.wikipedia.org/wiki/Power_transform

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.boxcox.html

https://www.statisticshowto.com/box-cox-transformation/

https://en.wikipedia.org/wiki/Variance-stabilizing_transformation

<a id="sec24"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 2.4. Logit Transformation </h1>


$\Large
     x_{transformed}=ln(\frac{x}{1-x})
$
* useful on continous data between 0 and 1, e.g. proportions, with a **sigmoid distribution** (many values with either very high or very low values)
* transformed data provides better distinction between the the data with either very high or very low values
* provides the log odds
* maps the data to continous values between **-inf** and **inf**
* the ends of the scale have a larger difference on the logit-transformed scale
* is a [variance stabilizing transformation](https://en.wikipedia.org/wiki/Variance-stabilizing_transformation) 
* the [scipy implementation](https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.special.logit.html)


The [Arcsine Transformation](http://strata.uga.edu/8370/rtips/proportions.html) works pretty similar and might be better in some cases, but in general, the Logit Transformation is the better choice.

In [None]:
from sklearn.preprocessing import MinMaxScaler
from scipy.special import logit
mms = MinMaxScaler()
trestbps_mms = pd.Series(mms.fit_transform(trestbps.values.reshape(-1, 1)).flatten())


fig, (ax1, ax2) = plt.subplots(nrows=2, ncols=1, figsize=(8, 10))
title = f'original'
trestbps_mms.plot(kind='hist', ax=ax1, color='red', alpha=0.5, title=title)

trestbps_t = pd.Series(logit(trestbps_mms))
trestbps_t = trestbps_t.replace(np.Inf, 4) # for the plot
trestbps_t = trestbps_t.replace(np.NINF, -4) # for the plot
trestbps_t = pd.Series(trestbps_t)
title_t = f'transformed'
trestbps_t.plot(kind='hist', ax=ax2, color='cyan', alpha=0.8, title=title_t)
plt.tight_layout()
plt.show()

https://docs.scipy.org/doc/scipy-0.13.0/reference/generated/scipy.special.logit.html

http://strata.uga.edu/8370/rtips/proportions.html

https://www.statsdirect.com/help/data_preparation/transform_logit.htm

<a id="sec25"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 2.5. Binning with Decision Trees </h1>

Binning Transforms numerical features into categorical features, which we can treat like any other categorical feature. There are several approaches like taking Quantiles as bin limits, or any arbitrary numbers. For example, if your job is to find out, whether patients which are older than 60 have a higher chance to have a specific illness, it might be interesting to bin the numerical age feature using the intervals $(0, 60)$ and $[60,\text{inf})$. One of the less self explaining methods of binning is **Binning with Decision Trees:**
* The bins will not necessarily contain equal numbers of cases, but we might end up being lucky, which might improve the model performance even more
* Each predicted probability will form one category
* Since Predictions are made in the leaf nodes, and multiple leafs could make the same predictions, we end up having as many categories as leaf nodes or fewer
* Usually improves the **correlation with the target**, due to having a [monotonical relation with the target](https://www.statisticshowto.com/monotonic-relationship/)
* handles outliers, since they are assigned to one of the bins
* Since Deep Decision Trees have a High [Variance](https://machinelearningmastery.com/gentle-introduction-to-the-bias-variance-trade-off-in-machine-learning/), this procedure might lead to overfitting

In [None]:
df_heart_failure = pd.read_csv('/kaggle/input/heart-failure-clinical-data/heart_failure_clinical_records_dataset.csv')
c_p = df_heart_failure['creatinine_phosphokinase']
target_heart_failure = df_heart_failure['DEATH_EVENT']

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
x = c_p.values.reshape(-1,1)

hyperparameter = {'max_depth' : [1,2,4, 6, 8]}
val = GridSearchCV(DecisionTreeClassifier(), 
                         hyperparameter, cv=5, 
                         scoring='roc_auc')

val.fit(x, target_heart_failure)
disc_tree = val.best_estimator_
# do this on bith, train and test set:
x_binned = pd.Series(disc_tree.predict_proba(x)[:,1], name='x_binned')

### Let's take a look at the Resulting categories, the tree and the correlation improvement.

In [None]:
x_binned.value_counts().plot.bar(title='resulting feature categories')
plt.show()

In [None]:
from sklearn.tree import export_graphviz
import cv2
export_graphviz(disc_tree, 'tree.dot', feature_names = ['c_p'])
! dot -Tpng tree.dot -o tree.png
img = cv2.imread('tree.png')
plt.figure(figsize = (18, 18))
plt.imshow(img)
plt.axis('off')
plt.show()

In [None]:
cor = np.corrcoef(target_heart_failure, x.flatten())[0][1]
cor_transformed = np.corrcoef(target_heart_failure, x_binned)[0][1]
print(f'The Pearson Correaltion between the Target and the numerical feature: {round(cor, 2)}')
print(f'The Pearson Correaltion between the Target and the binned feature: {round(cor_transformed, 2)}')

Note: I didn't finetune the other parameters of the decision tree, and I used the *roc_auc* score. One should always use an appropriate score, and feel free to finetune the other hyperparameters in your models, to obtain the best possible features.


further sources: 

https://www.youtube.com/watch?v=vsKNxbP8R_8?t=1388

https://towardsdatascience.com/discretisation-using-decision-trees-21910483fa4b



<a id="sec3"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 3. Encode Categorical Data </h1>

Categorical features contain discrete, oftentimes even string values. The number of unique values a categorical feature contains is called **cardinality**. 

Most common machine learning models can't handle this kind of data, since they assume data to be numerical. Thus we have to use encoding methods to transform the categorical features into a suitable representation in order to utilize their *predictive abilities*.

<a id="sec31"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 3.1. Label Encoding </h1>

Label Encoders are probably the most simple way to encode a categorical feature. The Resulting Encoding has the following properties:
* Endcodes the feature into one column (so we don't struggle with having too many features)
* Consecutive integers, starting at 0. 
* Each Category shares the same integer
* Indicates meaningfull numerical *hierarchies* and *distances* between the categories ($ 1 < 2$ and $1 = 0.5 \cdot 2$)
* Some models like *Linear Regression* might assign more meaning to categories with higher integer representation.
* In most cases, it  **violates** the **key idea** of forcing features to have a monotonic relationship with the target.


Let's apply our encodings to some data about *pets*, the *houses* in which they live and whether they are *house trained* or not.

In [None]:
house_nr = [1, 3, 3, 2, 1, 1, 3, 2, 2, 2, 1, 1]
pet = ['dog', 'cat', 'dog', 'dog', 'rabbit', 'mouse', 'cat', 'rabbit', 'dog', 'cat', 'rat', 'rat']
house_trained = [1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1]
values = list(zip(house_nr, pet, house_trained))
names = ['house_nr', 'pet', 'house_trained']
df = pd.DataFrame(values, columns=names)
df

In [None]:
# import a label encoder from sklearn
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

df_label = df.copy()

# fit -> create parameters for the encoding (which category will be encoded as which integer?)
# transform -> encode the feature using the parameters
# fit_transform -> performs both, fit and transform
# fit_transform on training data, transform on test data
df_label['pet'] = le.fit_transform(df_label['pet'])
df_label['house_nr'] = le.fit_transform(df_label['house_nr'])
df_label

As we can see, each *pet* has a new integer representation. Note that *house_nr* contains now values from a consequitive sequence of integers starting at 0.

Let's focus on the feature *pet* and the target *house_trained*. The relationship between these features is not monotonic, as we can see in the following plot:

In [None]:
grouped_by_pet = df_label.groupby('pet')['house_trained'].mean()

fig, ax1 = plt.subplots(nrows=1, ncols=1, figsize=(11,7))
plt.bar(x=grouped_by_pet.index, height=grouped_by_pet.values)
plt.xlabel('encoded pet')
plt.ylabel('chance of being house trained')
plt.show()

As we can see, higher feature values don't necessarily relate to higher chances of having a higher target value, so we don't have a *monotonic relationship* with the target. This example covers a *binary classification* task. Thus, our target variable can either have the value $1$ or $0$, wheres $1$ indicates the case to belong to the so called *pisitive class*. 

We could simply define a custom Label Encoding, which enforces our encoding to choose *higher values* for categories, which lead to a *higher chance* of belonging to the *positive class*. Unfortunately, we would have to create one encoding for each class in a *multiclass classification task*, and it might be very complex for regression tasks.

However, we would still struggle with the other drawbacks as listed above, and I don't recommend you to rely on this encoding method.

<a id="sec32"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 3.2. One-Hot Encoder </h1>

As I already mentioned, label Encodings have some huge drawbacks. Probably the biggest drawback is, that numerical relationships like *distances* and *hierarchies* will be assumed between the categories, because these are solely encoded as discrete numbers inside the same column. Categories, on the other hand, don't have any meaningful numerical relationships.

One-Hot Encoders evade this problem and the encoded feature will have the following propoerties:
* Each *category* is stored in a separate, new column.
* Each of these new columns contains solely zeroes and ones.
* ones indicate, that the case is of the respective category.
* Each of these new columns has a *monotonic relationship with the target*.
* There will be no numerical relationships assumed between the categories

Nevertheless, there are still some downsides of using this method:
* Huge drawback: multiple new columns, which might lead to a worse model due to the [Curse of Dimensionaility](https://en.wikipedia.org/wiki/Curse_of_dimensionality)
* In many cases, it makes sense to merge very uncommon categories into one category called *'other'* in order to evade having many columns containing very few ones. For example, you could merge all categories together, which occur in less than 5% of your *observations*.

In [None]:
# import a one-hot encoder from sklearn
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder()

# we will focus on the feature 'pet'
df_oh = df.copy().drop(['house_nr'], axis=1)

# creates the new features
dummies = pd.get_dummies(df_oh['pet'])

# adds the new features to our dataframe
df_oh = pd.concat([df_oh, dummies], axis=1)
df_oh

As you can see, each row contains just a single $1$ within the new columns, since each case still belongs solely to one of the categories. 

Moreover, the common machine learning models can handle this representation of the feature very well, since a category will be either recognized to be absent or not. Last but not least, every new column has either a positive or a negative *monotonic and linear * relationship with the target (for obvious reasons, since there are only two discrete values per column).

Let's for example take a look at the relationship between the column *rabbit* and the target:

In [None]:
grouped_by_rabbit = df_oh.groupby('rabbit')['house_trained'].mean()

fig, ax1 = plt.subplots(nrows=1, ncols=1, figsize=(11,7))
plt.bar(x=grouped_by_rabbit.index, height=grouped_by_rabbit.values)
plt.ylabel('chance of being house trained')

plt.xticks([0, 1], ['is a rabbit', 'is not a rabbit'])
plt.show()

As you can see, being a rabbit provides a higher chance of being house trained than not being a rabbit. This relationship very easy to interpret by the model and thus can be very beneficial. In a regression task, binary columns like this could indicate either higher or lower target values.

<a id="sec33"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 3.3. Target-Mean Encoding </h1>

$\Large
     x_{transformed}=\frac{|y=1_{X=x}|}{|X=x|} 
$

This advanced encoding method for classification problems with **binary targets** (i.e. two classes) is an elaborate alternative to the commonly used ones. One should always take a look at this representation of the categorical feature and its *predictive abilities*. 

* **Encoding:**  $
    \frac{\text{observations of the  positive class with the respective feature value}}{\text{observations with the respective feature value}}
$
* **Result:** Probability of the target value given each feature value
* Provides a monotonic relationship between the feature and the target
* Encodes the feature within **one column** and thus doesn't lead to huge amounts of new columns in contrast to [One-Hot Encoding](#sec31), which might be beneficial for models who can't handle huge amounts of features
* Might decrease the [cardinality](#sec3) of the categorical feature (e.g. 2 values might be encoded as 0.5 and thus would merge into one category)
* **Alternative for non-binary classification tasks:** create one Target-Mean encoded column for each target value and treat the respective target value as positive, and all other target values as negative
* **Regularization:** Instead of using the whole training data to determine the encoding, use K folds and use the average encoding in the final feature representation

Assume we have a dataset containing several pets from your friends and whethr they are house trained or not. We want to Target-Mean Encode the categorical feature *pet* with respect to the target *house_trained*.

In [None]:
pet = ['dog', 'cat', 'dog', 'dog', 'rabbit', 'mouse', 'cat', 'rabbit', 'dog', 'cat', 'rat', 'rat']
house_trained = [1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1]
values = list(zip(pet, house_trained))
names = ['pet', 'house_trained']
df = pd.DataFrame(values, columns=names)
df

In [None]:
def target_mean_encode(feature, target):
    encoded = feature
    for val in feature.unique():
        ser_pure = feature[feature==val]
        target_pure = target[ser_pure.index].sum()
        encoded = encoded.replace(val, target_pure/len(ser_pure))
    return encoded





pet_encoded = target_mean_encode(feature=df['pet'], target=df['house_trained'])
df_with_encoding = pd.concat([df, pet_encoded.rename('pet_encoded')], axis=1)
df_with_encoding

As we can see, rats and rabbits end up having the same encoding. Thus, the encoding has a *cardinality* of 4, whereas the original feature had a *cardinality* of 5.

TODO: regularization

<a id="sec-2"></a>
***

<h1 style="background-color:DodgerBlue; color:white" >-> 4. Combine Features </h1>

Many machine learning algorithms utilize feature interactions and combinations implicitely. However, experience has shown that it still might be a good idea to combine features manually, because we can't rely oon our model 'doing all the work'.


Finding and combining features could be important for standing out in kaggle competitions, but finding useful combinations might be a non-trivial problem. Suggestions from **domain experts** are oftentimes the best entry point to detecting valuable combinations.

Besides relying on domain experts, we could try every possible combination of features to identify *predictive* (i.e. model improving) ones. This approach would take too much time, but at least we can already find many of the most important combinations if we follow these guidelines: 
* **effect sparsity:** the less features are part of the combination, the higher the chance for the combination to be predictive (including singletons, i.e. uncombined features). We should focus on combinations between 2 or 3 features.
* **heredity:** the combination $(feat_1, feat_2)$ should only be considered to be predictive, if at least one of the features, $feat_1$ or $feat_2$, is already known to be predictive.[$^{1}$](#note)
* **priority:** in most cases, the interpretability and the predictivity of a combination is better, when the original features aren't transformed ([scaled](#sec21), encoded, [log-transformed](#log)...). Thus we should create the combinations *prior* to any transformations.

Note: I will focus on weak heredity. Besides that, strong heredity demands both features to be predictivs

As suggested by [@anashamoutni](https://www.kaggle.com/anashamoutni), [PCA](https://en.wikipedia.org/wiki/Principal_component_analysis) and otehr [Dimensionality Reduction](https://en.wikipedia.org/wiki/Dimensionality_reduction) methods can be seen as methods for combining features as well, since they merge multiple features together in a more or less meaningful way. I will probably make a separate notebook about that tpopic as well..

<a id="sec-21"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 4.1. Combine Features using Equations</h1>


There are multiple ways of combining features as introduced in [this video from Jeff Heaton](https://www.youtube.com/watch?v=X4pWmkxEikM). Combining features demands you to wrap your mind around the given data and to think of new features you could create by combining the given ones. A combination can be seen as a formula/equation for a new feature.

Probably the most common, and most interpretable **building blocks** of these equations are:

* Products of numerical features (E.g. daily cigarettes ⋅ days or area = width ⋅ length)
* Ratios of numerical features (E.g. $\frac{price}{gram}$)
* Sums of numerical features (E.g. weight of passengers + weight of the transported goods)
* Differences of numerical values (E.g. workdays - sick days). One often uses differences to subtract means

The whole equation for a new feature parent
that describes the likelihood of an employee going on parental leave (*parent*), given the numerical features age, and the years of employment (*empl*), as well as the binary categorical features *sex* (female), and the marital status (*married*) could look like this:

$\large
     \text{parent} =C (1+0.7married)(\frac{empl}{age^5}+0.5\frac{empl}{age^5}) 
$

Take a moment to think about the way the binary categorical features affect this equation.

Lets apply this equation to some data to see, if it works:

In [None]:
married = [0, 1, 0, 1, 0, 1, 0]
empl = [2, 5, 4, 15, 2, 6, 1]
age = [25, 27, 41, 43, 28, 29, 22]
female = [0, 1,  1, 1, 1, 0, 1]
values = list(zip(married, empl, age, female))
names = ['married', 'empl', 'age', 'female']
df = pd.DataFrame(values, columns=names)
df

In [None]:
c = 10_000_000 # a constant for better readability

def parent(row):
    return  c * (1 + 0.5*row['married'])*(row['empl']/(row['age']**5) + 0.5*row['female']*row['empl']/(row['age']**5))
    

    
parent = df.apply(parent, axis=1) # axis=1 for row-wise operation
df_with_parent = pd.concat([df, parent.rename('parent')], axis=1)
df_with_parent

As we can see, young people who are employed for several years are very likely to go on parental leave. Moreover, the likelihood increases a lot if the employee is female and even more if the employee is married. Note, that this approach benefits from some implicit constraints. Since we are taking a look at employee data, there will be no Kid in this dataset. According to this equation, kids would be extremely likely to become parents within the next year. Moreover, this approach needs some domain knowledge as well. We cant train any (supervised) predictor to obtain the best possible equation/function to create this new feature. We have to wrap our mind around the topic and we might obtain very unpredictable features, but time-consuming creative approaches might provide valuable new features.

<a id="sec-22"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 4.2. Combine Features using Groupby</h1>

In some cases, it might be helpful to create a new feature based on one feature grouped by another one. Why should this be helpful?
Imagine having data about your employees' salary and their department. The salary on its own might already be an important feature, but it might be helpful to compare the salary of your employees with the salary of the other employees of the same department when it comes to finding out why some of your employees seem to be less motivated than others, even though they already have high salaries in comparison with employees from other departments. 

In [None]:
salary = [40, 42, 30, 32, 45, 44, 31, 44, 29, 33, 46, 50, 33, 39]
dep =[1, 1, 0, 0, 2, 2, 1, 0, 1, 0, 2, 1, 2, 0]
values = list(zip(salary, dep))
names = ['salary', 'dep']
df = pd.DataFrame(values, columns=names)
df

In [None]:
mean_per_dep = df.groupby('dep')['salary'].mean()

mean_per_dep_rows = df['dep'].replace(mean_per_dep.index, mean_per_dep.values)
salary_per_dep = df['salary'] - mean_per_dep_rows
df_with_salary_per_dep = pd.concat([df, salary_per_dep.rename('salary_per_dep')], axis=1)
df_with_salary_per_dep

And here we go, we now have the salary of each clerk relative to the salary of his coworkers in the same department. 

<a id="sec-23"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 4.3. Combine Features using Conditions</h1>

One more very common method of creating a new feature by combining the original features is using conditions. This method is especially important, if your project demands you to focus on a particular subgroup of our observations, or when you already figured out any frequent sets in your Dataset.

* Use this method for commonly fulfilled conditions 
* The conditions should contain multiple features, to ecade *collinearity*
* Construct the condition based on your project goals or frequent patterns in your data
* You can find frequent patterns in your data using the [Apriori Algorithm](https://www.youtube.com/watch?v=guVvtZ7ZClw)


Imagine your Data Exploration reveals the fact that a particular combination of features occurs frequently with a particular outcome. In such a case, it might be interesting to create a *binary flag*, indicating the particular combination. Of course, the algorithm could find out this relationship automatically, but we can never be sure about it. 

For example, we could have found out, that all young customers, who already bought multiple products from our company are very interested in our new product. Let's call these young people *young fans*. Creating such a feature could look like this:

In [None]:
age = [23, 24, 51, 41, 72, 35, 21, 64, 29, 27]
products_bought =[1, 4, 3, 2, 1, 5, 1, 2, 7, 4]
values = list(zip(age, products_bought))
names = ['age', 'products_bought']
df = pd.DataFrame(values, columns=names)
df

In [None]:
def is_young_fan(df) : 
    if df['age'] < 30 and (df['products_bought'] > 1):
        return 1
    else : 
        return 0
    
df['young_fan'] = df.apply(is_young_fan, axis=1)
df

As mentioned in the Section about [Encodings](#sec3), these binary columns have some huge benefits. Nevertheless, we should only use this method for very common relationships, in order to evade columns mostly containing zeros.

<a id="sec-1"></a>
<a id="stats"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 5. Statistics Vocabulary and Plots</h1>

* **population:** 
the true data one could achieve with immense effort
* **sample:** 
the part of the data which is available for the modeling process / the training data 
* **Population Parameter:** 
an aspect of a population (e.g. the ground truth mean of a feature)
* **statistic:** 
an aspect of a sample (e.g. the mean of a feature in our training data) 
* **parametric statistical test:** 
makes an assumption about the population parameters(e.g. stdent's T test, ANOVA)
* **nonparametric statistical test:** 
doesn't assume anything about the population parameters (e.g. chi-square)
* **parametric models:**
machine learning models which make strong assumptions/have a high bias about the sample on which they are applied (e.g. they assume the data to follow a specific distribution).

<a id="sec51"></a>
<a id="tail"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 5.1. Distribution Tails</h1>

* **tail:** The part on the left side of the modes of the distribution is called the left tail and vice versa.
* **heavy-tailed distribution:** A Distribution with a bigger area under the curve in the tails than a normal distribution
* **long-tailed distribution:** A distribution with a long tail has some values which are far away from the mean of the distribution on the respective side of the mean(most long tails are also **"thin"** for obvious reasons). long tailed distributions contain many outliers; vice versa:**short and fat** 
* **skewness:** describes the asymmetry of a distribution
* **negative skew:** distribution tends to have a long tail on the left side
* **positive skew:** distribution tends to have a long tail on the right side
* **zero skewness:** both sides of the modes balance out over all. (e.g. symmetry, or one tail is long and thin and the other is short but fat
* [**kurtosis:**](https://corporatefinanceinstitute.com/resources/knowledge/other/kurtosis/) measures the conformity of a distributions tails with the tails of a normal distribution

<a id="sec52"></a>
<a id="qqplot"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 5.2. The Quantile-Quantile Plot (qqplot)</h1>

* plots the [quantiles](https://en.wikipedia.org/wiki/Quantile) (basically just the data sorted in ascending order) of two variables against each other
* each axis represents one of these variables
* the more similar the distributions of the variables are, the more looks the plot like the line formed by $x=y$
* quantile plots underneath the line have lower $y$-variable values than $x$-variable values and vice versa
* is oftentimes used to determine graphically, whether the data follows any known distribution like the normal distribution (by plotting these known distributions against the data)
* take a look at these [typical qqplot results](https://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot) and the respective interpretations regarding [skewness and kurtosis](#tail).

<a id="sec6"></a>
***
<h1 style="background-color:DodgerBlue; color:white" >-> 6. Further Readings & Helpful Videos</h1>

I hope that you noticed, that I tried to add some resources for further readings. Maybe I already linked you to some of them, but I want to emphasize the importance of these sources for this notebook:

https://www.youtube.com/watch?v=lUg0dRrlsoA

https://www.youtube.com/watch?v=vsKNxbP8R_8

https://www.youtube.com/watch?v=X4pWmkxEikM

https://www.goodreads.com/book/show/45832399-feature-engineering-and-selection