# Feature Engineering

Author : [Alexandre Gramfort](http://alexandre.gramfort.net), [Thomas Moreau](https://tommoral.github.io/about.html), and [Pedro L. C. Rodrigues](https://plcrodrigues.github.io)
         
with some code snippets from [Olivier Grisel](http://ogrisel.com/) (leaf encoder)

It is the most creative aspect of Data Science!

We will use here the Titanic dataset.

![title](titanic.jpg)

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
import seaborn as sns
df = sns.load_dataset("titanic")

In [None]:
df.head()

Let's look at the dtypes of the different columns. You will observe that it contains columns that
are explicitly marked as `category`.

In [None]:
df.info()

This allows you to do things like:

In [None]:
from sklearn.compose import make_column_selector
make_column_selector(dtype_include='category')(df)

in order to get quickly the names of the columns to treat as categorical.

As you can see the data contains both quantitative and categorical variables. These categorical have some predictive power:

In [None]:
sns.catplot(data=df, x='pclass', y='survived', hue='sex', kind='bar')

The question is how to feed these non-quantitative features to a supervised learning model?

## 1) Categorical encoding

 - Nearly always need some treatment
 - High cardinality can create very sparse data
 - Difficult to impute missing

### 1.1) One-Hot encoding

**Idea:** Each category is coded as a 0 or 1 in a dedicated column.

 - It is the most basic method. It is used with most linear algorithms
 - Drop first column to avoid collinearity
 - It uses sparse format which is memory-friendly
 - Most current implementations don’t gracefully treat missing, unseen variables

Example with the `embarked` column. We have here 3 categories:

In [None]:
df['embarked'].value_counts()

In [None]:
df1 = df[['embarked']]

In [None]:
df1.head(10)

Let's use a [scikit-learn OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html)

In [None]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df1.head(10)).toarray()

To know which column corresponds to what you can look at:

In [None]:
ohe.categories_

Basically the first column will be a 1 if category was 'C', etc.

Now if we have missing values:

In [None]:
ohe = OneHotEncoder()
ohe.fit_transform(df1).toarray()

We have now 4 columns, one corresponding to NaNs:

In [None]:
ohe.categories_

As the columns are linearly dependant after one-hot encoding you can drop one column with:

In [None]:
OneHotEncoder(drop='first').fit_transform(df1.head(10)).toarray()

This avoids colinearity, which for example leads to slower optimization solvers.

### 1.2) Ordinal encoding

**Idea:** Each category is coded with a different integer. The order being **arbitrary**.

 - Give every categorical variable a unique numerical ID
 - Useful for non-linear tree-based algorithms (forests, gradient-boosting)
 - Does not increase dimensionality

In [None]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder()
oe.fit_transform(df1.head(10))

In [None]:
oe.categories_

This means that 'C' will be coded as 0, 'Q' as a 1 and 'S' as a 2.

### 1.3) Count encoding

**Idea:** Replace categorical variables with their count in the train set

- Useful for both linear and non-linear algorithms
- Can be sensitive to outliers
- May add log-transform, works well with counts
- Replace unseen variables with `1`
- May give collisions: same encoding, different variables

You'll need to install the `category_encoders` package with:

    pip install category_encoders

In [None]:
!pip install category_encoders

In [None]:
import category_encoders as ce

In [None]:
ce.__version__

In [None]:
df1.head(10)

In [None]:
ce.CountEncoder().fit_transform(df1.head(10)).values

'S' is replaced by 7 as it appears 7 times in the fitted data, etc.

### 1.4) Label / Ordinal count encoding

**Idea:** Rank categorical variables by count and use this rank as encoding value. It is an ordinal encoding where the value is taking from the frequence of each category.

- Useful for both linear and non-linear algorithms
- Not sensitive to outliers
- Won’t give same encoding to different variables
- Best of both worlds

As it is not available in any package we will implement this ourselves:

In [None]:
from sklearn.preprocessing import OrdinalEncoder

class CountOrdinalEncoder(OrdinalEncoder):
    """Encode categorical features as an integer array
    usint count information.
    """
    def __init__(self, categories='auto', dtype=np.float64):
        self.categories = categories
        self.dtype = dtype

    def fit(self, X, y=None):
        """Fit the OrdinalEncoder to X.

        Parameters
        ----------
        X : array-like, shape [n_samples, n_features]
            The data to determine the categories of each feature.

        Returns
        -------
        self
        """
        self.handle_unknown = 'use_encoded_value'
        self.unknown_value = np.nan
        super().fit(X)
        X_list, _, _ = self._check_X(X)
        # now we'll reorder by counts
        for k, cat in enumerate(self.categories_):
            counts = []
            for c in cat:
                counts.append(np.sum(X_list[k] == c))
            order = np.argsort(counts)
            self.categories_[k] = cat[order]
        return self

coe = CountOrdinalEncoder()
coe.fit_transform(pd.DataFrame(df1.head(10)))

'S' is replace by 2 as it's the most frequent, then 'C' is 1 and 'Q' is 0.

This encoding is robust to collision which can happen with the CountEncoder when certain categories happen the same number of times. Example:

In [None]:
coe.fit_transform(pd.DataFrame(['es', 'fr', 'fr', 'en', 'en', 'es']))

vs.

In [None]:
ce.CountEncoder().fit_transform(pd.DataFrame(['es', 'fr', 'fr', 'en', 'en', 'es']))

### 1.5) Hash encoding

**Idea:** Does “OneHot-encoding” with arrays of a fixed length.

- Avoids extremely sparse data
- May introduce collisions
- Can repeat with different hash functions and bag result for small bump in accuracy
- Collisions usually degrade results, but may improve it.
- Gracefully deals with new variables (eg: new user-agents)

In [None]:
df1.head(10)

In [None]:
ce.hashing.HashingEncoder(n_components=4).fit_transform(df1.head(10).values)

### 1.6) Target encoding

Encode categorical variables by their ratio of target (binary classification or regression)

Formula reads:

$$
    \text{TE}(X) = \alpha\big(n(X)\big) E[ y | x=X ] +  \Big(1 - \alpha\big(n(X)\big)\Big) E[y]
$$

where $n(X)$ is the count of category $X$ and $\alpha$ is a monotonically increasing function bounded between 0 and 1.[1].

- Add smoothing to avoid setting variable encodings to 0.
```
[1] Micci-Barreca, 2001: A preprocessing scheme for
high-cardinality categorical attributes in classification
and prediction problems.
```

You will need the [dirty cat](https://pypi.org/project/dirty-cat/) package. You can install it with:

    pip install dirty_cat

In [None]:
!pip install git+https://github.com/skrub-data/skrub.git -U

In [None]:
import dirty_cat as dc  # install with: pip install dirty_cat

X = np.array(['A', 'B', 'C', 'A', 'B', 'B'])[:, np.newaxis]
y = np.array([1  , 1  , 1  , 0  , 0  , 1])

dc.TargetEncoder(clf_type='binary-clf').fit_transform(X, y)
# If \alpha was 1 you would get: [0.5, 0.66, 1, 0.5, 0.66, 0.66]

In [None]:
from sklearn.preprocessing import TargetEncoder
te = TargetEncoder(target_type='binary', cv=2, smooth=1)
te.fit(X, y)
te.transform(X)

### 1.7) NaN encoding

It is quite frequent in real life that **the fact one variable is missing
has some predictive power**. For example in the Titanic dataset the 'deck'
parameter is very often missing and it is missing often for passengers who
did not have a proper cabin and therefore who were most likely to die.

To inform your supervised model you can explicit encode the missingness
with a dedicated column.

You can do this with a [SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)

In [None]:
from sklearn.impute import SimpleImputer

X = np.array([0, 1., np.nan, 2., 0.])[:, None]
SimpleImputer(strategy='median', add_indicator=True).fit_transform(X)

or [MissingIndicator](https://scikit-learn.org/stable/modules/generated/sklearn.impute.MissingIndicator.html)

In [None]:
from sklearn.impute import MissingIndicator

X = np.array([0, 1., np.nan, 2., 0.])[:, None]
MissingIndicator().fit_transform(X)

### 1.8) Polynomial encoding

**Idea:** Encode interactions between categorical variables

- Linear algorithms without interactions can not solve the XOR problem
- A polynomial kernel *can* solve XOR

![title](xor.jpg)

In [None]:
X = np.array([[0, 1], [1, 1], [1, 0], [0, 0]])
X

In [None]:
from sklearn.preprocessing import PolynomialFeatures
PolynomialFeatures(include_bias=False, interaction_only=True).fit_transform(X)

### 1.9) To go beyond

You can also use some form of embedding eg using a Neural Network to create dense embeddings from categorical variables.

- Map categorical variables in a function approximation problem into Euclidean spaces
- Faster model training.
- Less memory overhead.
- Can give better accuracy than 1-hot encoded.
- See for example https://arxiv.org/abs/1604.06737

### 1.10) Playing with `skrub`

The `TableVectorizer` class from `skrub` proposes **automatically** for you which kind of encoding seems the most appropriate for each feature in the dataframe. But beware, it is always important to understand what are the actual encodings (and what they actually mean) so to be sure that there are not bizarre choices from the package.

In [None]:
from skrub import TableVectorizer
tv = TableVectorizer()
df_tv = tv.fit_transform(df)
for ti in tv.transformers_:
    print(ti)

## 2) Binning

See https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization_classification.html

[KBinsDiscretizer](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.KBinsDiscretizer.html) allows you to estimate non-linear model in the original feature space while only using a linear logistic regression. 

See this [example in regression](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_discretization.html).

What it does:

In [None]:
from sklearn.preprocessing import KBinsDiscretizer

rng = np.random.RandomState(42)
X = rng.randn(10, 2)
X

In [None]:
KBinsDiscretizer(n_bins=2).fit_transform(X).toarray()

In [None]:
colors = ['C' + str(i) for i in range(4)]
A = np.array([[1, 0], [0, 0.5]])
X = rng.randn(2048, 2) @ A + np.c_[np.zeros(2048), np.ones(2048)]
Xbins = KBinsDiscretizer(n_bins=4, encode='ordinal').fit_transform(X)
fig, ax = plt.subplots(figsize=(12,5), ncols=2, sharey=True)
for i, axi in enumerate(ax):
    bins, edges = np.histogram(X[:,i], bins=32)
    edges = (edges[1:] + edges[:-1])/2
    dx = edges[1] - edges[0]
    pdf = bins / (np.sum(bins) * dx)
    axi.plot(edges, pdf)
    axi.set_xlim(-4, +4)
    for j in range(4):
        axi.scatter(X[Xbins[:,i] == j,i], np.ones(512), c=colors[j])

In [None]:
X[Xbins[:,i] == j,i].shape

## 3) Scaling


Scale to numerical variables into a certain range

- Standard (Z) Scaling
- MinMax Scaling
- Root scaling
- Log scaling

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

rng = np.random.RandomState(42)
X = 10 + rng.randn(10, 1)
print(X)
print(X.mean(), X.var())

In [None]:
Xsc = StandardScaler().fit_transform(X)
print(Xsc)
print(Xsc.mean(), Xsc.var())

In [None]:
MinMaxScaler().fit_transform(X)

In [None]:
from sklearn.preprocessing import FunctionTransformer

X = np.arange(1, 10)[:, np.newaxis]
FunctionTransformer(func=np.log).fit_transform(X)

In [None]:
import pandas as pd
df_earnings = pd.read_csv('./earnings.csv')
df_earnings = df_earnings.dropna()
y = df_earnings['earn']/1000
y = y[y > 0]
ylog = np.log(y)
fig, ax = plt.subplots(figsize=(10,4), ncols=2)
for axi, yplot in zip(ax, [y, ylog]):
    sns.kdeplot(yplot, ax=axi, bw_adjust=1.5)

## 4) Leaf coding

The following is an implementation of a trick found in:

Practical Lessons from Predicting Clicks on Ads at Facebook
Junfeng Pan, He Xinran, Ou Jin, Tianbing XU, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, Joaquin Quiñonero Candela
International Workshop on Data Mining for Online Advertising (ADKDD)

https://research.fb.com/wp-content/uploads/2016/11/practical-lessons-from-predicting-clicks-on-ads-at-facebook.pdf

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin, clone
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import LabelBinarizer
from scipy.sparse import hstack


class TreeTransform(BaseEstimator, TransformerMixin):
    """One-hot encode samples with an ensemble of trees
    
    This transformer first fits an ensemble of trees (e.g. gradient
    boosted trees or a random forest) on the training set.

    Then each leaf of each tree in the ensembles is assigned a fixed
    arbitrary feature index in a new feature space. If you have 100
    trees in the ensemble and 2**3 leafs per tree, the new feature
    space has 100 * 2**3 == 800 dimensions.
    
    Each sample of the training set go through the decisions of each tree
    of the ensemble and ends up in one leaf per tree. The sample if encoded
    by setting features with those leafs to 1 and letting the other feature
    values to 0.
    
    The resulting transformer learn a supervised, sparse, high-dimensional
    categorical embedding of the data.
    
    This transformer is typically meant to be pipelined with a linear model
    such as logistic regression, linear support vector machines or
    elastic net regression.
    """
    def __init__(self, estimator):
        self.estimator = estimator
        
    def fit(self, X, y):
        self.fit_transform(X, y)
        return self
        
    def fit_transform(self, X, y):
        self.estimator_ = clone(self.estimator)
        self.estimator_.fit(X, y)
        self.binarizers_ = []
        sparse_applications = []
        estimators = np.asarray(self.estimator_.estimators_).ravel()
        for t in estimators:
            lb = LabelBinarizer(sparse_output=True)
            X_leafs = t.tree_.apply(X.astype(np.float32))
            sparse_applications.append(lb.fit_transform(X_leafs))
            self.binarizers_.append(lb)
        return hstack(sparse_applications)
        
    def transform(self, X, y=None):
        sparse_applications = []
        estimators = np.asarray(self.estimator_.estimators_).ravel()
        for t, lb in zip(estimators, self.binarizers_):
            X_leafs = t.tree_.apply(X.astype(np.float32))
            sparse_applications.append(lb.transform(X_leafs))
        return hstack(sparse_applications)


boosted_trees = GradientBoostingClassifier(
    max_leaf_nodes=5, learning_rate=0.1,
    n_estimators=10, random_state=0,
)

from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

TreeTransform(boosted_trees).fit_transform(X, y)

<div class="alert alert-success">
    <b>EXERCISE</b>:
     <ul>
      <li>
      Limiting yourself to LogisticRegression propose features to predict survival.
      </li>
    </ul>
</div>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score

df = sns.load_dataset("titanic")
y = df.survived.values
X = df.drop(['survived', 'alive'], axis=1)

In [None]:
X.head()

In [None]:
lr = LogisticRegression(solver='lbfgs')
ct = make_column_transformer(
    (make_pipeline(SimpleImputer(), StandardScaler()), ['age', 'pclass', 'fare']),
)
clf = make_pipeline(ct, lr)
np.mean(cross_val_score(clf, X, y, cv=10))

What if we consider a one hot encoding for the sex of the passenger too?

In [None]:
from sklearn.preprocessing import TargetEncoder
lr = LogisticRegression(solver='lbfgs')
ct = make_column_transformer(
    (make_pipeline(SimpleImputer(), StandardScaler()), ['age', 'pclass', 'fare']),
    (OneHotEncoder(),['sex'])
)
clf = make_pipeline(ct, lr)
np.mean(cross_val_score(clf, X, y, cv=10))

### Now do better !

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
import seaborn as sns

# load Titanic dataset
df = sns.load_dataset("titanic")

# use a simple imputer
si = SimpleImputer(strategy='mean')

# instantiate the TableVectorizer with imputer
tv = TableVectorizer(numerical_transformer=si)

# choose a scaler so logistic regression is happy
sc = StandardScaler()

# choose logreg as classifier
lr = LogisticRegression(solver='lbfgs')

# make pipeline
clf = make_pipeline(tv, sc, lr)

# print cross val score for classification
np.mean(cross_val_score(clf, X, y, cv=10))

In [None]:
Xtv = tv.fit_transform(X, y)
print(X.shape)
print(Xtv.shape)
tv.transformers_