# TOC

+ EDA
+ Missing Values Imputation
+ Features Preprocessing
+ Features Generation


# I. EDA

This process helps understand the data, build intuition about it, generate hypotheses and find insights. It can also be used to check if train and test set have similar populations (men vs women, e)t It is also useful to compare the distribution of each feature between training and test set; if they are very different, we need to find ways to make it match or exclude the feature entirely.c.

**individual features**
+ numerical summary / statistics.
+ histogram. 
+ plotting row index vs value.

**pairs of features**
+ scatter plots. A few interesting tweaks:
    + add a color for each class (classification) or match point size with outcome value (regression).
    + overlapping test set with train set to see if values match.
+ scatter matrix.

**groups of features**
+ correlation matrix. Running [K-means clustering](https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html) can help group related features together.
+ plotting column index vs statistics value (like mean) of each feature
**impact on target variable**
+ scatterplot.
+ binning.
.

It can be helpful to generate new features based on each of these group of feature
).

In [None]:
# df values
df.describe()
x.value_counts()
x.isnull()

# plot values vs index
plt.plot(x, '.')
plt.scatter(range(len(x)), x, c=y)

# correlation (scatter plots, correlation matrix)
pd.scatter_matrix(df)
df.corr()
plt.matshow(...)

# II. Missing Values Imputation

The scikit-learn documentation has a section dedicated to [missing values imputation](https://scikit-learn.org/stable/modules/impute.html

Missing values are sometimes not loaded as NaN: they might have been replaced by a single value that is completely out of the range taken by the rest of the values. These cases can be found by plotting an histogram.

Once identified, missing values can be inputed in a few ways:
+ inferred. This method should be handled with caution, especially when using inferred values to generate a new feature.
+ use a single value outside the feature's value range (-1, -999, etc.). This can be used as a separate category but will penalize non tree-based models.
+ use the meanrof the median. This works well for non-tree based methods but tree-based models won't be able to easily create a split for missing values.

An option is to add a new binary feature to flag rows that had missing values, then use either mean or median. The downside is that this method will double the number of features in the dataset.

For categorical data, we can use frequency encoding to highlight categories that are in the test set but not the training s.).

# III. Features Preprocessing

Features preprocessing & generation pipelines depend on the model type. A few examples:

+ A categorical feature that happens to be stored as numerical will not perform well with a linear model if the relation with the outcome is linear. In this case, one-hot encoding will perform better. But this preprocessing step is not required to fit a random forest.
+ Forecasting a linear trend will work well with a linear model, but a tree-based approach will not create splits for unseen dates and might perform poorly.

The scikit-learn documentation has a section dedicated to [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html).

## III.1. Numeric Features

Tree-based models are not impacted by feature scales nor outliers.

**Feature Scale (non-tree based)**

Non tree-based models (KNN, linear models & NN) are strongly impacted by the scale of each feature:
+ KNN: predictions are based on distances, so they will vary significantly depending on the scale of each feature.
+ Linear models & NN: 
    + regularization impact is proportional to feature scale. It will work best if we can apply it to each coefficient in equal amounts.
    + gradient descent methods don't work well without features scaling.

The easiest way to deal with this issue is to rescale all features to the same scale:
+ `sklearn.preprocessing.MinMaxScaler` scale to \[0, 1\]: $X = (X - X.min()) / (X.max() - X.min())$. The distribution of values doesn't change.
+ `sklearn.preprocessing.StandardScaler` scale to mean=0 and std=1: $X = (X - X.mean()) / X.std()$. 

_Note: Different scalings result in different model quality: it is another hyperparameter you need to optimize._

_Note: when using KNN, we can optimize the scaling parameters for certain features in order to boost their impact on predictions._

An analysis of when to use min-max vs standardization can be found [here](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html).

**Outliers [Winsorizing](https://en.wikipedia.org/wiki/Winsorizing) (linear models)**

Outliers (for both features and target values) can impact linear models significantly. Clipping feature values between some lower and upper bounds (like 1st and 99th percentiles) can mitigate this issue. This method is frequently used with financial data and is called winsorization.


**Rank Transformation (non-tree based)**

Rank transformation sets the space between values to be equal. A quick way of handling outliers is to use the values indices instead of their values (see `scipy.stats.rankdata`). The transformation used in the training set needs to be applied to test and validation sets.

**Log/Sqrt Transforms (non-tree based)**

Applying `np.log(1+x)` or `np.sqrt(x + 2/3)` to a feature can benefit all non tree-based models, especially NN, as they:

+ bring extreme values of the feature closer to the average.
+ make values close to zero more easily distinguishable.

**Grouping Preprocessings**

Training a model on concatenated dataframes, each having gone through different preprocessings, can sometimes yield great results. Another option is to mix models trained on differently preprocessed data.



## III.2. Categorical & Ordinal Data

_Reminder: "ordinal" means "ordered categorical" (start ratings, level of education, etc.). We cannot be sure that the difference between values is always constant (the difference between four and five stars might be smaller than between two and three stars)._

**Label Encoding (tree based)**

Label encoding maps categories into numbers. This method works well with tree-based models: they can split the feature and extract the most useful values.

+ `sklearn.preprocessing.LabelEncoder`: apply the encoding in sorted order.
+ `pandas.factorize`: apply the encoding by order of appearance.

**Frequency Encoding (tree based)**

Frequency encoding (see below) uses the frequency of each value as key.

_Note: if two categories have the same frequency, they won't be distinguishable after frequency encoding alone. In this case, a ranking operation will help._


In [None]:
# frequency encoding
encoding = df.groupby(feature).size() # number of occurrences by value
encoding = encoding / len(df)         # frequency of each value
df['enc_feature'] = df['feature'].map(encoding)


**One-Hot Encoding**

One-Hot Encoding creates a new boolean column for each value (it is by definition already min-max-scaled):
+ `sklearn.preprocessing.OneHotEncoder`
+ `pandas.get_dummies`

_Note: tree-based methods will struggle to use numeric features efficiently if there are too many binary columns created via one-hot encoding (they will not be selected in enough random splits)._

_Note: one-hot encoding features with a large amount of categories will create many binary features that have few non-zero values. In these cases, sparce matrices will have much better performance because they only store non-null values in memory._


## III.3. Target Mean Encoding

Mean encoding takes the mean of the target for each category of the feature in the test set. It ranks categories by target mean value and makes it easier for algorithms to use the feature. By contrast, label encoding orders categories at random, which makes it harder for algorithms to use the feature.

More concretely, it allows tree-based algorithms to get the same level of performance with shorter trees, especially with high-cardinality categorical features that are typically hard to handle for tree-based algorithms (because many decision boundaries are required).

_Note: this method also works with regression problems._

When increasing the maximum depth of trees leads, we can expect our models to overfit the training set. If the performance also increases for the validation set, it means that our model needs a huge number of splits to extract information from some variables. In this case, mean encoding is likely to provide significant benefits to our model.

Mean encoding can be performed in several ways:
+ likelihood (target mean): P / (N + P)
+ weight of evidence: ln(P/N) * 100
+ count: P
+ diff: P - N


In [None]:
# mean encoding must be performed on the training set only 
means = df.groupby(feature_col).target.mean()

# it is the napplied to both training and validation sets
df_train[feature_col + '_mean_target'] = df_train[feature_col].map(means)
df_val[feature_col + '_mean_target'] = df_val[feature_col].map(means)


Naive mean encoding might lead to overfitting when, for instance, all observations of a given category have the same target value in the training set but not in the validation set. Regularization can help.

+ CV loop: use the entire dataset to build the mean encoding, but only for the validation set of each CV fold. This technique leaks information from the validation set to the training set, hasn't any major impact if the data is large enough compared to the number of folds. This is why it is recommended to only use a small CV of 4-5 folds.
+ smoothing: use mean encoding only for categories with a lot of rows (see formula below). Must be combined with another method like CV loop to prevent overfitting.
+ expanding mean: sort the data and only consider the [0..n-1] rows to calculate the mean encoding of row n. The feature quality is not always excellent, but we can compensate by averaging the predictions of models fitted on encodings calculated from different data permutations.

$$\mathrm{smoothing} = \frac{\mathrm{mean}(\mathrm{target}) * \mathrm{nrows} + \mathrm{global\_mean} * \alpha}{\mathrm{nrows} + \alpha}$$

_Note: the expanding mean is built-in in the library CatBoost that works great for categorical datasets._


## III.4. Feature Encoding

The mean is the only meaningful statistic we can extract from a classification variable. 

For multiclassification tasks, we will create n features: one mean encoding for each class. This has the advantage of giving tree-based algorithms information about other classes where typically they don't: they usually solve multiclassification as n problems of one versus all so every class has a different model.

For regression tasks, we have much more options: percentiles, standard deviation, etc. We can also use distribution bins: create x features that count the number of times the target variable is in the x-th distribution bin.

For many-to-many combinations (like classification problems from apps installed on an user's phone), we can mean-encode based on each user-app combination (long-form representation), then merge the results into a vector for each user. We can apply various statistics to these vectors, like mean or standard deviation.

For time series, calculating some statistic of previous values can also help significantly.

It can also be useful to investigate numerical features that have many splits: try to mean-encode binned values where bins come from the splits. Another option is to mean-encode feature interactions between features that are frequently in neighboring nodes.




In [None]:
# we assume that the df is the available dataset 
# and that we have created an empty df_new df
# that will include all our new features

# CV loop
skf = StratifiedKFold(y, 5, shuffle=True, random_state=42)

for tr_idx, val_idx in skf:
    # split fold in train/val
    df_tr, df_val = df.iloc[tr_idx], df.iloc[val_idx]
    # loop over all feature columns to encode
    for feature_col in feature_cols: 
        means = df_tr.groupby(feature_col)['target'].mean()
        df_val[feature_col + '_mean_target'] = df_val[feature_col].map(means)
    # save to df_new
    df_new.iloc[val_idx] = df_val

# fill missing values with the global target mean
prior = df['target'].mean()
df_new.fillna(prior, inplace=True)

# expanding mean
cumsum = df.groupby(feature_col)['target'].cumsum() - df['target'] # cumsum of [0..n-1]
cumcnt = df.groupby(feature_col)['target'].cumcount()              # cumcnt of [0..n-1]
df_new[feature_col + 'mean_target'] = cumsum / cumcnt


# IV. Features Generation
## Common Methods


A good explanation of the benefits of features engineering, in addition to examples, can be found [here](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) and [here](https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering).

_Note: we can create a new feature that flags incorrect values if they appear to follow some pattern (as opposed to typos for instance
**Numerical**
+ combining several features into one via division or multiplication (area from width and lenght, price per square meter).
+ extracting the fractional part of prices (i.e. cents) to capture the psychology of how people perceive these prices.
+ flagging non-human behaviors (bots) by looking at patterns (non-rounded prices at auctions, constant intervals in messages sent on social media).

**Categorical & Ordinal**
+ combining several categorical features into one (features interaction).

**Datetimes**
+ periodicity (day of week, day of year, day, month, season, year). Helps capture repetitive patterns in the data.
+ time passed since major event (days since last holidays, is_holiday boolean).
+ difference between dates.

**Coordinates**
+ distance to major landmarks (shops, subway station).
+ distance to most remarkable data point of the dataset split into clusters (most expensive flat).
+ distance to the center of the dataset split into clusters.
+ aggregated statistics for the area around each point (number of flats in a given radius, mean price per square meter).
+ rotate coordinates so decision trees have decision boundaries closer to horizontal/vertical.

**Text Feature Extraction**

Text (see [NLTK](https://www.nltk.org/)):
+ Lemming (use root word for all its variants and related words), stemming (truncate the end of words), stopwords removal, lowercase.
+ Bag of Words: count occurrences of each word of a given dictionary for each sentence (`sklearn.feature_extraction.text.CountVectorizer`).
+ Bag of Words TFiDF (`sklearn.feature_extraction.text.TfidfVectorizer` - see code below).
+ N-grams (`sklearn.feature_extraction.text.CountVectorizer(Ngram_range, analyzer)`).
+ Word2Vec Embedding: convert words into a vector. Operations are possible to extract additional meaning (king - man + woman = queen).
)._


In [None]:
# bag of words - terms frequency (texts of different sizes become more comparable)
tf = 1 / x.sum(axis=1)[:, None] # number of words per row
x = x * tf

# bag of words - inverse document frequency (boost more important features/words - that is, less frequent words)
idf = np.log(len(x) / (x > 0).sum(axis=0)) # inverse of the fraction of rows with the word
x = x * idf


## Statistics and Distance-Based Features

We can add new features based on relations between data points. For instance, we can calculate the number of web pages a specific user has visited during a specific session, or the minimum and maximum price for articles displayed on a specific page. In other words, we group some features by value then calculate summary statistics of other features for each of these values.

Another method, more flexible but harder to implement, is to calculate summary statistics of neighboring values. For instance, we can estimate the average price per square meter of a specific neighborood by looking at the coordinates of each house in the dataset.


## Feature Interaction

This method combines two features into one to make interactions more explicit. This is especially useful for tree-based methods that would struggle to capture these interactions otherwise. 

+ For categorical variables, the method concatenates the values into one. 
+ For numerical variables, any operation that takes two arguments would work: sum, multiplication, division, etc.

Not all interactions are relevant. We can fit a tree-based method to a dataset with all possible interactions and use the ranking of features importance to keep the most useful ones only.

Lastly, tree-based models can use the index of each tree leaf to identify high-order interactions (i.e. branches with few leaves vs many leaves). The implementation is simple:
+ sklearn: `tree_model.apply()`.
+ xgboost: `booster.predict(pred_leaf=true)`.

An implementation example can be found [here](https://scikit-learn.org/stable/auto_examples/ensemble/plot_feature_transformation.html).


# Dimensionality Reduction

## Matrix Factorization

Matrix Factorization is a general approach to dimensionality reduction and feature extraction.

The most common features of matrix factorization are [Singular Value Decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition) (SVD) and [Principal Component Analysis](https://en.wikipedia.org/wiki/Principal_component_analysis) (PCA). Other methods include TruncatedSVD for sparse matrices (like text classification) and Non-Negative Matrix Factorization (NMF) for counts-like data (like Bag-of-Words matrices).

More details about the different methods of matrix factorization can be found [here](https://scikit-learn.org/stable/modules/decomposition.html).


In [None]:
# pca parameters
pca = PCA(n_components=5)

# pca done the wrong way
X_train_pca = pca.fit_transform(X_train)
X_test_pca = pca.fit_transform(X_test)

# pca done the right way
X_all = np.concatenate([X_train, X_test])
pca.fit(X_all)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)


## t-SNE

Matrix Factorization is a linear method of dimentionality reduction; t-SNE is a non-linear method that can be very powerful for vizualization purposes. A few points to keep in mind: 
+ results depend heavily on the perplexity value.
+ due to its stochastic nature, t-SNE provides different projections every time it is run, which is why train and test sets hould be projected together.

Additional resources:
+ More details about the different methods of manifold learning methods can be found [here](https://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html).
+ An implementation example, with its code, can be found [here](https://scikit-learn.org/stable/auto_examples/manifold/plot_t_sne_perplexity.html#sphx-glr-auto-examples-manifold-plot-t-sne-perplexity-py). 
+ This [page](https://distill.pub/2016/misread-tsne/) illustrates the effects of various parameters, as well as interactive plots to explore those effects.


# VI. Competition-specific Steps

A few extra steps can be taken during competitions:

+ We can remove all features that are either constant in the training set or a duplicate of another column (`df.T.drop_duplicates()`). For categorical features, we'll need to label encode them all by order of appearance.
+ If categorical values only exist in the test set, we need to test if these new values bring any useful information. One possibility is to compare the model performance on a validation set, for values in the training set vs previously unseen values. We might want to use a different model for new values if performance is low for them. 
+ Duplicated rows can either be mistakes or the result of having a key feature ommitted from the dataset. Having identical rows in both train and test sets can help us understand how the data was generated.
+ When plotting rolling mean of values by index, we can check if there are some patterns that will indicate that the data was not properly shuffled.
