## Main ML Models

**Linear Models**

Use planes to separate classes (each model uses different loss functions). Very good for sparse, high-dimensional data, but don't work well in no-linear situations.
- Logistic Regression.
- Support Vector Machines.

Implementations include scikit-learn and wopal wabbit (designed to handle very large datasets).


**Tree-based**

Very powerful for tabular data, but not well-suited for linear dependencies because they require a lot of splits and can be inaccurate in near decision borders.
- Decision Trees: use divide-and-conquer to recursively split the space into subspaces.
- [Random Forests](https://www.datasciencecentral.com/profiles/blogs/random-forests-explained-intuitively).
- [Gradient Boosting over Decision Trees](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html).


**Others**

- [k-Nearest Neighbors](https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/).
- Neural Networks


**Overview**

- Linear models split the space into two subspaces separated by an hyperplane.
- Tree-based methods split the space into boxes and use constant predictions inside each box.
- KNN make the assumption that objects close to one another are likely to have the same label and rely heavily on how to measure the distance between points.
- Neural Network produce smooth non-linear decision boundaries.



## Features Preprocessing

Features preprocessing & generation pipelines depend on the model type. A few examples:

+ A categorical feature that happens to be stored as numerical will not perform well with a linear model if the relation with the outcome is linear. In this case, one-hot encoding will perform better. But this preprocessing step is not required to fit a random forest.
+ Forecasting a linear trend will work well with a linear model, but a tree-based approach will not create splits for unseen dates and might perform poorly.

The scikit-learn documentation has a section dedicated to [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html).


### Numeric Features

Tree-based models are not impacted by feature scales nor outliers.

**Feature Scale (non-tree based)**

Non tree-based models (KNN, linear models & NN) are strongly impacted by the scale of each feature:
+ KNN: predictions are based on distances, so they will vary significantly depending on the scale of each feature.
+ Linear models & NN: 
    + regularization impact is proportional to feature scale. It will work best if we can apply it to each coefficient in equal amounts.
    + gradient descent methods don't work well without features scaling.

The easiest way to deal with this issue is to rescale all features to the same scale:
+ `sklearn.preprocessing.MinMaxScaler` scale to \[0, 1\]: $X = (X - X.min()) / (X.max() - X.min())$. The distribution of values doesn't change.
+ `sklearn.preprocessing.StandardScaler` scale to mean=0 and std=1: $X = (X - X.mean()) / X.std()$. 

_Note: Different scalings result in different model quality: it is another hyperparameter you need to optimize._

_Note: when using KNN, we can optimize the scaling parameters for certain features in order to boost their impact on predictions._

An analysis of when to use min-max vs standardization can be found [here](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html).

**Outliers [Winsorizing](https://en.wikipedia.org/wiki/Winsorizing) (linear models)**

Outliers (for both features and target values) can impact linear models significantly. Clipping feature values between some lower and upper bounds (like 1st and 99th percentiles) can mitigate this issue. This method is frequently used with financial data and is called winsorization.


**Rank Transformation (non-tree based)**

Rank transformation sets the space between values to be equal. A quick way of handling outliers is to use the values indices instead of their values (see `scipy.stats.rankdata`). The transformation used in the training set needs to be applied to test and validation sets.

**Log/Sqrt Transforms (non-tree based)**

Applying `np.log(1+x)` or `np.sqrt(x + 2/3)` to a feature can benefit all non tree-based models, especially NN, as they:

+ bring extreme values of the feature closer to the average.
+ make values close to zero more easily distinguishable.

**Grouping Preprocessings**

Training a model on concatenated dataframes, each having gone through different preprocessings, can sometimes yield great results. Another option is to mix models trained on differently preprocessed data.



### Categorical & Ordinal Data

_Reminder: "ordinal" means "ordered categorical" (start ratings, level of education, etc.). We cannot be sure that the difference between values is always constant (the difference between four and five stars might be smaller than between two and three stars)._

**Label Encoding (tree based)**

Label encoding maps categories into numbers. This method works well with tree-based models: they can split the feature and extract the most useful values.

+ `sklearn.preprocessing.LabelEncoder`: apply the encoding in sorted order.
+ `pandas.factorize`: apply the encoding by order of appearance.

**Frequency Encoding (tree based)**

Frequency encoding (see below) uses the frequency of each value as key.

_Note: if two categories have the same frequency, they won't be distinguishable after frequency encoding alone. In this case, a ranking operation will help._

**One-Hot Encoding**

One-Hot Encoding creates a new boolean column for each value (it is by definition already min-max-scaled):
+ `sklearn.preprocessing.OneHotEncoder`
+ `pandas.get_dummies`

_Note: tree-based methods will struggle to use numeric features efficiently if there are too many binary columns created via one-hot encoding (they will not be selected in enough random splits)._

_Note: one-hot encoding features with a large amount of categories will create many binary features that have few non-zero values. In these cases, sparce matrices will have much better performance because they only store non-null values in memory._


In [None]:
# frequency encoding
encoding = df.groupby(feature).size() # number of occurrences by value
encoding = encoding / len(df)         # frequency of each value
df['enc_feature'] = df['feature'].map(encoding)


## Common Features Generation

A good explanation of the benefits of features engineering, in addition to examples, can be found [here](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) and [here](https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering).

**Numerical**
+ combining several features into one via division or multiplication (area from width and lenght, price per square meter).
+ extracting the fractional part of prices (i.e. cents) to capture the psychology of how people perceive these prices.
+ flagging non-human behaviors (bots) by looking at patterns (non-rounded prices at auctions, constant intervals in messages sent on social media).

**Categorical & Ordinal**
+ combining several categorical features into one (features interaction).

**Datetimes**
+ periodicity (day of week, day of year, day, month, season, year). Helps capture repetitive patterns in the data.
+ time passed since major event (days since last holidays, is_holiday boolean).
+ difference between dates.

**Coordinates**
+ distance to major landmarks (shops, subway station).
+ distance to most remarkable data point of the dataset split into clusters (most expensive flat).
+ distance to the center of the dataset split into clusters.
+ aggregated statistics for the area around each point (number of flats in a given radius, mean price per square meter).
+ rotate coordinates so decision trees have decision boundaries closer to horizontal/vertical.


## Features Extractions

Text (see [NLTK](https://www.nltk.org/)):
+ Lemming (use root word for all its variants and related words), stemming (truncate the end of words), stopwords removal, lowercase.
+ Bag of Words: count occurrences of each word of a given dictionary for each sentence (`sklearn.feature_extraction.text.CountVectorizer`).
+ Bag of Words TFiDF (`sklearn.feature_extraction.text.TfidfVectorizer` - see code below).
+ N-grams (`sklearn.feature_extraction.text.CountVectorizer(Ngram_range, analyzer)`).
+ Word2Vec Embedding: convert words into a vector. Operations are possible to extract additional meaning (king - man + woman = queen).


In [None]:
# bag of words - terms frequency (texts of different sizes become more comparable)
tf = 1 / x.sum(axis=1)[:, None] # number of words per row
x = x * tf

# bag of words - inverse document frequency (boost more important features/words - that is, less frequent words)
idf = np.log(len(x) / (x > 0).sum(axis=0)) # inverse of the fraction of rows with the word
x = x * idf


## Missing Values

Missing values are sometimes not loaded as NaN: they might have been replaced by a single value that is completely out of the range taken by the rest of the values. These cases can be found by plotting an histogram.

Once identified, missing values can be inputed in a few ways:
+ inferred. This method should be handled with caution, especially when using inferred values to generate a new feature.
+ use a single value outside the feature's value range (-1, -999, etc.). This can be used as a separate category but will penalize non tree-based models.
+ use the mean of the median. This works well for non-tree based methods but tree-based models won't be able to easily create a split for missing values.

An option is to add a new binary feature to flag rows that had missing values, then use either mean or median. The downside is that this method will double the number of features in the dataset.

For categorical data, we can use frequency encoding to highlight categories that are in the test set but not the training set.

The scikit-learn documentation has a section dedicated to [missing values imputation](https://scikit-learn.org/stable/modules/impute.html).
