# I. Machine Learning
## I.1. Main ML Models

There are two main categories: tree-based and non tree-based. Features preprocessing & features generation methods, covered in the following sections, depend on the model type.

**Tree-based**

Very powerful for tabular data, but not well-suited for linear dependencies because they require a lot of splits and can be inaccurate in near decision borders.
- Decision Trees: use divide-and-conquer to recursively split the space into subspaces.
- [Random Forests](https://www.datasciencecentral.com/profiles/blogs/random-forests-explained-intuitively).
- [Gradient Boosting over Decision Trees](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html).


**Linear Models**

Use planes to separate classes (each model uses different loss functions). Very good for sparse, high-dimensional data, but don't work well in no-linear situations.
- Logistic Regression.
- Support Vector Machines.

Implementations include scikit-learn and wopal wabbit (designed to handle very large datasets).


**Others**

- [k-Nearest Neighbors](https://www.analyticsvidhya.com/blog/2018/03/introduction-k-neighbours-algorithm-clustering/).
- Neural Networks


**Overview**

- Tree-based methods split the space into boxes and use constant predictions inside each box.
- Linear models split the space into two subspaces separated by an hyperplane.
- KNN make the assumption that objects close to one another are likely to have the same label and rely heavily on how to measure the distance between points.
- Neural Network produce smooth non-linear decision boundaries.

## I.2. Other ML Problems

### Ranking

+ [Overview of the topic](https://wellecks.wordpress.com/2015/01/15/learning-to-rank-overview/)
+ [RankNet introduction](https://icml.cc/2015/wp-content/uploads/2015/06/icml_ranking.pdf) - pairwise method for AUC optimization
+ [RankNet improvements](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/MSR-TR-2010-82.pdf)
+ [Library of LTR algorithms](https://sourceforge.net/p/lemur/wiki/RankLib/)


### Clustering

+ [Evaluation metrics for clustering](http://nlp.uned.es/docs/amigo2007a.pdf)


# II. Features Preprocessing

Features preprocessing & generation pipelines depend on the model type. A few examples:

+ A categorical feature that happens to be stored as numerical will not perform well with a linear model if the relation with the outcome is linear. In this case, one-hot encoding will perform better. But this preprocessing step is not required to fit a random forest.
+ Forecasting a linear trend will work well with a linear model, but a tree-based approach will not create splits for unseen dates and might perform poorly.

The scikit-learn documentation has a section dedicated to [preprocessing](https://scikit-learn.org/stable/modules/preprocessing.html).

## II.1. Numeric Features

Tree-based models are not impacted by feature scales nor outliers.

**Feature Scale (non-tree based)**

Non tree-based models (KNN, linear models & NN) are strongly impacted by the scale of each feature:
+ KNN: predictions are based on distances, so they will vary significantly depending on the scale of each feature.
+ Linear models & NN: 
    + regularization impact is proportional to feature scale. It will work best if we can apply it to each coefficient in equal amounts.
    + gradient descent methods don't work well without features scaling.

The easiest way to deal with this issue is to rescale all features to the same scale:
+ `sklearn.preprocessing.MinMaxScaler` scale to \[0, 1\]: $X = (X - X.min()) / (X.max() - X.min())$. The distribution of values doesn't change.
+ `sklearn.preprocessing.StandardScaler` scale to mean=0 and std=1: $X = (X - X.mean()) / X.std()$. 

_Note: Different scalings result in different model quality: it is another hyperparameter you need to optimize._

_Note: when using KNN, we can optimize the scaling parameters for certain features in order to boost their impact on predictions._

An analysis of when to use min-max vs standardization can be found [here](https://sebastianraschka.com/Articles/2014_about_feature_scaling.html).

**Outliers [Winsorizing](https://en.wikipedia.org/wiki/Winsorizing) (linear models)**

Outliers (for both features and target values) can impact linear models significantly. Clipping feature values between some lower and upper bounds (like 1st and 99th percentiles) can mitigate this issue. This method is frequently used with financial data and is called winsorization.


**Rank Transformation (non-tree based)**

Rank transformation sets the space between values to be equal. A quick way of handling outliers is to use the values indices instead of their values (see `scipy.stats.rankdata`). The transformation used in the training set needs to be applied to test and validation sets.

**Log/Sqrt Transforms (non-tree based)**

Applying `np.log(1+x)` or `np.sqrt(x + 2/3)` to a feature can benefit all non tree-based models, especially NN, as they:

+ bring extreme values of the feature closer to the average.
+ make values close to zero more easily distinguishable.

**Grouping Preprocessings**

Training a model on concatenated dataframes, each having gone through different preprocessings, can sometimes yield great results. Another option is to mix models trained on differently preprocessed data.



## II.2. Categorical & Ordinal Data

_Reminder: "ordinal" means "ordered categorical" (start ratings, level of education, etc.). We cannot be sure that the difference between values is always constant (the difference between four and five stars might be smaller than between two and three stars)._

**Label Encoding (tree based)**

Label encoding maps categories into numbers. This method works well with tree-based models: they can split the feature and extract the most useful values.

+ `sklearn.preprocessing.LabelEncoder`: apply the encoding in sorted order.
+ `pandas.factorize`: apply the encoding by order of appearance.

**Frequency Encoding (tree based)**

Frequency encoding (see below) uses the frequency of each value as key.

_Note: if two categories have the same frequency, they won't be distinguishable after frequency encoding alone. In this case, a ranking operation will help._


In [None]:
# frequency encoding
encoding = df.groupby(feature).size() # number of occurrences by value
encoding = encoding / len(df)         # frequency of each value
df['enc_feature'] = df['feature'].map(encoding)


**One-Hot Encoding**

One-Hot Encoding creates a new boolean column for each value (it is by definition already min-max-scaled):
+ `sklearn.preprocessing.OneHotEncoder`
+ `pandas.get_dummies`

_Note: tree-based methods will struggle to use numeric features efficiently if there are too many binary columns created via one-hot encoding (they will not be selected in enough random splits)._

_Note: one-hot encoding features with a large amount of categories will create many binary features that have few non-zero values. In these cases, sparce matrices will have much better performance because they only store non-null values in memory._


## II.3. Target Mean Encoding

Mean encoding takes the mean of the target for each category of the feature in the test set. It ranks categories by target mean value and makes it easier for algorithms to use the feature. By contrast, label encoding orders categories at random, which makes it harder for algorithms to use the feature.

More concretely, it allows tree-based algorithms to get the same level of performance with shorter trees, especially with high-cardinality categorical features that are typically hard to handle for tree-based algorithms (because many decision boundaries are required).

_Note: this method also works with regression problems._

When increasing the maximum depth of trees leads, we can expect our models to overfit the training set. If the performance also increases for the validation set, it means that our model needs a huge number of splits to extract information from some variables. In this case, mean encoding is likely to provide significant benefits to our model.

Mean encoding can be performed in several ways:
+ likelihood (target mean): P / (N + P)
+ weight of evidence: ln(P/N) * 100
+ count: P
+ diff: P - N


In [None]:
# mean encoding must be performed on the training set only 
means = df.groupby(feature_col).target.mean()

# it is the napplied to both training and validation sets
df_train[feature_col + '_mean_target'] = df_train[feature_col].map(means)
df_val[feature_col + '_mean_target'] = df_val[feature_col].map(means)


Naive mean encoding might lead to overfitting when, for instance, all observations of a given category have the same target value in the training set but not in the validation set. Regularization can help.

+ CV loop: use the entire dataset to build the mean encoding, but only for the validation set of each CV fold. This technique leaks information from the validation set to the training set, hasn't any major impact if the data is large enough compared to the number of folds. This is why it is recommended to only use a small CV of 4-5 folds.
+ smoothing: use mean encoding only for categories with a lot of rows (see formula below). Must be combined with another method like CV loop to prevent overfitting.
+ expanding mean: sort the data and only consider the [0..n-1] rows to calculate the mean encoding of row n. The feature quality is not always excellent, but we can compensate by averaging the predictions of models fitted on encodings calculated from different data permutations.

$$\mathrm{smoothing} = \frac{\mathrm{mean}(\mathrm{target}) * \mathrm{nrows} + \mathrm{global\_mean} * \alpha}{\mathrm{nrows} + \alpha}$$

_Note: the expanding mean is built-in in the library CatBoost that works great for categorical datasets._


## II.4. Feature Encoding

The mean is the only meaningful statistic we can extract from a classification variable. 

For multiclassification tasks, we will create n features: one mean encoding for each class. This has the advantage of giving tree-based algorithms information about other classes where typically they don't: they usually solve multiclassification as n problems of one versus all so every class has a different model.

For regression tasks, we have much more options: percentiles, standard deviation, etc. We can also use distribution bins: create x features that count the number of times the target variable is in the x-th distribution bin.

For many-to-many combinations (like classification problems from apps installed on an user's phone), we can mean-encode based on each user-app combination (long-form representation), then merge the results into a vector for each user. We can apply various statistics to these vectors, like mean or standard deviation.

For time series, calculating some statistic of previous values can also help significantly.

It can also be useful to investigate numerical features that have many splits: try to mean-encode binned values where bins come from the splits. Another option is to mean-encode feature interactions between features that are frequently in neighboring nodes.




In [None]:
# we assume that the df is the available dataset 
# and that we have created an empty df_new df
# that will include all our new features

# CV loop
skf = StratifiedKFold(y, 5, shuffle=True, random_state=42)

for tr_idx, val_idx in skf:
    # split fold in train/val
    df_tr, df_val = df.iloc[tr_idx], df.iloc[val_idx]
    # loop over all feature columns to encode
    for feature_col in feature_cols: 
        means = df_tr.groupby(feature_col)['target'].mean()
        df_val[feature_col + '_mean_target'] = df_val[feature_col].map(means)
    # save to df_new
    df_new.iloc[val_idx] = df_val

# fill missing values with the global target mean
prior = df['target'].mean()
df_new.fillna(prior, inplace=True)

# expanding mean
cumsum = df.groupby(feature_col)['target'].cumsum() - df['target'] # cumsum of [0..n-1]
cumcnt = df.groupby(feature_col)['target'].cumcount()              # cumcnt of [0..n-1]
df_new[feature_col + 'mean_target'] = cumsum / cumcnt


# III. Common Features Generation

A good explanation of the benefits of features engineering, in addition to examples, can be found [here](https://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/) and [here](https://www.quora.com/What-are-some-best-practices-in-Feature-Engineering).

_Note: we can create a new feature that flags incorrect values if they appear to follow some pattern (as opposed to typos for instance)._


**Numerical**
+ combining several features into one via division or multiplication (area from width and lenght, price per square meter).
+ extracting the fractional part of prices (i.e. cents) to capture the psychology of how people perceive these prices.
+ flagging non-human behaviors (bots) by looking at patterns (non-rounded prices at auctions, constant intervals in messages sent on social media).

**Categorical & Ordinal**
+ combining several categorical features into one (features interaction).

**Datetimes**
+ periodicity (day of week, day of year, day, month, season, year). Helps capture repetitive patterns in the data.
+ time passed since major event (days since last holidays, is_holiday boolean).
+ difference between dates.

**Coordinates**
+ distance to major landmarks (shops, subway station).
+ distance to most remarkable data point of the dataset split into clusters (most expensive flat).
+ distance to the center of the dataset split into clusters.
+ aggregated statistics for the area around each point (number of flats in a given radius, mean price per square meter).
+ rotate coordinates so decision trees have decision boundaries closer to horizontal/vertical.

**Text Feature Extraction**

Text (see [NLTK](https://www.nltk.org/)):
+ Lemming (use root word for all its variants and related words), stemming (truncate the end of words), stopwords removal, lowercase.
+ Bag of Words: count occurrences of each word of a given dictionary for each sentence (`sklearn.feature_extraction.text.CountVectorizer`).
+ Bag of Words TFiDF (`sklearn.feature_extraction.text.TfidfVectorizer` - see code below).
+ N-grams (`sklearn.feature_extraction.text.CountVectorizer(Ngram_range, analyzer)`).
+ Word2Vec Embedding: convert words into a vector. Operations are possible to extract additional meaning (king - man + woman = queen).


In [None]:
# bag of words - terms frequency (texts of different sizes become more comparable)
tf = 1 / x.sum(axis=1)[:, None] # number of words per row
x = x * tf

# bag of words - inverse document frequency (boost more important features/words - that is, less frequent words)
idf = np.log(len(x) / (x > 0).sum(axis=0)) # inverse of the fraction of rows with the word
x = x * idf


# IV. Missing Values Imputation

The scikit-learn documentation has a section dedicated to [missing values imputation](https://scikit-learn.org/stable/modules/impute.html

Missing values are sometimes not loaded as NaN: they might have been replaced by a single value that is completely out of the range taken by the rest of the values. These cases can be found by plotting an histogram.

Once identified, missing values can be inputed in a few ways:
+ inferred. This method should be handled with caution, especially when using inferred values to generate a new feature.
+ use a single value outside the feature's value range (-1, -999, etc.). This can be used as a separate category but will penalize non tree-based models.
+ use the mean of the median. This works well for non-tree based methods but tree-based models won't be able to easily create a split for missing values.

An option is to add a new binary feature to flag rows that had missing values, then use either mean or median. The downside is that this method will double the number of features in the dataset.

For categorical data, we can use frequency encoding to highlight categories that are in the test set but not the training set.).

# V. EDA

This process helps understand the data, build intuition about it, generate hypotheses and find insights. It can also be used to check if train and test set have similar populations (men vs women, etc.

**individual features**
+ numerical summary / statistics.
+ histogram. 
+ plotting row index vs value.

**pairs of features**
+ scatter plots. A few interesting tweaks:
    + add a color for each class (classification) or match point size with outcome value (regression).
    + overlapping test set with train set to see if values match.
+ scatter matrix.

**groups of features**
+ correlation matrix. Running [K-means clustering](https://scikit-learn.org/stable/auto_examples/bicluster/plot_spectral_biclustering.html) can help group related features together.
+ plotting column index vs statistics value (like mean) of each feature.

It can be helpful to generate new features based on each of these group of features.).

In [None]:
# df values
df.describe()
x.value_counts()
x.isnull()

# plot values vs index
plt.plot(x, '.')
plt.scatter(range(len(x)), x, c=y)

# correlation (scatter plots, correlation matrix)
pd.scatter_matrix(df)
df.corr()
plt.matshow(...)

# VI. Validation

Using a validation set will limit the risk of overfitting our model on the training set. Once we are satisfied with a given model, we can retrain it on the entire dataset with the same hyperparameters.

Of particular interest will be how the sample data was extracted from the database: was it random, or were some classes over-sampled to produce a more balanced dataset. This will be crucial to set up a proper validation scheme: when splitting the training set to create the validation set, we want the resulting split to be as close to the split between training and test set as possible. 

The scikit-learn documentation has a section dedicated to [validation](https://scikit-learn.org/stable/modules/cross_validation.html).

## VI.1. Validation Methods

How to split / what ratios to use:
+ holdout (`sklearn.model_selection.ShuffleSplit`): split the dataset in two and train only on one part. This is a good choice when the dataset is large or when the model score is likely to be consistent across splits.
+ K-fold (`sklearn.model_selection.Kfold`): split dataset into K subsets; train K times on K-1 subsets. This method ensures that a given sample is used for validation only once. The end score is the average over these K-folds. This is a good choice when the dataset is of medium size  
+ Leave-one-out (`sklearn.model_selection.LeaveOneOut`): K-fold where K is equal to the number of samples in the training set. This is a good choice when the dataset is small and the models are relatively fast to train. 

_Note: it is important to ensure that there is no overlap between training and validation sets, ehich could occur if there are duplicate samples. In this case, the accuracy of the validation set risks being incorrectly high._

_Note: with K-fold and LOO, you can also estimate mean and variance of the loss. This can be used to measure if improvements are statistically signiffication**

For classification datasets that are eith small, or with a large amount of categories, it is always a good practice to use stratification: this method ensures that the distribution of classes will be similar over different training folds. 

## VI.2. Inconsistent Scores 

There are two main reasons for observing vastly different scores for our different folds:
+ the data has clear patterns but little data. The model will not be able to generalize them well. Each fold train on slightly different patterns, which can lead to vastly different scores.
+ the data is inconsistent. In this case where the variance is high, the model will struggle to generalize.

Running several K-folds with different seeds can help get a better estimate of the model's performance. It can also be helpful to adjust hyperparameters with one seed and estimate performance with another one.


# VII. Competition-specific Steps

Some specific concepts apply mainly during competition

**Checking data**

+ We can remove all features that are either constant in the training set or a duplicate of another column (`df.T.drop_duplicates()`). For categorical features, we'll need to label encode them all by order of appearance.
+ If categorical values only exist in the test set, we need to test if these new values bring any useful information. One possibility is to compare the model performance on a validation set, for values in the training set vs previously unseen values. We might want to use a different model for new values if performance is low for them. 
+ Duplicated rows can either be mistakes or the result of having a key feature ommitted from the dataset. Having identical rows in both train and test sets can help us understand how the data was generated.
+ When plotting rolling mean of values by index, we can check if there are some patterns that will indicate that the data was not properly shuffled.


**Validation scores vs leaderboard scores**

Sometimes, the validation score is very different from the leaderboard one. It usually comes from using a validation set that is not representative of the test set. If the test set distribution is different from the validation test, doing some learderboard probing to get mean values can help improve the score significantly.

A good practice fro final submissions is to submit one model that performs well in the validation set (to cover cases where test set et validation set have similar distributions) and one that performs well on the public leaderboard (to cover cases where the distribution of the test set if very different from the one of the validation set).
s.

# VIII. Performance Metrics - Regression

Performance metrics are used to assess the quality of an algorithm; you must choose the most appropriate one for the task at hand. Let's take the example of an online shop that tries to maximize the effectiveness of their website. THe company must decide the measure they want to use to quantify effectiveness, which is the measure they will try to optimize. It can be the number of visits, the ratio of visits that led to orders, etc.




## VIII.1. Mean Squared Error, Root Mean Squared Error and R²

$MSE = \frac{1}{N} \sum\limits_{i=1}^N (y_i - \hat y _i)^2$

$RSME = \sqrt{MSE}$

MSE vs RMSE:
+ the RMSE has the same unit as the values to predict, so it's easier to compare.
+ any minimizer of RMSE is a minimizer of RMSE and vice-versa: $MSE(a) > MSE(b) <=> RMSE(a) > RMSE(b)$. This means that optimizing for MSE is the same as optimizing for RMSE.

_Note: the MSE is easier to use than RMSE for gradient descent: the RMSE gradient is a function of the MSE, which means its learning rate is dynamic._

_Note: the MSE can be linked to the L2-loss. See this [SO question](https://datascience.stackexchange.com/questions/26180/l2-loss-vs-mean-squared-loss) for more details._

**MSE Baseline (optimal constant)**

When predicting a constant value, the MSE and RMSE are the smallest for $\bar y = \frac{1}{N} \sum\limits_{i=1}^N y_i$. In this case, the MSE is equal to the variance.


**R squared**

The value of the MSE itself doesn't tell us how good the model really is: it only allows us to compare the performance of various models. R Squared can be used for that purpose: it compares a model performance with the performance of the constant value that has the smallest MSE (predicting a constant value is the most naive approach and is used as the baseline performance). 

R² takes values from 0 (when the model performs no better than using the constant $\bar y$ for all predictions) to 1 (when the model has perfect predictions):

$$R^2 = 1 - \frac{\frac{1}{N} \sum\limits_{i=1}^N (y_i - \hat y _i)^2}{\frac{1}{N} \sum\limits_{i=1}^N (y_i - \bar y _i)^2} = 1 - \frac{MSE}{Var(y)}$$

_Note: the formula of R² shows that optimizing for R² is the same as optimizing for MSE._


## VIII.2. Mean Absolute Error

$MAE = \frac{1}{N} \sum\limits_{i=1}^N |y_i - \hat y _i|$

The MAE is close to the MSE but does not penalize large errors as badly, so it's less sensitive to outliers. It's better suited if the impact of large errors is proportionally the same as small errors (i.e. if an error of 10 costs twice as much as an error of 5, not more). It is widely used in finance.

_Note: the MAE is a specific type of quantile loss._

_Note: the MAE can be linked to the L1-loss._

_Note: the absolute function is not differentiable at x = 0, so the MAE has no gradient if the predictions perfectly match the values to predict._

**MAE Baseline (optimal constant)**

When predicting a constant value, the MAE is the smallest for the median of $y$.



## VIII.3. MSE vs MAE

MAE is robust to outliers, which means they will not influence predictions significantly. If outliers are mistakes, then MAE will be a better choice. Otherwise, MSE will be more suited to the task.


## VIII.4. MSPE & MAPE

MSE and MAE work with absolute errors: predicting 10 for a correct value of 9 and predicting 1000 for a correct value of 1 lead to the same SE and AE of 1. We can use the Mean Squared Percentage Error and Mean Absolute Percentage Error instead:

$$MSPE = \frac{100\%}{N} \sum\limits_{i=1}^N \left( \frac{y_i - \hat y _i}{y_i} \right)^2$$

$$MAPE = \frac{100\%}{N} \sum\limits_{i=1}^N \left\lvert \frac{y_i - \hat y _i}{y_i} \right\rvert$$

The cost for a fixed absolute error decreases as values get larger: MSPE and MAPE are weighted versions of MSE and MAE. Their optimal constant are the weighted mean and weighted median, respectively, with weights equal to:
    
$$\mathrm{MSPE}: w_i = \frac{1/y{_i}{^2}} {\sum\limits_{j=1}^N 1/y{_j}{^2}}$$

$$\mathrm{MAPE}: w_i = \frac{1/y{_i}} {\sum\limits_{j=1}^N 1/y{_j}}$$


## VIII.5. RMSLE

The Root Mean Squared Logarithmic Error is the RMSE calculated at the logarithmic scale:

$$RMSLE = \sqrt{\frac{1}{N} \sum\limits_{i=1}^N ( ln(y_i + 1) - ln(\hat y _i + 1))^2}$$

It is similar to the MSPE in the sense that it penalizes errors of small values more, but it penalizes underpredictions more than overpredictions. Its optimal constant in the log space is the weighted mean, and we need to exponentiate to get the actual value.



# IX. Performance Metrics - Classification

_Note: a cheat sheet of metrics used for classification can be found [here](https://queirozf.com/entries/evaluation-metrics-for-classification-quick-examples-references)._

## IX.1. Definitions

The terms "positive" and "negative" do not refer to benefit, but to the presence or absence of a condition: sick vs healthy, correctly classified vs not, etc.


|     TOTAL         | \| |           POS              |              NEG                  | \| |   prevalence = POS / TOTAL       |
|:-----------------:|:--:|:--------------------------:|:---------------------------------:|:--:|:--------------------------------:|
| ----------------- | \| | -------------------------- | --------------------------------- | \| | -------------------------------- |
| **PP (Pred Pos)** | \| |        **TP**              |        **FP** (error I)           | \| |  **precision** = TP / PP         |
| **PN (Pred Neg)** | \| |   **FN** (error II)        |             **TN**                | \| |                                  |
| ----------------- | \| | -------------------------- | --------------------------------- | \| | -------------------------------- |
|        .          | \| |  **recall** = TP / POS     | *false alarm = 1 - specificity*   | \| | **accuracy** = (TP + TN) / TOTAL |
|        .          | \| | *miss rate = 1 - recall*   |     **specificity** = TN / NEG    | \| |      **F-score** (see below)     |



Note: many terms exist to describe one measure.

|     TOTAL         | \| |           POS                           |              NEG                    | \| |   prevalence                              |    |
|:-----------------:|:--:|:---------------------------------------:|:-----------------------------------:|:--:|:-----------------------------------------:|:--:|
| ----------------- | \| | --------------------------------------- | ----------------------------------- | \| | ----------------------------------------- | --------------------
| **PP (Pred Pos)** | \| |        **TP**                           |        **FP** (error I)             | \| | **precision** / positive predictive value | False Discovery Rate
| **PN (Pred Neg)** | \| |   **FN** (error II)                     |             **TN**                  | \| | False Omission Rate                       | NPV
| ----------------- | \| | --------------------------------------- | ----------------------------------  | \| | ----------------------------------------- | --------------------
|        .          | \| |  **recall** / sensitivity / power / TPR | *false alarm / fall-out / FPR*      | \| | **accuracy**                              |
|        .          | \| | *miss rate / FNR*                       | **specificity** / selectivity / TNR | \| | **F-score**                               |

### Sensitivity vs Specificty

These two terms are widely used in medicine:

+ sensitivity measures the proportion of positives that are correctly identified. The smaller the error II, the higher the recall.
+ specificiy measures the proportion of negatives that are correctly identified. The smaller the error I, the higher the specificity.

They are prevalence-independent test characteristics, as their values are intrinsic to the test and do not depend on the disease prevalence in the population of interest.


### Recall vs Precision

These two terms are widely used in information retrieval:

+ recall is the probability that a relevant document is retrieved by the query.
+ precision is the probability that a retrieved documents is relevant to the query.


### F-Score

The $F_{\beta}$-score combines recall & precision to mesure the accuracy of a given test. The value of $\beta$ adds weight to the recall to adjust its comparative importance to the resulting score. In other terms, $\beta$ adds more weights either to false negatives or false positives.


$$F_{\beta} = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}} = \frac {(1 + \beta^2) \cdot \mathrm{true\ positive} }{(1 + \beta^2) \cdot \mathrm{true\ positive} + \beta^2 \cdot \mathrm{false\ negative} + \mathrm{false\ positive}}$$

Commonly used values of $\beta$ are 0.5, 1 and 2. Note that $F_1$ is the harmonic mean of precision & recall:

$$F_1 = \frac{2}{\mathrm{recall}^{-1} + \mathrm{precision}^{-1}}$$


### Soft Predictions vs Hard-Labels

+ soft predictions are the classifier's score for a given observation:
    + for multiclass problems, it's a vector of size L where L is the number of classes. Each value is the probability that the observation belongs to a given class. The sum of probabilities in the vector are exactly one.
    + for binary classification, it is a single value: the probability that the observation belongs to the positive class.
+ hard labels are a function of soft predictions: it's the $\mathrm{argmax}_i(probs)$ of the scores vector for multi-classes and a threshold $prob > b$ for binary classification. The default threshold is 0.5 but can be tweaked to get better results - see AUC below.

## IX.2. Accuracy vs Balanced Accuracy

The accuracy is the probability that a value is correctly classified; it uses hard labels. It can be a misleading metric for imbalanced data sets: for a sample with 95 negative and 5 positive values, predicting all values as negative yields an accuracy of (0 + 95) / 100 = 95%.

The balanced accuracy is the mean of the TPR and TNR, i.e. the average true predictions across classes. In our example, the balanced accuracy is 0.5 x (95/95 + 0/5) = 0.5.

_Note: the best constant value is to predict the most frequent class._


## IX.3. Cross-Entropy / Log-Loss

The logloss greatly penalizes completely wrong scores: it prefers to make many small mistakes rather than a few large ones. A detailed explaination of the binary cross-entropy / logloss can be found [here](https://towardsdatascience.com/understanding-binary-cross-entropy-log-loss-a-visual-explanation-a3ac6025181a).

_Note: the best constant value is a vector that lists the frequencies of each class in the dataset._


## IX.4. AUC ROC

The [ROC Curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) plots the recall as a function of the fall-out, i.e. TPR = f(FPR), when applying various discrimination thresholds (values above which a soft score is considered to belong to the positive class). It illustrates the trade-off between sensitivity and specificity, i.e. between error I vs error II. Because it takes into account all possible thresholds, it removes the effect of having to carefully choose a threshold.

The Area Under the ROC Curve gives the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one (assuming 'positive' ranks higher than 'negative'). If we consider all possible \[negative, positive\] pairs and compare their soft scores, the AUC is the fraction of pairs where the positive score is higher than the negative score.

The AUC is a [visual representation](http://www.navan.name/roc/) of how much the probability distributions of the positive and negative classes overlap for our model.

_Note: AUC only measures how well predictions are ranked, rather than their absolute values. It means that it is scale-invariant (i.e. multiplying all predictions by the same constant). It follows that all constant values will lead to the same AUC._


## IX.5. Cohen's Kappa

Coehn's Kappa fixes the accuracy of a random guess to be zero, and compares each model to this baseline; it's similar to how R squared works for regression. See the [wikipedia article](https://en.wikipedia.org/wiki/Cohen%27s_kappa) and this [paper](https://arxiv.org/abs/1509.07107) for more details.


_Note: a cheat sheet of metrics used for classification can be found [here](https://queirozf.com/entries/evaluation-metrics-for-classification-quick-examples-references)._

# X. Loss vs Metrics

+ the metric is what we want to optimize: what we'll use to measure the quality of our model.
+ the optimization loss function is what the model will optimize.

Sometimes the metric can be used as a loss function (MSE, logloss, etc.), but it's not always possible. It might be possible to preprocess the data & optimize another metric that results in directly optimizing the metric of interest  (MSPE, MAPE). Another option would be to optimize another metric and postprocess the predictions (accuracy, Cohen's Kappa).

Another solution that works in all circumstances is to use early stopping: train a model using any loss function and measure the metric of interest on a validation set. The training stops when the metric starts to overfit the training set (i.e. its performance increases in the training set but decreases in the validation set).




## X.1. Regression Metrics as Optimizers

|        Metric        | XGBoost | LightGBM | SKL.RandomForestRegressor | SKL.<prefix>Regression | SKL.SGDRegressor | NN Libraries |
|:--------------------:|:-------:|:--------:|:-------------------------:|:----------------------:|:----------------:|:------------:|
| MSE / MSPE* / MSLE** |    x    |     x    |             x             |            x           |         x        |       x      |
|      MAE / MAPE      |    -    |     x    |             x             |            -           |         -        |       x      |

*For MSPE and MAPE, there are two options: either use the `sample_weights` parameter when the libray allows it or simply resample the train set: `df.sample(weights=sample_weights)`. Note that results will be better and more stable when resampling several times and averaging the predictions.

**For MSLE, you can optimize MSE when training the model on the logarithmic scale using $z_i = ln(y_i + 1)$, then convert the predictions on the test set back to its original scale: $\hat{y_i} = exp(\hat{z_i}) - 1$.

The [Huber loss](https://en.wikipedia.org/wiki/Huber_loss) combines the best properties of both L1 and L2 losses. Its values are equal to the L2 loss for small values centered around zero (which means its derivative exists for zero) and to the L1 loss otherwise, making it less sensitive to outliers.


## X.3. Classification Metrics as optimizers

|  Metric    | XGBoost | LightGBM | SKL.RandomForestRegressor | SKL.<prefix>Regression | SKL.SGDRegressor | NN Libraries |
|:----------:|:-------:|:--------:|:-------------------------:|:----------------------:|:----------------:|:------------:|
| Logloss    |    x    |     x    |             -             |            x           |         x        |       x      |
| Accuracy   |    -    |     -    |             -             |            -           |         -        |       -      |
| AUC        |    x    |     x    |             -             |            -           |         -        |       o      |

If a model cannot directly optimize logloss, we can calibrate its results by fitting an another model to its predictions that will be optimized with logloss:

+ Platt scaling: fit a Logistic Regression to the predictions.
+ Isotonic scaling: fit an Isotonic Regression to the predictions.
+ Stacking: fit XGBost or NN to the predictions.

Accuracy cannot be optimized directly because its value is either one (the class is correctly predicted) or zero (the class is incorrectly predicted). It means that its gradient is zero most of the time so gradient descent cannot work. Training a model with any loss function and performing threshold-tuning is the best approach.

The AUC is a measure of pairwise loss as discussed in the AUC section above. Its implementation for NN libraries can be found online.
