Machine learning and AI 6

XGBoost

Gradient Boost
Regularization
A unique regression tree (XGBoost Tree)

In the beginning, all sample residuals are in the same leaf
Calculate a quality score (similarity score) for the residuals: similarity score = [(sum of residuals)^2]/[number of residuals + lambda]
Tree complexity parameter: gamma
- Tree Pruning: If the difference gain - gamma is positive, we will not remove a branch; otherwise, remove the branch. If a branch is not removed, neither its parent nodes. Setting gamma to zero does not turn off pruning, which means that gamma encourages more pruning but there are also pruning due to regularization.
Regularization parameter lambda: default value 0, intended to reduce the prediction's sensitivity to individual observations. When lambda > 0, it results in more pruning leaves (combining more individual in the same leaf) because it shrinks the similarity scores and results in smaller Gain values, thus preventing over-fitting the training data.
Tree output = sum of residuals / (number of residuals + lambda)
Learning rate (Eta): default value is 0.3

XGBoost for regression Algorithm:

Start from an initial prediction 0.5 for all samples in training set
Calculate the residual
Calculate similarity scores and Gains to determine how to split the data: Gain = LeftSplit_similarity + RightSplit_similarity - Root_similarity, the goal is to maximize gain over different splits
Prune the tree recursively by calculating the differences between user defined Tree complexity parameter (gamma): Gain - gamma, If the difference gain - gamma is positive, we will not remove a branch; otherwise, remove the branch and recursively pruning its parents. If a branch is not removed, neither its parent nodes. Setting gamma to zero does not turn off pruning.
Calculate output values for the tree
Make prediction for all samples with the most recently learned model
Re-calculate their residuals
Repeat until maximum number of trees is reached or pre-defined error measurement is met.

XGBoost for classification

similarity score = (sum of residuals)^2 / {sum over i [Previous Probability_i * (1-Previous Probility_i)] + lambda}
XGBoost also has minimum number of residuals in each leaf, determined by calculating something called Cover
- for regression task, cover = number of residuals in a leaf.
- for classification task, cover= {sum over i [Previous Probability_i * (1-Previous Probility_i)] + lambda} - lambda = {sum over i [Previous Probability_i * (1-Previous Probility_i)]}
- Cover (min_child_weight), Default to 1, i.e., having one residual per leaf. When we use default cover for regression, cover has no effect on how we grow the tree. Cover={sum over i [Previous Probability_i * (1-Previous Probility_i)]}
leaf output value for classification = sum of residuals / cover + lambda = sum of residuals / {sum over i [Previous Probability_i * (1-Previous Probility_i)]} + lambda

similarity-output

Approximate Greedy Algorithm

XGBoost uses a Greedy algorithm, it makes a decision without looking ahead to see if it is the absolute best choice in the long run.
Instead of testing every single threshold, we could divide the data into Quantiles and only use the quantiles as candidate thresholds to split the observations
using more quantiles, the predictions would be more accurate, since each threshold represents a smaller cluster of observations, but the more quantiles we have, the more thresholds to test and the longer to build a tree
Approximate Greedy Algorithm: instead of testing all possible thresholds, we only test the quantiles.
- By default, XGBoost uses ~33 quantiles

Weighted Quantile Sketch

Sketch algorithm: Quantile sketch algorithm combines values from each node to make an approximate histogram
Then, Approximate histogram is used to calculate approximate quantiles
Approximate Greedy Algorithm uses approximate quantiles
Weighted Quantile: the sum of weights in each quantiles are the same.
- For regression, the weights are all equal to 1, and Weighted Quantiles are just as the normal quantiles
- For classification, the weights are Previous probability_i * (1-Previous probability_i)
- In practice, weights are calculated after building each tree
- the advantage of using Weighted Quantile sketch is that we get smaller quantiles when we need them

Parallel Learning: parallelization
Sparsity-Aware Split Finding

how to build trees with missing values
how to predict new observations with missing values

Cache-aware Access

Puts the Gradients and Hessians in the Cache so that it can rapidly calculate similarity scores and output values

Blocks for out-of-core computation

compress data to store on Disk; read from the disk and decompress to use
Sharding for fast access to large datasets stored on multiple disks

XGBoost can also speed things up by allowing you to build each tree with only a random subset of the data
XGBoost can speedup building trees by only looking at a random subset of features when deciding how to split the data.

AIC and BIC

-1. Model selection: estimating the performance of different models in order to choose the best one.

train-validate-test split
resampling approach: nested k-fold CV
Statistical/Probabilistic Statistics: provides an analytical technique for scoring and choosing among candidate models.
Comparison: For the first two approaches, only model performance is assessed, regardless of model complexity; the third one evaluates both. The first two are empirical while the last one is analytical (statistical in essence)
Model complexity may be evaluated as the number of degrees of freedom or parameters in the model.
Model performance may be evaluated using a probabilistic framework, such as log-likelihood under the framework of maximum likelihood estimation.
Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.
A benefit of probabilistic model selection methods is that a test dataset is not required, meaning that all of the data can be used to fit the model, and the final model that will be used for prediction in the domain can be scored directly.
A limitation of probabilistic model selection methods is that the same general statistic cannot be calculated across a range of different types of models. Instead, the metric must be carefully derived for each model.
- It should be noted that the AIC statistic is designed for preplanned comparisons between models (as opposed to comparisons of many models during automated searches)
- Such criteria do not take account of the uncertainty in the model parameters, however, and in practice they tend to favor overly simple models.

Probabilistic Model Selection

Akaike Information Criterion
Bayesian Information Criterion
Minimum Description Length

An alternative approach to model selection (model performance evaluation/estimation, k-fold CV is the other empirical approach) involves using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model.
The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.
Akaike and Bayesian Information Criterion are two ways of scoring a model based on its log-likelihood and complexity
Minimum Description Length provides another scoring method from information theory that can be shown to be equivalent to BIC.
AIC

AIC = -2/N * LL + 2 * k/N. Where N is the number of examples in the training dataset, LL is the log-likelihood of the model on the training dataset, and k is the number of parameters in the model.
To use AIC for model selection, we simply choose the model giving smallest AIC over the set of models considered.
Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.

BIC: Bayesian Information Criterion

BIC = -2 * LL + log(N) * k
the model with the lowest BIC is selected.
Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected.
the derivation of BIC under the Bayesian probability framework means that if a selection of candidate models includes a true model for the dataset, then the probability that BIC will select the true model increases with the size of the training dataset. This cannot be said for the AIC score. Given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N -> infinity.
A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple.

MDL: Minimum Description Length

MDL = L(h) + L(D | h), Where h is the model, D is the predictions made by the model, L(h) is the number of bits required to represent the model, and L(D | h) is the number of bits required to represent the predictions from the model on the training dataset.
the model with the lowest MDL is selected.
The MDL principle takes the stance that the best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory
The MDL calculation is very similar to BIC and can be shown to be equivalent in some situations.

For OLS (ordinary least squares linear regression), AIC = n * LL + 2 * k

The likelihood function for a linear regression model can be shown to be identical to the least squares function; therefore, we can estimate the maximum likelihood of the model via the mean squared error metric.
BIC = n * LL + k * log(n)

Home
- Statistics
- Bioinformatics
- Machine Learning
- Deep Learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine learning and AI 6

XGBoost

AIC and BIC

Clone this wiki locally