Skip to content

Machine learning and AI 6

twang15 edited this page Jun 8, 2021 · 6 revisions
  1. Gradient Boost
  2. Regularization
  3. A unique regression tree (XGBoost Tree)
  • In the beginning, all sample residuals are in the same leaf
  • Calculate a quality score (similarity score) for the residuals: similarity score = [(sum of residuals)^2]/[number of residuals + lambda]
  • Tree complexity parameter: gamma
    • Tree Pruning: If the difference gain - gamma is positive, we will not remove a branch; otherwise, remove the branch. If a branch is not removed, neither its parent nodes. Setting gamma to zero does not turn off pruning, which means that gamma encourages more pruning but there are also pruning due to regularization.
  • Regularization parameter lambda: default value 0, intended to reduce the prediction's sensitivity to individual observations. When lambda > 0, it results in more pruning leaves (combining more individual in the same leaf) because it shrinks the similarity scores and results in smaller Gain values, thus preventing over-fitting the training data.
  • Tree output = sum of residuals / (number of residuals + lambda)
  • Learning rate (Eta): default value is 0.3
  1. XGBoost for regression Algorithm:
  • Start from an initial prediction 0.5 for all samples in training set
  • Calculate the residual
  • Calculate similarity scores and Gains to determine how to split the data: Gain = LeftSplit_similarity + RightSplit_similarity - Root_similarity, the goal is to maximize gain over different splits
  • Prune the tree recursively by calculating the differences between user defined Tree complexity parameter (gamma): Gain - gamma, If the difference gain - gamma is positive, we will not remove a branch; otherwise, remove the branch and recursively pruning its parents. If a branch is not removed, neither its parent nodes. Setting gamma to zero does not turn off pruning.
  • Calculate output values for the tree
  • Make prediction for all samples with the most recently learned model
  • Re-calculate their residuals
  • Repeat until maximum number of trees is reached or pre-defined error measurement is met.
  1. XGBoost for classification
  • similarity score = (sum of residuals)^2 / {sum over i [Previous Probability_i * (1-Previous Probility_i)] + lambda}
  • XGBoost also has minimum number of residuals in each leaf, determined by calculating something called Cover
    • for regression task, cover = number of residuals in a leaf.
    • for classification task, cover= {sum over i [Previous Probability_i * (1-Previous Probility_i)] + lambda} - lambda = {sum over i [Previous Probability_i * (1-Previous Probility_i)]}
    • Cover (min_child_weight), Default to 1, i.e., having one residual per leaf. When we use default cover for regression, cover has no effect on how we grow the tree. Cover={sum over i [Previous Probability_i * (1-Previous Probility_i)]}
  • leaf output value for classification = sum of residuals / cover + lambda = sum of residuals / {sum over i [Previous Probability_i * (1-Previous Probility_i)]} + lambda

similarity-output

  1. Approximate Greedy Algorithm
  • XGBoost uses a Greedy algorithm, it makes a decision without looking ahead to see if it is the absolute best choice in the long run.
  • Instead of testing every single threshold, we could divide the data into Quantiles and only use the quantiles as candidate thresholds to split the observations
  • using more quantiles, the predictions would be more accurate, since each threshold represents a smaller cluster of observations, but the more quantiles we have, the more thresholds to test and the longer to build a tree
  • Approximate Greedy Algorithm: instead of testing all possible thresholds, we only test the quantiles.
    • By default, XGBoost uses ~33 quantiles
  1. Weighted Quantile Sketch
  • Sketch algorithm: Quantile sketch algorithm combines values from each node to make an approximate histogram
  • Then, Approximate histogram is used to calculate approximate quantiles
  • Approximate Greedy Algorithm uses approximate quantiles
  • Weighted Quantile: the sum of weights in each quantiles are the same.
    • For regression, the weights are all equal to 1, and Weighted Quantiles are just as the normal quantiles
    • For classification, the weights are Previous probability_i * (1-Previous probability_i)
    • In practice, weights are calculated after building each tree
    • the advantage of using Weighted Quantile sketch is that we get smaller quantiles when we need them
  1. Parallel Learning: parallelization
  2. Sparsity-Aware Split Finding
  • how to build trees with missing values
  • how to predict new observations with missing values
  1. Cache-aware Access
  • Puts the Gradients and Hessians in the Cache so that it can rapidly calculate similarity scores and output values
  1. Blocks for out-of-core computation
  • compress data to store on Disk; read from the disk and decompress to use
  • Sharding for fast access to large datasets stored on multiple disks
  1. XGBoost can also speed things up by allowing you to build each tree with only a random subset of the data
  2. XGBoost can speedup building trees by only looking at a random subset of features when deciding how to split the data.

-1. Model selection: estimating the performance of different models in order to choose the best one.

  • train-validate-test split
  • resampling approach: nested k-fold CV
  • Statistical/Probabilistic Statistics: provides an analytical technique for scoring and choosing among candidate models.
  • Comparison: For the first two approaches, only model performance is assessed, regardless of model complexity; the third one evaluates both. The first two are empirical while the last one is analytical (statistical in essence)
  • Model complexity may be evaluated as the number of degrees of freedom or parameters in the model.
  • Model performance may be evaluated using a probabilistic framework, such as log-likelihood under the framework of maximum likelihood estimation.
  • Historically various ‘information criteria’ have been proposed that attempt to correct for the bias of maximum likelihood by the addition of a penalty term to compensate for the over-fitting of more complex models.
  • A benefit of probabilistic model selection methods is that a test dataset is not required, meaning that all of the data can be used to fit the model, and the final model that will be used for prediction in the domain can be scored directly.
  • A limitation of probabilistic model selection methods is that the same general statistic cannot be calculated across a range of different types of models. Instead, the metric must be carefully derived for each model.
    • It should be noted that the AIC statistic is designed for preplanned comparisons between models (as opposed to comparisons of many models during automated searches)
    • Such criteria do not take account of the uncertainty in the model parameters, however, and in practice they tend to favor overly simple models.
  1. Probabilistic Model Selection
  • Akaike Information Criterion
  • Bayesian Information Criterion
  • Minimum Description Length
  1. An alternative approach to model selection (model performance evaluation/estimation, k-fold CV is the other empirical approach) involves using probabilistic statistical measures that attempt to quantify both the model performance on the training dataset and the complexity of the model.
  2. The benefit of these information criterion statistics is that they do not require a hold-out test set, although a limitation is that they do not take the uncertainty of the models into account and may end-up selecting models that are too simple.
  3. Akaike and Bayesian Information Criterion are two ways of scoring a model based on its log-likelihood and complexity
  4. Minimum Description Length provides another scoring method from information theory that can be shown to be equivalent to BIC.
  5. AIC
  • AIC = -2/N * LL + 2 * k/N. Where N is the number of examples in the training dataset, LL is the log-likelihood of the model on the training dataset, and k is the number of parameters in the model.
  • To use AIC for model selection, we simply choose the model giving smallest AIC over the set of models considered.
  • Compared to the BIC method (below), the AIC statistic penalizes complex models less, meaning that it may put more emphasis on model performance on the training dataset, and, in turn, select more complex models.
  1. BIC: Bayesian Information Criterion
  • BIC = -2 * LL + log(N) * k
  • the model with the lowest BIC is selected.
  • Unlike the AIC, the BIC penalizes the model more for its complexity, meaning that more complex models will have a worse (larger) score and will, in turn, be less likely to be selected.
  • the derivation of BIC under the Bayesian probability framework means that if a selection of candidate models includes a true model for the dataset, then the probability that BIC will select the true model increases with the size of the training dataset. This cannot be said for the AIC score. Given a family of models, including the true model, the probability that BIC will select the correct model approaches one as the sample size N -> infinity.
  • A downside of BIC is that for smaller, less representative training datasets, it is more likely to choose models that are too simple.
  1. MDL: Minimum Description Length
  • MDL = L(h) + L(D | h), Where h is the model, D is the predictions made by the model, L(h) is the number of bits required to represent the model, and L(D | h) is the number of bits required to represent the predictions from the model on the training dataset.
  • the model with the lowest MDL is selected.
  • The MDL principle takes the stance that the best theory for a body of data is one that minimizes the size of the theory plus the amount of information necessary to specify the exceptions relative to the theory
  • The MDL calculation is very similar to BIC and can be shown to be equivalent in some situations.
  1. For OLS (ordinary least squares linear regression), AIC = n * LL + 2 * k
  • The likelihood function for a linear regression model can be shown to be identical to the least squares function; therefore, we can estimate the maximum likelihood of the model via the mean squared error metric.
  • BIC = n * LL + k * log(n)