<br>
<br>
<br>
<br>

# DAV 6150 Module 12: Gradient Descent & Gradient Boosting
<br>
<br>
<br>

## Module 11 Assignment Review

### Does the reg_pct_level attribute exhibit low entropy or high entropy?

- The __reg_pct__ attribute from which reg_pct_level is derived exhibits a gaussian distribution. When we convert that attribute to the required categorical indicator, the result is an __imbalanced__ categorical variable since the majority of the observations are assigned to the __medium__ category.

Results of applying the required categorical mapping to __reg_pct__:

- __medium__:    63278


- __low__:        5131


- __high__:       4743


- More than __86.5%__ of the values contained within the reg_pct_level attribute are indicative of a "medium" categorization. Therefore, the attribute has very __low entropy__.


- Furthermore, our __null error rate__ for our modeling work is therefore  __equivalent to the percentage of "medium" reg-pct-level indicators__ present within the new indicator attribute: 86.502%


- Effective decision trees (and therefore random forests) are built by identifying attributes that offer the __highest information gain__ and __lowest entropy__.


#### So how to determine whether explanatory attributes offer high information gain + low entropy?

In addition to identifying numeric attributes that are relatively strongly correlated with a response variable while being relatively independent from one another, we can also determine which attributes (whether they are categorical or numeric) __have relative high chi^2 values relative to the response variable__. We can we use chi^2 as a proxy for information gain. 

- Low chi^2 values are indicative of __low statistical significance__ between an explanatory attribute and a response variable. Therefore, chi^2 metrics can often be quite indicative of the amount of information gain / entropy inherent within an attribute relative to a response variable.


### There are many collinear pairs of attributes within the data set. How to treat them?

The data set contains multiple pairs of 'count' and 'percentage' attributes that are highly collinear with each other due to the fact that the percentage values are simply a 'normalized' representation of the count values. Therefore, we should discard one of each of these pairs before we construct our models.

__But which attributes should be discarded__? 

- For this data set we can start by examining the correlation coefficients for each of the 'cnt' and 'pct' attributes relative to the __reg_pct__ attribute. The correlation coefficients for the 'pct' attributes are significantly higher than are those for the 'cnt' attributes, thereby indicating that the 'pct' attributes have lower entropy / higher information gain relative to the attribute from which we will be deriving the new categorical __reg-pct-level__ indicator.


- Also, the percentages represent a "normalized" presentation of the related information, and the response variable itself was required to be derived from a percentage rather than a count. 


- Therefore, __discard the counts and retain the percentages__


### What (if anything) do we do to address the imbalance of the response?

- As was discussed in the Module 11 Assigned readings, __decision trees__ (and therefore random forests) __do not necessarily require any re-balancing of the observations prior to model training__.


- However, imbalanced response variable classes can result in the decision tree criterion used to select a split point ignoring examples from the minority classes. This can result in an ineffective decision tree (or random forest) model that misclassifies observations that should be assigned to a minority classification.


- To address this concern, in Python we can use the __class_weight__ parameter of the __DecisionTreeClassifier()__ (as well as the RandomForestClassifier()) to implement a __weighted decision tree__, wherein we assign "weightings" to each possible response variable value using the inverse of the class distribution present within the training dataset for purposes of improving the performance of the model relative to the underrepresented classification values that exist within the response variable. The overall effect of this approach serves to __minimize the likelihood of an observation being misclassified at any given node within a decision tree__.


- For example, within the __reg-pct-level__ attribute we know that the "medium" label appears in (86.502 / 100) observations, while the "low" label appears in approximately (7 / 100) observations and the "high" label appears in approximately (6.5 / 100) observations. Knowing this, we can set the __class_weight__ parameter as follows to automatically balance the weightings of the various classifications within the decision tree:

class_weight = "balanced"


- The “balanced” mode uses the values of the response variable to __automatically adjust weights inversely proportional to class frequencies__ in the input data as n_samples / (n_classes * np.bincount(y))

(see https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

For an alternative approach using "hand calculated" weightings see this example: https://machinelearningmastery.com/cost-sensitive-decision-trees-for-imbalanced-classification/


### Low Entropy: What if we fail to sample all classes for our training subset?

- In the reg_pct_level variable we find only 4743 "high" indicators out of a total of 73,000+ observations. What would happen if the random sampling process we use for purposes of creating training + testing subsets __results in a training data set devoid of "high" values__?  Our model will be __have no way of classifying observations that should be labeled as having a "high" reg-pct-level.


- So what can we do? To ensure that all response classification labels are proportionally represented in our training and testing subsets we can use the __stratify__ parameter of the __train_test_split()__ function:

In [None]:
# use stratified sampling to ensure all response variable classifications are
# proportionally represented in both the training + testing subsets
# In this example "Target" represents our response variable

X_train, X_test, y_train, y_test = 
cross_validation.train_test_split(Data, Target, test_size=0.3, 
                                  random_state=0, stratify=Target)

## Gradient Descent

__Gradient descent__ is an __optimization algorithm__ that seeks to __minimize the value of a given function__. Within the context of machine learning, gradient descent is __typically applied to a given loss function__. 


Since gradient descent is an optimization algorithm and __virtually every machine learning algorithm can be defined in terms of an appropriate loss function__, the algorithm __can be extended to virtually any type of regression or classification problem__. 


__The specified loss function should be relevant to the type of machine learning algorithm being analyzed__. For example, for a regression problem an appropriate loss function might be the Mean Squared Error; for a binary classification problem the loss function might be cross entropy; etc. 




### How it Works

Starting with a provided __set of function parameters__, gradient descent __iteratively applies differentiation (specifically, partial derivatives) to those parameters__ until it has identified a set of model parameters that __minimize the given function__.  

In effect, the differentiation __allows the algorithm to identify the the direction of steepest descent along the function's gradient__. The algorithm iteratively continues moving in the direction of steepest descent (referred to as taking "steps") along the gradient until __it identifies the minimum point of the gradient__. The __model parameters that correspond to that minimal gradient point__ are then __assumed to be the parameters that minimize the value of the loss function.__

The size of the "steps" is specified via the __learning rate parameter__. 

See this article for further explanation: https://medium.com/@arshren/gradient-descent-5a13f385d403


### Advantages

- Flexibility: Virtually any differentiable function can be specified as the loss function to be used


- Can be used in conjunction with nearly any type of machine learning algorithm for purposes of improving model performance


### Disadvantages

- Knowing how to select __an appropriate learning rate__ can be challenging: if the learning rate is __too large__, the algorithm __may fail to find the true minimum of the function__; if the learning rate is __too small__, the algorithm __may require a significant amount of time + computing resources to find the minimum of the function__.


- __All features should be standardized__ to ensure that all of them have a similar scale. Failure to do so will likely lead to a significant increase in the amount of time required for model convergence.


- As the dimensionality of the data increases, the number of partial derivatives that need to be compared to determine the direction of "steepest descent" for the gradient necessarily increases, thereby increasing model training time.


- Not guaranteed to find the minimum of a loss function that is not convex in shape, i.e., the algorithm can become "trapped" in a local minima and never find the true global minima of a function


## Stochastic Gradient Descent

__Stochastic gradient descent__ uses random sampling to select __a single random observation__ from a data set during each iteration to compute the gradients instead of calculating the gradients based on the entire training data set.

### Advantages

- Utilizes much less computing resources due to use of random sampling of training data

- Is much faster than standard gradient descent since the entire data set need not be used: parameter values can converge much more quickly 

- Is less likely to become trapped in a local minima than is non-stochastic gradient descent
 
 
### Disadvantages

- The true minima of the loss function may prove to be elusive due to the use of random sampling during model training. In fact, it is possible that the model will never converge.


- If the training data is not sufficiently randomized, the algorithm will likely perform poorly


### How to Implement Stochastic Gradient Descent in Python

The __scikit-learn__ library includes a pre-built stochastic gradient descent __classifier__: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html#sklearn.linear_model.SGDClassifier


An example of how to implement a stochastic gradient descent __regressor__ from the assigned readings (see input cell 21): 
- https://github.com/ageron/handson-ml2/blob/master/04_training_linear_models.ipynb


## Mini-batch Gradient Descent

__Mini-batch__ gradient descent improves upon stochastic gradient descent by using __a randomly sampled "batch" of observations__ to compute the gradients during each iteration. The use of random batches of samples tends to reduce the tendency toward divergence that can result from using stochastic gradient descent while also increasing the likelihood of the model achieving convergence on the true minima of the loss function.

### Advantages

- Less likely than basic gradient descent to become trapped in a local minima


- Uses less computing resources than does basic gradient descent


- Is faster than basic standard gradient descent since the entire training data set need not be processed during each iteration

### Disadvantages

- Not as fast as stochastic gradient descent


- More resource-intensive than stochastic gradient descenct


- No gaurantee that true minima of loss function will be found due to use of random sampling


### How to Implement Mini-Batch Gradient Descent in Python

An example from the assigned readings (see input cell 23): 
- https://github.com/ageron/handson-ml2/blob/master/04_training_linear_models.ipynb

## Gradient Boosting

__Gradient Boosting__ is an __ensemble-based optimization algorithm__ that, like gradient descent, __seeks to minimize the value of a differentiable loss function__. 

However, while gradient descent computes the gradient of the loss function with respect to the parameter vector for the equation we are trying to minimize, __gradient boosting computes the gradient of the loss function with respect to the predicted values of a model’s response variable__. 

In effect, gradient boosting works by __sequentially adding new models to an ensemble, with each new model being fit to the residual errors produced by the previous model__ (Remember: The difference between the observed value of the dependent variable (y) and the predicted value (ŷ) is called the __residual__). In this way, each “newer” model improves upon the performance of the previous model, with the final model being the most robust. 

Gradient boosting can be used for both classification and regression problems.


### How it Works

- Start with a loss function to be minimized. __Any differentiable loss function can be used__, but pre-built software functions generally include pre-defined, robust loss functions that you can select via a parameter. Or you can define your own based on your own domain knowledge.


- Select a "weak learner" to be used as the basis for the ensemble model (e.g., a decision tree)


- Fit a model using the weak learner


- Compute the gradient of the loss function __with respect to the predicted values for the model's response variable__


- Fit the next weak learner to the error residuals of the previous weak learner model's output


We either add a fixed number of weak learners to the model or we continue to add weak learners to the model until either the observed loss reaches an acceptable level or it no longer shows improvement when an additional weak learner is added.

### Advantages

- Often outperforms models that do not incorporate gradient boosting


- Limited data pre-processing required


- Flexibility: Can be directed to minimize a wide variety of loss functions


- Can simplify the modeling process by reducing the need for imputation of missing data values


- Relatively resistant to overfitting


### Disadvantages


- Computationally expensive (e.g., many gradient boosted tree models require more than 1000 decision trees be created)


- Less interpretable than other types of classification and regression models, i.e., how do we explain models that have been fitted to the residuals of earlier models?


### How to Implement Gradient Boosted Trees in Python

The __sklearn__ library includes a pre-built gradient boosting classifier: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html#sklearn.ensemble.GradientBoostingClassifier


An example of using the sklearn gradient boosting classifier: https://stackabuse.com/gradient-boosting-classifiers-in-python-with-scikit-learn/

## Extreme Gradient Boosting (XG Boost)

__XG Boost__ improves upon standard gradient boosting algorithms through the incorporation of a collection of system optimization and algorithmic enhancements:

- Automated tree pruning (when applied to tree-based algorithms)


- Parallelization


- Memory caching


- Data regularization via Lasso and Ridge regularization (when specified by the user)


- Sparse data handling


- Cross validation automatically applied during each iteration


XG Boosting can be applied to both classification and regression problems and __is most frequently applied to tree-based models__ (e.g., decision trees, random forests)

For a thorough explanation: https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d

### Advantages
 
- Built-in features can improve speed of execution + model performance relative to non-XG boosted algorithms


- Can simplify modeling process by reducing the need for feature scaling


- Can simplify the modeling process by reducing the need for imputation of missing data values


- Effective when applied to high-dimensional data


### Disadvantages

- XG Boost sometimes __underperforms__ other types of models; empirical approach is required, i.e., construct different types of models and then compare their performance


- Less interpretable than other types of classification and regression models due to complexity + built-in optimizations whose effects are relatively "opaque".


### How to Implement XG Boost in Python

The __xgboost__ package provides a pre-built XG boost function: https://xgboost.readthedocs.io/en/latest/python/python_intro.html


An example from the assigned readings (go to input cell 34): 
- https://github.com/mattharrison/ml_pocket_reference/blob/master/ch10.ipynb


Another example: 
- https://stackabuse.com/gradient-boosting-classifiers-in-python-with-scikit-learn/

## Project 3 Guidelines / Requirements