# Chapter 02 - Bagging and Boosting

## Summary of LightGBM optimizations

1. Implements histogram-based sampling of features for continuous variables, making it O(bins) rather than O(rows) to find splits
2. Calculates exclusive feature bundles to reduce the number of features
3. Applys GOSS to downsample the training data without losing accuracy
4. Builds Trees leaf-wise to improve accuracy
5. Allows L1 and L2 regularization

All these together improve LightGBM's performance by orders of magnitude over standard GBDT algorithm. Additionally, its implemented in C!! with a python interface. (Vs. scikit-learn's is python-based implementation)

Also, better data-parallel and feature-parallel distributed training.

## Hyperparameters

### A. Core Framework Parameters:

1. `objective`: `binary`, `multiclass`, `cross-entropy`, `lambdarank` (for ranking)
2. `boosting`: default to `gbdt`, can change to `dart` or `rf`
    - should use `dart`
3. `num_iterations` or `n_estimators`: the number of boosting iterations
4. `num_leaves`: the max number of leaves in a single tree
    - should tune this
5. `learning_rate`: controls the contribution of each tree to the overall prediction
    - have to tune this!

### B. Accuracy Parameters

6. max_bin: the maximum number of bins in which features are bucketed 

### C. learning control parameters for overfitting

7. `bagging_fraction` and `bagging_freq`: both enable feature bagging. feature bagging reduces overfitting
8. 


In [44]:
import lightgbm as lgb
from sklearn import datasets, model_selection, metrics
import pandas as pd
from typing import Callable # for functions!
import xgboost as xgb
import numpy as np

In [2]:
df = datasets.fetch_covtype() # data on the 6 forest covertypes with covariates. It's a multi-classification dataset

In [5]:
type(df)
  # a Bunch is a dictionary-like object with the following attributes
  # 1. data - ndarray of the X part of data
  # 2. target - the y part of the data
  # 3. frame  - X and y part of the data
  # 4. DESCR - a description
  # 5. feature_names
  # 6. target_names
  # 7. target_list

sklearn.utils._bunch.Bunch

In [8]:
print(df.DESCR)

.. _covtype_dataset:

Forest covertypes
-----------------

The samples in this dataset correspond to 30×30m patches of forest in the US,
collected for the task of predicting each patch's cover type,
i.e. the dominant species of tree.
There are seven covertypes, making this a multiclass classification problem.
Each sample has 54 features, described on the
`dataset's homepage <https://archive.ics.uci.edu/ml/datasets/Covertype>`__.
Some of the features are boolean indicators,
while others are discrete or continuous measurements.

**Data Set Characteristics:**

Classes                        7
Samples total             581012
Dimensionality                54
Features                     int

:func:`sklearn.datasets.fetch_covtype` will load the covertype dataset;
it returns a dictionary-like 'Bunch' object
with the feature matrix in the ``data`` member
and the target values in ``target``. If optional argument 'as_frame' is
set to 'True', it will return ``data`` and ``target`` as pandas
data

In [6]:
help(datasets.fetch_covtype) # loads the covertype dataset (classification). Downloads it if necessary.
    # by default, all scikit-learn data is stored in '~/sckikit_learn_data' subfolders
    # fetch_covtype(*, data_home=None, download_if_missing=True, random_state=None, shuffle=False, return_X_y=False, \
    # as_frame=False) #return_X_y if you want to be separate. as_frame if want X and y as a pd.DataFrame and pd.series \
    # or (when return_X_y=false) as one complete pd.DataFrame

Help on function fetch_covtype in module sklearn.datasets._covtype:

fetch_covtype(*, data_home=None, download_if_missing=True, random_state=None, shuffle=False, return_X_y=False, as_frame=False)
    Load the covertype dataset (classification).
    
    Download it if necessary.
    
    Classes                        7
    Samples total             581012
    Dimensionality                54
    Features                     int
    
    Read more in the :ref:`User Guide <covtype_dataset>`.
    
    Parameters
    ----------
    data_home : str or path-like, default=None
        Specify another download and cache folder for the datasets. By default
        all scikit-learn data is stored in '~/scikit_learn_data' subfolders.
    
    download_if_missing : bool, default=True
        If False, raise an OSError if the data is not locally available
        instead of trying to download the data from the source site.
    
    random_state : int, RandomState instance or None, default=None
   

In [9]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(df.data, df.target, random_state=179)

In [25]:
pd.DataFrame(y_train).value_counts().sort_index() # values are 1,2,3,4,5,6,7

0
1    159517
2    211856
3     26808
4      2031
5      7194
6     12987
7     15366
Name: count, dtype: int64

In [10]:
training_set = lgb.Dataset(X_train, y_train - 1) # have to do y - 1 because lgb wants 0 as first class

In [11]:
test_set = lgb.Dataset(X_test, y_test - 1) # have to do y - 1 because lgb wants 0 as first classz

In [28]:
print(training_set)

<lightgbm.basic.Dataset object at 0x74031909c190>


In [29]:
params = {
    'boosting_type': 'gbdt',
    'objective': 'multiclass',
    'num_classes': '7',
    'metric': {'auc_mu'},
    'num_leaves': 120,
    'learning_rate': 0.09,
    'force_row_wise': True,
    'verbose': 0
}

In [32]:
def learning_rate_decay(initial_lr: float, decay_rate: float) -> Callable:
    def _decay(iteration):
        return initial_lr * (decay_rate ** iteration)
    return _decay

In [33]:
metrics = {}
callbacks = [
    lgb.log_evaluation(period=15),
    lgb.record_evaluation(metrics),
    lgb.early_stopping(15),
    lgb.reset_parameter(learning_rate=learning_rate_decay(.09, .999))
]

In [47]:
gbm = lgb.train(params, training_set, num_boost_round=150, valid_sets=test_set, callbacks=callbacks)

[15]	valid_0's auc_mu: 0.991161
[30]	valid_0's auc_mu: 0.994448
[45]	valid_0's auc_mu: 0.99573
[60]	valid_0's auc_mu: 0.996388
[75]	valid_0's auc_mu: 0.996781
[90]	valid_0's auc_mu: 0.997185
[105]	valid_0's auc_mu: 0.997565
[120]	valid_0's auc_mu: 0.997818
[135]	valid_0's auc_mu: 0.998015
[150]	valid_0's auc_mu: 0.994935
Did not meet early stopping. Best iteration is:
[141]	valid_0's auc_mu: 0.998084


In [48]:
y_pred = np.argmax(gbm.predict(X_test, num_iteration=gbm.best_iteration), axis=1)
metrics.f1_score(y_test-1, y_pred, average="macro")


0.9165993281538187

In [None]:
lgb.plot_metric(metrics, 'auc_mu')

## Comparing Quickly with XGBoost

In [37]:
xgb_def = xgb.XGBClassifier()

In [39]:
xgb_def.fit(X_train, y_train-1)

In [42]:
xgb_def.score(X_test, y_test -1)

0.8717548002450896

## LightGBM scikit-learn API

- Provides four classes:
1. LGBMModel
2. LGBMClassifier,
3. LGBMRegressor
4. LGBMRanker