# Introduction
Boosting is a powerful ensemble Machine Learning technique used in both classification and regression tasks. It combines the predictions from multiple weak learners (oftentimes Decision Trees) to create a strong learner with improved performance.

### Core idea
- Boosting iteratively trains weak learners, where each learner focuses on correcting the errors of the previous ones.
- Imagine a group of average students (weak learners) working together to solve a problem. Each student learns from the mistakes of the others, ultimately leading to a better understanding of the problem.

### Boosting algorithm
1. Initialize weights: Each data point in the training set is assigned and equal weight.
2. Train weak learner: A weak learner (e.g., Decision Tree) is trained on the weighted data.
3. Calculate error: The error of the weak learner is calculated based on the assigned weights. Misclassified points receive higher weights, focusing the next learner on those challenging examples.
4. Adjust weights: Weights of the data points are adjusted based on the errors. More weight is given to points that the previous learners got wrong.
5. Repeat: Steps 2 to 4 are repeated for multiple iterations, with each new learner focusing on the most difficult cases from the previous learner.
6. Final prediction: The final prediction is made by combining the predictions from all the weak learners in the ensemble, often using a weighted voting (for classification) or averaging (for regression) approach.

### Benefits of Boosting
- Improved accuracy: By combining weaker models, boosting can achieve higher accuracy compared to individual learners.
- Can handle complex problems: Boosting can learn complex relationships in the data that might be challenging for a single model.
- Handles imbalanced data: Some boosting algorithms can effectively handle imbalanced datasets where certain classes have fewer data points.

### Common Boosting algorithms
- AdaBoost (Adaptive Boosting): A popular boosting algorithm that focuses on improving the weights of misclassified examples.
- Gradient Boosting: A more general framework where the focus is on minimizing a loss function (e.g., squared error for regression) in each iteration. Common examples include,
    - XGBoost: A powerful and scalable gradient boosting algorithm known for its performance.
    - LightGBM: Another efficient gradient boosting algorithm with good performance and speed.

### Considerations
- Overfitting: Boosting algorithms can be prone to overfitting if not carefully tuned. Techniques like regularization can be used to mitigate this risk.
- Computational cost: Training a boosted ensemble can be computationally expensive compared to a single model, especially with many iterations.

# Bagging V. Boosting
### Boosting for high bias and low variance models
- Boositng is often used when the base learners (e.g., Decision Trees) tend to have high bias (underfitting) and low variance (low model complexity).
- Bagging would not be ideal in this scenario because it focuses on averaging predictions from diverse models (high variance). Averaging underfitting models won't significantly improve performance.

### Additive combining in Boosting
Boosting addresses the high bias by sequentially training models (like Decision Trees) in an "additive" fashion. Each subsequent model "boosts" the overall performance by focusing on the erros of the previous model. Consider the following examples,
- Imagine a dataset with target values (predicted v. actual values).
- The first model (weak learner) might underfit the data, leading to sigificant errors (differences between predicted and actual values).
- The second model is trained specifically on these errors, trying to learn from the mistakes of the first model. It essentially adds its corrective predictions to the first model's predictions.
- This process continues iteratively, with each subsequent model focusing on the remaining errors from the previous ensemble.

### Comparison with Bagging
- Boosting is a sequential process. Each model builds upon the previous one.
- Bagging, on the other hand, trains models in parallel on different data subsets (bootstrap samples).

### Addressing bias in Boosting
- Boosting achieves bias reduction by focusing on the errors of previous models. Each iteration aims to learn from the shortcomings of the ensemble so far, gradually reducing the overall bias.
- Additionally, Boosting algorithms often adjust the weights of data points during training. Points that were misclassified by previous models receive higher weights, forcing the next model to pay more attention to those challenging examples.

### Boosting algorithms and considerations
- Common Boosting algorithms like AdaBoost and Gradient Boosting implement these principles in different ways.
- While Boosting offers advantages, it can be computationally expensive due to the sequential nature of training. However, the performance gains can often outweigh the increased training time.

# Additive Combining In Boosting
### Step-by-step breakdown
1. Simple model and residuals:
    - The average or the mean model is the simple starting point ($M_0$).
    - It predicts the average target value ($y_{cap}$) abd the residuals ($error_{i0}$) are calculated for each data point ($y_i$) as the difference between the actual value and the average prediction.
2. Model on residuals:
    - A second model ($M_1$) is trained on the residuals ($error_{i0}$). $M_1$ aims to fit these errors.
    - In Boosting terminology, M1 is called a weak larner. It typically has high bias (underfitting) and low variance (low complexity), focusing on specific patterns in the errors.
3. Additive prediction: The final prediction for each data point is achieved by adding the predictiona from $M_0$ (the average model) and $M_1$ (error model).
    - $h_0(x_i) + h_1(x_i)$.
    - This is where the additive aspect comes in. The corrections from each model are progressively added.
4. Iterative process (optional):
    - Boosting algorithms can continue this process by training additional models ($M_2$, $M_3$, etc) on the remaining errors from the previous ensemble.
    - Each subsequent model focuses on the errors the ensemble has not yet captured effectively.

### Example
Say that out of 100 data points, 80 data points have been correctly predicted by $M_0$. $M_1$ tries to learn from the remaining 20 data points. Now say that $M_1$ has been able to predict 16 out of 20 data points correctly. The sum of $M_0$'s prediction and $M_1$'s corrections (16 out of 20) provides a potentially better overall prediction.

### Classification v. regression
- In regression, residuals represent the difference between the actual target value and the predicted value.
- In classification, boosting algorithms might use probability differences instead of residuals for error calculations.

### Addressing bias
By iteratively focusing on the errors of previous models, boosting gradually reduces the overall bias. Each model in the ensemble adds it corrective power to improve the final prediction.

### High bias v. high variance
- High bias: High training error and high testing error (model underfits the data).
- High variance: Low training error but high testing error (model overfits the training data).
- Boosting is particualarly effective for models with high bias because it sequentially refines the predictions to reduce bias.

# Steps In Boosting
1. Mean model and error:
    - Say that the mean weight of 67 has been calculated as the initial prediction ($M_0$ or $h_0(x)$) for all data points.
    - The residuals ($error_0$) are computed as the difference between the actual weight and the mean weight for each data point.
2. Model on residuals:
    - The next model ($M_1$) uses the residuals ($error_0$) as its target variable. This is a crucial aspect of boosting.
    - Building a simple Decision Tree (depth 1) based on gender to predict these residuals is a valid approach. $M_1$ acts as a weak learner focusing on patterns in the errors.
3. Additive predictions and residuals
    - $M_1$'s predictions for the residuals are added to the initial predictions ($M_0$) to create new predictions. This is the "additive combining" concept.
    - Calculating new residuals based on these updated predictions is also correct.
4. Overfitting and stopping (optional): Repeateadly adding models can lead to overfitting. Boosting algorithms typically use techniques like cross-validation or a stopping criterion (e.g., maximum number of models) to prevent this.
5. Gamma ($\gamma$) for model weights:
    - Gamma ($\gamma$), controls the influence of each model in the ensemble.
    - The final prediction is therefore, $f_m(x) = (h_0(x) + \gamma_1) * (h_1(x) + \gamma_1) * (h_2(x) + \gamma_2) * ... * (h_n(x) + \gamma_n)$.
    - This weighting factor is crucial. Boosting algorithms often determine these weights based on the performance of each model on the validation set. Models with lower errors get higher weights in the final prediction.

### Differences from Linear Regression
- In Linear Regression, all weights are determined in one step using the entire dataset.
- In Boosting, weights (gamma) are assigned to each model dequentially based on their performance in correcting errors of previous models.

### Challenges of Boosting
- Boosting models are generally not parallelized due to the sequential nature of training each model based on the previous ones. This can be slower compared to some parallel Machine Learning algorithms (Bagging).
- Improving underfitting models can be more challenging than dealing with overfitting. Boosting works best when the base learners have some level of bias (underfitting) that can be progressively reduced through additive corrections.

# Gradient Boosting From Regression Perspective
### Loss function and predictions
Mean Squared Error (MSE) is used as the loss function for regression problems. It measures the squared difference between actual values ($y_i$) and predicted values ($y_{cap}$).

### Gradient and pseudo-residuals
- The key concept is the connection between gradients and psuedo-residuals.
- Taking the partial derivative of the loss function (MSE) with respect to the predicted value ($y_{cap}$) gives the negative gradient.
- This negative gradient is proportional to the actual residual ($y_i - y_{cap}$). It indicates the direction and maginitude for improvement in terms of reducing the loss.
- However, calculating the full residual repeatedly can be computationally expensive.

### Pseudo-residuals for efficiency
- A pseudo-residual is the negative gradient of the loss function with respect to the predicted value ($y_{cap}$). It directly relates to the direction for loss reduction.
- By using pseudo-residuals instead of full residuals in subsequent models, gradient boosting achieves computational efficiency.

### Optimizing for pseudo-residuals
- Optimizing for pseudo-residuals translates to optimizing for the overall loss function.
- Since the pseudo-residual is proportional to the negative gradient, focusing on minimizing the pseudo-residuals drives the model in the direction of reducing the overall MSE.

### Summary of Gradient Boosting algorithm
1. Initialize: Start with an initial prediction (often the mean of the target variable). Assign equal weights to all data points.
2. Iteratively train models: In each iteration,
    - Fit a weak learner (e.g., Decision Tree) on the current data using the pseudo-residuals as the target variable.
    - Calculate the model's weight based on its performance in reducing the loss (e.g., MSE).
3. Update predictions: Update the overall prediction for each data point by adding the weighted prediction from the new learner to the previous ensemble prediction.

### Overall benefit
By iteratively fitting models on pseudo-residuals and adding their predictions, gradient boosting progressively reduces the overall loss function, leading to improved performance compared to a single model.

# Loss Function
### Loss functions and gradients
The common loss functions for regression (MSE/ RMSE) and classification (Log-Loss).

### Pseudo-residuals for efficiency
- Calculating full residuals repeateadly can be computationally expensive.
- Pseudo-residuals, derived from the negative gradient of the loss function with respect to the prediction at the previous stage ($k - 1$), offer an efficient alternative.

### Gradient boosting and pseudo-residuals
- GBDT leverages pseudo-residuals to guide the training process.
- At each stage ($k$), the gradient of the loss function is calculated with respect to the output at the previous stage ($k - 1$). This gradient serves as the pseudo-residual for the current stage.
- Focusing on minimizing these pseudo-residuals during model fitting ensures the overall loss function (e.g., MSE or Log-Loss) is reduced iteratively.

### Why Gradient Boosting?
- By iteratively fitting models on pseudo-residuals and adding their predictions, GBDT progressively improves the model's performance compared to a single model.
- This approach helps address the bias of weak learners (e.g., Decision Trees) by sequentially correcting their errors.

### Additional notes
- Different GBDT algorithms might use different techniques to calculate the optimal step size (learning rate) when applying the pseudo-residuals to update the predictions.
- While GBDT is commonly used with Decision Trees as weak learners, the concept of pseudo-residuals can be applied with other weak learning algorithms as well.

# Gradient Boosting Decision Tree (GBDT) Algorithm
1. Data preparation: Prepare the training data by handling missing values, scaling features if necessary, and splitting it into training and validation sets (optional).
2. Initialization:
    - Initialize the model prediction for each data point (often the mean of the target variable in regression or a simple prediction rule in classification).
    - Assign equal weights to all data points in the training set.
3. For each iteration:
    - Calculate pseudo-residuals:
        - Based on the current ensemble prediction (at stage ($t - 1$)), calculate the loss function's gradient (e.g., negative gradient of MSE for regression or negative gradient of Log-Loss for classification) with respect to the prediction at the previous stage ($t - 1$).
        - These gradients are the pseudo-residuals for the current iteration.
    - Fit a weak learner:
        - Train a weak learner (e.g., a shallow Decision Tree) on the training data using the calculated pseudo-residuals as the target variable.
        - The goal of a weak learner is to improve upon the current ensemble prediction by focusing on the areas with high pseudo-residuals (large errors from the previous stage).
    - Model weighting:
        - Based on the weak learner's performance in reducing the loss function on the training set (e.g., using a learning rate), determine a weight for this learner.
        - This weight reflects how much influence the learner's predictions will have in the final ensemble.
4. Update ensemble prediction:
    - For each data point, add the weighted prediction from the newly trained weak learner to the current ensemble prediction (additive combining).
    - This creates an updated ensemble prediction that incorporates the improvements from the new weak learner.
5. Stopping: Continue iteratively training models until a stopping criteria is met. Common criteria include,
    - Reaching a predefined number of iterations.
    - The validation error (error on a held-out set) starts to increase, indicating overfitting.
6. Prediction for new data: To make a prediction for a new data point, apply the final ensemble model. This typically involves,
    - Calculating the predictions from each weak learner in the ensemble.
    - Adding the weighted predictions from all weak learner (similar to the update step during training).

### Key points
- GBDT iteratively improves the model by focusing on the errors (pseudo-residuals) from the previous models.
- Weak learners (e.g., Decision Trees) are used as building blocks for the ensemble.
- Pseudo-residuals, derived from the loss function gradient, guide the training process efficiently.
- Model weights control the influence of each weak learner in the final prediction.
- GBDT excels at addressing bias in weak learners by sequentially correcting their errors.

### Additional considerations
- GBDT models can be computationally expensive due to the sequential training process.
- Tuning hyperparameters like the number of iterations, learning rate, and weak learner complexity is crucial for optimal performance.

# Algorithm For Pseudo-Residuals In GBDT
### Initialization
- The argmin is used to find the initial prediction that minimizes the loss function ($L$) for all data points ($n$) with respect to a parameter ($\mu$). Howeve, this parameter typically represents the initial prediction iteself (often the mean for regression).
- So the minimization is to find the best constant value for the initial prediction ($F_0$) that minimizes the overall loss.

### Core steps
1. Pseudo-residual calculation:
    - The derivative of the loss function is not calculated directly with respect to the entire model prediction ($f(x_i)$).
    - In GBDT, the pseudo-residual for iteration m is calculated as the negative gradient of the loss function with respect to the prediction at the previous stage ($f_{(m - 1)}(x_i)$) for each data point $i$.
    - Thiss reflects the direction for improvement in reducing the loss based on the previous ensemble prediction.
2. Weak learner training:
    - A weak learner (e.g., a shallow Decision Tree) is trained on these pseudo-learners.
    - The goal of this weak learner is to learn patterns in the errors (residuals) from the previous ensemble prediction.
3. Model weighting: While some algorithms might use a line search or optimization techniques, the core idea is to assign a weight to this weak learner based on its performance in reducing the loss on the training data. This weight reflects the learner's importance in the final ensemble.
4. Ensemble update:
    - The update of the ensemble prediction ($F_m(x)$) is done by adding the weighted prediction from the newly trained weak learner ($h_m(x)$) to the previous ensemble prediction (F_{(m - 1)}(x)).
    - This additive combining is a key aspect of GBDT.

### Overall process
The core concept is to iteratively improve the ensemble prediction. In each iteration,
- Calculate pseudo-residuals based on the previous ensemble prediction's errors.
- Train a weak learner on these pseudo-residuals.
- Assign a weight to the weak learner based on its performance.
- Update the ensemble prediction by adding the weighted prediction from the weak learner.

### Number of iterations:
- The process is repeated for a predefined number of iterations (m).
- The stopping criteria typically involves reaching a maximum number of iterations or observing signs of overfitting on a validation set.

### Optimization:
GBDT optimizes for 2 things,
- The weak learners ($h_m(x)$) themselves during training to fit the pseudo-residuals.
- The weights ($\gamma_m$) assigned to each weak learner to determine their influence in the final ensemble prediction. These weights are often calculated based on the learner's performance in reducing the loss function.

# Bias-Variance Trade-Off In GBDT
### Bias-variance in GBDT
- GBDT leverages weak learners (often shallow Decision Trees) with high bias (underfitting) and low variance (low model complexity).
- The goal is to iteratively reduce the bias of the ensemble model by combining the predictions from multile weak learners that focus on the errors (residuals) from previous stages.

### Hyperparameter tuning
There are 2 crucial hyperparameters in GBDT that affect the bias-variance trade-off,
1. Number of boosting stages ($m$): This controls the number of weak learners used in the ensemble.
    - Increasing $m$ allows the model to capture more complex patterns, potentially reducing bias.
    - However, a very high m can lead to overfitting as the model becomes sensitive to training data noise (high variance).
2. Base learner complexity (depth): This determines the complexity of individual Decision Trees (weak learners).
    - Shallower trees have lower variance but may not capture complex relationships (higher bias).
    - Deeper Trees have the potential to reduce bias but can also lead to overfitting if they become too specific to the training data (high variance).

### Finding the right balance:
- The challenge lies in finding the optimal combination of $m$ and the depth that achieves a good balance between bias and variance.
- Lower bias and lower variance generally lead to better generalization performance (performance on unseen data).

### Additional considerations
- Other hyperparameters like learning rate can also influence the bias-variance trade-off.
- Techniques like cross-validation can be used to evaluate different hyperparameter configurations and identify the one that results in the best performance o a held out validation set.

### Strategies to tune hyperparameters in GBDT
- Grid search: evaluate a predefined grid of hyperparameter values and choose the combination with the best validation performance.
- Random search: Sample hyperparameter values randomly from a defined range and select the best performing combination.
- Early stopping: Stop training the model if the validation error starts to increase, preventing overfitting.

# Effect Of Outliers In GBDT
### Potential benefits
- Compared to some simpler models (e.g., Linear Regression), GBDT exhibits some level of robustness to outliers.
- During each iteration, GBDT focuses on the pseudo-residuals, which are the gradients of the loss function with respect to the predictions from the previous stage.
- Outliers with significant deviations from the overall trend will have larger pseudo-residuals.
- The weak learner in that iteration can potentially capture this pattern and adjust the model's predictions to account for those outliers to some extent.

### Potential drawbacks
- GBDT's focus on pseudo-residuals can also be a downside in the presence of outliers.
- If the weak learner in an iteration overfits to a few extreme outliers, it can introduce unnecessary complexity into the model.
- This can lead to a decrease in the model's generalizability (performance on unseen data) as it focuses too much on fitting the outliers in the training data.

### Overall effect
The net effect of outliers on GBDT depends on several factors,
- Number of outliers: A few outliers might be handled by the model, but a large number of outliers can significantly impact performance.
- Distribution of outliers: Outliers far from the main cluster are more likely to cause problems.
- Model complexity (hyperparameters):
    - A model with a high number of boosting stages ($m$) or deep Decision Trees might be more susceptable to overfitting to outliers.
    - Conversely, a simpler model with fewer stages or shallow Trees might not capture the underlying patterns even in the presence of some outliers (higher bias).

### Mitigation strategies
- Outlier detection and handling: Consider identifying and handling outliers before training the GBDT model (e.g., capping extreme values, removing outliers if justifiable).
- Hyperparameter tuning: Carefully tune hyperparameters like $m$ and depth to find a balance between reducing bias and avoiding overfitting to outliers. Techniques like cross-validation can be helpful for this purpose.
- Alternative loss functions: Explore loss functions less sesitive to outliers, such as Huber Loss or Trimmed Mean Squared Error, especially for regression tasks.

# Advantages And Disadvantages Of GBDT
### Advantages
- High accuracy and flexibility: GBDT can achieve excellent accuracy on various regression and classification tasks due to its ability to learn complex patterns by combining mukltiple weak learners.
- Handles diverse data types: GBDT can work effectively with different data types, including numerical and categorical features.
- Robust to outliers: GBDT is relatively insensitive to outliers in the data compared to some simpler models.
- Interpretability: While not as interpretable as linear models, GBDT models can be partially interpreted by analyzing the features used in the Decision Trees at each stage. Feature importance scores can be helpful for understanding which features contribute most to the model's predictions.
- Handles missing data: GBDT can handle missing data inherently by splitting data points based on existing features in the Decision Trees.
- Reduces bias: By iteratively focusing on errors from previous models, GBDT progressively reduces the overall bias of the ensemble.

### Disadvantages
- Computationally expensive: Training GBDT models can be computationally expensive due to the sequential training of multiple weak learners.
- Prone to overfitting: If not tuned properly (especially with high $m$ or depth), GBDT models can overfit the training data and perform poorly on unseen data. Careful hyperparameter tuning and techniques like early stopping are crucial to mitigate this rist.
- Black box nature: While partially interpretable, GBDT models can be complex, making it challenging to fully understand the internal logic behind their predictions.
- Prone to high variance: With a large number of weak learners or high-depth Trees, GBDT models can become sensitive to training data noise, leading to high variance.

# Code Implementation Of GBDT

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

In [2]:
pd.set_option("display.max_columns", None)
sns.set_theme(style = "whitegrid")
warnings.filterwarnings("ignore")
plt.rcParams['figure.figsize'] = (20, 10)

In [3]:
# loading employee_attrition_dataset
import pickle

with open("employee_attrition_dataset/x_sm.pkl", "rb") as file:
    x_train = pickle.load(file)

with open("employee_attrition_dataset/y_sm.pkl", "rb") as file:
    y_train = pickle.load(file)

with open("employee_attrition_dataset/x_test.pkl", "rb") as file:
    x_test = pickle.load(file)
    
with open("employee_attrition_dataset/y_test.pkl", "rb") as file:
    y_test = pickle.load(file)

In [4]:
from sklearn.ensemble import GradientBoostingClassifier

gbdt_classifier = GradientBoostingClassifier(n_estimators = 150, max_depth = 2, loss = "log_loss")
gbdt_classifier.fit(x_train, y_train)

In [5]:
gbdt_classifier.score(x_train, y_train)

0.9419496166484118

In [6]:
gbdt_classifier.score(x_test, y_test)

0.8722826086956522

# EMG Data

In [7]:
import pickle

with open("emg_data/x_train.pkl", "rb") as file:
    x_train = pickle.load(file)

with open("emg_data/x_test.pkl", "rb") as file:
    x_test = pickle.load(file)

with open("emg_data/y_train.pkl", "rb") as file:
    y_train = pickle.load(file)

with open("emg_data/y_test.pkl", "rb") as file:
    y_test = pickle.load(file)

In [8]:
# building a Decision Tree Classifier to begin with
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

params = {
    "max_depth": [3, 5, 7, 9, 11, 13, 15],
    "max_leaf_nodes": [20, 40, 60]
}

dt_classifier = DecisionTreeClassifier()
dt_classifier_grid_search = GridSearchCV(dt_classifier, params, scoring = "accuracy", cv = 5)
dt_classifier_grid_search.fit(x_train, y_train)

In [9]:
# results of grid search
res = dt_classifier_grid_search.cv_results_

for i in range(len(res["params"])):
    print(f"Parameters: {res['params'][i]} Mean Score: {res['mean_test_score'][i]} Rank: {res['rank_test_score'][i]}")

Parameters: {'max_depth': 3, 'max_leaf_nodes': 20} Mean Score: 0.3417685364121617 Rank: 19
Parameters: {'max_depth': 3, 'max_leaf_nodes': 40} Mean Score: 0.3417685364121617 Rank: 19
Parameters: {'max_depth': 3, 'max_leaf_nodes': 60} Mean Score: 0.3417685364121617 Rank: 19
Parameters: {'max_depth': 5, 'max_leaf_nodes': 20} Mean Score: 0.5631025479050761 Rank: 16
Parameters: {'max_depth': 5, 'max_leaf_nodes': 40} Mean Score: 0.5630391363641757 Rank: 18
Parameters: {'max_depth': 5, 'max_leaf_nodes': 60} Mean Score: 0.5631025479050761 Rank: 16
Parameters: {'max_depth': 7, 'max_leaf_nodes': 20} Mean Score: 0.6959034696550738 Rank: 15
Parameters: {'max_depth': 7, 'max_leaf_nodes': 40} Mean Score: 0.714612307711491 Rank: 10
Parameters: {'max_depth': 7, 'max_leaf_nodes': 60} Mean Score: 0.714866014209575 Rank: 9
Parameters: {'max_depth': 9, 'max_leaf_nodes': 20} Mean Score: 0.6997722775522944 Rank: 11
Parameters: {'max_depth': 9, 'max_leaf_nodes': 40} Mean Score: 0.767757554329693 Rank: 8
Para

In [10]:
# best estimator
dt_classifier_grid_search.best_estimator_

In [11]:
# fitting a model using the best parameters
dt_classifier = dt_classifier_grid_search.best_estimator_
dt_classifier.fit(x_train, y_train)

In [12]:
# training accuracy score
dt_classifier.score(x_train, y_train)

0.8192541856925418

In [13]:
# testing accuracy score
dt_classifier.score(x_test, y_test)

0.7973624144052752

In [14]:
# building a Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

params = {
    "n_estimators": [10, 25, 50, 100, 200],
    "max_depth": [3, 5, 10, 15, 20],
    "max_leaf_nodes": [20, 40, 80]
}

rf_classifier = RandomForestClassifier()
rf_classifier_grid_search = GridSearchCV(rf_classifier, params, scoring = "accuracy", cv = 3, n_jobs = -1, verbose = 1)
rf_classifier_grid_search.fit(x_train, y_train)

Fitting 3 folds for each of 75 candidates, totalling 225 fits


In [15]:
# results of grid search
res = rf_classifier_grid_search.cv_results_

for i in range(len(res["params"])):
    print(f"Parameters: {res['params'][i]} Mean Score: {res['mean_test_score'][i]} Rank: {res['rank_test_score'][i]}")

Parameters: {'max_depth': 3, 'max_leaf_nodes': 20, 'n_estimators': 10} Mean Score: 0.5287924911212583 Rank: 75
Parameters: {'max_depth': 3, 'max_leaf_nodes': 20, 'n_estimators': 25} Mean Score: 0.6485286656519533 Rank: 70
Parameters: {'max_depth': 3, 'max_leaf_nodes': 20, 'n_estimators': 50} Mean Score: 0.6390157280568239 Rank: 71
Parameters: {'max_depth': 3, 'max_leaf_nodes': 20, 'n_estimators': 100} Mean Score: 0.6645104008117707 Rank: 66
Parameters: {'max_depth': 3, 'max_leaf_nodes': 20, 'n_estimators': 200} Mean Score: 0.6825849822425165 Rank: 61
Parameters: {'max_depth': 3, 'max_leaf_nodes': 40, 'n_estimators': 10} Mean Score: 0.5573313039066464 Rank: 74
Parameters: {'max_depth': 3, 'max_leaf_nodes': 40, 'n_estimators': 25} Mean Score: 0.6246829020801623 Rank: 72
Parameters: {'max_depth': 3, 'max_leaf_nodes': 40, 'n_estimators': 50} Mean Score: 0.651382546930492 Rank: 69
Parameters: {'max_depth': 3, 'max_leaf_nodes': 40, 'n_estimators': 100} Mean Score: 0.6817605276509386 Rank: 62

In [16]:
# best estimator
rf_classifier_grid_search.best_estimator_

In [17]:
# fitting a model using the best parameters
rf_classifier = rf_classifier_grid_search.best_estimator_
rf_classifier.fit(x_train, y_train)

In [18]:
# training accuracy score
rf_classifier.score(x_train, y_train)

0.9082952815829528

In [19]:
# testing accuracy score
rf_classifier.score(x_test, y_test)

0.8889170682221659

In [20]:
# now building a GBDT classifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV

params = {
    "n_estimators": [10, 25, 50, 100, 200],
    "max_depth": [3, 5, 10, 15, 20],
    "max_leaf_nodes": [20, 40, 80],
    "learning_rate": [0.1, 0.2, 0.3]
}

gbdt_classifier = GradientBoostingClassifier()
gbdt_classifier_random_search = RandomizedSearchCV(gbdt_classifier, params, scoring = "accuracy", cv = 3, n_jobs = -1, verbose = 1)
gbdt_classifier_random_search.fit(x_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits


In [21]:
# results of random search
res = gbdt_classifier_random_search.cv_results_

for i in range(len(res["params"])):
    print(f"Parameters: {res['params'][i]} Mean Score: {res['mean_test_score'][i]} Rank: {res['rank_test_score'][i]}")

Parameters: {'n_estimators': 25, 'max_leaf_nodes': 80, 'max_depth': 15, 'learning_rate': 0.3} Mean Score: 0.9459665144596651 Rank: 7
Parameters: {'n_estimators': 50, 'max_leaf_nodes': 80, 'max_depth': 20, 'learning_rate': 0.1} Mean Score: 0.9544647387113141 Rank: 3
Parameters: {'n_estimators': 100, 'max_leaf_nodes': 80, 'max_depth': 3, 'learning_rate': 0.1} Mean Score: 0.927004058853374 Rank: 10
Parameters: {'n_estimators': 200, 'max_leaf_nodes': 80, 'max_depth': 20, 'learning_rate': 0.2} Mean Score: 0.9639142567224758 Rank: 1
Parameters: {'n_estimators': 100, 'max_leaf_nodes': 40, 'max_depth': 15, 'learning_rate': 0.3} Mean Score: 0.958523592085236 Rank: 2
Parameters: {'n_estimators': 100, 'max_leaf_nodes': 20, 'max_depth': 5, 'learning_rate': 0.1} Mean Score: 0.9450152207001522 Rank: 8
Parameters: {'n_estimators': 50, 'max_leaf_nodes': 40, 'max_depth': 20, 'learning_rate': 0.2} Mean Score: 0.9529426686960933 Rank: 5
Parameters: {'n_estimators': 50, 'max_leaf_nodes': 40, 'max_depth': 

In [22]:
# best estimator
gbdt_classifier_random_search.best_estimator_

In [23]:
# fitting a model using the best parameters
gbdt_classifier = gbdt_classifier_random_search.best_estimator_
gbdt_classifier.fit(x_train, y_train)

In [24]:
# training accuracy score
gbdt_classifier.score(x_train, y_train)

1.0

In [25]:
# testing accuracy score
gbdt_classifier.score(x_test, y_test)

0.9781891960436216

# XGBoost
XGBoost stands for eXtreme Gradient Boosting, is an optimized implementation of the Gradient Boosting Decision Tree (GBDT) algorithm. It's known for its efficiency, accuracy, and scalability, making it popular choice for various Machine Learning tasks, especially regression and classification.

### Core principles
- XGBoost follows the core principles of GBDT, using multiple weak learners (often Decision Trees) to build a powerful ensemble model.
- It iteratively trains models by focusing on the errors (pseudo-residuals) from previous stages to progressively improve the ensemble's performance.

### Optimizations and improvements
- RegualarizationL XGBoost incorporates various regularization techniques to prevent overfitting. These techniques penalize overly complex models, encouraging simpler Decision Trees in the ensemble.
- Sparsity: XGBoost encourages sparse Decision Trees, meaning many of the splits in the Trees can involve only a few features. This promotes interpretability and reduces computational cost.
- Parallelization: XGBoost is designed for efficient parallel and distributed computing, making it suitable for handling large datasets.
- Improved objective functions: XGBoost offers built-in support for various objective functions (loss functions) tailored for different tasks (e.g., regression, classification, ranking).

### Advantages of XGBoost
- High performance: XGBoost often achieves state-of-the-art performance on various Machine Learning benchmarks.
- Scalability: It can handle large datasets efficiently due to its optimized implementation and support for parallelization.
- Flexibility: It suppports various objective functions and offers a wide range of hyperparameters to tune the model for specific tasks.
- Regularization: Built-in regularization techniques help prevent overfitting and improve model generalizability.

### Disadvantages of XGBoost
- Complexity: Tuning XGBoost can be more complex compared to simpler algorithms due to its numerous hyperparameters.
- Black box nature: While partially interpretable through feature importance scores, XGBoost models can be difficult to fully understand due to their ensemble nature.

# Hyperparameters In XGBoost
### General parameters
- `n_estimators`: Controls the number of weak learners (Decision Trees) used in the ensemble. Increasing this value can reduce bias but also increase variance and risk of overfitting.
- `learning_rate`: Determines the step size for updating the model with each iteration. A lower learning rate can help prevent overfitting but might require more training iterations.
- `max_depth`: Limits the maximum depth of the individual Decision Trees. Deeper Trees can capture more complex patterns but are more prone to overfitting.

### Regularization parameters:
- `reg_lambda` (L2 regularization): Penalizes model complexity by adding a penalty term based on square of the model weights. Higher values encourage simpler models and reduce overfitting.
- `reg_alpha` (L1 regularization): Penalizes model complexity by adding a penalty term based on the absolute value of the model weights. Can lead to sparse models with fewer features.
- `gamma` (minimum loss reduction for a split): Controls the minimum improvement in the loss function required to make a split in the Decision Trees. Higher values can help prevent overfitting by avoiding unnecessary splits.

### Tree-specific parameters
- `min-child_weight` (minimum sum of weights of instances in a child node): Sets the minimum number of samples required in a leaf node. Higher values can prevent overfitting by avoiding overly specific splits.
- `subsample` (sampling ratio): Randomly samples a proportion of training data for each boosting iteration. Can help improve model generalizability by reducing variance.
- `colsample_bytree` (feature sampling ratio): Randomly samples a subset of features for each Tree in the ensemble. Promotes model diversity and reduces overfitting.

### Task-specific parameters
- `objective`: Specifies the objective function (loss function) to be optimized for the task. XGBoost provides options for regression (e.g., squared error), classification (e.g., Logistic Regression), and ranking tasks.

### Choosing hyperparameters
- There's no single best set of hyperparameters for all XGBoost models.
- It's essential to experiment with different configurations and evaluate their performance on a validation set to find the optimal values for the sepcific data and task.

### Techniques for hyperparameter tuning
- Grid search: Evaluate a predefined grid of hyperparameter values and choose the combination with the best validation performance.
- Random search: Sample hyperparameter values randomly from a defined range and select the best performing combination. This can be more efficient for exploring a large hyperparameter space.
- Bayesian optimization: Uses a probablistic model to guide the search for optimal hyperparameters, focusing on configurations with higher predicted performance.

# Code Implementation Of XGBoost

In [26]:
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
# brew install libomp

params = {
    "n_estimators": [50, 100, 150, 200],
    "max_depth": [3, 4, 5, 7],
    "learning_rate": [0.1, 0.2, 0.3],
    "subsample": [0.6, 0.8, 1.0],
    "colsample_bytree": [0.6, 0.8, 1.0]
}

xgb_classifier = XGBClassifier(objective = "multi:softmax", num_classes = 20, silent = True)
xgb_classifier_randomized_search = RandomizedSearchCV(xgb_classifier, param_distributions = params, n_iter = 10, scoring = "accuracy", n_jobs = -1, verbose = 2)
xgb_classifier_randomized_search.fit(x_train, y_train)

Fitting 5 folds for each of 10 candidates, totalling 50 fits


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=1.0, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=1.0, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=0.8; total time=   1.5s
[CV] END colsample_bytree=1.0, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=0.8; total time=   1.5s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.8; total time=   2.2s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.8; total time=   2.2s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.8; total time=   2.2s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.8; total time=   2.3s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.8; total time=   2.4s
[CV] END colsample_bytree=1.0, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=0.8, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=1.0; total time=   1.4s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=1.0, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=0.8; total time=   1.6s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=1.0; total time=   1.5s
[CV] END colsample_bytree=0.8, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=1.0; total time=   1.5s
[CV] END colsample_bytree=0.8, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=1.0; total time=   1.6s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.3, max_depth=7, n_estimators=50, subsample=1.0; total time=   1.6s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=1.0; total time=   3.0s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=1.0; total time=   3.3s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=1.0; total time=   3.2s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=1.0; total time=   3.2s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=100, subsample=1.0; total time=   3.1s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=100, subsample=1.0; total time=   3.0s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=100, subsample=1.0; total time=   3.1s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=1.0; total time=   3.3s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=100, subsample=1.0; total time=   3.3s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=100, subsample=1.0; total time=   2.9s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=1.0, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.6; total time=   3.3s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=1.0, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.6; total time=   3.5s
[CV] END colsample_bytree=1.0, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.6; total time=   3.2s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=200, subsample=0.8; total time=   6.3s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=200, subsample=0.8; total time=   6.3s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=200, subsample=0.8; total time=   6.2s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=200, subsample=0.8; total time=   6.1s
[CV] END colsample_bytree=1.0, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.6; total time=   2.8s
[CV] END colsample_bytree=0.8, learning_rate=0.1, max_depth=5, n_estimators=200, subsample=0.8; total time=   6.2s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.6, learning_rate=0.3, max_depth=5, n_estimators=50, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=0.6, learning_rate=0.3, max_depth=5, n_estimators=50, subsample=0.8; total time=   1.4s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=1.0, learning_rate=0.2, max_depth=4, n_estimators=100, subsample=0.6; total time=   2.7s


Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=0.6; total time=   3.2s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=0.6; total time=   3.1s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=0.6; total time=   3.1s
[CV] END colsample_bytree=0.6, learning_rate=0.3, max_depth=5, n_estimators=50, subsample=0.8; total time=   1.4s
[CV] END colsample_bytree=0.6, learning_rate=0.3, max_depth=5, n_estimators=50, subsample=0.8; total time=   1.4s


Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.

Parameters: { "num_classes", "silent" } are not used.



[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=0.6; total time=   3.2s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=100, subsample=0.6; total time=   3.2s
[CV] END colsample_bytree=0.6, learning_rate=0.3, max_depth=5, n_estimators=50, subsample=0.8; total time=   1.3s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=150, subsample=0.8; total time=   3.4s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=150, subsample=0.8; total time=   3.2s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=150, subsample=0.8; total time=   3.1s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=150, subsample=0.8; total time=   3.1s
[CV] END colsample_bytree=0.8, learning_rate=0.2, max_depth=7, n_estimators=150, subsample=0.8; total time=   3.1s


In [27]:
# results of random search
res = xgb_classifier_randomized_search.cv_results_

for i in range(len(res["params"])):
    print(f"Parameters: {res['params'][i]} Mean Score: {res['mean_test_score'][i]} Rank: {res['rank_test_score'][i]}")

Parameters: {'subsample': 0.8, 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.2, 'colsample_bytree': 0.8} Mean Score: 0.960172343437549 Rank: 7
Parameters: {'subsample': 0.8, 'n_estimators': 50, 'max_depth': 7, 'learning_rate': 0.3, 'colsample_bytree': 1.0} Mean Score: 0.9687341684832373 Rank: 4
Parameters: {'subsample': 1.0, 'n_estimators': 50, 'max_depth': 7, 'learning_rate': 0.3, 'colsample_bytree': 0.8} Mean Score: 0.9675290073107293 Rank: 6
Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 7, 'learning_rate': 0.2, 'colsample_bytree': 0.8} Mean Score: 0.9718416155482764 Rank: 2
Parameters: {'subsample': 1.0, 'n_estimators': 100, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 0.8} Mean Score: 0.949010524545978 Rank: 10
Parameters: {'subsample': 0.8, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.1, 'colsample_bytree': 0.8} Mean Score: 0.9682266549295987 Rank: 5
Parameters: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 4, 'learning_r

In [28]:
# best estimator
xgb_classifier_randomized_search.best_estimator_

In [29]:
# fitting a model using the best parameters
xgb_classifier = xgb_classifier_randomized_search.best_estimator_
xgb_classifier.fit(x_train, y_train)

In [30]:
# training accuracy score
xgb_classifier.score(x_train, y_train)

1.0

In [31]:
# testing accuracy score
xgb_classifier.score(x_test, y_test)

0.9761602840476794

# LightGBM
LightGBM (Light Gradinet Boosting Machine) is an efficient implementation of the Gradient Boosting Decision Tree (GBDT) algorithm with several optimizations for speed and performance.

### Gradient Boosting framework
- LightGBM follows the core principle of GBDT, which involves building an ensemble of weak learners (typically shallow Decision Trees) in a sequential manner.
- Each weak learner is trained to improve upon the predictions of the previous ensemble by focusong on the residual (errors) from the prior stage.
- This iterative process progressively reduces the overall error of the ensemble, leading to a more accurate model.

### Key optimizations in LightGBM
- Histogram-based algorithm:
    - LightGBM utilizes a histogram-based approach to find optimal split points in the Decision Trees.
    - Instead of sorting the data for each feature at every split, it builds histograms to efficiently count data points falling into different value bins.
    - This significantly reduces the computational cost compared to traditional GBDT which relies on feature sorting.
- Gradient based one-sided sampling (GOSS):
    - LightGBM employs a sampling technique called GOSS to focus training on data points with larger gradients (errors).
    - This prioritizes instances that contribute more significantly to improve the model training time.
- Exclusive feature bundling (EFB):
    - LightGBM can group mutually exclusive features (features that cannot take on the same value for a given instance) into bundles.
    - This reduces the number of features considered at each split, leading to faster Decision Tree construction and potentially reducing overfitting.
- Efficient Tree learning algorithms: LightGBM implements efficient algorithms for Decision Tree learning, including level-wise algorithms and other optimizations for finding the best splits.
- Parallelization and early stopping:
    - LightGBM supports efficient parallelization acros multiple cores or machines to handle large datasets.
    - It incorporates early stopping techniques to prevent overfitting by stopping training when the validation performance starts to deteriorate.

### Overall algorithm steps
1. Initialize: Start with an initial prediction model (often a simple constant value).
2. Iteration loop:
    - Calculate the residuals (errors) for each data point based on the current ensemble predictions.
    - Use GOSS to sample a subset of data points with larger residuals.
    - Build a new shallow Decision Tree using histogram-based algorithm and EFB (optional) to improve predictions on the residuals.
    - Update the ensemble model by adding the newly trained Decision Tree.
3. Repeat step 2: Continue iteratively building new trees until a stopping criterion (e.g., maximum number of iterations, early sropping) is met.

LightGBM's optimizations make it a powerful and efficient alternative to traditional GBDT algorithms, particularly for large datasets and computationally intensive tasks.

# Code Implementation Of LightGBM

In [32]:
from lightgbm import LGBMClassifier
from sklearn.model_selection import RandomizedSearchCV

params = {
    "learning_rate": [0.1, 0.3, 0.5],
    "boosting_type": ["gbdt"],
    "objective": ["multiclass"],
    "max_depth": [5, 6, 7, 8],
    "colsample_bytree": [0.5, 0.7],
    "subsample": [0.5, 0.7],
    "metric": ["multi_error"]
}
lgbm_classifier = LGBMClassifier(num_classes = 20)
lgbm_classifier_random_search = RandomizedSearchCV(lgbm_classifier, params, verbose = 2, cv = 3, n_jobs = -1, n_iter = 10)
lgbm_classifier_random_search.fit(x_train, y_train)

Fitting 3 folds for each of 10 candidates, totalling 30 fits
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005523 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2040
[LightGBM] [Info] Number of data points in the train set: 10512, number of used features: 8
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002422 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2040
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002658 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2040
[LightGBM] [Info] Number of data points in the train set: 10512, number of used features: 8
[LightGBM] [Info] Number of data points in the train set: 10512, number of used features: 8
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of t

In [33]:
# results of random search
res = lgbm_classifier_random_search.cv_results_

for i in range(len(res["params"])):
    print(f"Parameters: {res['params'][i]} Mean Score: {res['mean_test_score'][i]} Rank: {res['rank_test_score'][i]}")

Parameters: {'subsample': 0.7, 'objective': 'multiclass', 'metric': 'multi_error', 'max_depth': 7, 'learning_rate': 0.1, 'colsample_bytree': 0.5, 'boosting_type': 'gbdt'} Mean Score: 0.9627727042110603 Rank: 4
Parameters: {'subsample': 0.5, 'objective': 'multiclass', 'metric': 'multi_error', 'max_depth': 8, 'learning_rate': 0.1, 'colsample_bytree': 0.7, 'boosting_type': 'gbdt'} Mean Score: 0.968607305936073 Rank: 3
Parameters: {'subsample': 0.7, 'objective': 'multiclass', 'metric': 'multi_error', 'max_depth': 6, 'learning_rate': 0.5, 'colsample_bytree': 0.7, 'boosting_type': 'gbdt'} Mean Score: 0.1499873160832065 Rank: 9
Parameters: {'subsample': 0.5, 'objective': 'multiclass', 'metric': 'multi_error', 'max_depth': 7, 'learning_rate': 0.5, 'colsample_bytree': 0.5, 'boosting_type': 'gbdt'} Mean Score: 0.11409183155758497 Rank: 10
Parameters: {'subsample': 0.5, 'objective': 'multiclass', 'metric': 'multi_error', 'max_depth': 5, 'learning_rate': 0.3, 'colsample_bytree': 0.5, 'boosting_typ

In [34]:
# best estimator
lgbm_classifier_random_search.best_estimator_

In [35]:
# fitting a model using the best estimator
lgbm_classifier = lgbm_classifier_random_search.best_estimator_
lgbm_classifier.fit(x_train, y_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000340 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2040
[LightGBM] [Info] Number of data points in the train set: 15768, number of used features: 8
[LightGBM] [Info] Start training from score -2.993705
[LightGBM] [Info] Start training from score -3.010297
[LightGBM] [Info] Start training from score -2.992440
[LightGBM] [Info] Start training from score -3.028480
[LightGBM] [Info] Start training from score -2.989915
[LightGBM] [Info] Start training from score -3.007727
[LightGBM] [Info] Start training from score -2.981126
[LightGBM] [Info] Start training from score -3.027170
[LightGBM] [Info] Start training from score -3.014166
[LightGBM] [Info] Start training from score -2.991176
[LightGBM] [Info] Start training from score -3.007727
[LightGBM] [Info] Start training from score -2.967470
[LightGBM] [Info] Start training from score -2.996240
[LightGBM]

In [36]:
# training accuracy score
lgbm_classifier.score(x_train, y_train)

1.0

In [37]:
# testing accuracy score
lgbm_classifier.score(x_test, y_test)

0.9789500380420999