## 🔷 **Step 1: What is LightGBM?**

### ✅ **Definition:**

**LightGBM (Light Gradient Boosting Machine)** is a fast, powerful, and highly efficient **boosting algorithm** developed by Microsoft. It is designed to be **much faster and more memory-efficient** than traditional gradient boosting frameworks like XGBoost.

---

### 🧠 **Core Idea:**

LightGBM builds an ensemble of **decision trees**, just like XGBoost, but the way it grows trees is what makes it unique.

---

### 🌳 **How LightGBM Builds Trees (Key Difference)**

#### 🔸 XGBoost → **Level-wise Tree Growth**:

* It grows all leaves at the same level before moving to the next level.

* Slower, but more balanced.

#### 🔹 LightGBM → **Leaf-wise Tree Growth**:

* It finds the **leaf** with the **highest loss** and splits it first.

* Grows **unevenly**, focusing on **more informative splits**.

* This leads to **better accuracy** and **faster training**.

* But — can **overfit** if not tuned properly!

---

### ⚙️ **Advantages of LightGBM:**

| Feature                             | Description                                                     |
| ----------------------------------- | --------------------------------------------------------------- |
| 🚀 **Speed**                        | Much faster training than XGBoost, especially on large datasets |
| 💾 **Memory Efficient**             | Uses less RAM                                                   |
| 📈 **High Accuracy**                | Leaf-wise growth helps improve performance                      |
| 🔣 **Handles Categorical Features** | You don’t need to one-hot encode — it supports them natively    |
| 🧱 **Scalable**                     | Can handle **millions of rows** and **thousands of features**   |

---

### ⚠️ **Disadvantages / When to Be Careful:**

* **Leaf-wise tree growth** can easily overfit if:

  * Dataset is small

   * `num_leaves` is too large

  * Learning rate is high

* Needs **careful hyperparameter tuning** for generalization

---

### 📝 Summary:

> LightGBM is a **fast and accurate** boosting algorithm that grows trees **leaf-wise** and supports **categorical data** natively. It’s ideal for **large datasets**, but needs **careful tuning** to avoid overfitting.

---

## 🔷 **Step 2: Difference Between XGBoost and LightGBM**

Understanding the key differences helps you choose the right model based on **dataset size**, **speed needs**, and **risk of overfitting**.

---

### ✅ **Comparison Table: XGBoost vs LightGBM**

| Feature                        | XGBoost                             | LightGBM                                    |
| ------------------------------ | ----------------------------------- | ------------------------------------------- |
| 🌲 Tree Growth Strategy        | **Level-wise** (balanced tree)      | **Leaf-wise** (splits best leaf first)      |
| ⚡ Speed                        | Slower (but optimized)              | **Much faster** on large datasets           |
| 🧠 Accuracy                    | Very accurate                       | Often **more accurate**, but can overfit    |
| 💾 Memory Usage                | Moderate                            | **More memory-efficient**                   |
| 🧱 Large Dataset Support       | Scales well                         | **Scales extremely well**                   |
| 🔣 Categorical Feature Support | ❌ (needs one-hot encoding manually) | ✅ **Built-in support** for categorical data |
| 🔄 Overfitting Risk            | Low to moderate                     | **Higher**, if not tuned properly           |
| 🛠️ Parallelization Support    | Good                                | Excellent                                   |
| 🔍 Default Interpretability    | Medium (trees can be visualized)    | Medium                                      |

---

### 💡 **Key Takeaways (Easy Memory Hints)**

* **XGBoost**: Safe & stable → Slower, less prone to overfitting

* **LightGBM**: Bold & fast → Faster, but **more sensitive** (tune it!)

---

### 🧪 When to Use What:

| Situation                    | Recommendation         |
| ---------------------------- | ---------------------- |
| ✅ Small to Medium Dataset    | Try **XGBoost** first  |
| ✅ Large Dataset (100k+ rows) | Try **LightGBM**       |
| ✅ Training Time is Critical  | **LightGBM** is faster |
| ✅ High Risk of Overfitting   | **XGBoost** is safer   |
| ✅ Many Categorical Features  | Use **LightGBM**       |

---

## 🔷 **Step 3: Load Dataset (Breast Cancer Dataset)**

We’ll use the `load_breast_cancer()` dataset from `sklearn.datasets`, which is:

✅ Clean (no missing values)

✅ Balanced (target: malignant or benign)

✅ Perfect for binary classification

---

In [1]:
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_breast_cancer

# Load the breast cancer dataset
data = load_breast_cancer()

# Features (X) and target (y)
x = data.data
y = data.target

# Split into 80% train and 20% test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

---

## 🔷 **Step 4: Train a Basic LightGBM Model**

We'll use the **`LGBMClassifier`** from the `lightgbm` library.

---

In [4]:
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score

# Initialize the LightGBM classifier
model = LGBMClassifier()

# Train the model on the training data
model.fit(x_train, y_train)

# Predict on test data
y_pred = model.predict(x_test)

# Evaluate accuracy
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy : {acc:.2f}")

[LightGBM] [Info] Number of positive: 286, number of negative: 169
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000386 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 4548
[LightGBM] [Info] Number of data points in the train set: 455, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628571 -> initscore=0.526093
[LightGBM] [Info] Start training from score 0.526093
Accuracy : 0.96




---

## 🔷 **Step 5: Hyperparameter Tuning with RandomizedSearchCV**

---

### ✅ **What is RandomizedSearchCV?**

`RandomizedSearchCV` is used to **tune hyperparameters** like `max_depth`, `n_estimators`, etc., **more efficiently**.

| `GridSearchCV`                           | `RandomizedSearchCV`                         |
| ---------------------------------------- | -------------------------------------------- |
| Tests **all possible combinations**      | Tests a **random selection** of combinations |
| **Very slow** with large parameter grids | **Much faster** and nearly as accurate       |
| Best when parameter space is **small**   | Best when parameter space is **large**       |

---

### ✅ **When to Use It?**

Use `RandomizedSearchCV` when:

* You have **many parameters**.

* You want **faster results**.

* You can **afford a little randomness** for speed.

---

### 🧪 **Code Example: Tuning LightGBM on Breast Cancer Dataset**

Let’s tune:

* `num_leaves`: max leaves per tree

* `max_depth`: tree depth

* `learning_rate`: how fast the model learns

* `n_estimators`: number of boosting rounds

* `min_child_samples`: min data in one leaf

* `subsample`: % of rows used per tree

In [11]:
from sklearn.model_selection import RandomizedSearchCV
from lightgbm import LGBMClassifier
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import numpy as np

# 1. Load dataset
data = load_breast_cancer()
x, y = data.data, data.target

# 2. Split into train/test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

# 3. Define model
model = LGBMClassifier()

# 4. Define hyperparameter grid'
param_dist = {
    "num_leaves" : np.arange(20,150,10), #Number of leaves in a decision tree. More leaves = more complexity.
    "max_depth" : np.arange(3,15),  #How deep each tree can go. Controls overfitting.
    "learning_rate" : [0.001, 0.01, 0.05, 0.1], #Step size for updating the model. Smaller = slower but better learning.
    "n_estimators" : [50, 100, 200, 300], #How many trees to build in total.
    "min_child_samples" : [5, 10, 20],  #Minimum number of samples needed to create a leaf. Helps control overfitting.
    "subsample" : [0.6, 0.8, 1.0]   #What fraction of the data to use when building each tree (for randomness).
}

# 5. Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator = model,
    param_distributions = param_dist,
    n_iter = 20,  # Try 20 random combinations
    scoring = "accuracy", # Use accuracy to select best model
    cv=3, # 3-fold cross-validation
    verbose = 1,
    random_state = 42, #Running the same code again will give the same random choices.
    n_jobs = 1  # Use all CPU cores
)

# 6. Fit the model
random_search.fit(x_train, y_train)

# 7. Best Parameters
print("Best Parametors:", random_search.best_params_)

# 8. Evaluate on test data
best_model = random_search.best_estimator_
y_pred = best_model.predict(x_test)
print("Test Accuracy:", accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 20 candidates, totalling 60 fits
[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000347 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636
[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000339 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] St



[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000253 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636
[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000165 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000187 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000267 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886




[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000203 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636
[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000147 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775




[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000160 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000191 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636




[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000290 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000310 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000201 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000253 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636




[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000179 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000257 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000246 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000288 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000194 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000186 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886




[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000193 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636
[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000319 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636
[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000329 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775




[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000204 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886




[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000556 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636




[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000211 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000229 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000189 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000168 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of 



[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000315 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000275 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636




[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000192 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000198 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of positive: 190, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000156 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3054
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.627063 -> initscore=0.519636
[LightGBM] [Info] Start training from score 0.519636
[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choos



[LightGBM] [Info] Number of positive: 191, number of negative: 112
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000301 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3048
[LightGBM] [Info] Number of data points in the train set: 303, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.630363 -> initscore=0.533775
[LightGBM] [Info] Start training from score 0.533775
[LightGBM] [Info] Number of positive: 191, number of negative: 113
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000189 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 3043
[LightGBM] [Info] Number of data points in the train set: 304, number of used features: 30
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.628289 -> initscore=0.524886
[LightGBM] [Info] Start training from score 0.524886
[LightGBM] [Info] Number of 



---

### ✅ Final Output Breakdown:

#### 🏆 **Best Parameters Selected by RandomizedSearchCV:**

```python
{
  'subsample': 0.8,
  'num_leaves': 60,
  'n_estimators': 100,
  'min_child_samples': 5,
  'max_depth': 3,
  'learning_rate': 0.05
}
```

✅ These were the **best hyperparameters** out of all the combinations it tried — and these helped achieve a strong model!

---

#### 🎯 **Test Accuracy:**

```python
0.956140350877193
```

That’s a **95.6% accuracy** on the test set — very impressive! 🚀
It means your tuned LightGBM model is performing excellently on unseen data.

---

---

Let’s begin **Step 6: Compare with XGBoost** 🔍

> 🔁 **Repeat similar steps** we used for LightGBM — but with **XGBoost**

> 📊 Then we’ll **compare performance, time, and tuned parameters**

---

## 🔷 **Step 6: Compare LightGBM with XGBoost**

---

### ✅ **1. Import Libraries**

---

In [13]:
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer
from xgboost import XGBClassifier
import numpy as np

---

### ✅ **2. Load and Split Dataset**

---

In [14]:
# Load Breast Cancer Dataset
data = load_breast_cancer()
x, y = data.data, data.target

# Train-Test Split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 42)

---

### ✅ **3. Define Model and Parameter Grid for XGBoost**

---

In [17]:
# Define the model
model = XGBClassifier()

# Define hyperparameter grid
param_dist = {
    "max_depth" : [3, 4, 5], #Maximum depth of each tree. More depth → more complex trees.
    "learning_rate" : [0.01, 0.05, 0.1],  #How fast the model learns. Lower = slower but more accurate.
    "n_estimators" : [50, 100, 200],  #Number of trees (iterations). More trees = better accuracy (up to a limit).
    "subsample" : [0.6, 0.8, 1.0], #What fraction of the training data to use per tree. Prevents overfitting.
    "colsample_bytree" : [0.6, 0.8, 1.0]  #What fraction of features (columns) to use per tree. Adds randomness and helps generalize better.
}

---

### ✅ **4. Use RandomizedSearchCV for Tuning**

---

In [18]:
random_search = RandomizedSearchCV(
    estimator = model,
    param_distributions = param_dist,
    n_iter = 20,
    scoring = "accuracy",
    cv = 3,
    verbose = 1, # Shows progress output in the console
    random_state = 42,
    n_jobs = -1 #Use all CPU cores for faster performance 
)

---

### ✅ **5. Fit the Model**

---

In [19]:
random_search.fit(x_train, y_train)

# Best parameters
print("Best Parametors:", random_search.best_params_)

# Test Accuracy
best_model = random_search.best_estimator_
y_pred = best_model.predict(x_test)
print("Test Accuracy : ", accuracy_score(y_test, y_pred))

Fitting 3 folds for each of 20 candidates, totalling 60 fits
Best Parametors: {'subsample': 0.6, 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.1, 'colsample_bytree': 0.6}
Test Accuracy :  0.9649122807017544


---

## ✅ **🔍 LightGBM vs XGBoost — Final Comparison**

| Metric              | **LightGBM**                                                                                                             | **XGBoost**                                                                                            |
| ------------------- | ------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------ |
| **Best Parameters** | `'subsample': 0.8, 'num_leaves': 60, 'n_estimators': 100, 'min_child_samples': 5, 'max_depth': 3, 'learning_rate': 0.05` | `'subsample': 0.6, 'n_estimators': 100, 'max_depth': 4, 'learning_rate': 0.1, 'colsample_bytree': 0.6` |
| **Test Accuracy**   | **0.9561** (95.61%)                                                                                                      | **0.9649** (96.49%)                                                                                    |
| **Training Time**   | ⚡ Fast (LightGBM is optimized for speed)                                                                                 | 🕒 Slightly slower due to more computation                                                             |
| **Ease of Use**     | Simple + fast + good defaults                                                                                            | Powerful + stable + highly tunable                                                                     |

---

## ✅ Conclusion:

🔸 Both models **performed very well** — above 95% accuracy

🔸 **XGBoost** gave slightly better accuracy here (96.49%)

🔸 **LightGBM** is often preferred for large datasets due to **speed**

🔸 **XGBoost** might be better for **fine-grained control** and **stability**

---