# Lesson 3: Boosting Text Classification Power with Gradient Boosting Classifier

Greetings learners! Prepare to immerse yourself in advanced text classification techniques as we explore an advanced ensemble method: the **Gradient Boosting Classifier**. By the end of this lesson, you will have a sound understanding of this ensemble method and also gain practical experience in applying it using **Python** and **Scikit-learn**.

---

## Quick Recap on Dataset Preparation

First, let's review a few steps that should already be familiar: loading required libraries and preparing the dataset, which is the **Reuters-21578 Text Categorization Collection**.

### Python Code:

```python
# Import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

nltk.download('reuters', quiet=True)

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using CountVectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
```

This code prepares the dataset, using **CountVectorizer** for feature extraction, **LabelEncoder** for changing categories into numeric format, and splitting our data into training and test sets.

---

## Inside the Gradient Boosting Classifier

**Gradient Boosting Classifier** is an ensemble learning technique that fine-tunes its accuracy iteratively by addressing the inaccuracies of prior models, predominantly employing **decision trees** as its weak learners. The process unfolds through several critical stages:

1. **Initial Prediction**:  
   - It starts with a simple model, often predicting a constant value (like the mean of the target variable), setting the stage for improvement.

2. **Iterative Correction**:  
   - The essence of **Gradient Boosting** is its ability to learn from the mistakes of previous iterations.  
   - It focuses on the **residuals**—the differences between the predicted and actual values.  
   - Each new tree in the ensemble attempts to correct these residuals, aiming to **minimize a loss function** reflective of these errors.

3. **Learning Rate**:  
   - This parameter moderates the contribution of each new tree.  
   - A **smaller learning rate** demands more trees to achieve high accuracy but fosters a model that's less prone to overfitting.  
   - Conversely, a **larger learning rate** can hasten learning but increase the risk of overfitting by overly adjusting to the training data.

4. **Controlling Complexity**:  
   - To prevent overfitting, **Gradient Boosting** limits each tree's complexity, primarily using the **max_depth** parameter.  
   - This control ensures that individual trees do not grow too complex and start modeling the **noise** within the training data.

5. **Optimal Number of Trees**:  
   - The algorithm iteratively adds trees until it reaches the specified number (**n_estimators**) or until adding new trees does not significantly reduce the error.  
   - This balance is crucial as **too few trees** might not capture all the data patterns, while **too many** could lead to overfitting.

📌 **In summary**, Gradient Boosting sequentially builds upon previous trees to correct errors, with careful adjustments of parameters like the **learning rate** and **max depth** to ensure a robust model. Its adaptive nature makes it exceptionally powerful for tasks including text classification, albeit requiring thoughtful parameter tuning to balance complexity with generalization.

---

## Implementing Gradient Boosting Classifier for Text Classification

The main attraction is the **Gradient Boosting Classifier**. Let's set up and implement it now.

### Python Code:

```python
# Instantiate the GradientBoostingClassifier with tuned parameters
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)

# Train the classifier
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred = gb_clf.predict(X_test)
```

Here, we create an instance of the **GradientBoostingClassifier** with the following parameters:

- **n_estimators** (boosting stages) = `100`
- **learning_rate** (model learning speed) = `0.1`
- **max_depth** (tree depth) = `3`

After this setup, the model is trained using **fit()**, and predictions are made on the test data.

---

## Performance Evaluation

With our model trained and having made some predictions, let's assess its performance.

### Python Code:

```python
# Evaluate the performance of the classifier
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

🔹 **Output:**
```
Accuracy:  0.9852150537634409
```

📌 The `accuracy_score` function compares predicted values (`y_pred`) to actual test categories (`y_test`).  
📊 **Result**: Our Gradient Boosting model predicts **98.5%** of the instances correctly!

---

## Conclusion

Today, you learned about the **Gradient Boosting Classifier**, its workings, and how to implement and evaluate it. Advanced ensemble methods like this give you a significant edge in **NLP tasks**.

🚀 **Key Takeaways:**
- Gradient Boosting **iteratively improves predictions** by minimizing residual errors.
- **Learning rate** and **tree depth** must be carefully tuned to balance accuracy and overfitting.
- Our **Gradient Boosting Classifier achieved 98.5% accuracy** on the Reuters dataset.

🔹 **Next Steps:**  
"**Theory without practice is empty.**" Sharpen your skills by experimenting with different **parameters**.  
For instance, try setting `learning_rate=0.2` and `n_estimators=150` to see the impact!  
Keep exploring, and **onwards to mastery!** 🚀



## Tuning Learning Rate for Gradient Boosting Classifier

Galactic Analyst, your next mission takes you into the depths of NLP to master the dynamic art of parameter tuning on a Gradient Boosting Classifier for classifying Reuters news articles. Achieve equilibrium by setting the learning_rate to 0.2 and increasing n_estimators to 150. This balance is vital; it's a dance between learning speed and model complexity. Post adjustments, retrain and analyze your model's enhanced ability to discern the nuances of textual data.

```python
import nltk
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters

categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

text_data = [" ".join(reuters.words(fileid)) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
gb_clf.fit(X_train, y_train)
y_pred = gb_clf.predict(X_test)

print("Accuracy: ", accuracy_score(y_test, y_pred))
```

It looks like you're looking to tune the **learning rate** and **n_estimators** of the **Gradient Boosting Classifier** for better performance. Below is the updated Python code implementing the suggested changes:

### 🔧 **Tuned Gradient Boosting Classifier**  
- **Increased `n_estimators` to 150** (more boosting iterations for refined learning).  
- **Increased `learning_rate` to 0.2** (faster learning, but requires monitoring for overfitting).  

```python
import nltk
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters

nltk.download('reuters', quiet=True)

# Load categories and documents
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Prepare text and labels
text_data = [" ".join(reuters.words(fileid)) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Feature extraction with CountVectorizer
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Tune and train the Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=150, learning_rate=0.2, max_depth=3)
gb_clf.fit(X_train, y_train)

# Predict and evaluate accuracy
y_pred = gb_clf.predict(X_test)
print("Tuned Accuracy: ", accuracy_score(y_test, y_pred))
```

### 📊 **Expected Outcome:**
- Faster learning with `learning_rate=0.2`, but ensure it doesn’t overfit.  
- More iterations (`n_estimators=150`) refine performance.  
- Accuracy should improve or remain stable—monitor for overfitting.  

🚀 **Next Steps**: Try tuning **max_depth** and **subsample** to further optimize your model!

## Implementing and Training a Gradient Boosting Classifier

Stellar Navigator, are you ready for the next data exploration mission? Let's fill in the blank spots (____) in the code. You need to instantiate the GradientBoostingClassifier and fit the model to the training data. Good journey!

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using CountVectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Instantiate the GradientBoostingClassifier
gb_clf = ____(____=100, learning_rate=0.1, max_depth=3) 

# Train the classifier
gb_clf.____(____, ____)

# Make predictions
y_pred = gb_clf.predict(X_test)

# Calculate accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

Here’s your **completed Python code** with the missing parts filled in:

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:3]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using CountVectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Instantiate the GradientBoostingClassifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3) 

# Train the classifier
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred = gb_clf.predict(X_test)

# Calculate accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### 🔥 **Fixes & Explanation**:
1. **`GradientBoostingClassifier` instantiation:**
   ```python
   gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
   ```
   - **`n_estimators=100`** → Number of boosting stages.
   - **`learning_rate=0.1`** → Controls weight of new trees.
   - **`max_depth=3`** → Depth of each decision tree.

2. **Fitting the model:**
   ```python
   gb_clf.fit(X_train, y_train)
   ```
   - `.fit()` trains the model using training data.

### 🚀 **Mission Accomplished!**
Now, your **Gradient Boosting Classifier** is fully functional! 🔥 Run it and analyze the accuracy. You can experiment with tuning `learning_rate` or `n_estimators` to optimize results.

## Setting Learning Rate and Making Predictions with GradientBoostingClassifier

Greetings, Space Explorer! Understanding how to control the learning speed while building a gradient-boosting model is vital. How do you achieve this? By tuning the learning rate, of course! Let's continue practicing. Insert code to set the learning_rate as 0.1, and also predict results using the trained model. Your journey awaits!

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

categories = reuters.categories()[:3]
documents = reuters.fileids(categories)
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Instantiate the GradientBoostingClassifier and set max_depth
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=____, max_depth=5)

# Train the classifier
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred = ____.____(X_test)

# Print accuracy
print("Accuracy: ", ____(y_test, y_pred))

```

Here’s your **completed Python code** with the missing parts filled in:

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

categories = reuters.categories()[:3]
documents = reuters.fileids(categories)
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Instantiate the GradientBoostingClassifier and set max_depth
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5)

# Train the classifier
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred = gb_clf.predict(X_test)

# Print accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### ✅ **Fixes & Explanation**:
1. **Setting the learning rate** → `learning_rate=0.1`
   ```python
   gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=5)
   ```
   - **`learning_rate=0.1`** → Controls the contribution of each tree in boosting.

2. **Making predictions**:
   ```python
   y_pred = gb_clf.predict(X_test)
   ```
   - Uses the trained model to predict on test data.

3. **Printing accuracy**:
   ```python
   print("Accuracy: ", accuracy_score(y_test, y_pred))
   ```
   - `accuracy_score(y_test, y_pred)` → Compares predicted vs. actual labels.

### 🚀 **Mission Accomplished!**
You’ve successfully tuned the **learning rate** and **predicted results** using **Gradient Boosting Classifier**! 🎯 Try adjusting `learning_rate` (e.g., `0.05` or `0.2`) and observe how accuracy changes. 🚀

## Building a Gradient Boosting Classifier Model

You're doing great, Celestial Traveler! Can you now construct the Gradient Boosting Classifier by setting n_estimators to 100, learning_rate to 0.1, and max_depth to 3? Implement it in the code! Don't forget to test your agent's performance with the prepared data.

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

categories = reuters.categories()[:3]
documents = reuters.fileids(categories)
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# TODO: Instantiate the GradientBoostingClassifier and set max_depth, learning rate and number of estimators

# Train the classifier
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred = gb_clf.predict(X_test)

# Print accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

Here’s your **completed code** with the necessary parameters set for the **Gradient Boosting Classifier**:

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

categories = reuters.categories()[:3]
documents = reuters.fileids(categories)
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

# Instantiate the GradientBoostingClassifier with specified parameters
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)

# Train the classifier
gb_clf.fit(X_train, y_train)

# Make predictions
y_pred = gb_clf.predict(X_test)

# Print accuracy
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### 🔍 **Breakdown of Updates:**
1. **Instantiating Gradient Boosting Classifier**:
   ```python
   gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3)
   ```
   - **`n_estimators=100`** → Number of boosting stages (trees).
   - **`learning_rate=0.1`** → Controls contribution of each tree.
   - **`max_depth=3`** → Limits complexity of individual trees.

2. **Training the Classifier**:
   ```python
   gb_clf.fit(X_train, y_train)
   ```
   - Trains the model using the **Reuters news dataset**.

3. **Making Predictions**:
   ```python
   y_pred = gb_clf.predict(X_test)
   ```
   - Uses trained model to classify test data.

4. **Testing Performance**:
   ```python
   print("Accuracy: ", accuracy_score(y_test, y_pred))
   ```
   - Computes **accuracy** to measure model performance.

### 🎯 **Mission Accomplished!** 🚀
Your **Gradient Boosting Classifier** is now set with `n_estimators=100`, `learning_rate=0.1`, and `max_depth=3`. 🎉 Try tweaking these values (e.g., increasing `n_estimators` or decreasing `learning_rate`) to observe how performance changes! 🚀

## Implementation of Gradient Boosting Classifier

Greetings, Cosmos Conqueror! We've tuned the parameters, trained, made predictions, and evaluated our Gradient Boosting Classifier. Now, let's consolidate all we've learned in this crucial task. Your mission, should you choose to accept it, is to craft a Gradient Boosting Classifier, train it, predict labels, and finally measure its accuracy on Reuters' news categories. All the best!

```python
# import required libraries
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from nltk.corpus import reuters
import nltk

categories = reuters.categories()[:3]
documents = reuters.fileids(categories)
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Preprocessing and feature extraction
count_vectorizer = CountVectorizer(max_features=500)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# TODO: Split your data into training and testing sets

# TODO: Instantiate the GradientBoostingClassifier with the parameters - n_estimators=150, learning_rate=0.15, max_depth=3 - and fit your model to the training set

# TODO: Make predictions

# Evaluating the model's performance
print("Accuracy: ", accuracy_score(y_test, y_pred))



```