# Lesson 1: Ensemble Methods in NLP: Mastering Bagging for Text Classification

# Introduction to Ensemble Methods and BAGGING

Hello there! In this lesson, we'll dive into the fascinating world of machine learning ensemble methods. Ensemble methods are based on a simple but powerful concept: a team of learners, or algorithms, can achieve better results working together than any individual learner on its own.

## What is Bagging?

Bagging, which stands for Bootstrap Aggregating, is a prime example of an ensemble method. In the context of this course, where we are working with the Reuters-21578 Text Categorization Collection, our goal is to train a model that can accurately predict the category of a document based on its text.

Bagging helps us achieve this by:
- Building multiple base learners (e.g., Decision Trees) on random subsets (bootstrapped samples) of the original dataset.
- Aggregating their predictions to yield a final verdict.
- For classification tasks, taking the mode of the predictions from each model.

Bagging enhances model robustness by diminishing overfitting risks, effectively reducing variance without significantly increasing bias.

## Loading and Inspecting the Reuters-21578 Data

We'll use the Reuters-21578 Text Categorization Collection, a widely-used dataset for document classification, available via the NLTK library.

```python
import nltk
from nltk.corpus import reuters

nltk.download('reuters', quiet=True)

categories = reuters.categories()[:5]  # limiting to 5 categories
documents = reuters.fileids(categories)

print(len(categories))  
print(len(documents))  
```

**Output:**
```
5
2648
```

## Understanding the Reuters-21578 Dataset

The dataset consists of news documents categorized by Reuters in the late 1980s. Let's explore it further:

```python
# Printing the categories
print("Selected Categories:", categories)

# Printing the content of one document
doc_id = documents[0]  
print("\nDocument ID:", doc_id)
print("Category:", reuters.categories(doc_id))
print("Content excerpt:\n", " ".join(reuters.words(doc_id)[:50]))
```

**Output:**
```
Selected Categories: ['acq', 'alum', 'barley', 'bop', 'carcass']

Document ID: test/14843
Category: ['acq']
Content excerpt:
 SUMITOMO BANK AIMS AT QUICK RECOVERY FROM MERGER...
```

## Feature Extraction Using Count Vectorizer

Before applying machine learning, we need to convert text into numerical format using `CountVectorizer` from scikit-learn.

```python
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder

# Preparing the dataset
text_data = [" ".join(reuters.words(fileid)) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)

# Encoding the category data
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

print("Categories:\n", categories_data[:5])
print("Encoded Categories:\n", y[:5])
```

**Output:**
```
Categories:
 ['acq', 'acq', 'carcass', 'bop', 'acq']
Encoded Categories:
 [0 0 4 3 0]
```

## Applying Bagging for Text Classification

We will now apply the `BaggingClassifier` using Decision Trees as base learners.

```python
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split

# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Initiating the BaggingClassifier
bag_classifier = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, random_state=1)
bag_classifier.fit(X_train.toarray(), y_train)

# Generate predictions on the test data
y_pred = bag_classifier.predict(X_test.toarray())

# Displaying the predicted category for the first document
print("Predicted Category: ", label_encoder.inverse_transform([y_pred[0]])[0])
```

**Output:**
```
Predicted Category:  acq
```

## Performance Evaluation Using Classification Report

To evaluate our model's performance, we'll use a classification report.

```python
from sklearn.metrics import classification_report

# Checking the performance of the model on test data
y_pred = bag_classifier.predict(X_test.toarray())
print(classification_report(y_test, y_pred, zero_division=1))
```

**Output:**
```
              precision    recall  f1-score   support

           0       0.99      0.99      0.99       601
           1       0.82      0.93      0.87        15
           2       1.00      1.00      1.00        12
           3       0.91      0.95      0.93        22
           4       0.90      0.75      0.82        12

    accuracy                           0.99       662
   macro avg       0.93      0.93      0.92       662
weighted avg       0.99      0.99      0.99       662
```

## Lesson Summary

In this lesson, you learned:
- How ensemble methods, particularly Bagging, enhance classification performance.
- The importance of feature extraction using `CountVectorizer`.
- The application of `BaggingClassifier` with Decision Trees.
- Model evaluation using a classification report.

In the upcoming exercises, you'll apply and reinforce these concepts. Happy coding! 🚀



## Exploring the Last Documents and Categories

Great stuff, Stellar Navigator! You've now seen a broader overview of the dataset, but there's more to uncover. Change the existing code to examine the second document that is associated with more than two categories instead of the first, and print out both its categories and raw contents. Also, print the words of this document. This will further improve your understanding of the data formatting.

```python
import nltk
from nltk.corpus import reuters

categories = reuters.categories()
documents = reuters.fileids(categories)

print("Total Categories:", len(categories))
print("Total Documents:", len(documents))

multi_cat_docs = [doc for doc in documents if len(reuters.categories(doc)) > 2]

doc_id = multi_cat_docs[0]

print(f"Document with more than 2 Categories: {doc_id}")
print("Categories:", reuters.categories(doc_id))
print("Raw Document:\n", reuters.raw(doc_id))
print("Words:\n", " ".join(reuters.words(doc_id)))
```

## Introduction to Ensemble Methods and BAGGING

Hello there! In this lesson, we'll dive into the fascinating world of machine learning ensemble methods. Ensemble methods are based on a simple but powerful concept: a team of learners, or algorithms, can achieve better results working together than any individual learner on its own.

Bagging, which stands for Bootstrap Aggregating, is a prime example of an ensemble method. In the context of this course, where we are working with the Reuters-21578 Text Categorization Collection, our goal is to train a model that can accurately predict the category of a document based on its text. Bagging helps us achieve this by building multiple base learners (for instance, Decision Trees) on random subsets (bootstrapped samples) of the original dataset. Then, it aggregates their predictions to yield a final verdict. For classification tasks—like the text classification scenario we're addressing here—the aggregation occurs by taking the mode of the predictions from each model. This means we look for the most frequently predicted category across all models for any given observation. The beauty of Bagging lies in its ability to enhance model robustness by diminishing overfitting risks, effectively reducing variance without significantly increasing bias.

In text classification tasks, using Bagging can lead to marked improvements in model performance. By applying Bagging to our text data, we increase the predictive generalization capabilities of our model. Let's embark on this journey and put Bagging into action with text data, focusing on its mechanism and benefits in the sections to come.

## Loading and Inspecting the Reuters-21578 Data

Let's start by loading our dataset. We'll be using the Reuters-21578 Text Categorization Collection, a widely-used text dataset for document categorization and classification tasks. It is available via the NLTK (Natural Language Toolkit) library, which is the go-to library for natural language processing in Python.

Let's load the data and print the number of categories and documents:

```python
import nltk
from nltk.corpus import reuters

nltk.download('reuters', quiet=True)

categories = reuters.categories()
documents = reuters.fileids(categories)

print("Total Categories:", len(categories))
print("Total Documents:", len(documents))

multi_cat_docs = [doc for doc in documents if len(reuters.categories(doc)) > 2]

doc_id = multi_cat_docs[1]  # Selecting the second document with more than two categories

print(f"Document with more than 2 Categories: {doc_id}")
print("Categories:", reuters.categories(doc_id))
print("Raw Document:\n", reuters.raw(doc_id))
print("Words:\n", " ".join(reuters.words(doc_id)))
```

This output allows us to examine a document associated with more than two categories, providing deeper insights into the dataset's structure and formatting.



## Finding Documents with Specific Category Count

Good job, Space Voyager! Let's put your skills to the test. Fill in the blank spots (____) to iterate over the documents from the Reuters dataset, find the third-to-last document with exactly four categories, print its categories, and, finally, display the words contained in it. You're doing amazing! Keep it up!

```python
import nltk
from nltk.corpus import reuters

categories = reuters.____()
documents = reuters.fileids(____)

print(f"Total Categories: {len(categories)}")
print(f"Total Documents: {len(documents)}")

multi_cat_docs = [doc for doc in documents if len(reuters.categories(doc)) == 4]

doc_id = multi_cat_docs[____]

print(f"Document with exactly 4 Categories: {doc_id}")
print("Categories:", reuters.categories(doc_id))
print("Words:\n", " ".join(reuters.words(doc_id)))

```

Here's the corrected version of your code with the missing parts filled in:  

```python
import nltk
from nltk.corpus import reuters

categories = reuters.categories()
documents = reuters.fileids()

print(f"Total Categories: {len(categories)}")
print(f"Total Documents: {len(documents)}")

multi_cat_docs = [doc for doc in documents if len(reuters.categories(doc)) == 4]

doc_id = multi_cat_docs[-3]  # Third-to-last document with exactly four categories

print(f"Document with exactly 4 Categories: {doc_id}")
print("Categories:", reuters.categories(doc_id))
print("Words:\n", " ".join(reuters.words(doc_id)))
```

This will correctly find the third-to-last document that belongs to exactly four categories and print its categories and words. 🚀 Keep up the great work, Space Voyager! 🚀

## Implement Bagging Classifier and Evaluate Model Performance

Stellar Navigator, fill the gaps (____) in the script given below. You need to import the necessary modules, prepare the text and category data, perform the feature extraction using CountVectorizer, use the Bagging Classifier, and finally, evaluate the model using a Classification Report. Follow your insights from the lesson and good luck!

```python
import nltk
from nltk.corpus import ____
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

categories = ____.categories()[:5]
documents = reuters.fileids(categories)

text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Using BaggingClassifier with DecisionTrees
bag_classifier = ____(____(), n_estimators=100, random_state=1)
bag_classifier.fit(X_train.toarray(), y_train)

# Checking the performance of the model on test data
y_pred = bag_classifier.predict(X_test.toarray())

# Printing the first 10 values of y_test and y_pred before the classification report
print(f"First 10 y_test values: {y_test[:10]}")
print(f"First 10 y_pred values: {y_pred[:10]}")

print(classification_report(y_test, y_pred, zero_division=1))
```

Here's the completed script with the missing parts filled in:  

```python
import nltk
from nltk.corpus import reuters  # Import the Reuters dataset
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Download Reuters dataset if not already downloaded
nltk.download('reuters', quiet=True)

# Selecting first 5 categories
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing text and category data
text_data = [" ".join(reuters.words(fileid)) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]  # Selecting the first category for each document

# Using CountVectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)

# Encoding category labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Using BaggingClassifier with DecisionTrees
bag_classifier = BaggingClassifier(DecisionTreeClassifier(), n_estimators=100, random_state=1)
bag_classifier.fit(X_train.toarray(), y_train)

# Checking the performance of the model on test data
y_pred = bag_classifier.predict(X_test.toarray())

# Printing the first 10 values of y_test and y_pred before the classification report
print(f"First 10 y_test values: {y_test[:10]}")
print(f"First 10 y_pred values: {y_pred[:10]}")

# Evaluating the model
print(classification_report(y_test, y_pred, zero_division=1))
```

This script successfully imports the necessary modules, processes the dataset, extracts features using `CountVectorizer`, trains a `BaggingClassifier` with `DecisionTreeClassifier`, and evaluates its performance using a classification report. 🚀 Stellar job, Navigator! Keep up the great work! 🚀

## Bagging Classifier with Different Parameters Evaluation

Load the dataset and prepare it for training. Then, add the initialization code for the Bagging Classifier with specific parameters. A Decision Tree Classifier should be used as the base estimator. Feed the training data to the fit method of the Classifier. Evaluate the model on the test data. The status from your performance observation will be stellar!

```python
import nltk
from nltk.corpus import reuters
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# TODO: Load the Reuters dataset but only consider the first 5 categories.

text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# TODO: Initialize the Bagging Classifier with a Decision Tree Classifier with a specific number of estimators and a given seed.

bag_classifier.fit(X_train.toarray(), y_train)

# Check the performance of the model on test data
y_pred = bag_classifier.predict(X_test.toarray())
print(classification_report(y_test, y_pred, zero_division=1))

```

Here’s the completed script with the missing parts filled in:  

```python
import nltk
from nltk.corpus import reuters
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Download Reuters dataset if not already downloaded
nltk.download('reuters', quiet=True)

# Load the Reuters dataset but only consider the first 5 categories
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Prepare text and category data
text_data = [" ".join(reuters.words(fileid)) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]  # Select first category for each document

# Feature extraction using CountVectorizer
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)

# Encode category labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

# Split the data for training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Initialize the Bagging Classifier with a Decision Tree Classifier
bag_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=10),  # Decision Tree with max depth 10
    n_estimators=100,  # Number of estimators
    random_state=1
)

# Train the model
bag_classifier.fit(X_train.toarray(), y_train)

# Evaluate the model
y_pred = bag_classifier.predict(X_test.toarray())
print(classification_report(y_test, y_pred, zero_division=1))
```

### Changes and Enhancements:
✅ **Loaded only the first 5 categories**  
✅ **Used `DecisionTreeClassifier(max_depth=10)` as the base estimator**  
✅ **Set `n_estimators=100` for BaggingClassifier**  
✅ **Ensured proper dataset preparation**  
✅ **Trained and evaluated the model using `classification_report`**  

🚀 You're now ready to analyze the model's performance! Keep exploring, Space Voyager! 🌟

## Text Classification Using Bagging Classifier

Stellar Navigator, it's your turn to demonstrate your mastery of ensemble methods and bagging! Based on the Reuters-21578 dataset, you should write a pipeline script that enables you to load the dataset, prepare the feature matrix and target vector, build the Bagging Classifier model, and print the classification report. Good luck!

```python
import nltk
from nltk.corpus import reuters
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# TODO: Load the first 5 categories from Reuters dataset

# TODO: Prepare the text and categories data

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y)

# TODO: Initialize and fit the BaggingClassifier with Decision Tree as base estimator

# TODO: Predict on the test data

print(classification_report(y_test, y_pred, zero_division=1))
```

Here’s the completed script for **Text Classification Using Bagging Classifier** on the Reuters dataset:  

```python
import nltk
from nltk.corpus import reuters
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# Download dataset if needed
nltk.download('reuters', quiet=True)

# Load the first 5 categories from Reuters dataset
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Prepare the text and categories data
text_data = [" ".join(reuters.words(fileid)) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]  # Use the first category per document

# Feature extraction using CountVectorizer
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)

# Encode target labels
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(categories_data)

# Split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and fit the BaggingClassifier with Decision Tree as base estimator
bag_classifier = BaggingClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=10),
    n_estimators=100,
    random_state=42
)
bag_classifier.fit(X_train.toarray(), y_train)

# Predict on the test data
y_pred = bag_classifier.predict(X_test.toarray())

# Print classification report
print(classification_report(y_test, y_pred, zero_division=1))
```

### Enhancements:
✅ **Selected first 5 categories from Reuters dataset**  
✅ **Applied `CountVectorizer(max_features=1000)` for feature extraction**  
✅ **Encoded categorical labels using `LabelEncoder()`**  
✅ **Split dataset into 80% training and 20% testing**  
✅ **Used `BaggingClassifier` with `DecisionTreeClassifier(max_depth=10)`**  
✅ **Trained and evaluated model, printing a classification report**  

🚀 Your **Bagging Classifier** is now ready for text classification! Keep up the stellar work, Navigator! 🌟