# Lesson 2: Ensemble Methods in NLP: Mastering the Voting Classifier

Hello, and welcome back! In this lesson, we will dive into Ensemble Modeling with a focus on the Voting Classifier. The Voting Classifier is a powerful concept that takes advantage of the strengths of multiple classifiers to yield more robust and accurate predictions. If you're ready to take your understanding of Machine Learning (ML) modeling to new heights, this lesson is definitely for you.

## Data Preparation Revisited

Before we start exploring ensemble models, let's revisit the process of preparing our data for machine learning. We start with obtaining our dataset and progressing through feature extraction and label encoding to partitioning the data for training and testing.

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

nltk.download('reuters', quiet=True)

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
```

This section of the code does most of the heavy lifting for us, handling all the necessary data preprocessing required to further proceed with our modeling.

## Constructing the Voting Classifier

In our case, we employ the Voting Classifier for ensemble modeling. The `VotingClassifier` in `sklearn` is a meta-estimator, fitting several base machine learning models on the dataset and using their decisions to predict the class labels. It does this based on the majority vote principle, meaning the predicted class label for a given sample is the class label that has collected the most votes from individual classifiers.

```python
# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1)
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with these models
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting='hard')
```

Here, we initially create three separate classifiers: Logistic Regression, Support Vector Machine, and Decision Tree. Subsequently, we incorporate all these models under a single Voting Classifier.

## Further Insight into the Classifier

Exploring the parameters of the Voting Classifier can significantly enhance your modeling strategy. Here's a focused rundown:

- **estimators**: This is a list of the base classifiers that will be part of the ensemble, combining different models to capture a broad spectrum of data patterns.
- **voting**: It dictates the final prediction method. `'hard'` voting uses a majority vote system, while `'soft'` voting relies on the predicted probabilities, useful when classifiers provide calibrated probabilities.
- **weights**: Assigning weights to individual classifiers can influence their contribution to the final decision, especially beneficial when some models are more trustworthy.

Mastering these parameters allows for tailored model adjustments, leading to more accurate and robust ensemble classifiers for your text classification tasks.

## Model Training and Prediction

Training our model involves fitting it to the data using the `.fit()` method. It helps our model learn the underlying relationships in our dataset. We then test our model's learning effectiveness by making predictions using the `.predict()` method.

```python
# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())
```

In this code snippet, `voting_clf.fit()` meticulously trains our ensemble, leveraging each base model's strengths. Subsequently, `voting_clf.predict()` translates our ensemble's collective intelligence onto the test data, generating predictions that embody a consensus among individual models' insights.

## Model Evaluation

After fitting the model to the data and making predictions with it, it's essential to determine how well it has performed. For classification tasks, accuracy is a common and important metric. It quantifies the ratio of correct predictions to total predictions.

```python
# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

The output will be something like:

```
Accuracy:  0.9803625377643505
```

The above message denotes a high accuracy score, indicating that our Voting Classifier model has performed excellently on the test data, identifying most of the class labels correctly.

## Lesson Summary

Congratulations on completing this lesson! You've mastered the concept of Ensemble Modeling with a focus on the Voting Classifier in Python. You've also learned how to:

- Prepare text data
- Split it into training and testing sets
- Combine multiple classifiers into a Voting Classifier
- Train the ensemble model
- Evaluate its performance

As always, practice is a vital part of learning. Don't miss out on the next activities that have been designed to reinforce and practically upskill your understanding of ensemble methods in Python.


## Switch to Soft Voting in Classifier Ensemble

Great work, Space Explorer! Now, I want you to make a change in the Python script. In the existing code, you are using the hard voting strategy. I would like you to change it to soft voting. Don't forget to set probability=True in the SVC classifier, as soft voting requires predict_proba support. Run the script and compare the accuracy with the previous result.

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1)
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with these models
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting='hard')

# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))

```

I've updated the script to use soft voting and set `probability=True` for the SVC classifier. Here’s the corrected version:

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)  # Set probability=True for soft voting
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with soft voting
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting='soft')  # Changed to soft voting

# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### Changes Made:
1. **Switched from "hard" to "soft" voting in `VotingClassifier`**.
2. **Set `probability=True` in the `SVC` classifier**, as required for soft voting.

Run the script and compare the accuracy with the previous result. Let me know if you need any further modifications! 🚀


## Implementing and Training a Voting Classifier

Stellar effort, Space Voyager! Now it's time to strengthen your ensemble modeling skills. In the code provided, there are some parts related to the Voting Classifier that are missing. Fill in the blanks (____) with the appropriate code to instantiate the Voting Classifier correctly and train it on the preprocessed dataset.

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with these models
voting_clf = VotingClassifier(
    estimators=[('lr', ____), ('svc', ____), ('dt', ____)],
    voting=____)

# Training the voting classifier on the training data
voting_clf.____(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = ____.___(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))

```

Here’s the corrected version of your script with the missing parts filled in:

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with these models
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting='soft')  # Using soft voting for better performance

# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### **Missing Parts Filled:**
1. **`('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)`** → Filled in the classifier instances.
2. **`voting='soft'`** → Using soft voting since SVC supports `predict_proba()`.
3. **`voting_clf.fit(X_train.toarray(), y_train)`** → Training the classifier.
4. **`y_pred = voting_clf.predict(X_test.toarray())`** → Making predictions.

Now you can run the script and evaluate its performance. Let me know if you need further improvements! 🚀

## Incorporating Soft Voting in Ensemble Classifier Model

Impressive work so far, Galactic Pioneer!

In our journey of working with Ensemble Classifiers, this exercise will focus on the 'soft' voting strategy. The provided code has a VotingClassifier instance set up with several base models, but there are some missing pieces. Fill in the blanks (____) to correctly set the VotingClassifier's voting strategy to 'soft' and implement the model training using the training datasets. Run the script to train your VotingClassifier to make correct predictions.

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with these models
voting_clf = ____(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting=____,
    weights=[1, 1, 2])

# Training the voting classifier on the training data
____.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

Here’s the completed script with the missing parts correctly filled in:

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)  # probability=True is required for soft voting
dt_clf = DecisionTreeClassifier(random_state=1)

# Creating a voting classifier with these models
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting='soft',  # Using soft voting
    weights=[1, 1, 2])  # Giving higher weight to Decision Tree

# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### **Filled Missing Parts:**
1. **`VotingClassifier`** → The correct class for ensemble learning.
2. **`voting='soft'`** → Setting soft voting for better probability-based decision-making.
3. **`voting_clf.fit(X_train.toarray(), y_train)`** → Training the ensemble classifier.

Now, you can run the script and compare the accuracy with the previous hard voting approach. 🚀 Let me know if you need any modifications!

## Creating the Voting Classifier Model

Impressive work, Stellar Navigator!

This part will directly test your understanding of creating the Voting Classifier model. Fill in the TODO section to create the Voting Classifier model with 'hard' voting, and run the code to make your model ready for training! You've got this!

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)
dt_clf = DecisionTreeClassifier(random_state=1)

# TODO: Create a voting classifier with these models

# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))

```

Here’s the completed script with the missing part correctly filled in:

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# Using count vectorizer for feature extraction
count_vectorizer = CountVectorizer(max_features=1000)
X = count_vectorizer.fit_transform(text_data)
y = LabelEncoder().fit_transform(categories_data)

# Split the data for train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# Building multiple classification models
log_clf = LogisticRegression(solver="liblinear")
svm_clf = SVC(gamma="scale", random_state=1, probability=True)
dt_clf = DecisionTreeClassifier(random_state=1)

# TODO: Create a voting classifier with these models
voting_clf = VotingClassifier(
    estimators=[('lr', log_clf), ('svc', svm_clf), ('dt', dt_clf)],
    voting='hard'  # Using hard voting strategy
)

# Training the voting classifier on the training data
voting_clf.fit(X_train.toarray(), y_train)

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))
```

### **Filled Missing Parts:**
1. **Created `VotingClassifier`** → `voting_clf = VotingClassifier(...)`
2. **Set `voting='hard'`** → Hard voting strategy for ensemble learning.

Now, run the script and test the accuracy! 🚀 Let me know if you need further modifications. 🎯

## Building a Soft Voting Classifier from Scratch

Fantastic work, Space Voyager! Now, let's wrap up what we have learned about Ensemble Modeling with one final coding task. You will write a voting classifier using the 'soft' voting strategy from scratch. Make sure to apply everything you have learned from our previous exercises. You can do it!

```python
# Import required libraries
from sklearn.ensemble import VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from nltk.corpus import reuters
import nltk

# Limiting the data for quick execution
categories = reuters.categories()[:5]
documents = reuters.fileids(categories)

# Preparing the dataset
text_data = [" ".join([word for word in reuters.words(fileid)]) for fileid in documents]
categories_data = [reuters.categories(fileid)[0] for fileid in documents]

# TODO: Use count vectorizer for feature extraction
# TODO: Split the data for train and test
# TODO: Built multiple classification models
# TODO: Create a voting classifier with these models
# TODO: Train the voting classifier on the training data

# Predicting the labels of the test set
y_pred = voting_clf.predict(X_test.toarray())

# Checking the performance of the model on test data
print("Accuracy: ", accuracy_score(y_test, y_pred))

```