# Unit 5 Deciphering Model Accuracy with the Confusion Matrix in NLP

Here's the content converted to Markdown in English:

# Understanding and Using the Confusion Matrix in Machine Learning

## Introduction

When devising machine learning models, especially those for classification, understanding and evaluating their performance is vital. Let's take our SMS spam filter scenario as an example. Here, we utilize models to classify messages as spam or non-spam.

While we might be initially drawn to accuracy as the best measure of a model's performance, this isn't always the case, especially in situations where the number of instances in each category is significantly unbalanced. This brings us to the **Confusion Matrix** — a reliable and detailed performance assessment tool for binary classification models.

## The Confusion Matrix

As the name suggests, a confusion matrix helps us see where the model is "confused" when classifying instances. It's a simple 2x2 matrix (for binary classification problems like ours) that contrasts the predicted values with the actual values and displays the results in four quadrants. These include:

* **True Positives (TP)**: Instances correctly identified as positive (spam in our case)
* **True Negatives (TN)**: Instances correctly identified as negative (non-spam)
* **False Positives (FP)**: Negative instances incorrectly identified as positive (non-spam messages classified wrongly as spam)
* **False Negatives (FN)**: Positive instances incorrectly identified as negative (Spam messages that got through the filter)

To better understand our classifier's performance, let's jump in and generate a confusion matrix.

## Using the Confusion Matrix in Python

Thankfully, Python's popular machine learning library, Scikit-learn, provides us with a ready-to-use function, `confusion_matrix`, to generate this matrix. It requires two parameters: the true class labels and the predicted class labels.

Referring to your previous lessons, our model, a Multinomial Naive Bayes classifier, is already trained to identify spam or non-spam messages. It's time to put the classifier to the test and generate predictions. We'll then compare these predictions with the actual labels to create the confusion matrix:

```python
from sklearn.metrics import confusion_matrix

# Predictions made by the classifier
y_pred = clf.predict(X_test)

# Generation of the Confusion Matrix
conf_mat = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(conf_mat)
```

The output of the above code will be:

```
Confusion Matrix:
[[1207    0]
 [  48  138]]
```

This output shows the confusion matrix for our classifier, with 1207 true negatives, 0 false positives, 48 false negatives, and 138 true positives. It indicates a high number of correct classifications for non-spam messages and a smaller number of spam messages that were not correctly identified.

## Interpreting the Confusion Matrix

Now that we have the confusion matrix, let's decode what each quadrant stands for and evaluate the performance of our classifier:

* **The top-left quadrant represents True Negatives**: Our classifier successfully identified 1207 messages as non-spam, showing high effectiveness in recognizing legitimate messages.
* **The top-right quadrant stands for False Positives**: With 0 instances here, our classifier perfectly avoided misclassifying non-spam messages as spam, ensuring no legitimate messages were incorrectly blocked.
* **The bottom-left quadrant represents False Negatives**: The classifier failed to identify 48 spam messages, indicating a gap in detecting all spam accurately.
* **The bottom-right quadrant shows True Positives**: It correctly classified 138 messages as spam, demonstrating a solid capability to identify spam messages, though there's room for improvement in its spam detection rate.

## Evaluating Classifier Performance

In this task, you'll experience application of a Multinomial Naive Bayes classifier on the SMS Spam Collection dataset, focusing on evaluating its performance through a confusion matrix. This practical approach underlines the importance of nuanced model assessment. Simply run the provided code to witness how a confusion matrix can be built in Python.


```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Predict and evaluate using confusion matrix
y_pred = clf.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(conf_mat)

```

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Predict and evaluate using confusion matrix
y_pred = clf.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(conf_mat)
```



## Filling in the Confusion Matrix

Dive deeper into model evaluation by completing an exercise to generate a confusion matrix using Scikit-learn. This practical task aims to enhance your understanding of assessing a machine learning model’s performance. By accurately implementing the appropriate module from Scikit-learn, you will gain valuable hands-on experience in applying theoretical concepts to real-world scenarios using Python.



```python

import pandas as pd

from datasets import load_dataset

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.model_selection import train_test_split

from sklearn.naive_bayes import MultinomialNB

from sklearn.metrics import __________



# Load the SMS Spam Collection dataset

sms_spam = load_dataset('codesignal/sms-spam-collection')

# Convert to pandas DataFrame for convenient handling

df = pd.DataFrame(sms_spam['train'])



# Initialize a TF-IDF Vectorizer and transform the messages

vectorizer = TfidfVectorizer()

X_tfidf = vectorizer.fit_transform(df['message'])



# Split data into training and test sets

X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)



# Train a Multinomial Naive Bayes classifier

clf = MultinomialNB()

clf.fit(X_train, y_train)



# TODO: Predict and evaluate using confusion matrix

y_pred = clf.predict(X_test)

conf_mat = __________(y_test, y_pred)



print("Confusion Matrix:")

print(conf_mat)



```

To complete the code and generate the confusion matrix, you need to import `confusion_matrix` from `sklearn.metrics` and then use it to calculate `conf_mat`.

Here's the completed code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix # Corrected import

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Predict and evaluate using confusion matrix
y_pred = clf.predict(X_test)
conf_mat = confusion_matrix(y_test, y_pred) # Corrected usage

print("Confusion Matrix:")
print(conf_mat)
```

Demonstrate your command of the confusion matrix for evaluating machine learning models by writing the rest of the code. Your task is to make predictions using the classifier and evaluate its performance using a confusion matrix. Through these steps, you'll showcase the critical nature of each phase in enhancing model accuracy.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# TODO: Make predictions on the test set

# TODO: Evaluate the model using the confusion matrix and print it


```

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import confusion_matrix

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Evaluate the model using the confusion matrix and print it
conf_mat = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(conf_mat)

```