# Unit 1 Training a Naive Bayes Classifier for Text Categorization

## Topic Overview and Objective
Hello again! Now that you are familiar with loading and preprocessing the dataset, today we'll learn how to train a **Naive Bayes Classifier** for text classification using Python. We'll be using the **SMS Spam collection dataset** in this example. By the end of this lesson, you will have a clear understanding about the principles of the Naive Bayes algorithm and its implementation with the `sklearn` package in Python.

---

## Train-Test Split Details
You have previously learned how to load our dataset and vectorize its messages. Now it's time to split the data into two sets: the **train set**, where we already know the output that the model learns on, and the **test set**, where we test our model’s predictions on unseen data. We can leverage the `sklearn` package for that, as shown in the following code:

```python
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)
```
The random state in our train-test split function ensures reproducibility. This is crucial so we invariably get the same split each time we run the code. Let's use a **test size of 0.25**, which means we will use 75% of the data for training and 25% for testing our model.

---

## Introduction to Naive Bayes Model
The **Naive Bayes classifier** is a simple yet effective and commonly-used probabilistic classifier. It's founded on applying **Bayes' theorem** with strong (naive) independence assumptions between the features. Naive Bayes classifiers have been particularly effective for high-dimensional data and have worked quite well for text classification problems.

In our case, we'll use the **Multinomial Naive Bayes** implementation available in `sklearn`, which is suitable for classification with discrete features (like word counts for text classification).

Let's go ahead and create a Multinomial Naive Bayes model. Then, we can train it (or "fit" it, as we say in Machine Learning) with `fit(X_train, y_train)`. Here `X_train` contains the vectorized training data, and `y_train` contains the corresponding labels.

```python
from sklearn.naive_bayes import MultinomialNB

# Train a Multinomial Naive Bayes classifier
clf = MultinomialNB()
clf.fit(X_train, y_train)
```

---

## Predicting with a New Message
Once our model is trained, it can classify new, unseen messages. To do this, we preprocess and vectorize a new given message in the same way as our training data. Then, we use our trained classifier to predict the message category.

Here's an example:

```python
new_message = "Congratulations! Text 'WIN' to claim a prize."
new_message_vectorized = vectorizer.transform([new_message])

prediction = clf.predict(new_message_vectorized)
print("The message is:", prediction)
```
This shows how our trained Naive Bayes model can be used in practice to classify new messages.

```
The message is: ['spam']
```

---

## Measuring Accuracy with Test Set
Upon training, our Naive Bayes classifier is now ready to classify new messages and measure its accuracy. To predict categories of unseen data, we employ the `predict` method on the vectorized test data, `X_test`. We then assess the model's performance via the accuracy metric — the proportion of correct predictions against the total predictions. The accuracy is calculated using `accuracy_score()` from `sklearn`'s metrics.

```python
from sklearn.metrics import accuracy_score

# Make predictions on the test set
y_pred = clf.predict(X_test)

# Calculate the accuracy of the predictions
print("Accuracy:", accuracy_score(y_test, y_pred))
```
This outlines how our model fares in classifying SMS messages, providing insights into its reliability for practical applications. The output shows that the model has an accuracy of approximately 96%. This indicates that the Naive Bayes classifier is highly effective in distinguishing between spam and non-spam messages in our dataset, correctly identifying the category of new, unseen messages with a high degree of reliability.

```
Accuracy: 0.9655419956927495
```

---

## Lesson Summary and Practice
Great job! You've just learned to train a Naive Bayes Classifier, made predictions with the model, and assessed its performance using accuracy. These are key steps in building text classification models, and you've done a fantastic job grasping the basics.

Now, it's time for real hands-on practice. It's during the solving of real problems that the concepts get embedded in our understanding, so don't hold back on experimenting.

Keep exploring, keep learning! Happy coding!

## Running the Naive Bayes Classifier

Now you'll combine everything learned about the Naive Bayes Classifier into action. This task involves executing the given code to load the dataset, preprocess it, train the model, and evaluate its accuracy. Without needing modifications, proceed by executing the code to witness the classifier's performance firsthand.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Initialize a Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

```

## Adjusting Classifier Test Size

Explore the effect of adjusting the test size in data split from 25% to 40% on a Naive Bayes Classifier's accuracy. Modify the test_size parameter in train_test_split and evaluate the changes. Gain insights into training versus testing data balance in machine learning projects.


```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
# TODO: Change the test size to 40%
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# Initialize a Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))


```

To adjust the classifier test size from 25% to 40%, you need to change the `test_size` parameter in the `train_test_split` function from `0.25` to `0.40`.

Here's the modified code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
# TODO: Change the test size to 40%
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.40, random_state=42)

# Initialize a Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
```

## Mastering the Naive Bayes Classifier

Building on your progress, this task emphasizes the practical aspects of implementing a Naive Bayes classifier for text categorization. Here, we focus on adding crucial parts of the code related to importing necessary modules and initializing the classifier, ensuring our environment is correctly set up for the classification task. Your mission is to complete the missing parts of the code!


```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# TODO: Import the Multinomial Naive Bayes classifier
from sklearn.naive_bayes import __________

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# TODO: Initialize a Multinomial Naive Bayes classifier
clf = ______________

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))

```

Here's the completed code, filling in the missing imports and classifier initialization:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# TODO: Import the Multinomial Naive Bayes classifier
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# TODO: Initialize a Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier
clf.fit(X_train, y_train)

# Make predictions
y_pred = clf.predict(X_test)

# Calculate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
```

## Crafting a Naive Bayes Classifier

Having practiced with various aspects of the Naive Bayes Classifier for text classification, it's time to bring everything together. In this exercise, you will implement the classification process step by step. This exercise will challenge you to apply everything you've learned about training the Naive Bayes model, making predictions, and assessing model performance.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# TODO: Initialize a Multinomial Naive Bayes classifier

# TODO: Train the classifier

# TODO: Make predictions

# TODO: Calculate and print the accuracy of the predictions



```
```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_tfidf, df['label'], test_size=0.25, random_state=42)

# TODO: Initialize a Multinomial Naive Bayes classifier
clf = MultinomialNB()

# TODO: Train the classifier
clf.fit(X_train, y_train)

# TODO: Make predictions
y_pred = clf.predict(X_test)

# TODO: Calculate and print the accuracy of the predictions
print("Accuracy:", accuracy_score(y_test, y_pred))
```
