# Unit 3 Mastering Cross-Validation for Text Classification in Python

# Topic Overview

Hello! Our mission for this lesson is to master the **cross-validation** process — a critical part of building machine learning models. We will explore the following:

* What is cross-validation and why it's beneficial
* How to implement it in Python using the `sklearn` library
* Applying these concepts to our text **classification** task.

Ready? Let's start our journey!

---

## Understanding Cross-validation

First things first, what does **cross-validation** mean in machine learning terms? As you know, in supervised learning, we need a way to measure how good our model is. A common simple approach is to divide our data into two sets: one for training and one for testing. However, this approach has a problem where our evaluation depends on how we divided the data. This is where cross-validation becomes handy.

The core idea behind cross-validation is to divide the dataset into several subsets; the model is then trained on some of these subsets and tested on the resting subsets. We repeat this process several times, changing the subsets for training and testing, and in the end, we average the model's performance over different divisions of the dataset. It's named cross-validation because we're "crossing" over our subsets for training and validation.

Cross-validation gives us a more reliable measure of performance than just one train-test split. The most common type of cross-validation is **K-fold cross-validation**, where we divide the data into K subsets and train the model K times, each time using a different subset as the validation set.

---

## Cross-validation in Sci-kit Learn

In Python, we can easily perform cross-validation using the `sklearn` library, specifically the `cross_val_score()` function. Here's a quick overview of its parameters:

* **estimator**: the machine learning model we want to evaluate.
* **X**: the input data.
* **y**: the target data.
* **cv**: the number of subsets to divide the data into (the "K" in "K-Fold" cross-validation).

We'll be using the example code provided to create a 5-fold cross-validation. The number "5" is a common choice as it provides a good balance between the accuracy of the performance measure and computational cost.

---

## Conducting Cross-validation for Text Classification

To see cross-validation in action, let's proceed with an example using a Naive Bayes classifier.

```python
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5)
```

In this example, we first initialize a Naive Bayes classifier and then perform a 5-fold cross-validation on our text classification task. The variable `scores` stores the accuracy for each fold, providing insights into the model's performance across different subsets of the data.

---

## Interpreting Cross-validation Results

When we perform cross-validation, we obtain a series of performance scores, one for each fold. Here's how we can print these scores and their average to evaluate our model's performance.

```python
# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())
```

The output will look like this:

```
Cross-validated scores: [0.96502242 0.95426009 0.95780969 0.9551167  0.96229803]
Average cross-validated score: 0.9589013855455635
```

Each number in the list of **Cross-validated scores** represents our model's accuracy in a single fold, illustrating the model's consistency across different parts of the dataset. Observing similar performance across folds, as seen in our output, indicates that our model performs consistently.

By computing the mean of these scores, we get an **Average cross-validated score** of approximately 95.89%. This average score is a robust metric representing how our model is likely to perform on unseen data, thus providing a more reliable estimate than a single train/test split.

---

## Lesson Summary and Practice

Congratulations! Today, you have learned what cross-validation is, how to perform it using the Python `sklearn` library, and how to interpret the results. The road to mastery is through continuous practice. Don't forget to use this cross-validation technique in your future machine learning projects, particularly for text classification. Through this process, you will gain more familiarity and confidence in using the method. Happy learning!

## Running Cross-Validation on Text Data

In this task, you'll solidify your understanding of cross-validation through a practical text classification example with the Naive Bayes classifier. We'll use TF-IDF vectorization on the SMS Spam Collection dataset, and observe the model's performance across different data subsets. Run the provided code without modifications to analyze cross-validation scores and the model's consistency.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5)

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())

```

## Running Cross-Validation on Text Data

Let's run the provided Python code to solidify your understanding of cross-validation in a practical text classification example. This script will use a Naive Bayes classifier with TF-IDF vectorization on the SMS Spam Collection dataset, and then display the model's performance across different data subsets.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5)

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())
```

This code snippet will:
1.  **Load the dataset**: It starts by loading the `sms-spam-collection` dataset.
2.  **Vectorize text**: It then uses `TfidfVectorizer` to convert the text messages into numerical features, which is essential for machine learning models.
3.  **Initialize model**: A `MultinomialNB` (Naive Bayes) classifier is initialized.
4.  **Perform cross-validation**: The `cross_val_score` function will apply 5-fold cross-validation, training and testing the Naive Bayes classifier five times on different subsets of the data.
5.  **Display results**: Finally, it will print the accuracy scores for each of the five folds and their average, giving you insight into the model's consistency and overall performance.

Running this code will help you observe firsthand how cross-validation provides a more robust evaluation of your model's performance on unseen data.

## Elevating Cross-Validation to 10-Folds

Continuing from your successful journey through the basics of cross-validation, we delve deeper into refining your skills. We have discovered how 5-fold cross-validation provides a dependable metric of our model's performance. Now, we will tweak this setting to deepen our understanding.

Change the cv parameter in cross_val_score() from a 5-fold to a 10-fold cross-validation. This shift is crucial for observing how the increase in folds influences the robustness and reliability of our model evaluation, laying the groundwork for informed decisions on model training and validation techniques.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5) # Change this line to use 10-fold cross-validation

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())

```

## Elevating Cross-Validation to 10-Folds

Excellent! Building on your understanding of 5-fold cross-validation, let's now elevate our approach to 10-fold cross-validation. This change will allow us to observe how increasing the number of folds impacts the robustness and reliability of our model evaluation.

The only modification needed is to change the `cv` parameter in the `cross_val_score()` function from `5` to `10`. This simple adjustment will give us a more granular look at our model's performance across more subsets of the data.

Here's the modified code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=10) # Changed cv from 5 to 10

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())
```

By running this updated code, you'll see a list of 10 scores, representing the accuracy of the model in each of the 10 folds. The average of these scores will provide an even more stable estimate of the model's true performance, as it has been evaluated on more diverse splits of the dataset. This exercise is key to understanding the trade-offs and benefits of choosing different `k` values in K-fold cross-validation.

## Fixing Cross-Validation in Naive Bayes

Often, the devil is in the details, and it's easy to slip up on seemingly small things. This challenge presents you with a snippet of code designed to do cross validation on a classifier. There's a bug that's throwing off the entire cross-validation process. Your task is to identify and fix the bug, making sure the process runs smoothly.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=-5)

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())

```

## Fixing Cross-Validation in Naive Bayes

You've spotted a common pitfall! The bug lies in the `cv` parameter for the `cross_val_score()` function.

The `cv` parameter expects a positive integer representing the number of folds, or a cross-validation splitter object. A negative value for `cv` is invalid and will cause an error or unexpected behavior.

To fix this, we simply need to change `cv=-5` to a valid positive integer, such as `cv=5` (for 5-fold cross-validation) or `cv=10` (for 10-fold cross-validation), as explored in previous exercises.

Here's the corrected code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5) # Bug fixed: Changed cv=-5 to cv=5

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())
```

By changing `cv=-5` to `cv=5`, the cross-validation process will now run smoothly, providing meaningful accuracy scores for your Naive Bayes classifier on the text data.

## Implementing Cross-Validation in Python

Building on the solid foundation you've developed from understanding cross-validation, we're diving a bit deeper. You'll fill in the gaps in a Python script designed to perform 5-fold cross-validation on a Naive Bayes classifier, targeting the detection of spam messages.

This task emphasizes honing your ability to apply the cross_val_score function from the sklearn library, ensuring you can accurately assess a model's performance across different data splits. Remember, correctly implementing cross-validation is key to validating the stability and reliability of your machine learning model.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier with 5 folds
scores = ____________________

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())

```

## Implementing Cross-Validation in Python

Great! Let's fill in the missing piece to correctly implement 5-fold cross-validation on our Naive Bayes classifier for spam detection.

The `cross_val_score` function from `sklearn.model_selection` is precisely what we need here. Its essential parameters are:
* `estimator`: The machine learning model (our `clf`).
* `X`: The feature matrix (our `X_tfidf`).
* `y`: The target variable (our `df['label']`).
* `cv`: The number of folds for cross-validation (which is `5` for this task).

Here's the completed code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Initialize a Naive Bayes classifier
clf = MultinomialNB()

# Apply cross-validation on the classifier with 5 folds
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5)

# Print the scores for each cross-validation fold
print("Cross-validated scores:", scores)

# Calculate and print the mean of the cross-validation scores
print("Average cross-validated score:", scores.mean())
```

By completing the `scores = cross_val_score(clf, X_tfidf, df['label'], cv=5)` line, you are instructing the script to perform the 5-fold cross-validation, providing a robust evaluation of your Naive Bayes model's ability to classify spam messages. This is a fundamental step in ensuring your model is stable and generalizes well to unseen data.

## Mastering Text Classification with Naive Bayes

Building on our journey through cross-validation, it's time to apply your understanding in a comprehensive task. Your goal is to complete the Python script to use a Naive Bayes classifier, conduct a 5-fold cross-validation, and finally display the scores and compute the average cross-validation score. This exercise offers a hands-on opportunity to integrate the concepts you've learned into a real-world dataset.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# TODO: Initialize a Naive Bayes classifier

# TODO: Apply 5-fold cross-validation on the classifier and store the result in a variable

# TODO: Print the cross-validated scores for each fold

# TODO: Calculate and print the average of the cross-validated scores


```

## Mastering Text Classification with Naive Bayes

Excellent! Let's complete this comprehensive task by integrating all the steps to perform text classification with a Naive Bayes classifier and evaluate it using 5-fold cross-validation.

You'll need to add the following lines to complete the script:

1.  **Initialize a Naive Bayes classifier**: Create an instance of `MultinomialNB()`.
2.  **Apply 5-fold cross-validation**: Use `cross_val_score()` with `cv=5`.
3.  **Print scores**: Display the `scores` variable.
4.  **Calculate and print average score**: Compute and display the mean of the `scores`.

Here's the completed Python script:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# TODO: Initialize a Naive Bayes classifier
clf = MultinomialNB()

# TODO: Apply 5-fold cross-validation on the classifier and store the result in a variable
scores = cross_val_score(clf, X_tfidf, df['label'], cv=5)

# TODO: Print the cross-validated scores for each fold
print("Cross-validated scores:", scores)

# TODO: Calculate and print the average of the cross-validated scores
print("Average cross-validated score:", scores.mean())
```

By running this script, you will perform the entire process: from loading the data and transforming it with TF-IDF, to training a Naive Bayes model, and finally evaluating its performance rigorously using 5-fold cross-validation. This comprehensive exercise demonstrates your ability to apply machine learning concepts to a practical problem.