# Unit 4 Fine-Tuning Text Classification Models with Grid Search in Python

## Introduction and Overview

Hello and welcome to another lesson! We are going to learn how to fine-tune parameters with Grid Search, which is an essential part of optimizing machine learning models. Through this lesson, you will gain the practical skills needed to apply Grid Search for hyperparameter tuning in a text classification using Python. We'll use the Scikit-Learn library and continue working with our SMS Spam Collection dataset.

## Understanding Grid Search and Hyperparameter Tuning

To begin, we need to clarify what **hyperparameters** are. These are the customizable settings of machine learning algorithms that need to be specified and fine-tuned by the programmer. For example, in a Naive Bayes classifier, the parameter `alpha` is a hyperparameter that we can adjust. The term is used to distinguish them from model parameters which are learned during the model training.

**Grid Search** is a handy tool for the fine-tuning of the hyperparameters. It exhaustively searches through a predefined set of hyperparameters to determine the optimal values that maximize the performance of the model. The process involves defining a grid of hyperparameters and then evaluating model performance for each point in the grid. While using cross-validation to evaluate this performance is not mandatory, it is highly recommended. Cross-validation ensures that the evaluation is performed on multiple splits of the data, providing a more reliable performance estimate. We can visualize this process as searching over a multi-dimensional space, where each dimension is a different hyperparameter and we want to find the point (or points) in this space that have the highest accuracy or the lowest loss.

So how does Grid Search work with text classification?

* We first define a set of possible values for each hyperparameter; this set of values forms a grid.
* We then can use cross-validation to evaluate the model performance for each combination of hyperparameters on the grid.
* The combination of values that provides the best model performance is chosen as the optimal set of hyperparameters.

## Fine-Tuning our Naive Bayes Classifier using Grid Search

Now, let's dive into some code. Recall that in prior lessons, we already loaded our dataset and vectorized our messages which left us with a representation of our text input data that our machine learning model could work with.

Here is the code that adjusts the hyperparameters of the Naive Bayes classifier through Grid Search:

```python
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Define the grid of hyperparameters to test
param_grid = {'alpha': [0.1, 0.5, 1.0]}

# Initialize Grid Search with Naive Bayes and 5-fold cross-validation
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)

# Fit the grid search to the training data
search.fit(X_tfidf, df['label'])
```

In the given code, we start by setting up a parameter grid as a Python dictionary that maps hyperparameters to the list of values to be tested. We then initialize our model, `MultinomialNB()`, alongside this parameter grid to kick off the Grid Search. Following the initialization, the grid search object named `search` is tasked with fitting onto our vectorized training data and finding the best parameters.

By adjusting these parameters, we can increase the accuracy of our model and make it more effective at identifying the spam messages in the SMS Spam Collection dataset.

## Interpreting the Outcome of Grid Search

Once the grid search has been completed, the process concludes with the extraction and display of the best parameter combination discovered, pinpointing the optimal settings for the model's performance. It is crucial to understand and interpret the output. The following code shows the results of the grid search with the best parameters found:

```python
print("Best Parameters:", search.best_params_)
```

The output will be:

```
Best parameters: {'alpha': 0.1}
```

This output indicates that among the values tested, an `alpha` value of 0.1 provided the best results during the cross-validation. This highlights the importance of hyperparameter tuning in improving the model's performance.

By utilizing grid search within your text classification tasks, you can fine-tune your model's performance, iterate faster, and ultimately deliver a model with a high degree of predictive accuracy.

## Summary and Next Steps

In this lesson, we have learned how to fine-tune the hyperparameters of a Multinomial Naive Bayes classifier using Grid Search, allowing us to optimize the text classification model. This tool gives us the ability to systematically work through multiple combinations of parameters to find the best one, leading to the potential of creating more precise and effective classification models.

To solidify this knowledge, go ahead and get hands-on with the code. Take our existing data, experiment with different parameter grids, and observe how that influences the outcome of your model. This practice will not only solidify your understanding of today's lesson but strengthen your overall skills in Natural Language Processing.

Happy Coding!

## Optimizing Naive Bayes with Grid Search

Dive into optimizing the Naive Bayes classifier for text classification by using Grid Search to find the best alpha parameter. You'll work with the SMS Spam Collection dataset, applying TF-IDF vectorization before running the Grid Search. Simply execute the provided code to see which alpha value enhances our model's accuracy—no modifications required, just press Run to observe the results.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Define parameter grid and perform grid search
param_grid = {'alpha': [0.1, 0.5, 1.0]}
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)

```

## Expanding Alpha Range in Grid Search

Following our journey through Grid Search's ability to optimize model parameters, let's explore further how slight adjustments can lead to different outcomes. Your current objective involves refining the parameter range of alpha for the Naive Bayes classifier. By introducing new values, 0.05 and 2.0, into the grid, you'll understand the impact of these alterations on the classifier's effectiveness. This will also allow you a glimpse into how precise or broad adjustments can sway the model's predictive accuracy. Let's proceed to enhance the model based on your newfound insights from earlier lessons.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Define parameter grid and perform grid search
param_grid = {'alpha': [0.1, 0.5, 1.0]} # TODO: Expand this range to include 0.05 and 2.0
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)

```

To expand the alpha range in the Grid Search, you need to modify the `param_grid` dictionary to include the new values `0.05` and `2.0`. Here's the updated code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Define parameter grid and perform grid search
# TODO: Expand this range to include 0.05 and 2.0
param_grid = {'alpha': [0.05, 0.1, 0.5, 1.0, 2.0]}
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)
```

## Debugging Grid Search Implementation

In this task you'll debug a Python script meant for fine-tuning a Naive Bayes classifier with Grid Search on the SMS Spam Collection dataset. A small bug in the code prevents the Grid Search from executing correctly. Your task is to identify and fix this error to ensure the optimal alpha parameter for the classifier can be accurately determined.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Define parameter grid and perform grid search using the parameters
param_grid = {'alpha': [0.1, 0.5, 1.0]}
search = GridSearchCV(MultinomialNB(), [], cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)

```

The bug in the provided code lies in the `GridSearchCV` initialization. The second argument, `param_grid`, is incorrectly passed as an empty list `[]` instead of the defined dictionary `param_grid`.

Here's the corrected code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# Define parameter grid and perform grid search using the parameters
param_grid = {'alpha': [0.1, 0.5, 1.0]}
# Bug fix: Changed [] to param_grid
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)
```

## Tuning Naive Bayes with Grid Search

In this task, you're stepping into an essential phase of machine learning model fine-tuning. Your focus will be on completing a piece of code tasked with defining the parameter grid for a Naive Bayes classifier. Specifically, the missing piece involves the alpha hyperparameter, which is pivotal in adjusting the model's sensitivity to features. It's a chance to apply what you've learned about Grid Search directly and see its impact on the performance of text classification on the SMS Spam Collection dataset.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# TODO: Define parameter grid and perform grid search
param_grid = {'alpha': [______, _____, _____]}
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)

```

To complete the `param_grid`, you need to provide a list of values for the `alpha` hyperparameter that `GridSearchCV` will explore. Common values for `alpha` in `MultinomialNB` often include values like 0.1, 0.5, and 1.0, which represent different levels of Laplace smoothing.

Here's the completed code:

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# TODO: Define parameter grid and perform grid search
param_grid = {'alpha': [0.1, 0.5, 1.0]} # Completed the alpha values
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

print("Best parameters:", search.best_params_)
```

## Mastering Grid Search in Text Classification

This challenge offers you the opportunity to implement all the concepts, from tinkering with parameter values to analyzing model performance. Your task involves setting up the Grid Search for a Naive Bayes classifier, and identifying the optimal value for the alpha parameter. This exercise not only assesses your understanding but also strengthens your ability to apply these techniques to real-world datasets.

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# TODO: Set up the parameter grid for 'alpha' values in Grid Search

# TODO: Initialize Grid Search with Naive Bayes classifier and fit it to the data

# TODO: Print out the best 'alpha' parameter found by Grid Search

```

```python
import pandas as pd
from datasets import load_dataset
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB

# Load the SMS Spam Collection dataset
sms_spam = load_dataset('codesignal/sms-spam-collection')
# Convert to pandas DataFrame for convenient handling
df = pd.DataFrame(sms_spam['train'])

# Initialize a TF-IDF Vectorizer and transform the messages
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(df['message'])

# TODO: Set up the parameter grid for 'alpha' values in Grid Search
param_grid = {'alpha': [0.1, 0.5, 1.0, 2.0]} # Common alpha values to explore

# TODO: Initialize Grid Search with Naive Bayes classifier and fit it to the data
# We use MultinomialNB for text classification and a 5-fold cross-validation
search = GridSearchCV(MultinomialNB(), param_grid, cv=5)
search.fit(X_tfidf, df['label'])

# TODO: Print out the best 'alpha' parameter found by Grid Search
print("Best 'alpha' parameter:", search.best_params_['alpha'])
```