<a href="https://colab.research.google.com/github/qtncyy/229352-StatisticalLearning/blob/main/660510740_Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [53]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [54]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [55]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.1)

### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [56]:
from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline([('count', CountVectorizer(stop_words='english')), ('nb', MultinomialNB())])

parameters = {'nb__alpha': uniform(loc=0.1, scale=10)}

clf = RandomizedSearchCV(pipeline, parameters, n_iter=10, cv=3)

clf.fit(Xtrain, ytrain)

In [57]:
ypred = clf.predict(Xtest)

print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.64      0.67      0.65        21
           1       0.75      0.43      0.55        21
           2       1.00      0.04      0.07        26
           3       0.48      0.74      0.58        34
           4       0.84      0.79      0.82        34
           5       0.70      0.81      0.75        26
           6       0.89      0.77      0.83        22
           7       0.78      1.00      0.88        28
           8       0.96      0.82      0.89        33
           9       0.88      0.84      0.86        25
          10       0.87      1.00      0.93        27
          11       0.72      0.90      0.80        20
          12       0.64      0.67      0.65        24
          13       0.87      0.87      0.87        23
          14       0.81      0.93      0.87        28
          15       0.77      0.93      0.84        29
          16       0.53      0.95      0.68        21
          17       0.89    

#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

**Grid Search**

In [58]:
from sklearn.feature_extraction.text import CountVectorizer

pipeline = Pipeline([('count', CountVectorizer(stop_words='english')), ('nb', MultinomialNB())])

parameters = {'nb__alpha': [0.01, 0.1, 0.5, 1.0, 5.0, 10.0]}

# GridSearchCV with 5-fold cross-validation
grid_search = GridSearchCV(pipeline, parameters, cv=5)
grid_search.fit(Xtrain, ytrain)

# best alpha and the corresponding cross-validation score
print("Best alpha from grid search:", grid_search.best_params_['nb__alpha'])
print("Best cross-validation score (f1_macro):", grid_search.best_score_)

Best alpha from grid search: 0.01
Best cross-validation score (f1_macro): 0.8156666666666667


In [59]:
from sklearn.metrics import classification_report

# best model from the grid search
best_model = grid_search.best_estimator_

ypred_test = best_model.predict(Xtest)

# Compute the classification report and extract f1_macro score
report = classification_report(ytest, ypred_test, output_dict=True)
f1_macro_score = report['macro avg']['f1-score']

# print("Best alpha from grid search:", grid_search.best_params_['nb__alpha'])
print("Model's f1_macro score on the test set from Grid Search:", f1_macro_score)

Model's f1_macro score on the test set from Grid Search: 0.7077459540233744


**Random Search**

In [60]:
pipeline = Pipeline([('count', CountVectorizer(stop_words='english')), ('nb', MultinomialNB())])

parameters_random = {'nb__alpha': uniform(loc=0.01, scale=10)}

# RandomizedSearchCV with 5-fold cross-validation
random_search = RandomizedSearchCV(pipeline, parameters_random, n_iter=10, cv=5)

random_search.fit(Xtrain, ytrain)

In [61]:
print("Best alpha from random search:", random_search.best_params_['nb__alpha'])
print("Best cross-validation score (f1_macro) from random search:", random_search.best_score_)

Best alpha from random search: 0.30807329003684325
Best cross-validation score (f1_macro) from random search: 0.8046666666666666


In [62]:
# best model from the random search
best_model_rand = random_search.best_estimator_

ypred_test_rand = best_model_rand.predict(Xtest)

# Compute the classification report and extract f1_macro score
report_rand = classification_report(ytest, ypred_test_rand, output_dict=True)
f1_macro_score_rand = report_rand['macro avg']['f1-score']

#print("Best alpha from random search:", random_search.best_params_['nb__alpha'])
print("Model's f1_macro score on the test set from Random Search:", f1_macro_score_rand)

Model's f1_macro score on the test set from Random Search: 0.7136131453502167


In [64]:
# Compared
print(f'f1_macro score from Grid Search: {f1_macro_score:.4f}')
print(f'f1_macro score from Random Search: {f1_macro_score_rand:.4f}')

print()

if f1_macro_score_rand > f1_macro_score:
    print("Random search achieved a better f1_macro score compared to Grid search.")
elif f1_macro_score_rand < f1_macro_score:
    print("Grid search achieved a better f1_macro score compared to Random search.")
else:
    print("Grid search and random search achieved the same f1_macro score.")

f1_macro score from Grid Search: 0.7077
f1_macro score from Random Search: 0.7136

Random search achieved a better f1_macro score compared to Grid search.
