<a href="https://colab.research.google.com/github/yeoauqt/229352/blob/main/Lab04_Naive_Bayes_Grid_and_Random_Search.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Statistical Learning for Data Science 2 (229352)
#### Instructor: Donlapark Ponnoprat

#### [Course website](https://donlapark.pages.dev/229352/)

## Lab #4

In [1]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

from scipy.stats import uniform

In [2]:
train = fetch_20newsgroups(subset='train')
test = fetch_20newsgroups(subset='test')

Xtrain = train.data[:3000]
ytrain = train.target[:3000]
Xtest = test.data[:500]
ytest = test.target[:500]

print("X:", len(Xtest))
print("y:", len(ytest))

X: 500
y: 500


### Naive Bayes [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html)

In [3]:
pipeline = Pipeline([('tfidf', TfidfVectorizer(stop_words="english")),
    ('nb', MultinomialNB())])

pipeline.fit(Xtrain, ytrain)
y_pred = pipeline.predict(Xtest)

print(classification_report(ytest, y_pred))

              precision    recall  f1-score   support

           0       0.67      0.38      0.48        21
           1       0.79      0.52      0.63        21
           2       0.58      0.69      0.63        26
           3       0.74      0.68      0.71        34
           4       0.72      0.85      0.78        34
           5       0.88      0.81      0.84        26
           6       1.00      0.73      0.84        22
           7       0.70      1.00      0.82        28
           8       0.90      0.82      0.86        33
           9       0.88      0.84      0.86        25
          10       0.82      1.00      0.90        27
          11       0.79      0.95      0.86        20
          12       0.59      0.54      0.57        24
          13       0.75      0.78      0.77        23
          14       0.87      0.71      0.78        28
          15       0.53      0.90      0.67        29
          16       0.50      0.95      0.66        21
          17       0.94    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


### Random Search Cross-Validation [(Documentation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html)

### Uniform distribution in `Scipy` [(Documentation)](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.uniform.html)

In [4]:
parameters = {'nb__alpha': uniform(loc=0, scale=1)}

clf = RandomizedSearchCV(pipeline, parameters, n_iter=7)
clf.fit(Xtrain, ytrain)

In [5]:
ypred = clf.predict(Xtest)
print(classification_report(ytest, ypred))

              precision    recall  f1-score   support

           0       0.73      0.52      0.61        21
           1       0.70      0.67      0.68        21
           2       0.60      0.58      0.59        26
           3       0.71      0.71      0.71        34
           4       0.88      0.85      0.87        34
           5       0.89      0.62      0.73        26
           6       0.94      0.77      0.85        22
           7       0.77      0.96      0.86        28
           8       0.97      0.85      0.90        33
           9       0.92      0.88      0.90        25
          10       0.84      1.00      0.92        27
          11       0.77      1.00      0.87        20
          12       0.59      0.71      0.64        24
          13       0.83      0.83      0.83        23
          14       0.86      0.86      0.86        28
          15       0.59      0.93      0.72        29
          16       0.51      0.95      0.67        21
          17       0.94    

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


#### Exercise

1. For the Naive Bayes model, use grid search 5-fold cross-validation across different values of `alpha` to find the best model.

2. For the best value of `alpha`, compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* What is the model's `f1_macro` score?

3. Repeat Exercise 1 and 2 for **random search** 5-fold cross validation across different values of `alpha`. Compute the `f1_macro` score on the test set.
* What value of `alpha` did you obtain?
* Did you get a better `f1_macro` score compared to grid search in Exercise 2?

In [6]:
# For the Naive Bayes model, use grid search 5-fold cross-validation across different values of alpha to find the best model
params = {'nb__alpha': [0.01, 0.1, 0.2, 0.5, 1]}

gridcv = GridSearchCV(pipeline, params, scoring='f1_macro', cv=5)
gridcv.fit(Xtrain, ytrain)

In [7]:
# For the best value of alpha, compute the f1_macro score on the test set

best_alpha_gridcv = gridcv.best_params_['nb__alpha']
y_pred = gridcv.predict(Xtest)

In [8]:
print(best_alpha_gridcv)

0.01


In [9]:
print(classification_report(ytest, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.57      0.67        21
           1       0.57      0.62      0.59        21
           2       0.62      0.58      0.60        26
           3       0.75      0.71      0.73        34
           4       0.84      0.79      0.82        34
           5       0.89      0.62      0.73        26
           6       0.78      0.82      0.80        22
           7       0.88      1.00      0.93        28
           8       0.97      0.88      0.92        33
           9       0.88      0.92      0.90        25
          10       0.87      0.96      0.91        27
          11       0.83      0.95      0.88        20
          12       0.67      0.75      0.71        24
          13       0.83      0.87      0.85        23
          14       0.81      0.93      0.87        28
          15       0.75      0.93      0.83        29
          16       0.53      0.90      0.67        21
          17       0.94    

In [10]:
parameters = {'nb__alpha': uniform(loc=0, scale=1)}

clf = RandomizedSearchCV(pipeline, parameters,
                         scoring='f1_macro', n_iter=50, cv=5)
clf.fit(Xtrain, ytrain)

In [11]:
best_alpha_clf = clf.best_params_['nb__alpha']
y_pred = clf.predict(Xtest)

In [12]:
print(best_alpha_clf)

0.0027496907600084164


In [13]:
print(classification_report(ytest, y_pred))

              precision    recall  f1-score   support

           0       0.80      0.57      0.67        21
           1       0.52      0.52      0.52        21
           2       0.65      0.58      0.61        26
           3       0.66      0.68      0.67        34
           4       0.83      0.74      0.78        34
           5       0.79      0.58      0.67        26
           6       0.78      0.82      0.80        22
           7       0.84      0.96      0.90        28
           8       0.97      0.85      0.90        33
           9       0.85      0.92      0.88        25
          10       0.86      0.93      0.89        27
          11       0.83      0.95      0.88        20
          12       0.62      0.62      0.62        24
          13       0.87      0.87      0.87        23
          14       0.72      0.93      0.81        28
          15       0.75      0.93      0.83        29
          16       0.53      0.90      0.67        21
          17       0.94    

- grid search มีค่า alpha ที่ดีที่สุดคือ 0.01 และ f1_macro = 0.77

- random search มีค่า alpha ที่ดีที่สุดคือ 0.02 และ f1_macro = 0.75

เมื่อเปรียบเทียบผลลัพธ์ระหว่างสองวิธี พบว่า
grid search ให้ค่า f1_macro สูงกว่า random search
แสดงให้เห็นว่าในกรณีนี้ การค้นหาค่าพารามิเตอร์แบบเป็นระบบของ grid search สามารถเลือกค่า alpha ที่เหมาะสมกับข้อมูลได้ดีกว่า random search ซึ่งอาจเกิดจากจำนวนรอบการสุ่มที่จำกัด