<a href="https://colab.research.google.com/github/quangHieu2109/MachineLearing/blob/main/Lab_7_21130356_NgoQuangHieu.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This lab deals with **GridSearchCV** for tuning the hyper-parameters of an estimator and applying vectorization techniques to the **movie reviews dataset** for classification task.

*   **Deadline: 23:59, 22/4/2024 (lớp TH thứ 3) || 29/4/2024 (lớp TH thứ 5)**



# Import libraries

In [1]:
# code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from prettytable import PrettyTable
from sklearn import svm
from sklearn.linear_model import LinearRegression,LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn import datasets
from sklearn.model_selection import GridSearchCV

from google.colab import drive
drive.mount('/content/gdrive')
%cd '/content/gdrive/MyDrive/ML/dataset'


#Task 1. With **iris** dataset
*  1.1. Apply **GridSearchCV** for **SVM** to find the best hyperparameters using the following param_grid.

```
param_grid = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf','linear']}
```




In [2]:
#grid_params
SMVgrid_params = {'C': [0.1, 1, 10, 100, 1000],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf','linear']}
kNNgrid_params = { 'n_neighbors' : [5,7,9,11,13,15],
               'weights' : ['uniform','distance'],
               'metric' : ['minkowski','euclidean','manhattan']}
RFgrid_params = {
    'n_estimators': [25, 50, 100, 150],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [3, 6, 9],
    'max_leaf_nodes': [3, 6, 9],
}
LRLRgrid_params= {
    'C': [ 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'max_iter': list(range(100,800,100)),
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

In [3]:
# #code
dataSet = datasets.load_iris();

dataX = dataSet.data
dataY = dataSet.target
t = PrettyTable(['Algorithms', 'Accurance', 'Precision', 'Recall','F1 measure'])
def _insertTable(t, nameAlg, y_test, y_pred):
  t.add_row([nameAlg, metrics.accuracy_score(y_test, y_pred), metrics.precision_score(y_test, y_pred, average='macro')
  ,metrics.recall_score(y_test, y_pred, average='macro'),metrics.f1_score(y_test, y_pred, average='macro')])
def _findBestHeperParam(param_grid, algo):
  grid_rf_class = GridSearchCV(
  estimator = algo,
  param_grid = param_grid,
  scoring = 'accuracy',
  n_jobs =4,
  cv=10,
  refit = True,
  return_train_score = True)
  grid_rf_class.fit(X_train, y_train)
  y_pred = grid_rf_class.predict(X_test)
  print(grid_rf_class.best_params_)
  _insertTable(t, "GridSearchCV - "+algo.__class__.__name__, y_test, y_pred )





X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.3, shuffle=True)
svm_l = svm.SVC()
_findBestHeperParam(SMVgrid_params ,svm_l)


*  1.2. Apply **GridSearchCV** for **kNN** to find the best hyperparameters using the following param_grid.

```
grid_params = { 'n_neighbors' : [5,7,9,11,13,15],
               'weights' : ['uniform','distance'],
               'metric' : ['minkowski','euclidean','manhattan']}
```
where

    *  **n_neighbors**: Decide the best k based on the values we have computed earlier.
    *  **weights**: Check whether adding weights to the data points is beneficial to the model or not. 'uniform' assigns no weight, while 'distance' weighs points by the inverse of their distances meaning nearer points will have more weight than the farther points.
    *  **metric**: The distance metric to be used will calculating the similarity.


In [None]:
#code

knn = KNeighborsClassifier();
_findBestHeperParam(kNNgrid_params, knn)


*  1.3. Apply **GridSearchCV** for **Random Forest** to find the best hyperparameters using the following param_grid.

```
param_grid = {
    'n_estimators': [25, 50, 100, 150],
    'max_features': ['sqrt', 'log2', None],
    'max_depth': [3, 6, 9],
    'max_leaf_nodes': [3, 6, 9],
}
```

In [None]:
#code

rf = RandomForestClassifier()
_findBestHeperParam(RFgrid_params, rf)


*   1.4 Compare the best obtained results from 1.1 to 1.3 (use PrettyTable to dispaly the results)

In [None]:
print(t)

+---------------------------------------+--------------------+--------------------+--------------------+-------------------+
|               Algorithms              |     Accurance      |     Precision      |       Recall       |     F1 measure    |
+---------------------------------------+--------------------+--------------------+--------------------+-------------------+
|           GridSearchCV - SVC          |        1.0         |        1.0         |        1.0         |        1.0        |
|  GridSearchCV - KNeighborsClassifier  |        1.0         |        1.0         |        1.0         |        1.0        |
| GridSearchCV - RandomForestClassifier | 0.9777777777777777 | 0.9722222222222222 | 0.9814814814814815 | 0.975983436853002 |
+---------------------------------------+--------------------+--------------------+--------------------+-------------------+


#Task 2.
For breast cancer dataset (https://tinyurl.com/3vme8hr3) which could be loaded from datasets in sklearn as follows:

```
#Import scikit-learn dataset library
from sklearn import datasets

#Load dataset
cancer = datasets.load_breast_cancer()
```

*   Apply **GridSearchCV** to different classification algorithms such as **SVM, kNN, LogisticRegression, RandomForest**.
*   Compare the results obtained by the best hyperparameters among classification algorithms.

*   2.1. Apply **GridSearchCV** to **SVM**


In [None]:
# code
cancer = datasets.load_breast_cancer()
t = PrettyTable(['Algorithms', 'Accurance', 'Precision', 'Recall','F1 measure'])
dataX = cancer.data
dataY = cancer.target
X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.3, shuffle=True)
svm_l = svm.SVC()
_findBestHeperParam(SMVgrid_params, svm_l)


{'C': 100, 'gamma': 1, 'kernel': 'linear'}


*   2.2. Apply **GridSearchCV** to **kNN**

In [None]:
#code
knn = KNeighborsClassifier();
_findBestHeperParam(kNNgrid_params, knn)

{'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'uniform'}


*   2.3. Apply **GridSearchCV** to **LogisticRegression**

In [None]:
#code

logistic_regression = LogisticRegression()
_findBestHeperParam(LRLRgrid_params, logistic_regression)

1470 fits failed out of a total of 4900.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
490 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

-----------------------------

{'C': 1000, 'max_iter': 200, 'penalty': 'l1', 'solver': 'liblinear'}


*   2.4. Apply **GridSearchCV** to **RandomForest**

In [None]:
#code
rf = RandomForestClassifier()
_findBestHeperParam(RFgrid_params, rf)


{'max_depth': 9, 'max_features': 'sqrt', 'max_leaf_nodes': 9, 'n_estimators': 100}


*   2.5. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)

In [None]:
#code
print(t)

+---------------------------------------+--------------------+--------------------+--------------------+--------------------+
|               Algorithms              |     Accurance      |     Precision      |       Recall       |     F1 measure     |
+---------------------------------------+--------------------+--------------------+--------------------+--------------------+
|           GridSearchCV - SVC          | 0.9590643274853801 | 0.963691159586682  | 0.9532828282828283 | 0.9575787645745473 |
|  GridSearchCV - KNeighborsClassifier  | 0.935672514619883  | 0.939150401836969  | 0.9292929292929293 | 0.9333380586171456 |
|   GridSearchCV - LogisticRegression   | 0.9473684210526315 | 0.9453452613922281 | 0.946969696969697  | 0.9461228776474707 |
| GridSearchCV - RandomForestClassifier | 0.9590643274853801 | 0.963691159586682  | 0.9532828282828283 | 0.9575787645745473 |
+---------------------------------------+--------------------+--------------------+--------------------+--------------

#Task 3. With **mobile price classification** dataset
* 3.1.  Apply **GridSearchCV** for **SVM, kNN, RandomForest** algorithms to find the best hyperparameters for each classification algorithm.
* 3.2. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)

In [None]:
#code
datasetTrain = pd.read_csv("mobile_train.csv")
datasetTest = pd.read_csv("mobile_test.csv")
dataX = datasetTrain.drop("price_range",axis=1)
dataY = datasetTrain.get("price_range")

X_train, X_test, y_train, y_test = train_test_split(dataX, dataY, test_size=0.3, shuffle=True)

t = PrettyTable(['Algorithms', 'Accurance', 'Precision', 'Recall','F1 measure'])

svm_l = svm.SVC()
_findBestHeperParam(SMVgrid_params, svm_l)

knn = KNeighborsClassifier();
_findBestHeperParam(kNNgrid_params, knn)

rf = RandomForestClassifier()
_findBestHeperParam(RFgrid_params, rf)

print(t)

{'C': 1, 'gamma': 1, 'kernel': 'linear'}
{'metric': 'manhattan', 'n_neighbors': 13, 'weights': 'distance'}
{'max_depth': 9, 'max_features': 'log2', 'max_leaf_nodes': 9, 'n_estimators': 150}
+---------------------------------------+--------------------+--------------------+--------------------+--------------------+
|               Algorithms              |     Accurance      |     Precision      |       Recall       |     F1 measure     |
+---------------------------------------+--------------------+--------------------+--------------------+--------------------+
|           GridSearchCV - SVC          | 0.9683333333333334 | 0.9693287169291009 | 0.968074931277976  | 0.9685993189012225 |
|  GridSearchCV - KNeighborsClassifier  | 0.9116666666666666 | 0.9113564456456332 | 0.9127788373463922 | 0.9116923908220079 |
| GridSearchCV - RandomForestClassifier |        0.81        | 0.8130478927203064 | 0.8171176135890262 | 0.8084362495118457 |
+---------------------------------------+-------------

#Task 4.
The dataset consists of **2000 user-created movie reviews** archived on the IMDb(Internet Movie Database). The reviews are equally partitioned into a positive set and a negative set (1000+1000). Each review consists of a plain text file (.txt) and a class label representing the overall user opinion.
The class attribute has only two values: **pos** (positive) or **neg** (negative).


*   4.1 Importing additional libraries

In [6]:
import nltk, random
nltk.download('movie_reviews')#download movie reviews dataset
from nltk.corpus import movie_reviews
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.model_selection import cross_val_score
from collections import Counter
from sklearn.model_selection import train_test_split

[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


*   4.2. Movie reviews information

In [None]:
#code
print(len(movie_reviews.fileids()))
print(movie_reviews.categories())
print(movie_reviews.words()[:100])
print(movie_reviews.fileids()[:10])

*   4.3. Create dataset from movie reviews

In [7]:
documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.seed(123)
random.shuffle(documents)

In [None]:
print('Number of Reviews/Documents: {}'.format(len(documents)))
print('Corpus Size (words): {}'.format(np.sum([len(d) for (d,l) in documents])))
print('Sample Text of Doc 1:')
print('-'*30)
print(' '.join(documents[0][0][:50])) # first 50 words of the first document

In [8]:
sentiment_distr = Counter([label for (words, label) in documents])
print(sentiment_distr)

Counter({'pos': 1000, 'neg': 1000})


*   4.4. Train test split

In [9]:
train, test = train_test_split(documents, test_size = 0.33, random_state=42)

In [None]:
## Sentiment Distrubtion for Train and Test
print(Counter([label for (words, label) in train]))
print(Counter([label for (words, label) in test]))

In [10]:
X_train = [' '.join(words) for (words, label) in train]
X_test = [' '.join(words) for (words, label) in test]
y_train = [label for (words, label) in train]
y_test = [label for (words, label) in test]

In [None]:
X_train


array([[0.11906061, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.14474233, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.16450792, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.1560616 , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.14605465, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.16944372, 0.        , 0.        , ..., 0.        , 0.1546671 ,
        0.        ]])

*   4.5. Text Vectorization

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

tfidf_vec = TfidfVectorizer(min_df = 10, token_pattern = r'[a-zA-Z]+')
X_train = tfidf_vec.fit_transform(X_train).toarray() # fit train
X_test = tfidf_vec.transform(X_test).toarray() # transform test





*   4.6. Apply **SVM** with **GridSearchCV**

In [13]:
#code
t = PrettyTable(['Algorithms', 'Accurance', 'Precision', 'Recall','F1 measure'])

svm_l = svm.SVC()
_findBestHeperParam(SMVgrid_params, svm_l)




{'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}


*   4.7. Apply **RandomForest** with **GridSearchCV**

In [None]:
#code
rf = RandomForestClassifier()
_findBestHeperParam(RFgrid_params, rf)

{'max_depth': 9, 'max_features': 'sqrt', 'max_leaf_nodes': 6, 'n_estimators': 150}


*   4.8. Apply **kNN** with **GridSearchCV**

In [None]:
#code
knn = KNeighborsClassifier();
_findBestHeperParam(kNNgrid_params, knn)



{'metric': 'minkowski', 'n_neighbors': 15, 'weights': 'distance'}


*   4.9. Apply **LogisticRegression** with **GridSearchCV**

In [12]:
#code
logistic_regression = LogisticRegression()
_findBestHeperParam(LRLRgrid_params, logistic_regression)

840 fits failed out of a total of 2800.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
280 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.10/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver newton-cg supports only 'l2' or 'none' penalties, got l1 penalty.

------------------------------

{'C': 10, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'}


*   4.10. Compare the best obtained results among classification algorithms (use PrettyTable to dispaly the results)

In [18]:
t = PrettyTable(['Algorithms', 'Accurance', 'Precision', 'Recall','F1 measure'])

svm_l = svm.SVC(C= 10, gamma= 0.1, kernel= 'rbf')
rf = RandomForestClassifier(max_depth= 9, max_features= 'sqrt', max_leaf_nodes= 6, n_estimators= 150)
knn = KNeighborsClassifier(metric='minkowski', n_neighbors=15, weights='distance')
logistic_regression = LogisticRegression(C= 10, max_iter= 100, penalty='l2', solver='liblinear')
algos=[svm_l, rf, knn, logistic_regression]
for algo in algos:
  algo.fit(X_train, y_train)
  y_pred = algo.predict(X_test)
  _insertTable(t, "GridSearchCV - "+algo.__class__.__name__, y_test, y_pred )
print(t)


+---------------------------------------+--------------------+--------------------+--------------------+--------------------+
|               Algorithms              |     Accurance      |     Precision      |       Recall       |     F1 measure     |
+---------------------------------------+--------------------+--------------------+--------------------+--------------------+
|           GridSearchCV - SVC          | 0.8121212121212121 | 0.8122242647058824 | 0.8119833951728446 | 0.8120366372380594 |
| GridSearchCV - RandomForestClassifier | 0.7848484848484848 | 0.7848689771766695 | 0.7847433966422982 | 0.7847773368606701 |
|  GridSearchCV - KNeighborsClassifier  | 0.6348484848484849 | 0.677148782264137  | 0.6318008155468204 | 0.6085727152591675 |
|   GridSearchCV - LogisticRegression   | 0.8242424242424242 | 0.8245694730644686 | 0.8240329157635649 | 0.8241115981584098 |
+---------------------------------------+--------------------+--------------------+--------------------+--------------

#Finally,
Save a copy in your Github. Remember renaming the notebook.