<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: <i>News or "News"</i> - Distinguishing facts from misinformation on Reddit

### Notebook #5 for Hyperparameter Tuning

Hyperparameter tuning is the process of finding the optimal combination of hyperparameter values that maximize the model's performance on a given task.

Different hyperparameter values can significantly impact the model's behavior, leading to underfitting (high bias, oversimplified model) or overfitting (high variance, model too complex for the data). Appropriate hyperparameter settings can help strike the right balance between bias and variance, resulting in better generalization performance on unseen data.


---
## Import libraries and data
The cleaned CSV of comments scraped from the two subreddits (r/news and r/TheOnion) are read into a data frame.

In [1]:
import pandas as pd
import numpy as np
import re

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix

from datetime import datetime

In [2]:
# read cleaned csv file into a data frame
comments_df = pd.read_csv('../data/03_data_post_EDA.csv')

---
## Prepare data for pipeline model selection 
A pipeline is a way to chain multiple estimators (preprocessing steps and the final estimator) into a sequence of steps. This allows the entire sequence to be treated as a single unit, making it easier to apply the same preprocessing steps consistently for both training and prediction.

#### Comments from r/TheOnion (misinformation) are labelled `1` while comments from r/news (fact) are labelled `0`.

In [3]:
comments_df = comments_df.dropna()

In [4]:
comments_df['body_cleaned_lemmatized'].isna().sum()

0

In [5]:
comments_df

Unnamed: 0,comment_id,parent_id,post_id,body,score,post_title,subreddit,body_cleaned,body_cleaned_lemmatized,comment_length,sentiment_score,emotional_tone
0,gbgsbi4,t3_jptqj9,t3_jptqj9,"As you all celebrate or commiserate, please he...",1,Joe Biden elected president of the United States,news,celebrate commiserate please help us reporting...,celebrate commiserate please help reporting co...,138,0.150000,Neutral
1,gbhfdv2,t3_jptqj9,t3_jptqj9,"Congratulations USA! From Brazil, I hope Bolso...",172,Joe Biden elected president of the United States,news,Congratulations USA Brazil hope Bolsonaro next...,congratulation usa brazil hope bolsonaro next ...,12,0.000000,Neutral
2,gbgt3me,t3_jptqj9,t3_jptqj9,"Fox News just called it a couple minutes ago, ...",8749,Joe Biden elected president of the United States,news,Fox News called couple minutes ago know real,fox news call couple minute ago know real,14,0.200000,Neutral
3,gbgvvs7,t3_jptqj9,t3_jptqj9,"""You were expecting Nevada to decide the elect...",16552,Joe Biden elected president of the United States,news,You expecting Nevada decide election ME PENNSY...,you expect nevada decide election me pennsylvania,13,0.000000,Neutral
4,gbgr8hw,t3_jptqj9,t3_jptqj9,"Is it 100% confirmed, as in nothing can take t...",3176,Joe Biden elected president of the United States,news,100 confirmed nothing take away,100 confirm nothing take away,11,0.400000,Positive
...,...,...,...,...,...,...,...,...,...,...,...,...
37774,fsm23z4,t1_fslramj,t3_gus9vi,What episode is this?,2,"‘Let Them Have Eric,’ Screams Trump While Push...",TheOnion,episode this,episode this,4,0.000000,Neutral
37775,fsml44n,t1_fslci54,t3_gus9vi,Hahaha!,3,"‘Let Them Have Eric,’ Screams Trump While Push...",TheOnion,Hahaha,hahaha,1,0.250000,Neutral
37776,fsmp2et,t1_fsmfqvz,t3_gus9vi,[Skip to 32:00](https://youtu.be/Wm0EaYEj9AU?t...,6,"‘Let Them Have Eric,’ Screams Trump While Push...",TheOnion,Skip 3200 want get straight it miss tear gassi...,skip 3200 want get straight miss tear gas expl...,168,0.084375,Neutral
37777,fsm7crj,t1_fsm23z4,t3_gus9vi,"[""The fire""](https://youtu.be/_u1cbZTwBx4)",6,"‘Let Them Have Eric,’ Screams Trump While Push...",TheOnion,The fire,the fire,2,0.000000,Neutral


In [6]:
## binary label:  1 - misinformation / TheOnion, 0 - fact / news
comments_df['label'] = comments_df['subreddit'].map({'TheOnion':1, 'news':0})
comments_df[['label','subreddit']]

Unnamed: 0,label,subreddit
0,0,news
1,0,news
2,0,news
3,0,news
4,0,news
...,...,...
37774,1,TheOnion
37775,1,TheOnion
37776,1,TheOnion
37777,1,TheOnion


#### Create the training and testing data

In [7]:
# set up X and y
X = comments_df['body_cleaned_lemmatized']
y = comments_df['label']

y.value_counts(normalize=True)

label
0    0.528881
1    0.471119
Name: proportion, dtype: float64

`stratify = y` replicates similar distribution of the classes pre-train/test split. This means that the train-test split algorithm will try to preserve the class proportions in both the training and test sets, ensuring that they are representative of the overall dataset.

In [8]:
# split our data into training and testing sets.
# set a custom test size of 20% for model evaluation. 
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

---
## Find the best fit
`GridSearchCV` is a technique for automatically tuning the hyperparameters of a machine learning model. It defines a grid of hyperparameter values to explore, training the model with each combination of hyperparameters using cross-validation, and evaluating the model's performance using specified metrics (e.g., accuracy, sensitivity, F1-score).

Pipeline is chained together with the transformers or pre-processing steps (e.g. standardScaler, countVectorizer, etc.) and the estimator or model (e.g. kNN, LogisticRegression, NB, etc.).  

### Step 1a: Instantiate pipeline and `GridSearchCV`
1. Define the Pipeline, by instantiating the `Pipeline()` class. Arguments to the class initiator are the "tasks" (e.g. vectorize, scaling, model) to be carried out in order. Pipeline is chained together with the Transformers and Estimator. 
2. Define the hyperparameters in a dictionary. Hyperparameters are the set of test parameters to be used by GridSearch. 
   - For each hyperparameter, go with the format `"<task_name>_ _ <parameter_name>" : [ <param1>, <param2>, <param3>,...]`
3. Instantiate a GridSearch for the Pipeline (defined in #1). 
   - First argument is the pipeline name
   - `param_grid` is the name of the hyperparameter dictionary defined in #2 
   - `scoring=recall` since we are trying to optimise sensitivity
   - `cv=5`

In [9]:
from sklearn.feature_selection import chi2, SelectKBest

In [10]:
# Pipeline #1:
# 1. CountVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)
pipe_cv_mnb = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])

# define dictionary of hyperparameters spanning both transformers and estimators
pipe_cv_mnb_param = {
    'cvec__max_features': [1000,],
    'cvec__stop_words': [None,'english'],
    'cvec__min_df': [2, 5, 10],
    'cvec__max_df': [0.5, 0.7, 0.9],
    'cvec__ngram_range': [(1,1),(1,2)],
    'nb__alpha': [0.1, 0.5, 1.0]
}

# instantiate our GridSearchCV object
gs_cv_mnb = GridSearchCV(pipe_cv_mnb, 
                        param_grid=pipe_cv_mnb_param,
                        scoring='recall', 
                        cv=5)

# Pipeline #2:
# 1. CountVectorizer (transformer)
# 2. Logistic Regression (estimator)
pipe_cv_lr = Pipeline([
    ('cvec', CountVectorizer()),
    ('feature_selection', SelectKBest(chi2, k=1000)),
    ('lr', LogisticRegression(solver='liblinear', max_iter=2000, random_state=42))
])

# define dictionary of hyperparameters spanning both transformers and estimators
pipe_cv_lr_param = {
    'cvec__max_features': [2000,7500],
    'cvec__stop_words': [None,'english'],
    'cvec__min_df': [1,10],
    'cvec__max_df': [0.1,0.9],
    'cvec__ngram_range': [(1,1),(1,2)],
    'lr__C': [0.01, 1.0, 10],
    'lr__penalty' : ['l1', 'l2'],
    'lr__class_weight': ['balanced']
}

# instantiate our GridSearchCV object
gs_cv_lr = GridSearchCV(pipe_cv_lr, 
                        param_grid=pipe_cv_lr_param, 
                        scoring='recall',
                        cv=5)

# Pipeline #3:
# 1. TF-IDF Vectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)
pipe_tv_mnb = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

# define dictionary of hyperparameters spanning both transformers and estimators
pipe_tv_mnb_param = {
    'tvec__max_features': [1000, 5000, 10000],
    'tvec__stop_words': [None,'english'],
    'tvec__min_df': [2, 5, 10],
    'tvec__max_df': [0.1, 0.9],
    'tvec__ngram_range': [(1,1),(1,2)],
    'nb__alpha': [0.1, 0.5, 1.0]
}

# instantiate our GridSearchCV object
gs_tv_mnb = GridSearchCV(pipe_tv_mnb, 
                        param_grid=pipe_tv_mnb_param, 
                        scoring='recall',
                        cv=5)

# Pipeline #4:
# 1. TF-IDF Vectorizer (transformer)
# 2. Logistic Regression (estimator)
pipe_tv_lr = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('lr', LogisticRegression(solver='liblinear', max_iter=2000, random_state=42))
])

# define dictionary of hyperparameters spanning both transformers and estimators
pipe_tv_lr_param = {
    'tvec__max_features': [1000, 5000, 10000],
    'tvec__stop_words': [None,'english'],
    'tvec__min_df': [2, 5, 10],
    'tvec__max_df': [0.5, 0.7, 0.9],
    'tvec__ngram_range': [(1,1), (1,2)],
    'lr__C': [0.01, 1.0, 10],
    'lr__penalty' : ['l1', 'l2']
}

# instantiate our GridSearchCV object
gs_tv_lr = GridSearchCV(pipe_tv_lr, 
                        param_grid=pipe_tv_lr_param, 
                        scoring='recall',
                        cv=5)

### Step 1b: Add the newly configured `GridSearchCV` into the `grids` dictionary
- For each GridSearchCV object, assign an `id` (running number) which is the key in the dictionary.
- For each `id` key, the corresponding value is a dictionary.
    - `grid_search_obj` carries the object name of the GridSearchCV.
    - `grid_search_desc` is a description of the gridSearch, for user-friendly display in the report. 

In [11]:
grids = {
    1 : {
        "grid_search_obj": gs_cv_mnb,
        "grid_search_desc" : "MultinomialNB with CountVectorizer",
    },

    2 : {
        "grid_search_obj": gs_cv_lr,
        "grid_search_desc" : "LogisticRegression with CountVectorizer",
    },

    3 : {
        "grid_search_obj": gs_tv_mnb,
        "grid_search_desc" : "MultinomialNB with TfidfVectorizer",
    },

    4 : {
        "grid_search_obj": gs_tv_lr,
        "grid_search_desc" : "LogisticRegression with TfidfVectorizer",
    },
}


### Step 2: Execute the GridSearch
In the `grid_execute` list, specify which GridSearch `id` to be execute. 
- for all GridSearch `id` defined in the `grids` dictionary, the list of GridSearch will be executed. 
- if keyword "**ALL**" is provided, then all the `id` in the `grids` dictionary will be pulled, which mean all the defined GridSearch will be executed in one go. 

In the interest of time, and with reference to 04_Data_Modelling notebook, since `LogisticRegression` with `CountVectorizer` provided the best balance of accuracy, sensitivity and f1-score, we will execute `GridSearchCV` only for `LogisticRegression with CountVectorizer`.

| Model | Accuracy | Precision | Sensitivity | F1 Score |
| --- | --- | --- | --- | --- |
| `LogisticRegression with CountVectorizer` | 0.768 | 0.742 | 0.780 | 0.760 |

Note: `gs_result` is the dictionary that holds the grid search outcome for every execution.

In [12]:
### Specify the GridSearch `id` to be executed, in the `grid_execute` list 
#   ALWAYS IN A LIST, even if only single `id` is given.             
#   Keyword "ALL" will auto populate all the `id` from the `grids` dictionary into the lst                 
#################################
grid_execute = [2, ] # only execute GridSearch for LogisticRegression with CVec
# grid_execute = ["ALL",]
#################################

# initiate variable, do not tamper # 
best_acc = 0.0
best_idx = 0
best_gs = ''
gs_result = {}
gs_execute = []

# pull out list of grid `id` for execution 
for ge in grid_execute:
	if len(grid_execute) == 1 and ge == "ALL":
		for i_, grid_cfg in grids.items():
			gs_execute.append(grid_cfg)
	else:
		gs_execute.append(grids[ge])

# GridSearch execution based on the defined list of `id`
for idx, ge in enumerate(gs_execute):
	gs = ge['grid_search_obj']
	gs_desc = ge['grid_search_desc']
	print('\n>>>>Estimator: %s', gs_desc)	
	# Fit grid search	
	gs.fit(X_train, y_train)
	# Best params
	print('Best params: %s' % gs.best_params_)
	# Best training data accuracy
	print('Best training accuracy: %.3f' % gs.best_score_)

	gs_result[gs_desc] = {}
	gs_result[gs_desc]['best_score'] = gs.best_score_
	gs_result[gs_desc]['best_param'] = gs.best_params_
	gs_result[gs_desc]['train_score'] = gs.score(X_train, y_train)
	gs_result[gs_desc]['test_score'] = gs.score(X_test, y_test)
	
	# Predict on test data with best params
	y_pred = gs.predict(X_test)
	gs_result[gs_desc]['accuracy_score'] = accuracy_score(y_test, y_pred)

	tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
	gs_result[gs_desc]['m_tn'] = tn
	gs_result[gs_desc]['m_fp'] = fp
	gs_result[gs_desc]['m_fn'] = fn
	gs_result[gs_desc]['m_tp'] = tp
	gs_result[gs_desc]['m_sensitivity'] = tp / (tp+fn)
	gs_result[gs_desc]['m_precision'] = tp / (tp+fp)
	gs_result[gs_desc]['m_specificity'] = tn / (tn +fp)
	gs_result[gs_desc]['m_negative_predictive_value'] = tn / (tn +fn)
	gs_result[gs_desc]['m_accuracy'] = (tp +tn) / (tp +tn+fp+fn)
	gs_result[gs_desc]['m_f1_score'] = 2*tp / (2*tp + fp + fn)

	if gs.best_score_ > best_acc:
		best_acc = gs.best_score_
		best_gs = gs_desc
	

	print(gs_result[gs_desc])

print(f"\n\n ****** RESULT ********* \n\n Best performing model - {best_gs}, with score {best_acc}")
	


>>>>Estimator: %s LogisticRegression with CountVectorizer
Best params: {'cvec__max_df': 0.1, 'cvec__max_features': 7500, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'lr__C': 1.0, 'lr__class_weight': 'balanced', 'lr__penalty': 'l1'}
Best training accuracy: 0.837
{'best_score': 0.8371867811687217, 'best_param': {'cvec__max_df': 0.1, 'cvec__max_features': 7500, 'cvec__min_df': 1, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': 'english', 'lr__C': 1.0, 'lr__class_weight': 'balanced', 'lr__penalty': 'l1'}, 'train_score': 0.8565807040966903, 'test_score': 0.8451377178189995, 'accuracy_score': 0.795948093220339, 'm_tn': 3004, 'm_fp': 990, 'm_fn': 551, 'm_tp': 3007, 'm_sensitivity': 0.8451377178189995, 'm_precision': 0.7523142356767576, 'm_specificity': 0.7521281922884326, 'm_negative_predictive_value': 0.8450070323488045, 'm_accuracy': 0.795948093220339, 'm_f1_score': 0.7960291197882198}


 ****** RESULT ********* 

 Best performing model - LogisticRegress

### Step 3: Examine the results of GridSearch
- GridSearch results are captured in the `gs_result` dataframe for easier reading. 
- `gs_result` is exported into CSV for recording.

Once the GridSearch has fit, we interpret the following information from the GridSearch object as follows :

| Property | Description |
| --- | ---|
| **`best_param`** | The hyperparameters that have been found to perform with the best score. |
| **`best_score`** | Best mean cross-validated score achieved. |
| **`m_accuracy`** | Measures the overall correctness of the model's predictions, regardless of class. |
| **`m_f1_score`** | Balanced measure of precision and recall, considering both false positives and false negatives. |
| **`m_negative_predictive_value`** | Measures the proportion of correctly identified true negative cases (correctly identifying facts as not misinformation) among all instances that are predicted as negative. |
| **`m_precision`** | Measures the proportion of correctly identified misinformation among all instances predicted as misinformation. |
| **`m_sensitivity`** | Measures the proportion of actual misinformation that is correctly identified by the model. |
| **`m_specificity`** | Measures the proportion of actual facts that are correctly identified by the model. |

In our study, we would like to optimise sensitivity and F1 score simultaneously to achieve the best performance in identifying misinformation while minimizing the misclassification of facts as misinformation. As sensitivity increases, precision tends to decrease.

Since F1 score is a measure of both sensitivity and precision, F1 score also tends to decrease as sensitivity increases. Hence, we should try to achieve a balanced optimisation of both sensitivity and F1 score to ensure a greater weight on sensitivity is placed over precision.

In [15]:
gs_result= pd.DataFrame(gs_result).T
#gs_df = gs_df.reset_index().rename(columns={"index":"model"})
gs_result

Unnamed: 0,LogisticRegression with CountVectorizer
accuracy_score,0.795948
best_param,"{'cvec__max_df': 0.1, 'cvec__max_features': 75..."
best_score,0.837187
m_accuracy,0.795948
m_f1_score,0.796029
m_fn,551
m_fp,990
m_negative_predictive_value,0.845007
m_precision,0.752314
m_sensitivity,0.845138


In [14]:
# uncomment to print gs_result data frame into a csv file
# grid_search_result_csv = model_ready_csv[:-4]+'_RESULT_'+datetime.now().strftime("%Y%m%d%H%M")+'.csv'
# gs_result.to_csv(f'../data/'+grid_search_result_csv)