## 2021F AML 3104 2 Neural Networks and Deep Learning

### FakeNews Detection using LSTM Neural Network

*Done by,*
**<br>Swathi Gurijila(C0790294)</br>**
**<br>Varadharajan Kalyanaraman(C0793756)</br>**
**<br>Vignesh Kumar Murugananthan(C0793760)</br>**

#### Modeling Cross Validation

In [1]:
import pandas as pd
import numpy as np

# Vectorization imports
from sklearn.feature_extraction.text import CountVectorizer

# Training imports
from sklearn.model_selection import train_test_split, GridSearchCV

# Model imports
from sklearn.svm import SVC
from sklearn.linear_model import PassiveAggressiveClassifier
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier

# Validation imports
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Plotting imports
import matplotlib.pyplot as plt

# Other required library imports
import itertools
import pickle

# For clean Visualization
import warnings
warnings.filterwarnings('ignore')

In [2]:
vectorizer = CountVectorizer()

## Importing clean dataset

In [5]:
df = pd.read_csv("../Dataset/clean_data.csv")
df = df.dropna()
df.head()

Unnamed: 0,text,label
0,unit state budget fight loom republican flip f...,1
1,unit state militari accept transgend recruit m...,1
2,senior unit state republican senat let mr muel...,1
3,fbi russia probe help australian diplomat tip ...,1
4,trump want postal servic charg much amazon shi...,1


## Vectorizing the text

In [6]:
X = vectorizer.fit_transform(df['text'].values)
y = df['label']

## Train-test-validation split

* In the first step, we split the whole dataset into two sub dataset, 
	* 60% Training data,
	* 40% Validation & testing data

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.4,
                                                    random_state=0)

* In the second step, we split the validation and testing dataset into two sub dataset, 
	* 50% Validation data,
	* 50% Testing data

In [8]:
X_val, X_test, y_val, y_test = train_test_split(X_test,
                                                y_test,
                                                test_size=0.5,
                                                random_state=0)

*After splitting the data frame into three sub-data frames, this is how we have distributed the data.*

* 60% Training data
* 20% Testing data
* 20% Validation data

In [9]:
print("Checking the shape of datasets.")
print(f"Train data:      {X_train.shape}")
print(f"Test data:       {X_test.shape}")
print(f"Validation data: {X_val.shape}")

Checking the shape of datasets.
Train data:      (27243, 106803)
Test data:       (9081, 106803)
Validation data: (9081, 106803)


## Training results

In [10]:
def print_results(results):
    print(f"Best params : {results.best_params_}")  # Printing the best params

    means = results.cv_results_['mean_test_score']
    stds = results.cv_results_['std_test_score']

    for mean, std, params in zip(means, stds, results.cv_results_['params']):
        # Printing accuracy of each grid         
        print(f"{round(mean, 3)} (+/-{round(std*2, 3)}) for {params}") 

## Model Training

### SVM

In [11]:
svc = SVC()

# Training parameters
parameters = {'kernel': ['linear', 'rbf'], 'C': [0.1, 1, 10]}

In [None]:
cv = GridSearchCV(svc, parameters, cv=5)

# Training the model
cv.fit(X_train, y_train) 

In [None]:
print_results(cv)

In [None]:
cv.best_estimator_

In [12]:
# Saving the model with best parameters
with open('models/svm_best_params', 'wb') as file:
    pickle.dump(cv.best_estimator_, file)

### Random Forest

In [14]:
rf = RandomForestClassifier()

# Training parameters
params = {'n_estimators': [5, 50, 250], 'max_depth': [5, 10, 15, 20, 30, None]} 

In [15]:
rf_cv = GridSearchCV(rf, params, cv=5)

# Training the model
rf_cv.fit(X_train, y_train) 

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [5, 10, 15, 20, 30, None],
                         'n_estimators': [5, 50, 250]})

In [16]:
print_results(rf_cv)

Best params : {'max_depth': None, 'n_estimators': 250}
0.749 (+/-0.079) for {'max_depth': 5, 'n_estimators': 5}
0.786 (+/-0.018) for {'max_depth': 5, 'n_estimators': 50}
0.801 (+/-0.012) for {'max_depth': 5, 'n_estimators': 250}
0.801 (+/-0.014) for {'max_depth': 10, 'n_estimators': 5}
0.867 (+/-0.014) for {'max_depth': 10, 'n_estimators': 50}
0.875 (+/-0.01) for {'max_depth': 10, 'n_estimators': 250}
0.829 (+/-0.036) for {'max_depth': 15, 'n_estimators': 5}
0.89 (+/-0.017) for {'max_depth': 15, 'n_estimators': 50}
0.905 (+/-0.008) for {'max_depth': 15, 'n_estimators': 250}
0.833 (+/-0.015) for {'max_depth': 20, 'n_estimators': 5}
0.91 (+/-0.011) for {'max_depth': 20, 'n_estimators': 50}
0.913 (+/-0.011) for {'max_depth': 20, 'n_estimators': 250}
0.849 (+/-0.008) for {'max_depth': 30, 'n_estimators': 5}
0.921 (+/-0.008) for {'max_depth': 30, 'n_estimators': 50}
0.928 (+/-0.007) for {'max_depth': 30, 'n_estimators': 250}
0.861 (+/-0.019) for {'max_depth': None, 'n_estimators': 5}
0.931 

In [17]:
rf_cv.best_estimator_

RandomForestClassifier(n_estimators=250)

In [18]:
# Saving the model with best parameters
with open('models/rf_best_params', 'wb') as file:
    pickle.dump(rf_cv.best_estimator_, file)

### Passive Aggressive Classifier

In [19]:
pa_clf = PassiveAggressiveClassifier()

# Training parameters
params = {'C': [0.1, 1, 10]} 

In [21]:
pa_clf_cv = GridSearchCV(pa_clf, params, cv=5)

# Training the model
pa_clf_cv.fit(X_train, y_train) 

GridSearchCV(cv=5, estimator=PassiveAggressiveClassifier(),
             param_grid={'C': [0.1, 1, 10]})

In [24]:
print_results(pa_clf_cv)

Best params : {'C': 1}
0.948 (+/-0.005) for {'C': 0.1}
0.949 (+/-0.005) for {'C': 1}
0.948 (+/-0.006) for {'C': 10}


In [25]:
pa_clf_cv.best_estimator_

PassiveAggressiveClassifier(C=1)

In [26]:
# Saving the model with best parameters
with open('models/pa_clf_best_params', 'wb') as file:
    pickle.dump(pa_clf_cv.best_estimator_, file)

### XG Boost

In [13]:
XGB = XGBClassifier()

# Training parameters
params = {
    'min_child_weight': [1, 5, 10],
    'gamma': [0.5, 1, 1.5, 2, 5],
    'subsample': [0.6, 0.8, 1.0],
    'colsample_bytree': [0.6, 0.8, 1.0],
    'max_depth': [3, 4, 5]
}

In [14]:
xgb_clf_cv = GridSearchCV(XGB, params, cv=5)

# Training the model
xgb_clf_cv.fit(X_train, y_train) 



























































































































































GridSearchCV(cv=5,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None, gamma=None,
                                     gpu_id=None, importance_type='gain',
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, random_state=None,
                                     reg_alpha=None, reg_lambda=None,
                                     scale_pos_weight=None, subsample=None,
                                     tree_method=None,

In [15]:
print_results(xgb_clf_cv)

Best params : {'colsample_bytree': 0.8, 'gamma': 5, 'max_depth': 5, 'min_child_weight': 1, 'subsample': 1.0}
0.954 (+/-0.006) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 1, 'subsample': 0.6}
0.955 (+/-0.004) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 1, 'subsample': 0.8}
0.954 (+/-0.005) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 1, 'subsample': 1.0}
0.953 (+/-0.005) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 5, 'subsample': 0.6}
0.954 (+/-0.003) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 5, 'subsample': 0.8}
0.954 (+/-0.005) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 5, 'subsample': 1.0}
0.952 (+/-0.005) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'max_depth': 3, 'min_child_weight': 10, 'subsample': 0.6}
0.952 (+/-0.005) for {'colsample_bytree': 0.6, 'gamma': 0.5, 'ma

In [16]:
xgb_clf_cv.best_estimator_

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8, gamma=5, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
              reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1.0,
              tree_method='exact', validate_parameters=1, verbosity=None)

In [17]:
# Saving the model with best parameters
with open('models/xgb_clf_best_params', 'wb') as file:
    pickle.dump(xgb_clf_cv.best_estimator_, file)

### Naive Bayes

In [22]:
nvb = MultinomialNB()

# Training parameters
params = {
    'alpha': [0.01, 0.1, 0.5, 1.0, 10.0],
}

In [23]:
nvb_clf_cv = GridSearchCV(nvb, params, cv=5)

# Training the model
nvb_clf_cv.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=MultinomialNB(),
             param_grid={'alpha': [0.01, 0.1, 0.5, 1.0, 10.0]})

In [24]:
print_results(nvb_clf_cv)

Best params : {'alpha': 0.01}
0.919 (+/-0.007) for {'alpha': 0.01}
0.917 (+/-0.005) for {'alpha': 0.1}
0.916 (+/-0.005) for {'alpha': 0.5}
0.916 (+/-0.005) for {'alpha': 1.0}
0.91 (+/-0.004) for {'alpha': 10.0}


In [27]:
nvb_clf_cv.best_estimator_

MultinomialNB(alpha=0.01)

In [26]:
# Saving the model with best parameters
with open('models/nvb_clf_best_params', 'wb') as file:
    pickle.dump(nvb_clf_cv.best_estimator_, file)

## Validation

In [31]:
models = {}
for mdl in ['nvb_clf', 'pa_clf', 'rf', 'svm', 'xgb_clf']:  # Model names
    filename = f'models/{mdl}_best_params'
    models[mdl] = pickle.load(open(filename, 'rb'))

models

In [48]:
validation_results = {
    'Models': [],
    'Accuracy': [],
    'Precision': [],
    'Recall': []
}

### Evaluation function

In [49]:
def evaluate_model(name, pred, labels, save_rec=False):
    accuracy = round(accuracy_score(labels, pred), 3)    # Calculating the accuracy of the model
    precision = round(precision_score(labels, pred), 3)  # Calculating the precision of the model
    recall = round(recall_score(labels, pred), 3)        # Calculating the recall of the model

    # Appending evaluation details to a dataframe
    if save_rec:
        validation_results['Models'].append(name)
        validation_results['Accuracy'].append(accuracy)
        validation_results['Precision'].append(precision)
        validation_results['Recall'].append(recall)

    # Printing the evaluation scores
    print(
        f"{name} -- Accuracy:{accuracy}, Precision:{precision}, Recall:{recall}"
    )

In [50]:
# Looping through each model to calculate the evaluation score 
for name, model in models.items():
    pred = model.predict(X_val)
    evaluate_model(name, pred, y_val, True)

nvb_clf -- Accuracy:0.919, Precision:0.934, Recall:0.914
pa_clf -- Accuracy:0.951, Precision:0.947, Recall:0.963
rf -- Accuracy:0.939, Precision:0.932, Recall:0.956
svm -- Accuracy:0.959, Precision:0.962, Recall:0.962
xgb_clf -- Accuracy:0.964, Precision:0.961, Recall:0.973


In [51]:
# Evaluation scores
pd.DataFrame(validation_results)

Unnamed: 0,Models,Accuracy,Precision,Recall
0,nvb_clf,0.919,0.934,0.914
1,pa_clf,0.951,0.947,0.963
2,rf,0.939,0.932,0.956
3,svm,0.959,0.962,0.962
4,xgb_clf,0.964,0.961,0.973


## Testing the best models

Looking at above dataframe, we can see that *XG-Boost* and *SVM* is the best performer.

So we tested these two models with our testing data.

In [52]:
xgb_pred = models['xgb_clf'].predict(X_test)
evaluate_model('xgb_clf', xgb_pred, y_test)

xgb_clf -- Accuracy:0.963, Precision:0.96, Recall:0.97


In [53]:
svm_pred = models['svm'].predict(X_test)
evaluate_model('svm', svm_pred, y_test)

svm -- Accuracy:0.961, Precision:0.965, Recall:0.959
