# Title:

#### Group Member Names :

Anselm Che Fon

Thomas Britnell



### INTRODUCTION:
*********************************************************************************************************************
#### AIM :

Detecting "fake news" or deceptive statements using various nlp techniques. Both traditional, neural net, and tranformer tecniques were used in the original paper. Our aim was to see if we could improve upon their claimed 93% accuracy using traditional models (Naive Bayes) by adding Random Forest. 

*********************************************************************************************************************
#### Github Repo:

Original:


https://github.com/JunaedYounusKhan51/FakeNewsDetection/tree/master/Codes

Ours:

https://github.com/thomasbritnell/mlpfinal




*********************************************************************************************************************
#### DESCRIPTION OF PAPER:

"A benchmark study of machine learning models for online fake news detection" by Junaed Younus Khan et al., published in Machine Learning with Applications (2021)

This paper evaluates various traditional, deep learning, and advanced pre-trained language models for detecting "fake news" across three datasets: Liar, Fake or Real News, and Combined Corpus. The aim of the paper is to outline the options that one has when pursuing an AI model for this purpose that will align with their needs.



*********************************************************************************************************************
#### PROBLEM STATEMENT :

In addition to replicating the results with the traditional ML models used by the paper, we aimed to use Random Forest on the datasets, along with several feature engineering techniques that weren't attempted in the original. The aim is not to try to beat the heavy hitting pre-trained models which clearly performed best in the study. Our goal was to try to outperform or match Naive Bayes on the datasets used, since the paper claims Naive Bayes was the best of the ML models. This could potentially add another option for those looking to integrate fact checking AI into their applications who have resource contraints. 

*********************************************************************************************************************
#### CONTEXT OF THE PROBLEM:

The issue at hand is that online statements, especially from "alternative" news sources are not accountable to standards of fact checking that might come from more reputable publications. With distrust in the mainstream media, more and more people are seeking out such alternative media sources and aren't thinking critically about the claims they read. Artificial Intelligence, specifically Sentiment Analysis has been propsed as a way to decipher false claims from true ones. There are more or less two distinct categories of AI capable of accomplishing this task. While newer models like transformers perform better than the traditional ones, the article talks about how traditional ML models are still important because they cater to limited hardware. Things like personal blogs might benefit from fact checking, but don't have the resources to run heavy neural networks or transformer models.  

*********************************************************************************************************************
#### SOLUTION:
We propose the implementation of a Random Forest classifier enhanced by extensive feature engineering, including text vectorization and statistical linguistic features. We use SMOTE to handle class imbalance and apply GridSearchCV for hyperparameter tuning to achieve optimal performance. 


# Background
*********************************************************************************************************************


|Reference|Explanation|Dataset/Input|Weakness|
|------|------|------|------|

Reference:
https://www.sciencedirect.com/science/article/pii/S266682702100013X?ref=cra_js_challenge&fr=RR-1

Explanation:


The Datasets:




Weaknesses : 

The paper fundametnally assumes that fake news or false statements can be detected by sentiment analysis methods. This assumes that these statements have inherit truth based on their structure or the words that they use; that they can be detected without cross verifying the facts involed. In other words, the models can only ever at best detect if text is written in a way that sounds true, not if it is true or not. Similarly, models like this could be reverse engineered by those looking to disseminate false information, to help them form sentences that appear more true. It is presumably much more reliable to cross reference statements of fact in order to see if they are true. 
*********************************************************************************************************************






# Implement paper code :
*********************************************************************************************************************





In [None]:
# This is just a subset of the original code, showcasing how Naive Bayes was implemented for the "Liar" dataset.
#In the full repository they use Naive Bayes using bigram and unigram, for all three datasets. 
# We also did Random Forest with our feature engineering methods on the other datasets 

# We renamed get_feature_names to get_feature_names_out to reflect the newer function name from tf-idf

# NOTE: this creates nb.sav and nb_pickle.pickle locally

#Original credit to "Junaed Younus Khan" : 



from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
import csv
import pickle

texts = []
labels = []

with open('datasets/train.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    next(csv_reader)

    # words = []
    # c = len(csv_reader)
    for line in csv_reader:
        texts.append(line[0])
        if line[1] == 'FALSE':
            labels.append(1)
        elif line[1] == 'TRUE':
            labels.append(0)

with open('datasets/test.csv', 'r') as csv_file:
    csv_reader = csv.reader(csv_file)
    next(csv_reader)

    # words = []
    # c = len(csv_reader)
    for line in csv_reader:
        texts.append(line[0])
        if line[1] == 'FALSE':
            labels.append(1)
        elif line[1] == 'TRUE':
            labels.append(0)

# print(texts)
'''
texts = [
        "good movies", "not a good movie", "did not like",
        "i like it", "good one"

]

print(texts)
labels = [
        "1","0","0","1","1"


]
'''
tfidf = TfidfVectorizer(min_df=2, max_df=0.5,
                        ngram_range=(1, 1), stop_words='english')
features = tfidf.fit_transform(texts)
pd.DataFrame(
    features.todense(),
    columns=tfidf.get_feature_names_out()
)


features = features.toarray()


# print(tfidf.get_feature_names())

# x_train, x_test, y_train, y_test = tts(features, labels, test_size=0.2)

x_train = features[0:10240]
y_train = labels[0:10240]

x_test = features[10240:]
y_test = labels[10240:]


# classifiers
clf_nb = MultinomialNB()

# clf_svm = svm.SVC(kernel='linear')
# clf_lr = LogisticRegression()

#########


# model save
print("training start.........")
print(".")
print("nb start")
clf_nb.fit(x_train, y_train)
filename = 'nb.sav'
pickle.dump(clf_nb, open(filename, 'wb'))
print("nb done")


######################


##########################
filename = 'nb.sav'
loaded_model = pickle.load(open(filename, 'rb'))
# result = loaded_model.score(X, y)
# print(result)


pred = loaded_model.predict(x_test)

print("###################")
print(".")
print("test results: ")
print("---------nb---------------------")
print("test_accuracy: ")
print(f"{accuracy_score(y_test, pred):.4f}")

print("test_precision: ")
print(f'{precision_score(y_test, pred, average="weighted"):.4f}')

print("test_recall: ")
print(f'{recall_score(y_test, pred, average="weighted"):.4f}')

print("test_f1 ")
print(f'{f1_score(y_test, pred, average="weighted"):.4f}')


filename = 'nb_pickle.pickle'
pickle.dump((y_test, pred), open(filename, 'wb'))



training start.........
.
nb start
nb done
###################
.
test results: 
---------nb---------------------
test_accuracy: 
0.5966
test_precision: 
0.5973
test_recall: 
0.5966
test_f1 
0.5758


#### Discussion of the above code: 

These are the results from one run:

test_accuracy: 
0.5966


test_precision: 
0.5973


test_recall: 
0.5966


test_f1 
0.5758


As you can see, these results aren't very good. On the smaller dataset "liar" (in their code it is just called test.csv and train.csv) Naive Bayes with very simple Tf-idf doesn't perform well at all. We feel that this code sample is representative of the other Naive Bayes implementations in the paper because they used the same code almost exactly for the three different datasets, Liar, Fake/Real, and the Combined Corpus. They used Tf-idf without any feature engineering or preprocessing, which leaves room for improvement. 

*********************************************************************************************************************
### Contribution  Code :

This is random_forest_liar.py, which can be found under code/random_forest_liar.py in this repo. This isn't the only python file of our implementation, but it is the one which can be directly compared to the code sample from the original paper which was included above. This code takes the same dataset, the Liar dataset and trains random forest on it. In addition, there are feature engineering techniques beyond vectorization to transform the data which were not present in the original paper's code.

Specifically, SMOTE (Synthetic Minority Over-Sampling) is used to address the imbalance between false and true samples, as the Liar dataset has nearly 40% more "False" labelled values than "True". 

Other features were extracted from the raw text data, as was done for other ML models (but not Naive Bayes) in the paper, such as avg_word_length, unique_word_count, etc. 

In addition to vectorization (count vectorization, not tf-idf), these extracted features helped to improve the performance with some extra insight of a relatively simple ML model random forest, which offers an alternative to the ones used in the paper originally.

In [None]:
#Note : this saves "fake_news_detector_rf_liar.pickle" locally


import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.metrics import classification_report
from imblearn.over_sampling import SMOTE
from scipy.sparse import hstack, csr_matrix
import warnings
warnings.filterwarnings('ignore')

# Load and preprocess data
def load_data():
    
    train_df = pd.read_csv('datasets/train.csv')
    test_df = pd.read_csv('datasets/test.csv')
    
    #encode the target 
    label_map = {False: 0, True: 1}
    train_df['Label'] = train_df['Label'].map(label_map)
    test_df['Label'] = test_df['Label'].map(label_map)
    print(f"Data Sample:\n{train_df.head()}\n")
    return train_df, test_df

# Feature engineering 
def engineer_features(df):
    
    
    features = pd.DataFrame()
    
    # Counts of characters, words, unique words, etc
    features['char_count'] = df['Statement'].apply(len).astype(np.float64)
    features['word_count'] = df['Statement'].apply(lambda x: len(str(x).split())).astype(np.float64)
    features['unique_word_count'] = df['Statement'].apply(lambda x: len(set(str(x).lower().split()))).astype(np.float64)
    # this is a calculation of how many words are unique in the sample 
    features['unique_word_ratio'] = (features['unique_word_count'] / (features['word_count'] + 1)).astype(np.float64)
    
    #average word length
    features['avg_word_length'] = df['Statement'].apply(
        lambda x: np.mean([len(word) for word in str(x).split()]) if len(str(x).split()) > 0 else 0
    ).astype(np.float64)
    
    #  Sentence features using regular expressions
    import re
    features['sentence_count'] = df['Statement'].apply(lambda x: len(re.split(r'[.!?]+', str(x)))).astype(np.float64)
    features['avg_sentence_length'] = (features['word_count'] / (features['sentence_count'] + 1)).astype(np.float64)
    
    # Special characters like exclamation (might indicate a sensational article title for example)
    features['exclamation_count'] = df['Statement'].apply(lambda x: str(x).count('!')).astype(np.float64)
    features['question_count'] = df['Statement'].apply(lambda x: str(x).count('?')).astype(np.float64)
    features['capital_count'] = df['Statement'].apply(lambda x: sum(1 for c in str(x) if c.isupper())).astype(np.float64)
    features['capital_ratio'] = (features['capital_count'] / (features['char_count'] + 1)).astype(np.float64)
    
    # flags for different things like the inclusion of quotes, or if there is a source cited 
    features['has_number'] = df['Statement'].apply(lambda x: float(bool(re.search(r'\d', str(x)))))
    features['has_quote'] = df['Statement'].apply(lambda x: float(bool(re.search(r'\"|\"|\'|\'', str(x)))))
    features['has_source'] = df['Statement'].apply(lambda x: float(bool(re.search(r'\bsource\b|\baccording\b', str(x).lower()))))

    
    return features

# Text vectorization function with sklearn CountVectorizer
def vectorize_text(train_df, test_df, max_features=2000):
   
    #use bigram and unigram
    vectorizer = CountVectorizer(
        max_features=max_features, 
        ngram_range=(1, 2),
        stop_words='english',
        min_df=5
    )
    
    # Fit and transform
    X_text_train = vectorizer.fit_transform(train_df['Statement'])
    X_text_test = vectorizer.transform(test_df['Statement'])
    
    return X_text_train, X_text_test, vectorizer

# Combine the vectorized text with the added derived features
def combine_features(X_text_train, X_text_test, X_stats_train, X_stats_test):
   
    # csr to compress the spare matrix from vectorization
    X_stats_train_csr = csr_matrix(X_stats_train.astype(np.float64).values)
    X_stats_test_csr = csr_matrix(X_stats_test.astype(np.float64).values)
    
    # combine all the engineered features and the vectorized 
    X_train_combined = hstack([X_text_train, X_stats_train_csr])
    X_test_combined = hstack([X_text_test, X_stats_test_csr])

    
    return X_train_combined, X_test_combined

# smote for handling class imbalance
def apply_smote(X_train, y_train):

    smote = SMOTE(random_state=22)
    X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
    
    print(f"Original class distribution: {np.bincount(y_train)}")
    print(f"Class distribution after SMOTE: {np.bincount(y_train_resampled)}")
    return X_train_resampled, y_train_resampled

# Random Forest classifier
def train_random_forest(X_train, y_train):
    
    #params were chosen with GridSearchCV, though not incldued because of the long run-time
    rf_model = RandomForestClassifier(
        n_estimators=300,
        min_samples_split=5,
        min_samples_leaf=1,
        max_depth=None,
        class_weight='balanced',
        random_state=22,
        n_jobs=-1 
    )
    
    rf_model.fit(X_train, y_train)
    
    return rf_model

# Evaluation
def evaluate_model(model, X_test, y_test):

    # Get predictions
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:, 1] #get probabilities for both true and false, needed for roc
    
    # Calculate metrics
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_proba)
    
    # Print results
    print("\nModel Performance:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC AUC: {roc_auc:.4f}\n")
    
    

def main():  
    
    # Load and preprocess data
    train_df, test_df = load_data()
    
    # Engineer features
    train_features = engineer_features(train_df)
    test_features = engineer_features(test_df)
    
    # Vectorize text
    X_text_train, X_text_test, vectorizer = vectorize_text(train_df, test_df, max_features=2000)
    
    # Combine features
    X_train, X_test = combine_features(X_text_train, X_text_test, train_features, test_features)
    y_train = train_df['Label'].values
    y_test = test_df['Label'].values
    
    # Apply SMOTE
    X_train_resampled, y_train_resampled = apply_smote(X_train, y_train)
    
    # Train Random Forest
    rf_model = train_random_forest(X_train_resampled, y_train_resampled)
    
    # Evaluate the model
    results = evaluate_model(rf_model, X_test, y_test)
    
    
    # Save the model
    import pickle
    with open('fake_news_detector_rf_liar.pickle', 'wb') as model_file:
        pickle.dump({'model': rf_model, 'vectorizer': vectorizer}, model_file)
    
    print("\nModel saved as 'fake_news_detector_rf_liar.pickle'")

if __name__ == "__main__":
    main()

Data Sample:
                                           Statement  Label
0  Says the Annies List political group supports ...      0
1  When did the decline of coal start? It started...      1
2  Hillary Clinton agrees with John McCain "by vo...      1
3  Health care reform legislation is likely to ma...      0
4  The economic turnaround started at the end of ...      1

Original class distribution: [4488 5752]
Class distribution after SMOTE: [5752 5752]

Model Performance:
Accuracy: 0.6103
Precision: 0.6151
Recall: 0.7504
F1 Score: 0.6760
ROC AUC: 0.6446


Model saved as 'fake_news_detector_rf_lier.pickle'


### Results :
*******************************************************************************************************************************

Seen above, the results are ok. They are marginally better than with Naive Bayes on the "Liar" dataset.

Accuracy: 0.5966 -> 0.6103

Precision: 0.5973 -> 0.6151

Recall: 0.5966 -> 0.7504

F1 Score: 0.5758 -> 0.6760

ROC AUC: (not measured) -> 0.6446 

However, this is not the only dataset from the paper that we tested Random Forest on. 

In the file from this repo: code/random_forest_fake_or_real_and_combined_corupus.py, we train Random Forest using the same feature engineering as shown above but on much larger datasets: fake or real dataset, and the combined corpus dataset. These datasets differ from the liar dataset previously used because they aren't raw data- they are preprocessed from the study. They contain meta data as columns like average word length, etc. This combined corpus dataset is what the study claims Naive Bayes got 93% accuracy training on. This dataset is around 86,000 rows, each representing a news headline or article categorized as true or false.

These are the results from running our same random Forest methodology on the other two, larger datasets:

##### Best Parameters for fake or real Dataset:

class_weight: None

max_depth: 10

min_samples_leaf: 1

min_samples_split: 10

n_estimators: 100

Best F1 Score: 0.7187



##### Model Performance on fake or real Dataset:

Accuracy: 0.7524

Precision: 0.7290

Recall: 0.7945

F1 Score: 0.7604

ROC AUC: 0.8277

##### Best params for combined corpus:

Best Parameters:

class_weight: None

max_depth: 20

min_samples_leaf: 2

min_samples_split: 10

n_estimators: 200

Best F1 Score: 0.7933

##### Combine corpus dataset performance: 

Model Performance:

Accuracy: 0.7765

Precision: 0.8421

Recall: 0.8264

F1 Score: 0.8342

ROC AUC: 0.8282


#### Observations :
*******************************************************************************************************************************
*


### Conclusion and Future Direction :
*******************************************************************************************************************************
#### Learnings :

*******************************************************************************************************************************
#### Results Discussion :


*******************************************************************************************************************************
#### Limitations :



*******************************************************************************************************************************
#### Future Extension :


# References:

[1]:  