# Fake News Detection: an application of classic NLP techniques
**Universidade de Brasília**<br>
Faculdade de Tecnologia<br>
Programa de Pós-graduação em Engenharia Elétrica (PPGEE)

## Author: Stefano M P C Souza (stefanomozart@ieee.org)<br>Advisor: Daniel G Silva<br>Advisor: Anderson C A Nascimento


## 1. Experiment design

We want to study the impact of various NLP preprocessing techniques in the task of text classification for fake news detection. We are going to use the pipeline from [[1](#bot)] for model traing, tuning (hyper-parameter search) and comparison. The following ML algorithms are used:
1. Naive Bayes:
2. Decision Trees:
2. K-Nearest Neighbour:
3. Logistic Regression:
3. Suport-Vector Machines:
4. Random Forest:
5. XGBoost:

All models are trained and tested on a binary (*fake*/real) classification task. The *pipeline*, written by the author, extends the `sklearn.pipeline.Pipeline` class, from scikit-learn, and consists of the following steps:
1. **Training and tuning**: uses a random search algorithm to select the best hyper-parameters for each ML model;
2. **Selection**: for each dataset, selects the models with best performance, on the selected metric, for the validation set. The selected model is trained one more time with the concatanation of the training and the valiudation set;
5. **Test**: the models selected on the previous step, and trained on training+validation sets are used to classify texts in the test set. The final score, on the selected metric, is record so we can compare .

## 2. Natural Language Processing

### 2.1. Selected techniques

1. **Tokenization**: the text, a sequence of caracters, is transformed in a ordered collection of tokens (words, punctiation marks, emojis, etc);
2. **Stopword removal (SwR)**: removing words that do not add information, in the statistical learning sense, to any specific class in the sample. Most algorithms rely on experts dictionaries or on statistical measures such as *Mutual Information*;
3. **Stemming**: Stemming is the reduction of variant forms of a word, eliminating inflectional morphemes such as verbal tense or plural suffixes, in order to provide a common representation, the root or stem. The intuition is to perform a dimensionality reduction on the dataset, removing rare morphological word variants, and reduce the risk of bias on word statistics measured on the documents;
4. **Lemmatization:** Lemmatization consists on the reduction of each token to a linguistically valid root or lemma. The goal, from the statistical perspective, is exactly the same as in stemming: reduce variance in term frequency. It is sometimes compared to the normalization of the word sample, and aims to provide more accurate transformations than stemming, from the linguistic perspective;
5. **Bag-of-Words (BoW)**: The BoW algorithm used in most NLP libraries is based on the *Vector Space Model* (VSM) and associates the tokens with with the corresponding term frequency: the number of occurrences of that token in that document. This algorithm produces an unordered set that does not retain any information on word order or proximity in the document ;
6. **Term Frequency/Inverse Document Frequency (TF-IDF)**: Similar to the Vector Space Model Bag-of-Words, the TF-IDF (sometimes expressed as TF*IDF) document representation will associate each token in a document with a normalized or smoothed term frequency, weighted by the inverse of the frequency at which the term occurs in $D$, the corpus, or in the list of documents under processing. That is, $f_{t_i, d_j}$, the number of occurrences of token $t_i$ in document $d_j$, is replaced by $\mathrm{tf\cdot{idf}}$, where:
   
\begin{equation}
\begin{split}
  \mathrm{tf}(t_i,d_j) &=1 + \log \frac{f_{t_i,d_j}}{\sum_{t\in d_j}{f_{t,d_j}}} \\
  \mathrm{idf}(t_i, D) &=  1 + \log \frac{|D|+1}{|\{d \in D : t_i \in d\}|+1}
\end{split}
\end{equation}

### 2.2. Datasets

We selected 2 datasets in English and 2 in Portuguese. Each pair has a dataset with full-length news
articles and a dataset comprised of short statements, or sentences. The purpose of experimenting
with different languages and text sizes was to observe how these variables may impact preprocessing
and training cost, and, ultimately, model performance.

The selected datasets are:
  - **Liar Dataset (liar):** curated by the UC Santa Barbara NLP Group, contains 12791 claims
  by North-American politicians and celebrities, classified as `true`, `mostly-true`, `half-true`, 
  `barely-true`, `false` and `pants-on-fire` [[2](#liar)];

  - **Source Based Fake News Classification (sbnc):** 2020 full-length news manually labeled
  as `Real` or `Fake` [[3](#sbnc)];
  
  - **FactCk.br:** 1313 claims by Brazilian politicians, manually annotated by fact checking agencies\footnote{\url{https://piaui.folha.uol.com.br/lupa}, \url{https://www.aosfatos.org} and \url{https://apublica.org}} as `true`, `false`, `imprecise` and `others` [[4](#factckbr)];

  - **Fake.br:** 7200 full-length news articles, with text and metadata, manually flagged as `real` or `fake` news [[5](#fakebr)].

The classification experiments were preceded by a dataset preparation so that each dataset would have the same structure: 
1. **label**: (boolean) indicating if that text was labeled as *fake news*;
2. **text**: (string) a concatenation of title (when available) and news body. 

## 3. Processing
#### Daset preparation

In [1]:
# importando bibliotecas de propósito geral, utilizada na manipulação dos datasets
import pandas as pd
import numpy as np
import joblib

import os, sys, inspect, time
sys.path.insert(0, os.path.dirname(os.path.abspath(inspect.getfile(inspect.currentframe()))))

In [2]:
# Os datasets serão armezenados em um dicionário a fim de facilitar 
# a iteração de cada experimento sobre todos os datasets
datasets = [
    # Dataset 1: Liar 
    {'name':  'liar', 'lang': 'en', 'df': pd.read_csv('datasets/liar/liar.csv')},
    
    # Dataset 2: Source Based FK Detection
    {'name': 'sbnc', 'lang': 'en', 'df': pd.read_csv('datasets/sbnc/sbnc.csv')},

    # Dataset 3: Fake.br
    {'name': 'fake.br', 'lang': 'pt', 'df': pd.read_csv('datasets/fake.br/fake.br.csv')},

    # Dataset 4: FactCk.br
    {'name': 'factck.br', 'lang': 'pt', 'df': pd.read_csv("datasets/factck.br/factck.br.csv")}
]

experiments = {
   "E01": {'preprocessing_time': {}, 'name': 'bow'},
   "E02": {'preprocessing_time': {}, 'name': 'bow.swr'},
   "E03": {'preprocessing_time': {}, 'name': 'bow.stem'},
   "E04": {'preprocessing_time': {}, 'name': 'bow.lemm'},
   "E05": {'preprocessing_time': {}, 'name': 'bow.lemm.swr'},
   "E06": {'preprocessing_time': {}, 'name': 'tfidf'},
   "E07": {'preprocessing_time': {}, 'name': 'tfidf.swr'},
   "E08": {'preprocessing_time': {}, 'name': 'tfidf.stem'},
   "E09": {'preprocessing_time': {}, 'name': 'tfidf.lemm'},
   "E10": {'preprocessing_time': {}, 'name': 'tfidf.lemm.swr'},
}

#### Data split

In [3]:
from sklearn.model_selection import train_test_split

for d in datasets:
    train_valid, test = train_test_split(d['df'], stratify=d['df'].label, test_size=0.2, random_state=42)
    train, valid = train_test_split(train_valid, stratify=train_valid.label, test_size=0.2, random_state=42)
    
    train_valid.to_csv(f"datasets/{d['name']}/train.valid.csv", index=False)
    train.to_csv(f"datasets/{d['name']}/train.csv", index=False)
    valid.to_csv(f"datasets/{d['name']}/valid.csv", index=False)
    test.to_csv(f"datasets/{d['name']}/test.csv", index=False)
    
    d['train.valid'] = train_valid
    d['train'] = train
    d['valid'] = valid
    d['test'] = test

### 3.1. Bag of Words (BoW)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import save_npz

for d in datasets:
    t = time.process_time()
    
    cv = CountVectorizer()    
    train = cv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.bow.npz", train)
    valid = cv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.bow.npz", valid)
    
    cv = CountVectorizer()
    train = cv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.bow.npz", train)
    test = cv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.bow.npz", test)
    
    experiments["E01"]['preprocessing_time'][d['name']] = time.process_time() - t


### 3.2. BoW and Stopword Removal  (BoW + SwR):

In [5]:
import nltk

swr = {
    'en': nltk.corpus.stopwords.words("english"), 
    'pt': nltk.corpus.stopwords.words("portuguese")
}

for d in datasets:
    t = time.process_time()
    
    cv = CountVectorizer(stop_words=swr[d['lang']])
    train = cv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.bow.swr.npz", train)
    valid = cv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.bow.swr.npz", valid)
    
    cv = CountVectorizer(stop_words=swr[d['lang']])
    train = cv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.bow.swr.npz", train)
    test = cv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.bow.swr.npz", test)
    
    experiments["E02"]['preprocessing_time'][d['name']] = time.process_time() - t

### 3.3. BoW and Stemming

In [6]:
cv_analyzer = CountVectorizer().build_analyzer()

snowball = {
    'en': nltk.stem.SnowballStemmer('english'),
    'pt': nltk.stem.SnowballStemmer('portuguese')
}

def en_stemmer(doc):
    return (snowball['en'].stem(w) for w in cv_analyzer(doc))

def pt_stemmer(doc):
    return (snowball['pt'].stem(w) for w in cv_analyzer(doc))

cv_stemmer = {
    'en': en_stemmer,
    'pt': pt_stemmer
}

In [7]:
for d in datasets:
    t = time.process_time()
    
    cv = CountVectorizer(analyzer=cv_stemmer[d['lang']])
    train = cv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.bow.stem.npz", train)
    valid = cv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.bow.stem.npz", valid)
    
    train = cv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.bow.stem.npz", train)
    test = cv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.bow.stem.npz", test)
    
    experiments["E03"]['preprocessing_time'][d['name']] = time.process_time() - t

### 3.4. BoW and Lemmatization

In [8]:
import stanza
stanza_pt = stanza.Pipeline(lang='pt', processors='tokenize,mwt,pos,lemma')

wordnet = nltk.stem.WordNetLemmatizer()

def en_lemma(doc):
    return [wordnet.lemmatize(token) for token in nltk.word_tokenize(doc)]
    
def pt_lemma(doc):
    d = stanza_pt(doc).sentences
    return [w.lemma for s in d for w in s.words]

lemmatizer = {
    'en': en_lemma,
    'pt': pt_lemma
}


2021-06-08 13:57:05 INFO: Loading these models for language: pt (Portuguese):
| Processor | Package |
-----------------------
| tokenize  | bosque  |
| mwt       | bosque  |
| pos       | bosque  |
| lemma     | bosque  |

2021-06-08 13:57:05 INFO: Use device: gpu
2021-06-08 13:57:05 INFO: Loading: tokenize
2021-06-08 13:57:09 INFO: Loading: mwt
2021-06-08 13:57:09 INFO: Loading: pos
2021-06-08 13:57:11 INFO: Loading: lemma
2021-06-08 13:57:11 INFO: Done loading processors!


In [9]:
for d in datasets:
    t = time.process_time()
    
    cv = CountVectorizer(tokenizer=lemmatizer[d['lang']])
    train = cv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.bow.lemm.npz", train)
    valid = cv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.bow.lemm.npz", valid)
    
    cv = CountVectorizer(tokenizer=lemmatizer[d['lang']])
    train = cv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.bow.lemm.npz", train)
    test = cv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.bow.lemm.npz", test)
    
    experiments["E04"]['preprocessing_time'][d['name']] = time.process_time() - t

### 3.5. BoW, Lemmatization and SwR

In [10]:
for d in datasets:
    t = time.process_time()
    
    cv = CountVectorizer(tokenizer=lemmatizer[d['lang']], stop_words=swr[d['lang']])    
    train = cv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.bow.lemm.swr.npz", train)
    valid = cv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.bow.lemm.swr.npz", valid)
    
    cv = CountVectorizer(tokenizer=lemmatizer[d['lang']], stop_words=swr[d['lang']])    
    train = cv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.bow.lemm.swr.npz", train)
    test = cv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.bow.lemm.swr.npz", test)
    
    experiments["E05"]['preprocessing_time'][d['name']] = time.process_time() - t



### 3.6. Term-Frequency/Inverse Document Frequency (TF-IDF)

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

for d in datasets:
    t = time.process_time()
    
    tv = TfidfVectorizer()    
    train = tv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.tfidf.npz", train)    
    valid = tv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.tfidf.npz", valid)
    
    tv = TfidfVectorizer()
    train = tv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.tfidf.npz", train)
    test = tv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.tfidf.npz", test)    
    
    experiments["E06"]['preprocessing_time'][d['name']] = time.process_time() - t
    

### 3.7. TF-IDF and SwR

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

for d in datasets:
    t = time.process_time()
    
    tv = TfidfVectorizer(stop_words=swr[d['lang']])
    train = tv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.tfidf.swr.npz", train)
    valid = tv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.tfidf.swr.npz", valid)
    
    tv = TfidfVectorizer(stop_words=swr[d['lang']])
    train = tv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.tfidf.swr.npz", train)
    test = tv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.tfidf.swr.npz", test)
    
    experiments["E07"]['preprocessing_time'][d['name']] = time.process_time() - t


### 3.8. TF-IDF and Stemming

In [13]:
#norm_count_vec = TfidfVectorizer(use_idf=False, norm='l2')
tf_analyzer = TfidfVectorizer().build_analyzer()

snowball = {
    'en': nltk.stem.SnowballStemmer('english'),
    'pt': nltk.stem.SnowballStemmer('portuguese')
}

def en_stemmer(doc):
    return (snowball['en'].stem(w) for w in tf_analyzer(doc))

def pt_stemmer(doc):
    return (snowball['pt'].stem(w) for w in tf_analyzer(doc))

tf_stemmer = {
    'en': en_stemmer,
    'pt': pt_stemmer
}

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer

for d in datasets:
    t = time.process_time()
    
    tv = TfidfVectorizer(tokenizer=tf_stemmer[d['lang']])
    train = tv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.tfidf.stem.npz", train)
    valid = tv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.tfidf.stem.npz", valid)
    
    tv = TfidfVectorizer(tokenizer=tf_stemmer[d['lang']])
    train = tv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.tfidf.stem.npz", train)
    test = tv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.tfidf.stem.npz", test)
    
    experiments["E08"]['preprocessing_time'][d['name']] = time.process_time() - t


### 3.9. TF-IDF and Lemmatization

In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer

for d in datasets:
    t = time.process_time()
    
    tv = TfidfVectorizer(tokenizer=lemmatizer[d['lang']])
    train = tv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.tfidf.lemm.npz", train)    
    valid = tv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.tfidf.lemm.npz", valid)
    
    tv = TfidfVectorizer(tokenizer=lemmatizer[d['lang']])
    train = tv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.tfidf.lemm.npz", train)    
    test = tv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.tfidf.lemm.npz", test)
    
    experiments["E09"]['preprocessing_time'][d['name']] = time.process_time() - t


### 3.10. TF-IDF, Lemmatization and SwR

In [16]:
from sklearn.feature_extraction.text import TfidfVectorizer

for d in datasets:
    t = time.process_time()
    
    tv = TfidfVectorizer(tokenizer=lemmatizer[d['lang']], stop_words=swr[d['lang']])
    train = tv.fit_transform(d['train'].text)
    save_npz(f"datasets/{d['name']}/train.tfidf.lemm.swr.npz", train)
    valid = tv.transform(d['valid'].text)
    save_npz(f"datasets/{d['name']}/valid.tfidf.lemm.swr.npz", valid)
    
    tv = TfidfVectorizer(tokenizer=lemmatizer[d['lang']], stop_words=swr[d['lang']])
    train = tv.fit_transform(d['train.valid'].text)
    save_npz(f"datasets/{d['name']}/train.valid.tfidf.lemm.swr.npz", train)
    test = tv.transform(d['test'].text)
    save_npz(f"datasets/{d['name']}/test.tfidf.lemm.swr.npz", test)
    
    experiments["E10"]['preprocessing_time'][d['name']] = time.process_time() - t


### 4. Saving the pre-processed datasets

In [17]:
import joblib

joblib.dump(datasets, 'datasets.pyd')
joblib.dump(experiments, 'experiments.pyd')

['experiments.pyd']

## References
<a name="bot"></a>
[1]: Souza, S.M.P. et al. *Tuning machine learning models to detect bots on Twitter*. 2020 Workshop on Communication Networks and Power Systems (WCNPS). Brasilia, 2020.

<a name="liar"></a>
[2] Wlliam Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.

<a name="snbc"></a>
[3]. A.  Bharadwaj,  B.  Ashar,  P.  Barbhaya,  R.  Bhatia,  Z.  Shaikh,  Source based fake news classification using machine learning (Aug 2020).URL https://kaggle.com/ruchi798/source-based-news-classification

<a name="factbr"></a>
[4]. J. a. Moreno, G. Bressan, Factck.br:  A new dataset to study fake news,in:  Proceedings of the 25th Brazillian Symposium on Multimedia andthe  Web,  WebMedia  ’19,  Association  for  Computing  Machinery,  NewYork, NY, USA, 2019, p. 525–527.  doi:10.1145/3323503.3361698.

<a name="fakebr"></a>
[5]. Monteiro R.A., Santos R.L.S., Pardo T.A.S., de Almeida T.A., Ruiz E.E.S., Vale O.A. (2018) Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In: Villavicencio A. et al. (eds) Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science, vol 11122. Springer, Cham.
