# Fake News Detection: an application of classic NLP techniques
**Universidade de Brasília**<br>
School of Technology<br>
Graduate Program in Electrical Engineering (PPGEE)

## Author: Stefano M P C Souza (stefanomozart@ieee.org)<br> Advisor: Daniel G Silva<br>Advisor: Anderson C A Nascimento


## 1. Experiment design

We want to study the impact of various NLP preprocessing techniques in the task of text classification for fake news detection. We are going to use the pipeline from [1] for model traing, tuning (hyper-parameter search) and comparison. The following ML algorithms are used:
1. Naive Bayes:
2. Decision Trees:
2. K-Nearest Neighbour:
3. Logistic Regression:
3. Suport-Vector Machines:
4. Random Forest:
5. XGBoost:

All models are trained and tested on a binary (*fake*/real) classification task. The *pipeline*, written by the author, extends the `sklearn.pipeline.Pipeline` class, from scikit-learn, and consists of the following steps:
1. **Training and tuning**: uses a random search algorithm to select the best hyper-parameters for each ML model;
2. **Selection**: for each dataset, selects the models with best performance, on the selected metric, for the validation set. The selected model is trained one more time with the concatanation of the training and the valiudation set;
5. **Test**: the models selected on the previous step, and trained on training+validation sets are used to classify texts in the test set. The final score, on the selected metric, is record so we can compare .

### 1.1. Datasets

We selected 2 datasets in English and 2 in Portuguese. Each pair has a dataset with full-length news
articles and a dataset comprised of short statements, or sentences. The purpose of experimenting
with different languages and text sizes was to observe how these variables may impact preprocessing
and training cost, and, ultimately, model performance.

The selected datasets are:
  - **Liar Dataset (liar):** curated by the UC Santa Barbara NLP Group, contains 12791 claims
  by North-American politicians and celebrities, classified as `true`, `mostly-true`, `half-true`, 
  `barely-true`, `false` and `pants-on-fire` [[2](#liar)];

  - **Source Based Fake News Classification (sbnc):** 2020 full-length news manually labeled
  as `Real` or `Fake` [[3](#sbnc)];
  
  - **FactCk.br:** 1313 claims by Brazilian politicians, manually annotated by fact checking agencies\footnote{\url{https://piaui.folha.uol.com.br/lupa}, \url{https://www.aosfatos.org} and \url{https://apublica.org}} as `true`, `false`, `imprecise` and `others` [[4](#factckbr)];

  - **Fake.br:** 7200 full-length news articles, with text and metadata, manually flagged as `real` or `fake` news [[5](#fakebr)].

The classification experiments were preceded by a dataset preparation so that each dataset would have the same structure: 
1. **label**: (boolean) indicating if that text was labeled as *fake news*;
2. **text**: (string) a concatenation of title (when available) and news body.  

### 1.2. Pre-processing

See the [Classic NLP](./classic_nlp.ipynb) notebook for details.

## 2. Classic ML Experiments

The experiments differ, in fact, in the preprocessing step, by the application of the following NLP techniques: 

E1. Bag-of-Words (BoW);<br>
E2. Stop-word Removal (SwR) and BoW;<br>
E3. *Stemming* and BoW;<br>
E4. Lemmatization and BoW;<br>
E5. SwR, lemmatization and BoW;<br>
E6. TF-IDF;<br>
E7. SwR and TF-IDF;<br>
E8. *Stemming* and TF-IDF;<br>
E9. Lemmatization and TF-IDF;<br>
E10. SwR, lemmatization and TF-IDF;<br>
E11. DiltilBERT word embeddings;<br>
E12. Sentence-BERT sentence embeddings.

**Loading the preprocessed datasets**<BR>


In [1]:
import pandas as pd
import numpy as np
import joblib

datasets = joblib.load('datasets.pyd')
experiments = joblib.load('experiments.pyd')

**Loading the model selection pipeline from [[1](#bot)]**

In [2]:
from scipy.sparse import load_npz
from bc_pipeline import ModelSelection

results = lambda: None
results.info = pd.DataFrame()
results.best_model = pd.DataFrame()

### 4.1. Running the model selection pipeline

In [3]:
for e in experiments.values():
    for d in datasets:
        print('Runnning ', e['name'], d['name'])
        train = load_npz(f"datasets/{d['name']}/train.{e['name']}.npz")
        train_valid = load_npz(f"datasets/{d['name']}/train.valid.{e['name']}.npz")
        valid = load_npz(f"datasets/{d['name']}/valid.{e['name']}.npz")
        test = load_npz(f"datasets/{d['name']}/test.{e['name']}.npz")
                
        # Instanciate model selection class
        mdl = ModelSelection(
            n_iter=20,
            cv_scoring='f1',
            selection_scoring='roc_auc',
            save_models=True,
            output_path=f"{e['name']}/{d['name']}",
            name=f"{e['name']}_{d['name']}"
        )

        # tune the internal models
        mdl.tune(train, d['train'].label)

        # select the best model for the experiment
        mdl.select(valid, d['valid'].label)

        # Fit the selected model with train+validation sets
        mdl.fit(train_valid, d['train.valid'].label)

        # test the selected model
        mdl.score(test, d['test'].label)
        display(mdl._score)
        
        results.info = results.info.append(mdl._info)
        results.best_model = results.best_model.append(mdl._score)

Runnning  bow liar


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow_liar,none,Naive Bayes,"{'norm': False, 'alpha': 0.8}",f1,0.551215,roc_auc,0.595368,0.613912,0.61337,0.607433,0.297642


Runnning  bow sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow_sbnc,none,XGBoost,"{'learning_rate': 0.037999397928169254, 'max_d...",f1,0.830805,roc_auc,0.796636,0.769802,0.763858,0.74165,53.888175


Runnning  bow fake.br






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow_fake.br,none,XGBoost,"{'learning_rate': 0.13740346640696569, 'max_de...",f1,0.965594,roc_auc,0.967882,0.975694,0.975693,0.975694,63.006642


Runnning  bow factck.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow_factck.br,none,Logistic Regression,"{'penalty': 'l2', 'max_iter': 80, 'fit_interce...",f1,0.879698,roc_auc,0.603393,0.798479,0.784813,0.649293,0.12338


Runnning  bow.swr liar


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.swr_liar,none,Naive Bayes,"{'norm': False, 'alpha': 0.8}",f1,0.54828,roc_auc,0.591772,0.606487,0.605598,0.599224,0.136738


Runnning  bow.swr sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.swr_sbnc,none,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",f1,0.835986,roc_auc,0.784758,0.79703,0.79022,0.766342,31.075107


Runnning  bow.swr fake.br




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.swr_fake.br,none,XGBoost,"{'learning_rate': 0.1036111842654619, 'max_dep...",f1,0.963559,roc_auc,0.96441,0.966667,0.966667,0.966667,46.308035


Runnning  bow.swr factck.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.swr_factck.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.876953,roc_auc,0.687301,0.794677,0.779725,0.640521,0.09413


Runnning  bow.stem liar


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.stem_liar,none,Naive Bayes,"{'norm': False, 'alpha': 0.8}",f1,0.54844,roc_auc,0.600101,0.615866,0.616177,0.611559,0.123718


Runnning  bow.stem sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.stem_sbnc,none,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",f1,0.825404,roc_auc,0.79544,0.767327,0.758402,0.733145,30.527657


Runnning  bow.stem fake.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.stem_fake.br,none,Logistic Regression,"{'penalty': 'l2', 'max_iter': 60, 'fit_interce...",f1,0.965779,roc_auc,0.97309,0.967361,0.96736,0.967361,1.516396


Runnning  bow.stem factck.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.stem_factck.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.877919,roc_auc,0.620758,0.768061,0.736121,0.566428,0.098414


Runnning  bow.lemm liar


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm_liar,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.530278,roc_auc,0.59344,0.605315,0.604157,0.597533,0.301757


Runnning  bow.lemm sbnc


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm_sbnc,none,Random Forest,"{'n_estimators': 566, 'min_samples_split': 2, ...",f1,0.816055,roc_auc,0.791374,0.762376,0.754404,0.730123,140.0891


Runnning  bow.lemm fake.br




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm_fake.br,none,XGBoost,"{'learning_rate': 0.014239783131830445, 'max_d...",f1,0.981164,roc_auc,0.983507,0.982639,0.982639,0.982639,56.411969


Runnning  bow.lemm factck.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm_factck.br,none,Logistic Regression,"{'penalty': 'l2', 'max_iter': 30, 'fit_interce...",f1,0.886541,roc_auc,0.656018,0.813688,0.776602,0.608244,0.115272


Runnning  bow.lemm.swr liar


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm.swr_liar,none,Support Vector Machines,"{'kernel': 'rbf', 'gamma': 'scale', 'C': 1.189}",f1,0.505561,roc_auc,0.595265,0.611958,0.604726,0.596459,9.644128


Runnning  bow.lemm.swr sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm.swr_sbnc,none,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",f1,0.826994,roc_auc,0.777105,0.774752,0.764665,0.738217,29.853997


Runnning  bow.lemm.swr fake.br




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm.swr_fake.br,none,XGBoost,"{'learning_rate': 0.01957975133212852, 'max_de...",f1,0.981219,roc_auc,0.980035,0.98125,0.98125,0.98125,46.364801


Runnning  bow.lemm.swr factck.br




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,bow.lemm.swr_factck.br,none,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",f1,0.883352,roc_auc,0.637328,0.809886,0.770384,0.599472,4.3695


Runnning  tfidf liar


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf_liar,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.598886,roc_auc,0.610295,0.605705,0.604521,0.61642,0.289573


Runnning  tfidf sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf_sbnc,none,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",f1,0.819077,roc_auc,0.807159,0.787129,0.782698,0.762449,54.344105


Runnning  tfidf fake.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf_fake.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 80, 'fit_inter...",f1,0.959072,roc_auc,0.967014,0.963889,0.963889,0.963889,1.38094


Runnning  tfidf factck.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf_factck.br,none,Support Vector Machines,"{'kernel': 'linear', 'gamma': 'auto', 'C': 1.733}",f1,0.880547,roc_auc,0.657741,0.775665,0.760452,0.615696,0.185114


Runnning  tfidf.swr liar


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.swr_liar,none,Naive Bayes,"{'norm': False, 'alpha': 0.8}",f1,0.515193,roc_auc,0.593617,0.607659,0.602999,0.594888,0.136423


Runnning  tfidf.swr sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.swr_sbnc,none,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",f1,0.831304,roc_auc,0.783562,0.774752,0.76639,0.741445,24.98398


Runnning  tfidf.swr fake.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.swr_fake.br,none,Support Vector Machines,"{'kernel': 'linear', 'gamma': 'auto', 'C': 1.733}",f1,0.962057,roc_auc,0.965278,0.9625,0.962499,0.9625,14.233476


Runnning  tfidf.swr factck.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.swr_factck.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 80, 'fit_inter...",f1,0.880549,roc_auc,0.665562,0.787072,0.781065,0.661046,0.112301


Runnning  tfidf.stem liar


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.stem_liar,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.570034,roc_auc,0.599631,0.611567,0.612752,0.61501,0.28631


Runnning  tfidf.stem sbnc




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.stem_sbnc,none,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",f1,0.817647,roc_auc,0.777423,0.762376,0.754949,0.731199,27.634219


Runnning  tfidf.stem fake.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.stem_fake.br,none,Support Vector Machines,"{'kernel': 'linear', 'gamma': 'auto', 'C': 1.733}",f1,0.956902,roc_auc,0.967882,0.9625,0.962499,0.9625,11.721201


Runnning  tfidf.stem factck.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.stem_factck.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.883343,roc_auc,0.687301,0.764259,0.744637,0.58938,0.094805


Runnning  tfidf.lemm liar


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm_liar,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.59677,roc_auc,0.615851,0.600625,0.600221,0.609673,0.310219


Runnning  tfidf.lemm sbnc




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm_sbnc,none,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",f1,0.822544,roc_auc,0.790179,0.789604,0.782789,0.759119,28.977821


Runnning  tfidf.lemm fake.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm_fake.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 80, 'fit_inter...",f1,0.971087,roc_auc,0.982639,0.971528,0.971527,0.971528,1.349841


Runnning  tfidf.lemm factck.br


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm_factck.br,none,Support Vector Machines,"{'kernel': 'sigmoid', 'gamma': 'scale', 'C': 2...",f1,0.882171,roc_auc,0.612937,0.806084,0.788892,0.647803,0.167091


Runnning  tfidf.lemm.swr liar


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm.swr_liar,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",f1,0.595495,roc_auc,0.60036,0.602579,0.602432,0.61106,0.27098


Runnning  tfidf.lemm.swr sbnc






Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm.swr_sbnc,none,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",f1,0.828468,roc_auc,0.787149,0.764851,0.751666,0.723566,24.235038


Runnning  tfidf.lemm.swr fake.br


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm.swr_fake.br,none,Logistic Regression,"{'penalty': 'none', 'max_iter': 20, 'fit_inter...",f1,0.968867,roc_auc,0.980035,0.967361,0.967361,0.967361,1.141418


Runnning  tfidf.lemm.swr factck.br




Unnamed: 0,pipeline,scaling,model_name,params,training_scoring,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
0,tfidf.lemm.swr_factck.br,none,XGBoost,"{'learning_rate': 0.017958976695919917, 'max_d...",f1,0.876921,roc_auc,0.612142,0.771863,0.71117,0.524442,6.71351


## 5.  Results

In [4]:
results.best_model.sort_values(by=['dataset', 'accuracy'], ascending=False)

Unnamed: 0,pipeline,model_name,params,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime,dataset
0,bow.swr,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",0.835986,roc_auc,0.784758,0.79703,0.79022,0.766342,31.075107,sbnc
0,tfidf.lemm,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",0.822544,roc_auc,0.790179,0.789604,0.782789,0.759119,28.977821,sbnc
0,tfidf,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",0.819077,roc_auc,0.807159,0.787129,0.782698,0.762449,54.344105,sbnc
0,tfidf.swr,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",0.831304,roc_auc,0.783562,0.774752,0.76639,0.741445,24.98398,sbnc
0,bow.lemm.swr,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",0.826994,roc_auc,0.777105,0.774752,0.764665,0.738217,29.853997,sbnc
0,bow,XGBoost,"{'learning_rate': 0.037999397928169254, 'max_d...",0.830805,roc_auc,0.796636,0.769802,0.763858,0.74165,53.888175,sbnc
0,paraphrase-multilingual-mpnet-base-v2,Support Vector Machines,"{'kernel': 'poly', 'gamma': 'scale', 'C': 2.278}",0.792585,roc_auc,0.756537,0.769802,0.762865,0.739498,1.884847,sbnc
0,bow.stem,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",0.825404,roc_auc,0.79544,0.767327,0.758402,0.733145,30.527657,sbnc
0,tfidf.lemm.swr,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",0.828468,roc_auc,0.787149,0.764851,0.751666,0.723566,24.235038,sbnc
0,tfidf.stem,XGBoost,"{'learning_rate': 0.044850167300919674, 'max_d...",0.817647,roc_auc,0.777423,0.762376,0.754949,0.731199,27.634219,sbnc


The best model, measured by accuracy, for each dataset is listed below

In [5]:
best.sort_values(by=['accuracy'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,pipeline,model_name,params,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
factck.br,bow.lemm,Logistic Regression,"{'penalty': 'l2', 'max_iter': 30, 'fit_interce...",0.886541,roc_auc,0.656018,0.813688,0.776602,0.608244,0.115272
fake.br,bow.lemm,XGBoost,"{'learning_rate': 0.014239783131830445, 'max_d...",0.981164,roc_auc,0.983507,0.982639,0.982639,0.982639,56.411969
liar,paraphrase-multilingual-mpnet-base-v2,Random Forest,"{'n_estimators': 722, 'min_samples_split': 2, ...",0.511372,roc_auc,0.62068,0.630715,0.622811,0.614465,2529.194057
sbnc,bow.swr,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",0.835986,roc_auc,0.784758,0.79703,0.79022,0.766342,31.075107


The best model for each dataset, as measured by the ROC AUC metric, is listed below 

In [6]:
best.sort_values(by=['roc_auc'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,pipeline,model_name,params,training_score,validation_scoring,validation_score,accuracy,f1,roc_auc,runtime
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
factck.br,tfidf.swr,Logistic Regression,"{'penalty': 'none', 'max_iter': 80, 'fit_inter...",0.880549,roc_auc,0.665562,0.787072,0.781065,0.661046,0.112301
fake.br,bow.lemm,XGBoost,"{'learning_rate': 0.014239783131830445, 'max_d...",0.981164,roc_auc,0.983507,0.982639,0.982639,0.982639,56.411969
liar,tfidf,Logistic Regression,"{'penalty': 'none', 'max_iter': 10, 'fit_inter...",0.598886,roc_auc,0.610295,0.605705,0.604521,0.61642,0.289573
sbnc,bow.swr,XGBoost,"{'learning_rate': 0.043846947416260156, 'max_d...",0.835986,roc_auc,0.784758,0.79703,0.79022,0.766342,31.075107


## References
<a name="bot"></a>
[1]: Souza, S.M.P. et al. *Tuning machine learning models to detect bots on Twitter*. 2020 Workshop on Communication Networks and Power Systems (WCNPS). Brasilia, 2020.

<a name="liar"></a>
[2] Wlliam Yang Wang, "Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection, to appear in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), short paper, Vancouver, BC, Canada, July 30-August 4, ACL.

<a name="snbc"></a>
[3]. A.  Bharadwaj,  B.  Ashar,  P.  Barbhaya,  R.  Bhatia,  Z.  Shaikh,  Source based fake news classification using machine learning (Aug 2020).URL https://kaggle.com/ruchi798/source-based-news-classification

<a name="factbr"></a>
[4]. J. a. Moreno, G. Bressan, Factck.br:  A new dataset to study fake news,in:  Proceedings of the 25th Brazillian Symposium on Multimedia andthe  Web,  WebMedia  ’19,  Association  for  Computing  Machinery,  NewYork, NY, USA, 2019, p. 525–527.  doi:10.1145/3323503.3361698.

<a name="fakebr"></a>
[5]. Monteiro R.A., Santos R.L.S., Pardo T.A.S., de Almeida T.A., Ruiz E.E.S., Vale O.A. (2018) Contributions to the Study of Fake News in Portuguese: New Corpus and Automatic Detection Results. In: Villavicencio A. et al. (eds) Computational Processing of the Portuguese Language. PROPOR 2018. Lecture Notes in Computer Science, vol 11122. Springer, Cham.