## Privacy-preserving Fake News Detection
**Universidade de Brasília**<br>
School of Technology<br>
Graduate Program in Electrical Engineering (PPGEE)

### Author: Stefano M P C Souza (stefanomozart@ieee.org)<br> Author: Daniel G Silva<br>Author: Anderson C A Nascimento

# Clear-text Neural Network Benchmark

Our general goal in this research work is to demonstrate the use of secure Multi-party Computation (MPC) protocols in order to provide privacy-preserving fake news detection techniques. We are going to use neural networks inference models to classify texts. The MPC protocols can be used both during the training and inference phases. 

In this notebook we train and test these neural networks in the clear-text setting --- that is, without any atempt cipher or keep the confidentiality of the datasets or models. We are going to use these results as a benchmark, in order to study the impact of running the same algorithms upon MPC protocols on the performance of the predictive models.

**Notice** that you need to run the [Embeddings](./embeddings.ipynb) notebook beforehand, in order to encode the datasets with the chosen BERT embeddings.

**Concerning reproducibility:**

We make our best effort to present reproducible experiments, however, it cannot be guaranteed across PyTorch and CrypTen non-deterministic algorithms. Also, the cuDNN library, used by CUDA convolution operations, can be a source of nondeterminism across multiple executions.

In [1]:
# Utilities
import os, sys, time, random
import pandas as pd
import numpy as np
import joblib
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

In [2]:
# PyTorch
import torch

torch.cuda.empty_cache()
torch.manual_seed(42)

<torch._C.Generator at 0x7fedfe4f4a50>

In [3]:
# Path to our models
sys.path.insert(0, sys.path.insert(0, os.path.abspath('../')))

# Our Convolutional Neural Networks
from models.cnn import CNN, CNN2, CNN3, CNN4, CNN5

# Our Deep Feed-Forward Neural Network
from models.fnn import FNN, FNN2, FNN3, FNN4, FNN5

# The CNN from [1]
from models.adams import BenchmarkCNN

models = [FNN, FNN2, FNN3, FNN4, FNN5, CNN, CNN2, CNN3, CNN4, CNN5, BenchmarkCNN]

In [4]:
#- Experiment setup

NUM_EPOCHS = 10
LEARNING_RATE = 0.01
BATCH_SIZE = 50
OUTPUT_PATH = './output'
DATASET_HOME="/home/ppml/datasets"
DATASETS = ["liar", "sbnc", "fake.br", "factck.br"]

# Check the Embeddings (embeddings.ipynb) notebook to see how the datasets were encoded to embeddings)
EMBEDDINGS = ["stsb-distilbert-base", "paraphrase-multilingual-mpnet-base-v2"]

## Training and testing

The following functions are used to load the datasets and train the neural networks.

In [5]:
# Model training
def train(model, model_name, optimizer, dataset, num_epochs, output_dir="", save_model=True):
    # Create output dir, if it does not exist
    if save_model:
        os.makedirs(output_dir + 'models', exist_ok=True)

    best_loss = 1
    
    # Training
    batch_index = [i for i in range(len(dataset['train']))]
    t = time.process_time()
    for epoch in range(num_epochs):
        random.shuffle(batch_index)
    
        model.train()
        losses = []
        criterion = torch.nn.CrossEntropyLoss()
        
        # We are going to train the model with batches of BATCH_SIZE elements
        for i in batch_index:
            optimizer.zero_grad()
            probs = model(dataset['train'][i])
            loss = criterion(probs, dataset['train_label'][i])
            losses.append(loss.item() / len(dataset['train'][i]))
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
            optimizer.step()
        
        epoch_loss = np.mean(losses)
        
        info = test(model, dataset['valid'], dataset['valid_label'])

        if save_model and info['loss'] < best_loss:
            torch.save(model, f"{output_dir}{model_name}.best.model")
            torch.save(info, f"{output_dir}{model_name}.best.info")
            best_loss = info['loss']

    train_runtime = time.process_time() - t

    if save_model:
        torch.save(model, f"{output_dir}{model_name}.last.model")

    # Model test
    t = time.process_time()
    info = test(model, dataset['test'], dataset['test_label'])    
    info['train_runtime'] = train_runtime 
    info['test_runtime'] = time.process_time() - t
    print(f" -- Training results: accuracy={info['accuracy']}, runtime={train_runtime}")

    return info  

In [6]:
# Model evaluation: Cross-entropy losss, accuracy, F1-score and ROC AUC score
def test(model, X, y):
    model.eval()
    losses = []
    y_predicted = []
    y_true = []
    
    for bX, by in zip(X, y):
        #embeddings, labels = tuple(i.to(device) for i in batch)

        with torch.no_grad():
            probs = model(bX)

        criterion = torch.nn.CrossEntropyLoss()
        loss = criterion(probs, by)
        
        losses.append(loss.item()/len(bX))
        
        predicted = probs.max(1).indices
        
        y_predicted.extend(predicted.tolist())
        y_true.extend(by.tolist())

    return {
        'loss': np.mean(losses), 
        'accuracy': accuracy_score(y_true, y_predicted), 
        'f1_score': f1_score(y_true, y_predicted),
        'roc_auc': roc_auc_score(y_true, y_predicted)
    }

In [7]:
# Load a numpy array, convert to torch tensor and split in batches
def load_torch_split(path, dtype=None, batch_size=BATCH_SIZE):
    arr = np.load(path, allow_pickle=True)
    ten = torch.tensor(arr, dtype=dtype)
    return torch.split(ten, batch_size)

## Experiments 

In [8]:
# Load dataset labels
datasets = {}
for d in DATASETS:
    dtpath = f'{DATASET_HOME}/{d}'
    
    datasets[d] = {
        'name': d,
        'train_label': load_torch_split(f"{dtpath}/train.labels.npy", dtype=torch.long),
        'valid_label': load_torch_split(f"{dtpath}/valid.labels.npy", dtype=torch.long),
        'test_label': load_torch_split(f"{dtpath}/test.labels.npy", dtype=torch.long)
    }    


In [9]:
from torch.optim import AdamW

# load previous results, to add new experimental results
try:
    results = joblib.load('cleartext_benchmark.pyd')
except FileNotFoundError:
    results = pd.DataFrame()
    joblib.dump(results, 'cleartext_benchmark.pyd')

for d in datasets.keys():
    dtpath = f'{DATASET_HOME}/{d}'
    
    
    for e in EMBEDDINGS:
        datasets[d]['train'] = load_torch_split(f"{dtpath}/train.{e}.npy")
        datasets[d]['valid'] = load_torch_split(f"{dtpath}/valid.{e}.npy")
        datasets[d]['test'] = load_torch_split(f"{dtpath}/test.{e}.npy")

        output_path = f'{OUTPUT_PATH}/{d}/{e}/'
        
        for m in models:
            print("\n---------------------------------------------------------------------------------------")
            print(d, e, m.__name__)

            mdl = m()
            mdl_info = train(
                mdl,                                       # Model instance
                m.__name__,                                # Model name (to save)
                AdamW(mdl.parameters(), lr=LEARNING_RATE), # Optimizer
                datasets[d],                               # Dataset with train, valid and test sets
                NUM_EPOCHS,                                # number of epochs for training
                output_path                                # path to save trained models
            )

            mdl_info['dataset'] = d
            mdl_info['embedding'] = e
            mdl_info['model'] = m.__name__
            
            results = results.append(mdl_info, ignore_index=True)


---------------------------------------------------------------------------------------
liar stsb-distilbert-base FNN
 -- Training results: accuracy=0.5576397030089879, runtime=47.723126609999994

---------------------------------------------------------------------------------------
liar stsb-distilbert-base FNN2
 -- Training results: accuracy=0.5576397030089879, runtime=29.848463197

---------------------------------------------------------------------------------------
liar stsb-distilbert-base FNN3
 -- Training results: accuracy=0.5576397030089879, runtime=24.561286620999994

---------------------------------------------------------------------------------------
liar stsb-distilbert-base FNN4
 -- Training results: accuracy=0.5576397030089879, runtime=21.834679205

---------------------------------------------------------------------------------------
liar stsb-distilbert-base FNN5
 -- Training results: accuracy=0.5849941383352872, runtime=8.771627180999985

-----------------------

 -- Training results: accuracy=0.6757425742574258, runtime=13.970093046000102

---------------------------------------------------------------------------------------
sbnc paraphrase-multilingual-mpnet-base-v2 CNN4
 -- Training results: accuracy=0.6039603960396039, runtime=15.638607274000151

---------------------------------------------------------------------------------------
sbnc paraphrase-multilingual-mpnet-base-v2 CNN5
 -- Training results: accuracy=0.6039603960396039, runtime=15.419316557000002

---------------------------------------------------------------------------------------
sbnc paraphrase-multilingual-mpnet-base-v2 BenchmarkCNN
 -- Training results: accuracy=0.6039603960396039, runtime=12.223547996999969

---------------------------------------------------------------------------------------
fake.br stsb-distilbert-base FNN
 -- Training results: accuracy=0.5, runtime=27.754292634999956

-----------------------------------------------------------------------------------

 -- Training results: accuracy=0.7908745247148289, runtime=1.088542069000141

---------------------------------------------------------------------------------------
factck.br paraphrase-multilingual-mpnet-base-v2 CNN
 -- Training results: accuracy=0.779467680608365, runtime=5.414315801000157

---------------------------------------------------------------------------------------
factck.br paraphrase-multilingual-mpnet-base-v2 CNN2
 -- Training results: accuracy=0.779467680608365, runtime=9.906651195999984

---------------------------------------------------------------------------------------
factck.br paraphrase-multilingual-mpnet-base-v2 CNN3
 -- Training results: accuracy=0.779467680608365, runtime=8.780549404999874

---------------------------------------------------------------------------------------
factck.br paraphrase-multilingual-mpnet-base-v2 CNN4
 -- Training results: accuracy=0.7680608365019012, runtime=10.547003724000206

-------------------------------------------------

### Best accuracy on the test set, for each dataset

In [10]:
results.sort_values(by=['accuracy'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,accuracy,embedding,f1_score,loss,model,roc_auc,test_runtime,train_runtime
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
factck.br,0.821293,paraphrase-multilingual-mpnet-base-v2,0.891455,0.01882,CNN4,0.670201,0.093772,9.921069
fake.br,0.826389,paraphrase-multilingual-mpnet-base-v2,0.828297,0.008419,CNN,0.826389,0.209264,27.105803
liar,0.613521,paraphrase-multilingual-mpnet-base-v2,0.586711,0.014161,CNN,0.614205,0.431463,52.808952
sbnc,0.737624,paraphrase-multilingual-mpnet-base-v2,0.804428,0.021291,CNN2,0.696721,0.128258,14.871974


### Best ROC AUC

In [11]:
results.sort_values(by=['roc_auc'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,accuracy,embedding,f1_score,loss,model,roc_auc,test_runtime,train_runtime
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
factck.br,0.775665,paraphrase-multilingual-mpnet-base-v2,0.852868,0.033022,CNN,0.704522,0.061667,5.629209
fake.br,0.826389,paraphrase-multilingual-mpnet-base-v2,0.828297,0.008419,CNN,0.826389,0.209264,27.105803
liar,0.613521,paraphrase-multilingual-mpnet-base-v2,0.586711,0.014161,CNN,0.614205,0.431463,52.808952
sbnc,0.727723,paraphrase-multilingual-mpnet-base-v2,0.773663,0.099958,FNN5,0.716496,0.017222,1.630381


### Best F1-score

In [12]:
results.sort_values(by=['f1_score'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,accuracy,embedding,f1_score,loss,model,roc_auc,test_runtime,train_runtime
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
factck.br,0.821293,paraphrase-multilingual-mpnet-base-v2,0.891455,0.01882,CNN4,0.670201,0.093772,9.921069
fake.br,0.826389,paraphrase-multilingual-mpnet-base-v2,0.828297,0.008419,CNN,0.826389,0.209264,27.105803
liar,0.575615,paraphrase-multilingual-mpnet-base-v2,0.61131,0.014434,CNN,0.594097,0.411986,49.048951
sbnc,0.737624,paraphrase-multilingual-mpnet-base-v2,0.804428,0.021291,CNN2,0.696721,0.128258,14.871974


### Average metrics

In [13]:
results['avg'] = (results.accuracy+results.roc_auc+results.f1_score)/3
results.sort_values(by=['avg'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,accuracy,embedding,f1_score,loss,model,roc_auc,test_runtime,train_runtime,avg
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
factck.br,0.821293,paraphrase-multilingual-mpnet-base-v2,0.891455,0.01882,CNN4,0.670201,0.093772,9.921069,0.794316
fake.br,0.826389,paraphrase-multilingual-mpnet-base-v2,0.828297,0.008419,CNN,0.826389,0.209264,27.105803,0.827025
liar,0.613521,paraphrase-multilingual-mpnet-base-v2,0.586711,0.014161,CNN,0.614205,0.431463,52.808952,0.604812
sbnc,0.737624,paraphrase-multilingual-mpnet-base-v2,0.804428,0.021291,CNN2,0.696721,0.128258,14.871974,0.746258


### Average metrics, by dataset and model

In [14]:
results.groupby(by=['dataset', 'model']).mean().sort_values(by=['dataset', 'avg'], ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,accuracy,f1_score,loss,roc_auc,test_runtime,train_runtime,avg
dataset,model,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
sbnc,FNN5,0.692915,0.74866,0.06699,0.675173,0.019405,1.723053,0.705582
sbnc,CNN,0.665687,0.710086,0.036213,0.657137,0.079116,8.565199,0.677637
sbnc,CNN3,0.650062,0.718303,0.031806,0.625845,0.142188,13.383544,0.664737
sbnc,CNN2,0.653929,0.718829,0.038481,0.615263,0.130994,14.816163,0.662674
sbnc,CNN5,0.638769,0.721105,0.03203,0.601098,0.162941,15.710287,0.653657
sbnc,FNN4,0.624536,0.712879,0.030326,0.569278,0.029512,3.994367,0.635564
sbnc,CNN4,0.618193,0.737488,0.038223,0.541099,0.132619,15.553385,0.63226
sbnc,FNN3,0.615254,0.730508,0.031378,0.542364,0.028126,4.089505,0.629375
sbnc,FNN,0.610613,0.753654,0.031611,0.512164,0.040277,7.392143,0.625477
sbnc,BenchmarkCNN,0.604785,0.753475,0.031862,0.501042,0.094214,12.177314,0.619767


### Average metrics, by model

In [15]:
results.groupby(by=['model']).mean().sort_values(by=['avg'], ascending=False)

Unnamed: 0_level_0,accuracy,f1_score,loss,roc_auc,test_runtime,train_runtime,avg
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
FNN5,0.707661,0.719713,0.033067,0.666273,0.04189,4.602997,0.697882
CNN,0.692008,0.648779,0.021003,0.649159,0.196338,23.440761,0.663315
CNN2,0.672975,0.608704,0.020919,0.611389,0.347546,40.825454,0.631023
CNN3,0.646258,0.585373,0.019018,0.588447,0.373752,35.670399,0.606693
CNN5,0.647352,0.551262,0.019235,0.58293,0.433282,42.396833,0.593848
FNN4,0.650299,0.545611,0.018849,0.571337,0.061992,11.198415,0.589082
FNN3,0.64196,0.500404,0.020173,0.562855,0.059613,11.706963,0.568406
CNN4,0.637834,0.483058,0.024613,0.557698,0.349181,43.531,0.55953
FNN2,0.611218,0.512054,0.018794,0.5,0.071906,16.523024,0.541091
FNN,0.612881,0.491363,0.018775,0.503041,0.092157,22.973005,0.535761


## Models selected for PPML experiments

As the results above show that the CNN and the FNN5 models appear as the top average metrics for most datasets, we are going to use these models in the next step of our resarch. 

## Comparisson with bechmark CNN from literature

Below are the results for the CNN found in literature [1]. One can note that it does not appear in the best position for any of the metrics, as displayed above.

In [16]:
results[results.model=='BenchmarkCNN'].groupby(by=['dataset', 'embedding']).mean().sort_values(by=['accuracy'], ascending=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,accuracy,f1_score,loss,roc_auc,test_runtime,train_runtime,avg
dataset,embedding,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
factck.br,paraphrase-multilingual-mpnet-base-v2,0.78327,0.878465,0.014426,0.5,0.064148,7.67271,0.720578
factck.br,stsb-distilbert-base,0.78327,0.878465,0.014147,0.5,0.069881,7.794583,0.720578
sbnc,stsb-distilbert-base,0.605611,0.753864,0.031897,0.502083,0.092576,12.166,0.620519
sbnc,paraphrase-multilingual-mpnet-base-v2,0.60396,0.753086,0.031827,0.5,0.095853,12.188628,0.619016
liar,paraphrase-multilingual-mpnet-base-v2,0.55764,0.0,0.014993,0.5,0.551043,71.85415,0.352547
liar,stsb-distilbert-base,0.55764,0.0,0.014985,0.5,0.520885,71.356444,0.352547
fake.br,paraphrase-multilingual-mpnet-base-v2,0.5,0.222222,0.013986,0.5,0.294154,40.256484,0.407407
fake.br,stsb-distilbert-base,0.5,0.444444,0.01399,0.5,0.302599,40.171994,0.481481


In [17]:
joblib.dump(results, 'cleartext_benchmark.pyd')

['cleartext_benchmark.pyd']

## References

[1] S. Adams, D. Melanson, M. De Cock, Private text classification with convolutional neural networks, in: Proceedings of the Third Workshop on Privacy in Natural Language Processing, Association for Computational Linguistics, Online, 2021, pp. 53–58. doi:10.18653/v1/2021.privatenlp-1.7.
