## Privacy-preserving Fake News Detection
**Universidade de Brasília**<br>
School of Technology<br>
Graduate Program in Electrical Engineering (PPGEE)

### Author: Stefano M P C Souza (stefanomozart@ieee.org)<br> Author: Daniel G Silva<br>Author: Anderson C A Nascimento

# Clear-text Neural Network Benchmark

Our general goal in this research work is to demonstrate the use of secure Multi-party Computation (MPC) protocols in order to provide privacy-preserving fake news detection techniques. We are going to use neural networks inference models to classify texts. The MPC protocols can be used both during the training and inference phases. 

In this notebook we train and test these neural networks in the clear-text setting --- that is, without any atempt cipher or keep the confidentiality of the datasets or models. We are going to use these results as a benchmark, in order to study the impact of running the same algorithms upon MPC protocols on the performance of the predictive models.

In [1]:
# Utilities
import os, sys, time, random
import pandas as pd
import numpy as np
import joblib

In [2]:
# PyTorch
import torch
from torch.optim import AdamW

In [3]:
# Metrics
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score

In [4]:
# Path to our models
sys.path.insert(0, sys.path.insert(0, os.path.abspath('../')))

# Our Convolution-LSTM Neural Network: consists of a conviolution followed by a LSTM
from models.clstm import CLSTM

# Our Recurrent Neural Network: consists in a LSTM followed by to dense layers
from models.rnn import RNN

# The CNN from [1]
from models.adams import CNN

models = [
    {
        'name': 'clstm',
        'model': CLSTM
    },
    {
        'name': 'rnn',
        'model': RNN
    },
    {
        'name': 'addams',
        'model': CNN
    }
]

## Training and testing

The following functions are used to load the datasets and train the neural networks.

In [5]:
# Split datasets in batches of this size  
BATCH_SIZE = 50

# Use the CPU, since we are going to use only CPUs while running the MPC protocols 
# (A limitation we faced, because we could not get 4 GPU instances in any major cloud provider)
DEVICE = torch.device("cpu")

In [6]:
# Model traing and test
def train_test(model, model_name, optimizer, dataset, num_epochs, device, output_dir="", save_model=True):
    # Create output dir, if it does not exist
    if save_model:
        os.makedirs(output_dir + 'models', exist_ok=True)

    best_loss = 1
    
    # Training
    batch_index = [i for i in range(len(dataset['train']))]
    t = time.process_time()
    for epoch in range(num_epochs):
        random.shuffle(batch_index)
    
        model.train()
        losses = []
        criterion = torch.nn.CrossEntropyLoss()
        
        # We are going to train the model with batches of BATCH_SIZE elements
        for i in batch_index:
            optimizer.zero_grad()
            probs = model(dataset['train'][i])
            loss = criterion(probs, dataset['train_label'][i])
            losses.append(loss.item() / len(dataset['train'][i]))
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1)
            optimizer.step()
        
        epoch_loss = np.mean(losses)
        print(f"Training loss on epoch {epoch}: {epoch_loss}")
        
        info = evaluate(model, dataset['valid'], dataset['valid_label'])

        if save_model and info['loss'] < best_loss:
            torch.save(model, f"{output_dir}{model_name}.best.model")
            torch.save(info, f"{output_dir}{model_name}.best.info")
            
            best_loss = info['loss']

    train_runtime = time.process_time() - t
    
    if save_model:
        torch.save(model, f"{output_dir}{model_name}.last.model")
    
    # Testing
    t = time.process_time()
    info = evaluate(model, dataset['test'], dataset['test_label'])
    info['loss'] = losses
    info['train_runtime'] = train_runtime 
    info['test_runtime'] = time.process_time() - t
    
    return info  

In [7]:
# Model evaluation: Cross-entropy losss, accuracy, F1-score and ROC AUC score
def evaluate(model, X, y):
    model.eval()
    losses = []
    y_predicted = []
    y_true = []
    
    for bX, by in zip(X, y):
        #embeddings, labels = tuple(i.to(device) for i in batch)

        with torch.no_grad():
            probs = model(bX)

        criterion = torch.nn.CrossEntropyLoss()
        loss = criterion(probs, by)
        
        losses.append(loss.item()/len(bX))
        
        predicted = probs.max(1).indices
        
        y_predicted.extend(predicted.tolist())
        y_true.extend(by.tolist())
    
    return {
        'loss': np.mean(losses), 
        'accuracy': accuracy_score(y_true, y_predicted), 
        'f1_score': f1_score(y_true, y_predicted),
        'roc_auc': roc_auc_score(y_true, y_predicted)
    }

In [8]:
# Load a numpy array, convert to torch tensor and split in batches
def load_torch_split(path, dtype=None, batch_size=BATCH_SIZE):
    arr = np.load(path, allow_pickle=True)
    ten = torch.tensor(arr, dtype=dtype)
    return torch.split(ten, batch_size)

## Experiments 

Check the [Embeddings](./embeddings.ipynb) notebook to see how the datasets were encoded to embeddings.

In [9]:
# Experiments 
# - (check the Embeddings (embeddings.ipynb) notebook to see how the datasets were encoded to embeddings)
embeddings = ["stsb-distilbert-base", "paraphrase-multilingual-mpnet-base-v2"]

# Load dataset labels
datasets = {}
DATASET_HOME="/home/dev/datasets"
for d in ["liar", "sbnc", "fake.br", "factck.br"]:
    dtpath = f'{DATASET_HOME}/{d}'
    
    datasets[d] = {
        'name': d,
        'train_label': load_torch_split(f"{dtpath}/train.labels.npy", dtype=torch.long),
        'valid_label': load_torch_split(f"{dtpath}/valid.labels.npy", dtype=torch.long),
        'test_label': load_torch_split(f"{dtpath}/test.labels.npy", dtype=torch.long)
    }    


In [10]:
info = pd.DataFrame()
for d in datasets.keys():
    dtpath = f'{DATASET_HOME}/{d}'
    for e in embeddings:
        datasets[d]['train'] = load_torch_split(f"{dtpath}/train.{e}.npy")
        datasets[d]['valid'] = load_torch_split(f"{dtpath}/valid.{e}.npy")
        datasets[d]['test'] = load_torch_split(f"{dtpath}/test.{e}.npy")

        output_path = f'./out/{d}/{e}/'
        
        for m in models:
            print("\n---------------------------------------------------------------------------------------")
            print(d, e, m['name'])

            mdl = m['model']()
            mdl_info = pd.DataFrame([
                train_test(
                    mdl,                               # Model instance
                    m['name'],                         # Model name (to save)
                    AdamW(mdl.parameters(), lr=0.002), # Optimizer
                    datasets[d],                       # Dataset with train, valid and test sets
                    10,                                # number of epochs for training
                    DEVICE,                            # use GPU for torch operations, if available 
                    output_path                        # path to save trained models
                )
            ])
            mdl_info['dataset'] = d
            mdl_info['embedding'] = e
            mdl_info['model'] = m['name']
            
            info = info.append(mdl_info, ignore_index=True)


---------------------------------------------------------------------------------------
liar stsb-distilbert-base clstm
Training loss on epoch 0: 0.013427250717575126
Training loss on epoch 1: 0.013086420212473188
Training loss on epoch 2: 0.012881850386330477
Training loss on epoch 3: 0.012653365045370542
Training loss on epoch 4: 0.012330577087423113
Training loss on epoch 5: 0.011763631327642382
Training loss on epoch 6: 0.01101804677306152
Training loss on epoch 7: 0.010106496402818568
Training loss on epoch 8: 0.009203898626856687
Training loss on epoch 9: 0.008292815872964545

---------------------------------------------------------------------------------------
liar stsb-distilbert-base rnn
Training loss on epoch 0: 0.013524681331803989
Training loss on epoch 1: 0.013077417391933216
Training loss on epoch 2: 0.012856592937836664
Training loss on epoch 3: 0.012583929917122844
Training loss on epoch 4: 0.012129001305909107
Training loss on epoch 5: 0.011426794965611931
Training 

Training loss on epoch 7: 0.009026928334146417
Training loss on epoch 8: 0.008619134557503527
Training loss on epoch 9: 0.008731904877449879

---------------------------------------------------------------------------------------
fake.br stsb-distilbert-base addams
Training loss on epoch 0: 0.014701225222438894
Training loss on epoch 1: 0.014642981845204548
Training loss on epoch 2: 0.014596125137421392
Training loss on epoch 3: 0.014023371922072543
Training loss on epoch 4: 0.013369279942845782
Training loss on epoch 5: 0.013150638393176501
Training loss on epoch 6: 0.012639831035367904
Training loss on epoch 7: 0.012571871774170988
Training loss on epoch 8: 0.012412279584715443
Training loss on epoch 9: 0.012463278401923437

---------------------------------------------------------------------------------------
fake.br paraphrase-multilingual-mpnet-base-v2 clstm
Training loss on epoch 0: 0.010605422014831214
Training loss on epoch 1: 0.009041427636659273
Training loss on epoch 2: 0.0

In [11]:
info.sort_values(by=['dataset', 'accuracy'], ascending=False)

Unnamed: 0,loss,accuracy,f1_score,roc_auc,train_runtime,test_runtime,dataset,embedding,model
9,"[0.0076143044233322145, 0.0070717322826385496,...",0.712871,0.743363,0.719262,21.672421,0.111861,sbnc,paraphrase-multilingual-mpnet-base-v2,clstm
10,"[0.004112387001514435, 0.007928777933120728, 0...",0.705446,0.745182,0.703432,17.487736,0.081008,sbnc,paraphrase-multilingual-mpnet-base-v2,rnn
7,"[0.0028282147645950317, 0.005094188451766968, ...",0.688119,0.723684,0.69124,17.522868,0.081122,sbnc,stsb-distilbert-base,rnn
8,"[0.012314567565917969, 0.014219032526016235, 0...",0.65099,0.738404,0.607787,10.147545,0.080209,sbnc,stsb-distilbert-base,addams
6,"[0.007219417095184326, 0.0030452868342399596, ...",0.631188,0.618926,0.666701,23.612051,0.112323,sbnc,stsb-distilbert-base,clstm
11,"[0.013074352741241455, 0.013525532484054565, 0...",0.60396,0.753086,0.5,10.196623,0.079061,sbnc,paraphrase-multilingual-mpnet-base-v2,addams
3,"[0.012171179056167603, 0.01265548586845398, 0....",0.61821,0.491411,0.597408,128.535008,1.048806,liar,paraphrase-multilingual-mpnet-base-v2,clstm
4,"[0.010766549110412598, 0.012709614038467407, 0...",0.614693,0.468177,0.590785,109.987746,0.513143,liar,paraphrase-multilingual-mpnet-base-v2,rnn
0,"[0.0112392520904541, 0.006361594200134277, 0.0...",0.599453,0.533455,0.590999,121.725314,0.660007,liar,stsb-distilbert-base,clstm
1,"[0.008791337609291077, 0.007702667713165284, 0...",0.598671,0.51488,0.586555,109.002325,0.479136,liar,stsb-distilbert-base,rnn


Best model, as measured by accuracy on the test set, for each dataset

In [12]:
info.sort_values(by=['accuracy'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,loss,accuracy,f1_score,roc_auc,train_runtime,test_runtime,embedding,model
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
factck.br,"[0.005180470943450928, 0.004003062844276428, 0...",0.809886,0.882075,0.681954,17.075449,0.06621,paraphrase-multilingual-mpnet-base-v2,clstm
fake.br,"[0.0041114529967308045, 0.006204996705055237, ...",0.825,0.82973,0.825,62.071632,0.27282,paraphrase-multilingual-mpnet-base-v2,rnn
liar,"[0.012171179056167603, 0.01265548586845398, 0....",0.61821,0.491411,0.597408,128.535008,1.048806,paraphrase-multilingual-mpnet-base-v2,clstm
sbnc,"[0.0076143044233322145, 0.0070717322826385496,...",0.712871,0.743363,0.719262,21.672421,0.111861,paraphrase-multilingual-mpnet-base-v2,clstm


Best model, as measured by the ROC AUC metric on the test set, for each dataset

In [13]:
info.sort_values(by=['roc_auc'], ascending=False).groupby(by='dataset').first()

Unnamed: 0_level_0,loss,accuracy,f1_score,roc_auc,train_runtime,test_runtime,embedding,model
dataset,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
factck.br,"[0.005180470943450928, 0.004003062844276428, 0...",0.809886,0.882075,0.681954,17.075449,0.06621,paraphrase-multilingual-mpnet-base-v2,clstm
fake.br,"[0.0041114529967308045, 0.006204996705055237, ...",0.825,0.82973,0.825,62.071632,0.27282,paraphrase-multilingual-mpnet-base-v2,rnn
liar,"[0.012171179056167603, 0.01265548586845398, 0....",0.61821,0.491411,0.597408,128.535008,1.048806,paraphrase-multilingual-mpnet-base-v2,clstm
sbnc,"[0.0076143044233322145, 0.0070717322826385496,...",0.712871,0.743363,0.719262,21.672421,0.111861,paraphrase-multilingual-mpnet-base-v2,clstm
