## Privacy-preserving Fake News Detection
**Universidade de Brasília**<br>
School of Technology<br>
Graduate Program in Electrical Engineering (PPGEE)

### Author: Stefano M P C Souza (stefanomozart@ieee.org)<br> Author: Daniel G Silva<br>Author: Anderson C A Nascimento

# Privacy-preserving Model Training and Inference Setup

Our general goal in this research work is to demonstrate how the use of secure Multi-party Computation (MPC) protocols can enable privacy-preserving fake news detection techniques. We are going to use neural networks inference models to classify news texts. The MPC protocols can be used both during the training and inference phases. 

In this notebook we setup the files that willl be used in each computing node. The generated folders and files will be used both for the ppml training and ppml inference experiments.

**Notice:**
In order to run this setup you need to first generate the training, validation and test subsets for each dataset. As weel as the embeddings used for text encoding. Refer to the [Classic NLP](./classic_nlp.ipynb) and the [BERT Based Embeddings](./embeddings.ipynb) notebooks for more details.

In [1]:
# Utilities
import os, sys, time, types, joblib
import numpy as np

In [2]:
# PyTorch
import torch

In [3]:
# Path to your copy of CrypTen
sys.path.insert(0, os.path.abspath('/home/ppml/CrypTen/'))
import crypten

In [4]:
# - Computing parties
ALICE = 0   # Will train the model
BOB = 1     # Has the training and validation sets
CHARLIE = 2 # Has the test set

In [5]:
# - Experiment globals
args = types.SimpleNamespace()

# List of datasets used in the experiments
args.datasets = ["liar", "sbnc", "fake.br", "factck.br"]
# Sentence_Transformer embbedings used to encode the texts
args.embeddings = ["stsb-distilbert-base", "paraphrase-multilingual-mpnet-base-v2"]
# Path to the NLP preprocessed datasets. Refer to the `classic_nlp.ipynb` 
# notebook for more details
args.dataset_home = "/home/ppml/datasets"

# Where to save the temp files
args.output_path = './output'

In [6]:
#- The Deep Neural Network models that will be trained

# Path to our models
sys.path.insert(0, sys.path.insert(0, os.path.abspath('../')))

# Our Convolutional Neural Network
from models.cnn import CNN, CNN2, CNN3, CNN4, CNN5

# Our Deep Feed-Forward Neural Network
from models.fnn import FNN, FNN2, FNN3, FNN4, FNN5

args.models = [CNN, CNN2, CNN3, CNN4, CNN5, FNN, FNN2, FNN3, FNN4, FNN5]

In [7]:
# Now that we have the environment set up, we init crypten
crypten.init()
torch.set_num_threads(1)

In [8]:
#- Register dependencies
crypten.common.serial.register_safe_class(types.SimpleNamespace)
crypten.common.serial.register_safe_class(torch.nn.modules.activation.Tanh)
crypten.common.serial.register_safe_class(torch.nn.modules.pooling.MaxPool1d)
crypten.common.serial.register_safe_class(torch.nn.modules.dropout.Dropout)
crypten.common.serial.register_safe_class(torch.nn.modules.pooling.MaxPool1d)
crypten.common.serial.register_safe_class(torch.nn.modules.container.Sequential)

In [9]:
# Load a numpy array, convert to torch tensor and one-hot-encode labels
eye = torch.eye(2)
def load_torch(path, dtype=None):
    arr = np.load(path, allow_pickle=True)
    arr.setflags(write=True)
    ten = torch.tensor(arr, dtype=dtype)
    return eye[ten] if dtype==torch.long else ten

In [16]:
import crypten.mpc as mpc
import crypten.communicator as comm

# Run the preprocessing using CrypTen Multiprocess
@mpc.run_multiprocess(world_size=3)
def save_files(args):
    # Identify the party running this code
    rank = comm.get().get_rank()

    # Alice will save a dummy/untrained copy of the models with her key
    if args.experiment == 'ppml_training':
        os.makedirs(f"{args.output_path}/ppml_training/alice/plain", exist_ok=True)

        for model in args.models:
            crypten.save_from_party(
                model(),
                f"{args.output_path}/ppml_training/alice/plain/{model.__name__}.model", 
                src=ALICE
            )

    # Bob and charlie will save the datasets
    for d in args.datasets:
        dtpath = f"{args.dataset_home}/{d}"

        bobpath = f"{args.output_path}/{args.experiment}/bob/{d}"
        os.makedirs(bobpath, exist_ok=True)
        charliepath = f"{args.output_path}/{args.experiment}/charlie/{d}"
        os.makedirs(charliepath, exist_ok=True)

        # PPML Training setup
        if args.experiment == 'ppml_training':
            # Train labels
            train_labels = load_torch(f"{dtpath}/train.labels.npy", dtype=torch.long)
            crypten.save_from_party(train_labels, f"{bobpath}/train.labels.ct", src=BOB)

            # Training ebeddings
            for emb in args.embeddings:
                train_embeddings = load_torch(f"{dtpath}/train.{emb}.npy")
                crypten.save_from_party(train_embeddings, f"{bobpath}/train.{emb}.ct", src=BOB)

            # Validation labels
            valid_labels = load_torch(f"{dtpath}/valid.labels.npy", dtype=torch.long)
            crypten.save_from_party(valid_labels, f"{charliepath}/valid.labels.ct", src=CHARLIE)

            # Validation ebeddings
            for emb in args.embeddings:
                valid_embeddings = load_torch(f"{dtpath}/valid.{emb}.npy")
                crypten.save_from_party(valid_embeddings, f"{charliepath}/valid.{emb}.ct", src=CHARLIE)
        
        # PPML Inference setup
        if args.experiment == 'ppml_inference':
            
            # Test labels
            test_labels = load_torch(f"{dtpath}/test.labels.npy", dtype=torch.long)
            half = len(test_labels)//2
            crypten.save_from_party(test_labels[:half,:], f"{bobpath}/test.labels.ct", src=BOB)
            crypten.save_from_party(test_labels[half:,:], f"{charliepath}/test.labels.ct", src=CHARLIE)

            # Test ebeddings
            for emb in args.embeddings:
                test_embeddings = load_torch(f"{dtpath}/test.{emb}.npy")
                crypten.save_from_party(
                    test_embeddings[:half,:], 
                    f"{bobpath}/test.{emb}.ct",
                    src=BOB
                )
                crypten.save_from_party(
                    test_embeddings[half:,:], 
                    f"{charliepath}/test.{emb}.ct",
                    src=CHARLIE
                )


## PPML Training setup

In this scenario, we are going to distribute the datasets and models according to our experiment desing: 
- Alice will have the untrained models;
- Bob will have the training set; and
- Charlie the validation set.

Run the cell below to prepare the data to the privacy-preserving model training experiments. After running this code locally, or on alice's cloud node, send the folders under `OUTPUT_PATH/ppml_training` to the corresponding computing nodes.

In [11]:
args.experiment = 'ppml_training'

save_files(args)

[None, None, None]

## PPML Inference setup

In this scenario, we are going to distribute the datasets and models according to our experiment desing: 
- Alice will have the encrypted trained models;
- Bob will have half of the test set; and
- Charlie the other half.

Run the cell below in order to prepare the data to the privacy-preserving inference experiments. After running this code on alice's cloud node, send the folders under `OUTPUT_PATH/ppml_inference` to the corresponding computing nodes.

**Notice**: 
1. You nedd to run the PPML Training experiments before running the PPML Inference, in order to generate alices's trained, encrypted models
2. Alternativelly, you can encrypt models trained on the clear and place them under the `OUTPUT_PATH/ppml_training/alice/encrypted` folder on alices' machine

In [17]:
args.experiment = 'ppml_inference'

save_files(args)

[None, None, None]