# Building a Tri-gram Biomedical Language Model on Multiple Private Datasets



Let `Bob` and `Alice` be the owner's of two hospitals. Each of there hospitals have alot of data, which can be used to train deep learning models to help doctors and medical staff to save more lives. But the hospitals can't share their data with outside parties, as it violates the privacy of the patients and is illegal in there in countries.
They have approached you, a leading company in BioMedical Deep Learning, to find a solution to their problem. Now you are experienced in using **PyTorch**, which is an **amazing Deep Learning framework** and is great for training DL model locally. But since you are dealing with datasets which you can't see and are on another machine, you need something more than just PyTroch.

- You find about **PySyft**, which is again an amazing library which overloads PyTorch with tons of features, which include sending tensors and models across machines, and perform computation on them remotely.
    
    
    - But since you are dealing with text data, and can't directly feed text into models, you need something more.


- Then you find about **SyferText**, which leverages PySyft and allows you to define text processing pipelines, which enable you to perform Natural Language Processing on remote private datasets while preserving it's privacy.
    
    
    - And ... That's it! Now, you have all the tools.

Before reading this tutorial, if you are unware about the theory behind `Word Embeddings` and `N-gram language models`, then I suggest you to go through this awesome, PyTorch tutorial : https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html

Awesome tutorials, are one of the main reason behind PyTorch's success as a DL framework. And we have tried our best to follow the same here.

You can also refer to the following tutorials, for knowing more about `Federated Learning`.

- 1. [Intro to federated learning](https://github.com/OpenMined/PySyft/blob/master/examples/tutorials/Part%2002%20-%20Intro%20to%20Federated%20Learning.ipynb)
- 2. [Federated learning via Trusted Aggregator](https://github.com/OpenMined/PySyft/blob/master/examples/tutorials/Part%2004%20-%20Federated%20Learning%20via%20Trusted%20Aggregator.ipynb)

### Install Dependencies

In [1]:
# SyferText imports
import syfertext
from syfertext.pipeline import SimpleTagger
from syfertext.workers.virtual import VirtualWorker
from syfertext.encdec import shared_prime, shared_base

# Import useful utility functions for this tutoria
from utils import download_dataset

# PySyft imports
import syft as sy
from syft.generic.string import String
from syft.generic.pointers.string_pointer import StringPointer

# PyTorch imports
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, Dataset
from torch.utils.tensorboard import SummaryWriter

# NLTK imports
import nltk
from nltk.corpus import stopwords

# Useful imports
import os
import tqdm
import sys
import re
import string
import random
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split


Bad key "text.kerning_factor" on line 4 in
/home/sachin/miniconda3/envs/openmined/lib/python3.6/site-packages/matplotlib/mpl-data/stylelib/_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
https://github.com/matplotlib/matplotlib/blob/v3.1.3/matplotlibrc.template
or from the matplotlib source distribution


In [2]:
# Todo: GPU for training
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

### Download and load the dataset

We will train our Language Model on the [Medical Transcriptions dataset available on Kaggle](https://www.kaggle.com/tboyle10/medicaltranscriptions).

Thanks to [socd06](https://github.com/socd06/medical-nlp) for processing the data and extracting the stop words and vocab from it. Althogh we will not be using the vocab.txt file.

In [3]:
# The dataset's root folder
root_path = os.path.join('.', 'medical-nlp')

# The URL template to all dataset files
if not os.path.exists(root_path):
    !git clone https://github.com/socd06/medical-nlp/

data_path = os.path.join(root_path, 'data')
assert os.path.exists(data_path)

### Preprare work environment

We will simulate a work environment where we will have the three hospitals, and the awesome data science expert me.

We will also have the secure_worker which will act as a **neutral trusted third-party** and will help us to generate a combined vocabulary for all the hospitals and also act as a gradient aggregator when training in a federated fashion.

In practice it can be a regular **AWS Instance**, connected to the private grid.

In [4]:
# Create a torch hook for PySyft
hook = sy.TorchHook(torch)

me = sy.local_worker
bob = VirtualWorker(hook, id='bob_hospital')
alice = VirtualWorker(hook, id='alice_hospital')

secure_worker = VirtualWorker(hook, id='secure_worker')



### Simulate Private Datasets

Now we will simulate two private datasets owned by two different hospitals, Bob_hospital and Alice_hospital.

1. Load the data from mtsamples.csv locally.
2. Split the dataset into training and validation datasets for each worker, thus creating four splits namely, `bob_train_data`, `alice_train`, `bob_val` and `alice_val`.
3. Each element in the four datasets, will be sent to the appropriate worker. And then we will hold pointers to the data on remote machines.

`NOTE:  In a real-world scenario, the datasets will be already present on the remote machines, and will be tagged by the grid, which makes them easy to search and load.`

In [5]:
# Load the public dataset
csv_path = os.path.join(data_path, 'mtsamples.csv')
dataset = pd.read_csv(csv_path, usecols=['transcription'], index_col=False).dropna()

print("Size of dataset", len(dataset))

dataset.head()

Size of dataset 4966


Unnamed: 0,transcription
0,"SUBJECTIVE:, This 23-year-old white female pr..."
1,"PAST MEDICAL HISTORY:, He has difficulty climb..."
2,"HISTORY OF PRESENT ILLNESS: , I have seen ABC ..."
3,"2-D M-MODE: , ,1. Left atrial enlargement wit..."
4,1. The left ventricular cavity size and wall ...


In [6]:
# TODO: Do it using SimpleTaggers ?
def preprocess(text):
    
    #remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))

    #remove strings like "1." or "2."
    text = re.sub(r'\d+\.', '', text)

    # convert to lowercase
    text = text.lower()

    #remove whitespaces from beginning and end
    text = text.strip()
    
    return text
    
dataset['transcription'] = dataset['transcription'].apply(lambda x: preprocess(x))
dataset.head()

Unnamed: 0,transcription
0,subjective this 23yearold white female presen...
1,past medical history he has difficulty climbin...
2,history of present illness i have seen abc to...
3,2d mmode 1 left atrial enlargement with left...
4,1 the left ventricular cavity size and wall t...


In [7]:
# Distribute the dataset to Bob and Alice
bob_dataset, alice_dataset = train_test_split(dataset, train_size = 0.5)

bob_train,   bob_val   = train_test_split(bob_dataset,   train_size=0.9)
alice_train, alice_val = train_test_split(alice_dataset, train_size=0.9)

print("Size' of Bob's dataset    : Train: ", len(bob_train),  "\tValidation: ", len(bob_val))
print("Size' of Alice's dataset  : Train: ", len(alice_train), "\tValidation: ", len(alice_val))

Size' of Bob's dataset    : Train:  2234 	Validation:  249
Size' of Alice's dataset  : Train:  2234 	Validation:  249


And combining their datasets we get a much bigger dataset to train upon.

In [8]:
print("Size of combined datasets : Train: {} \tValidation: {}".format(
                        len(bob_train) + len(alice_train),
                        len(bob_val) + len(alice_val))
     )

Size of combined datasets : Train: 4468 	Validation: 498


Now let's make the datasets remote.

In [9]:
def make_dataset_remote(dataset, worker):
    """A handy function to send the dataset to a remote worker.
    Returns a list of dictionries contaring pointers to the dataset
    examples on the remote worker.
    """
    remote_dataset = list()
    
    for index, example in dataset.iterrows():
        
        # Get the value for the current example
        trans = example['transcription']
        
        # Send it to the remote worker
        trans_pointer = String(trans).send(worker)
        
        # Store the pointer in dict
        remote_dataset.append({'transcription': trans_pointer})
    
    return remote_dataset

In [10]:
# Bob's remote datasets
bob_train = make_dataset_remote(bob_train, bob)
bob_val = make_dataset_remote(bob_val, bob)

# Alice's remote dataets
alice_train = make_dataset_remote(alice_train, alice)
alice_val = make_dataset_remote(alice_val, alice)

Now let's see how an element of Bob's dataset looks like.

In [11]:
example = bob_train[10]

trans = example['transcription']
print("Trans type:", type(trans))
print("Trans location:", trans.location)

Trans type: <class 'syft.generic.pointers.string_pointer.StringPointer'>
Trans location: <VirtualWorker id:bob_hospital #objects:2483>


As you can see it is of type `StringPointer`, pointing to real `String` object on Bob's machine.

## Create the Pipeline

To process the remote strings, we will initialize a `SyferText` Language object. At intialization the Language object contains a `Pipeline` with the `Tokenizer`. And we can add other components to the pipepline, example a `SimpleTagger` which can tag `Tokens`.

Refer to [this tutorial]() for detailed understanding of `SimpleTagger`.

In [12]:
nlp = syfertext.load('en_core_web_lg', owner=me)

# Assert tokenizer is present in pipeline
nlp.pipeline_template

[{'remote': True, 'name': 'tokenizer'}]

Now let's add a tagger for tagging stop words.

In [13]:
# Read the clinical stop words file
with open(os.path.join(data_path, 'clinical-stopwords.txt'), 'r') as f:
    stop_words = [line for line in f.read().splitlines()]

In [14]:
stopwords_tagger = SimpleTagger(attribute='is_stop', 
                                lookups=stop_words, 
                                tag=True)

# Add tagger with remote = True
# This allows the tagger to be sent to remote machines
nlp.add_pipe(stopwords_tagger, name='stopwords_tagger', remote=True)

# Check pipeline
nlp.pipeline_template

[{'remote': True, 'name': 'tokenizer'},
 {'remote': True, 'name': 'stopwords_tagger'}]

Great! The `StopWord Tagger` will allow us to tag stop word tokens. This will help us to avoid them, while training our Word Embeddings.

## Process remote Strings using `SyferText` Pipeline

Now that we have our Pipeline ready, we can feed our remote Strings into the pipeline. Based on the components in Pipeline, the following processing steps take place:

1. `Tokenizer` breaks the strings into multiple tokens.
2. `Stop-word Tagger` tags the stop-word tokens based on the `lookup` that we have provided to it.

After each string is processed by the pipeline, we are returned a pointer to the remote `Doc` object. This remote Doc holds all the tokens, and exposes tools via it's pointer which allow us call certain methods on the Doc which don't violate the privacy of it's contents.

In [15]:
def process_dataset(dataset, dataset_name, print_interval):
        
    for i, example in enumerate(dataset):
        
        # Process the current example
        example['transcription'] = nlp(example['transcription'])
        
        if (i+1)%print_interval == 0:
            print(dataset_name, '\t', f'{i+1}/{len(dataset)}', end='\r', flush=True)
    
    print(dataset_name, "Processed")

In [16]:
# Process bob_train dataset
# NOTE: THIS WILL TAKE TIME
process_dataset(bob_train, "bob_train", print_interval=100)

bob_train Processed34


In [17]:
# Process alice_train dataset
# NOTE: THIS WILL TAKE TIME
process_dataset(alice_train, "alice_train", print_interval=100)

alice_train Processed34


Now that we have processed our data, and tagged our tokens, we can go ahead and create our vocabulary by combining the vocabularies across bob's dataset and alice's dataset.

## Create Combined Vocabulary

Now we have to create a `Combined Vocabulary` by combining the vocabularies of Bob and Alice, and assign each unique word in the combined vocabulary an unique index. 

But building a combined vocabulary requires us to know the set of words that are present in each dataset. But this info may reveal certain aspects of the data, for example, in our case, it may reveal the names of medicines & drugs used by each hospital, and machines that they have and other information. Which violates the privacy of involved hospitals.

To solve this problem we have a `Secure Worker` which is a trusted third party server, which performs the `Private Set Intersection` of each worker's vocabulary for us. It also assigns each word in the combined vocabulary an unique index and returns the vocabulary mapped to indices back to the workers.
As mentioned before, `Secure Worker` can be a cloud server, such as an **AWS Instance**.

For more details, refer to this tutorial:

And thanks to this Stack Exchange answer: 

In [18]:
# Execute a Diffie-Hellman key exchange to encrypt the
# vocabularies of each worker
workers = [bob, alice]
secure_worker.execute_dh_key_exchange(shared_prime, shared_base, workers)

Now we can gather together the vocabulary from each worker in a secure manner, and combine them to create a unified vocabulary.

We can also **exclude certain tokens, from being a part of the combined vocabulary**, if we are not going to use them for training the model.

In [19]:
# Exclude stop words from the combined vocabulary
excluded_tokens = {"is_stop": {True}}

# NOTE: THIS WILL ALSO TAKE SOME TIME

vocab_size = secure_worker.create_vocabulary(bob_train + alice_train, 'transcription', excluded_tokens)

print("vocab_size", vocab_size)

vocab_size 43912


## Why Federate Training?

Now that we have assigned unique indices to each word in the combined vocabulary, we can bring the contents of the remote documents in form of their conrresponding index and since we don't know the mapping from the indices to the words, we will not be able to extract the contents of the documents.

This above idea may seem right to you.. but actually it is not. :)

`There are multiple ways to violate PRIVACY, there is only one way to maintain it.`

Evne when we bring back words mapped to their indices, we can perform a frequency attack to establish a relation between the indices and the correspondig words that they represent. To avoid this, we will train our model in a `Federated` fahsion. 



In federated training, instead of collecting all the data to a single place, for the model to train on. We instead, send the model to the machine, where the data resides. This maintains the `ownership` of the data, with the data provider and also maintains it's privacy. If you have gone through the tutorials mentioned above, you would know that, bringing the gradients, directly from a worker may leak information about the worker.

So, in order to avoid it, we perform `gradient averaging on a Secure Aggregator`. Secure Aggregator, basically aggregates the gradients from all workers and averages them before returning to us. This disables us from learning
a lot about a single data provider, thus preserving their individual privacy.

There are other much more secure way of averaging gradients such as `Encrypted Gradient Aggregation`, but they aren't supported in PySyft yet. Here is the [issue link](https://github.com/OpenMined/PySyft/blob/master/examples/tutorials/Part%2010%20-%20Federated%20Learning%20with%20Secure%20Aggregation.ipynb) if you want to contribute.

Moreover, we can use our `Secure Worker` which has helped us to make the `combined vocabulary` as the `Secure Aggregator`.

## Create a Dataset class

Now that we have our `combined vocabulary` with unique indices assigned to each word. Let's create a Dataset class, which will return us pointers to `Context` and `Target` vectors for a given remote `Document`. 
These Context and target pairs can be used to train our `N-gram language model`.

In [20]:
class NGramDataset(Dataset):
    """Creates an n-gram dataset combining remote datasets."""
    
    def __init__(self, dataset, n_gram, excluded_tokens, nlp = None):
        """Initialize the Dataset object
        
        Args:
            dataset (list): List containing examples as dicts.
            n_gram (int): Size of n-grams to be created for training.
            excluded_tokens (Dict): Tokens to be excluded while making
                context and target pairs.
            nlp (Language): The syfertext language object containing
                the pipeline. Should be passed in when you pass a dataset
                which has not been already processed by the pipeline.
                Example, during validation.
        """
        
        self.dataset = dataset
        self.n_gram = n_gram
        self.excluded_tokens = excluded_tokens
        self.nlp = nlp
        
        self._create_relative_context_pos()
        
    def __getitem__(self, index):
        """In this function, we will be returned a pointer to the 
        context, target pairs from the document.
        
        Args:
            index (int): index of the example to be fetched.
                This actually indexes one example in `self.dataset`
                which pools over examples of all the remote datasets.
        """
        
        # get the example
        example = self.dataset[index]
        doc_ptr = example['transcription']
        
        if isinstance(doc_ptr, StringPointer):
            doc_ptr = self.nlp(doc_ptr)
        
        # Get a pointer to the context and target tensors
        # of dtype = torch.LongTensor residing on the worker
        # to whom this example belongs.
        
        context, target = doc_ptr.get_context_target_tensors(self.relative_context_pos,
                                                             self.excluded_tokens)
        
        return context, target
            
    def __len__(self):
        """Returns the combined size of all of the 
        remote training/validation sets.
        """
        
        # The size of the combined datasets
        return len(self.dataset)
    
    def _create_relative_context_pos(self):
        """The doc.get_context_target_tensors()` requires us to pass
        the relative position of the context tokens with respect to the 
        target tokens. 
        
        Example: If we are training a tri-gram language model, 
            then for given target token position `i`. The context positions
            will be [i-2, i-1]. So the corresponding relative context
            position will be [-2, -1].
            
            
        """
        
        context_size = self.n_gram - 1
        self.relative_context_pos = [-(context_size - i) 
                                     for i in range(context_size)]

Let's now create two such NGramDataset objects, one for training and the other for validation.

In [21]:
# Let's also instantiate an n-gram variable
N_GRAM = 3

In [22]:
# Exclude stop words from being a part of 
# context, or from being a target
excluded_tokens = {"is_stop": {True}}


trainset = NGramDataset(dataset = list(bob_train + alice_train),
                      n_gram = N_GRAM, 
                      excluded_tokens = excluded_tokens)

## Create a Model Class

We will create a Model with the following layers.
1. Embedding Layer : 
    - Input: A torch.LongTensor representing the indices which form the context, eg. [2, 3, 4]
    - Output: Embedding vectors of the input indices, which are stacke toghether and fed to next layer.
2. Linear layer: Takes embedding vectors as the input and outputs a 128 dimensional vector 
3. Linear layer: Takes 128 dimensional vector and maps outputs a vector with dimension of vocabulary size. And then apply LogSoftamx activation on the output of the last layer.

Finally since we have used LogSoftmax activation in our final layer, we use nn.NLLLoss as our loss function.

In [23]:
class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        """
        Args:
            vocab_size: Size of combined vocabulary
            embedding_dim: Embedding dimension
            context_size: (n_gram - 1) is context size
        """
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        """
        Args:
            inputs: A torch.LongTensor having indices for the context
                dimensions = (batch_size, context_size)
        Returns:
            log_probs: dimension = (batch_size, vocab_size)
        """
        embeds = self.embeddings(inputs)
        out = F.relu(self.linear1(embeds.view(embeds.shape[0], -1)))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs

Let's initialize our model, which will be later copied and send to Bob's and Alice's machine.

In [24]:
EMBEDDING_DIM = 100

model = NGramLanguageModeler(vocab_size, 
                             embedding_dim = EMBEDDING_DIM, 
                             context_size = N_GRAM - 1).to(device)

## Let's train our model !! 

Finally it's time to train our model now.

In [29]:
def train(epochs, loss_function, dataset):
    
    model.train()
    
    for epoch in range(epochs):
        
        # Reset total loss
        total_loss = 0
        
        # Send updated copy of model to bob and alice
        bobs_model = model.copy().send(bob)
        alices_model = model.copy().send(alice)
        
        # Initilaize optimizers for the remote models
        optimizers = {'bob_hospital'  : optim.SGD(bobs_model.parameters(), lr=0.01),
                      'alice_hospital': optim.SGD(alices_model.parameters(), lr=0.01)}
        
        # Map models to their workers
        models = {'bob_hospital': bobs_model, 'alice_hospital': alices_model}
               
        for contexts, targets in dataset:
            
            contexts, targets = contexts.to(device), targets.to(device)
            
            if(contexts.shape[0] > 0):
                # get the location where this data resides
                location = contexts.location.id

                # Zero out the gradients
                optimizers[location].zero_grad()

                # Perform a prediction            
                probs = models[location](contexts)

                # Calculate by how much we missed
                loss = loss_function(probs, targets)

                # Backpropagation
                loss.backward()

                # Update the weights
                optimizers[location].step()

                # update total loss
                total_loss += loss.get().item()

        # Move bob's and alice's models to Secure worker
        # acting as the seucre aggregator here
        bobs_model.move(secure_worker)
        alices_model.move(secure_worker)
        
        with torch.no_grad():

            # Average the parmams(weights and biases) on secure aggregator
            # And update our local model with the new params
            updated_state_dict = {}
            
            for param in model.state_dict().keys():
                updated_state_dict[param] = ( (bobs_model.state_dict()[param] + 
                                               alices_model.state_dict()[param]) / 2
                                            ).get()
                
            model.load_state_dict(updated_state_dict)
            
        # Print loss
        total_loss /= len(dataset)
        print("Epoch: ", epoch, "\t Loss:", total_loss)

In [30]:
epochs = 5
loss_function = nn.NLLLoss().to(device)

train(epochs, loss_function, trainset)

Epoch:  0 	 Loss: 9.14086214185508
Epoch:  1 	 Loss: 8.798042418386856
Epoch:  2 	 Loss: 8.503424681823171


RuntimeError: CUDA out of memory. Tried to allocate 236.00 MiB (GPU 0; 3.82 GiB total capacity; 2.32 GiB already allocated; 203.69 MiB free; 2.89 GiB reserved in total by PyTorch)

## Testing our model

In [31]:
# Initialize Dataset class for bob's and alice's validation datasets

bob_valset = NGramDataset(dataset = bob_val,
                          n_gram = N_GRAM, 
                          excluded_tokens = excluded_tokens,
                          nlp = nlp)

alice_valset = NGramDataset(dataset = alice_val,
                            n_gram = N_GRAM, 
                            excluded_tokens = excluded_tokens,
                            nlp = nlp)
                            

In [44]:
# Move stuff temporarily to cpu
device = torch.device("cpu")
model.to(device)

NGramLanguageModeler(
  (embeddings): Embedding(43912, 100)
  (linear1): Linear(in_features=200, out_features=128, bias=True)
  (linear2): Linear(in_features=128, out_features=43912, bias=True)
)

In [47]:
def test(worker, val_dataset):
    
    model.eval()
    
    # Send copy of model to remote worker
    remote_model = model.copy().send(worker)
    
    total_accuracy = 0
    for i in range(len(val_dataset)):
        
        try:
            contexts, targets = val_dataset[i]
            
            contexts, targets = contexts.to(device), targets.to(device)
            
            if(contexts.shape[0] > 0):
                
                
                # Perform a prediction
                probs = remote_model(contexts)

                # Get the predicted labels
                preds = probs.argmax(dim=1)

                # Compute the prediction accuracy
                accuracy = (preds == targets).sum()
                accuracy = accuracy.get().item()
                accuracy = 100 * (accuracy / contexts.shape[0])

                # Add to total accuracy
                total_accuracy += accuracy
                
        
        except KeyError: # OOV token
            pass
    
    total_accuracy = total_accuracy/len(val_dataset)
    return total_accuracy

In [48]:
# Test on bob's validation dataset
test(bob, bob_valset)

7.606402360514078

In [49]:
# Test on alice's validation dataset
test(alice, alice_valset)

6.722117200860472

### TODO:

0. Think about how to handle batch of context, target pairs.
0. Add OOV to combined_vocab.
1. ~Move to GPU.~
2. Make the vocabulary creation more secure.
3. Make the vocab set serializable and sendable.
4. How to avoid .get() on PointerTensors ?
5. Add tensorboard ?

## Discussion: Owenership of the Model