# Encrypted Training Demo on Medical Text Data using SyferText

**Author:**
- Carlos Salgado - [email](mailto:csalgado@uwo.ca) | [GitHub](https://github.com/socd06) | [LinkedIn](www.linkedin.com/in/eng-socd)

## Problem Statement
Bob <sup>MD</sup> and Alice <sup>MD</sup> are physicians running their respective medical practices and both have a database of private medical transcriptions. You own a Natural Language Processing (NLP) company and have been contacted by these physicians because both Bob <sup>MD</sup> and Alice <sup>MD</sup> have heard of the high quality of the Machine Learning as a Service (MLaaS) solutions you provide and want you to create a text classifier to help them automatically assign a medical specialty to each new patient text transcription.

## Limitations

Healthcare data is highly regulated and should be, for most intents and purposes, private. Therefore, if in a medical setting, the Machine Learning model being trained should not actually look at the data. 

Combining both Bob's and Alice's datasets, you should be able to create a bigger, better dataset that you could use to train your model with higher accuracy, only that you can't because it's all sensitive and private data, which is why you will need [PySyft](https://github.com/OpenMined/pysyft/) and [SyferText](https://github.com/OpenMined/SyferText/) to complete the job at hand.

## Importing libraries

Make sure to first install [PySyft](https://github.com/OpenMined/PySyft) and [SyferText](https://github.com/OpenMined/SyferText) before you run this tutorial. 
Using virtual environments is highly recommended for any PySyft experiment.


Verifying requirements and installing if missing. 

In [1]:
!pip install -r ../requirements.txt



In [1]:
import sys
sys.path.append('../scripts')

from util import download_dataset

In [1]:
# SyferText imports
import syfertext
from syfertext.pipeline import SimpleTagger

# PySyft and PyTorch import
import syft as sy
from syft.generic.string import String
import torch
import torch.nn.functional as F
from torch.utils.tensorboard import SummaryWriter
from torch.utils.data import DataLoader
from torch.utils.data import Dataset
import torch.optim as optim

# Useful imports
import numpy as np
from tqdm import tqdm
import csv
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import os
from pprint import pprint

Falling back to insecure randomness since the required custom op could not be found for the installed version of TensorFlow. Fix this by compiling custom ops. Missing file was '/home/carlos/anaconda3/envs/pysyft/lib/python3.7/site-packages/tf_encrypted/operations/secure_random/secure_random_module_tf_1.15.2.so'







Importing dataset downloader helper function from the `scripts` folder

## Downloading the Dataset

<div>
    <p style='color:red;'> (IGNORE THIS STEP IF YOU CLONED THE REPO OR ALREADY DOWNLOADED THE DATASET) </p>
</div>

The dataset will be downloaded in a folder called `data` in the root directory. Three files will be downloaded using the `download_dataset` helper function:

- `mtsamples.csv`: This is the dataset file containing almost 5K sample medical transcriptions. It is a csv file composed of five columns: `description`,`medical_specialty`,`sample_name`,`transcription`, and `keywords`. The `transcription` column holds the sample text, and the `medical_specialty` contains the labels we will use to train the classifier. 

- `clinical-stopwords.txt`: Clinical stop words compiled by [Dr. Kavita Ganesan](https://github.com/kavgan) from the [clinical-concepts](https://github.com/kavgan/clinical-concepts) repository. 

- `vocab.txt`: Vocabulary text file generated using the Systematized Nomenclature of Medicine International (SNMI) data.

- `mt.csv`: Reduced dataset containing only the `medical_specialty` and `transcription` features.

Please run the cell below in order to download the dataset. 

In [None]:
# The URL template to all dataset files
url_template = 'https://raw.githubusercontent.com/socd06/medical-nlp/master/data/%s'

# File names to be downloaded from the using the URL template above
files = ['mtsamples.csv', 'clinical-stopwords.txt', 'vocab.txt', 'mt.csv']

# Construct the list of urls
urls = [url_template % file for file in files]

# The dataset name and its root folder
dataset_name = 'data'
root_path = '../data'

# Create the dataset folder if it is not already there
if not os.path.exists('../data'):
    os.mkdir('../data')

# Start downloading
download_dataset(dataset_name = dataset_name, 
                 urls = urls, 
                 root_path = root_path
                )
print("Succesfully downloaded:",files)

## Preparing the work environment

A work environment is simulated with three main actors, a company (us) and two clients owning two private datasets (Bob and Alice) but also a crypto provider which will provide the primitives for Secure Multi-Party Computation (SMPC).

In [2]:
# Create a torch hook for PySyft
hook = sy.TorchHook(torch)

# Create some PySyft workers
me = hook.local_worker # This is the worker representing the deep learning company
bob = sy.VirtualWorker(hook, id = 'bob') # Bob owns the first dataset
alice = sy.VirtualWorker(hook, id = 'alice') # Alice owns the second dataset

crypto_provider = sy.VirtualWorker(hook, id = 'crypto_provider') # provides encryption primitive for SMPC

# Create a summary writer for logging performance with Tensorboard
writer = SummaryWriter()



### Simulating Private Datasets

We simulate two private datasets owned by two clients (Bob and Alice) but we will skip loading since we already loaded during out exploratory analysis:

1. Load the whole dataset in `mtsamples.csv` locally (the `me` worker). This dataset will be loaded as a list of dictionaries that has the following format: `[ {'transcription': <transcription text>, 'label': <0 - 7>}, {...}, {...}]`


2. Split the dataset into two parts, one for Bob and the other for Alice. Each part will be also split into a training set and a validation set. This will create four lists: `train_bob`, `valid_bob`, `train_alice`, `valid_alice`. Each list has the same format mentioned above.


3. Each element in the four lists will be sent to the corresponding worker. This will change the content of the lists as depicted in **Figure(1)**. Each list willl hold PySyft pointers to the texts and labels instead of the objects themselves.

<div>
<br>
<img alt = 'imdb review remote datasets' src ='./img/imdb_review_remote.png' style='width:700px'>
<div>
<div style='width:600px;margin:30px auto 10px auto;text-align:center;'>
<strong> Figure(1): </strong> The reviews and their labels are remotely located on Bob's and Alice's remote machines, only pointers to them are kept by the local worker (the company's machine).
</div>
</div>
<br>
</div>

## Loading dataset locally

In [3]:
# Set the path to the dataset file
dataset_path = '../data/mt.csv'

# store the dataset as a list of dictionaries
# each dictionary has two keys, 'transcription' and 'label'
# the 'transcription' element is a PySyft String
# the 'label' element is an integer with 1 for each surgical specialty and a 0 otherwise
dataset_local = []

In [4]:
with open(dataset_path, 'r') as dataset_file:
    
    # Create a csv reader object
    reader = csv.DictReader(dataset_file)
    
    for elem in reader:
        
        # Create one entry
        # Check if the medical specialty contains "urgery" 
        # meaning the transcription could be labeled "Surgery","Cosmetic / Plastic Surgery" or "Neurosurgery"
        example = dict(transcription = String(elem['transcription']),                       
                       label = 1 if 'urgery' in elem['medical_specialty'] else 0 
                      )
        
        # add to the local dataset
        dataset_local.append(example)

Here is how an element in the list looks like:

In [5]:
# import needed to generate random integer numbers
from random import randint

In [6]:
# Get a random index to verify entry examples
random_index = randint(0,len(dataset_local))
print("Entry #",random_index)
example = dataset_local[random_index]
pprint(example)

Entry # 4618
{'label': 0,
 'transcription': "OPERATION,1.  Insertion of a left subclavian Tesio hemodialysis catheter.,2.  Surgeon-interpreted fluoroscopy.,OPERATIVE PROCEDURE IN DETAIL: , After obtaining informed consent from the patient, including a thorough explanation of the risks and benefits of the aforementioned procedure, patient was taken to the operating room and MAC anesthesia was administered.  Next, the patient's chest and neck were prepped and draped in the standard surgical fashion.  Lidocaine 1% was used to infiltrate the skin in the region of the procedure.  Next a #18-gauge finder needle was used to locate the left subclavian vein.  After aspiration of venous blood, Seldinger technique was used to thread a J wire through the needle.  This process was repeated.  The 2 J wires and their distal tips were confirmed to be in adequate position with surgeon-interpreted fluoroscopy.  Next, the subcutaneous tunnel was created.  The distal tips of the individual Tesio hemodialy

Now that we verified our examples we can look into our data types

In [7]:
print(type(example['transcription']))
print(type(example['label']))

<class 'syft.generic.string.String'>
<class 'int'>


This transcription text is a PySyft `String` object. The label is an integer.

Let's split the dataset into two equal parts and send each part to a different worker simulating two remote datasets mentioned before:

In [8]:
# Create two datasets, one for Bob and another one for Alice
dataset_bob, dataset_alice = train_test_split(dataset_local, train_size = 0.5)

# Now create a validation set for Bob and another one for Alice
train_bob, val_bob = train_test_split(dataset_bob, train_size = 0.7)
train_alice, val_alice = train_test_split(dataset_alice, train_size = 0.7)

Making the datasets remote:

In [9]:
# A function that sends the content of each split to a remote worker
def make_remote_dataset(dataset, worker):

    # Got through each example in the dataset
    for example in dataset:
        
        # Send each transcription text
        example['transcription'] = example['transcription'].send(worker)
                       
        # Send each label as a one-hot-encoded vector
        one_hot_label = torch.zeros(2).scatter(0, torch.Tensor([example['label']]).long(), 1)
        
        # print for debugging purposes
        # print("mapping",example['label']," to ",one_hot_label)
        
        # Send the transcription label
        example['label'] = one_hot_label.send(worker)

The above function transforms the label to a one-hot-encoded format before sending it to a remote worker. Every label corresponds to a 2-digit tensor of binary values (`[1,0]` or `[0,1]`).

Now we can convert the dataset into a remote dataset.

In [10]:
# Bob's remote dataset
make_remote_dataset(train_bob, bob)
make_remote_dataset(val_bob, bob)

# Alice's remote dataset
make_remote_dataset(train_alice, alice)
make_remote_dataset(val_alice, alice)

Now Bob's dataset looks like:

In [11]:
# Take an element from the dataset
example = train_bob[10]

print(type(example['transcription']))
print(example['label'])

<class 'syft.generic.pointers.string_pointer.StringPointer'>
(Wrapper)>[PointerTensor | me:29420145204 -> bob:88329213666]


Now the text type is now a PySyft `StringPointer` that points to the real `String` object  located in Bob's machine. The label type is a PySyft `PointerTensor`.

Now we can review and see where the label is located.

In [12]:
print(example['transcription'].location)
print(example['label'].location)

<VirtualWorker id:bob #objects:4972>
<VirtualWorker id:bob #objects:4972>


Confirming that the dataset is now remote and also confirming the information in **Figure(1)**.

Now the environment and the data are ready for the next step.

## Step 6 - Creating a `SyferText` Language object and a pipeline

In [13]:
# Create a Language object with SyferText
nlp = syfertext.load('en_core_web_lg', owner = me)

Whenever you create a Language object a pipeline will be created. At initialization, a pipeline only contains a tokenizer. You can see this for yourself using the `pipeline_template` property:

In [14]:
nlp.pipeline_template

[{'remote': True, 'name': 'tokenizer'}]

In [15]:
type(nlp)

syfertext.language.Language

Notice that the tokenizer entry has a propery called `remote` set to `True`. This means we allowed the tokenizer to be sent to a remote worker for the string to be tokenized there.

We can add more components to the pipeline by using the `add_pipe` method of the Language class. One component we can add is a `SimpleTagger` object. This is a SyferText object that we can use to set custom attributes to individual tokens. In this tutorial, we will create two taggers: One that tags tokens that are Stop Words and another one that tags each token with their respective class.

By tagging we mean setting a custom attribute to a token and assigning it a given value (e.g. An attribute called `is_stop` with `True` and `False` values when evaluating Stop Words. 

You can refer to **Figure(2)** to see how a pipeline is distributed on multiple workers.

### Creating a  tagger for stop words

We will start by creating the Stop Word tagger. First loading the stop word file into a list of words:

In [16]:
# Load the list of stop words
with open('../data/clinical-stopwords.txt', 'r') as f:
    stop_words = set(f.read().splitlines())

Now we create the tagger which is an object of the `SimpleTagger` class:

In [17]:
# Create a simple tagger object to tag stop words
stop_tagger = SimpleTagger(attribute = 'is_stop',
                           lookups = stop_words,
                           tag = True,
                           default_tag = False,
                           case_sensitive = False
                          )

Note that the `lookups` argument passed was the list of stop words.

Every token in the `Doc` object will be given a custom attribute called `is_stop`. Every time a stop word is found, this attribute will be given the value `True` specified by the `tag` argument of the `SimpleTagger` class initializer, otherwise, the `default_tag` will be used (e.g. `False`).

### Adding the taggers to the pipeline

We can add the tagger we created above by using the `add_pipe()` method of the `Language` class. However, in the following cell, you can try and do the experiment again, setting the boolean variable `use_stop_tagger` as `False`.

In [18]:
use_stop_tagger = True

# Token with these custom tags
# will be excluded from creating
# the Doc vector
excluded_tokens = {}

The `excluded_tokens` dictionary will be used further down, when we create embedding vectors for the transcriptions. This dictionary will enable us to exclude some tokens when we create a document embedding. Such exclusion will be based on the value of the custom attributes we set with the taggers.

In [19]:
if use_stop_tagger:

    # Add the stop word to the pipeline
    nlp.add_pipe(name = 'stop tagger',
                 component = stop_tagger,
                 remote = True
                )

    # Tokens with 'is_stop' = True are
    # not going to be used when creating the 
    # Doc vector
    excluded_tokens['is_stop'] = {True}

Let's check out what pipe components are included in the pipeline:

In [20]:
nlp.pipeline_template

[{'remote': True, 'name': 'tokenizer'},
 {'remote': True, 'name': 'stop tagger'}]

## Step 8 - Creating a Dataset class

Now that the datasets are remote and ready along with the `Language` object and its pipeline we can create PyTorch loaders to make data batches for training and validation.

The batches will be composed of training examples coming from both Bob's and Alice's datasets as if it were only one big dataset.

Each example in the batch contains an encrypted version of one transcription's embedding vector and its encrypted label. For this tutorial, the vector will be computed as the average of the transcription's individual token vectors taken from the `en_core_web_lg` language model. Also, tokens with custom tags indicated in `excluded_tokens` won't be taken into account in computing a transcription's vector.

From **Figure(2)** we can see how the transcription text is remotely preprocessed by SyferText: 

1. First, the `Language` object `nlp` is used to preprocess one transcription on Bob's or Alice's machine.
2. The object `nlp` determines that the real transcription text is actually remote, so it sends a subpipeline containing the required pipeline components we defined to the corresponding worker.
3. The subpipeline is run and a `Doc` object is created on the remote worker containing the transcription's individual tokens appropriately tokenized and tagged.
4. On the local worker, a `DocPointer` object is created pointing to that `Doc` object.
5. By calling `get_encrypted_vector()` on the `DocPointer`, the call is forwarded to `Doc`, which, in turn, computes the `Doc` vector, encrypts it with Secure Multy-Party Computation (SMPC) using PySyft and returns it to the caller at the local worker.
6. The PyTorch dataloader takes this encrypted vector and appends it to the training or validation batch.

Note that at no moment in the process, the plaintext data of the remote datasets are revealed to the local worker. *Privacy is preserved thanks to SyferText and PySyft!*

<div>
<br>
<img alt =  'SyferText pipeline' src ='./img/imdb_pipeline_2.png' style='width:700px;'>
<div>
<p style='width:600px;margin:30px auto 10px auto;text-align:center;'>
<strong> Figure(2): </strong> A pipeline on the local worker only contains pointers to subpipelines carrying out the actual preprocessing on remote workers.
</p>
</div>
<br>
</div>

Take a minute to review the `__getitem__()` method of the custom PyTorch `Dataset` object defined below and see how . Please take a few minutes to check it out below:

In [21]:
class DatasetMTS(Dataset):
    
    def __init__(self, sets, share_workers, crypto_provider, nlp):
        """Initialize the Dataset object
        
        Args:
            sets (list): A list containing all training OR 
                all validation sets to be used.
            share_workers (list): A list of workers that will
                be used to hold the SMPC shares.
            crypto_provider (worker): A worker that will 
                provide SMPC primitives for encryption.
            nlp: This is SyferText's Language object containing
                the preprocessing pipeline.
        """
        self.sets = sets
        self.crypto_provider = crypto_provider
        self.workers = share_workers
    
        # Create a single dataset unifying all datasets.
        # A property called `self.dataset` is created 
        # as a result of this call.
        self._create_dataset()
        
        # The language model
        self.nlp = nlp
        
    def __getitem__(self, index):
        """In this function, preprocessing with SyferText 
        of one transcription will be triggered. Encryption will also
        be performed and the encrypted vector will be obtained.
        The encrypted label will be computed too.
        
        Args:
            index (int): This is an integer received by the 
                PyTorch DataLoader. It specifies the index of
                the example to be fetched. This actually indexes
                one example in `self.dataset` which pools over
                examples of all the remote datasets.
        """
        
        # get the example
        example = self.dataset[index]
        
        # Run the preprocessing pipeline on 
        # the transcription text and get a DocPointer object
        doc_ptr = self.nlp(example['transcription'])
        
        # Get the encrypted vector embedding for the document
        vector_enc = doc_ptr.get_encrypted_vector(bob, 
                                                  alice, 
                                                  crypto_provider = self.crypto_provider,
                                                  requires_grad = True,
                                                  excluded_tokens = excluded_tokens
                                                 )
        

        # Encrypt the target label
        label_enc = example['label'].fix_precision().share(bob, 
                                                           alice, 
                                                           crypto_provider = self.crypto_provider,
                                                           requires_grad = True
                                                          ).get()


        return vector_enc, label_enc

    
    def __len__(self):
        """Returns the combined size of all of the 
        remote training/validation sets.
        """
        
        # The size of the combined datasets
        return len(self.dataset)

    def _create_dataset(self):
        """Create a single list unifying examples from all remote datasets
        """
        
        # Initialize the dataset
        self.dataset = []
      
        # populate the dataset list
        for dataset in self.sets:
            for example in dataset:
                self.dataset.append(example)
                
    @staticmethod
    def collate_fn(batch):
        """The collat_fn method to be used by the
        PyTorch data loader.
        """
        
        # Unzip the batch
        vectors, targets = list(zip(*batch))        
            
        # concatenate the vectors
        vectors = torch.stack(vectors)
        
        #concatenate the labels
        targets = torch.stack(targets)        
        
        return vectors, targets

Let's now create two such `DatasetMTS` objects, one for training and the other for validation:

In [22]:
# Instantiate a training Dataset object
trainset = DatasetMTS(sets = [train_bob,
                               train_alice],
                       share_workers = [bob, alice],
                       crypto_provider = crypto_provider,
                       nlp = nlp
                      )

# Instantiate a validation Dataset object
valset = DatasetMTS(sets = [val_bob,
                             val_alice],
                     share_workers = [bob, alice],
                     crypto_provider = crypto_provider,
                     nlp = nlp
                    )

In [55]:
#type(trainset)
vec_enc, label_enc = valset.__getitem__(1)

In [56]:
vec_enc.shape[0]

300

In [None]:
doc_vector_enc = doc_ptr.get_encrypted_vector(bob, alice, crypto_provider = crypto_provider)

print(f' Vector size is {doc_vector_enc.shape[0]}')
print(doc_vector_enc)

## Step 9 - Creating a DataLoader

Let's now choose some hyper parameters for training and validation, and create the PyTorch data loaders:

In [81]:
# Set some hyper parameters
learning_rate = 0.001
batch_size = 32
epochs = 1

In [82]:
# Instantiate the DataLoader object for the training set
trainloader = DataLoader(trainset, shuffle = True,
                         batch_size = batch_size, num_workers = 0, 
                         collate_fn = trainset.collate_fn)


# Instantiate the DataLoader object for the validation set
valloader = DataLoader(valset, shuffle = True,
                       batch_size = batch_size, num_workers = 0, 
                       collate_fn = valset.collate_fn)

## 3. Create an Encrypted Classifier

The sentiment classifier I use here is a simple fully connected network with `300` input features which is the size of the embedding vectors computed by SyferText. The network has eight outputs, one for each of the chosen medical specialties.

In [83]:
class Classifier(torch.nn.Module):
    
    def __init__(self, in_features, out_features):
        super(Classifier, self).__init__()
        
        self.fc = torch.nn.Linear(in_features, out_features)
                
    def forward(self, x):
       
        logits = self.fc(x)
        
        probs = F.relu(logits)
        
        return probs, logits

I should now initialize and encrypt the classifier. Encryption here should of course use the same workers to hold the share and the same primitives used to encrypt the document vectors.

In [84]:
# Create the classifer
classifier = Classifier(in_features = 300, out_features = 2)

# Apply SMPC encryption
classifier = classifier.fix_precision().share(bob, alice, 
                                              crypto_provider = crypto_provider,
                                              requires_grad = True
                                              )
print(classifier)

Classifier(
  (fc): Linear(in_features=300, out_features=2, bias=True)
)


And finally I create an optimizer. Notice that the optimizer does not need to be encrypted, since it operates separately within each worker holding the classifier's and embeddings' shares. We just need to make it operate on fixed precision numbers that are used to encode shares.

In [85]:
optim = optim.SGD(params = classifier.parameters(),
                  lr = learning_rate)

optim = optim.fix_precision()

AttributeError: 'SGD' object has no attribute 'SGD'

## 4. Start training

You are now ready to run the below cell to launch the training. 

`NLLLoss()` is not yet implemented in PySyft for SMPC mode so we will use MSE as a training loss even though is not the best choice for a classification task. 

Now that training is finished, let me prove to you, that as I explained in **Figure(2)**, both Bob and Alice has `SubPipeline` objects on their machines sent by SyferText that contain the pipeline components I defined above:

In [86]:
for epoch in range(epochs):
    
    for iter, (vectors, targets) in enumerate(trainloader):
        
        # Set train mode
        classifier.train()

        # 1). Zero out previous gradients
        optim.zero_grad()

        # 2). predict sentiment probabilities
        probs, logits = classifier(vectors)

        # 3). Compute loss and accuracy
        
        loss = ((probs -  targets)**2).sum()
        

        # Get the predicted labels
        preds = probs.argmax(dim=1)
        targets = targets.argmax(dim=1)
        
        # Compute the prediction accuracy
        accuracy = preds.eq(targets).sum()
        accuracy = accuracy.get().float_precision()
        accuracy = 100 * (accuracy / batch_size)
        
        # 4). Backpropagate the loss
        loss.backward()

        # 5). Update weights
        optim.step()

        # Decrypt the loss for logging
        loss = loss.get().float_precision()

        
        # Log to tensorboard
        writer.add_scalar('train/loss', loss, epoch * len(trainloader) + iter )
        writer.add_scalar('train/acc', accuracy, epoch * len(trainloader) + iter )

        
        """ Perform validation on exactly one batch """
        
        # Set validation mode
        classifier.eval()

        for vectors, targets in valloader:
            
            
            probs, logits = classifier(vectors)

            loss = ((probs -  targets)**2).sum()

            preds = probs.argmax(dim=1)
            targets = targets.argmax(dim=1)

            accuracy = preds.eq(targets).sum()
            accuracy = accuracy.get().float_precision()
            accuracy = 100 * (accuracy / batch_size)

            loss = loss.get().float_precision()

            # Log to tensorboard
            writer.add_scalar('val/loss', loss, epoch * len(trainloader) + iter )
            writer.add_scalar('val/acc', accuracy, epoch * len(trainloader) + iter )
            
            break
            
writer.close()

## 5. Results

In order to view the training and validation curves for loss and accuracy, you need to run `Tensorboard`. Just open a terminal, navigate to the folder containing this notebook, and run:

```
$ tensorboard --logdir runs/
```

Then open you favorite web browser and go to `localhost:6006`.

You should now be able to see performance curves.

In [87]:
# On bob's machine
[bob._objects[id] for id in bob._objects if  isinstance(bob._objects[id], syfertext.SubPipeline)]

[SubPipeline[tokenizer > stop tagger]]

In [88]:
# On Alices's machine
[alice._objects[id] for id in alice._objects if  isinstance(alice._objects[id], syfertext.SubPipeline)]

[SubPipeline[tokenizer > stop tagger]]