<a href="https://colab.research.google.com/github/turbosheep/deep_learning_for_cNLP_AMIA2020/blob/master/deep_learning_for_cNLP_AMIA2020.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Deep Learning for Clinical NLP

### Presented by Olga Patterson and Hannah Eyre

## Disclaimer
The contents of this presentation do not represent the views of the Department of Veterans Affairs or the United States Government.

## Acknowledgement
Some work described in this presentation was supported using resources and facilities at the VA Salt Lake City Health Care System with funding from VA Informatics and Computing Infrastructure (VINCI), VA HSR RES 13-457

## About this Tutorial

This colab notebook is available at: https://tinyurl.com/cNLP-AMIA2020

### Goals


1.   Understand when python might be the right choice for deep learning on clinical text.
2.   Understand how deep learning for NLP is different from other deep learning problems.
3.   Understand the steps to set up a deep learning model for an NLP task.
4.   Understand how to examine results of a deep learning system.
5.   Understand how to use and improve a deep learning system.



### Task

The goal of this tutorial is to identify three common entities in text:

*   Diseases/Syndromes (COVID-19, respiratory failure, etc.)
*   Anatomical Site (lung, lower extremity, etc.)
*   Pharmacologic Substances (vaccine, gene names, proteins, etc.)

### Data Source

The data used in this demo is from the [COVID-19 Open Research Dataset (CORD) ](https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge). This dataset contains over 200,000 scholarly articles related to COVID-19 and other coronaviruses. This is *NOT* clinical text, but it shares a similar scientific vocabulary, does not contain PHI, and is freely available to download.

For this demo, only the abstracts of the articles are used. Additionally, the data has been annotated algorithmically. This means there is no quality assurance for the data and should only be used for learning and demonstration purposes. The methods shown here can apply to any data.

### Computing Environment
This demo will run in the cloud using Google Colab, which requires no setup by participants. It is compatible with tablets and smartphones. More information about the computing environment can be found at the [Colab FAQ](https://research.google.com/colaboratory/faq.html).

This code requires an internet connection to run, but will not download or run anything onto your computer.

***NOTE:*** This environment is not secure and data containing PHI should not be uploaded or used in a Colab environment.

# Why Python?

When SHOULD you choose python for your deep learning project?

1.   Robust open source community. You take advantage of the benefits the work of many experienced developers and scientists, not just what you can program by yourself.
2.   There are a variety of packages to choose from when starting a project depending on your preferences and needs.
3.   Lots of help. Major packages have millions of active users. Experienced users and developers are always available to help troubleshoot via email, social media, slack, and stackoverflow.
4.   You need to program something from scratch.


When SHOULDN'T you choose python for your deep learning project?

1.   Your project needs robust package management. While Anaconda is advancing python package management, it does not have as many features as Maven for Java.
2.   You need to prioritize energy-efficiency or speed. C or C++ will give the fastest, most energy efficient systems. Also consider non-deep learning solutions for this.
3.   You do not have experience programming. Python is a beginner friendly language, however there are tools available that allow you to train models without any programming required.





# Setup

Deep learning has a small amount of randomness to each training iteration. While this should not affect the results of a properly built system, this tutorial will keep everything consistent by freezing the random number generators.

In [None]:
import numpy as np
np.random.seed(123)

import random
random.seed(123)

import tensorflow
tensorflow.random.set_seed(123)

## Getting the Data

This data is being downloaded from a public github repository set up for the demo [here](https://github.com/turbosheep/deep_learning_for_cNLP_AMIA2020).

This demo will only use a fraction of the abstracts available, but the full dataset is available there. The preprocessing scripts that split the data into this downloaded format are also available.

In [None]:
!wget https://raw.githubusercontent.com/turbosheep/deep_learning_for_cNLP_AMIA2020/master/test_train_split/test_data_pretokenized.tsv
!wget https://raw.githubusercontent.com/turbosheep/deep_learning_for_cNLP_AMIA2020/master/test_train_split/test_labels_bio.tsv

!wget https://raw.githubusercontent.com/turbosheep/deep_learning_for_cNLP_AMIA2020/master/test_train_split/train_data_pretokenized.tsv
!wget https://raw.githubusercontent.com/turbosheep/deep_learning_for_cNLP_AMIA2020/master/test_train_split/train_labels_bio.tsv

!wget https://raw.githubusercontent.com/turbosheep/deep_learning_for_cNLP_AMIA2020/master/test_train_split/validation_data_pretokenized.tsv

## Opening data

The data has been broken into a training set, a test set, and a validation set. The training and test set come with annotations on the data, but the validation does not.

In [None]:
train_data_raw = []
with open("train_data_pretokenized.tsv") as f:
  for line in f.readlines():
    train_data_raw.append(line.strip().split('\t'))

train_labels_raw = []
with open("train_labels_bio.tsv") as f:
  for line in f.readlines():
    train_labels_raw.append(line.strip().split('\t'))

test_data_raw = []
with open("test_data_pretokenized.tsv") as f:
  for line in f.readlines():
    test_data_raw.append(line.strip().split('\t'))

test_labels_raw = []
with open("test_labels_bio.tsv") as f:
  for line in f.readlines():
    test_labels_raw.append(line.strip().split('\t'))

validation_data = []
with open("validation_data_pretokenized.tsv") as f:
  for line in f.readlines():
    validation_data.append(line.strip().split('\t'))

## Data Format

If we examine our training data, we can see that there is a long hexadecimal ID followed by some text. This ID is used by the original CORD data, but we can also use it to pair our data to our labels.

Each row in the data represents a sentence with at least one annotation.

In [None]:
train_data_raw[7]

Our labels are in the BILUO format, a variant of the common BIO format.

*   ***B***egin: the first token of a 2 or more token sequence
*   ***I***nside: a token inside a 3 or more token sequence
*   ***L***ast: the last token of a 2 or more token sequence
*   ***U***nit: a one token length sequence
*   ***O***utside: a token not in any labeled sequence

Each label will be one of the BILUO schema followed by the specific type: B-DISEASE, L-PHARM, etc.

Each row will match with the same row in the training data. Each BILUO label will correspond in-order with the token in the text.




In [None]:
train_labels_raw[7]

# Preprocessing

## Challenges

Deep learning for NLP tasks have a variety of (often unspoken) challenges that need to be handled in the model design and preprocessing phase.

### Sequence Length

Neural Networks of any kind operate on matrices of fixed sizes. Part of the definition of a language includes the ability to generate sequences of arbitrary length.

#### Solution
Depending on your task, there may be several solutions. One option involves padding your sequences to be all the same length. A deep learning model will "learn" to ignore the filler words during training. If sequences are very long (generally >500), the "vanishing gradient" problem appears where numbers get too small for a computer to use, and nothing can be learned.

If you have long sequences, consider breaking the sequences down into shorter sequences. Abstracts were broken down into sentences for this task.

### Vocabulary Size
There are an infinite number of possible words, but we do not have infinite memory or computational power to store and learn them. Additionally, words not seen at training time will appear in future data, either in production systems or during testing.

#### Solution

Remove rare and unknown words from the data. Words can be removed from sequences or they can be replaced with an "OOV" or some other representation of an "unknown" value. If replacement is used, the model will learn information about when/where these words appear and may still be able to make correct predictions.

### Words are not Numbers
The mathematical operations inside a network cannot be performed on words. Sending in the bytes does not make sense for words in the same way it does for other tasks where data might exist in a continuous spectrum, such as the color of a pixel.

#### Solution

There are many ways to encode text data. The most common for deep learning is with word embeddings, which are a mathematical representation of a word that has been learned by some deep learning model. These can be learned while training your task or they can be taken from other models by downloading them off the internet. Several common embedding sets exist, such as [gloVe](https://nlp.stanford.edu/projects/glove/). To use embeddings, turn each word into a single numeric value that will reference a row in your embedding list.

## Define Preprocessing Methods

### Training Data Preprocessing

Let's define a method that handles the "vocabulary size" and "words are not numbers" problem.

This method makes indexes for each word and transforms the document into a list of the ids. It also removes "rare" words from the vocabulary. In this case a "rare" word is defined as something that does not occur more than once in the training set.

After processing, instead of sending in `the dog chased the cat`, a network might see `[1,2,3,1,4]`.

In [None]:
from collections import Counter

In [None]:
def get_idx_train(document_list):

  # get a list of all words and their frequency that appear in the training set
  vocab = Counter()
  for doc in document_list:
    vocab += Counter(doc)
  
  # check the preprocessed vocab size
  print('    Number of tokens in data (unfiltered): {0}'.format(
      len(vocab.keys())))

  # remove all words that only occur once
  vocab_reduced = vocab
  for (word,count) in list(vocab_reduced.items()):
    if count < 2:
      del vocab_reduced[word]
  
  # check the reduced vocab size
  print('    Number of tokens in data (filtered): {0}'.format(
      len(vocab_reduced.keys())))
  
  vocab_final = set(vocab_reduced.keys())
  
  # build word indexes
  word_to_idx = {u:i for i, u in enumerate(vocab_final)}
  idx_to_word = list(vocab_final)
  
  # add PAD word to index manually
  word_to_idx['PAD'] = len(idx_to_word)
  idx_to_word.append('PAD')

  # add OOV word to index manually
  word_to_idx['OOV'] = len(idx_to_word)
  idx_to_word.append('OOV')
  
  # replace document with IDs
  id_documents =[]
  for doc in document_list:
    id_doc = []
    for word in doc:
      # if the word is in our vocab, get the ID
      if word in vocab_final:
        id_doc.append(word_to_idx[word])
      # otherwise get the OOV ID
      else:
        id_doc.append(word_to_idx["OOV"])
    id_documents.append(id_doc)
  
  return word_to_idx,idx_to_word,id_documents,vocab_final

### Test Data Preprocessing

For the test set, remove all words not found in the training vocabulary as unknown. Additionally, removing words that only occur once in the test set it not necessary, because it is possible that the word was learned in the training set already.

Replace all words not in the training vocab with "OOV".

In [None]:
def get_idx_test(document_list,word_to_idx,vocab):
  id_documents = []
  
  # process document using known words/indexes
  for doc in document_list:
    id_doc = []
    for word in doc:
      if word in vocab:
        id_doc.append(word_to_idx[word])
      else:
        id_doc.append(word_to_idx['OOV'])
    id_documents.append(id_doc)
      
  return id_documents  

### Label Transformation

We also need to transofm our labels into IDs as well.

In [None]:
def get_idx_label(train_label_list,test_label_list, label_set):
  idx_to_label = ['O']

  for label in label_set:
    idx_to_label.append("B-"+label)
    idx_to_label.append("I-"+label)
    idx_to_label.append("L-"+label)
    idx_to_label.append("U-"+label)

  label_to_idx = {u:i for i, u in enumerate(idx_to_label)}

  train_id_labels = [[label_to_idx[label] for label in doc] for doc in train_label_list]
  test_id_labels = [[label_to_idx[label] for label in doc] for doc in test_label_list]

  return idx_to_label,label_to_idx,train_id_labels,test_id_labels


### Data Padding

To solve the sequence length problem, a simple solution is to pad the ends of shorter sequences until they are same length as longer ones. We already added the word "PAD" to our vocabulary, so we simply need to fill in space using it. The network will learn to ignore this padding and focus on the real text on its own.

In [None]:
def pad_seqs(doc_list,label_list, pad_id, o_id, max_len):
  # pad the text with the id for PAD
  for doc in doc_list:
    if len(doc) < max_len:
      doc += [pad_id for i in range(max_len-len(doc))]

  # pad the labels with the id for O
  for doc in label_list:
    if len(doc) < max_len:
      doc += [o_id for i in range(max_len-len(doc))]

## Perform Preprocessing

Let's split the data into a sequence of words and labels rather than a string.

In [None]:
train_data = [row[1].split() for row in train_data_raw]
train_labels = [row[1].split() for row in train_labels_raw]

test_data = [row[1].split() for row in test_data_raw]
test_labels = [row[1].split() for row in test_labels_raw]

### Transform into IDs

Let's convert the data into the preprocessed format. The data is sorted ahead of time and the row index of the texts and labels can be used in place of the IDs for now. We will also break up the labels and words into a list instead of a string.

This step would normally consist of tokenizing documents and aligning labels to your text, but since this is a step not unique to deep learning, it has been done ahead of time.

In [None]:
word_to_idx,idx_to_word,train_idx,vocab = get_idx_train(train_data)

To check that this method worked... look up a word in the word to index dictionary and see what index it outputs.

In [None]:
id = word_to_idx['viral']
print(id)

Then take that index and put it in the index to word dictionary and ensure that they reference each other.

In [None]:
idx_to_word[id]

Then preprocess the testing data and the labels.

In [None]:
test_idx = get_idx_test(test_data,word_to_idx,vocab)

In [None]:
idx_to_label,label_to_idx,train_labels_idx,test_labels_idx = get_idx_label(
    train_labels,test_labels,["ANATOMY","PHARM","DISEASE"])

### Pad Data

Now we need to pad our data to make sure all of our sequences are the same length. We can start by finding the maximum length of a sequence in our training set.

In [None]:
max_length = 0
for doc in train_idx:
  if len(doc) > max_length:
    max_length = len(doc)

print(max_length)

In [None]:
seq_length = max_length

Then padding all of our sequences to that length.

In [None]:
pad_seqs(train_idx,train_labels_idx,word_to_idx['PAD'],label_to_idx['O'],seq_length)
pad_seqs(test_idx,test_labels_idx,word_to_idx['PAD'],label_to_idx['O'],seq_length)

# Building a Sequence Tagger

To build our deep learning model, we are using Tensorflow and Keras, two of the most common deep learning libraries in python. The most recent version of Tensorflow directly incorporates Keras and simplifies using both.

These libraries have extensive, well documented features beyond what is shown here. Full documentation is available on the [tensorflow website](https://www.tensorflow.org/api_docs).

In [None]:
from tensorflow import keras

seed = 111 # deep learning has a random start to most algorithms, this
           # seed ensures that everything stays the same between runs

To start, we want to define our model. We want a sequential model for this task. Sequential in this case does not mean inputting sequences, it means that each layer in the model is sequential. Layer A provides the input for layer B, layer B provides input for layer C and so on.

In [None]:
model = keras.models.Sequential()

The first layer is an input layer. This simply defines the shape of the data the model will take in. In our case we will take some number of sequences of `seq_length`.

In [None]:
model.add(keras.layers.Input(shape=(seq_length,)))

The next layer is the embedding layer. We already transformed our data to handle this and we will be learning the embeddings at training time. The only parameters we need to give are the `input_dim`, which is the size of our vocabulary (how many embeddings the model needs to make), and the size of the embeddings. Imagine a matrix, we defined how many rows as one per word in our vocabulary, `output_dim` is how many columns are associated with each row.

In [None]:
model.add(keras.layers.Embedding(input_dim=len(idx_to_word),output_dim=100))

Dropout is the next layer type added. This means information learned in the embedding layer is randomly thrown away or "dropped out" of the model. The theory behind this is that it forces other information to compensate, resulting in stronger learning. This usually helps prevent the model from memorizing the training data.

In [None]:
model.add(keras.layers.Dropout(0.2))

The next layer is the primary layer that learns our task. We want to use a layer called a "Bidirectional LSTM".

In [None]:
model.add(keras.layers.Bidirectional(
    keras.layers.LSTM(units=50,return_sequences=True)
))

Dropout is usually added between all layers other than the input layer.

In [None]:
model.add(keras.layers.Dropout(0.2))

This layer is what allows us to predict results over the entire sequence. It treats a sequence of words as a time series problem, like a weather forecast or stock market data, where a value is produced for every time step.

In [None]:
model.add(keras.layers.TimeDistributed(
    keras.layers.Dense(len(idx_to_label),activation='softmax')
))

Compiling the model after it has been made tells your computer how to use the layers we just added. The `optimizer` explains how to move around the high dimensional space that your data exists in. `loss` explains how to penalize incorrect predictions.

There are many [optimizers](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers), [losses](https://www.tensorflow.org/api_docs/python/tf/keras/losses), and [metrics](https://www.tensorflow.org/api_docs/python/tf/keras/metrics) available.

In [None]:
model.compile(optimizer='adam',
              loss='sparse_categorical_crossentropy',
              metrics=['accuracy'])

Afterwards, we can print out a summary of the model to make sure everything looks right.

In [None]:
model.summary()

## Training the Sequence Tagger

Training involves finding the best "fit" of the data to the model we defined above, given the training data, a batch size, and how many iterations of the data to go through. As training progresses, `loss` should trend down and `accuracy` should trend up.

In [None]:
history = model.fit(x=train_idx,y=train_labels_idx,batch_size=50,epochs=10,verbose=1)

## Evaluating the Sequence Tagger

### Measuring Accuracy

We can evaluate the model based on the test data and get a simple output.

In [None]:
model.evaluate(x=test_idx,y=test_labels_idx)

The first thing to notice out of these results is exceedingly high accuracy, which should be suspicious for most NLP tasks, especially on complex data like clinical text or these abstracts. Additionally, the loss on the training set is lower than on the test set, indicating the model might be overfit.

In [None]:
preds = model.predict_classes(x=test_idx,batch_size=50)

So let's look at some of the output in the test data.

In [None]:
preds[24]

This particular example has one label that isn't an "O" in the entire sequence.

### Other Metrics

NLP usually uses different metrics than simple accuracy because of the example shown above. A model can be highly accurate if all it does is predict the most common label in the dataset. In our case, "O" makes up almost all of it.

In [None]:
from sklearn.metrics import classification_report

Instead, precision, recall, and f1 are used. These provide an accuracy-like result, but are more informative to on rare labels.

In [None]:
print(classification_report(np.array(test_labels_idx).flatten(),preds.flatten(),target_names=idx_to_label))

In this report, the "O" label is very accurate, but makes up 98% of the data. This is a common problem in NLP, where the number of words in a document or sentence drown out the rare, relevant information.

Some of the more frequent labels such as "U-DISEASE" are accurate, but the "I-" labels are not.

# Using the Output

Next, let's briefly go over some ways we can use this model we've trained or visualize the output on text.

## Spacy Component

Spacy is a natural language processing library that is as common in NLP as tensorflow/keras are in deep learning. It has the ability to use out-of-the-box models trained on their data, add custom models like we will do below, and supports rule-based NLP development.

In [None]:
import spacy

from spacy import displacy
from spacy.tokens import Span
from spacy.lang.en import English

A spacy "component" is a class that defines the `__call__` method and does something to a piece of text sent in when that method is used.

For this tutorial, it breaks the text into chunks of `seq_length` that we defined earlier and predicts the BILUO tags based on the model we just trained and then attaches them to the text.

In [None]:
class BiLSTMNER:

  def __init__(self,model,idx_to_label,word_to_idx,seq_length):
    ''' Initializing the spacy component. We want to pass in our trained
    model, our label dictionary, and our word dictionary.
    '''
    self.model = model
    self.idx_to_label = idx_to_label
    self.word_to_idx = word_to_idx
    self.seq_length = seq_length

  def __call__(self,doc):
    ''' This is the method called when you run: doc = nlp("Your text here")

    We want to predict the labels for documents by converting the spacy doc
    into an object usable by Tensorflow, i.e. a list of IDs instead of tokens.

    Then, we want to save the predictions from Tensorflow into the doc.
    '''

    subsequence = []

    # break document into seq_length chunks and predict
    for token in doc:
      if token.lower_ in self.word_to_idx.keys():
        subsequence.append(self.word_to_idx[token.lower_])
      else:
        subsequence.append(self.word_to_idx['OOV'])
      if len(subsequence) == self.seq_length:
        self.make_preds(doc,subsequence)
        subsequence = []
    
    # if there are leftover words, pad the data
    if len(subsequence) > 0:
      subsequence += [self.word_to_idx['PAD'] 
                      for i in range(self.seq_length-len(subsequence))]
      self.make_preds(doc,subsequence)

    return doc

  def make_preds(self,doc,sequence):
    '''This is a helper method to predict BILUO labels and attach them to 
    documents.

    Although BILUO labels have a specific expected structure, for the purposes
    of having more entities shown in visualizations, ANY consecutive sequence of
    BIL tags with the same label are considered an entity.

    U and O tags are treated as expected.
    '''
    preds = self.model.predict_classes([sequence])[0]
    preds_biluo = [self.idx_to_label[pred] for pred in preds]

    entities = []
    start = None
    label = None
    for i, tag in enumerate(preds_biluo):
      if tag.startswith("U"):
        doc.ents = list(doc.ents) + [Span(doc,i,i,tag[2:])]
      if start is not None:
        if tag == "O" or tag.startswith("U"):
          doc.ents = list(doc.ents) + [Span(doc,start,i-1,label)]
          start = None
        elif tag[2:] != label:
          doc.ents = list(doc.ents) + [Span(doc,start,i-1,label)]
          start = i
      else:
        if tag != "O":
          start = i
          label = tag[2:]


To start using spacy, we need to start with an English model, since our text is (mostly) English.

In [None]:
nlp = English()

Then add the ability to detect words and sentences.

In [None]:
nlp.add_pipe(nlp.create_pipe('sentencizer'))

And finally add our custom component that will predict the entites.

In [None]:
bilstm = BiLSTMNER(model,idx_to_label,word_to_idx,seq_length)

nlp.add_pipe(bilstm, name='bilstm')

## Visualizing Output on New Data

Spacy was covered quickly, but the purpose of using spacy is to be able to use a library within spacy called displacy. It is a visualization tool for our prediction results. This makes the results much more human readable than a sequence of IDs or BILUO tags.

In [None]:
doc = nlp(validation_data[0][1])
spacy.displacy.render(doc,style='ent',jupyter=True)

In [None]:
doc = nlp(validation_data[1903][1])
spacy.displacy.render(doc,style='ent',jupyter=True)

We notice even more now that the >95% accuracy is misleading. Sequences often do not have all the data labeled and if the data is labeled, it might label only part of the true correct phrase.

There are two reasons why this could be the case:

1.   The model did not learn what was in the data, due to the shape/type of the model, training time, or other factors.
2.   The data did not exist in the data for the model to learn from.

Pinpointing which of the two is difficult and solving either one can be equally difficult.

If after further error analysis, you find part of the model to be satisfactory but wish to improve another aspect, it is easy (and relatively fast) to blend in post-processing and rule-based NLP into the results.



## Adding Rule-Based Components

In one of the above examples, we noticed that the phrase "COVID-19" was missing from the labels. Considering this is a COVID-19 related dataset, that is a large oversight.

Luckily, Spacy allows us to quickly add new entities to documents with a few rules from their built in "EntityRuler" Feature.

In [None]:
from spacy.pipeline import EntityRuler

We create a new EntityRuler.

In [None]:
ruler = EntityRuler(nlp)

Add some rules in a JSON format. If there is a match on `pattern`, `label` will be added to the document. We can even add entirely new labels, such as symptoms.

In [None]:
rules = [{"label": "DISEASE", "pattern": [{"LOWER": "covid-19"}]},
         {"label": "SYMPTOM", "pattern": [{"LOWER": "cough"}]},
         {"label": "SYMPTOM", "pattern": [{"LOWER": "dyspnea"}]},
         {"label": "SYMPTOM", "pattern": [{"LOWER": "abdominal"},{"LOWER":"pain"}]}]

ruler.add_patterns(rules)

Add the rules to our nlp system after our trained model.

In [None]:
nlp.add_pipe(ruler,name='entity_ruler')

And visualize the sentence again.

In [None]:
doc = nlp(validation_data[1903][1])
spacy.displacy.render(doc,style='ent',jupyter=True)

#### Other Spacy Packages

For clinical text, further information is often needed, such as the context surrounding any one entity.

These can be added into our system as well.

Medspacy is one library in development by clinical NLP researchers at the VA and University of Utah that augments spacy's effectiveness on clinical text. 

In [None]:
!pip install medspacy

We will quickly look at the context around our entities in this sentence and visualize them again

In [None]:
from medspacy.context import ConTextComponent
from medspacy.visualization import visualize_dep

Make the spacy component with default configurations.

In [None]:
context = ConTextComponent(nlp)

nlp.add_pipe(context,name='context')

Visualize the result again.

In [None]:
doc = nlp(validation_data[1903][1])
visualize_dep(doc,jupyter=True)

# What Next?

The system shown in this tutorial was simple and not very accurate due to complex, sparse data. What else can be done?

Unfortunately, iterative development and machine learning generally do not mix. It is hard to guarantee any action improves the quality of the model. However, there are some common next steps that you can try:

1.   Tune the model to your data. This example was only run on one specific set of dimensions and epochs. Try a different loss function, smaller embeddings, a bigger LSTM size, more dropout, less dropout, etc.
2.   Change the shape of the model. Instead of LSTM cells, try GRU. Instead of an RNN, try a CNN or Transformer. Try adding a CRF. Try two or more layers of LSTMs.
3.   Try pre-trained word embeddings or contextual embeddings.
4.   Try a traditional machine learning (not deep learning) approach.
5.   Augment the system with rules or other pre and post-processing steps.
6.   Change project scope or concept definitions.


# Conclusion

This tutorial focuses on some of the most common practical challenges posed by deep learning for NLP tasks.

Much of what is shown here is now built into the libraries used, however it is often unclear what each method is doing and the reason for doing it in example code and documentation.

This model shown was simple. Deep learning models can be made as complicated as you can imagine (and pay for the computers to train on). Current state of the art systems are an order of magnitude larger and often take weeks or months to train on computers far larger than those available at the VA or University of Utah's Center for High Performance Computing. However, that does not mean that bigger models are always better. Smaller models are faster, cheaper, and often still perform better on your specific task. Determining what will work for any one task is far more of an art than a science and will never be fully covered in one tutorial.

However, the steps shown surrounding the creation and training of a model are applicable to any NLP deep learning task, not just the one used as an example. That is why improving the performance of this particular model was not the focus.

Deep learning and NLP are becoming increasingly mixed and dominating much of the latest research from all angles of NLP (computer science NLP, computational linguistics, clinical NLP, etc.). Understanding how these models are created and what they can and cannot do is a key component of understanding the accomplishments of many researchers in the field, potentially applying it to your own work, and cutting through disinformation and hype surrounding this technology. 



# Resources

This colab notebook will be publicly available indefinitely, as will the [git repository](https://github.com/turbosheep/deep_learning_for_cNLP_AMIA2020) storing the training and testing data.

Some additional resources listed below, in no particular order:

* Python
  * [SpaCy](https://spacy.io/)
    * [MedSpaCy](https://github.com/medspacy) (developed and used by VINCI+University of Utah)
  * [NLTK](http://www.nltk.org/)
  * [CoreNLP](https://stanfordnlp.github.io/CoreNLP/)
  * [AllenNLP](https://allennlp.org/)
  * [HuggingFace](https://huggingface.co/)
* Java
  * [OpenNLP](https://opennlp.apache.org/)
  * [Stanford NLP](https://nlp.stanford.edu/software/index.shtml)
  * [Leo](http://department-of-veterans-affairs.github.io/Leo/) (developed and used by VINCI)

Other software packages that may be useful for clinical NLP users and developers:
* [CLAMP](https://clamp.uth.edu/)
* [cTAKES](https://ctakes.apache.org/)
* [Metamap](https://metamap.nlm.nih.gov/)

Other tools that are useful for development but are not NLP-specific:
* Data Analysis
  * [Numpy](https://numpy.org/)
  * [Scipy](https://www.scipy.org/)
  * [Matplotlib](https://matplotlib.org/)
  * [Plotly](https://plotly.com/)
* Machine Learning
  * [Scikit-Learn](https://scikit-learn.org/stable/)
  * [CRF++](https://taku910.github.io/crfpp/)
* Deep Learning
  * [Tensorflow](https://www.tensorflow.org/)
  * [PyTorch](https://pytorch.org/)
  * [Keras](https://keras.io/)

General Resources:
* [StackOverflow](https://stackoverflow.com/), specifically the tensorflow, keras, spacy, etc. tags
* [TowardsDataScience](https://towardsdatascience.com/), helpful articles with tutorials and examples of many kinds of deep learning tasks
* [MachineLearningMastery](https://machinelearningmastery.com/), another source of help articles and tutorials
* [Deep Learning for NLP Best Practices](https://ruder.io/deep-learning-nlp-best-practices/), an article covering some of what was covered here plus a lot of other helpful tricks and good habits
* [Neural Network Methods for Natural Language Processing](https://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=1056), a textbook written by a prominent NLP researcher and professor covering many deep learning NLP tasks in detail
* The Issues page on Github for any of the linked libraries can also be useful for troubleshooting.

# Questions?

If there are further questions about the contents of this demonstration, feel free to reach out to Olga (olga.patterson@utah.edu) or Hannah (hannah.eyre@utah.edu)

If you have questions about NLP development in VINCI environments or questions about what NLP-developed datasets are available to VA researchers, please send an email to vinci@va.gov