# Gene NER using PySysrev and Human Review (Part II)
<span style="color:gray">James Borden, Nole Lin</span>

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes.  We use data from 2000 abstracts reviewed in the sysrev [Gene Hunter project](https://sysrev.com/p/3144). This second part of the series describes how users can use the spaCy.io library to train a model to detect gene names in text.

In this notebook we:

1. **Train Annotations** using data from the Gene Hunter project
2. **Test Model** on example text to check its performance

We start by getting the training annotations from the gene hunter project ([sysrev.com/p/3144](https://sysrev.com/p/3144)) below.  This process is described in [part I](https://s3.amazonaws.com/sysrev-blog/NERGenes_Processing.html)

In [1]:
import PySysrev
TRAIN_DATA = PySysrev.processAnnotations(project_id=3144, label='GENE')
print("text:\t{}...\nentities:\t{}".format(TRAIN_DATA[0][0][0:100],TRAIN_DATA[0][1]))
print("num paragraphs: {}".format(len(TRAIN_DATA)))

text:	BACKGROUND: Olaparib is an oral poly(adenosine diphosphate-ribose) polymerase inhibitor that has pro...
entities:	{u'entities': [(183, 187, 'GENE'), (1726, 1730, 'GENE'), (354, 358, 'GENE')]}
num paragraphs: 1230


## Training a gene annotation model

After formatting our annotations and text, we can train an NER model for genes. To train annotations we:
1. Initialize a blank English spacy model. 
2. Train the model on the gene hunter training set.
3. Save our model.

**(1) Initialize english spacy model**  
Creating a blank spacy model is simple, below we set some basic parameters. In spaCy, we will create a pipeline and add the NER task to it. We specify the gene entity as the label, and initiate the optimizer to begin training.

In [2]:
from __future__ import unicode_literals, print_function
import spacy

nlp = spacy.blank('en')
nlp.meta['name'] = 'gene'

ner = nlp.create_pipe('ner')
ner.add_label('GENE')

nlp.add_pipe(ner)
optimizer = nlp.begin_training()

**(2) Train the model**  
To train the model we repeatedly call `nlp.update` on the training corpus `TRAIN_DATA`.  Each iteration is referred to as an 'epoch' and the model should improve on each call. Internally spacy is fitting a complex model to the ~1000 training instances provided by Sysrev.  The [spacy documentation](https://spacy.io/usage/linguistic-features#section-named-entities) helps explain this process.

<span style="color:#AC3434">Warning - running this code may take a long time.  consider reducing the training size or using fewer epochs.</span>

In [65]:
import random, sys

epochs = 30
trace_losses = [] #track losses over time (we use this for graphing later)

for itn in range(epochs):
    sys.stdout.write("{} ".format(itn))
    losses = {}
    text = [item[0] for item in TRAIN_DATA] #get training text items
    annotations = [item[1] for item in TRAIN_DATA] #get training annotations
    
    nlp.update(text, annotations, sgd=optimizer, drop=0.6,losses=losses)
    trace_losses.append(losses['ner']) #track the per epoch losses

print(" done")

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29  done


## Visualize Training
It is important to see how the model learns over time.  The below graph shows the model still improves after all epochs, but more epochs might result in overfitting. We may already have overfit! The 60% [dropout](https://en.wikipedia.org/wiki/Dropout_(neural_networks)) we used is one method to combat overfitting.

In [95]:
import plotly as py
import plotly.graph_objs as go

trace0 = go.Scatter(x=range(epochs),y=trace_losses)
py.plotly.iplot([trace0], auto_open=False)

## Test Model

Below we visualize the model abilities on a paragraph in the training corpus.  We use the spacy `displacy` visualizer ([documentation](https://spacy.io/usage/visualizers)) and some html formatting to help readability. 

The results look great! The model seems to capture most of the genes. Though it does miss `TNFα`, maybe because of that pesky alpha.  The model also manages to avoid incorrectly labelling non-genes as genes.

In [74]:
test_text = TRAIN_DATA[3][0][0:659].replace("\n","  ")

ner_prediction = nlp(test_text)
html_ner_prediction = spacy.displacy.render(nlp2(test_text), style='ent')

from IPython.core.display import display, HTML
display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

# Evaluation and Production
The above results still look pretty good, but we need to do a better job of testing. In part III we will be doing more evaluation.  In Part IV the model will become available in a web application and an API.  