# Gene NER using PySysrev and Human Review (Part III)
<span style="color:gray">James Borden, Nole Lin</span>

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes.  We use data from 2000 abstracts reviewed in the sysrev [Gene Hunter project](https://sysrev.com/p/3144). This third part of the series details how we can evaluate our model .

In this notebook we:

1. **Evaluate Model** on Gene Hunter text to test performance
2. **Demonstrate** our model in action on example sentences

We start by training on our processed data and separate 20% of the training set into a test set. We will train for 20 epochs with a dropout rate of 0.2

In [1]:
from __future__ import unicode_literals, print_function
import spacy
import PySysrev
import random, sys

TRAIN_DATA = PySysrev.processAnnotations(project_id=3144, label='GENE')
uniq_articles = list(set([x[0] for x in TRAIN_DATA]))
test_size = int(0.2 * len(uniq_articles))
test_articles = uniq_articles[0:test_size]

nlp = spacy.blank('en')
nlp.meta['name'] = 'gene'

ner = nlp.create_pipe('ner')
ner.add_label('GENE')

nlp.add_pipe(ner)
optimizer = nlp.begin_training()

epochs = 20

for itn in range(epochs):
    random.shuffle(TRAIN_DATA)
    losses = {}
    test_range = range(test_size)
    text = [item[0] for item in TRAIN_DATA if item[0] not in test_articles] #get training text items
    annotations = [item[1] for item in TRAIN_DATA if item[0] not in test_articles] #get training annotations
    
    nlp.update(text, annotations, sgd=optimizer, drop=0.2,losses=losses)



Now that we have our model, let's evaluate its performance. Here is a function that gets the sensitivity and specificity of our model when testing on the test set or train set.

In [2]:
from __future__ import division

def get_metrics(test_or_train, model):
    if test_or_train == 'test':
        section = [x for x in TRAIN_DATA if x[0] in test_articles]
    elif test_or_train == 'train':
        section = [x for x in TRAIN_DATA if x[0] not in test_articles]
    true_genes = 0
    pred_genes = 0
    true_non_genes = 0
    pred_non_genes = 0
    nlp2 = spacy.load('en_core_web_sm')
    for txt in section:
        if txt[0] is not None:
            doc = model(txt[0])
            predict_annotations = [str(x) for x in list(doc.ents)]
            entities = txt[1]['entities']
            true_annotations = [txt[0][x[0]:x[1]] for x in entities]
            pred_genes += len([value for value in predict_annotations if value in true_annotations])
            true_genes += len(true_annotations)
            doc2 = nlp2(txt[0])
            for token in doc2:
                if str(token) not in true_annotations:
                    true_non_genes += 1
                    if str(token) not in predict_annotations:
                        pred_non_genes += 1
    return pred_genes / true_genes, pred_non_genes / true_non_genes

In [4]:
test_sensitivity, test_specificity = get_metrics('test', nlp)
train_sensitivity, train_specificity = get_metrics('train', nlp)

Below we see the values for our model's metrics. Sensitivity refers to the proportion of genes that the model correctly identified as genes. Specificity refers to the proportion of non-genes that the model correctly identified as non-genes. The bar chart shows a respectable performance by our trained model.

In [13]:
import plotly as py
import plotly.graph_objs as go

data = [go.Bar(
            x=['test_sensitivity', 'test_specificity', 'train_sensitivity', 'train_specificity'],
            y=[test_sensitivity, test_specificity, train_sensitivity, train_specificity]
    )]

py.plotly.iplot(data)

Now, we look at specific sentences to see how our model performs in detecting gene terms. Here, it's able to extract "HMOX" and "UGT1A1" correctly and exclude the rest of the words.

In [28]:
from spacy import displacy
from IPython.core.display import display, HTML

doc = nlp("The aim of our study was to assess the possible relationships among heme oxygenase (HMOX), bilirubin UDP-glucuronosyl transferase (UGT1A1) promoter gene variations, serum bilirubin levels, and Fabry disease (FD).")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

Again, the model is able to nicely detect an unconventional gene name with a hyphen in the term.

In [29]:
doc = nlp("Differential Requirement of Human Cytomegalovirus UL112-113 Protein Isoforms for Viral Replication.")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

However, we now see some flaws in our model. The below sentence contains two gene names "MDM2" and "p53." But because they are separated by a slash instead of a space, the model is unable to identify the genes.

In [30]:
doc = nlp("Furthermore, our results demonstrate that miR-365 functions as an upstream regulator of MDM2/p53 expression, cell cycle progression and apoptosis in trophoblasts")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

Other times, the model is only able to get one of the genes in the sentence. "SPI" is also a gene, but is not highlighted as only "malat1" is.

In [31]:
doc = nlp("showed that malat1\xa0M5 interacts with the C-terminal domain of SP1 by RNA immunoprecipitation (RIP) assay coupled with UV cross-linking")
html_ner_prediction = spacy.displacy.render(doc, style='ent')

display(HTML("<div style='color:red;padding-left:50px'>{}</div>".format(html_ner_prediction)))

Overall, our trained model shows promising results in the test and train metrics, as well as specific identification tasks. Some things we could do to improve model performance would be to look into tuning hyperparameters such as the number of epochs and the dropout rate. But with our current working model, we will look into turning it into a web application with an API as our next step, documented in the next post.