# Gene NER using PySysrev and Human Review (Part III)
<span style="color:gray">James Borden, Nole Lin</span>

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes.  We use data from 2000 abstracts reviewed in the sysrev [Gene Hunter project](https://sysrev.com/p/3144). This third part of the series details how we can evaluate our model .

In this notebook we:

1. **Perform k-fold cross validation** on our model
2. **Evaluate Model** on Gene Hunter text to test performance

We start by getting the training annotations from the gene hunter project ([sysrev.com/p/3144](https://sysrev.com/p/3144)) below.  This process is described in [part I](https://s3.amazonaws.com/sysrev-blog/NERGenes_Processing.html)

## K-fold Cross Validation

A parameter we can tune when training our model in spaCy is the dropout rate. The dropout rate refers to the proportion of random nodes in the hidden layer we select to "switch off" so as to not allow those connections to pass along information when updating weights. To find the proper rate for dropout, we can section our training data into k-parts and find the average loss across those sections for any given dropout rate. We demonstrate this practice below. Here, we set the number of folds to be 5, the range of dropout rates to be from 0.1 to 0.5, and the number of epochs to be 3 to save time.

In [None]:
from __future__ import unicode_literals, print_function
import spacy
import PySysrev
import random, sys

TRAIN_DATA = PySysrev.processAnnotations(project_id=3144, label='GENE')

epochs = 3
num_folds = 5
fold_size = len(TRAIN_DATA) / num_folds
k_fold_results = {}

for dor in [0.1, 0.2, 0.3, 0.4, 0.5]:
    for k in range(num_folds):
        
        nlp = spacy.blank('en')
        nlp.meta['name'] = 'gene'

        ner = nlp.create_pipe('ner')
        ner.add_label('GENE')

        nlp.add_pipe(ner)
        optimizer = nlp.begin_training()
        
        for itn in range(epochs):
            losses = {}
            holdout_range = range(k * fold_size, (k + 1) * fold_size)
            text = [item[0] for count, item in enumerate(TRAIN_DATA) if count not in holdout_range] #get training text items
            annotations = [item[1] for count, item in enumerate(TRAIN_DATA) if count not in holdout_range] #get training annotations

            nlp.update(text, annotations, sgd=optimizer, drop=dor,losses=losses)
            trace_losses.append(losses['ner']) #track the per epoch losses
        k_fold_results[(dor, k)] = losses['ner']

Now we want to find the dropout rate that had the lowest average across the 5 folds. It looks like a rate of 0.2 performed the best.

In [33]:
import numpy as np

for dor in [0.1, 0.2, 0.3, 0.4, 0.5]:
    dor_results = [x for x in k_fold_results.keys() if dor in x]
    print ("Dropout Rate: ", dor, "Average Loss: ", np.mean([k_fold_results[x] for x in dor_results]))

Dropout Rate:  0.1 Average Loss:  0.00804237237898633
Dropout Rate:  0.2 Average Loss:  0.007454585621599108
Dropout Rate:  0.3 Average Loss:  0.01417305856011808
Dropout Rate:  0.4 Average Loss:  0.02278483025729656
Dropout Rate:  0.5 Average Loss:  0.06490198126994073


Next we can move on to training our model with the optimal dropout rate. We will run it for more epochs to get better results.

In [None]:
nlp = spacy.blank('en')
nlp.meta['name'] = 'gene'

ner = nlp.create_pipe('ner')
ner.add_label('GENE')

nlp.add_pipe(ner)
optimizer = nlp.begin_training()

epochs = 20

for itn in range(epochs):
    losses = {}
    text = [item[0] for item in TRAIN_DATA] #get training text items
    annotations = [item[1] for item in TRAIN_DATA] #get training annotations
    
    nlp.update(text, annotations, sgd=optimizer, drop=0.2,losses=losses)

Extracting the DataFrame for the Gene Hunter project and adding the predicted entites column, we can see some examples of genes the model is able to detect.

In [96]:
df = PySysrev.getAnnotations(3144)
txt_list = []
for txt in list(df['text']):
    if txt is None:
        txt_list.append(None)
    else:
        doc = nlp(txt)
        txt_list.append(doc.ents)
df['entities'] = txt_list
df.head(5)

Unnamed: 0,annotation,datasource,end,external_id,selection,semantic_class,start,sysrev_id,text,entities
0,α-KGDH,pubmed,286.0,29211711,α-KGDH,gene,280.0,1524023,"Histone modifications, such as the frequently ...","((succinyl, -, CoA), (succinyl, -, CoA), (succ..."
1,KAT2A,pubmed,391.0,29211711,KAT2A,gene,386.0,1524023,"Histone modifications, such as the frequently ...","((succinyl, -, CoA), (succinyl, -, CoA), (succ..."
2,GCN5,pubmed,411.0,29211711,GCN5,gene,407.0,1524023,"Histone modifications, such as the frequently ...","((succinyl, -, CoA), (succinyl, -, CoA), (succ..."
3,succinyl-CoA,pubmed,493.0,29211711,succinyl-CoA,gene,481.0,1524023,"Histone modifications, such as the frequently ...","((succinyl, -, CoA), (succinyl, -, CoA), (succ..."
4,KAT2A,pubmed,509.0,29211711,KAT2A,gene,504.0,1524023,"Histone modifications, such as the frequently ...","((succinyl, -, CoA), (succinyl, -, CoA), (succ..."


We would also like to evaluate our model's performance with some metrics. To do so, we will find the sensitivity and specificity of the model, where sensitivity refers to the proportion of genes our model was able to correctly identify and specificity refers to the proportion of non-genes our model was able to able to exclude from annotating.

In [None]:
true_genes = 0
pred_genes = 0
true_non_genes = 0
pred_non_genes = 0
nlp2 = spacy.load('en_core_web_sm')
for text_id in list(set(list(df['external_id']))):
    article_df = df.loc[df['external_id'] == text_id]
    true_annotations = list(article_df['annotation'])
    if list(article_df['entities'])[0] is not None:
        predict_annotations = [str(x) for x in list(list(article_df['entities'])[0])]
    else:
        continue
    article_text = list(article_df['text'])[0]
    pred_genes += len([value for value in predict_annotations if value in true_annotations])
    true_genes += len(true_annotations)
    doc2 = nlp2(article_text)
    for token in doc2:
        if str(token) not in true_annotations:
            true_non_genes += 1
            if str(token) not in predict_annotations:
                pred_non_genes += 1

In [95]:
from __future__ import division
print ("Sensitivity: ", pred_genes / true_genes)
print ("Specificity: ", pred_non_genes / true_non_genes)

Sensitivity:  0.170817906429
Specificity:  0.993726255113
