# Gene Recognition

In this notebook we:

1. Download annotations from the sysrev.com gene hunter project (sysrev.com/p/3144)
2. Use the (spacy)[https://spacy.io/] to build a model for automatically recognizing genes in text.

The gene hunter project was a 2000 article open online review of pubmed abstracts.  15 reviewers highlighted genes in text.  Sysrev data is accessible using the syrev python client [pySysrev](https://github.com/sysrev/PySysrev).  

## Text Processing
Sysrev provides an api call to download data into a shape Spacy can handle:

In [2]:
# Retrieve project annotations
! curl -X GET -d project-id=3144 -G https://sysrev.com/web-api/project-annotations > sysrev_output.json

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.3M  100 10.3M    0     0  3589k      0  0:00:02  0:00:02 --:--:-- 3588k


In [2]:
# Call Python client to format annotations for Spacy, and save processed output
# Our label of interest is genes
import PySysrev
input_path = 'sysrev_output.json'
label = 'GENE'
output_path = 'processed_output.json'
PySysrev.processAnnotations(input_path, label, output_path)

In [None]:
import random
import json
with open('sysrev_output.json') as f:
    TRAIN_DATA = json.load(f)

TRAIN_DATA

## Training the Model


In [3]:
processed_path = 'processed_output.json'
model_path = 'sysrev_gene'
PySysrev.trainAnnotations(processed_path, model_path)

current loss: 8377.95615092 | min_loss: 5000.0 | Steps since last min: 1
current loss: 5071.10204883 | min_loss: 5000.0 | Steps since last min: 2
current loss: 4337.21254523 | min_loss: 4337.21254523 | Steps since last min: 0
current loss: 3815.78632474 | min_loss: 3815.78632474 | Steps since last min: 0
current loss: 3485.6487403 | min_loss: 3485.6487403 | Steps since last min: 0
current loss: 3168.37126738 | min_loss: 3168.37126738 | Steps since last min: 0
current loss: 2970.61318064 | min_loss: 2970.61318064 | Steps since last min: 0
current loss: 2682.66895263 | min_loss: 2682.66895263 | Steps since last min: 0
current loss: 2670.88276027 | min_loss: 2670.88276027 | Steps since last min: 0
current loss: 2393.32225931 | min_loss: 2393.32225931 | Steps since last min: 0
current loss: 2173.69761541 | min_loss: 2173.69761541 | Steps since last min: 0
current loss: 2137.0622847 | min_loss: 2137.0622847 | Steps since last min: 0
current loss: 1945.48313314 | min_loss: 1945.48313314 | St

In [8]:
# Test our trained model on a test text
from __future__ import unicode_literals
import spacy

test_text = "Depletion of Nup98 or Wdr82 abolishes Set1A recruitment to chromatin and subsequently ablates H3K4me3 at adjacent promoters."

nlp2 = spacy.load('sysrev_gene')
doc2 = nlp2(test_text)
for ent in doc2.ents:
    print(ent.label_, ent.text)

(u'GENE', u'Nup98')
(u'GENE', u'Wdr82')


In [9]:
# Use Python client to get data for the project
df = PySysrev.getAnnotations(3144)

In [10]:
# Our initial text
df.head(5)

Unnamed: 0,annotation,datasource,end,external_id,selection,semantic_class,start,sysrev_id,text
0,α-KGDH,pubmed,286.0,29211711,α-KGDH,gene,280.0,1524023,"Histone modifications, such as the frequently ..."
1,KAT2A,pubmed,391.0,29211711,KAT2A,gene,386.0,1524023,"Histone modifications, such as the frequently ..."
2,GCN5,pubmed,411.0,29211711,GCN5,gene,407.0,1524023,"Histone modifications, such as the frequently ..."
3,succinyl-CoA,pubmed,493.0,29211711,succinyl-CoA,gene,481.0,1524023,"Histone modifications, such as the frequently ..."
4,KAT2A,pubmed,509.0,29211711,KAT2A,gene,504.0,1524023,"Histone modifications, such as the frequently ..."


In [11]:
# Run our texts through our trained model to get entities
nlp2 = spacy.load('sysrev_gene')
txt_list = []
for txt in list(df['text']):
    if txt is None:
        txt_list.append(None)
    else:
        doc2 = nlp2(txt)
        txt_list.append(doc2.ents)

In [12]:
# Add entities to DataFrame
df['entities'] = txt_list

In [13]:
# Identified genes shown as new column
df.head(5)

Unnamed: 0,annotation,datasource,end,external_id,selection,semantic_class,start,sysrev_id,text,entities
0,α-KGDH,pubmed,286.0,29211711,α-KGDH,gene,280.0,1524023,"Histone modifications, such as the frequently ...","((α, -, KGDH), (KAT2A), (GCN5), (succinyl, -, ..."
1,KAT2A,pubmed,391.0,29211711,KAT2A,gene,386.0,1524023,"Histone modifications, such as the frequently ...","((α, -, KGDH), (KAT2A), (GCN5), (succinyl, -, ..."
2,GCN5,pubmed,411.0,29211711,GCN5,gene,407.0,1524023,"Histone modifications, such as the frequently ...","((α, -, KGDH), (KAT2A), (GCN5), (succinyl, -, ..."
3,succinyl-CoA,pubmed,493.0,29211711,succinyl-CoA,gene,481.0,1524023,"Histone modifications, such as the frequently ...","((α, -, KGDH), (KAT2A), (GCN5), (succinyl, -, ..."
4,KAT2A,pubmed,509.0,29211711,KAT2A,gene,504.0,1524023,"Histone modifications, such as the frequently ...","((α, -, KGDH), (KAT2A), (GCN5), (succinyl, -, ..."
