# Gene Recognition

For this series of posts on the Sysrev tool, we will demonstrate how users can perform Named Entity Recognition (NER) with the annotated texts from the Gene Hunter project. The first part of the series describes how users can load and process the project texts for training with Python's spaCy library.

In this notebook we:

1. Download annotations from the sysrev.com Gene Hunter project (sysrev.com/p/3144)
2. Format the annotations to feed into spaCy (https://spacy.io/)

The Gene Hunter project was a 2000 article open online review of pubmed abstracts.  15 reviewers highlighted genes in text.  Sysrev data is accessible using the Sysrev Python client [PySysrev](https://github.com/sysrev/PySysrev).  

## Text Processing
Sysrev provides an API call to download data into a shape spaCy can handle:

Let's look at the data in the Gene Hunter project

In [2]:
# Basic view of the data we're working with
import PySysrev
df = PySysrev.getAnnotations(3144)
df.head(5)

Unnamed: 0,annotation,datasource,end,external_id,selection,semantic_class,start,sysrev_id,text
0,α-KGDH,pubmed,286.0,29211711,α-KGDH,gene,280.0,1524023,"Histone modifications, such as the frequently ..."
1,KAT2A,pubmed,391.0,29211711,KAT2A,gene,386.0,1524023,"Histone modifications, such as the frequently ..."
2,GCN5,pubmed,411.0,29211711,GCN5,gene,407.0,1524023,"Histone modifications, such as the frequently ..."
3,succinyl-CoA,pubmed,493.0,29211711,succinyl-CoA,gene,481.0,1524023,"Histone modifications, such as the frequently ..."
4,KAT2A,pubmed,509.0,29211711,KAT2A,gene,504.0,1524023,"Histone modifications, such as the frequently ..."


In the above DataFrame, we can see the different genes (under the column "selection") identified in the text column. The start and end indices indicate where in the text the gene name can be found. Now, we'll call the processAnnotations function to get the Gene Hunter project data and format it for spaCy. The project id is 3144, the entity we want is genes, and we will save the output json as the file "processed_output.json"

In [3]:
project_id = 3144
label = 'GENE'
output_path = 'processed_output.json'
PySysrev.processAnnotations(project_id, label, output_path)

Let's take a look at the processed json. The data structure of the json file read into Python becomes a list of lists. For each individual list, we get the text as the first element (string), and the named entities of the text as the second element (dictionary). In the single key-value pair in the dictionary, we see that the value is another list of lists, where the start and end indices of the gene terms are located in the text.

In [4]:
import json
with open('processed_output.json') as f:
    data = json.load(f)
    
data[0]

[u"BACKGROUND: Olaparib is an oral poly(adenosine diphosphate-ribose) polymerase inhibitor that has promising antitumor activity in patients with metastatic breast cancer and a germline BRCA mutation.\n\nMETHODS: We conducted a randomized, open-label, phase 3 trial in which olaparib monotherapy was compared with standard therapy in patients with a germline BRCA mutation and human epidermal growth factor receptor type 2 (HER2)-negative metastatic breast cancer who had received no more than two previous chemotherapy regimens for metastatic disease. Patients were randomly assigned, in a 2:1 ratio, to receive olaparib tablets (300 mg twice daily) or standard therapy with single-agent chemotherapy of the physician's choice (capecitabine, eribulin, or vinorelbine in 21-day cycles). The primary end point was progression-free survival, which was assessed by blinded independent central review and was analyzed on an intention-to-treat basis.\n\nRESULTS: Of the 302 patients who underwent randomiz

Now that we have our "processed_output.json" file, we are ready to input it into spaCy for training. This step will be detailed in the next post.