# Gene NER using PySysrev and Human Review (Part IV)
<span style="color:gray">James Borden, Nole Lin</span>

In this series on the Sysrev tool, we build a Named Entity Recognition (NER) model for genes.  We use data from 2000 abstracts reviewed in the sysrev [Gene Hunter project](https://sysrev.com/p/3144). This fourth part of the series details how we can apply our model to PubMed articles.

In this notebook we:

1. **Apply Model** on PubMed text to extract gene names

We start by loading our trained model from our previous notebook.

In [None]:
import spacy

nlp = spacy.load('/path/to/gene_model')

We can access PubMed articles through our Cassandra database and store it in a DataFrame. A sample of this is shown below.

In [2]:
from __future__ import unicode_literals
from cassandra.cluster import Cluster
import pandas as pd
cluster = Cluster()
session = cluster.connect('biosource')
df = pd.DataFrame(list(session.execute('SELECT * FROM pubmed')))
df.head(5)

Unnamed: 0,pmid,abstractText,author_json,chemicals,date,geneSymbol,grants,journal_json,keywords,mesh,mesh_id,pubType,title,xml
0,1535,The syntheses of trans- and cis-1-benzyl-3-dim...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Ahmed"",""...","[Dimethylamines, Histamine H1 Antagonists, Pip...",,,,"{""MedlineJournalInfo"":{""Country"":""United State...",,"[Acetylcholine, Animals, Barium, Dimethylamine...",,"[Journal Article, Research Support, U.S. Gov't...",Conformationally restricted analogs of histami...,"<PubmedArticle><MedlineCitation Owner=""NLM"" St..."
1,82694,Mathematical smoothing of data for the Framing...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Anderson...",,,,,"{""MedlineJournalInfo"":{""Country"":""England"",""Me...",,"[Blood Pressure, Blood Pressure Determination,...",,[Journal Article],Re-examination of some of the Framingham blood...,"<PubmedArticle><MedlineCitation Owner=""NLM"" St..."
2,57379,,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Visner"",...","[Phenytoin, Phenobarbital]",,,,"{""MedlineJournalInfo"":{""Country"":""England"",""Me...",,"[Abnormalities, Drug-Induced, Epilepsy, Female...",,[Journal Article],Letter: Anticonvulsants and fetal malformations.,"<PubmedArticle><MedlineCitation Owner=""NLM"" St..."
3,12775,A new technique is described for the measureme...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Wakeham""...","[Lipoproteins, Serum Albumin, Bovine]",,,,"{""MedlineJournalInfo"":{""Country"":""Ireland"",""Me...",,"[Animals, Blood, Cattle, Diffusion, Hydrogen-I...",,[Journal Article],Diffusion coefficients for protein molecules i...,"<PubmedArticle><MedlineCitation Owner=""NLM"" St..."
4,76367,The sera of Heterakis-infected birds influence...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Stomenov...",[gamma-Globulins],,,,"{""MedlineJournalInfo"":{""Country"":""Bulgaria"",""M...",,"[Age Factors, Animals, Chickens, Immunization,...",,"[English Abstract, Journal Article]",[Passive immunization in heterakidosis].,"<PubmedArticle><MedlineCitation Owner=""NLM"" St..."


Now we apply our trained model to the text to find potential gene names in the articles. Note that the sample DataFrame shown doesn't have any gene names in the NER_Genes column, but the expanded DataFrame will.

In [4]:
ner_prediction = [nlp(unicode(x)).ents for x in list(df['abstractText'])]
df['NER_Genes'] = ner_prediction
df.head(5)

Unnamed: 0,pmid,abstractText,author_json,chemicals,date,geneSymbol,grants,journal_json,keywords,mesh,mesh_id,pubType,title,xml,NER_Genes
0,1535,The syntheses of trans- and cis-1-benzyl-3-dim...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Ahmed"",""...","[Dimethylamines, Histamine H1 Antagonists, Pip...",,,,"{""MedlineJournalInfo"":{""Country"":""United State...",,"[Acetylcholine, Animals, Barium, Dimethylamine...",,"[Journal Article, Research Support, U.S. Gov't...",Conformationally restricted analogs of histami...,"<PubmedArticle><MedlineCitation Owner=""NLM"" St...",()
1,82694,Mathematical smoothing of data for the Framing...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Anderson...",,,,,"{""MedlineJournalInfo"":{""Country"":""England"",""Me...",,"[Blood Pressure, Blood Pressure Determination,...",,[Journal Article],Re-examination of some of the Framingham blood...,"<PubmedArticle><MedlineCitation Owner=""NLM"" St...",()
2,57379,,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Visner"",...","[Phenytoin, Phenobarbital]",,,,"{""MedlineJournalInfo"":{""Country"":""England"",""Me...",,"[Abnormalities, Drug-Induced, Epilepsy, Female...",,[Journal Article],Letter: Anticonvulsants and fetal malformations.,"<PubmedArticle><MedlineCitation Owner=""NLM"" St...",()
3,12775,A new technique is described for the measureme...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Wakeham""...","[Lipoproteins, Serum Albumin, Bovine]",,,,"{""MedlineJournalInfo"":{""Country"":""Ireland"",""Me...",,"[Animals, Blood, Cattle, Diffusion, Hydrogen-I...",,[Journal Article],Diffusion coefficients for protein molecules i...,"<PubmedArticle><MedlineCitation Owner=""NLM"" St...",()
4,76367,The sera of Heterakis-infected birds influence...,"[{""Author"":{""ValidYN"":""Y"",""LastName"":""Stomenov...",[gamma-Globulins],,,,"{""MedlineJournalInfo"":{""Country"":""Bulgaria"",""M...",,"[Age Factors, Animals, Chickens, Immunization,...",,"[English Abstract, Journal Article]",[Passive immunization in heterakidosis].,"<PubmedArticle><MedlineCitation Owner=""NLM"" St...",()


What we can do with this new DataFrame with identified NER genes is to find out the most frequent genes found in the PubMed text. Below is a bar chart showing the top 20 most frequently listed genes in our database of PubMed artickes

In [51]:
from itertools import chain
from collections import Counter
import plotly.plotly as py
import plotly.graph_objs as go

ner_genes = [str(y) for y in list(chain.from_iterable([x for x in ner_prediction if len(x) > 0]))]
c = Counter(ner_genes)

data = [go.Bar(
            x=[gene for gene, count in c.most_common(20)],
            y=[count for gene, count in c.most_common(20)]
    )]

py.iplot(data, filename='top-genes')