# Important Genes - Gene Hunter Part IV
<span style="color:gray">James Borden, Nole Lin</span>

What if you could instantly find the important genes for any medical query?  Do you know what genes matter for prostate cancer?  Cystic Fibrosis?  What about longevity?  In this post we show you how to use the gene hunter ner model to and `Bio.Entrez` to answer these questions.  

Counting how often people reference genes in the context of these diseases is one way to answer this question.  We'll put our gene hunter model to use for this.  You can learn more about the Gene NER model from the original sysrev [syrev.com/p/3144](https://sysrev.com/p/3144) or from our other blog posts [gene hunter 1-3](https://blog.sysrev.com/)).  

In [3]:
import PySysrev, spacy, urllib2, pickle, IPython.display, urllib

nlp = PySysrev.getModel('gene_ner')

display(IPython.display
        .HTML("<div style='background-color: lightblue; padding:10px; text-align:center;'>{}</div>"
        .format(spacy.displacy.render(nlp(unicode("My favorite genes are p53 and MDM2. CO2 is not a gene.")), style='ent'))))

The above shows how to load the gene hunter `gene_ner` model.  Soon sysrev will release a **model store** where you can easily get all the models we build.  The gene_ner model isn't perfect, but it's smart enough to ignore some obvious non-gene acronyms like CO2 in the above text.  Next we'll find some longevity publications.

# Search for Longevity Publications
NCBI publishes some extremely useful tools for querying their enormous pubmed literature database.  Below we use the python `Bio.Entrez` client to grab all 40916 publications matching the query "longevity". 

In [4]:
import Bio.Entrez
Bio.Entrez.email = "info@insilica.co"
entrez_search    = Bio.Entrez.esearch(db="pubmed",retmax=100000,term="longevity",idtype="pmid")
pmids            = Bio.Entrez.read(entrez_search)["IdList"]
print("example pmid: {} number of pmids: {}".format(pmids[1],len(pmids)))

example pmid: 30358721 number of pmids: 40916


Accessing medical abstracts from pubmed identifiers is pretty easy.  We could use `Bio.Entrez` for this as well, but PySysrev actually supports a clone of pubmed with slightly faster load times.  Next we'll get some abstracts and start annotating them with the `gene_ner` model.

# Tag Longevity Publications
PySysrev will soon release a **data store** where users can access supported datasets.  The below call `PySysrev.getEntity(resource,id)` is a sneak peak of this.  `getEntity` will let subscribed users have fast access to entities from pubmed, arxiv, sysrev reviews, and other datasets.  

We use `PySysrev.getEntity` together with `nlp(unicode_value).ents` to find all the genes annotated in each medical abstracts. 

In [11]:
import pandas as pd, sys, collections

def getGenes(pmid,i): 
    if(i%1000==0): sys.stdout.write(".")
    abstract = PySysrev.getEntity('pubmed',pmid)
    if("abstractText" in abstract):
        return nlp(unicode(abstract["abstractText"][0])).ents
    else: 
        return ()

# ents        = [getGenes(pmid,i) for (i,pmid) in enumerate(pmids)]
geneCounter = collections.Counter([str(ent).upper() for ent in list(sum(ents,()))])
print("\n{}".format(geneCounter.most_common(20)))


[('SIRT1', 1504), ('IGF-1', 991), ('P53', 564), ('DAF-16', 547), ('APOE', 476), ('FOXO', 293), ('FOXO3', 246), ('IGF-I', 218), ('SOD1', 213), ('FOXO3A', 204), ('CO2', 175), ('SIR2', 158), ('IL-6', 157), ('DAF-2', 153), ('BCL-2', 143), ('SIRT3', 122), ('FOXO1', 114), ('SIRT6', 112), ('HSP70', 111), ('AMPK', 99)]


Finally we can plot the results.  Below you can see that '' is a frequently reference longevity gene.  Its interesting to note the long tail associated with these genes as well.  Perhaps there are no obvious 'driver' genes for longevity.  

In [13]:
import plotly
plotly.offline.init_notebook_mode(connected=True) # required for plotly graphs w/out accounts

layout = plotly.graph_objs.Layout(
    title="Top 20 Genes in Longevity Literature",
    xaxis=dict(title='Gene Hunter Detected Gene'),
    yaxis=dict(title="Count")
)

bargraph = plotly.graph_objs.Bar(
            x=[gene for gene, count in geneCounter.most_common(20)],
            y=[count for gene, count in geneCounter.most_common(20)]
        )

fig = plotly.graph_objs.Figure(data=[bargraph], layout=layout)

plotly.offline.iplot(fig, config={'showLink': False})