# Important Genes - Gene Hunter Part IV
<span style="color:gray">James Borden, Nole Lin</span>

What if you could instantly find the important genes for any medical query?  Do you know what genes matter for prostate cancer?  Cystic Fibrosis?  What about heart attacks?  

To answer these questions we need to identify genes in text.  We'll put our gene hunter model to use for this (see [gene hunter 1-3](https://blog.sysrev.com/)).  

In [3]:
from __future__ import unicode_literals, print_function
import PySysrev, spacy, urllib2, pickle, IPython.display, urllib

# load the model from S3
nlp = pickle.loads(urllib.urlopen("https://s3.amazonaws.com/sysrev-model/gene_model.pickle").read())

display(IPython.display
        .HTML("<div style='background-color: lightblue; padding:10px; text-align:center;'>{}</div>"
        .format(spacy.displacy.render(nlp("My favorite genes are p53 and MDM2. CO2 is not a gene."), style='ent'))))

SyntaxError: from __future__ imports must occur at the beginning of the file (funcs.py, line 4)

# Search for Prostate Cancer Publications

In [None]:
import Bio.Entrez
Bio.Entrez.email = "info@insilica.co"
entrez_search    = Bio.Entrez.esearch(db="pubmed",retmax=10,term="prostate cancer",idtype="pmid")
pmids            = Bio.Entrez.read(entrez_search)["IdList"]
print(len(pmids))

We can access PubMed articles through our Cassandra database and store it in a DataFrame. A sample of this is shown below.

In [None]:
from cassandra.cluster import Cluster
import pandas as pd
cluster = Cluster()
session = cluster.connect('biosource')

def getPmidEnts(pmid,i):
    if(i % 1000 == 0): sys.stdout.write(".")
    query = 'SELECT pmid, "abstractText" FROM pubmed WHERE pmid = {}'.format(pmid)
    df = pd.DataFrame(list(session.execute(query)))
    if(len(df) != 0):
        return nlp(unicode(df["abstractText"][0])).ents 
    else: 
        return ()

ents = [getPmidEnts(pmid,i) for (i,pmid) in enumerate(pmids[1:10000])]
flatEnts = list(sum(ents,()))

Now we apply our trained model to the text to find potential gene names in the articles. Note that the sample DataFrame shown doesn't have any gene names in the NER_Genes column, but the expanded DataFrame will.

In [None]:
import collections
c= collections.Counter([str(ent) for ent in flatEnts])
print(c.most_common(20))

What we can do with this new DataFrame with identified NER genes is to find out the most frequent genes found in the PubMed text. Below is a bar chart showing the top 20 most frequently listed genes in our database of PubMed articles.

In [None]:
import plotly
plotly.offline.init_notebook_mode(connected=True) # required for plotly graphs w/out accounts

layout = go.Layout(
    title="Top 20 Genes in Prostate Cancer Literature",
    xaxis=dict(title='Gene Hunter Detected Gene'),
    yaxis=dict(title="Count")
)

bargraph = plotly.graph_objs.Bar(
            x=[gene for gene, count in c.most_common(20)],
            y=[count for gene, count in c.most_common(20)]
        )

fig = plotly.graph_objs.Figure(data=[bargraph], layout=layout)

plotly.offline.iplot(fig, config={'showLink': False})