# 2018-11-12 HIV and significant genes
Here I want to explore the PubMed database and retrieve possible connections between the genes identified as significantly associated to HIV in the previous steps of the analysis. To do this, I'll harness the power of the Biopython package, which provides functions dedicated to performing queries to the PubMed database.

In [None]:
import pandas as pd
from Bio import Entrez
Entrez.email = "ruggero.cortini@crg.eu"
import os, time
import matplotlib.pyplot as plt
import numpy as np

First, let's load the list of interesting genes in a convenient data structure.

In [None]:
# directory with the data
schiv_rootdir = "%s/work/CRG/projects/sc_hiv"%(os.getenv('HOME'))
matrices_dir = "%s/data/matrices"%(schiv_rootdir)

# information on the data we want to load
sample_name = "P2449"
module_colors = ["darkgreen", "darkturquoise"]

In [None]:
# load the data into convenient dictionaries
tables = []
for module_color in module_colors :
    csv_fname = "%s/%s-%s.csv"%(matrices_dir, sample_name, module_color)
    table = pd.read_csv(csv_fname, index_col=0)
    table["color"] = module_color
    tables.append(table)

# Merge the two tables for more convenient access
alltables = pd.concat([t for t in tables])

In [None]:
# get all the names of the interesting genes into one convenient list
symbols = [s for s in alltables['hgnc_symbol'] if isinstance(s, basestring)]

Now we have all the genes loaded, so we need to init the interface to PubMed and try to do the mining.

In [None]:
associations = {}
for symbol in symbols :
    handle = Entrez.esearch(db='pubmed', term='HIV AND %s' % symbol, retmax=200)
    record = Entrez.read(handle)
    n_found = int(record['Count'])
    if n_found > 0 :
        print "%s: found %d associations" % (symbol, n_found)
        associations[symbol] = record
    handle.close()
    time.sleep(2)

This gave us a data structure with the interesting associations. Let's look at the p-values of the associations.

In [None]:
# pull together the information
associations_table = alltables.loc[alltables['hgnc_symbol'].isin(associations.iterkeys())]
associations_table = associations_table.set_index('hgnc_symbol')
associations_table['NAssociations'] = 0
for s in associations.iterkeys() :
    associations_table.loc[s, "NAssociations"] = associations[s]["Count"]

In [None]:
associations_table

In [None]:
plt.scatter(associations_table['NAssociations'].astype(int),
            -np.log10(associations_table['GSP'].astype(float)))
plt.xlabel("Number of associations")
plt.ylabel("Gene Trait Significance [-log10]")
plt.show()