# Project Crunchgrant

This project is based on linking available data from from large-scale cancer genomics data sets (https://www.cbioportal.org) and available information from reasearch grants (to this stage NSF grant proposals). Cancer genomics data sets can be downloaded classified by cellular pathway, and provide information on the number of alterations identified in patients per protein of interest in the particular pathway. The idea for this project is to allow scientists, from industry or academia, to obtain knowledge on other research groups working on a gene or protein of interest to them. This could potentially foster new collaborations and allow for an easy way of knowledge search in a scientific world overwhelmed by publication numbers.

To illustrate this idea, this notebook uses as an example the cell cycle pathway and in particular 34 genes identified in cbioportal in 12,192 samples. In this case, I used the NSF grant information from 2018 and downloaded the provided xml files for each grant. 




In [4]:
import numpy as np 
import pandas as pd 
import matplotlib as mpl 
import matplotlib.pyplot as plt
from collections import OrderedDict
from bokeh.io import output_notebook
from bokeh.plotting import figure, show, output_file
from bokeh.models import LabelSet, Label, HoverTool, ColumnDataSource
import lxml.etree as ET
from glob import glob

# Get mutations data
mut_df = pd.read_csv("mutations.txt", sep='\t')

# Get copy-number alterations (CNA)
cna_df = pd.read_csv("cna.txt", sep='\t')

# Get lists of mutations, CNA, genes and sum of modifications per gene
gene_mutations = [mut_df[col].count() for col in mut_df.columns[2:]]
gene_cna = [(cna_df[col].dropna() != 0).sum() for col in cna_df.columns[2:]]
genes = list(mut_df.columns[2:])
gene_modifs = [(gene_cna[i] + gene_mutations[i])/len(mut_df.index) for i in range(len(gene_mutations))]

# Calculate percentages here not before in case raw data needed later
gene_cna_vals = [100*v/len(mut_df.index) for v in gene_cna]
gene_mutations_vals = [100*v/len(mut_df.index) for v in gene_mutations]

# Create color palette viridis for bokeh
colors = [
     "#%02x%02x%02x" % (int(r), int(g), int(b)) for r, g, b, _ in 255*mpl.cm.viridis(mpl.colors.Normalize()(gene_modifs))
]

In [5]:
output_notebook()

In [6]:
# Create source for plotting, useful for Labeling afterwards
source = ColumnDataSource(data=dict(cna=gene_cna_vals, 
                                    mut=gene_mutations_vals,
                                    gene_names=genes,
                                    gene_modif=gene_modifs,
                                    col=colors))

# Formatting title and labels
p = figure(title='Gene alterations in the cell cycle pathway')
p.xaxis[0].axis_label = '% copy-number alterations'
p.yaxis[0].axis_label = '% mutations'

p.scatter('cna', 'mut', source=source, alpha=0.6, radius='gene_modif', fill_color='col', line_color=None)
labels = LabelSet(x='cna', y='mut', text='gene_names', source=source)
p.add_layout(labels)


In [7]:
show(p)

The first plot in this notebook is simply a 2D representation of the percentages of mutations and copy-number alterations observed for each gene, where the colors of each circle corresponds to the combined ratio. Interestingly, we observe that although the overall percentage of samples having a genetic alteration is similar between the genes RB1 and CDKN1B, they originate from different genetic modifications. For instance, CDKN1B has a low percentage of point mutations and high percentage of copy-number alterations. 




In [8]:

# Parsing NSF data
filenames = glob("2018/18*.xml")

parser = ET.XMLParser(recover=True)

counter = 0
citing_grants = {}
for file in filenames:
    with open(file) as f:
        tree = ET.parse(f, parser=parser)
        root = tree.getroot()
        check_if_conference = [elm.text for elm in root.iter('AwardTitle') if 'Conference' in elm.text]
        if len(check_if_conference) == 0 :
            awardid = [elm.text for elm in root.iter('AwardID')][0]
            for gene in genes:
                for resume in root.iter('AbstractNarration'):
                    if resume is not None and resume.text is not None:
                        resume_splitted = (resume.text).split()
                        if gene in resume_splitted: 
                            gene_index = resume_splitted.index(gene)
                            if awardid not in citing_grants:
                                citing_grants[gene] = awardid
                            else:
                                citing_grants[gene].update(awardid)
                        if 'p'+str(gene) in resume_splitted:
                            gene_index = resume_splitted.index('p'+str(gene))
                            if awardid not in citing_grants:
                                citing_grants[gene] = awardid
                            else:
                                citing_grants[gene].update(awardid)



In [15]:
liste_nsf = []
liste_colors = []

for gene in genes:
    if gene not in citing_grants:
        liste_nsf.append('0')
        liste_colors.append('blue')
    else:
        liste_nsf.append(citing_grants[gene])
        liste_colors.append('red')

p2 = figure(title='NSF grants on genes altered in cell cycle pathway')

source2 = ColumnDataSource(data=dict(cna=gene_cna_vals, 
                            mut=gene_mutations_vals,
                            gene_names=genes,
                            gene_modif=gene_modifs,
                            liste_nsf=liste_nsf,
                            col=liste_colors))

tooltips = [('NSF Award ID', '@liste_nsf')]

p2.scatter('cna', 'mut', source=source2, alpha=0.6, fill_color='col', line_color=None)
p2.xaxis[0].axis_label = '% copy-number alterations'
p2.yaxis[0].axis_label = '% mutations'

labels = LabelSet(x='cna', y='mut', text='gene_names', source=source2, text_font_size="8pt")

p2.add_tools(HoverTool(tooltips=tooltips))
p2.add_layout(labels)


In [16]:
show(p2)