# Upstream Regulator Analysis Package
## HUVEC and Breast Cancer Case Study

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 11th January, 2018

----------------------

<a id='toc'></a>
## Table of Contents
1. [Background](#background)
2. [Import packages](#import)
3. [Transcription Factors](#tf)
4. [Background Network](#bn)
5. [HUVEC](#huvec)
    1. [Load DEG's](#loadhuvec)
    2. [Calculate p-values and z-scores](#pzhuvec)
    3. [Compare the Ingenuity Article's results to Ours](#comphuvec)
    4. [Display Our results](#displayhuvec)
6. [Breast Cancer](#brca)
    1. [Load DEG's](#loadbrca)
    2. [Calculate p-values and z-scores](#pzbrca)
    3. [Compare the Ingenuity Article's results to Ours](#compbrca)
    4. [Display Our results](#displaybrca)

## Background
<a id='background'></a>

This notebook attempts to validate our Upstream Regulator Analysis (URA) modules. Our modules are inspired by Ingenuity System's [Ingenuity Upstream Regulator Analysis in IPA®](http://pages.ingenuity.com/rs/ingenuity/images/0812%20upstream_regulator_analysis_whitepaper.pdf); this test case is inpired by Ingenuity System's corresponding paper, [Causal analysis approaches in Ingenuity Pathway Analysis](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3928520/). This paper analyses two sets of Differencially Expressed Genes (DEG's): one from a breast cancer tumor and the other from Human Umbilical Vein Endothelial Cells (HUVEC). We run our version of URA on these same breast cancer and HUVEC DEG's, however using the [STRING database](https://string-db.org/) human protein interaction network as our background network. 

## Import packages
<a id='import'></a>

In [2]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import upstream regulator modules
import sys
code_path = '../ura'
sys.path.append(code_path)
import create_graph
import stat_analysis
reload(create_graph)
reload(stat_analysis)

# network visualization package
# (pip install visJS2jupyter)
import visJS2jupyter.visJS_module as visJS_module

## Transcription Factors
<a id='tf'></a>

Load our list of Transcription Factors (TF) we want to analyze. Gene symbols will be in all caps. 

In [5]:
TF_list = create_graph.create_TF_list(slowkow_bool=True,
                    slowkow_files=['../TF_databases/slowkow_databases/TRED_TF.txt',
                                   '../TF_databases/slowkow_databases/ITFP_TF.txt',
                                   '../TF_databases/slowkow_databases/ENCODE_TF.txt',
                                   '../TF_databases/slowkow_databases/Neph2012_TF.txt',
                                   '../TF_databases/slowkow_databases/TRRUST_TF.txt',
                                   '../TF_databases/slowkow_databases/Marbach2016_TF.txt'],
                    slowkow_sep = '\n',
                    jaspar_bool=True,
                    jaspar_file="../TF_databases/jaspar_genereg_matrix.txt")


TF_list = TF_list + ['TNF', 'IFNG', 'LBP'] # known regulators of interest missing from our TF databases
len(TF_list)

3986

## Background Network
<a id='bn'></a>

Load our background network, available on the [STRING website](https://string-db.org/cgi/download.pl?UserId=9BGA8WkVMRl6&sessionId=HWUK6Dum9xC6&species_text=Homo+sapiens), and keep only our the information about our TF's. 

The function load_STRING_to_digraph can load any species' "protein actions" database from STRING. Just ensure your TF list and DEG list have the same naming convention as you background network (Homo Sapiens use all caps, Mus Musculus only capitalizes the first letter, etc.) Inconsistent naming can effect your results.

In [4]:
filename = "../background_networks/9606.protein.actions.v10.5.txt"
confidence_filter=400
DG_TF, DG_universe = create_graph.load_STRING_to_digraph(filename, confidence_filter, TF_list)

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-7374...done.
Finished.
56 input query terms found dup hits:
	[(u'ENSP00000359550', 3), (u'ENSP00000447879', 2), (u'ENSP00000364076', 2), (u'ENSP00000348986', 2),
312 input query terms found no hit:
	[u'ENSP00000376684', u'ENSP00000289352', u'ENSP00000202788', u'ENSP00000373637', u'ENSP00000367802',
Pass "returnall=True" to return complete lists of duplicate or missing query terms.


## HUVEC##
<a id='huvec'></a>

### Load HUVEC DEG's ###
<a id='loadhuvec'></a>

In [6]:
# DEG's
filename_huvec = '../DEG_databases/geo2r_GSE2639_huvec.txt'

# TF's found to be statistically significant in Ingenuity Systems URA article
huvec_genes = ['TNF','IFNG','LBP', 'NFKB1', 'NFKB2', 'REL', 'RELA', 'RELB', 'PCBP3', 'PCBP2', 'PCBP1', 'PCBP4', 'NFKBIA']

# add DEG information to STRING background network
DEG_list_huvec, DEG_to_pvalue_huvec, DEG_to_updown_huvec = create_graph.create_DEG_list(filename_huvec, p_value_filter = 0.3)
DG_huvec = create_graph.add_updown_from_DEG(DG_TF, DEG_to_updown_huvec) # will not overwrite original graph DG

### Calculate p-values and z-scores
<a id='pzhuvec'></a>

For a detailed explination of our p-value and z-score calculation functions, see our URA_Basic_Example notebook

In [7]:
# calculate p-values
p_values = stat_analysis.tf_pvalues(DG_huvec, DG_universe, DEG_list_huvec)

# calculate z-scores
z_scores_huvec = stat_analysis.tf_zscore(DG_huvec, DEG_list_huvec, bias_filter = 0.25) # recommended bias filter is 0.25

### Compare the Ingenuity Article's results to Ours
<a id='comphuvec'></a>

These are the TF's found to be most relevant according to the Ingenuity Pathway Analysis paper. Optimally, these genes would rank very high. A rank of 0 is always best, while the rank associated with a z-score of zero is the worst. 

In this case, IFNG anf TNF are high ranking, therefore show coordination between our results and the Ingenuity article's results. LBP, PCBP1, PCBP2, PCBP4, and RELB have z-scores of zero, meaning not enough information exists about these genes to calculate a true z-score for them. PCBP3 has a NaN z-score because it is not in the graph.

In [8]:
stat_analysis.rank_and_score_df(z_scores_huvec, huvec_genes, remove_dups=True)

Unnamed: 0,rank,z-score
IFNG,0.0,3.549648
TNF,5.0,2.342606
NFKB2,12.0,1.732051
RELA,13.0,1.732051
NFKBIA,19.0,1.341641
REL,19.0,-1.341641
NFKB1,23.0,1.091089
LBP,34.0,0.0
PCBP1,34.0,0.0
PCBP2,34.0,0.0


### Display Our results
<a id='displayhuvec'></a>

These are the most relevant TF's according to our analysis. Optimally, these should match the huvec genes defined above. However, using a different background network produces discrepancies.

IFNG, TNF, and TNFAIP3 show coordination between our results and the Ingenuity article's results.

In [9]:
top_act_huvec = stat_analysis.top_values(z_scores_huvec, act = True, abs_value = False, top = 10)
top_inh_huvec = stat_analysis.top_values(z_scores_huvec, act = False, abs_value = False, top = 10)
display(top_act_huvec.to_frame(name = 'activating z-score'))
display(top_inh_huvec.to_frame(name = 'inhibiting z-score'))

Unnamed: 0,activating z-score
IFNG,3.549648
CTF1,3.0
STAT4,2.645751
STAT5A,2.529822
TNF,2.342606
HIF1A,2.236068
GATA3,2.236068
IRF3,2.236068
FOS,2.236068
SPG7,2.236068


Unnamed: 0,inhibiting z-score
PPARA,-2.44949
TNFAIP3,-2.0
DDX41,-2.0
CACNA2D4,-2.0
CACNA2D2,-2.0
GPSM1,-1.941451
SIRT1,-1.897367
STRAP,-1.732051
IRF8,-1.732051
NKX2-5,-1.732051


## Breast Cancer ###
<a id='brca'></a>

### Load Breast Cancer DEG's
<a id='loadbrca'></a>

In [10]:
filename_brca = '../DEG_databases/geo2r_GSE11352_brca_48hours.txt'
brca_genes = ['ESR1', 'FSH', 'MEK', 'BRD4', 'MYC', 'MARK1', 'IL1B','NCOA3','PGR', 
              'EGR1', 'HIF1A', 'NR3C1','CTNNB1','TP53','SMARCE1','STAT5B']
DEG_list_brca, DEG_to_pvalue_brca, DEG_to_updown_brca = create_graph.create_DEG_list(filename_brca, p_value_filter = 0.05)
DG_brca = create_graph.add_updown_from_DEG(DG_TF, DEG_to_updown_brca)

### Calculate p-values and z-scores
<a id='pzbrca'></a>

In [11]:
# calculate p-values
p_values_brca = stat_analysis.tf_pvalues(DG_brca, DG_universe, DEG_list_brca)

# calculate z-scores
z_scores_brca = stat_analysis.tf_zscore(DG_brca, DEG_list_brca, bias_filter = 0.25)

### Compare the Ingenuity Article's results to Ours
<a id='compbrca'></a>
TP53, MYC, and ESR1 are foud to be highly ranking, showing coordination between our results and the Ingenuity article's results.

In [12]:
stat_analysis.rank_and_score_df(z_scores_brca, brca_genes, act = True, abs_value = True, remove_dups=True)

Unnamed: 0,rank,z-score
TP53,0.0,-1.889822
MYC,2.0,1.666667
ESR1,3.0,1.414214
STAT5B,4.0,1.341641
NR3C1,5.0,1.0
HIF1A,7.0,-0.57735
CTNNB1,8.0,0.447214
EGR1,8.0,-0.447214
BRD4,9.0,0.0
NCOA3,9.0,0.0


### Display Our results
<a id='displaybrca'></a>
MYCN, MYC, ESR1, and TP53 show coordination between our results and the Ingenuity article's results.

Four out of five of our top activating genes, in particular MYCN, are associated with breast cancer according to literature. Our top three inhibiting genes, in particular FOXO3, are also associated with breast cancer according to literature.

In [13]:
top_act_brca = stat_analysis.top_values(z_scores_brca, act = True, abs_value = False, top = 10)
top_inh_brca = stat_analysis.top_values(z_scores_brca, act = False, abs_value = False, top = 10)
display(top_act_brca.to_frame(name = 'activating z-score'))
display(top_inh_brca.to_frame(name = 'inhibiting z-score'))

Unnamed: 0,activating z-score
STAT1,1.732051
MYCN,1.732051
STAT3,1.732051
MYC,1.666667
SREBF1,1.414214
PHKB,1.414214
CCNA2,1.414214
ESR1,1.414214
SREBF2,1.414214
IQGAP1,1.414214


Unnamed: 0,inhibiting z-score
TP53,-1.889822
FOXO3,-1.414214
EGR2,-1.414214
NR1H3,-1.414214
FOXO4,-1.414214
AKT1,-1.414214
NR1H2,-1.0
NFE2L2,-1.0
TWIST1,-1.0
FOXP3,-1.0


In [15]:
stat_analysis.vis_tf_network(DG_brca, 'MYCN', '../DEG_databases/geo2r_GSE11352_brca_48hours.txt', DEG_list_brca,
              directed_edges = False,
              node_spacing = 1200,
              graph_id = 0) 

In [16]:
stat_analysis.vis_tf_network(DG_brca, 'FOXO3', '../DEG_databases/geo2r_GSE11352_brca_48hours.txt', DEG_list_brca,
              directed_edges = True,
              node_spacing = 2300,
              graph_id = 1) 