# Upstream Regulator Analysis Package Basic Example

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 8th November, 2017

----------------------

<a id='toc'></a>
## Table of Contents
1. [Import packages](#import)
2. [Background](#background)
3. [Create Graph](#graph)
    1. [Transcription Factors](#tf)
    2. [Background Network](#bn)
    3. [Differentially Expressed Genes](#deg)
3. [Stat Analysis](#stat)
    1. [P-value](#pvalue)
    2. [Z-score](#zscore)
    3. [Display Our Results](#display)
        1. [Top Values](#top)
        2. [Rank with z-score](#rank)
        3. [Network Visualization using visJS2jupyter](#visJS)

## Import packages
<a id='import'></a>

In [1]:
import sys
import networkx as nx
code_path = '../ura'
sys.path.append(code_path)
import create_graph
import stat_analysis
reload(create_graph)
reload(stat_analysis)

<module 'stat_analysis' from '../ura\stat_analysis.pyc'>

## Background
<a id='background'></a>

The inspiration for these modules comes from Ingenuity System's [Ingenuity Upstream Regulator Analysis in IPA®](http://pages.ingenuity.com/rs/ingenuity/images/0812%20upstream_regulator_analysis_whitepaper.pdf).

This notebook is meant to explain how to use the functions from our URA modules. If you need further help on how to use a function or what that function's purpose is, see the comments associated with that function in the source code (create_graph.py or stat_analysis.py).

## Create Graph ##
<a id='graph'></a>

### Transcription Factors
<a id='tf'></a>

Our create_graph module prepares 3 databases to use in our analysis: a transcription factor interaction network, a background network, and a differentially expressed genes database. Our transcription factor data comes from two sources: [slowkow](https://github.com/slowkow/tftargets) and [jaspar](http://jaspar.genereg.net/). Slowkow is a compilation of 6 smaller databases: TRED, ITFP, ENCODE, Neph2012, TRRUST, Marbach2016. Information about these specific databases, including links to their sources, is available through the slowkow github page. Some basic information on these databases is as follows:
- Size of Jaspar Database: 2049
- Size of Slowkow Database: 2705
- Size of TRED: 133
- Size of ITFP: 1974
- Size of ENCODE: 157
- Size of Neph2012: 536
- Size of TRRUST: 748
- Size of Marbach2016: 643 

Our trascription factor list contains 3983 unique entries.

In [2]:
TF_list = create_graph.create_TF_list(slowkow_bool=True,
                    slowkow_files=['../slowkow_databases/TRED_TF.txt',
                                   '../slowkow_databases/ITFP_TF.txt',
                                   '../slowkow_databases/ENCODE_TF.txt',
                                   '../slowkow_databases/Neph2012_TF.txt',
                                   '../slowkow_databases/TRRUST_TF.txt',
                                   '../slowkow_databases/Marbach2016_TF.txt'],
                    slowkow_sep = '\n',
                    jaspar_bool=True,
                    jaspar_file="../jaspar_genereg_matrix.txt")
len(TF_list)

3983

### Background Network
<a id='bn'></a>

Our background network comes from the [STRING](https://string-db.org/) database. Our module can use our list of transcription factors (TF_list) to extract only relevant information from the STRING database. Namely, we pull out edge information from STRING only for our transcription factors (the entries in TF_list) and use those edges to create a graph.

In [3]:
filename = "../STRING_network.xlsx"
DG_TF, DG_universe = create_graph.load_small_STRING_to_digraph(filename, TF_list) # to filter, specify TF_list

In [4]:
print 'Number of unique nodes without filtering: ' + str(len(list(set(DG_universe.nodes()))))
print 'Number of edges without filtering: ' + str(len(DG_universe.edges()))
print '\n'
print 'Number of unique nodes with filtering: ' + str(len(list(set(DG_TF.nodes()))))
print 'Number of edges with filtering: ' + str(len(DG_TF.edges()))

Number of unique nodes without filtering: 3580
Number of edges without filtering: 40106


Number of unique nodes with filtering: 687
Number of edges with filtering: 1505


### Differentially Expressed Genes
<a id='deg'></a>

Our last database comes from a list of differentially exressed genes (DEG) found in an experiement. This is the user input, the genes we wish to analyze using our stat_analysis module. The TF graph created previously serves as our point of referece for these DEG's.

Our DEG file must be a tab separated file, where each row represents a gene. This file must contain three columns (can contain other columns as well) titled "adj_p_value" (adjusted p-value), "gene_symbol", and "fold_change" (fold change or log fold change). 

In [5]:
filename = '../differentially_expressed_genes.txt'
p_value_filter = 0.3
DEG_list, DEG_to_pvalue, DEG_to_updown = create_graph.create_DEG_list(filename, p_value_filter) # load DEG list from file
DG_TF = create_graph.add_updown_from_DEG(DG_TF, DEG_to_updown) # add DEG's information to our graph 
len(DEG_list)

12906

## Stat Analysis
<a id='stat'></a>

Our stat_analysis package can help identify TF's that are statistically significant to our set of DEG's.

### P-value
<a id='pvalue'></a>

Our p-value function calculates the log of the p-value for every TF in the graph using [scipy.stats.hypergeom.logsf](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.hypergeom.html). These values help us determine which TF's are actually associated with our DEG's. 
- high value = significant connection between this TF and its DEG targets
- low value = TF is randomly associated with its DEG targets
- zero = None of this TF's targets were DEG's
- inf = original p-value was so small that its log is inf. Very high significance.

In [6]:
pvalues = stat_analysis.tf_pvalues(DG_TF, DG_universe, DEG_list)
display(pvalues.to_frame(name = 'p-values logs'))

Unnamed: 0,p-values logs
MRPL44,inf
RBBP5,inf
TOPORS,inf
HBN,inf
RPS24,74.214855
RPL23A,73.436222
RPL8,69.506025
RPS3,68.456336
RPS23,63.336969
DL,26.761561


### Z-score
<a id='zscore'></a>

The goal of our z-score function is to predict the activation states of the TF's. We observe how a TF relates to each of its targets to make our prediction. We compare each targets' observed gene regulation (either up or down) and each TF-target interation (whether it is activating or inhibiting) to conclude whether a TF is activating or inhibiting. A positive value indicates activating while a negative value indicates inhibiting. A value of zero means that we did not have enough information about the target or TF-target interaction to make the prediction.

Our z-score calculater has to calculation methods. Our method calculates how "biased" the background network is towards either activating or inhibiting interactions, then adjusts its calculations to that non-zero average. Our bias-filter parameter is a number between 0 and 1 that indicates the threshold at which the graph is considered "too biased", and therefore the bias z-score formula must be used.

In [7]:
unbiased_zscores = stat_analysis.tf_zscore(DG_TF, DEG_list, bias_filter = 1) # To explicitly use not-biased formula, set to 1
display(unbiased_zscores.to_frame(name = 'unbias function z-scores'))

Unnamed: 0,unbias function z-scores
RPS24,7.229569
RPS3,7.069980
RPL23A,7.020021
RPS23,7.020021
RPL8,6.077702
MED15,3.741657
MED1,3.000000
TAF1,3.000000
TAF5,3.000000
DL,2.840188


In [8]:
bias_zscore = stat_analysis.tf_zscore(DG_TF, DEG_list, bias_filter = 0) # To explicitly use bias formula, set to 0
display(bias_zscore.to_frame(name = 'bias function z-scores'))

Graph has bias of 0.637209302326. Adjusting z-score calculation accordingly.


Unnamed: 0,bias function z-scores
RPS24,61.537275
RPL23A,58.234485
RPS23,54.468443
RPL8,49.379552
RPS3,46.947623
MED15,8.105357
TAF1,6.499651
TAF4,6.461631
TAF5,6.118121
TAF2,5.462191


### Display Results
<a id='display'></a>

Our stat_analysis module also inculdes methods to help you find the most statistically relevant TF's to your input DEG data set.
- top_values: will display the highest activating, inhibiting, or overall values
- rank_and_score_df: will display where an input set of genes rank among all others in terms of their z-score
- vis_tf_network: will display an interactive network of a specified regulator and its downstream targets

**Top Values**
<a id='top'></a>

In [9]:
top_act = stat_analysis.top_values(unbiased_zscores, act = True, abs_value = False, top = 5) # top 5 activating
top_inh = stat_analysis.top_values(unbiased_zscores, act = False, abs_value = False, top = 5) # top 5 inhibiting
top_overall = stat_analysis.top_values(unbiased_zscores, act = False, abs_value = True, top = 5) # top 5 overall 
                                                                                                 # (happen to all be activating)
display(top_act.to_frame(name = 'activating z-score'))
display(top_inh.to_frame(name = 'inhibiting z-score'))
display(top_overall.to_frame(name = 'top z-scores'))

Unnamed: 0,activating z-score
RPS24,7.229569
RPS3,7.06998
RPL23A,7.020021
RPS23,7.020021
RPL8,6.077702


Unnamed: 0,inhibiting z-score
SLBO,-2.44949
KLHL18,-1.414214
PAX,-1.414214
Z,-1.414214
STAT92E,-1.0


Unnamed: 0,top z-scores
RPS24,7.229569
RPS3,7.06998
RPL23A,7.020021
RPS23,7.020021
RPL8,6.077702


**Rank with z-score**
<a id='rank'></a>

Ranks range from 0 (best, aka strongest z-score) to whichever number is associated with a z-score of zero, in this case 47. If the remove_dups flag is set to True, any genes with the same z-score will have the same rank. 

In [10]:
genes_to_rank = ['TAF1', 'RPL8', 'RPS24','Z', 'ZEN2']
stat_analysis.rank_and_score_df(bias_zscore, genes_to_rank, act = True, abs_value = True, remove_dups=True)

Unnamed: 0,rank,z-score
RPS24,0,61.537275
RPL8,3,49.379552
TAF1,9,6.499651
Z,15,-4.630727
ZEN2,47,0.0


**Network Visualization using visJS2jupyter**
<a id='visJS'></a>

If you do not already have the package visJS2jupyter, type "pip install visJS2jupyter" into your command prompt.

**Node Color**:
- yellow: regulator/TF
- red: up-regulated tarets
- blue: down-regulated targets
- the stronger the gene's fold change (specified in DEG_filename), the stronger the shade of red or blue
- white: no fold change information for this gene

**Node Border**:
- DEG's (secified by DEG_list) are outlined in black

**Node Size**:
- small: insiginificant/large adjusted p-value (taken from DEG_filename)
- large: significant/small adjusted p-value

**Edge Color**:
- red: activating
- blue: inhibiting

**Tips**:
- nodes look too clustered: increase node spacing to ~3000 or ~4000
- nodes look to sparse: decrease node spacing to ~2000 or ~1000
- directed_edges = False will center the TF
- directed_edges = True will pull the TF off to the side

In [12]:
stat_analysis.vis_tf_network(DG_TF, tf = 'RPS24', DEG_filename = '../geo2r_GSE11352_brca_48hours.txt', DEG_list = DEG_list,
              directed_edges = False,
              node_spacing = 3200,
              graph_id = 0) 

In [13]:
stat_analysis.vis_tf_network(DG_TF, 'RPL8', '../geo2r_GSE11352_brca_48hours.txt', DEG_list = DEG_list,
              directed_edges = True,
              node_spacing = 3200,
              graph_id = 1) 