# Upstream Regulator Analysis Package Case Study

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 8th November, 2017

----------------------

<a id='toc'></a>
## Table of Contents
1. [Import packages](#import)
2. [Create Graph](#graph)
3. [Background](#background)
    1. [Transcription Factors](#tf)
    2. [Background Network](#bn)
    3. [Differentially Expressed Genes](#deg)
3. [Stat Analysis](#stat)
    1. [P-value](#pvalue)
    2. [Z-score](#zscore)

## Import packages
<a id='import'></a>

In [22]:
import sys
import time
import networkx as nx
code_path = '../ura'
sys.path.append(code_path)
import create_graph
import stat_analysis
reload(create_graph)
reload(stat_analysis)

<module 'stat_analysis' from '../ura\stat_analysis.py'>

## Background
<a id='background'></a>

The inspiration for these modules comes from [Ingenuity](http://pages.ingenuity.com/rs/ingenuity/images/0812%20upstream_regulator_analysis_whitepaper.pdf).

## Create Graph ##
<a id='graph'></a>

### Transcription Factors
<a id='tf'></a>

Our create_graph module prepares 3 databases to use in our analysis: a transcription factor interaction network, a background network, and a differentially expressed genes database. Our transcription factor data comes from two sources: [slowkow](https://github.com/slowkow/tftargets) and [jaspar](http://jaspar.genereg.net/). Slowkow is a compilation of 6 smaller databases: TRED, ITFP, ENCODE, Neph2012, TRRUST, Marbach2016. Information about these specific databases, including links to their sources, is available through the slowkow github page. Some basic information on these databases is as follows:
- Size of Jaspar Database: 2049
- Size of Slowkow Database: 2705
- Size of TRED: 133
- Size of ITFP: 1974
- Size of ENCODE: 157
- Size of Neph2012: 536
- Size of TRRUST: 748
- Size of Marbach2016: 643 

Our trascription factor list contains 3983 unique entries.

In [2]:
TF_list = create_graph.create_TF_list(slowkow_bool=True,
                    slowkow_files=['../slowkow_databases/TRED_TF.txt',
                                   '../slowkow_databases/ITFP_TF.txt',
                                   '../slowkow_databases/ENCODE_TF.txt',
                                   '../slowkow_databases/Neph2012_TF.txt',
                                   '../slowkow_databases/TRRUST_TF.txt',
                                   '../slowkow_databases/Marbach2016_TF.txt'],
                    slowkow_sep = '\n',
                    jaspar_bool=True,
                    jaspar_file="../jaspar_genereg_matrix.txt")
len(TF_list)

3983

### Background Network
<a id='bn'></a>

Our background network comes from the [STRING](https://string-db.org/) database. Our module can use our list of transcription factors (TF_list) to extract only relevant information from the STRING database. Namely, we pull out edge information from STRING only for our transcription factors (the entries in TF_list) and use those edges to create a graph.

In [16]:
filename = "../STRING_network.xlsx"
unfiltered_DG = create_graph.load_small_STRING_to_digraph(filename) # by default does not filter graph
DG = create_graph.load_small_STRING_to_digraph(filename, TF_list) # to filter, specify TF_list

In [17]:
print 'Number of unique nodes without filtering: ' + str(len(list(set(unfiltered_DG.nodes()))))
print 'Number of edges without filtering: ' + str(len(unfiltered_DG.edges()))
print '\n'
print 'Number of unique nodes with filtering: ' + str(len(list(set(DG.nodes()))))
print 'Number of edges with filtering: ' + str(len(DG.edges()))

Number of unique nodes without filtering: 3580
Number of edges without filtering: 40106


Number of unique nodes with filtering: 687
Number of edges with filtering: 1505


### Differentially Expressed Genes
<a id='deg'></a>

Our last database comes from a list of differentially exressed genes (DEG) found in an experiement. This is the user input, the genes we wish to analyze using our stat_analysis module. The TF graph created previously serves as our point of referece for these DEG's.

Our DEG file must be a tab separated file, where each row represents a gene. This file must contain three columns (can contain other columns as well) titled "adj_p_value" (adjusted p-value), "gene_symbol", and "fold_change" (fold change or log fold change). 

In [18]:
filename = '../differentially_expressed_genes.txt'
p_value_filter = 0.3
DEG_list, DEG_to_pvalue, DEG_to_updown = create_graph.create_DEG_list(filename, p_value_filter) # load DEG list from file
DG = create_graph.add_updown_from_DEG(DG, DEG_to_updown) # add DEG's information to our graph 
len(DEG_list)

12906

## Stat Analysis

Our stat_analysis package can help identify TF's that are statistically significant to our set of DEG's.

### P-value

Our p-value function calculates the log of the p-value for every TF in the graph using [scipy.stats.hypergeom.logsf](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.hypergeom.html). These values help us determine which TF's are actually associated with our DEG's. If a TF is given a high value (because we are working with logs, not straight p-values), then it is likely that there is correlation between that TF and its DEG targets. Therefore, it is likely that TF is responsible for some of our observed gene expression. Note that if a TF is given a value of zero, that means none of the TF's targets were DEG's. 

In [23]:
stat_analysis.tf_pvalues(DG, unfiltered_DG, DEG_list)

MRPL44          inf
RBBP5           inf
TOPORS          inf
HBN             inf
RPS24     74.214855
RPL23A    73.436222
RPL8      69.506025
RPS3      68.456336
RPS23     63.336969
DL        26.761561
MED15     20.992895
MRPL24    18.756252
TAF2      14.757389
TAF7      14.299399
MED1      14.072495
REL       12.883480
INTS6     11.623171
TAF11     10.987837
SLBO      10.056217
INTS4      9.915756
HDAC3      9.915756
INTS8      9.293373
Z          6.966166
RNPS1      6.857867
TAF1       6.789783
TAF5       6.789783
TAF4       6.642762
MCM2       6.306885
PTEN       5.990586
SVP        5.367390
            ...    
MAX        0.000000
PXN        0.000000
SU(HW)     0.000000
ILK        0.000000
MYB        0.000000
SPEN       0.000000
SF1        0.000000
ABD-B      0.000000
REPO       0.000000
ZEN2       0.000000
PNR        0.000000
GSTO1      0.000000
ANTP       0.000000
UTX        0.000000
MAD        0.000000
HKB        0.000000
CNOT4      0.000000
SU(H)      0.000000
PAN        0.000000


### Z-score

The goal of our z-score function is to predict the activation states of the TF's. We observe how a TF relates to each of its targets to make our prediction. We compare each targets' observed gene regulation (either up or down) and each TF-target interation (whether it is activating or inhibiting) to conclude whether a TF is activating or inhibiting. A positive value indicates activating while a negative value indicates inhibiting. A value of zero means that we did not have enough information about the target or TF-target interaction to make the prediction.

Our z-score calculater has to calculation methods. Our method calculates how "biased" the background network is towards either activating or inhibiting interactions, then adjusts its calculations to that non-zero average. Our bias-filter parameter is a number between 0 and 1 that indicates the threshold at which the graph is considered "too biased", and therefore the bias z-score formula must be used.

In [27]:
unbiased_zscores = stat_analysis.tf_zscore(DG, DEG_list, bias_filter = 1) # To explicitly use not-biased formula, set to 1
unbiased_zscores

RPS24      7.229569
RPS3       7.069980
RPL23A     7.020021
RPS23      7.020021
RPL8       6.077702
MED15      3.741657
MED1       3.000000
TAF1       3.000000
TAF5       3.000000
DL         2.840188
TAF11      2.828427
MRPL24     2.828427
TAF4       2.828427
TAF2       2.828427
TAF12      2.828427
TAF7       2.449490
REL        2.449490
TAF6       2.449490
MCM2       2.333333
CDC27      2.236068
MCM3       2.236068
INTS4      2.000000
NUP133     2.000000
HDAC3      2.000000
RAE1       2.000000
INTS6      2.000000
SVP        1.732051
INTS8      1.732051
BUB3       1.732051
CDC16      1.732051
             ...   
DFD        0.000000
VHL        0.000000
BAP        0.000000
MAD        0.000000
MYB        0.000000
TLL        0.000000
SCR        0.000000
PAN        0.000000
GCM        0.000000
TBP        0.000000
KR         0.000000
H          0.000000
ANTP       0.000000
ZEN2       0.000000
SU(HW)     0.000000
RNPS1     -0.377964
SIN3A     -1.000000
PTEN      -1.000000
HBN       -1.000000


In [26]:
stat_analysis.tf_zscore(DG, DEG_list, bias_filter = 0) # To explicitly use bias formula, set to 0

Graph has bias of 0.637209302326. Adjusting z-score calculation accordingly.


RPS24      61.537275
RPL23A     58.234485
RPS23      54.468443
RPL8       49.379552
RPS3       46.947623
MED15       8.105357
TAF1        6.499651
TAF4        6.461631
TAF5        6.118121
TAF2        5.462191
TAF12       5.371641
MED1        4.795125
TAF11       4.565742
CDC27       4.263537
MRPL24      4.217432
TAF6        3.984691
REL         3.424306
TAF7        3.335306
DL          3.152367
MCM3        2.680626
NUP133      2.398179
INTS6       2.317814
RAE1        2.242446
CDC16       2.127297
INTS4       1.999597
HDAC3       1.867017
MCM2        1.811704
BUB3        1.654499
INTS8       1.654499
SVP         1.388918
             ...    
TRL         0.000000
GCM         0.000000
DFD         0.000000
ACHI        0.000000
OPTIX       0.000000
TAZ         0.000000
ZEN2        0.000000
SPEN        0.000000
GSTO1       0.000000
KR          0.000000
RFC4       -0.150186
MCM7       -0.611082
AKT1       -1.439509
VHL        -1.802300
MCM6       -1.950101
HBN        -2.315364
STAT92E    -2

Our stat_analysis module also inculdes methods to help you find the most statistically relevant TF's to your input DEG data set.

In [28]:
stat_analysis.top_values(unbiased_zscores, act = True, abs_value = False, top = 5) # top 5 activating

RPS24     7.229569
RPS3      7.069980
RPL23A    7.020021
RPS23     7.020021
RPL8      6.077702
dtype: float64

In [29]:
stat_analysis.top_values(unbiased_zscores, act = False, abs_value = False, top = 5) # top 5 inhibiting

SLBO      -2.449490
KLHL18    -1.414214
PAX       -1.414214
Z         -1.414214
STAT92E   -1.000000
dtype: float64

In [32]:
stat_analysis.top_values(unbiased_zscores, act = False, abs_value = True, top = 5) # top 5 overall (happen to all be activating)

RPS24     7.229569
RPS3      7.069980
RPL23A    7.020021
RPS23     7.020021
RPL8      6.077702
dtype: float64