# Upstream Regulator Analysis Package Case Study

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 8th November, 2017

----------------------

<a id='toc'></a>
## Table of Contents
1. [Import packages](#import)
2. [Create Graph](#graph)
3. [Background](#background)
    1. [Transcription Factors](#tf)
    2. [Background Network](#bn)
    3. [Differencially Expressed Genes](#deg)
3. [Stat Analysis](#stat)
    1. [P-value](#pvalue)
    2. [Z-score](#zscore)

## Import packages
<a id='import'></a>

In [1]:
import sys
import time
import networkx as nx
code_path = '../ura'
sys.path.append(code_path)
import create_graph
import stat_analysis
reload(create_graph)
reload(stat_analysis)

<module 'stat_analysis' from '../ura\stat_analysis.pyc'>

## Background
<a id='background'></a>

The inspiration for these modules comes from [Ingenuity](http://pages.ingenuity.com/rs/ingenuity/images/0812%20upstream_regulator_analysis_whitepaper.pdf).

## Create Graph ##
<a id='graph'></a>

### Transcription Factors
<a id='tf'></a>

Our create_graph module prepares 3 databases to use in our analysis: a transcription factor interaction network, a background network, and a differencially expressed genes database. Our transcription factor data comes from two sources: [slowkow](https://github.com/slowkow/tftargets) and [jaspar](http://jaspar.genereg.net/). Slowkow is a compilation of 6 smaller databases: TRED, ITFP, ENCODE, Neph2012, TRRUST, Marbach2016. Information about these specific databases, including links to their sources, is available through the slowkow github page. Some basic information on these databases is as follows:
- Size of Jaspar Database: 2049
- Size of Slowkow Database: 2705
- Size of TRED: 133
- Size of ITFP: 1974
- Size of ENCODE: 157
- Size of Neph2012: 536
- Size of TRRUST: 748
- Size of Marbach2016: 643 

Our trascription factor list contains 3983 unique entries.

In [2]:
TF_list = create_graph.create_TF_list(slowkow_bool=True,
                    slowkow_files=['../slowkow_databases/TRED_TF.txt',
                                   '../slowkow_databases/ITFP_TF.txt',
                                   '../slowkow_databases/ENCODE_TF.txt',
                                   '../slowkow_databases/Neph2012_TF.txt',
                                   '../slowkow_databases/TRRUST_TF.txt',
                                   '../slowkow_databases/Marbach2016_TF.txt'],
                    slowkow_sep = '\n',
                    jaspar_bool=True,
                    jaspar_file="../jaspar_genereg_matrix.txt")
len(TF_list)

3984

### Background Network
<a id='bn'></a>

Our background network comes from the [STRING](https://string-db.org/) database. We use our list of transcription factors (TF_list) to extract only relevant information from the STRING database. Namely, we pull out edge information from STRING only for our transcription factors (the entries in TF_list) and use those edges to create a graph. After this filtering process, our graph (labeled DG) has 687 nodes and 1505 edges.

In [3]:
#STRING_DF, db_edges, db_sign_att = create_graph.load_and_process_small_STRING(filename="../STRING_network.xlsx")
unfiltered_DG = create_graph.load_STRING_to_digraph(filename = "../9606.protein.actions.v10.5.txt", confidence_filter=400)
#unfiltered_DG = create_graph.load_small_STRING_to_digraph(filename="../STRING_network.xlsx")
print '\n'
print 'Total number of interactions in STRING before filtering: ' + str(len(unfiltered_DG.edges()))

DG = create_graph.filter_digraph(unfiltered_DG, TF_list)
print 'Number of unique nodes after filtering: ' + str(len(list(set(DG.nodes()))))
print 'Number of edges after filtering: ' + str(len(DG.edges()))

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-7374...done.
Finished.
312 input query terms found no hit:
	[u'ENSP00000376684', u'ENSP00000289352', u'ENSP00000202788', u'ENSP00000373637', u'ENSP00000367802',
Pass "returnall=True" to return complete lists of duplicate or missing query terms.


Total number of interactions in STRING before filtering: 140418
Number of unique nodes after filtering: 4615
Number of edges after filtering: 28946


### Differencially Expressed Genes
<a id='deg'></a>

Our last database comes from a list of differencially exressed genes (DEG) found in an experiement. These are the genes we wish to analyze using our stat_analysis module. The TF graph created previously serves as our point of referece for these DEG's. Our DEG list (DEG_list) contains 2782 genes. 

In [4]:
DEG_list = create_graph.add_updown_from_DEG(DG, DEG_filename="../differencially_expressed_genes.txt", DEG_filter_value=0.3)
len(DEG_list)

2782

## Stat Analysis

Our stat_analysis package can help identify TF's that are statistically significant to our set of DEG's.

### P-value

Our p-value function calculates the log of the p-value for every TF in the graph using [scipy.stats.hypergeom.logsf](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.hypergeom.html). These values help us determine which TF's are actually associated with our DEG's. If a TF is given a high value (because we are working with logs, not straight p-values), then it is likely that there is correlation between that TF and its DEG targets. Therefore, it is likely that TF is responsible for some of our observed gene expression. Note that if a TF is given a value of zero, that means none of the TF's targets were DEG's. 

In [5]:
stat_analysis.tr_pvalues(DG, list(unfiltered_DG.edges()), DEG_list)

{'HIF3A': 0.92191939429333325,
 u'SMARCC2': 0.92191939429333325,
 u'SMARCC1': 3.0873425435201187,
 'NCBP2': 0.53045611443476992,
 'RNF13': 0,
 'NELFB': 1.2689545284379755,
 'MNT': 0,
 'PSPC1': 5.7100881255537388,
 u'MEN1': 3.0873425435201187,
 'FKBP4': 0,
 u'HIPK2': 2.6862153016864703,
 'PDCD11': 0,
 'FKBP8': 0,
 'LHX2': 5.7100881255537388,
 u'LHX3': 2.8118790049350224,
 'LHX4': 0,
 'LHX6': 0,
 'EN1': 0,
 u'LHX8': 0,
 'NR0B2': 0.89585899417497972,
 'NR0B1': 2.2658571064060014,
 'MIB1': inf,
 u'BTAF1': 2.8118790049350224,
 'DRAP1': 0,
 'IRX5': inf,
 'IRX4': 0,
 u'OFD1': 4.4374427658750397,
 'IRX6': 0,
 'IRX3': 0,
 'FOXQ1': 1.9732911774035751,
 'DHX9': 2.2218044258809373,
 'ARHGEF12': 2.6218186350600003,
 'FBXO11': 0,
 'TCOF1': 3.6431717202005465,
 u'NR5A1': 0.58885080779158561,
 'XPC': 0.66047090658086016,
 'SP1': 0.60102993218211931,
 'SP2': 0,
 'SP3': 2.6218186350600003,
 u'PPP2R2A': 5.9867901524227207,
 'PPP2R2B': 6.5766464004770642,
 u'EHF': 0,
 u'DLX4': 0,
 'OSBPL8': 0,
 u'NUP50': 

### Z-score

The goal of our z-score function is to predict the activation states of the TF's. We observe how a TF relates to each of its targets to make our prediction. We compare each targets' observed gene regulation (either up or down) and each TF-target interation (whether it is activating or inhibiting) to conclude whether a TF is activating or inhibiting. A positive value indicates activating while a negative value indicates inhibiting. A value of zero means that we did not have enough information about the target or TF-target interaction to make the prediction. 

In [6]:
not_biased_zcsores = stat_analysis.tr_zscore(DG, DEG_list)
not_biased_zcsores

{'HIF3A': 1.0,
 u'SMARCC2': 1.0,
 u'SMARCC1': 0.0,
 'NCBP2': -0.7071067811865475,
 'RNF13': 0,
 'NELFB': -1.0,
 'MNT': 0,
 'PSPC1': -1.414213562373095,
 u'MEN1': 0.0,
 'FKBP4': 0,
 u'HIPK2': -1.7320508075688774,
 'PDCD11': 0,
 'FKBP8': 0,
 'LHX2': 0.0,
 u'LHX3': 1.0,
 'LHX4': 0,
 'LHX6': 0,
 'EN1': 0,
 u'LHX8': 0,
 'NR0B2': 1.0,
 'NR0B1': 1.414213562373095,
 'MIB1': 1.414213562373095,
 u'BTAF1': -1.0,
 'DRAP1': 0,
 'IRX5': 1.0,
 'IRX4': 0,
 u'OFD1': 1.7320508075688774,
 'IRX6': 0,
 'IRX3': 0,
 'FOXQ1': 0.0,
 'DHX9': 1.0,
 'ARHGEF12': -0.5773502691896258,
 'FBXO11': 0,
 'TCOF1': -1.414213562373095,
 u'NR5A1': 0.0,
 'XPC': 1.414213562373095,
 'SP1': 0.0,
 'SP2': 0,
 'SP3': 1.414213562373095,
 u'PPP2R2A': 1.4596008983995234,
 'PPP2R2B': 1.2247448713915892,
 u'EHF': 0,
 u'DLX4': 0,
 'OSBPL8': 0,
 u'NUP50': 0.3333333333333333,
 u'SMARCD3': 0.0,
 'SMARCD2': 0,
 u'SMARCD1': 1.0,
 'ABLIM2': -1.0,
 u'GATA6': 0,
 u'GATA4': -0.3333333333333333,
 u'GATA5': 0,
 u'GATA2': -1.3416407864998738,
 'IFT1

In [7]:
TR_to_bias = stat_analysis.calculate_bias(DG, DEG_list)
bias_zscores = stat_analysis.bias_corrected_tr_zscore(DG, DEG_list, TR_to_bias)
bias_zscores

{'HIF3A': 0.0,
 u'SMARCC2': 0.0,
 u'SMARCC1': 0.0,
 'NCBP2': -3.0,
 'RNF13': 0,
 'NELFB': 0.0,
 'MNT': 0,
 'PSPC1': 0.0,
 u'MEN1': 0.0,
 'FKBP4': 0,
 u'HIPK2': -2.6666666666666665,
 'PDCD11': 0,
 'FKBP8': 0,
 'LHX2': 0.0,
 u'LHX3': 0.0,
 'LHX4': 0,
 'LHX6': 0,
 'EN1': 0,
 u'LHX8': 0,
 'NR0B2': 2.0,
 'NR0B1': 2.0,
 'MIB1': 0.0,
 u'BTAF1': 0.0,
 'DRAP1': 0,
 'IRX5': 0.0,
 'IRX4': 0,
 u'OFD1': 0.0,
 'IRX6': 0,
 'IRX3': 0,
 'FOXQ1': 0.0,
 'DHX9': 0.0,
 'ARHGEF12': -2.2204460492503131e-16,
 'FBXO11': 0,
 'TCOF1': 0.0,
 u'NR5A1': 0.0,
 'XPC': 0.0,
 'SP1': 0.0,
 'SP2': 0,
 'SP3': 0.0,
 u'PPP2R2A': 4.1739130434782616,
 'PPP2R2B': 3.0,
 u'EHF': 0,
 u'DLX4': 0,
 'OSBPL8': 0,
 u'NUP50': -4.4408920985006262e-16,
 u'SMARCD3': 0.0,
 'SMARCD2': 0,
 u'SMARCD1': 0.0,
 'ABLIM2': 0.0,
 u'GATA6': 0,
 u'GATA4': -2.6666666666666661,
 u'GATA5': 0,
 u'GATA2': -3.2000000000000002,
 'IFT122': 0.0,
 u'GATA1': -0.5714285714285714,
 u'GTF2F1': 2.4000000000000004,
 'AKAP1': 0,
 'WDTC1': 0,
 u'GTF2F2': -1.7142857142

In [8]:
# comparison of unbiased z-score vs biased z-score calculation
# each entry is of form (bias, unbiased z-score, biased z-score)
zip(TR_to_bias.values(), not_biased_zcsores.values(), bias_zscores.values())

[(1.0, 1.0, 0.0),
 (1.0, 1.0, 0.0),
 (0.0, 0.0, 0.0),
 (0.125, -0.7071067811865475, -3.0),
 (0, 0, 0),
 (-1.0, -1.0, 0.0),
 (0, 0, 0),
 (-1.0, -1.414213562373095, 0.0),
 (0.0, 0.0, 0.0),
 (0, 0, 0),
 (-0.1111111111111111, -1.7320508075688774, -2.6666666666666665),
 (0, 0, 0),
 (0, 0, 0),
 (0.0, 0.0, 0.0),
 (1.0, 1.0, 0.0),
 (0, 0, 0),
 (0, 0, 0),
 (0, 0, 0),
 (0, 0, 0),
 (0.0, 1.0, 2.0),
 (0.0, 1.414213562373095, 2.0),
 (1.0, 1.414213562373095, 0.0),
 (-1.0, -1.0, 0.0),
 (0, 0, 0),
 (1.0, 1.0, 0.0),
 (0, 0, 0),
 (1.0, 1.7320508075688774, 0.0),
 (0, 0, 0),
 (0, 0, 0),
 (-0.0, 0.0, 0.0),
 (1.0, 1.0, 0.0),
 (-0.3333333333333333, -0.5773502691896258, -2.2204460492503131e-16),
 (0, 0, 0),
 (-1.0, -1.414213562373095, 0.0),
 (0.0, 0.0, 0.0),
 (1.0, 1.414213562373095, 0.0),
 (0.0, 0.0, 0.0),
 (0, 0, 0),
 (1.0, 1.414213562373095, 0.0),
 (0.12287334593572778, 1.4596008983995234, 4.1739130434782616),
 (0.125, 1.2247448713915892, 3.0),
 (0, 0, 0),
 (0, 0, 0),
 (0, 0, 0),
 (0.1111111111111111, 0.33