# Upstream Regulator Analysis Package
## Arthritis Case Study

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 19th January, 2018

----------------------

<a id='toc'></a>
## Table of Contents
1. [Background](#background)
2. [Import packages](#import)
3. [Load Networks](#load)
6. [P-value and Z-score Calculation](#pz)
7. [Compare the Ingenuity Article's results to Ours](#comp)
8. [Display Our results](#display)

## Background
<a id='background'></a>

Need some info about where these genes come from. Also Some feedback from Katie and Adam on motivation will go good here!

## Import packages
<a id='import'></a>

In [10]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# import upstream regulator modules
import sys
code_path = '../ura'
sys.path.append(code_path)
import create_graph
import stat_analysis
reload(create_graph)
reload(stat_analysis)

# network visualization package
# (pip install visJS2jupyter)
import visJS2jupyter.visJS_module as visJS_module

## Load Networks
<a id='load'></a>

1. List of all **Transcription Factors** (TF's) or regulators of interest to us
    <br>
    - Our sources are [slowkow](https://github.com/slowkow/tftargets) and [jaspar](http://jaspar.genereg.net/) TF databases
    <br><br>
2. **Background Network**: [STRING human protein interactions network](https://string-db.org/cgi/download.pl?UserId=9BGA8WkVMRl6&sessionId=HWUK6Dum9xC6&species_text=Homo+sapiens)  
    - Filter our background network down to just the sub network of TF's and their targets
    <br><br>
3. User-supplied list of **Differentially Expressed Genes** (DEG's)

In [2]:
# transcription factors
TF_list = create_graph.create_TF_list(slowkow_bool=True,
                    slowkow_files=['../slowkow_databases/TRED_TF.txt',
                                   '../slowkow_databases/ITFP_TF.txt',
                                   '../slowkow_databases/ENCODE_TF.txt',
                                   '../slowkow_databases/Neph2012_TF.txt',
                                   '../slowkow_databases/TRRUST_TF.txt',
                                   '../slowkow_databases/Marbach2016_TF.txt'],
                    slowkow_sep = '\n',
                    jaspar_bool=True,
                    jaspar_file="../jaspar_genereg_matrix.txt")
print "Number of TF's: " + str(len(TF_list))

Number of TF's: 3983


In [3]:
# background network
filename = "../9606.protein.actions.v10.5.txt"
confidence_filter=400
DG_TF, DG_universe = create_graph.load_STRING_to_digraph(filename, confidence_filter, TF_list) # supplying TF_list filters network
print "Number of interactions: " + str(len(list(DG_TF.edges())))

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-7374...done.
Finished.
56 input query terms found dup hits:
	[(u'ENSP00000359550', 3), (u'ENSP00000447879', 2), (u'ENSP00000364076', 2), (u'ENSP00000348986', 2),
312 input query terms found no hit:
	[u'ENSP00000376684', u'ENSP00000289352', u'ENSP00000202788', u'ENSP00000373637', u'ENSP00000367802',
Pass "returnall=True" to return complete lists of duplicate or missing query terms.
Number of interactions: 28931


In [4]:
# differentially expressed genes
DEG_filename = "DE_Coeff_OAvsNormal_OAvsNormal_20171215.csv" # must meet requirements, see Basic Example notebook
DEG_list, DEG_to_adj_pvalue, DEG_to_updown = create_graph.create_DEG_list(DEG_filename, p_value_filter = 0.3)
DG_TF = create_graph.add_updown_from_DEG(DG_TF, DEG_to_updown)
print "Number of DEG's: " + str(len(DEG_list))

Number of DEG's: 7035


## P-values and Z-score Calculation
<a id='pz'></a>

1. **P-values**: How relevant is a TF to its DEG targets? Are they connected by chance, or is their connection statistically significant?
    <br>
    1. -log(p-value) for each TF using [scipy.stats.hypergeom.logsf](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.hypergeom.html).
        1. high value = significant connection between this TF and its DEG targets
        2. low value = TF is randomly associated with its DEG targets
        3. zero = None of this TF's targets were DEG's
        4. inf = original p-value was so small that its log is inf. Very high significance.
        <br><br>
2. **Z-scores**: Goal is to predict the activation states of the TF's

    - activation states: interaction type/regulation direction = predicted state
        - activating/up  = activating
        - activating/down = inhibiting
        - inhibiting/up = inhibiting
        - inhibiting/down = activating
        <br><br>
    - unbiased vs biased calculations:
        - **unbiased calculation**: Assume a normal distribution of activating and inhibiting states 
        - **biased calculation**: For the case when you cannot assume a 50-50 split between up/down-regulated targets and activating/inhbiting interactions. Modify our formula to approximate a normal distribution.

In [5]:
p_values = stat_analysis.tf_pvalues(DG_TF, DG_universe, DEG_list)
z_scores = stat_analysis.tf_zscore(DG_TF, DEG_list, bias_filter = 0.25) # recommended bias filter is 0.25

## Display Our Results
- Display TF's with top z-scores
- Display where certain TF's rank among others and overall according to z-score
- Display subnetwork of a particular TF and its targets

In [11]:
# Top TF's
top_overall = stat_analysis.top_values(z_scores, act = False, abs_value = True, top = 20)
display(top_overall.to_frame(name = 'top z-scores'))

Unnamed: 0,top z-scores
WASL,3.411211
RBL2,-3.356586
NCF2,3.207135
FOXO3,-3.170376
NCOR1,3.0
PIAS2,3.0
TTC8,2.645751
RORA,-2.645751
TLE3,2.529822
NFATC2,-2.501851


In [7]:
# Ranking (in this case from 0 to 168)
genes_to_rank = ['HOXA4', 'CBX8', 'WASL','Z', 'TLE3', 'RBL2','TP53', 'STAT1', 'STAT5B']
stat_analysis.rank_and_score_df(z_scores, genes_to_rank, act = True, abs_value = True, remove_dups=True)

Unnamed: 0,rank,z-score
WASL,0.0,3.411211
RBL2,1.0,-3.356586
TLE3,6.0,2.529822
STAT5B,31.0,1.941451
CBX8,35.0,1.889822
STAT1,38.0,1.807392
TP53,75.0,1.204076
HOXA4,168.0,0.0
Z,,


In [8]:
# display subnetworks using visJS2jupyter
stat_analysis.vis_tf_network(DG_TF, 'WASL', '../geo2r_GSE11352_brca_48hours.txt', DEG_list,
              directed_edges = False,
              node_spacing = 1000,
              graph_id = 1) 

In [9]:
stat_analysis.vis_tf_network(DG_TF, 'RBL2', '../geo2r_GSE11352_brca_48hours.txt', DEG_list,
              directed_edges = False,
              node_spacing = 1500,
              graph_id = 2) 