# Upstream Regulator Analysis Package
## Arthritis Case Study

----------------------

Author: Mikayla Webster (m1webste@ucsd.edu)

Date: 19th January, 2018

----------------------

<a id='toc'></a>
## Table of Contents
1. [Background](#background)
2. [Import packages](#import)
3. [Load Networks](#load)
6. [P-value and Z-score Calculation](#pz)
7. [Compare the Ingenuity Article's results to Ours](#comp)
8. [Display Our results](#display)

## Background
<a id='background'></a>

Need some info about where these genes come from. Also Some feedback from Katie and Adam on motivation will go good here!

## Import packages
<a id='import'></a>

In [49]:
import networkx as nx
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import mygene

# import upstream regulator modules
import sys
code_path = '../ura'
sys.path.append(code_path)
import create_graph
import stat_analysis
reload(create_graph)
reload(stat_analysis)

# network visualization package
import visJS2jupyter.visJS_module as visJS_module # >> pip install visJS2jupyter

In [3]:
# User preferences
symbol = 'symbol'
entrez = 'entrez'

gene_type = symbol

## Load Networks
<a id='load'></a>

1. List of all **Transcription Factors** (TF's) or regulators of interest to us
    <br>
    - Our sources are [slowkow](https://github.com/slowkow/tftargets) and [jaspar](http://jaspar.genereg.net/) TF databases
    <br><br>
2. **Background Network**: [STRING human protein interactions network](https://string-db.org/cgi/download.pl?UserId=9BGA8WkVMRl6&sessionId=HWUK6Dum9xC6&species_text=Homo+sapiens)  
    - Filter our background network down to just the sub network of TF's and their targets
    <br><br>
3. User-supplied list of **Differentially Expressed Genes** (DEG's)

In [4]:
# transcription factors
TF_list = create_graph.easy_load_TF_list(slowkow_bool=True, jaspar_bool=True, gene_type = gene_type)
print "Number of TF's: " + str(len(TF_list))

Number of TF's: 3983


In [5]:
# background network
filename = "../background_networks/9606.protein.actions.v10.5.txt"
confidence_filter = 400
DG_TF, DG_universe = create_graph.load_STRING_to_digraph(filename, TF_list, confidence_filter, gene_type)

print "\nNumber of interactions: " + str(len(list(DG_TF.edges())))

querying 1-1000...done.
querying 1001-2000...done.
querying 2001-3000...done.
querying 3001-4000...done.
querying 4001-5000...done.
querying 5001-6000...done.
querying 6001-7000...done.
querying 7001-7374...done.
Finished.
56 input query terms found dup hits:
	[(u'ENSP00000359550', 3), (u'ENSP00000447879', 2), (u'ENSP00000364076', 2), (u'ENSP00000348986', 2),
312 input query terms found no hit:
	[u'ENSP00000376684', u'ENSP00000289352', u'ENSP00000202788', u'ENSP00000373637', u'ENSP00000367802',
Pass "returnall=True" to return complete lists of duplicate or missing query terms.

Number of interactions: 28939


In [73]:
# differentially expressed genes
DEG_filename = "../DEG_databases/DE_Coeff_OAvsNormal_OAvsNormal_20171215.csv" 
#DEG_full_graph = create_graph.create_DEG_full_graph(DEG_filename) # create graph of full DEG file
reload(create_graph)
DEG_list, DG_TF = create_graph.create_DEG_list(DEG_filename, 
                                        DG_TF, # adding DEG up-down information to our graph
                                        p_value_filter = 0.05, # p < 0.05
                                        fold_change_filter = 1) # |fld| > 1

if type(DEG_list) != int:
    print "Number of DEG's: " + str(len(DEG_list))
else:
    print "Please modify create_DEG_list function call.\n"

Number of DEG's: 1456


## P-values and Z-score Calculation
<a id='pz'></a>

1. **P-values**: How relevant is a TF to its DEG targets? Are they connected by chance, or is their connection statistically significant?
    <br>
    1. -log(p-value) for each TF using [scipy.stats.hypergeom.logsf](https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.hypergeom.html).
        1. high value = significant connection between this TF and its DEG targets
        2. low value = TF is randomly associated with its DEG targets
        3. zero = None of this TF's targets were DEG's
        4. inf = original p-value was so small that its log is inf. Very high significance.
        <br><br>
2. **Z-scores**: Goal is to predict the activation states of the TF's

    - activation states: interaction type/regulation direction = predicted state
        - activating/up  = activating
        - activating/down = inhibiting
        - inhibiting/up = inhibiting
        - inhibiting/down = activating
        <br><br>
    - unbiased vs biased calculations:
        - **unbiased calculation**: Assume a normal distribution of activating and inhibiting states 
        - **biased calculation**: For the case when you cannot assume a 50-50 split between up/down-regulated targets and activating/inhbiting interactions. Modify our formula to approximate a normal distribution.

In [19]:
# Enrichment of every TF with respect to its targets
p_values = stat_analysis.tf_target_enrichment(DG_TF, DG_universe, DEG_list)

# Enrichment of TF's themselves
stat_analysis.tf_enrichment(TF_list, DEG_full_graph, DEG_list)    

TF_ENRICHMENT    4.114487e-12
dtype: float64

In [None]:
z_scores = stat_analysis.tf_zscore(DG_TF, DEG_list, bias_filter = 0.25) # recommended bias filter is 0.25

## Display Our Results
- Display TF's with top z-scores
- Display where certain TF's rank among others and overall according to z-score
- Display subnetwork of a particular TF and its targets

In [None]:
def top_values(series, TF_to_adjp, TF_to_foldchange, act = True, abs_value = False, top = 10):

    # top activating and inhibiting, sort by strongest zscore or log(pvalue)
    if abs_value == True:
        top_series_abs = series.abs().sort_values(ascending=False).head(top)
        top_genes = list(top_series_abs.index)
        top_values = [series[gene] for gene in top_genes]
        return pd.Series(top_values, index=top_genes)

    # top activating
    if act == True:
        return series.sort_values(ascending=False).head(top)

    # top inhibiting
    else:
        return series.sort_values(ascending=True).head(top)
    
#df1['e'] = p.Series(np.random.randn(sLength), index=df1.index)
top_values(z_scores, TF_to_adjp, TF_to_foldchange, act = False, abs_value = True, top = 20)

In [None]:
series = z_scores.sort_values(ascending=False).head(10)
top_to_adjp = pd.Series({ k: TF_to_adjp[k] for k in TF_to_adjp if k in series })
top_to_foldchange = pd.Series({ k: TF_to_foldchange[k] for k in TF_to_foldchange if k in series })

In [None]:
pd.concat([series, top_to_adjp, top_to_foldchange], axis=1, names = ["z-scores", "adj p-value", "fold change"])

In [None]:
# Top TF's
# ~~~~~~~~~~~~~~~~~~~~~~ KATIE'S REQUEST ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
TF_to_adjp, TF_to_foldchange = map_TF_to_DEG_attributes(DEG_filename, TF_list)
top_overall = stat_analysis.top_values(z_scores, act = False, abs_value = True, top = 20)
display(top_overall.to_frame(name = 'top z-scores'))

In [None]:
# Ranking (in this case from 0 to 168)
genes_to_rank = ['HOXA4', 'CEBPB', 'HIF1A','KLF4', 'TLE3', 'RBL2','TP53', 'STAT1', 'STAT5B']
stat_analysis.rank_and_score_df(z_scores, genes_to_rank, act = True, abs_value = True, remove_dups=True)

In [None]:
stat_analysis.vis_tf_network(DG_TF, 'KLF5', '../DEG_databases/geo2r_GSE11352_brca_48hours.txt', DEG_list,
              directed_edges = True,
              node_spacing = 1500,
              graph_id = 2) 

In [None]:
# display subnetworks using visJS2jupyter
stat_analysis.vis_tf_network(DG_TF, 'STAT1', '../DEG_databases/geo2r_GSE11352_brca_48hours.txt', DEG_list,
              directed_edges = True,
              node_spacing = 2000,
              graph_id = 1) 