# Tutorial 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/shubhvjain/coregtor/blob/release/docs/tutorial1.ipynb) 



 

In this tutorial we demonstrate the use of the `CoRegTor` tool to find transcription co regulators for a gene from gene expression data.

## Objective

The aim of this tutorial is to find potential co-regulators of the gene [GFAP](https://www.ncbi.nlm.nih.gov/gene/2670) by analyzing tissue gene expression data for Frontal cortex in adult brain. 

## Step 1 : Get data

Before we begin, let's gather all the data we require:
- Gene Expression data `ge_brain.gct`. This file contains tissue gene expression data for the Frontal Cortex (BA9) in an adult brain. The data is download from the [GTEx portal](https://www.gtexportal.org/home/downloads/adult-gtex/bulk_tissue_expression)
- List of transcription factors `human_tf.txt` : This file was downloaded from [aertslab.org](https://resources.aertslab.org/cistarget/tf_lists/)



In [1]:
from pathlib import Path 
base_path = Path("docs/temp") # UPDATE THIS
data_file_path = Path(base_path/"brain_ge.gct") # UPDATE THIS 
tf_file_path = Path(base_path/"human_tf.txt") # UPDATE THIS
target_gene_name = "GFAP"


## Step 2 : Install and import the `CoRegTor` package

Using pip, `pip install coregtor` 

Or `poetry install coregtor` to add the package as a dependency in your project

In [2]:
# Install coregtor if not already installed, then import it
try:
    import coregtor
except ImportError:
    %pip install coregtor
    import coregtor

# Additional imports
from pathlib import Path


## Step 3 : Load gene expression data and transcription factors

Let's begin by loading the data using the  `read` method that accepts path to the gene expression data and optional options. The method outputs a pandas DataFrame with genes as columns and samples as rows. 

In [3]:
import pandas as pd
ge_data = coregtor.read(file_path=data_file_path)
tf_data = pd.read_csv(tf_file_path, names=["gene_name"], header=None)

In [4]:
ge_data

gene_name,DDX11L1,WASH7P,MIR6859-1,MIR1302-2HG,FAM138A,OR4G4P,OR4G11P,OR4F5,ENSG00000238009,CICP27,...,MT-ND4,MT-TH,MT-TS2,MT-TL2,MT-ND5,MT-ND6,MT-TE,MT-CYB,MT-TT,MT-TP
sample_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GTEX-1117F-0011-R10b-SM-GI4VE,0.000000,3.57928,0.0,0.093825,0.000000,0.000000,0.028731,0.046554,0.039501,0.058675,...,49762.2,1.177570,2.754330,0.000000,7311.39,4788.56,6.47666,28676.5,3.077750,1.19489
GTEX-111FC-0011-R10a-SM-GIN8G,0.000000,2.32926,0.0,0.025333,0.000000,0.052233,0.031030,0.016759,0.000000,0.031684,...,44692.0,0.953824,0.000000,1.544930,6831.00,5164.36,6.67677,26950.9,1.661970,3.54879
GTEX-117XS-0011-R10b-SM-GIN8Z,0.000000,4.79425,0.0,0.000000,0.046843,0.067977,0.020191,0.043622,0.013880,0.032987,...,39249.9,0.827551,0.967814,1.206360,5603.53,3585.51,6.20663,20794.9,0.432584,2.93902
GTEX-1192W-0011-R10b-SM-GHWOF,0.000000,3.83774,0.0,0.032159,0.045693,0.000000,0.039392,0.053189,0.013539,0.000000,...,50750.5,1.614480,2.832190,1.176750,9433.33,7697.90,12.51220,23405.4,1.265900,3.68601
GTEX-1192X-0011-R10a-SM-DO941,0.040388,1.47233,0.0,0.040318,0.000000,0.000000,0.049385,0.040010,0.050922,0.000000,...,31566.9,2.024070,0.591784,0.983528,4424.64,3568.41,4.55416,14051.5,0.529019,1.54038
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GTEX-ZVZQ-0011-R10b-SM-51MRT,0.017553,1.91964,0.0,0.070089,0.000000,0.180647,0.064389,0.092738,0.044262,0.043831,...,44939.6,3.078850,1.543150,2.137230,7019.94,6874.29,16.71380,24296.2,1.379490,2.23152
GTEX-ZXG5-0011-R10a-SM-57WDD,0.000000,1.07536,0.0,0.036646,0.000000,0.000000,0.044887,0.084853,0.000000,0.000000,...,62226.7,2.759570,1.613650,4.469730,11407.90,11061.80,15.17770,38732.2,1.442500,1.86677
GTEX-ZYFD-0011-R10a-SM-GPI91,0.000000,2.71020,0.0,0.000000,0.000000,0.000000,0.037432,0.000000,0.000000,0.030577,...,43740.3,0.000000,2.691290,0.745473,6574.39,5241.85,9.20498,24934.3,0.000000,1.55672
GTEX-ZYY3-0011-R10a-SM-GNTAZ,0.015919,3.29538,0.0,0.000000,0.000000,0.065533,0.058395,0.031540,0.066902,0.079502,...,40835.8,1.196680,0.933006,3.101260,6228.45,5626.94,10.37120,20992.5,5.004310,2.83332


In [5]:
tf_data

Unnamed: 0,gene_name
0,ZNF354C
1,KLF12
2,ZNF143
3,ZIC2
4,ZNF274
...,...
1887,ZNF826P
1888,ZNF827
1889,ZNF831
1890,ZRSR2


## Step 4 : Create Ensemble model

Next, we generate a random forest ensemble using this gene expression data which predicts the expression value of the gene "GFAP" using all other genes in the data.

We do this using the generate_model method. This methods take gene_expression data, model options. 
Additionally, since we are concerned with finding co regulators we use only transcription factors as features to predict this target gene. 

In [6]:
# first generate the training input for the model
X,Y = coregtor.create_model_input(ge_data,target_gene_name,tf_data)
# use the training data to create a model
model = coregtor.create_model(X,Y,"rf",{})

In [7]:
model

0,1,2
,n_estimators,100
,criterion,'squared_error'
,max_depth,
,min_samples_split,2
,min_samples_leaf,1
,min_weight_fraction_leaf,0.0
,max_features,1.0
,max_leaf_nodes,
,min_impurity_decrease,0.0
,bootstrap,True


## Step 5 : Generating tree paths 

Forest based ensemble methods contains multiple decision tress. We want to analyze the structure of the trees in the model. 

For each tree, there exists multiple root to leaf paths. We first enumerate all the paths in all the trees in the model. 


In [9]:
all_paths = coregtor.tree_paths(model,X,Y)

In [10]:
all_paths

Unnamed: 0,tree,source,target,path_length,node1,node2,node3,node4,node5,node6,...,node9,node10,node11,node12,node13,node14,node15,node16,node17,node18
0,0,HMBOX1,GFAP,8,YBX1,SP110,SCRT1,E2F2,TCFL5,SHOX2,...,,,,,,,,,,
1,0,HMBOX1,GFAP,14,YBX1,SP110,SCRT1,STAT5A,VAMP3,HOXC4,...,NKX3-2,CEBPD,ARID3C,DLX1,TRIM69,,,,,
2,0,HMBOX1,GFAP,8,YBX1,RBM8A,NR1D1,IRX3,RXRG,YY2,...,,,,,,,,,,
3,0,HMBOX1,GFAP,8,YBX1,SP110,ZNF639,LHX3,EGR4,TBX15,...,,,,,,,,,,
4,0,HMBOX1,GFAP,9,YBX1,SP110,ZNF639,ZNF710,ZFHX2,HES6,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10968,99,YBX1,GFAP,7,FOXD4,CDX2,HOXA3,ETV2,BANP,ZBED5,...,,,,,,,,,,
10969,99,YBX1,GFAP,13,EN1,TSC22D4,ZNF114,IRF9,ZNF683,NR2E1,...,ZNF433,HMGB2,HEY2,ZNF662,,,,,,
10970,99,YBX1,GFAP,8,EN1,TSC22D4,ZNF114,ZNF587,EMX1,IKZF3,...,,,,,,,,,,
10971,99,YBX1,GFAP,8,EN1,TSC22D4,ZNF114,IRF9,ING3,ARNT,...,,,,,,,,,,
