In [1]:
import pandas as pd
import numpy as np
import dcona

pd.set_option('display.max_rows', 6)

## Reading data using Pandas

**Reading expression matrix**  
Columns = samples  
Rows = genes

Gene names can be either index (`index_col=0`) or the first column of DataFrame.

In [2]:
data = pd.read_csv(
    "data/data.csv", 
    #index_col=0
)
data

Unnamed: 0.1,Unnamed: 0,TCGA-CH-5761-11A,TCGA-CH-5767-11B,TCGA-CH-5768-11A,TCGA-CH-5769-11A,TCGA-EJ-7115-11A,TCGA-EJ-7123-11A,TCGA-EJ-7125-11A,TCGA-EJ-7314-11A,TCGA-EJ-7315-11A,...,TCGA-ZG-A9LN-01A,TCGA-ZG-A9LS-01A,TCGA-ZG-A9LU-01A,TCGA-ZG-A9LY-01A,TCGA-ZG-A9LZ-01A,TCGA-ZG-A9M4-01A,TCGA-ZG-A9MC-01A,TCGA-ZG-A9N3-01A,TCGA-ZG-A9ND-01A,TCGA-ZG-A9NI-01A
0,hsa-let-7a-3p|0,6.011075,6.006577,6.220264,5.537189,4.748540,5.502099,3.682509,3.953135,4.923677,...,5.654900,5.250263,5.761209,5.759811,5.672676,6.498865,5.755263,5.712863,5.646821,6.754639
1,hsa-let-7a-5p|+1,3.348551,4.030655,3.567832,3.250863,2.721590,3.979438,1.535444,3.168917,3.204502,...,3.727959,3.436101,2.866248,2.761484,2.242451,3.436749,3.695778,2.942695,2.473159,3.291749
2,hsa-let-7a-5p|+3,5.560759,5.668651,5.860315,5.610896,1.925239,4.597866,2.045232,4.490770,3.798516,...,3.236051,3.604441,4.240851,3.184805,4.241353,4.808803,4.048365,2.942695,3.873591,3.583909
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19075,ZZZ3,2.392628,2.468200,2.750706,1.901729,2.757027,2.721597,2.952434,2.641519,2.964434,...,2.465726,2.521459,2.272377,2.597408,2.213477,2.045952,2.237342,2.417844,2.262224,2.216108
19076,chr22-38_28785274-29006793.1,0.037108,0.065637,0.041679,0.034789,0.033236,0.027458,0.035765,0.032788,0.041537,...,0.017529,0.026160,0.047253,0.019139,0.042004,0.019252,0.028186,0.036193,0.022542,0.030088
19077,pk,2.917252,3.058705,2.971728,2.397213,3.576395,5.812027,3.545540,4.133061,4.072306,...,2.960870,3.718095,1.992211,2.742619,2.458514,2.069989,2.892236,2.869674,0.930494,2.583106


**Reading a description of samples**  

In [3]:
description = pd.read_csv("data/description.csv")
description

Unnamed: 0,Sample,Group
0,TCGA-CH-5761-11A,Normal
1,TCGA-CH-5767-11B,Normal
2,TCGA-CH-5768-11A,Normal
...,...,...
484,TCGA-ZG-A9N3-01A,Tumor
485,TCGA-ZG-A9ND-01A,Tumor
486,TCGA-ZG-A9NI-01A,Tumor


**Reading a list of interactions**  (optional)  

In [4]:
interactions = pd.read_csv("data/interactions.csv")
interactions

Unnamed: 0,Source,Target
0,hsa-miR-30a-5p|0,RARRES1
1,hsa-miR-10a-5p|0,PLA1A
2,hsa-miR-30e-3p|0,FAM20B
...,...,...
11552,hsa-miR-183-5p|+1,AVIL
11553,hsa-miR-30e-5p|0,ACTR1A
11554,hsa-let-7f-5p|0,CLDN12


## Running DCoNA analysis

### ztest  
**Tests the hypothesis on correlation equiavalence between pairs of genes.**  
If `interaction` DataFrame is not provided, all possible gene pairs are tested.  
If `output_dir` is provided, result is saved as .csv file. May be useful in order to reduce RAM consumption while saving. For example, if no `interaction` DataFrame is used, file consisting of millions of rows will be saved more effectively.

``` python
dcona.ztest(data_df, description_df, reference_group, experimental_group, correlation='spearman', alternative='two-sided', interaction=None, repeats_number=None, output_dir=None, process_number=None)
```
Argument|Data type|Description
-|-|-
`data_df`|pandas.DataFrame|expression matrix
`description_df`|pandas.DataFrame|description of samples 
`reference_group`|str|name of reference samples group (it must be listed in `description_df`)  
`experimental_group`|str|name of experimental samples group (it must be listed in `description_df`)  
`correlation` (optional)|str: `"spearman"` or `"pearson"`|metric of correlation
`alternative` (optional)|str: `"two-sided"`, `less` or `greater`| ?
`interaction` (optional)|pandas.DataFrame|list of interactions. If not provided, all pairs ar tested
`repeats_number` (optional)|int|number of permutation test iterations
`output_dir` (optional)|str|path for result saving. If not provided, returns DataFrame
`process_number` (optional)|int|number of threads used for computation (default: all CPUs)

In [None]:
ztest = dcona.ztest(
    # Required arguments
    data_df=data, 
    description_df=description, 
    reference_group="Normal", 
    experimental_group="Tumor",
    # Optional arguments
    interaction=interactions,
    correlation="spearman", 
    alternative="two-sided",
    process_number=None,
    output_dir=None,
    repeats_number=1000
)

In [6]:
ztest

Unnamed: 0,Source,Target,RefCorr,RefPvalue,ExpCorr,ExpPvalue,Statistic,Pvalue,AdjPvalue,PermutePvalue
0,hsa-let-7a-5p|0,FER,-0.548427,8.209132e-05,0.383399,4.513049e-16,-1.020167,5.816868e-08,0.000224,0.000
1,hsa-let-7b-5p|0,PLP1,-0.672029,4.581802e-07,0.193310,5.324435e-05,-1.010207,7.813477e-08,0.000226,0.000
2,hsa-let-7c-5p|0,PACRGL,-0.748956,4.097571e-09,0.056805,2.365255e-01,-1.027438,4.681513e-08,0.000270,0.000
...,...,...,...,...,...,...,...,...,...,...
11552,hsa-let-7c-5p|0,LYSMD2,-0.209892,1.485060e-01,-0.210113,1.106933e-05,0.000232,9.990174e-01,0.999450,0.997
11553,hsa-let-7a-5p|0,LIMD2,0.255222,7.826895e-02,0.255410,8.570364e-08,-0.000201,9.991492e-01,0.999495,0.999
11554,hsa-let-7a-5p|0,TACC1,-0.283265,5.026774e-02,-0.283328,2.677377e-09,0.000068,9.997115e-01,0.999711,0.998


### zscore
**Aggregates correlation changes of source molecule with all its targets.**  
If `interaction` DataFrame is not provided, all possible gene pairs are included in aggregated score.  
If `output_dir` is provided, result is saved as .csv file.  

``` python
dcona.zscore(data_df, description_df, reference_group, experimental_group, correlation='spearman', score='mean', alternative='two-sided', interaction=None, repeats_number=None, output_dir=None, process_number=None)
```
Argument|Data type|Description
-|-|-
`data_df`|pandas.DataFrame|expression matrix
`description_df`|pandas.DataFrame|description of samples 
`reference_group`|str|name of reference samples group (it must be listed in `description_df`)  
`experimental_group`|str|name of experimental samples group (it must be listed in `description_df`)  
`correlation` (optional)|str: `"spearman"` or `"pearson"`|metric of correlation
`score` (optional)|str: `mean`, `median`, | TODO
`alternative` (optional)|str: `"two-sided"`, `less` or `greater`| TODO
`interaction` (optional)|pandas.DataFrame|list of interactions. If not provided, all pairs ar tested
`repeats_number` (optional)|int|number of permutation test iterations
`output_dir` (optional)|str|path for result saving. If not provided, returns DataFrame
`process_number` (optional)|int|number of threads used for computation (default: all CPUs)

In [None]:
zscore = dcona.zscore(
    # Required arguments
    data_df=data, 
    description_df=description, 
    reference_group="Normal", 
    experimental_group="Tumor",
    # Optional arguments
    interaction=interactions,
    correlation="spearman", 
    score="mean",
    alternative="two-sided",
    process_number=8,
    output_dir=None,
    repeats_number=1000
)

In [8]:
zscore.head()

Unnamed: 0,Source,Score,Pvalue,AdjPvalue
0,hsa-let-7a-5p|0,0.261334,0.0,0.0
1,hsa-let-7b-5p|0,0.295821,0.0,0.0
5,hsa-miR-101-3p|-1,0.276187,0.0,0.0
8,hsa-miR-10a-5p|0,0.294553,0.0,0.0
10,hsa-miR-10b-5p|0,0.25992,0.0,0.0


### hypergeom
**Groups pairs with changed correlations by the source molecules and finds overrepresented groups using the hypergeometric test.**  
If `output_dir` is provided, result is saved as .csv file.  

``` python
dcona.hypergeom(ztest_df, alternative='two-sided', oriented=True, output_dir=None)
```
Argument|Data type|Description
-|-|-
`ztest_df`|pandas.DataFrame|output of `dcona.ztest` function
`alternative` (optional)|str: `"two-sided"`, `less` or `greater`| TODO
`oriented` (optional)|bool|if `True`, only "Source" genes are taken into account
`output_dir` (optional)|str|path for result saving. If not provided, returns DataFrame

In [9]:
hypergeom = dcona.hypergeom(
    # Required arguments
    ztest_df=ztest,
    # Optional argument
    alternative="two-sided",
    oriented=True,
    output_dir=None
)
hypergeom

Unnamed: 0,Molecule,Diff,Total,Proportion,Pvalue,AdjPvalue
0,hsa-let-7a-5p|0,316,4469,0.070709,0.035078,0.105235
1,hsa-let-7b-5p|0,279,4468,0.062444,0.850588,1.0
2,hsa-let-7c-5p|0,160,2618,0.061115,0.850867,0.850867
