# PyGNA Workflow
#### 1)Generate GMT files from CSV files in case GMT file isn't available  
#### 2)Generate matrices
#### 3)Perform analysis for single or multiple genesets. Get the results in the form of pdf or png


## Data Loading

### Generating GMT files from  a table

In [None]:
$ pygna geneset-from-table <filename>.csv <setname> <filename>.gmt --name-colum <gene_names_column> --filter-column <filter-col> <'less'> --threshold <th> --descriptor <descriptor string>
$ pygna geneset-from-table <deseq>.csv diff_exp <deseq>.gmt --descriptor deseq#for table from deseq


### Merging different Genesets

It is also possible to merge different setnames in a single gmt file through the function generate-group-gmt. You can override the default parameters, to match the columns in your table.*generate-group-gmt* generates a GMT file of multiple setnames. From the table file, it groups the names in the group_col (the column you want to use to group them) and prints the genes in the name_col. Set the descriptor according to your needs. OR you could simply concatenate all the files. 

## Computing rwr and sp matrices

In [None]:
$ pygna build-distance-matrix <network> <network_sp>.hdf5
$ pygna build-rwr-diffusion <network> --output-file <network_rwr>.hdf5

## Topology Tests

In [None]:
$ pygna test-topology-module <network> <geneset> <table_results_test>_topology_module.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-rwr <network> <geneset> <network_rwr>.hdf5 <table_results_test>_topology_rwr.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-internal-degree <network> <geneset> <table_results_test>_topology_internal_degree.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-sp <network> <geneset> <network_sp>.hdf5 <table_results_test>_topology_sp.csv --number-of-permutations 100 --cores 4
$ pygna test-topology-total-degree <network> <geneset> <table_results_test>_topology_total_degree.csv --number-of-permutations 100 --cores 4

## Association tests

If only A_geneset_file is passed the analysis is run on all the pair of sets in the file, if both A_geneset_file and B_geneset_file are passed, one can specify the setnames for both, if there is only one geneset in the file, setname_X can be omitted, if both sets are in the same file, B_geneset_file can be not specified, but setnames are needed

In [None]:
pygna test-association-rwr [-h] [--setname-a SETNAME_A] [--file-geneset-b FILE_GENESET_B] [--setname-b SETNAME_B] [--size-cut SIZE_CUT] [-k] [-c CORES] [-i]
                                [--number-of-permutations NUMBER_OF_PERMUTATIONS] [--n-bins N_BINS] [--results-figure RESULTS_FIGURE]
                                network-file file-geneset-a rwr-matrix-filename output-table

    Performs comparison of network location analysis.

    It computes a p-value for the shortest path distance
    between two genesets being smaller than expected by chance.

    If only A_geneset_file is passed the analysis is run on all the pair of sets in the file, if both
    A_geneset_file and B_geneset_file are passed, one can specify the setnames for both, if there is only one
    geneset in the file, setname_X can be omitted, if both sets are in the same file, B_geneset_file can be not
    specified, but setnames are needed.


positional arguments:
network-file          network file
file-geneset-a        GMT geneset file
rwr-matrix-filename   .hdf5 file with the RWR matrix obtained by pygna
output-table          output results table, use .csv extension

optional arguments:
-h, --help            show this help message and exit
--setname-a SETNAME_A
                        Geneset A to analyse (default: -)
--file-geneset-b FILE_GENESET_B
                        GMT geneset file (default: -)
--setname-b SETNAME_B
                        Geneset B to analyse (default: -)
--size-cut SIZE_CUT   removes all genesets with a mapped length < size_cut (default: 20)
-k, --keep            if true, keeps the geneset B unpermuted (default: False)
-c CORES, --cores CORES
                        Number of cores for the multiprocessing (default: 1)
-i, --in-memory       set if you want the large matrix to be read in memory (default: False)
--number-of-permutations NUMBER_OF_PERMUTATIONS
                        number of permutations for computing the empirical pvalue (default: 500)
--n-bins N_BINS       if >1 applies degree correction by binning the node degrees and sampling according to geneset distribution (default: 1)
--results-figure RESULTS_FIGURE
                        heatmap of results (default: -)

In [None]:
$ pygna test-association-sp <network> <geneset> <network_sp>.hdf5 <table_results_test>_association_sp.csv -B <geneset_pathways> --keep --number-of-permutations 100 --cores 4
$ pygna test-association-rwr <network> <geneset> <network_rwr>.hdf5 <table_results_test>_association_rwr.csv -B <geneset_pathways> --keep --number-of-permutations 100 --cores 4


### Visualisation

In [None]:
Usage: pygna paint-datasets-stats [-h] [-a ALTERNATIVE] table-filename output-file #GNT barplot
Usage: pygna paint-summary-gnt [-h] [-s SETNAME] [-t THRESHOLD] [-c COLUMN_FILTER] [--larger] [--less-tests LESS_TESTS] output-figure [input_tables [input_tables ...]]#GNT Summary
Usage: pygna paint-comparison-matrix [-h] [-r] [-s] [-a] table-filename output-file#heatmap
Usage: pygna paint-volcano-plot [-h] [-r] [-i ID_COL] [--threshold-x THRESHOLD_X] [--threshold-y THRESHOLD_Y] [-a] table-filename output-file#volcanoplot



### Snakemake Workflow

1) Install Snakemake
2) Make changes to the config file and rules files accordingly(changing the path/parameters etc)
3) Run the analysis

All the steps from above are boiled down to one or two steps.


In [None]:
snakemake --use-conda -n#dry run


In [None]:
snakemake --snakefile Snakefile_paper --configfile config_paper --use-conda --cores $N#to replicate the results of the paper

To obtain all the results for the single geneset (avoid the first step to have the full regeneration of all files):

In [None]:
 snakemake snakemake --snakefile Snakefile_paper single_all --configfile config_paper_single.yaml -t 
 
 snakemake --snakefile Snakefile_paper single_all --configfile config_paper_single.yaml --use-conda

To obtain the results for the multi geneset

In [None]:
 snakemake snakemake --snakefile Snakefile_paper multi_all --configfile config_paper_multi.yaml -t 
 
 snakemake --snakefile Snakefile_paper multi_all --configfile config_paper_multi.yaml

## Paper Use Case

### Using Commandline

Since the distance matrices are already built and the merged geneset(gmt) already obtained, topology and association analysis can be carried out directly.



#### **Topology Analysis**

In [1]:
#file names: biogrid_3.168_filtered.tsv merged.gmt goslim.gmt interactome_RWR.hdf5 interactome_SP.hdf5

In [3]:
cd /home/gee3/Documents/PyGNA/data_tcga_workflow/external/

/home/gee3/Documents/PyGNA/data_tcga_workflow/external


In [1]:
! pygna test-topology-module biogrid_3.168_filtered.tsv merged.gmt table_topology_module3.csv --number-of-permutations 100 --cores 2





INFO:root:Results file = table_topology_module3.csv
INFO:root:Mapped 759 genes out of 986.
INFO:root:Setname:tcga_brca
INFO:root:Observed: 203 p-value: 1
INFO:root:Mapped 172 genes out of 259.
INFO:root:Setname:tcga_prad
INFO:root:Observed: 5 p-value: 0.960396
INFO:root:Mapped 1212 genes out of 1640.
INFO:root:Setname:tcga_lusc
INFO:root:Observed: 784 p-value: 0.019802
INFO:root:Mapped 2295 genes out of 3247.
INFO:root:Setname:tcga_laml
INFO:root:Observed: 1629 p-value: 0.267327
INFO:root:Mapped 1326 genes out of 1796.
INFO:root:Setname:tcga_dlbc
INFO:root:Observed: 870 p-value: 0.019802
INFO:root:Mapped 568 genes out of 769.
INFO:root:Setname:tcga_blca
INFO:root:Observed: 63 p-value: 1


In [1]:
! pygna test-topology-rwr biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 tableresults_topology_rwr.csv --number-of-permutations 10 --cores 3

INFO:root:Results file = tableresults_topology_rwr.csv
INFO:root:Mapped 758 genes out of 986.
INFO:root:Setname:tcga_brca
INFO:root:Observed: 4.6334 p-value: 0.574257
INFO:root:Mapped 172 genes out of 259.
INFO:root:Setname:tcga_prad
INFO:root:Observed: 0.119103 p-value: 0.970297
INFO:root:Mapped 1208 genes out of 1640.
INFO:root:Setname:tcga_lusc
INFO:root:Observed: 16.5947 p-value: 0.0792079
INFO:root:Mapped 2292 genes out of 3247.
INFO:root:Setname:tcga_laml
INFO:root:Observed: 46.3724 p-value: 0.346535
INFO:root:Mapped 1322 genes out of 1796.
INFO:root:Setname:tcga_dlbc
INFO:root:Observed: 18.1138 p-value: 0.049505
INFO:root:Mapped 568 genes out of 769.
INFO:root:Setname:tcga_blca
INFO:root:Observed: 3.37367 p-value: 0.128713
Closing remaining open files:interactome_RWR.hdf5...doneinteractome_RWR.hdf5...done


In [1]:
! pygna test-topology-internal-degree biogrid_3.168_filtered.tsv merged.gmt table_topology_internal_degree.csv --number-of-permutations 10 --cores 3

INFO:root:Results file = table_topology_internal_degree.csv
INFO:root:Mapped 759 genes out of 986.
INFO:root:Setname:tcga_brca
INFO:root:Observed: 0.0417359 p-value: 0.524752
INFO:root:Mapped 172 genes out of 259.
INFO:root:Setname:tcga_prad
INFO:root:Observed: 0.00404209 p-value: 0.940594
INFO:root:Mapped 1212 genes out of 1640.
INFO:root:Setname:tcga_lusc
INFO:root:Observed: 0.0939875 p-value: 0.039604
INFO:root:Mapped 2295 genes out of 3247.
INFO:root:Setname:tcga_laml
INFO:root:Observed: 0.135056 p-value: 0.366337
INFO:root:Mapped 1326 genes out of 1796.
INFO:root:Setname:tcga_dlbc
INFO:root:Observed: 0.0924852 p-value: 0.0990099
INFO:root:Mapped 568 genes out of 769.
INFO:root:Setname:tcga_blca
INFO:root:Observed: 0.0421927 p-value: 0.0891089


In [1]:
! pygna test-topology-sp biogrid_3.168_filtered.tsv merged.gmt interactome_SP.hdf5 table_topology_sp.csv --number-of-permutations 10 --cores 2


INFO:root:Results file = table_topology_sp.csv
INFO:root:Setname:tcga_brca
INFO:root:Mapped 758 genes out of 986.
INFO:root:Observed: 1.69789 p-value: 1
INFO:root:Setname:tcga_prad
INFO:root:Mapped 172 genes out of 259.
INFO:root:Observed: 2.06395 p-value: 1
INFO:root:Setname:tcga_lusc
INFO:root:Mapped 1208 genes out of 1640.
INFO:root:Observed: 1.34189 p-value: 0.0909091
INFO:root:Setname:tcga_laml
INFO:root:Mapped 2292 genes out of 3247.
INFO:root:Observed: 1.28316 p-value: 0.272727
INFO:root:Setname:tcga_dlbc
INFO:root:Mapped 1322 genes out of 1796.
INFO:root:Observed: 1.32148 p-value: 0.0909091
INFO:root:Setname:tcga_blca
INFO:root:Mapped 568 genes out of 769.
INFO:root:Observed: 1.71479 p-value: 1
Closing remaining open files:interactome_SP.hdf5...doneinteractome_SP.hdf5...done


In [1]:
! pygna test-topology-total-degree biogrid_3.168_filtered.tsv merged.gmt table_topology_total_degree.csv --number-of-permutations 100 --cores 4

INFO:root:Evaluating the test topology total degree, please wait
INFO:root:Results file = table_topology_total_degree.csv
INFO:root:Mapped 759 genes out of 986.
INFO:root:Setname:tcga_brca
INFO:root:Observed: 19.2793 p-value: 1
INFO:root:Null mean: %g null variance: %g
INFO:root:Mapped 172 genes out of 259.
INFO:root:Setname:tcga_prad
INFO:root:Observed: 13.9826 p-value: 1
INFO:root:Null mean: %g null variance: %g
INFO:root:Mapped 1212 genes out of 1640.
INFO:root:Setname:tcga_lusc
INFO:root:Observed: 37.4142 p-value: 0.019802
INFO:root:Null mean: %g null variance: %g
INFO:root:Mapped 2295 genes out of 3247.
INFO:root:Setname:tcga_laml
INFO:root:Observed: 33.7834 p-value: 0.227723
INFO:root:Null mean: %g null variance: %g
INFO:root:Mapped 1326 genes out of 1796.
INFO:root:Setname:tcga_dlbc
INFO:root:Observed: 36.494 p-value: 0.029703
INFO:root:Null mean: %g null variance: %g
INFO:root:Mapped 568 genes out of 769.
INFO:root:Setname:tcga_blca
INFO:root:Observed: 18.632 p-value: 1
INFO:ro

#### **Association Tests**

In a GNA two genesets are tested for their association. When testing a signle geneset against many pathways it is recommended the –keep flag is used. This way, while resampling only the geneset a will be randomly permuted and the geneset b is going to be kept as it is. This strategy is more conservative and is helpful in testing whether the tested geneset is more strongly connected to the pathway (or any other geneset of interest) than expected by chance.

In [4]:
! pygna test-association-rwr biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 table_association_rwr.csv --file-geneset-b goslim_entrez.gmt --keep --number-of-permutations 100 --cores 4

INFO:root:geneset_a contains 6 sets
INFO:root:Setnames in A: ['tcga_brca', 'tcga_prad', 'tcga_lusc', 'tcga_laml', 'tcga_dlbc', 'tcga_blca']
INFO:root:geneset_b contains 139 sets
INFO:root:Setnames in B: ['ATPase activity', 'DNA binding', 'DNA binding transcription factor activity', 'DNA metabolic process', 'GTPase activity', 'Golgi apparatus', 'RNA binding', 'aging', 'anatomical structure development', 'anatomical structure formation involved in morphogenesis', 'autophagy', 'biological_process', 'biosynthetic process', 'carbohydrate metabolic process', 'catabolic process', 'cell', 'cell adhesion', 'cell cycle', 'cell death', 'cell differentiation', 'cell division', 'cell junction organization', 'cell morphogenesis', 'cell motility', 'cell proliferation', 'cell wall', 'cell wall organization or biogenesis', 'cell-cell signaling', 'cellular amino acid metabolic process', 'cellular component assembly', 'cellular nitrogen compound metabolic process', 'cellular protein modification process'

If you don't include the --results-figure flag at the comparison step, plot the matrix as follows

In [5]:
! pygna paint-comparison-matrix table_association_rwr.csv heatmap_association_rwr.png --rwr --annotate

Traceback (most recent call last):
  File "/home/gee3/.local/bin/pygna", line 8, in <module>
    sys.exit(main())
  File "/home/gee3/.local/lib/python3.8/site-packages/pygna/cli.py", line 17, in main
    argh.dispatch_commands([
  File "/home/gee3/.local/lib/python3.8/site-packages/argh/dispatching.py", line 328, in dispatch_commands
    dispatch(parser, *args, **kwargs)
  File "/home/gee3/.local/lib/python3.8/site-packages/argh/dispatching.py", line 174, in dispatch
    for line in lines:
  File "/home/gee3/.local/lib/python3.8/site-packages/argh/dispatching.py", line 277, in _execute_command
    for line in result:
  File "/home/gee3/.local/lib/python3.8/site-packages/argh/dispatching.py", line 260, in _call
    result = function(*positional, **keywords)
  File "/home/gee3/.local/lib/python3.8/site-packages/pygna/painter.py", line 201, in paint_comparison_matrix
    with open(table_filename, "r") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'table_association_rwr.csv

If setname B is not passed, the analysis is run between each couple of setnames in the geneset as follows(The only difference between single geneset and multiple genests. No within comprison in multi):

In [None]:
! pygna test-association-rwr biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 table_within_comparison_rwr.csv --number-of-permutations 100 --cores 2

! pygna paint-comparison-matrix table_within_comparison_rwr.csv heatmap_within_comparison_rwr.png --rwr --single-geneset

In [None]:
! pygna test-association-sp biogrid_3.168_filtered.tsv merged.gmt interactome_SP.hdf5 table_association_SP.csv --file-geneset-b goslim_entrez.gmt --keep --number-of-permutations 2 --cores 3 
! pygna paint-comparison-matrix table_association_sp.csv heatmap_association_sp.png --rwr --annotate#default heatmap 

INFO:root:geneset_a contains 6 sets
INFO:root:geneset_b contains 139 sets
INFO:root:Results file = table_association_SP.csv
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 504 genes out of 541 from genesetB
INFO:root:n_proc = 3, each computing 0 permutations 
INFO:root:Observed: 0.0722265 p-value: 1
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(
  ret = ret.dtype.type(ret / rcount)
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 2379 genes out of 2659 from genesetB
INFO:root:n_proc = 3, each computing 0 permutations 
INFO:root:Observed: 0.0167353 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 2031 genes out of 2275 from genesetB
INFO:root:n_proc = 3, each computing 0 permutations 
INFO:root:Observed: 0.0571095 p-value: 1
INFO:root:Mapped 758 genes out of 986 from genesetA
INF

In [None]:
! pygna test-association-sp biogrid_3.168_filtered.tsv merged.gmt interactome_RWR.hdf5 table_within_comparison_sp.csv --number-of-permutations 2 --cores 2
! pygna paint-comparison-matrix table_within_comparison_sp.csv heatmap_within_comparison_rwr.png --rwr --single-geneset

INFO:root:Analysing all the sets in merged.gmt
INFO:root:Results file = table_within_comparison_sp.csv
INFO:root:Analysing tcga_brca and tcga_prad
INFO:root:Mapped 758 genes out of 986 from genesetA
INFO:root:Mapped 172 genes out of 259 from genesetB
INFO:root:n_proc = 2, each computing 1 permutations 


#### **Diagnostic**

Distribution plot
When running a statistical test, one might want to visually assess the null distribution. By passing -d \<diagnostic_folder/> through command line, a distribution plot of the empirical null is shown for each test.

In [3]:
! pygna test-topology-total-degree biogrid_3.168_filtered.tsv merged.gmt diagnstic_total_degree.csv -d "diagnostic/" --number-of-permutations 2 --cores 2

INFO:root:Evaluating the test topology total degree, please wait
INFO:root:Results file = diagnstic_total_degree.csv
INFO:root:Mapped 759 genes out of 986.
INFO:root:Setname:tcga_brca
INFO:root:Observed: 19.2793 p-value: 1
INFO:root:Null mean: %g null variance: %g
  ndim = x[:, None].ndim
  x = x[:, np.newaxis]
  y = y[:, np.newaxis]
xmax 45.160879
  g4 = axes.stem([observed], [ymax / 2], "r", "r--")
INFO:root:Output for diagnostic null distribution: diagnostic/tcga_brca_total_degree_null_distribution.pdf
INFO:root:Mapped 172 genes out of 259.
INFO:root:Setname:tcga_prad
INFO:root:Observed: 13.9826 p-value: 1
INFO:root:Null mean: %g null variance: %g
  ndim = x[:, None].ndim
  x = x[:, np.newaxis]
  y = y[:, np.newaxis]
xmax 29.277120
  g4 = axes.stem([observed], [ymax / 2], "r", "r--")
INFO:root:Output for diagnostic null distribution: diagnostic/tcga_prad_total_degree_null_distribution.pdf
INFO:root:Mapped 1212 genes out of 1640.
INFO:root:Setname:tcga_lusc
INFO:root:Observed: 37.414

### **Visualisation**

There are four main types of figures currently implemented in PyGNA, namely bar plots, point plots, heatmaps and volcano plots, to visualize to GNT and GNA results.

Barplots are used to plot the GNT results for a single statistic. For each geneset a red bar represents the observed statistic, whereas a blue one represents the average of the empirical null distribution. Conversely, a dot plot can be used to summarize multiple tests for the same geneset. In order to show all the results in the same figure, the observed values are transformed in absolute normalized z-scores, such that all significant tests have z-score >0 and are marked with a red dot. 

GNA results can instead be visualised on heatmaps, with the color gradients used to report the strength of association between two genesets. When an all-vs-all test is conducted, a lower triangular matrix is shown, with stars denoting significance. If, instead, a M-vs-N test was conducted, a complete heatmap would be included in the plot.

Alternatively, volcano plots can be used to visualize one-vs-many GNA results, for testing a geneset against a large number of datasets (e.g. gene ontologies). The plot shows the normalized z-score on the x-axis and the −log10 of the p-value adjusted to control the False Discovery Rate (FDR) on the y-axis. Significant results are shown with red crosses, whereas not significant associations are represented by blue dots.Can be annotated to fid out the top 5 terms.

In [4]:
! pygna paint-datasets-stats table_topology_module.csv gnt_tm.png #GNT barplot
! pygna paint-summary-gnt dotplt.png #GNT Summary
! pygna paint-comparison-matrix table_association_sp.csv withncomp_sp.pdf #heatmap
! pygna paint-volcano-plot table_association_sp.csv volcno_sp.png #volcanoplot

Traceback (most recent call last):
  File "/home/gee3/.local/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 3361, in get_loc
    return self._engine.get_loc(casted_key)
  File "pandas/_libs/index.pyx", line 76, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 108, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 5198, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 5206, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'observed'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/gee3/.local/bin/pygna", line 8, in <module>
    sys.exit(main())
  File "/home/gee3/.local/lib/python3.8/site-packages/pygna/cli.py", line 17, in main
    argh.dispatch_commands([
  File "/home/gee3/.local/lib/python3.8/site-packages/argh/dispatching.py", line 328, in dispatch