# GORi: an efficient algorithm to annotate pools of genesets simultaneously, with multiple knowledge bases.

In this notebook, we present GORi: an algorithm developed to conduct enrichment analyses on a pool of genesets, with multiple knowledge bases.
GORi is computationally efficient, and is able to leverage hierarchical relationships between genesets (e.g. marker genes of a cell population and marker genes of its sub-populations) to improve the quality of the enrichment analysis. 

### Installing GORi

GORi can be installed as a Python3 package using pip3 and its GitHub repository with the following command: \
`pip3 install git+https://github.com/yanisaspic/GORi.git`

Following this installation, you will be able to load it as a package in Python3 scripts:

In [None]:
import gori

### Overview of GORi

GORi loads a pool of **priors** and leverages the gene annotations of each prior to identify strong associations between pairs of concepts. A prior is a collection of three Python3 dictionaries:
    
- **annotations**: a dict associating gene symbols or uniprot ids (keys) to a set of semantic ids (values).
- **hierarchy**: a dict associating semantic ids (keys) to a set of parents in the hierarchy (values).
- **translations**: a dict associating semantic ids (keys) to their human-readable label (values).

To demonstrate how to use GORi, we will apply it on the results of a scRNA-seq clustering analysis conducted with the fEVE framework (cf. https://github.com/yanisaspic/fEVE) downloaded from an online repository.

*This analysis was conducted on a human glioblastoma dataset, and 10 populations of cells were predicted: C, C.1, C.2, C.3, C.L, C.L.1, C.L.L, C.L.L.1, C.L.L.2 and C.L.L.L. These populations were predicted across multiple resolutions, as indicated by their label (*e.g.* C.L.1 and C.L.L are two sub-populations of C.L). For each population, a set of marker genes was predicted, and all of these informations are stored in the data file `Darmanis_HumGBM.xlsx` of the GORi package.*

##### Setting-up GORi

Using the function `load_feve()`, we will load the results of this analysis as a prior for GORi:

In [None]:
import importlib.resources  # import the clustering analysis results
from gori.loaders import load_feve

with importlib.resources.path(gori, "Darmanis_HumGBM.xlsx") as path:
    feve_prior = load_feve(path=path)
print(feve_prior)

For this analysis, our other priors will correspond to curated knowledge bases associating genes to specific concepts. Eight different priors are readily implemented in GORi:

- **BIOP: biological processes**, from the Gene Ontology (GO).
- **CELC: cellular components**, from the Gene Ontology (GO).
- **CTYP: cell types**, from the Cell Ontology (CL) and CellMarker 2.0.
- **DISE: diseases**, from the Comparative Toxicogenomics Database (CTD) and the Medical Subject Headings (MeSH).
- **GENG: gene groups**, from the HUGO Gene Nomenclature Committee (HGNC).
- **MOLF: molecular functions**, from the Gene Ontology (GO).
- **PATH: pathways**, from Reactome.
- **PHEN: phenotypes**, from the Human Phenotype Ontology (HPO).

In order to load them, the function `load_priors()` must be called.

**Note 1:** the priors CTYP, DISE, GENG and PATH must be intialized with `download_priors()` and `setup_priors()` before loading.

**Note 2:** only human genes are handled by the default priors of GORi.

In [None]:
from gori.init import download_priors, setup_priors
from gori.loaders import load_priors

priors = {"BIOP", "CELC", "CTYP", "DISE", "GENG", "MOLF", "PATH", "PHEN"}
priors_requiring_init = {"CTYP", "DISE", "GENG", "PATH"}

download_priors(priors_requiring_init)
setup_priors(priors_requiring_init)
data = load_priors(priors)

**Note 1:** downloading and setting-up CTYP, DISE, GENG and PATH can be time-consuming. You only need to do it once, as local files will be stored on your machine afterwards (by default, in the folders `./.priors` and `./priors`).

**Note 2:** the first time you load the priors BIOP, CELC, MOLF and PHEN, cache files are automatically saved in your machine (from the `pypath` package). This process can also be time-consuming. If errors are raised after loading one of these priors, they will likely be caused by cache issues. They can be solved by clearing the cache, and regenerating it with `load_priors()`.

Finally, it is also possible to load a prior from a collection of three local .json files with the function `load_local()`. For instance, a prior labeled POPU can be loaded if three .json files are saved locally: `POPU_a.json`, `POPU_h.json` and `POPU_t.json` (for annotations, hierarchy and translations, respectively).

We demonstrate how to use this function by loading the DISE prior from the local collection of .json files generated after its initialization:

In [None]:
from gori.loaders import load_local

dise_prior = load_local(prior="DISE", path="./priors")
print(dise_prior)

**Note:** you should make sure that any prior loaded with GORi is a directed acyclic graph (DAG), rooted to a unique concept (*e.g.* `biological_process`). Otherwise, the analysis might not run smoothly.

##### Running GORi

After loading all of our priors of interest, we will merge them to a single variable, and start the GORi analysis with the function `gori()`. This function expects four arguments:
- **geneset** is a set of gene symbols or Uniprot ids used to conduct the enrichment analysis. We'll use the marker genes predicted with the fEVE analysis.
- **antecedent_prior** is a label indicating which prior should be annotated. We'll annotate the clusters predicted with the fEVE analysis.
- **consequent_prior** is a set of labels indicating which priors should be used to annotate. We'll use the eight default priors available with GORi.
- **data** is a collection of priors. We have loaded them earlier.

In [None]:
%% time
from gori.run import gori

data["FEVE"] = feve_prior
geneset = data["FEVE"]["annotations"].keys()
results = gori(geneset=geneset, antecedent_prior="FEVE", consequent_priors=priors, data=data, save=False)
print(results)

**Note:** the results of a fEVE clustering analysis can also be analyzed directly with the function `gorilon()`. In this case, the fEVE prior is automatically generated and annotated with 6 other priors: BIOP, CELC, CTYP, GENG, MOLF and PATH.

In [None]:
from gori.run import gorilon

with importlib.resources.path(gori, "Darmanis_HumGBM.xlsx") as path:
    results = gorilon(path, save=False)
print(results)

By default, the results of the GORi analysis should be stored in a collection of spreadsheets `GORi.xlsx`, and a Jupyter Notebook `GORi.ipynb` should be generated to help investigate them. The resulting notebook documents the number of annotations found for each gene in all priors, the number of associations found during the GORi analysis, their identities, and their most informative words. A more extensive presentation of these elements is directly available in the notebook.

Regardless, because we have set `save=False` when we have run `gori()` and `gorilon()` earlier, neither documents were generated. 

##### Parameterizing GORi

A set of default parameters is used to run a GORi analysis. These parameters are generated with the function `get_parameters()`, and they include:

- **n_genes_threshold:** the minimum number of co-annotating genes required to associate two concepts. Defaults to `5`.
- **pvalue_threshold:** the p-value threshold used to identify strong associations (using Fisher's exact test). Defaults to `0.05`.
- **use_heuristic:** a boolean indicating if the heuristic strategy should be employed to identify strong associations. Defaults to `True`.
- **use_gene_symbol:** a boolean indicating if gene symbols (True) or Uniprot ids (False) should be used. Defaults to `True`.
- **sheets_path:** a path to store the spreadsheets of the GORi analysis. Defaults to `./GORi.xlsx`.
- **notebook_path:** a path to store the interactive notebook of the GORi analysis. Defaults to `./GORi.ipynb`.
- **stopwords:** a set of words that should be filtered out from the words overview.
- **wrappers:** wrapper functions necessary to run GORi; see ***Developing new priors in GORi***.

These settings can be changed prior to the GORi analysis, and must be input to the argument `params` of `gori()` or `gorilon()`:

In [None]:
from gori.params import get_parameters

params = get_parameters()
params["n_genes_threshold"] = 10
params["pvalue_threshold"] = 0.001
params["use_heuristic"] = False
params["use_gene_symbol"] = False
params["stopwords"] = params["stopwords"] | {"bind, binding"}
results = gori(geneset=geneset, antecedent_prior="FEVE", consequent_priors={"CTYP", "BIOP", "PATH"}, data=data, params=params, save=False)
print(results)

### Developing with GORi

**Note:** this section is aimed at developers interested in adding complex priors (*i.e.* priors that cannot be split into three .json files) to their GORi analyses. For a majority of users, the functionalities presented in the above section should be sufficient to carry out insightful GORi analyses, and new priors can likely be generated using `load_local()`.

GORi is a framework structured around a collection of wrapper functions associating prior labels to prior-specific functions. This modular structure facilitates the integration of multiple knowledge bases in the analysis. If we want to add a new prior to GORi (again, **a new prior that cannot be split into three .json files**), we need to develop new prior-specific functions, and add them to the wrapper functions of GORi.

In [None]:
from gori.params import get_parameters

params = get_parameters()
print(params["wrappers"].keys())

Each wrapper function corresponds to a dict associating a prior label to its prior-specific function, *e.g.* `BIOP` and `_get_biop_ancestors()`. Each wrapper function fulfills a different role, and expects different arguments. Below, we list the wrapper functions, their role, their inputs and their output. 

##### ancestors_wrapper()

A wrapper of `_get_ancestors()` functions, that return the ancestor terms of a given subset of terms. 

They expect two inputs:

- `terms` is a set of terms from a specific knowledge base.
- `prior` is a data structure storing the annotations, the hierarchy and the translations of a specific knowledge base.

They return a set of terms.

##### annotations_wrapper()

A wrapper of `_get_annotations()` functions, that return the terms annotating a given UniProtID. 

They expect four inputs:

- `uid` is a UniProtID.
- `prior` is a prior label (*e.g.* BIOP).
- `data` is a collection of knowledge bases; see `load_priors()`.
- `params` is a collection of GORi parameters.

They return a set of terms.

##### descendants_wrapper()

A wrapper of `_get_descendants()` functions, that return the descendant terms of a given subset of terms. 

They expect two inputs:

- `terms` is a set of terms from a specific knowledge base.
- `prior` is a data structure storing the annotations, the hierarchy and the translations of a specific knowledge base.

They return a set of terms.

##### download_wrapper()

A wrapper of `_download()` functions, that download source files required to setup priors.

They expect two inputs:

- `path` is a path where downloaded files will be stored.
- `params` is a collection of GORi parameters.

They return nothing.

##### headers_wrapper()

A wrapper that stores url headers leading to the description of a term (e.g. `www.ebi.ac.uk/QuickGO/term/` for `BIOP`).

**This wrapper does not expect a function, but a string directly.**

##### inverse_translate_wrapper()

A wrapper of `_get_inverse_translation()` functions, that translate a human-readable label to a term (*e.g.* `biological_process` to `GO:0008150`).

They expect two inputs: 
- `label` is a human-readable label from a specific knowledge base.
- `prior` is a data structure storing the annotations, the hierarchy and the translations of a specific knowledge base.

They return a term.

##### load_wrapper()

A wrapper of `_load()` functions, that load priors from local files.

They expect one input:
- `path` is a path leading to local files that have been set-up.

They return a data structure storing the annotations, the hierarchy and the translations of a specific knowledge base.

##### resources_wrapper()

A wrapper that stores source files to download.

**This wrapper does not expect a function, but a dict associating file names to download urls directly.**

##### roots_wrapper()

A wrapper of `get_roots()` functions, that return the root term of a prior.

They expect no inputs.

They return a term.

##### setup_wrapper()

A wrapper of `setup()` functions, that setup local files from downloaded files.

They expect two inputs:
- `dl_path` is a path leading to downloaded files.
- `su_path` is a path leading to local files that have been set-up.

They return nothing.

##### terms_wrapper()

A wrapper of `get_terms()` functions, that list every term (*e.g.* `GO:0008150`) in a prior.

They expect one input:
- `prior` is a data structure storing the annotations, the hierarchy and the translations of a specific knowledge base.

They return a set of terms.

##### translate_wrapper()

A wrapper of `get_translation()` functions, that translate a term to a human-readable label (*e.g.* `GO:0008150` to `biological_process`).

They expect two inputs:
- `term` is a term from a specific knowledge base.
- `prior` is a data structure storing the annotations, the hierarchy and the translations of a specific knowledge base.

They return a human-readable label.

##### urls_wrapper()

A wrapper of `get_url()` functions, that return the url leading to the description of a term.

They expect three inputs:
- `term` is a term from a specific knowledge base.
- `header` is the generic section of a url; see `headers_wrapper()`.
- `data` is a collection of knowledge bases; see `load_priors()`.

They return a term-specific url.