# GORi: an efficient algorithm to annotate pools of genesets simultaneously, with multiple knowledge bases.

In this notebook, we present GORi: an algorithm developed to conduct enrichment analyses on a pool of genesets, with multiple knowledge bases.
GORi is computationally efficient, and is able to leverage hierarchical relationships between genesets (e.g. marker genes of a cell population and marker genes of its sub-populations) to improve the quality of the enrichment analysis. 

### Installing GORi

GORi can be installed as a Python3 package using pip3 and its GitHub repository:

In [None]:
pip3 install git+https://github.com/yanisaspic/GORi.git

Following this installation, you will be able to load it as a package in Python3 scripts:

In [None]:
import gori

### Overview of GORi

GORi loads a pool of **priors** and leverages the gene annotations of each prior to identify strong associations between pairs of concepts. A prior is a collection of three Python3 dictionaries:
    
- **annotations**: a dict associating gene symbols or uniprot ids (keys) to a set of semantic ids (values).
- **hierarchy**: a dict associating semantic ids (keys) to a set of parents in the hierarchy (values).
- **translations**: a dict associating semantic ids (keys) to their human-readable label (values).

To demonstrate how to use GORi, we will apply it on the results of a scRNA-seq clustering analysis conducted with the fEVE framework (cf. https://github.com/yanisaspic/fEVE) downloaded from an online repository.

*This analysis was conducted on a human glioblastoma dataset, and 10 populations of cells were predicted: C, C.1, C.2, C.3, C.L, C.L.1, C.L.L, C.L.L.1, C.L.L.2 and C.L.L.L. These populations were predicted across multiple resolutions, as indicated by their label (*e.g.* C.L.1 and C.L.L are two sub-populations of C.L). For each population, a set of marker genes was predicted, and all of these informations are stored in the downloaded file `Darmanis_HumGBM.xlsx`.*

In [None]:
from urllib.request import urlretrieve  # download the example data files

url = "https://github.com/yanisaspic/GORi/raw/refs/heads/main/data/Darmanis_HumGBM.xlsx"
urlretrieve(url, "./Daramanis_HumGBM.xlsx")

##### Setting-up GORi

Using the function `load_feve()`, we will load the results of this analysis as a prior for GORi:

In [None]:
from gori.loaders import load_feve

feve_prior = load_feve(path="./Darmanis_HumGBM.xlsx")
print(feve_prior)

For this analysis, our other priors will correspond to curated knowledge bases associating genes to specific concepts. Eight different priors are readily implemented in GORi:

- **BIOP: biological processes**, from the Gene Ontology (GO).
- **CELC: cellular components**, from the Gene Ontology (GO).
- **CTYP: cell types**, from the Cell Ontology (CL) and CellMarker 2.0.
- **DISE: diseases**, from the Comparative Toxicogenomics Database (CTD) and the Medical Subject Headings (MeSH).
- **GENG: gene groups**, from the HUGO Gene Nomenclature Committee (HGNC).
- **MOLF: molecular functions**, from the Gene Ontology (GO).
- **PATH: pathways**, from Reactome.
- **PHEN: phenotypes**, from the Human Phenotype Ontology (HPO).

In order to load them, the function `load_priors()` must be called.

**Note:** the priors CTYP, DISE, GENG and PATH must be intialized with `download_priors()` and `setup_priors()` before loading.

In [None]:
from gori.init import download_priors, setup_priors
from gori.loaders import load_priors

priors = {"BIOP", "CELC", "CTYP", "DISE", "GENG", "MOLF", "PATH", "PHEN"}

download_priors()
setup_priors()
data = load_priors(priors=priors) # only priors in the set will be loaded

**Note 1:** downloading and setting-up CTYP, DISE, GENG and PATH can be time-consuming. You only need to do it once, as local files will be stored on your machine afterwards (by default, in the folders `./.priors` and `./priors`).

**Note 2:** the first time you load the priors BIOP, CELC, MOLF and PHEN, cache files are automatically saved in your machine (from the `pypath` package). This process can also be time-consuming. If errors are raised after loading one of these priors, they will likely be caused by cache issues. They can be solved by clearing the cache, and regenerating it with `load_priors()`.

Finally, it is also possible to load a prior from a collection of three local .json files with the function `load_local()`. For instance, a prior labeled POPU can be loaded if three .json files are saved locally: `POPU_a.json`, `POPU_h.json` and `POPU_t.json` (for annotations, hierarchy and translations, respectively).

We demonstrate how to use this function by loading the DISE prior from the local collection of .json files generated after its initialization:

In [None]:
from gori.loaders import load_local

dise_prior = load_local(prior="DISE", path="./priors")
print(dise_prior)

**Note:** priors loaded with GORi should be rooted to a unique concept (*e.g.* `biological_process`) to ensure that the analysis runs smoothly. 

##### Running GORi

After loading all of our priors of interest, we will merge them to a single variable, and start the GORi analysis with the function `gori()`. This function expects four arguments:
- **geneset** is a set of gene symbols or Uniprot ids used to conduct the enrichment analysis. We'll use the marker genes predicted with the fEVE analysis.
- **antecedent_prior** is a label indicating which prior should be annotated. We'll annotate the clusters predicted with the fEVE analysis.
- **consequent_prior** is a set of labels indicating which priors should be used to annotate. We'll use the eight default priors available with GORi.
- **data** is a collection of priors. We have loaded them earlier.

In [None]:
%% time
from gori.run import gori

data["FEVE"] = feve_prior
geneset = data["FEVE"]["annotations"].keys()
results = gori(geneset=geneset, antecedent_prior="FEVE", consequent_priors=priors, data=data)
print(results)

**Note:** the results of a fEVE clustering analysis can also be analyzed directly with the function `gorilon()`. By default, 6 priors are used: BIOP, CELC, CTYP, GENG, MOLF and PATH.

In [None]:
from gori.run import gorilon

results = gorilon("./Darmanis_HumGBM.xlsx")
print(results)

A set of default parameters is used to run a GORi analysis. These parameters are generated with the function `get_parameters()`, and they include:

- **n_genes_threshold:** the minimum number of co-annotating genes required to associate two concepts.
- **pvalue_threshold:** the p-value threshold used to identify strong associations (using Fisher's exact test). 
- **use_heuristic:** a boolean indicating if the heuristic strategy should be employed to identify strong associations.
- **use_gene_symbol:** a boolean indicating if gene symbols (True) or Uniprot ids (False) should be used.
- **sheets_path:** a path to store the results of the GORi analysis.
- **wrappers:** wrapper functions necessary to run GORi (cf. the section ***Developing new priors in GORi*** below).

These settings can be changed prior to the GORi analysis, and must be input to the variable `params` of `gori()` or `gorilon()`:

In [None]:
from gori.params import get_parameters

params = get_parameters()
params["pvalue_threshold"] = 0.001
results = gori(geneset=geneset, antecedent_prior="FEVE", consequent_priors={"CTYP", "BIOP", "PATH"}, data=data, params=params)
print(results)

### Developing new priors in GORi

**Note:** this section is aimed at developers interested in adding complex priors (*i.e.* priors that cannot be split into three .json files) to their GORi analyses. For a majority of users, the functionalities presented in the above section - notably `load_local()` - should be sufficient to carry out insightful GORi analyses.

GORi is a framework structured around a collection of wrapper functions, that associate prior labels to prior-specific functions. This modular structure facilitates the integration of multiple knowledge bases in the analysis. In this section, we list each wrapper function, their role, their inputs and their output.

In [None]:
from gori.params import get_parameters

params = get_parameters()
print(params["wrappers"])

##### ancestors_wrapper()

##### annotations_wrapper()

##### descendants_wrapper()

##### download_wrapper()

##### headers_wrapper()

##### inverse_translate_wrapper()

##### load_wrapper()

##### resources_wrapper()

##### roots_wrapper()

##### setup_wrapper()

##### terms_wrapper()

##### translate_wrapper()

##### urls_wrapper()