# MIPath Vignette

## Package installation

Install MIPath with

`pip install mipathway`

## Usage example

Here we will give an example on how to run a standard pathway analysis with MIPath.

First, we will load the libraries. Here we will also use **scanpy** to download and process a single cell dataset. You can install **scanpy** with `pip install scanpy`


In [1]:
import mipath
import scanpy as sc

We will use a dataset of a *hematopoietic cell lineage*, taken from the paper:

Ranzoni AM, Tangherloni A, Berest I, Riva SG, Myers B et al. (2021) Integrative Single-Cell RNA-Seq and ATAC-Seq Analysis of Human Developmental Hematopoiesis

It is present in the EBI Single Cell Expression Atlas, under the id E-MTAB-9067. We use **scanpy** to download the dataset, and preprocess it.

In [2]:
dataset_id = 'E-MTAB-9067'

adata = sc.datasets.ebi_expression_atlas(dataset_id)
     
# Basic filtering
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

# Library size correction
sc.pp.normalize_total(adata, target_sum=1e4)

# Logarithmize the data
sc.pp.log1p(adata)

Next, we extract the two dataframes we need for the analysis, one with the gene expression information, and a second with phenotype information for each cell.

In [3]:
metadata = adata.obs
data = adata.to_df()

The dataframe with gene expression looks like this

In [4]:
data.head(10)

Unnamed: 0,ENSG00000000003,ENSG00000000419,ENSG00000000457,ENSG00000000460,ENSG00000000938,ENSG00000000971,ENSG00000001036,ENSG00000001084,ENSG00000001167,ENSG00000001460,...,ENSG00000289685,ENSG00000289690,ENSG00000289692,ENSG00000289694,ENSG00000289695,ENSG00000289697,ENSG00000289700,ENSG00000289701,ENSG00000289716,ENSG00000289718
ERR4147602,0.0,0.12353,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.12353,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.204019,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147604,0.0,1.075944,0.0,0.0,0.0,0.0,0.0,0.096876,0.0,0.185191,...,1.763158,0.0,0.0,0.618255,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147605,0.0,0.435912,0.09471,0.048476,0.0,0.138901,0.334586,0.048476,0.0,0.181222,...,0.278324,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147606,0.0,0.693898,0.0,0.0,0.0,2.029455,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147607,0.0,0.0,0.0,0.0,0.0,0.031404,0.0,0.027872,0.0,0.081388,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147608,0.104638,0.03611,0.0,0.043839,0.0,0.0,0.0,0.03611,0.0,0.0,...,0.03611,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147609,0.0,0.0,0.0,0.0,0.0,0.476586,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147610,0.0,0.023072,0.0,0.0,0.0,0.0,0.0,0.398879,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
ERR4147611,0.0,0.101359,0.101359,0.913659,0.0,0.0,0.0,0.0,0.0,0.0,...,0.776297,0.0,0.0,0.642046,0.0,0.0,0.0,0.0,0.0,0.0


And the one with phenotype information

In [5]:
metadata.head(10)

Unnamed: 0,Sample Characteristic[organism],Sample Characteristic Ontology Term[organism],Sample Characteristic[individual],Sample Characteristic Ontology Term[individual],Sample Characteristic[developmental stage],Sample Characteristic Ontology Term[developmental stage],Sample Characteristic[age],Sample Characteristic Ontology Term[age],Sample Characteristic[sex],Sample Characteristic Ontology Term[sex],...,Factor Value Ontology Term[single cell identifier],Factor Value[organism part],Factor Value Ontology Term[organism part],Factor Value[sampling site],Factor Value Ontology Term[sampling site],Factor Value[inferred cell type - authors labels],Factor Value Ontology Term[inferred cell type - authors labels],Factor Value[inferred cell type - ontology labels],Factor Value Ontology Term[inferred cell type - ontology labels],n_genes
ERR4147602,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,2191
ERR4147603,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,1810
ERR4147604,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,erythroid progenitor cell,http://purl.obolibrary.org/obo/CL_0000038,erythroid progenitor cell,http://purl.obolibrary.org/obo/CL_0000038,3929
ERR4147605,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,5115
ERR4147606,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,2137
ERR4147607,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,2561
ERR4147608,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,2890
ERR4147609,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,2271
ERR4147610,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,,,,,3252
ERR4147611,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Sample2,,late embryonic stage,http://purl.obolibrary.org/obo/UBERON_0007220,17 week,,female,http://purl.obolibrary.org/obo/PATO_0000383,...,,bone marrow,http://purl.obolibrary.org/obo/UBERON_0002371,hip,http://purl.obolibrary.org/obo/UBERON_0001464,cycling committed progenitor,,progenitor cell,http://purl.obolibrary.org/obo/CL_0011026,5067


Next, we need to load pathway annotation information. 

In the next cell we use the MIPath `parse_gmt` function to parse a `.gmt` file downloaded from **MSigDB**, and translate it from HUGO gene names to Ensembl gene IDs using a file created using [BioMart](https://www.ensembl.org/info/data/biomart/index.html).

For Reactome, MIPath includes a function `get_reactome` that automatically downloads the latest version of the database and parses it. Reactome is already annotated using Ensembl gene IDs.

In [6]:
kegg = mipath.parse_gmt(gmt_path = './data/c2.cp.kegg.v7.5.1.symbols.gmt', gene_id_dict_path = './data/name2ensembl_human.txt')

reactome = mipath.get_reactome(organism = 'HSA', gene_anot = 'Ensembl')

We finally have everything we need to run the analysis. There are 2 steps in the MIPath pipeline.

The first step is to decompose the data into different modules for each pathway in the dataset. We use the `decompose_pathways` function, inputting both the gene expression data, and the pathway annotation. This is the most computationally expensive part of the analysis.

In [7]:
decomposed_df = mipath.decompose_pathways(data = data, gene_sets_df = kegg)

Performing pathway decomposition
(1/186) --- KEGG_N_GLYCAN_BIOSYNTHESIS --- 46 genes
Computed 25 NN in 7.1 s
SNN scores in 0.1 s
Found 23 partitions in 0.3 s

(2/186) --- KEGG_OTHER_GLYCAN_DEGRADATION --- 24 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 28 partitions in 0.2 s

(3/186) --- KEGG_O_GLYCAN_BIOSYNTHESIS --- 30 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 29 partitions in 0.1 s

(4/186) --- KEGG_GLYCOSAMINOGLYCAN_DEGRADATION --- 20 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 30 partitions in 0.1 s

(5/186) --- KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_KERATAN_SULFATE --- 14 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 28 partitions in 0.1 s

(6/186) --- KEGG_GLYCEROLIPID_METABOLISM --- 57 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 28 partitions in 0.3 s

(7/186) --- KEGG_GLYCOSYLPHOSPHATIDYLINOSITOL_GPI_ANCHOR_BIOSYNTHESIS --- 25 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 25 partitions in 0.3 s

(8/186) --- KEGG

Found 28 partitions in 0.4 s

(62/186) --- KEGG_TAURINE_AND_HYPOTAURINE_METABOLISM --- 10 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 54 partitions in 0.1 s

(63/186) --- KEGG_SELENOAMINO_ACID_METABOLISM --- 25 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 28 partitions in 0.2 s

(64/186) --- KEGG_GLUTATHIONE_METABOLISM --- 52 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 17 partitions in 0.4 s

(65/186) --- KEGG_STARCH_AND_SUCROSE_METABOLISM --- 56 genes
Computed 25 NN in 0.2 s
SNN scores in 0.1 s
Found 30 partitions in 0.2 s

(66/186) --- KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM --- 45 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 30 partitions in 0.3 s

(67/186) --- KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_CHONDROITIN_SULFATE --- 22 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 25 partitions in 0.2 s

(68/186) --- KEGG_GLYCOSAMINOGLYCAN_BIOSYNTHESIS_HEPARAN_SULFATE --- 26 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 2

SNN scores in 0.1 s
Found 24 partitions in 0.2 s

(125/186) --- KEGG_CELL_ADHESION_MOLECULES_CAMS --- 250 genes
Computed 25 NN in 0.2 s
SNN scores in 0.1 s
Found 19 partitions in 0.2 s

(126/186) --- KEGG_ADHERENS_JUNCTION --- 80 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 9 partitions in 0.3 s

(127/186) --- KEGG_TIGHT_JUNCTION --- 149 genes
Computed 25 NN in 0.1 s
SNN scores in 0.2 s
Found 9 partitions in 0.2 s

(128/186) --- KEGG_GAP_JUNCTION --- 103 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 15 partitions in 0.2 s

(129/186) --- KEGG_COMPLEMENT_AND_COAGULATION_CASCADES --- 91 genes
Computed 25 NN in 0.1 s
SNN scores in 0.0 s
Found 28 partitions in 0.2 s

(130/186) --- KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION --- 521 genes
Computed 25 NN in 0.2 s
SNN scores in 0.1 s
Found 16 partitions in 0.2 s

(131/186) --- KEGG_TOLL_LIKE_RECEPTOR_SIGNALING_PATHWAY --- 115 genes
Computed 25 NN in 0.1 s
SNN scores in 0.1 s
Found 12 partitions in 0.3 s

(132/186) --- KEGG_NO

SNN scores in 0.1 s
Found 15 partitions in 0.2 s



Now we are ready to see how well these modules represent phenotype information, using the `score_factors` function.

We input the dataframe obtained in the previous step, the one containing the phenotype information, and a list of things to score, that must match exactly column names from the phenotype dataframe.

In [8]:
to_score = ['Factor Value[inferred cell type - authors labels]', 'Factor Value[sampling site]']

result = mipath.score_factors(decomposed_df, metadata, to_score)

Scoring Factor Value[inferred cell type - authors labels]
Scoring Factor Value[sampling site]


And we can now display the final results

In [9]:
result.sort_values('Factor Value[inferred cell type - authors labels]', ascending=False)

Unnamed: 0,Factor Value[inferred cell type - authors labels],Factor Value[sampling site]
KEGG_HEMATOPOIETIC_CELL_LINEAGE,0.404919,0.027009
KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION,0.330273,0.011363
KEGG_LYSOSOME,0.313702,0.014283
KEGG_CELL_ADHESION_MOLECULES_CAMS,0.303012,0.062967
KEGG_ANTIGEN_PROCESSING_AND_PRESENTATION,0.300463,0.074172
...,...,...
KEGG_LINOLEIC_ACID_METABOLISM,0.054820,0.002627
KEGG_OTHER_GLYCAN_DEGRADATION,0.053092,0.005065
KEGG_REGULATION_OF_AUTOPHAGY,0.050851,0.003167
KEGG_CIRCADIAN_RHYTHM_MAMMAL,0.044396,0.004115
