# **Pipeline to obtain biologically grounded gene clusters**

<a target="_blank" href="https://colab.research.google.com/github/raphaelrubrice/MylliaESG/blob/main/gene_clusters/gene_clusters.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

**Goal:** Obtain gene clusters based on 5 biological views: 
- Reactome (pathway co occurence) 
- Go:BP (Biological process co-occurence)
- GO:CC (Subcellular localization co occurence)
- GO:MF (Molecular function co occurence)
- ESM-2 (Protein embedding similarity)
- Co-expression (spearman correlation across cells in data).

## **Setup**

Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# to avoid having the data on your drive
%cd /content

Clone the repo

In [None]:
!git clone https://github.com/raphaelrubrice/MylliaESG.git
%cd MylliaESG

Install dependencies

In [None]:
!pip install -r requirements.txt

Download challenge data

In [None]:
!cp -r /content/drive/MyDrive/MylliaESG/data .
%cd data
!unzip echoes-of-silenced-genes.zip
!rm echoes-of-silenced-genes.zip
%cd ..

Make gene list

In [None]:
!python data_scripts/make_gene_list.py

## **Running the pipeline**

In [None]:
import os
os.environ("PATH_GENE_LIST") = "/content/MylliaESG/data/gene_list.txt"
os.environ("PATH_DATA_H5AD") = "/content/MylliaESG/data/training_cells.h5ad"
os.environ("PATH_CACHE") = "/content/MylliaESG/gene_clusters/gene_clusters_results/cache"
os.environ("PATH_OUTPUT") = "/content/MylliaESG/gene_clusters/gene_clusters_results/"

In [None]:
!python multiview_gene_clusters.py \
    --genes $PATH_GENE_LIST \
    --adata $PATH_DATA_H5AD \
    --esm-device cuda \
    --cache $PATH_CACHE \
    --output $PATH_OUTPUT