# **Download, Preprocess and Harmonize data**

<a target="_blank" href="https://colab.research.google.com/github/raphaelrubrice/MylliaESG/blob/main/data_scripts/prepare_data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>


## **Setup**

Mount Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')
# to avoid having the data on your drive
%cd /content

Clone the repo

In [None]:
!git clone https://github.com/raphaelrubrice/MylliaESG.git
%cd MylliaESG

Install dependencies

In [None]:
!pip install -r requirements.txt

Download challenge data

In [None]:
!cp -r /content/drive/MyDrive/MylliaESG/data .
%cd data
!unzip echoes-of-silenced-genes.zip
!rm echoes-of-silenced-genes.zip
%cd ..

Make gene list

In [None]:
!python data_scripts/make_gene_list.py

## **Download & Preprocess**

First we will download the data needed for training and evaluation:

| Cell Type           | Role          | Rationale                                                                             |
| ------------------- | ------------- | ------------------------------------------------------------------------------------- |
| **K562** (Replogle) | **Training**  | Best cells/pert, strongest signal, anchor dataset                                     |
| **Jurkat** (Nadig)  | **Training**  | Second hematopoietic line, decent cells/pert (~83)                                    |
| **CD4+ T** (Biohub) | **Training**  | Primary cells, adds biological diversity beyond cell lines                            |
| **RPE1** (Replogle) | **Eval only** | Epithelial, p53-WT - maximally different from training                                |
| **HepG2** (Nadig)   | **Eval only** | Hepatic/epithelial, low cells/pert (~45) makes it weak for training but fine for eval |


We will preprocess it using the following protocol:
1) Per cell UMI normalization
2) Multiply to 10k
3) log2(x+1)

In [None]:
import os
os.environ["PATH_GENE_LIST"] = "/content/MylliaESG/data/gene_list.txt"
os.environ["PATH_MYLLIA_H5AD"] = "/content/MylliaESG/data/training_cells.h5ad"

Download and preprocess public datasets

In [None]:
# Login to the Virtual Cell Platform of biohub, you will be prompted you password
!vcp login --username raphael.rubrice@ens-paris-saclay.fr

In [None]:
!python data_scripts/download_data.py --dataset k562 rpe1 jurkat hepg2 cd4t --gene-list $PATH_GENE_LIST

Preprocess the raw training data from the MylliaESG challenge

In [None]:
!python data_scripts/download_data.py --preprocess-only $PATH_MYLLIA_H5AD --gene-list $PATH_GENE_LIST

In [None]:
!rm -rf data/raw/

In [None]:
!cp -r /content/MylliaESG/data/ /content/drive/MyDrive/MylliaESG/.

## **Harmonize data and create train/val sets**

In [None]:
!python data_scripts/uniformize_data.py --gene-list $PATH_GENE_LIST

In [None]:
!cp -r /content/MylliaESG/data/ /content/drive/MyDrive/MylliaESG/.