Welcome to the first step of cross species analysis using the saturn package. Before starting this step you need to have your anndatas ready and saved in the folder called "data". Use the Rscript to convert your seurat object to anndata, if you haven't done already. 

#### All combined


In [1]:
!ls

cross_species_cell_type_map.csv  S1_Mutlipleseeds.ipynb
Mouse_SC.h5ad			 zebrafish_mouse_run.csv
multiple_seed_results		 Zebrafish_SC.h5ad


### Making CSV file for running Saturn

This csv file contains information about the path to dataset that we need to integrate as well as the the path pytorch trained protein embedding for the respective species. You can either directly edit the csv file or use the following codes to create one. 

In [2]:
import pandas as pd

In [3]:
df = pd.DataFrame(columns=["path", "species", "embedding_path"])
df["species"]=["zebrafish", "mouse"]
df["path"] = ["/DATAFILES/Cross_species/Fulldatasets/Allcombined/Zebrafish_SC.h5ad", "/DATAFILES/Cross_species/Fulldatasets/Allcombined/Mouse_SC.h5ad"]

zebrafish_embedding_path = "/DATAFILES/Cross_species/protein_embeddings/ESM2/zebrafish_embedding.torch"
mouse_embedding_path = "/DATAFILES/Cross_species/protein_embeddings/ESM2/mouse_embedding.torch"

df["embedding_path"] = [zebrafish_embedding_path, mouse_embedding_path]
df.to_csv("zebrafish_mouse_run.csv", index=False)
df

Unnamed: 0,path,species,embedding_path
0,/DATAFILES/Cross_species/Fulldatasets/Allcombi...,zebrafish,/DATAFILES/Cross_species/protein_embeddings/ES...
1,/DATAFILES/Cross_species/Fulldatasets/Allcombi...,mouse,/DATAFILES/Cross_species/protein_embeddings/ES...


In [4]:
import scanpy as sc

In [5]:
zebra=sc.read_h5ad("/DATAFILES/Cross_species/Fulldatasets/Allcombined/Zebrafish_SC.h5ad")

In [6]:
zebra.obs["cell_type"].unique()

array(['MHM-Microglia', 'MHM-Neurons', 'MHM-Div-Myeloid',
       'MHM-Macrophage', 'MHM-Ependymal', 'MHM-iNeurons', 'MHM-T Cells',
       'MHM-Astrocytes', 'MHM-Oligodendrocytes', 'MHM-OPCs',
       'MHM-Pericytes', 'MHM-Neutrophils'], dtype=object)

#### make sure that there are no underscores (_) in the celltype annotation. This will throw error later

In [10]:
mouse=sc.read_h5ad("/DATAFILES/Cross_species/Fulldatasets/Allcombined/Mouse_SC.h5ad")

In [11]:
mouse.obs["cell_type"].unique()

array(['Anderson-Oligodendrocytes', 'Anderson-OPC', 'Anderson-Astrocytes',
       'Anderson-Vascular', 'Anderson-Microglia', 'Anderson-Neurons',
       'JaeLee-Microglia', 'JaeLee-Ependymal', 'JaeLee-Endothelial',
       'JaeLee-Immune', 'JaeLee-Astrocytes', 'JaeLee-Oligodendrocytes',
       'JaeLee-Pericytes', 'JaeLee-OPCs', 'JaeLee-Div-Myeloid',
       'ArielLevine-Oligodendrocytes', 'ArielLevine-Schwann',
       'ArielLevine-Oligodendrocyte Progenitors/Precursors',
       'ArielLevine-Neurons', 'ArielLevine-Microglia/Hematopoietic',
       'ArielLevine-Endothelial', 'ArielLevine-Leptomeninges',
       'ArielLevine-Ependymal', 'ArielLevine-Astrocytes',
       'ArielLevine-Pericytes'], dtype=object)

## Running multiple seeds of SATURN

Seed are used to initialize the centroid creation and dataloader. It is the seed used to generate random numbers.
You don't need to run multiple seeds, however it can be useful to get a sense of the variance in integration results. You could then choose the seed that got the highest integration score if you want.

### Running multiple seeds to identify the best integration

In [4]:
!python3 /DATAFILES/Cross_species/SATURN/saturn_multiple_seeds_singlegpu.py \
--run=zebrafish_mouse_run.csv --embedding_model=ESM2 \
--seeds=25

['0']
  0%|                                                    | 0/25 [00:00<?, ?it/s]RUNNING SEED: 0 ON GPU:0
Epoch 200: L1 Loss 0.0 Rank Loss 10.856510162353516, Avg Loss mouse: 1624, Avg L
100%|█████████████████████████████████████████| 224/224 [00:17<00:00, 12.72it/s]
100%|█████████████████████████████████████████| 224/224 [00:08<00:00, 27.08it/s]
  4%|█▍                                   | 1/25 [1:35:33<38:13:22, 5733.45s/it]RUNNING SEED: 1 ON GPU:0
Epoch 200: L1 Loss 0.0 Rank Loss 11.11199951171875, Avg Loss mouse: 1624, Avg Lo
100%|█████████████████████████████████████████| 224/224 [00:17<00:00, 12.64it/s]
100%|█████████████████████████████████████████| 224/224 [00:07<00:00, 28.13it/s]
  8%|██▉                                  | 2/25 [3:11:08<36:38:10, 5734.38s/it]RUNNING SEED: 2 ON GPU:0
Epoch 200: L1 Loss 0.0 Rank Loss 10.910243034362793, Avg Loss mouse: 1624, Avg L
100%|█████████████████████████████████████████| 224/224 [00:17<00:00, 12.68it/s]
100%|██████████████████████████

### Scoring Multiple Seeds


Running multiple seeds enable us to select the best integration possible. For that, we  now need to score different seeds.


In [5]:
!ls

cross_species_cell_type_map.csv  S1_Mutlipleseeds.ipynb
Mouse_SC.h5ad			 zebrafish_mouse_run.csv
multiple_seed_results		 Zebrafish_SC.h5ad


In [6]:
from glob import glob
fz_adatas = glob("multiple_seed_results/saturn_results/*ESM2*5000*8000*default*.h5ad")
fz_adatas

['multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_0_pretrain.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_0.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_1_pretrain.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_1.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_2_pretrain.h5ad',
 'multiple_seed_results/saturn_results

In [7]:
fz_adatas= [path for path in fz_adatas if "pretrain" not in path] # selecting out all pretrain files
fz_adatas

['multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_0.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_1.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_2.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_3.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebrafish_SC_org_zebrafish_mouse_run_l1_0_pe_1.0_ESM2_macrogenes_5000_hv_genes_8000_centroid_score_func_default_seed_4.h5ad',
 'multiple_seed_results/saturn_results/test256_data_Mouse_SC_Zebr

In [8]:
seeds = [path.split("_")[-1].replace(".h5ad", "") for path in fz_adatas]
fz_adatas, seeds

# creating a csv file with file path associated with each seed
import pandas as pd
score_df = pd.DataFrame()
score_df["seed"] = seeds
score_df["path"] = fz_adatas
display(score_df.head())
print(len(score_df))
score_df.to_csv("./multi_seeds.csv", index=False)

Unnamed: 0,seed,path
0,0,multiple_seed_results/saturn_results/test256_d...
1,1,multiple_seed_results/saturn_results/test256_d...
2,2,multiple_seed_results/saturn_results/test256_d...
3,3,multiple_seed_results/saturn_results/test256_d...
4,4,multiple_seed_results/saturn_results/test256_d...


25


In [9]:
!python3 /DATAFILES/Cross_species/SATURN/score_adata.py --adata=multi_seeds.csv --scores=1 \
                                 --multiple_files --species1=zebrafish --species2=mouse --label=labels2 \
                                 --ct_map=cross_species_cell_type_map.csv


  if species_1 or species_2 is "human":
  elif species_1 or species_2 is "zebrafish":
0
100%|███████████████████████████████████████████| 25/25 [01:41<00:00,  4.05s/it]
100%|███████████████████████████████████████████| 25/25 [02:24<00:00,  5.78s/it]
multi_seeds_scores.csv
    seed  ...               Label
0      0  ...  zebrafish to mouse
1      1  ...  zebrafish to mouse
2      2  ...  zebrafish to mouse
3      3  ...  zebrafish to mouse
4      4  ...  zebrafish to mouse
5      5  ...  zebrafish to mouse
6      6  ...  zebrafish to mouse
7      7  ...  zebrafish to mouse
8      8  ...  zebrafish to mouse
9      9  ...  zebrafish to mouse
10    10  ...  zebrafish to mouse
11    11  ...  zebrafish to mouse
12    12  ...  zebrafish to mouse
13    13  ...  zebrafish to mouse
14    14  ...  zebrafish to mouse
15    15  ...  zebrafish to mouse
16    16  ...  zebrafish to mouse
17    17  ...  zebrafish to mouse
18    18  ...  zebrafish to mouse
19    19  ...  zebrafish to mouse
20    20  ...

In [10]:
df = pd.read_csv("multi_seeds_scores.csv")
df = df.sort_values("Logistic Regression", ascending=False)
df


Unnamed: 0,seed,path,Logistic Regression,Balanced Regression,Reannotation,Label
15,15,multiple_seed_results/saturn_results/test256_d...,0.392334,0.205905,,zebrafish to mouse
39,14,multiple_seed_results/saturn_results/test256_d...,0.384001,0.33283,,mouse to zebrafish
35,10,multiple_seed_results/saturn_results/test256_d...,0.378225,0.328388,,mouse to zebrafish
18,18,multiple_seed_results/saturn_results/test256_d...,0.330539,0.176424,,zebrafish to mouse
36,11,multiple_seed_results/saturn_results/test256_d...,0.270255,0.249713,,mouse to zebrafish
46,21,multiple_seed_results/saturn_results/test256_d...,0.269339,0.24912,,mouse to zebrafish
25,0,multiple_seed_results/saturn_results/test256_d...,0.268961,0.248644,,mouse to zebrafish
37,12,multiple_seed_results/saturn_results/test256_d...,0.267874,0.249013,,mouse to zebrafish
48,23,multiple_seed_results/saturn_results/test256_d...,0.267618,0.249251,,mouse to zebrafish
29,4,multiple_seed_results/saturn_results/test256_d...,0.267593,0.248629,,mouse to zebrafish


It looks like seed # 14 got the highest value in both logistic regression and balanced regression. So, I will select the anndata output from seed # 14 for downstream analysis

Now, please use the scanpy_analysis.py file for further downstream analysis of the downstream output files for seed#6

In [11]:
df.to_csv("Allcombined_seed_Scores.csv", index=False)