# Replicate Figures 1, 2 and 3

To replicate SATURN results for frog and zebrafish embryogenesis you need to run SATURN 30 times with different seeds.

To more easily do this analysis, we have provided a python script that will run SATURN a certain number of times.


**NOTE: run the Train SATURN vignette first, `Vignettes/frog_zebrafish_embryogenesis/Train SATURN.ipynb`**

In [39]:
# Make a path fixed copy of the run file from the vignette
import pandas as pd
run_df = pd.read_csv("data/frog_zebrafish_run.csv")
run_df["path"] = ["Vignettes/frog_zebrafish_embryogenesis/" + path for path in run_df["path"] ]
run_df.to_csv("data/frog_zebrafish_run_multi.csv", index=False)

# Run the 30 seeds

*This will take a while*

In [1]:
!cd ../../ ; python3 saturn_multiple_seeds.py \
                --run=Vignettes/frog_zebrafish_embryogenesis/data/frog_zebrafish_run_multi.csv --embedding_model=ESM2 \
                --gpus 7 8 9 \
                --seeds=30

['7', '8', '9']
  0%|                                                    | 0/30 [00:00<?, ?it/s]RUNNING SEED: 0 ON GPU:7
RUNNING SEED: 1 ON GPU:8
RUNNING SEED: 2 ON GPU:9
Global seed set to 0
Global seed set to 0
Global seed set to 0
Epoch 200: L1 Loss 0.0 Rank Loss 12.267735481262207, Avg Loss frog: 1861, Avg Lo
Epoch 200: L1 Loss 0.0 Rank Loss 12.50074577331543, Avg Loss frog: 1861, Avg Los
100%|█████████████████████████████████████████| 157/157 [00:22<00:00,  7.08it/s]
100%|█████████████████████████████████████████| 157/157 [00:21<00:00,  7.19it/s]
Epoch 200: L1 Loss 0.0 Rank Loss 12.381772994995117, Avg Loss frog: 1862, Avg Lo
100%|█████████████████████████████████████████| 157/157 [00:21<00:00,  7.19it/s]
100%|█████████████████████████████████████████| 157/157 [00:10<00:00, 15.70it/s]
 10%|███▋                                 | 3/30 [2:25:12<21:46:51, 2904.13s/it]RUNNING SEED: 3 ON GPU:7
Global seed set to 0
100%|█████████████████████████████████████████| 157/157 [00:09<00:00, 15.

# Score the 30 seeds

We now need to score each SATURN run. First, we create a csv file mapping each run to a path.

In [24]:
from glob import glob

fz_adatas = glob("../multiple_seeds_results/saturn_results/*ESM2*2000*8000*default*.h5ad")
fz_adatas = [path.replace("..", "Vignettes") for path in fz_adatas if "pretrain" not in path and "frog" in path]
seeds = [path.split("_")[-1].replace(".h5ad", "") for path in fz_adatas]
fz_adatas, seeds

import pandas as pd
score_df = pd.DataFrame()
score_df["seed"] = seeds
score_df["path"] = fz_adatas
display(score_df.head())
print(len(score_df))
score_df.to_csv("./data/fz_multi_seeds.csv", index=False)

Unnamed: 0,seed,path
0,16,Vignettes/multiple_seeds_results/saturn_result...
1,3,Vignettes/multiple_seeds_results/saturn_result...
2,27,Vignettes/multiple_seeds_results/saturn_result...
3,23,Vignettes/multiple_seeds_results/saturn_result...
4,7,Vignettes/multiple_seeds_results/saturn_result...


30


In [27]:
!cd ../../ ; python3 score_adata.py --adata=Vignettes/frog_zebrafish_embryogenesis/data/fz_multi_seeds.csv --scores=1 \
                                 --multiple_files --species1=zebrafish --species2=frog --label=labels2 \
                                 --ct_map=Vignettes/frog_zebrafish_embryogenesis/data/frog_zebrafish_cell_type_map.csv

  if species_1 or species_2 is "human":
  elif species_1 or species_2 is "zebrafish":
0
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
100%|███████████████████████████████████████████| 30/30 [11:31<00:00, 23.06s/it]
100%|███████████████████████████████████████████| 30/30 [08:26<00:00, 16.88s/it]
Vignettes/frog_zebrafish_embryogenesis/data/fz_multi_seeds_default_scores.csv
    seed  ...              Label
0     16  ...  zebrafish to frog
1      3  ...  zebrafish to frog
2     27  ...  zebrafish to frog
3     23  ...  zebrafish to frog
4      7  ...  zebrafish to frog
5     12  ...  zebrafish to frog
6      2  ...  zebrafish to frog
7     17  ...  zebrafish to frog


The script will save a copy with scores to `"./data/fz_multi_seeds_scores.csv"`

In [40]:
pd.read_csv("./data/fz_multi_seeds_scores.csv")

Unnamed: 0,seed,path,Logistic Regression,Balanced Regression,Reannotation,Label
0,16,Vignettes/multiple_seeds_results/saturn_result...,0.829958,0.460089,,zebrafish to frog
1,3,Vignettes/multiple_seeds_results/saturn_result...,0.855068,0.524969,,zebrafish to frog
2,27,Vignettes/multiple_seeds_results/saturn_result...,0.833744,0.484727,,zebrafish to frog
3,23,Vignettes/multiple_seeds_results/saturn_result...,0.857358,0.535022,,zebrafish to frog
4,7,Vignettes/multiple_seeds_results/saturn_result...,0.856708,0.511298,,zebrafish to frog
5,12,Vignettes/multiple_seeds_results/saturn_result...,0.752092,0.507344,,zebrafish to frog
6,2,Vignettes/multiple_seeds_results/saturn_result...,0.850343,0.518007,,zebrafish to frog
7,17,Vignettes/multiple_seeds_results/saturn_result...,0.810347,0.479374,,zebrafish to frog
8,26,Vignettes/multiple_seeds_results/saturn_result...,0.860752,0.532518,,zebrafish to frog
9,22,Vignettes/multiple_seeds_results/saturn_result...,0.812214,0.502011,,zebrafish to frog


In [41]:
pd.read_csv("./data/fz_multi_seeds_scores.csv")["Logistic Regression"].describe()

count    60.000000
mean      0.817800
std       0.037250
min       0.710072
25%       0.795786
50%       0.817852
75%       0.854596
max       0.870831
Name: Logistic Regression, dtype: float64