### Simulated data generation

We adapted the code for data simulation from https://github.com/AltschulerWu-Lab/MUSE/tree/master/simulation_tool and added spatial modality by sorting the spots in spatial space. You can find the adapted version [here](https://github.com/ratschlab/simulate_spatial_transcriptomics_tool). https://github.com/ratschlab/simulate_spatial_transcriptomics_tool

In [None]:
from simulate_spatial_transcriptomics_tool.src.simulate_st_data import generate_samples
import numpy as np
import yaml

In [None]:
def create_yaml_file(main_folder, YAML_TEXT):
    with open(f"{main_folder}/info.yaml", 'w') as f:
            f.write(YAML_TEXT)

In [None]:
np.random.seed(2023)

In [None]:
YAML_TEXT = """
SAMPLE:
   - "A"
   - "B" 
   - "C" 
   - "D" 
   - "E"
   
DATASET: "simulated_data"
OUT_FOLDER: "out_ablation"

CROSS_VALIDATION_SPLIT:    
    - [["A"], ["B", "C", "D", "E"]]
    - [["B"], ["A", "C", "D", "E"]]
    - [["C"], ["B", "A", "D", "E"]]
    - [["D"], ["B", "C", "A", "E"]]
    - [["E"], ["B", "C", "D", "A"]]

MODEL_R:  

MODEL_PYTHON:

MODEL_FINE_TUNE:
    - AESTETIK
    - AESTETIK_window_size_1
    - AESTETIK_window_size_3
    - AESTETIK_window_size_5
    - AESTETIK_window_size_7
    - AESTETIK_window_size_9
    - AESTETIK_window_size_11
    - AESTETIK_morphology_weight_0
    - AESTETIK_morphology_weight_1
    - AESTETIK_morphology_weight_1.5
    - AESTETIK_morphology_weight_2
    - AESTETIK_morphology_weight_3
    - AESTETIK_triplet_loss_off
    - AESTETIK_triplet_loss_single
    - AESTETIK_triplet_loss_multi
    - AESTETIK_clustering_method_kmeans
    - AESTETIK_clustering_method_bgm
    - AESTETIK_refine_cluster_off
    - AESTETIK_refine_cluster_on
    - AESTETIK_train_size_0.01
    - AESTETIK_train_size_0.1
    - AESTETIK_train_size_0.25
    - AESTETIK_train_size_0.5
    - AESTETIK_train_size_0.75
    - AESTETIK_train_size_all
    - AESTETIK_rec_loss_off
    - AESTETIK_rec_loss_on


IMAGE_FEATURES:
    - inception

DOWNSAMPLE_FACTOR: 1
IMAGE_FORMAT: "png"
"""


YAML_TEXT_TEST = """
SAMPLE:
   - "A"
   - "B"
   
DATASET: "test_data"
OUT_FOLDER: "out_benchmark"

CROSS_VALIDATION_SPLIT:    
    - [["A"], ["B"]]
    - [["B"], ["A"]]

MODEL_R:

MODEL_PYTHON: 
   

MODEL_FINE_TUNE:
    - AESTETIK
    - GraphST

IMAGE_FEATURES:
    - inception

DOWNSAMPLE_FACTOR: 1
IMAGE_FORMAT: "png"
"""

In [None]:
"""
Parameters:
      n_clusters:       number of ground truth clusters.
      n:                number of cells to simulate.
      d_1:              dimension of features for transcript modality.
      d_2:              dimension of features for morphological modality.
      k:                dimension of latent code to generate simulate data (for both modality)
      sigma_1:          variance of gaussian noise for transcript modality.
      sigma_2:          variance of gaussian noise for morphological modality.
      decay_coef_1:     decay coefficient of dropout rate for transcript modality.
      decay_coef_2:     decay coefficient of dropout rate for morphological modality.
      merge_prob:       probability to merge neighbor clusters for the generation of modality-specific
                        clusters (same for both modalities)
"""

To ensure that everything is installed correctly, we will start by generating the spatial transcritpomics test data.

In [None]:
main_folder = "test_data/out_benchmark"
sample_names = ["A", "B"]
create_yaml_file(main_folder, YAML_TEXT_TEST)
generate_samples(main_folder, sample_names, n_clusters=5, n_points=2500)

Next, we will generate three datasets, each containing five samples with 2500 spots per sample. The datasets will vary by the number of clusters: 5, 10, and 15.

In [None]:
main_folder = "simulated_data_5_clusters/out_ablation"
sample_names = ["A", "B", "C", "D", "E"]
create_yaml_file(main_folder, YAML_TEXT)
generate_samples(main_folder, sample_names, n_clusters=5, n_points=2500)

In [None]:
main_folder = "simulated_data_10_clusters/out_ablation"
sample_names = ["A", "B", "C", "D", "E"]
create_yaml_file(main_folder, YAML_TEXT)
generate_samples(main_folder, sample_names, n_clusters=10, n_points=2500)

In [None]:
main_folder = "simulated_data_15_clusters/out_ablation"
sample_names = ["A", "B", "C", "D", "E"]
create_yaml_file(main_folder, YAML_TEXT)
generate_samples(main_folder, sample_names, n_clusters=15, n_points=2500)

Lastly, we generate one sample with 1 million spots to test the runtime of our algorithm.

In [None]:
main_folder = "simulated_data_run_time/out_ablation"
sample_names = ["A"]
generate_samples(main_folder, sample_names, n_clusters=5, n_points=1_000_000)

In [None]:
main_folder = "simulated_data_run_time/out_ablation"
sample_names = ["B"]
generate_samples(main_folder, sample_names, n_clusters=5, n_points=10_000_000)