# DeepSpot2Cell Data Preprocessing

This notebook shows how to prepare data for DeepSpot2Cell training by:
- Loading HEST datasets
- Extracting morphology features from histological images
- Processing spot-level expression data

In [None]:
import os
os.chdir('../../')

from deepspot.deepspot2cell.dataset_creation import process_sample
from deepspot.utils.utils_image import get_morphology_model_and_preprocess
from hest import iter_hest
from tqdm import tqdm
import torch

dataset_variant = 22
data_folder = "hest_data"
model_name = "phikonv2"
model_path = "/path/to/your/pathology_models/phikon-v2/snapshots/93276a02a78253429f566882de5302867f747394"
samples_to_process = ["NCBI864", "NCBI873", "NCBI856", "NCBI860", "NCBI858"]
batch_size = 32

config = {
    'data': {
        'data_folder': data_folder,
        'dataset_variant': dataset_variant,
        'spot_diameter_px': 160,
        'patch_size_px': 224,
        'qv_threshold': 20,
        'min_cells_per_gene': 20
    }
}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model, preprocess, _ = get_morphology_model_and_preprocess(model_name, device, model_path=model_path)

os.makedirs(f'{data_folder}/embeddings{dataset_variant}/{model_name}', exist_ok=True)
os.makedirs(f'{data_folder}/expressions{dataset_variant}', exist_ok=True)

In [None]:
for i, st in enumerate(tqdm(iter_hest(data_folder, id_list=samples_to_process, load_transcripts=True), 
                            desc="Processing samples", total=len(samples_to_process))):
    sample_id = samples_to_process[i]
    process_sample(sample_id, st, model, preprocess, device, batch_size, config)

## Same, but with a script and config file

For large datasets, use the preprocessing script:

```bash
cd /path/to/DeepSpot
python -m deepspot.deepspot2cell.dataset_creation --config config.yaml
```

> Please note that this script is GPU-only.

Example config file:
```yaml
data:
  dataset_variant: 22
  spot_diameter_px: 160
  patch_size_px: 224
  data_folder: hest_data
  model_name: phikonv2
  model_path: /path/to/pathology_models/phikon-v2/snapshots/93276a02a78253429f566882de5302867f747394
  val_fraction: 0.2
  qv_thr: 20
  min_cells: 20

  lung:
    short_name: Lung Cancer
    train:
      dataset_ids:
        - NCBI864
        - NCBI873
        - NCBI856
        - NCBI860
        - NCBI858
        - NCBI881
        - NCBI866
        - NCBI859
        - NCBI884
        - NCBI867
        - NCBI875
        - NCBI879
        - NCBI883
        - NCBI870
        - NCBI861
        - NCBI876
        - NCBI880
        - NCBI882
        - NCBI865
        - NCBI857
    ood:
      dataset_ids:
        - TENX118
        - TENX141
    ordered_genes_file: filtered_ordered_genes_lung.json
```

## Cell segmentation

HEST-1k dataset comes with cell segmentations obtained using CellViT. If your dataset does not have cell segmentations, you can refer to the [CellViT documentation](https://tio-ikim.github.io/CellViT-Inference/usage.html) for instructions on how to obtain them.