# DeepSpot2Cell Data Preprocessing

This notebook shows how to prepare data for DeepSpot2Cell training by:
- Loading HEST datasets
- Extracting morphology features from histological images
- Processing spot-level expression data

In [None]:
import os
os.chdir('../')

from deepspot2cell.dataset_creation import process_sample
from deepspot2cell.utils.utils_image import get_morphology_model_and_preprocess
from hest import iter_hest
from tqdm import tqdm
import torch

dataset_variant = "_v1"
data_folder = "hest_data"
model_name = "phikonv2"
model_path = "/path/to/phikonv2"
samples_to_process = ["NCBI864", "NCBI873", "NCBI856", "NCBI860", "NCBI858"]
batch_size = 32

config = {
    'data': {
        'data_folder': data_folder,
        'dataset_variant': dataset_variant,
        'spot_diameter_px': 160,
        'patch_size_px': 224,
        'qv_thr': 20
    },
    'dataset': 'lung'
}

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model, preprocess, _ = get_morphology_model_and_preprocess(model_name, device.type, model_path=model_path)

os.makedirs(f'{data_folder}/embeddings{dataset_variant}/{model_name}', exist_ok=True)
os.makedirs(f'{data_folder}/expressions{dataset_variant}', exist_ok=True)

In [None]:
for i, st in enumerate(tqdm(iter_hest(data_folder, id_list=samples_to_process, load_transcripts=True), 
                            desc="Processing samples", total=len(samples_to_process))):
    sample_id = samples_to_process[i]
    process_sample(sample_id, st, model, preprocess, device, batch_size, config)

## Using the preprocessing script

For larger datasets, use the script with a config file:

```bash
python -m deepspot2cell.dataset_creation --config configs/dataset.example.yaml --dataset lung
```

Example config:

```yaml
data:
  data_folder: /path/to/hest_data
  dataset_variant: _v1
  model_name: phikonv2
  model_path: /path/to/phikonv2
  spot_diameter_px: 160
  patch_size_px: 224
  qv_thr: 20

  lung:
    ordered_genes_file: lung_ordered_genes.json
    train:
      dataset_ids:
        - NCBI864
        - NCBI873
        - NCBI856
        - NCBI860
        - NCBI858
    ood:
      dataset_ids: []

processing:
  gpu_batch_size: 32
  cpu_batch_size: 8
```

## Cell Segmentation

HEST-1k includes cell segmentations from CellViT. For other datasets, see [CellViT documentation](https://tio-ikim.github.io/CellViT-Inference/usage.html).