# Data Preprocess for Fusion Data

Wenrui Wu, 2024-12-28

In [1]:
import os
from pathlib import Path

import numpy as np
from pyqupath.geojson import crop_dict_by_geojson
from pyqupath.ometiff import export_ometiff_pyramid_from_dict, load_tiff_to_dict

  from tqdm.autonotebook import tqdm


## 01. Data Structure

The output data structure of Fusion platform is: 

```
Scan1
├── [name].qptiff
└── .temp
    └── MarkerList.txt
```

## 02. Organize Data

CODEX downstream analysis is usually performed on the core/region level. So you need to first crop the whole slide image into multiple regions. 

- Annotate different regions using QuPath and its Polygon tools. Export the annotation as GeoJSON file. 

- Put the following files into a folder (`dir_root`):
    - `.qptiff`

    - `MarkerList.txt`
    
    - `cropping_regions.geojson`

```
/path/dir_root
├── [name].qptiff
├── cropping_regions.geojson
└── MarkerList.txt
```

In [2]:
################################################################################
dir_root = "/mnt/nfs/storage/wenruiwu_temp/pipeline/fusion/00_raw_data/"
################################################################################

dir_root = Path(dir_root)

# review all the files in the root directory
!tree $dir_root

[01;34m/mnt/nfs/storage/wenruiwu_temp/pipeline/fusion/00_raw_data[00m
├── cropping_regions.geojson
├── [01;32mMarkerList.txt[00m
└── Periodontal_CODEX-S8_Scan1.er.qptiff

0 directories, 3 files


In [3]:
# parse the dir_root
path_markerlist = dir_root / "MarkerList.txt"
path_geojson = dir_root / "cropping_regions.geojson"
paths_qptiff = list(dir_root.glob("*.qptiff"))
if len(paths_qptiff) == 1:
    path_qptiff = paths_qptiff[0]
else:
    raise ValueError("There should be only one qptiff file in the directory")

In [4]:
# review the channels in the qptiff file
channels_name = np.loadtxt(path_markerlist, dtype=str).tolist()
channels_name

['DAPI',
 'CD56',
 'CD3e',
 'CD8',
 'CD15',
 'CD138',
 'HLA-E',
 'CD45',
 'CD31',
 'CD68',
 'Pax5',
 'CD11b',
 'CD11c',
 'CD4',
 'MUC5AC',
 'MUC5B',
 'HLA-DR',
 'CD44',
 'ICOS',
 'E-cadherin',
 'COLA1',
 'KRT14',
 'a-SMA',
 'HLA-1',
 'Ki67',
 'Vimentin',
 'Blank-75',
 'Blank-75']

## 03. Order and Rename Markers

`channels_order`: select and order markers from the `MarkerList.txt`. 

`channels_rename`: in the same length of `channels_order`, which are the corresponding new names for markers in the `channels_order`. 

In [5]:
################################################################################
# selcet the channels that are needed (e.g., exclude the Blank channels)")
channels_order = [
    "DAPI",
    "CD45",
    "CD3e",
    "CD4",
    "CD8",
    "CD56",
    "CD11b",
    "CD11c",
    "CD138",
    "Pax5",
    "CD68",
    "CD15",
    "CD31",
    "HLA-E",
    "HLA-DR",
    "E-cadherin",
    "MUC5AC",
    "MUC5B",
    "COLA1",
    "KRT14",
    "a-SMA",
    "Vimentin",
    "ICOS",
    "CD44",
    "Ki67",
    "HLA-1",
]
channels_rename = None  # If None, the channels will not be renamed
################################################################################

## 04. Crop QPTIFF into Multiple OME-TIFF

In [6]:
################################################################################  
dir_output = "/mnt/nfs/storage/wenruiwu_temp/pipeline/fusion/01_preprocess/"
################################################################################

dir_output = Path(dir_output)

In [7]:
# Load QPTIFF file
im_dict = load_tiff_to_dict(
    path_qptiff,
    filetype="qptiff",
    channels_order=channels_order,
    channels_rename=channels_rename,
    path_markerlist=path_markerlist,
)

Loading images:   0%|          | 0/26 [00:00<?, ?it/s]

Loading images: 100%|██████████| 26/26 [00:31<00:00,  1.23s/it]


In [8]:
# Crop QPTIFF file into multiple OME-TIFF files
for name, crop_im_dict in crop_dict_by_geojson(im_dict, path_geojson):
    print(f"Cropping OME-TIFF for: {name}")
    path_ometiff = dir_output / name / f"{name}.ome.tiff"
    path_ometiff.parent.mkdir(parents=True, exist_ok=True)
    if path_ometiff.exists():
        os.remove(path_ometiff)
    export_ometiff_pyramid_from_dict(crop_im_dict, str(path_ometiff))

Cropping regions:   0%|          | 0/6 [00:00<?, ?it/s]

Cropping OME-TIFF for: reg001


Writing images: 100%|██████████| 7/7 [00:25<00:00,  3.62s/it]
Cropping regions:  17%|█▋        | 1/6 [00:28<02:24, 28.89s/it]


Cropping OME-TIFF for: reg002


Writing images: 100%|██████████| 7/7 [00:24<00:00,  3.45s/it]
Cropping regions:  33%|███▎      | 2/6 [00:56<01:52, 28.18s/it]


Cropping OME-TIFF for: reg003


Writing images: 100%|██████████| 8/8 [01:14<00:00,  9.35s/it]
Cropping regions:  50%|█████     | 3/6 [02:19<02:39, 53.16s/it]


Cropping OME-TIFF for: reg004


Writing images: 100%|██████████| 7/7 [01:11<00:00, 10.25s/it]
Cropping regions:  67%|██████▋   | 4/6 [03:40<02:08, 64.27s/it]


Cropping OME-TIFF for: reg005


Writing images: 100%|██████████| 7/7 [00:54<00:00,  7.77s/it]
Cropping regions:  83%|████████▎ | 5/6 [04:42<01:03, 63.30s/it]


Cropping OME-TIFF for: reg006


Writing images: 100%|██████████| 7/7 [00:45<00:00,  6.47s/it]
Cropping regions: 100%|██████████| 6/6 [05:34<00:00, 55.67s/it]







# 05. Review Output

A OME-TIFF file for each region is exported under directory for each region. 

In [9]:
!tree $dir_output

[01;34m/mnt/nfs/storage/wenruiwu_temp/pipeline/fusion/01_preprocess[00m
├── [01;34mreg001[00m
│   └── reg001.ome.tiff
├── [01;34mreg002[00m
│   └── reg002.ome.tiff
├── [01;34mreg003[00m
│   └── reg003.ome.tiff
├── [01;34mreg004[00m
│   └── reg004.ome.tiff
├── [01;34mreg005[00m
│   └── reg005.ome.tiff
└── [01;34mreg006[00m
    └── reg006.ome.tiff

6 directories, 6 files
