# Integration with slide level-labels

In this tutorial we will demonstrate how to integrate whole slide images (WSIs) with slide-level labels and derive quantitative scores for each slide via top-K scoring.

We will also demonstrate how to run tasks in a distributed fashion using dask.

For this, we will be using a pre-processed dataset of artery tissue from GTEx, whihc contains healthy and calcified samples.

In [1]:
from huggingface_hub import hf_hub_download
import pandas as pd

table = hf_hub_download(
    "rendeirolab/lazyslide-data", 
    "GTEx_artery_dataset.csv.gz",
    repo_type="dataset"
)

dataset = pd.read_csv(table)
dataset.head()

Unnamed: 0,Tissue Sample Id,Sex,Age Bracket,Pathology Categories
0,GTEX-111YS-2226,male,60-69,calcification
1,GTEX-11GSP-2926,female,60-69,calcification
2,GTEX-11LCK-1426,male,30-39,clean_specimens
3,GTEX-11ONC-2726,male,60-69,calcification
4,GTEX-12126-0726,male,20-29,clean_specimens


Since we need to run for many slides, let's first define a function to process a slide and reuse it.

In [2]:
terms = [
    'BMP-2', 'Monckeberg sclerosis', 'Runx2', 'adventitia', 'apoptosis',
    'arterial hardening', 'arterial narrowing', 'arterial remodeling', 'arterial stiffness', 'arteriole',
    'artery', 'atherosclerosis', 'basement membrane', 'blood flow', 'bone morphogenetic protein',
    'calcification', 'calcified nodule', 'calcium deposition', 'calcium phosphate', 'chronic kidney disease',
    'collagen', 'compliance', 'connective tissue', 'elastic fibers', 'elasticity',
    'endothelial dysfunction', 'endothelium', 'epithelium', 'external elastic lamina', 'extracellular matrix',
    'fibroblast', 'fibrosis', 'fibrous cap', 'gap junction', 'hemodynamics',
    'hydroxyapatite', 'hyperphosphatemia', 'inflammation', 'internal elastic lamina', 'interstitial space',
    'intima', 'intimal calcification', 'intimal thickening', 'ischemia', 'lamina propria',
    'lumen', 'macrocalcification', 'macrophage', 'matrix vesicle', 'mechanotransduction',
    'media', 'medial calcification', 'microcalcification', 'mineralization', 'myofibroblast',
    'necrotic core', 'osteoblast-like cell', 'osteocalcin', 'osteogenic', 'osteopontin',
    'oxidative stress', 'pericyte', 'phosphate transporter', 'plaque', 'shear stress',
    'smooth muscle', 'tight junction', 'tunica', 'vasa vasorum', 'vascular basement membrane',
    'vascular compliance', 'vascular integrity', 'vascular niche', 'vascular ossification', 'vascular remodeling',
    'vascular smooth muscle cell', 'vascular stiffness', 'vascular tone', 'vascular wall'
]

In [9]:
def wsi_feature_extraction(slide):

    from wsidata import open_wsi
    import lazyslide as zs

    s = hf_hub_download(
        "rendeirolab/lazyslide-data", 
        f"gtex_artery_data/{slide}.svs",
        repo_type="dataset"
    )
    wsi = open_wsi(s)
    zs.pp.find_tissues(wsi)
    zs.pp.tile_tissues(wsi, 256, mpp=0.5, background_fraction=0.5)

    # conch feature
    zs.tl.feature_extraction(wsi, "conch", pbar=False)
    embed = zs.tl.text_embedding(terms, "conch")
    zs.tl.text_image_similarity(wsi, embed, "conch")
    wsi.write()

    # score the slide with Top-K max pooling
    scores = zs.metrics.topk_scores(wsi["conch_tiles_text_similarity"], k=100)

    return scores

## Run for every slides

The easist way is to run a for-loop:

```python
for slide in dataset["Tissue Sample Id"]:
    wsi_feature_extraction(slide)
```

However, this will take a long time and doesn't fully use the power of parallism.

## Distributed processing with dask

Dask is a good option for parallism on local machine or across multiple machines.

For different hardward availabilities, alternatives are:
1. [dask-jobqueue](https://jobqueue.dask.org/en/latest/): For PBS, Slurm, MOAB, SGE, LSF, and HTCondor.
2. [coiled](https://docs.coiled.io/user_guide/index.html): AWS, GCP, Azure etc.
3. [dask-cuda](https://docs.rapids.ai/api/dask-cuda/nightly/quickstart/): If you have multiple GPU cards locally.

Here, we showcase how to parallel the jobs with dask on a SLURM cluster.
The configuration may not work on your SLURM system, please make adjustment accordingly.

When running GPU-intensive works like feature extraction for multiple WSIs, 
we recommanded to run one task on one GPU every time.
To accelarate the processing speed, either distributed across multiple GPU cards or multiple machines.

Here are code snippet to run on different architectures

Run local with CPUs:

```python
from dask.distributed import LocalCluster
cluster = LocalCluster()
```

Run local with many GPUs:

```python
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster()
```

Run on a SLURM cluster with GPUs (Example script, may not work on users' cluster):

```python
from dask_jobqueue import SLURMCluster

cluster = SLURMCluster(
    queue="gpu",
    cores=8,
    processes=1,
    memory="20 GB",
    # For SLURM, use --gres flag to get GPU
    job_extra_directives=["--gres=gpu:h100pcie:1"],
    # Each work must one GPU
    worker_extra_args=["--resources GPU=1"],
)
```

In [4]:
from dask.distributed import LocalCluster
cluster = LocalCluster(n_workers=10)

In [5]:
from dask.distributed import Client

client = Client(cluster)

In [6]:
client

0,1
Connection method: Cluster object,Cluster type: distributed.LocalCluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 10
Total threads: 70,Total memory: 251.50 GiB
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:38339,Workers: 10
Dashboard: http://127.0.0.1:8787/status,Total threads: 70
Started: Just now,Total memory: 251.50 GiB

0,1
Comm: tcp://127.0.0.1:44831,Total threads: 7
Dashboard: http://127.0.0.1:41671/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:41667,
Local directory: /tmp/dask-scratch-space/worker-ktzriu7l,Local directory: /tmp/dask-scratch-space/worker-ktzriu7l

0,1
Comm: tcp://127.0.0.1:45627,Total threads: 7
Dashboard: http://127.0.0.1:43661/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:35881,
Local directory: /tmp/dask-scratch-space/worker-mas6avqx,Local directory: /tmp/dask-scratch-space/worker-mas6avqx

0,1
Comm: tcp://127.0.0.1:33469,Total threads: 7
Dashboard: http://127.0.0.1:35045/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:46361,
Local directory: /tmp/dask-scratch-space/worker-5rqrssfa,Local directory: /tmp/dask-scratch-space/worker-5rqrssfa

0,1
Comm: tcp://127.0.0.1:35295,Total threads: 7
Dashboard: http://127.0.0.1:35215/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:33471,
Local directory: /tmp/dask-scratch-space/worker-8wqwdr23,Local directory: /tmp/dask-scratch-space/worker-8wqwdr23

0,1
Comm: tcp://127.0.0.1:34753,Total threads: 7
Dashboard: http://127.0.0.1:40701/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:38569,
Local directory: /tmp/dask-scratch-space/worker-e37q8m2a,Local directory: /tmp/dask-scratch-space/worker-e37q8m2a

0,1
Comm: tcp://127.0.0.1:38699,Total threads: 7
Dashboard: http://127.0.0.1:45093/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:34831,
Local directory: /tmp/dask-scratch-space/worker-pyk4gz_a,Local directory: /tmp/dask-scratch-space/worker-pyk4gz_a

0,1
Comm: tcp://127.0.0.1:33385,Total threads: 7
Dashboard: http://127.0.0.1:34657/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:33825,
Local directory: /tmp/dask-scratch-space/worker-snz2dcze,Local directory: /tmp/dask-scratch-space/worker-snz2dcze

0,1
Comm: tcp://127.0.0.1:40185,Total threads: 7
Dashboard: http://127.0.0.1:34475/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:34509,
Local directory: /tmp/dask-scratch-space/worker-sdkuzplk,Local directory: /tmp/dask-scratch-space/worker-sdkuzplk

0,1
Comm: tcp://127.0.0.1:44513,Total threads: 7
Dashboard: http://127.0.0.1:37625/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:32787,
Local directory: /tmp/dask-scratch-space/worker-apqhlpzm,Local directory: /tmp/dask-scratch-space/worker-apqhlpzm

0,1
Comm: tcp://127.0.0.1:34321,Total threads: 7
Dashboard: http://127.0.0.1:45389/status,Memory: 25.15 GiB
Nanny: tcp://127.0.0.1:40207,
Local directory: /tmp/dask-scratch-space/worker-tsh8ipk9,Local directory: /tmp/dask-scratch-space/worker-tsh8ipk9


Let's parallel the jobs

In [None]:
futures = [
    client.submit(wsi_feature_extraction, slide)
    for slide in dataset["Tissue Sample Id"]
]

2025-07-30 13:28:27,048 - distributed.worker - ERROR - Compute Failed
Key:       wsi_feature_extraction-4c4ea34f0360e6582af9d84ae8ffa42a
State:     executing
Task:  <Task 'wsi_feature_extraction-4c4ea34f0360e6582af9d84ae8ffa42a' wsi_feature_extraction(...)>
Exception: "OutOfMemoryError('CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 47.41 GiB of which 2.62 MiB is free. Process 3856894 has 3.77 GiB memory in use. Process 3859438 has 6.11 GiB memory in use. Process 3859450 has 6.67 GiB memory in use. Including non-PyTorch memory, this process has 6.15 GiB memory in use. Process 3859443 has 6.21 GiB memory in use. Process 3859447 has 5.69 GiB memory in use. Process 3859463 has 3.05 GiB memory in use. Process 3859459 has 4.35 GiB memory in use. Process 3859473 has 1.02 GiB memory in use. Process 3859468 has 4.33 GiB memory in use. Of the allocated memory 5.37 GiB is allocated by PyTorch, and 546.09 MiB is reserved by PyTorch but unallocated. If reserved but 

[34mINFO    [0m The Zarr backing store has been changed from [3;35mNone[0m the new file path:   
         [35m/home/yzheng/.cache/huggingface/hub/datasets--rendeirolab--lazyslide-da[0m
         [35mta/snapshots/e02b4fb1d09edde7263479990690f224b761f54c/gtex_artery_data/[0m
         [95mGTEX-PW2O-1926.zarr[0m                                                    


2025-07-30 13:28:34,926 - distributed.worker - ERROR - Compute Failed
Key:       wsi_feature_extraction-407b767e14d725ae1fe739e990d0e647
State:     long-running
Task:  <Task 'wsi_feature_extraction-407b767e14d725ae1fe739e990d0e647' wsi_feature_extraction(...)>
Exception: 'AttributeError("module \'lazyslide.metrics\' has no attribute \'topk_scores\'")'
Traceback: '  File "/tmp/ipykernel_3859322/1619889921.py", line 22, in wsi_feature_extraction\n'



[34mINFO    [0m The Zarr backing store has been changed from [3;35mNone[0m the new file path:   
         [35m/home/yzheng/.cache/huggingface/hub/datasets--rendeirolab--lazyslide-da[0m
         [35mta/snapshots/e02b4fb1d09edde7263479990690f224b761f54c/gtex_artery_data/[0m
         [95mGTEX-17MF6-0526.zarr[0m                                                   


2025-07-30 13:28:38,679 - distributed.worker - ERROR - Compute Failed
Key:       wsi_feature_extraction-9041847908cb69025a4dd5e18e13ef5f
State:     long-running
Task:  <Task 'wsi_feature_extraction-9041847908cb69025a4dd5e18e13ef5f' wsi_feature_extraction(...)>
Exception: 'AttributeError("module \'lazyslide.metrics\' has no attribute \'topk_scores\'")'
Traceback: '  File "/tmp/ipykernel_3859322/1619889921.py", line 22, in wsi_feature_extraction\n'



[34mINFO    [0m The Zarr backing store has been changed from [3;35mNone[0m the new file path:   
         [35m/home/yzheng/.cache/huggingface/hub/datasets--rendeirolab--lazyslide-da[0m
         [35mta/snapshots/e02b4fb1d09edde7263479990690f224b761f54c/gtex_artery_data/[0m
         [95mGTEX-XPT6-2226.zarr[0m                                                    


2025-07-30 13:28:39,562 - distributed.worker - ERROR - Compute Failed
Key:       wsi_feature_extraction-d39e0da7a2e5f140405b3f5ee82a94b3
State:     long-running
Task:  <Task 'wsi_feature_extraction-d39e0da7a2e5f140405b3f5ee82a94b3' wsi_feature_extraction(...)>
Exception: 'AttributeError("module \'lazyslide.metrics\' has no attribute \'topk_scores\'")'
Traceback: '  File "/tmp/ipykernel_3859322/1619889921.py", line 22, in wsi_feature_extraction\n'



[34mINFO    [0m The Zarr backing store has been changed from [3;35mNone[0m the new file path:   
         [35m/home/yzheng/.cache/huggingface/hub/datasets--rendeirolab--lazyslide-da[0m
         [35mta/snapshots/e02b4fb1d09edde7263479990690f224b761f54c/gtex_artery_data/[0m
         [95mGTEX-11ONC-2726.zarr[0m                                                   


2025-07-30 13:29:13,079 - distributed.worker - ERROR - Compute Failed
Key:       wsi_feature_extraction-7e1506345184dd06cedbe4a7b0ea456a
State:     long-running
Task:  <Task 'wsi_feature_extraction-7e1506345184dd06cedbe4a7b0ea456a' wsi_feature_extraction(...)>
Exception: 'AttributeError("module \'lazyslide.metrics\' has no attribute \'topk_scores\'")'
Traceback: '  File "/tmp/ipykernel_3859322/1619889921.py", line 22, in wsi_feature_extraction\n'



: 

If you want to monitor the process, you can either go to the dask dashboard or use a simple progress bar

In [12]:
from dask.distributed import as_completed
from tqdm.auto import tqdm

for _ in tqdm(as_completed(futures), total=len(futures)):
    pass

  0%|          | 0/45 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [16]:
client

0,1
Connection method: Cluster object,Cluster type: dask_cuda.LocalCUDACluster
Dashboard: http://127.0.0.1:8787/status,

0,1
Dashboard: http://127.0.0.1:8787/status,Workers: 0
Total threads: 0,Total memory: 0 B
Status: running,Using processes: True

0,1
Comm: tcp://127.0.0.1:35175,Workers: 0
Dashboard: http://127.0.0.1:8787/status,Total threads: 0
Started: 2 minutes ago,Total memory: 0 B


Our function returns the scores for the two pathological terms that we defined, 
we can collect the scores and save it for further analysis.

In [41]:
slide_scores = pd.DataFrame(
    np.vstack([f.result() for f in futures]),
    columns=terms,
    index=dataset["Tissue Sample Id"],
)
slide_scores.to_csv("slide_scores.csv")

In [87]:
client.shutdown()

## Slide aggregation

After the slides are processed to have slide-level features and scores, we will aggregate them into an AnnData object.

In [43]:
from wsidata import agg_wsi

dataset["slide"] = [f"gtex_artery_slides/{s}.svs" for s in dataset["Tissue Sample Id"]]
agg_data = agg_wsi(dataset, "uni2", wsi_col="slide", agg_key="agg_slide")
agg_data.obs = agg_data.obs.merge(slide_scores, on="Tissue Sample Id").set_index(
    "Tissue Sample Id"
)
agg_data

AnnData object with n_obs × n_vars = 45 × 1536
    obs: 'Sex', 'Age Bracket', 'Pathology Categories', 'slide', 'calcification', 'atherosclerosis'

In [44]:
agg_data.write_h5ad("agg_uni2_features.h5ad")