# 🔬 End-to-end tutorial for WSI processing and benchmarking

Follow along this tutorial to download WSIs, process them with Trident, and run benchmarking studies using Patho-Bench.

## What will this tutorial cover?
1. Downloading WSIs for CPTAC Clear Cell Renal Cell Carcinoma (CCRCC)
2. Processing the WSIs with [Trident](https://github.com/mahmoodlab/trident), our one-stop package for WSI preprocessing
3. Running benchmarking studies. Following examples are included:  
    a. Linear probe for BAP1 mutation prediction  
    b. Cox Proportional Hazards (CoxPH) model for survival prediction    
    c. Attention-based multiple instance learning model for BAP1 mutation prediction  

# ⬇️ Download CPTAC CCRCC WSIs 

You can easily download CPTAC CCRCC WSIs from [TCIA Cancer Imaging Archive](https://www.cancerimagingarchive.net/collection/cptac-ccrcc/). If you have access to SSD, download your WSIs there for faster IO. 

**Tip**: Keep an eye out on this webpage for any new updates to the dataset as newer versions are often released.

# 🧑‍🔬 🧬 Installing Patho-Bench

Run:

1. `conda create -n pathobench python=3.10`
2. `conda activate pathobench`
3. `pip install -e .`

# 🤖 Preprocess WSIs: segmentation, patching and patch feature extraction

We will use [Trident](https://github.com/mahmoodlab/trident), our package for WSI processing and feature extraction. Trident is already installed as part of Patho-Bench installation. 

Run the following cell:

In [None]:
import sys 
from run_batch_of_slides import main # import Trident launch script. 

sys.argv = [
    "run_batch_of_slides",
    '--task', 'all', \
    '--job_dir', './cptac_ccrcc', \
    '--wsi_dir', './CPTAC-CCRCC_v1/CCRCC', \
    '--patch_encoder', 'conch_v15', \
    '--mag', '20', \
    '--patch_size', '512', \
    '--skip_errors',
]

main()


You will see the following directory structure as the output. Note, we have placed the `wsis` folder inside the job dir, but you can put it anywhere.
```bash
|-->/path/to/job_dir
    |-->20x_512px_0px_overlap
        |-->features_conch_v15 --> these are the patch features
        |-->patches
        |-->visualizations
    |-->contours
    |-->contours_geojson
    |-->thumbnails
    |-->wsis
```

# Running a linear probe experiment for BAP1 mutation prediction  

For predicting BAP1 mutation in CCRCC using Titan slide embeddings, we train a linear regression model (linear probe). Below is a modular function to run a linear probe experiment.  

### Argument Descriptions  

| Argument | Description | Possible Values | Additional Notes |
|----------|-------------|----------------|------------------|
| `model_name` | Slide encoder model to use | `titan`, `prism`, `chief`, etc. | - |
| `train_source` | Train dataset | Unless using custom splits, must match datasource name from [Hugging Face](https://huggingface.co/datasets/MahmoodLab/Patho-Bench) | - |
| `test_source` | Test dataset | Unless using custom splits, must match datasource name from [Hugging Face](https://huggingface.co/datasets/MahmoodLab/Patho-Bench) | Defaults to None. If not None, will test generalizability by training on all of train_source and testing on all of test_source. Note that if test_source is used the task must exist in both datasets. |
| `task_name` | Task to be performed | Example: `BAP1_mutation` | See [Hugging Face](https://huggingface.co/datasets/MahmoodLab/Patho-Bench) for available tasks. |
| `patch_embeddings_dirs` | Location of patch embeddings | Example if using Trident for patch extraction: `'/path/to/job_dir/20x_512px_0px_overlap/features_conch_v15'` | Used by Patho-Bench to construct slide- or patient-level embeddings. Can be a list if your patch embeddings are split across multiple directories. |
| `pooled_embeddings_root` | Storage location for pooled slide features | User-defined path | Pooled embeddings are saved when first run; subsequent runs use stored features instead of re-pooling. |
| `splits_root` | Location to download dataset splits | User-defined path (optional) | Splits are automatically downloaded from [Hugging Face](https://huggingface.co/datasets/MahmoodLab/Patho-Bench). If not provided, must provide local paths `path_to_split` and `path_to_task_config` instead. |
| `combine_slides_per_patient` | Method for pooling multiple WSIs per patient | `early` or `late` fusion | Use `True` for early fusion (concatenating all patch embeddings and processing as a single pseudo-slide) and `False` for late fusion (processing slides separately and averaging slide embeddings) |
| `cost` | Regularization strength for the linear probe | Inverse of regularization strength | - |
| `balanced` | Whether to apply balanced loss for the linear probe | `True` or `False` | - |
| `saveto` | Directory to save experiment results | User-defined path | Choose an appropriate location. |


In [None]:
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'titan'
train_source = 'cptac_ccrcc' 
task_name = 'BAP1_mutation'

# For this task, we will automatically download the split and task config from HuggingFace.
experiment = ExperimentFactory.linprobe(
                    model_name = model_name,
                    train_source = train_source,
                    task_name = task_name,
                    patch_embeddings_dirs = './cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15',
                    pooled_embeddings_root = './_tutorial_pooled_features',
                    splits_root = './_tutorial_splits', # This is where the downloaded splits are saved
                    combine_slides_per_patient = False,
                    cost = 1,
                    balanced = False,
                    saveto = f'./_tutorial_linprobe/{train_source}/{task_name}/{model_name}',
                )
experiment.train()
experiment.test()
result = experiment.report_results(metric = 'macro-ovr-auc')

Instead of automatically downloading the split and task config from Huggingface, you can also provide your own split and task config. To do this, simply provide `path_to_split` and `path_to_task_config` instead of `splits_root`.

If you wish to develop custom splits and tasks, please follow the format of the HuggingFace downloaded splits and task configs. At minimum, your task config should contain the following fields.

```yaml
task_col: BAP1_mutation # Column containing labels for the task
extra_cols: [] # Any extra columns needed to perform the task (e.g. survival tasks)

metrics: # List of one or more performance metrics to report (this is used for automated result compilation when using Patho-Bench in advanced mode)
  - macro-ovr-auc

label_dict: # Dictionary of integer labels to string labels
  0: wildtype
  1: mutant

sample_col: case_id # Column containing the unit of analysis. Use 'case_id' for patient-level tasks and 'slide_id' for slide-level tasks.
```

Here we rerun the previous experiment, but this time we provide local paths to the previously downloaded split and task config files:

In [None]:
experiment = ExperimentFactory.linprobe(
                    model_name = model_name,
                    train_source = train_source,
                    task_name = task_name,
                    patch_embeddings_dirs = '/media/ssd1/cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15',
                    pooled_embeddings_root = './_tutorial_pooled_features',
                    path_to_split = f'./_tutorial_splits/{train_source}/{task_name}/k=all.tsv', # Local path to the split
                    path_to_task_config = f'./_tutorial_splits/{train_source}/{task_name}/config.yaml', # Local path to the task config
                    combine_slides_per_patient = False,
                    cost = 1,
                    balanced = False,
                    saveto = f'./_tutorial_linprobe/{train_source}/{task_name}/{model_name}_fromlocal',
                )
experiment.train()
experiment.test()
result = experiment.report_results(metric = 'macro-ovr-auc')

# 🐺 Running a survival prediction experiment for CPTAC CCRCC

Let's see how can we train a CoxPH model to predict survival using Titan slide embeddings. Most of the arguments follow from our previous example of linear probe, but some CoxPH-specific hyperparameters are as follows:

| Argument | Description | Possible Values | Additional Notes |
|----------|-------------|----------------|------------------|
| `alpha` | sequence of regularization strengths (L1 penalty) used in the elastic net penalized Cox model, controlling the sparsity of the learned coefficients | float | If you get a c-index of 0.5 exactly, that means CoxPH model has not converged. Try changing alpha |
| `l1_ratio` | controls the balance between L1 (lasso) and L2 (ridge) regularization, where `l1_ratio=1` corresponds to pure L1 regularization (lasso), `l1_ratio=0` corresponds to pure L2 regularization (ridge), and values in between apply an elastic net penalty. | 0.0 to 1.0 | We use 0.5 |

In [None]:
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'titan'
train_source = 'cptac_ccrcc' 
task_name = 'OS'

patch_embeddings_dirs = './cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15' 

experiment = ExperimentFactory.coxnet(
                    model_name = model_name,
                    train_source = train_source,
                    task_name = task_name,
                    patch_embeddings_dirs = patch_embeddings_dirs,
                    pooled_embeddings_root = './_tutorial_pooled_features',
                    splits_root = './_tutorial_splits',
                    combine_slides_per_patient = False,
                    alpha = 0.07,
                    l1_ratio = 0.5,
                    saveto = f'./_tutorial_coxnet/{train_source}/{task_name}/{model_name}',
                )
experiment.train()
experiment.test()
result = experiment.report_results(metric = 'cindex')

# 🖌️ Training an ABMIL from scratch

In many scenarios, a simple linear probe may not be sufficient and you need a deep learning model. `Patho-Bench` will support you in easily training attention based multiple instance learning (ABMIL) models for this purpose. Let's use our example of BAP1 mutation to train an ABMIL model. Note that running the below cell may take some time, as this task has 50 folds.


In [None]:
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'abmil'
train_source = 'cptac_ccrcc'
task_name = 'BAP1_mutation'

# Hyperparameters
bag_size = 2048
base_learning_rate = 0.0003
layer_decay = None
gradient_accumulation = 1
weight_decay = 0.00001
num_epochs = 20
scheduler_type = 'cosine'
optimizer_type = 'AdamW'

patch_embeddings_dirs = './cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15' 

experiment = ExperimentFactory.finetune(
                    model_name = model_name,
                    train_source = train_source,
                    task_name = task_name,
                    task_type = 'classification',
                    patch_embeddings_dirs = patch_embeddings_dirs,
                    combine_slides_per_patient = False,
                    splits_root = './_tutorial_splits',
                    bag_size = bag_size,
                    base_learning_rate = base_learning_rate,
                    layer_decay = layer_decay, 
                    gradient_accumulation = gradient_accumulation,
                    weight_decay = weight_decay,
                    num_epochs = num_epochs,
                    scheduler_type = scheduler_type,
                    optimizer_type = optimizer_type,
                    balanced = True, 
                    save_which_checkpoints = 'last-1',
                    saveto = f'./_tutorial_finetune/{train_source}/{task_name}/{model_name}')
experiment.train()
experiment.test()
result = experiment.report_results(task_name, 'macro-ovr-auc')

Note that the `ExperimentFactory.finetune()` method can also be used for finetuning pretrained slide encoders instead of training an ABMIL from scratch. You are encouraged to read the code and explore the additional capabilities of Patho-Bench.