# 🔬 End-to-end tutorial for WSI processing and benchmarking

Follow along this tutorial to download WSIs, process them with Trident, and run benchmarking studies using Patho-Bench.

## What will this tutorial cover?
1. Downloading WSIs for CPTAC Clear Cell Renal Cell Carcinoma (CCRCC)
2. Processing the WSIs with [Trident](https://github.com/mahmoodlab/trident), our one-stop package for WSI preprocessing
3. Running benchmarking studies. Following examples are included:  
    a. Linear probe for BAP1 mutation prediction  
    b. Cox Proportional Hazards (CoxPH) model for survival prediction    
    c. Attention-based multiple instance learning model for BAP1 mutation prediction  

# ⬇️ Download CPTAC CCRCC WSIs 

You can easily download CPTAC CCRCC WSIs from [TCIA Cancer Imaging Archive](https://www.cancerimagingarchive.net/collection/cptac-ccrcc/). If you have access to SSD, download your WSIs there for faster IO. 

**Tip**: Keep an eye out on this webpage for any new updates to the dataset as newer versions are often released.

# 🧑‍🔬 🧬 Installing Patho-Bench

Run:

1. `conda create -n pathobench python=3.10`
2. `conda activate pathobench`
3. `pip install -e .`

# 🤖 Preprocess WSIs: segmentation, patching and patch feature extraction

We will use [Trident](https://github.com/mahmoodlab/trident), our package for WSI processing and feature extraction. Trident is already installed as part of Patho-Bench installation. 

Run the following cell:

In [None]:
import sys 
from run_batch_of_slides import main # import Trident launch script. 

sys.argv = [
    "run_batch_of_slides",
    '--task', 'all', \
    '--job_dir', './cptac_ccrcc', \
    '--wsi_dir', './CPTAC-CCRCC_v1/CCRCC', \
    '--patch_encoder', 'conch_v15', \
    '--mag', '20', \
    '--patch_size', '512', \
    '--skip_errors',
]

main()


You will see the following directory structure as the output. Note, we have placed the `wsis` folder inside the job dir, but you can put it anywhere.
```bash
|-->/path/to/job_dir
    |-->20x_512px_0px_overlap
        |-->features_conch_v15 --> these are the patch features
        |-->patches
        |-->visualizations
    |-->contours
    |-->contours_geojson
    |-->thumbnails
    |-->wsis
```

# Running a linear probe experiment for BAP1 mutation prediction  

For predicting BAP1 mutation in CCRCC using Titan slide embeddings, we train a linear regression model (linear probe). Below is a modular function to run a linear probe experiment. Please see docstrings for more details on specific arguments.


In [None]:
import os
from patho_bench.SplitFactory import SplitFactory
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'titan'
train_source = 'cptac_ccrcc' 
task_name = 'BAP1_mutation'

# For this task, we will automatically download the split and task config from HuggingFace.
path_to_split, path_to_task_config = SplitFactory.from_hf('./_tutorial_splits', train_source, task_name)

# Now we can run the experiment
experiment = ExperimentFactory.linprobe(
                    split = path_to_split,
                    task_config = path_to_task_config,
                    pooled_embeddings_dir = os.path.join('./_tutorial_pooled_features', model_name, train_source, 'by_case_id'), # This task uses case-level pooling
                    saveto = f'./_tutorial_linprobe/{train_source}/{task_name}/{model_name}',
                    combine_slides_per_patient = False,
                    cost = 1,
                    balanced = False,
                    patch_embeddings_dirs = '/media/ssd1/cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15',
                    model_name = model_name,                    
                )
experiment.train()
experiment.test()
result = experiment.report_results(metric = 'macro-ovr-auc')

Instead of downloading the split and task config from Huggingface, you can also provide your own split and task config. If you wish to develop custom splits and tasks, please follow the format of the HuggingFace downloaded splits and task configs. At minimum, your task config should contain the following fields.

```yaml
task_col: BAP1_mutation # Column containing labels for the task
extra_cols: [] # Any extra columns needed to perform the task (e.g. survival tasks)

metrics: # List of one or more performance metrics to report (this is used for automated result compilation when using Patho-Bench in advanced mode)
  - macro-ovr-auc

label_dict: # Dictionary of integer labels to string labels
  0: wildtype
  1: mutant

sample_col: case_id # Column containing the unit of analysis. Use 'case_id' for patient-level tasks and 'slide_id' for slide-level tasks.
```

# Running a retrieval experiment for BAP1 mutation prediction  

Same task as linear probe experiment above, but now we will use patient-level retrieval instead of a linear probe. We are using the same splits as downloaded in the cell above.

If you already ran the cell above, you will already have case-level embeddings saved in `pooled_embeddings_dir`. This means you don't have to provide the `patch_embeddings_dirs` and `model_name` arguments.

In [None]:
import os
from patho_bench.SplitFactory import SplitFactory
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'titan'
train_source = 'cptac_ccrcc' 
task_name = 'BAP1_mutation'

# Now we can run the experiment
experiment = ExperimentFactory.retrieval(
                    split = f'./_tutorial_splits/cptac_ccrcc/{task_name}/k=all.tsv',
                    task_config = f'./_tutorial_splits/cptac_ccrcc/{task_name}/config.yaml',
                    pooled_embeddings_dir = os.path.join('./_tutorial_pooled_features', model_name, train_source, 'by_case_id'), # This task uses case-level pooling
                    saveto = f'./_tutorial_retrieval/{train_source}/{task_name}/{model_name}',
                    combine_slides_per_patient = False,
                    similarity = 'l2',
                    centering = False,
                    
                    # Don't need to provide the following args because pooled embeddings are already computed and saved to pooled_embeddings_dir
                    # patch_embeddings_dirs = '/media/ssd1/cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15',
                    # model_name = model_name,                    
                )

experiment.train()
experiment.test()
result = experiment.report_results(metric = 'mAP@1')
result = experiment.report_results(metric = 'mAP@5')
result = experiment.report_results(metric = 'mAP@10')

# Running a survival prediction experiment for CPTAC CCRCC

Let's see how can we train a CoxPH model to predict survival (`OS`) using Titan slide embeddings. For this one we'll have to download a new split from Huggingface.

In [None]:
import os
from patho_bench.SplitFactory import SplitFactory
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'titan'
train_source = 'cptac_ccrcc' 
task_name = 'OS'

# For this task, we will automatically download the split and task config from HuggingFace.
path_to_split, path_to_task_config = SplitFactory.from_hf('./_tutorial_splits', train_source, task_name)

# Now we can run the experiment
experiment = ExperimentFactory.coxnet(
                    split = path_to_split,
                    task_config = path_to_task_config,
                    pooled_embeddings_dir = os.path.join('./_tutorial_pooled_features', model_name, train_source, 'by_case_id'), # This task uses case-level pooling
                    saveto = f'./_tutorial_coxnet/{train_source}/{task_name}/{model_name}',
                    combine_slides_per_patient = False,
                    alpha = 0.07,
                    l1_ratio = 0.5,
                    
                    # Don't need to provide the following args because pooled embeddings are already computed and saved to pooled_embeddings_dir
                    # patch_embeddings_dirs = '/media/ssd1/cptac_ccrcc/20x_512px_0px_overlap/features_conch_v15',
                    # model_name = model_name,                    
                )
experiment.train()
experiment.test()
result = experiment.report_results(metric = 'cindex')


# 🖌️ Training an ABMIL from scratch

In many scenarios, a simple linear probe may not be sufficient and you need a deep learning model. `Patho-Bench` will support you in easily training attention based multiple instance learning (ABMIL) models for this purpose. Let's use our example of BAP1 mutation to train an ABMIL model. Note that running the below cell may take some time, as this task has 50 folds.


In [None]:
from patho_bench.ExperimentFactory import ExperimentFactory

model_name = 'abmil'
train_source = 'cptac_ccrcc'
task_name = 'BAP1_mutation'

experiment = ExperimentFactory.finetune(
                    split = f'./_tutorial_splits/cptac_ccrcc/{task_name}/k=all.tsv',
                    task_config = f'./_tutorial_splits/cptac_ccrcc/{task_name}/config.yaml',
                    patch_embeddings_dirs = '/media/ssd1/downstream_conch/cptac_ccrcc',
                    saveto = f'./_tutorial_finetune/{train_source}/{task_name}/{model_name}',
                    combine_slides_per_patient = False,
                    model_name = model_name,
                    bag_size = 2048,
                    base_learning_rate = 0.0003,
                    layer_decay = None, 
                    gradient_accumulation = 1,
                    weight_decay = 0.00001,
                    num_epochs = 20,
                    scheduler_type = 'cosine',
                    optimizer_type = 'AdamW',
                    balanced = True, 
                    save_which_checkpoints = 'last-1',
                    model_kwargs = {                    # ABMIL requires extra kwargs. Other models do not.
                        'input_feature_dim': 768,
                        'n_heads': 1,
                        'head_dim': 512,
                        'dropout': 0.25,
                        'gated': False
                    }
                    )
experiment.train()
experiment.test()
result = experiment.report_results(task_name, 'macro-ovr-auc')

Note that the `ExperimentFactory.finetune()` method can also be used for finetuning pretrained slide encoders instead of training an ABMIL from scratch. You are encouraged to read the code and explore the additional capabilities of Patho-Bench.