# Prepare MIDOG++ Dataset

**Citation:**
Aubreville, Marc; Wilm, Frauke; Stathonikos, Nikolas; Breininger, Katharina; Donovan, Taryn; Jabari, Samir; et al. (2023). MIDOG++: A Comprehensive Multi-Domain Dataset for Mitotic Figure Detection. figshare. Collection. https://doi.org/10.6084/m9.figshare.c.6615571.v1

**Download and dataset preparation**
As the images in the MIDOG dataset are rather large, we first preexctract the cell-graphs by applying CellViT and then train the classification module solely on the graphs (*.pt files). Due to the CCO license we are pleased to share the annotations along this repository.

If you want to perform all steps, you need to have at least 300 GB free disk space.

1. Please download the dataset here: https://springernature.figshare.com/collections/MIDOG_A_Comprehensive_Multi-Domain_Dataset_for_Mitotic_Figure_Detection/6615571
   Just download the figures (*.tiff) and place them in a folder called `images_raw`. We provide the following files with changes marked here:
        - dataset_xvalidation.csv: added slide mpp in x and y direction
        - midog.json: Same file as MIDOG++.json, but renamed
        - midog_indent.json: Same file as MIDOG++.json, but with intendations for better readability (human)
        - gt_dataset.json: JSON-File with Mitotic Figures (only Mitotic Figures!) for every image in the dataset
        - gt_test.json: JSON-File with Mitotic Figures (only Mitotic Figures!), just for the images in the test set
        - label_map.yaml: Label-Map
        - filelist.csv: Filelist with all WSI paths of MIDOG++ for CellViT extraction
    If you needs help with the download, check out this script from the authors of the dataset: https://github.com/DeepMicroscopy/MIDOGpp/blob/main/Setup.ipynb. We just needed to replace the download link with https://springernature.figshare.com/ndownloader/files/{file} in the notebook. The dataset size is around 65GB. We tried to replicate the split (train-val) by following the guidelines in the training script of https://github.com/DeepMicroscopy/MIDOGpp/blob/8b15a5af4508953ca67f8cbe6a4c19e724cf4431/training.py. The test split is externally defined by the authors which we followed. 
2. The dataset structure should be:
   
        ```bash
        ./
        ├── images_raw                  # images downloaded -> No pyramid (single plane tiff file)
        │   ├── 001.tiff
        │   ├── 002.tiff
        │   ...
        │   └── 553.tiff
        ├── split                       # corresponding split files
        │   ├── leave-one-out
        │   ├── single-domain
        │   └── test_files
        ├── datasets_xvalidation.csv    # please use our file!
        ├── gt_dataset.json             # please use our file!
        ├── gt_test.json                # please use our file!
        ├── midog.json                  # please use our file!
        ├── midog_indent.csv            # please use our file!
        ├── label_map.yaml            # please use our file!
        └── filelist.csv            # please use our file!
        ```
3. Convert tiff files to pyramid-tiff images using vips and store them in the `images` folder. Change the paths in the `convert.sh` script to your local paths and run the script.
   Be carefull, this adds additional 100GB to your local disk.
4. Extract graphs. Exchange the checkpoint with the cellvit model, change the outdir and adapt the paths in the filelist.csv before running:

         ```bash
         python3 ./cellvit/detect_cells.py \
            --model ./cellvit/checkpoints/SELECT_YOUR_CHECKPOINT \ # exchange, e.g. ./checkpoints/CellViT-SAM-H-x40-AMP.pth
            --outdir /path/to/MIDOG++/Dataset/graphs/model \ # exchange, e.g., ./logs/Datasets/MIDOG++/graphs/SAM-H
            --batch_size 4 \
            --graph \
            process_dataset \ 
            --filelist /path/to/MIDOG++/filelist.csv \ # exchange, e.g., ./logs/Datasets/MIDOG++/filelist.csv
            --wsi_extension tiff
         ```
5. Define your configs and train. An example is given in the folder.

# Train Classifier

In [None]:
# python3 ./cellvit/train_cell_classifier_head.py --config /path/to/your/config.yaml

# Evaluation

### General overview of the CLI

In [None]:
# evaluate with MIDOG-evaluation metrics
# python3 ./cellvit/training/evaluate/inference_cellvit_experiment_midog.py --help
#
# usage: inference_cellvit_experiment_midog.py [-h] [--logdir LOGDIR] [--graph_path GRAPH_PATH] [--test_filelist TEST_FILELIST] [--gt_json GT_JSON]
#                                              [--x_valid_path X_VALID_PATH] [--threshold THRESHOLD] [--bbox_radius BBOX_RADIUS] [--comment COMMENT] [--gpu GPU]
#                                              [--image_path IMAGE_PATH] [--validation VALIDATION]
                                             
# Perform CellViT-Classifier inference for MIDOG dataset. Differing to all other datasets, as the MIDOG dataset contains WSI-like sections the preextracted graphs (from
# cellvit/detect_cells.py) are used for inference and passed to the CellViT-Classifier-Head. Be careful to use the correct graph folder!

# options:
#   -h, --help            show this help message and exit
#   --logdir LOGDIR       Path to the log directory with the trained head. (default: None)
#   --graph_path GRAPH_PATH
#                         Path to the MIDOG dataset with the preextracted graphs for this CellViT-Architecture. Be careful about choosing the correct CellViT-
#                         Architecture for the graph folder. Possible models are: CellViT-256, CellViT-SAM-H, CellViT-UNI (default: None)
#   --test_filelist TEST_FILELIST
#                         Path to the test filelist for the MIDOG dataset. (default: None)
#   --gt_json GT_JSON     Path to the ground truth json test file for the MIDOG dataset. (default: None)
#   --x_valid_path X_VALID_PATH
#                         Path to the x_valid.csv file for the MIDOG dataset. (default: None)
#   --threshold THRESHOLD
#                         Threshold for classification. Default is 0.85. (default: 0.85)
#   --bbox_radius BBOX_RADIUS
#                         Radius for merging cells. Default is 0.01125. (default: 0.01125)
#   --comment COMMENT     Comment for the inference run. (default: None)
#   --gpu GPU             Number of CUDA GPU to use (default: 0)
#   --image_path IMAGE_PATH
#                         Path to the image folder for the MIDOG dataset. Just use if you want to store plots. (default: None)
#   --validation VALIDATION
#                         If set, the validation set is used for inference and optimal thresholds calculated. (default: None)
#

### Step 1: Validation to find classification threshold

In [None]:
# python3 ./cellvit/training/evaluate/inference_cellvit_experiment_midog.py \
#     --logdir /Path/to/your/run/log \
#     --graph_path /Path/to/the/graphs \
#     --test_filelist /Path/to/the/validation/split \ # e.g. ./logs/Datasets/MIDOG++/split/single-domain/all/valid_0.csv
#     --gt_json /Path/to/gt_dataset.json \
#     --x_valid_path /Path/to/datasets_xvalidation.csv \
#     --validation True
#     --comment validation
# IT IS IMPORTANT TO USE THE CORRECT VALID FILE AND NOT A TEST FILE AT THIS STAGE

### Step 2: Inference on test dataset

In [None]:
# python3 ./cellvit/training/evaluate/inference_cellvit_experiment_midog.py \
#     --logdir /Path/to/your/run/log \
#     --graph_path /Path/to/the/graphs \
#     --test_filelist /Path/to/the/test/split.csv \ # e.g. ./logs/Datasets/MIDOG++/split/test_files/test_all.csv
#     --gt_json /Path/to/gt_test.json \
#     --x_valid_path /Path/to/datasets_xvalidation.csv \
#     --threshold # e.g. 0.85, take it from your best threshold out of the validation folder
# IT IS IMPORTANT TO USE THE CORRECT TEST FILE AND NOT A VALID FILE AT THIS STAGE