## Training the CELTIC Model

In this notebook we demonstrate the process of training the `CELTIC` model. 

Using the single cell images and the context data, we initialize the experiment, configure the model, and run the training process. The trained model is saved in a local folder and can be used later for predictions (see `predict.ipynb`).

**`IMPORTANT:`**

To execute the full dataset training described in our [paper](https://www.biorxiv.org/content/10.1101/2024.11.10.622841v1.full), you have to download the 20GB microtubules single-cell dataset from our [BioImage Archive dataset](https://doi.org/10.6019/S-BIAD2156), and store the images locally.
The data can be downloaded with any FTP client application (e.g., FileZilla) via ftp://ftp.ebi.ac.uk/pub/databases/biostudies/S-BIAD/156/S-BIAD2156/Files, path: microtubules/cell_images.

To run a <u>simple test</u> of this script using an extremely small dataset that can even be downloaded to Google Colab, follow the instructions below.


### CELTIC installation

In [None]:
# package installation (e.g for Colab users)
!git clone https://github.com/zaritskylab/CELTIC
%cd CELTIC
!pip install .

### Initializations

In [3]:
# set the absolute path to the CELTIC repo
REPO_ROOT = "/content/CELTIC"

# set the organelle (keep microtubules for this example)
organelle = 'microtubules' 

In [4]:
from celtic.utils.functions import initialize_experiment, download_example_files
from celtic.train import train
import os
from pathlib import Path
import pandas as pd
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

assert os.path.exists(REPO_ROOT) and REPO_ROOT.endswith("CELTIC"), "REPO_ROOT misconfiguration"

abs_path_resources_dir = Path(f'{REPO_ROOT}/resources/{organelle}') # location of the samples to be downloaded


### Download resources

Download the training metadata

In [None]:
download_example_files(abs_path_resources_dir, "train")

### Define the training set size

In [6]:
# To run a tiny training example, set this variable to True.
# (see explanation in the header of this notebook)
# To run the full training, set it to False
tiny_training = True

In [None]:
if tiny_training:
    
    # keep only a tiny portion of the training and validation data
    train_size = 15
    valid_size = 7
    metadata_dir = f'{abs_path_resources_dir}/metadata'
    train_images = pd.read_csv(f'{metadata_dir}/train_images.csv').head(train_size)
    train_context = pd.read_csv(f'{metadata_dir}/train_context.csv').head(train_size)
    valid_images = pd.read_csv(f'{metadata_dir}/valid_images.csv').head(valid_size)
    valid_context = pd.read_csv(f'{metadata_dir}/valid_context.csv').head(valid_size)
    
    train_images.to_csv(f'{metadata_dir}/train_images_tiny.csv', index=0)
    train_context.to_csv(f'{metadata_dir}/train_context_tiny.csv', index=0)
    valid_images.to_csv(f'{metadata_dir}/valid_images_tiny.csv', index=0)
    valid_context.to_csv(f'{metadata_dir}/valid_context_tiny.csv', index=0)

    # extract the file names
    images_names = []
    for file_type in ['signal_file', 'target_file', 'mask_file']:
        images_names.extend(train_images[file_type].tolist())
        images_names.extend(valid_images[file_type].tolist())
    print(f"Total files to download: {len(images_names)}")

    # download the images
    images_paths = [f"{organelle}/cell_images/{item}" for item in images_names]
    download_example_files(abs_path_resources_dir, "", from_dict={"cell_images": images_paths})

    path_single_cells = f'{abs_path_resources_dir}/cell_images'

else:
    # This is the local path to the training images you downloaded from our BioImage Archive dataset.
    path_single_cells = r"path/to/your/local/copy/of/the/dataset"


### Initialize the Experiment

This step initializes the experiment by creating a local folder to store the training files. It also sets up CSV files that contain the paths to the images, and if contexts are used, it includes CSV files with the context data. In this example, we provide the microtubules context files. The process of context creation is explained in the `context_creation.ipynb` notebook.


In [None]:
path_run_dir, context_model_config = initialize_experiment(organelle, 
                                                           'train', 
                                                           models_dir=f'{abs_path_resources_dir}/models')
print("the experiment will be saved in:", path_run_dir)

tiny_training_postfix = '' if not tiny_training else '_tiny'
path_images_csv = [f'{abs_path_resources_dir}/metadata/{item}_images{tiny_training_postfix}.csv' for item in ['train', 'valid']]
path_context_csv = [f'{abs_path_resources_dir}/metadata/{item}_context{tiny_training_postfix}.csv' for item in ['train', 'valid']]

### Run Training

This step starts the training process using the specified parameters, including image paths, context data, and model configuration. The results are saved in the local folder of the experiment.


In [None]:
train.run_training(path_run_dir,
                    path_images_csv, 
                    path_context_csv,
                    path_single_cells, 
                    masked = True,
                    transforms = context_model_config['transforms'],
                    patch_size = context_model_config['train_patch_size'],
                    iterations = 60_000,
                    batch_size = 24,
                    learning_rate = 0.001,
                    context_features = context_model_config['context_features'], 
                    daft_embedding_factor = context_model_config['daft_embedding_factor'], 
                    daft_scale_activation = context_model_config['daft_scale_activation'])