# ***DESI Human Glioma Section Spectra Dimension Reduction***

This notebook shows the process of section spectra dimension reduction of the DESI Human Glioma preprocessed dataset.

### ***Import packages***

Before we begin, let"s import all the necessary packages for this notebook.
First we add the directory which has our python files:

In [13]:
import sys
sys.path.insert(0, "../..")

Next we import all the necessary packages for this notebook:

In [14]:
import gc
import os
import numpy as np
import pandas as pd
import tensorflow as tf
from typing import Tuple
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import callbacks
from tensorflow.keras import backend as K
from sklearn.model_selection import train_test_split
from skimage import (filters)
from tqdm import tqdm
from pyimzml.ImzMLParser import ImzMLParser, getionimage
from nnbiopsy.bn_vae import BNVAE

### ***Constants definitions***

Next, let"s define some constant variables for this notebook:

In [15]:
# Define folder that contains the dhg dataset and files
DHG_PATH = "C:/Users/Leor/Desktop/Thesis/DHG"
# Define folder that contains the preprocessed dhg dataset
DHG_IN_PATH = f"{DHG_PATH}/Preprocessed"
# Define file that contains dhg clinical state annotations
CLINICAL_STATE_ANNOTATIONS_PATH = f"{DHG_PATH}/Clinical_state_annotations.csv"
# Define folder to save VAE models for later use
VAE_MODELS_PATH = "C:/Users/Leor/Desktop/Thesis/section_vae_models"
# VAE model number of epochs
VAE_EPHOCS = 50
# VAE model batch size
VAE_BATCH_SIZE = 256
# VAE model intermidate layer size
VAE_INTERMIDATE_LAYER_SIZE = 512
# VAE model latent layer size
VAE_LATENT_LAYER_SIZE = 10
# VAE model learning rate
VAE_LEARNING_RATE = 1e-3
# MSI Spectra dimension
SPECTRA_DIM = 92000
# The MSI sample type for filtering
SAMPLE_TYPE = "s"
# Mz value to get in order to threshold for tissue
TRESH_MZ = 750
# Mz tolerance value to get in order to threshold for tissue
TRESH_MZ_TOL = 150
# Treshould standard deviation for Gaussian kernel
TRESH_GAUSSIAN_SIGMA = 1.5 

### ***Reading MSI clinical state anotations***

Next, lets read the clinical state anotations for each MSI:

In [16]:
# Read clinical state annotations csv
clinical_state_df = pd.read_csv(CLINICAL_STATE_ANNOTATIONS_PATH)

# Filter by sample_type
clinical_state_df = clinical_state_df[clinical_state_df["sample_type"] ==
                                      SAMPLE_TYPE]

### ***Get all tissue spectra from all MSI:***

Next, let"s get all informations except intensities (which need a lot of memory) for each tissue spectra from all MSI:

In [17]:
# Create lists to store each spectra's info
file_names = []
sample_numbers = []
histologies = []
who_grades = []
x_coordinates = []
y_coordinates = []
idxs = []

# Loop over each MSI
for index, msi_row in tqdm(clinical_state_df.iterrows(),
                           total=clinical_state_df.shape[0],
                           desc="MSI Loop"):
  # Parse the MSI file
  with ImzMLParser(os.path.join(DHG_IN_PATH,
                                f"{msi_row.file_name}.imzML")) as reader:
    # Get local TIC image of msi in mz region [600, 900]
    local_tic_img = getionimage(reader, TRESH_MZ, tol=TRESH_MZ_TOL)

    # Threshold image to separate tissue spectra from background
    smooth = filters.gaussian(local_tic_img, sigma=TRESH_GAUSSIAN_SIGMA)
    thresh_mean = filters.threshold_mean(smooth)
    thresh_img = local_tic_img > thresh_mean

    # Loop over each spectra
    for idx, (x, y, z) in tqdm(enumerate(reader.coordinates),
                               total=len(reader.coordinates),
                               desc="Spectra Loop"):
      # Check if spectra is tissue
      if thresh_img[y - 1, x - 1]:
        # Keep sample file name of spectra
        file_names.append(msi_row.file_name)
        # Keep sample number of spectra
        sample_numbers.append(msi_row.sample_number)
        # Keep sample histology of spectra
        histologies.append(msi_row.histology)
        # Keep sample who grade of spectra
        who_grades.append(msi_row.who_grade)
        # Keep x coordinate of spectra
        x_coordinates.append(x)
        # Keep y coordinate of spectra
        y_coordinates.append(y)
        # Keep  of spectra
        idxs.append(idx)

# Convert to numpy array
file_names = np.array(file_names)
sample_numbers = np.array(sample_numbers)
histologies = np.array(histologies)
who_grades = np.array(who_grades)
x_coordinates = np.array(x_coordinates)
y_coordinates = np.array(y_coordinates)
idxs = np.array(idxs)

MSI Loop:   0%|          | 0/24 [00:02<?, ?it/s]


KeyboardInterrupt: 

### ***MSI parsers opening:***

Next, let"s create parser for each MSI in order to read spectra's for the model:

In [None]:
# Opening parsers
parsers = {
    file_name: ImzMLParser(os.path.join(DHG_IN_PATH, f"{file_name}.imzML"))
    for file_name in clinical_state_df.file_name.unique()
}

### ***Dataset generator:***

Next, let"s create a dataset generator for the model:

In [None]:
def map_index(index: tf.Tensor) -> Tuple[np.ndarray, np.ndarray]:
  """Function to map index to model input (spectra) and output (spectra).

  Args:
      index (tf.Tensor): index to map to corresponding values.

  Returns:
      Tuple[np.ndarray, np.ndarray]: input (spectra) and output (spectra).
  
  """
  # Decoding index from the EagerTensor object
  index = index.numpy()
  # Reading spectra from parser
  file_name = file_names[index]
  idx = idxs[index]
  _, spectra = parsers[file_name].getspectrum(idx)
  # Return spectra twice as input and reconstruction
  return (spectra, spectra)


def _fixup_shape(x: tf.Tensor, y: tf.Tensor):
  """ Function to Fix the implicit inferring of the shapes of the
  output Tensors.

  Args:
      x (tf.Tensor): input (spectra)
      y (tf.Tensor): output (spectra)

  Returns:
      Tuple[np.ndarray, np.ndarray]: input (spectra) and output (spectra) with
        correct shape.
  
  """
  x.set_shape([SPECTRA_DIM])
  y.set_shape([SPECTRA_DIM])
  return x, y


def create_ds(indexes: np.ndarray, batch_size: int) -> tf.data.Dataset:
  """Function to create a dataset for model

  Args:
      indexes (np.ndarray): indexes of thh dataset
      batch_size (int): batch size

  Returns:
      tf.data.Dataset: dataset
  """
  # Create dataset from generator
  ds = tf.data.Dataset.from_tensor_slices(indexes)
  # Shuffle the data
  ds = ds.shuffle(len(indexes))
  # Repeats this data
  ds = ds.repeat()
  # Map index to spectra
  ds = ds.map(lambda i: tf.py_function(
      func=map_index, inp=[i], Tout=[tf.float32, tf.float32]))
  # Fix the implicit inferring of the shapes of the
  # output Tensors
  ds = ds.map(_fixup_shape)
  # Batch the spectra's
  ds = ds.batch(batch_size)
  # Prefetch batchs to make sure that a batch is ready to
  # be served at all time
  ds = ds.prefetch(tf.data.AUTOTUNE)
  return ds

### ***Variational auto encoder:***

Next, let"s create a variational auto encoder model:

In [None]:
# Add different implemantation of VAE

### ***LOOCV Dimension reduction:***

Next, let"s apply dimension reduction using LOOCV for best evaluation:

In [None]:
# Loop over each sample number
for exclude_sample in tqdm(np.unique(sample_numbers)[:1]):
  # Clear graph
  K.clear_session()
  gc.collect()

  # Create filter for training data
  train_filter = (sample_numbers != exclude_sample)

  # Get indexes of all data
  indexes = np.arange(len(sample_numbers))

  # Get indexes of training data
  train_indexes = indexes[train_filter]

  # Get binary labels
  labels = who_grades > 2

  # Get indexes of training and validation data
  train_indexes, val_indexes = train_test_split(train_indexes,
                                                test_size=0.2,
                                                random_state=0,
                                                stratify=labels[train_filter])

  # Create data generators
  training_generator = create_ds(train_indexes, VAE_BATCH_SIZE)
  validation_generator = create_ds(val_indexes, VAE_BATCH_SIZE)
  test_generator = create_ds(indexes[~train_filter], VAE_BATCH_SIZE)

  # Create Callback to save the best model
  checkpoint_filepath = os.path.join(VAE_MODELS_PATH,
                                     f"excluded_{exclude_sample}/")
  model_checkpoint_callback = callbacks.ModelCheckpoint(
      filepath=checkpoint_filepath,
      save_weights_only=True,
      monitor="val_loss",
      mode="min",
      save_best_only=True)

  # Create VAE model
  vae_model = BNVAE(SPECTRA_DIM, VAE_INTERMIDATE_LAYER_SIZE,
                    VAE_LATENT_LAYER_SIZE)

  # Compile the VAE model
  optimizer = optimizers.Adam(learning_rate=VAE_LEARNING_RATE)
  vae_model.compile(optimizer, loss=losses.CategoricalCrossentropy())

  # Train the VAE model
  history = vae_model.fit(
      x=training_generator,
      validation_data=validation_generator,
      epochs=VAE_EPHOCS,
      steps_per_epoch=np.ceil(len(train_indexes) / VAE_BATCH_SIZE),
      validation_steps=np.ceil(len(val_indexes) / VAE_BATCH_SIZE),
      callbacks=[model_checkpoint_callback])

  # Load the saved weights into the model
  vae_model.load_weights(checkpoint_filepath)

  # Evalute The NN on test set
  test_eval = vae_model.evaluate(x=test_generator,
                                 steps=np.ceil(
                                     len(val_indexes) / VAE_BATCH_SIZE))

  0%|          | 0/1 [00:00<?, ?it/s]

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50

  0%|          | 0/1 [07:11<?, ?it/s]


KeyboardInterrupt: 

### ***MSI parsers closing:***

Next, let"s close MSI parsers:

In [None]:
# Closing parsers
for reader in parsers.values():
  if reader.m:
    reader.m.close()