# ***Spectra Dimension Reduction***

This notebook shows the process of reducing dimension of spectras.

### ***Import packages***

Before we begin, let's import all the necessary packages for this notebook:

In [1]:
import gc
import os
import pickle
import random
import numpy as np
import pandas as pd
import tensorflow as tf
import seaborn as sns
from typing import Tuple
from tqdm import tqdm
from pathlib import Path
from pyimzml.ImzMLParser import ImzMLParser
from tensorflow.keras import Model
from tensorflow.keras import layers
from tensorflow.keras import optimizers
from tensorflow.keras import losses
from tensorflow.keras import callbacks
from tensorflow.keras import metrics as k_metrics
from tensorflow.keras import backend as K
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

### ***Consistency***
Next, let's make sure notebook is not random:

In [2]:
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
tf.random.set_seed(SEED)

### ***Constants definitions***

Next, let's define some constant variables for this notebook:

In [3]:
# Define folder that contains the dhg dataset
DHG_PATH = "./DHG/"
# Define folder that contains the preprocessed dataset
LEVEL_2_PATH = f"{DHG_PATH}/level_2"
# Define folder that contains the lower dimension dataset
LEVEL_3_PATH = f"{DHG_PATH}/level_3"
# Define file that contains dhg metadata
METADATA_PATH = f"{DHG_PATH}/metadata.csv"
# Define folder to save models for later use
MODELS_PATH = f"./models/dimension_reduction"
# Define path to save plots
FIGURES_PATH = "./figures/dimension_reduction"
# Define spectra dimension
SPECTRA_DIM = 24000
# Define intermediate dimension
INTERMEDIATE_DIM = 512
# Define latent dimension
LATENT_DIM = 5
# Define number of epochs
EPOCHS = 100
# Define batch size
BATCH_SIZE = 128
# Define learning rate
LEARNING_RATE = 1e-3

### ***Creating output folders***

Next, let's create the output folders:

In [4]:
# Create output folder if doesn't exist
Path(MODELS_PATH).mkdir(parents=True, exist_ok=True)
Path(LEVEL_3_PATH).mkdir(parents=True, exist_ok=True)
Path(FIGURES_PATH).mkdir(parents=True, exist_ok=True)

### ***Reading MSI metadata file***

Next, let's read the metadata file:

In [5]:
# Read metadata csv
metadata_df = pd.read_csv(METADATA_PATH)

# Separate section and replica
s_metadata_df = metadata_df[metadata_df.sample_type == "section"]
r_metadata_df = metadata_df[metadata_df.sample_type == "replica"]

### ***Get single spectra information from all MSI:***

Next, let's get all information except intensities (which needs a lot of memory) for each spectra from all images:

In [6]:
# Create lists to store each spectra's info
spectras_info = []

# Loop over each MSI
for index, row in tqdm(
    metadata_df.iterrows(), total=metadata_df.shape[0], desc="MSI Loop"
):
  # Parse the MSI file
  with ImzMLParser(
      os.path.join(LEVEL_2_PATH, f"{row.sample_file_name}.imzML")
  ) as reader:
    # Threshold image
    thresh_img = np.load(
        os.path.join(LEVEL_2_PATH, f"{row.sample_file_name}.npy")
    )

    # Loop over each spectra
    spectra_info = []
    for idx, (x, y, z) in tqdm(
        enumerate(reader.coordinates), total=len(reader.coordinates),
        desc="Spectra Loop"
    ):
      # Append spectra info
      spectras_info.append(
          [
              row.sample_file_name, row.sample_type, row.sample_number,
              row.histology, row.who_grade, x, y, idx,
              (True if thresh_img[y - 1, x - 1] else False)
          ]
      )

# Convert to data frame
spectras_info = pd.DataFrame(
    spectras_info, columns=[
        "file_name", "sample_type", "sample_number", "histology", "who_grade",
        "x_coordinate", "y_coordinate", "idx", "is_tissue"
    ]
)

# Separate section and replica
s_spectras_info = spectras_info[spectras_info.sample_type == "section"]
r_spectras_info = spectras_info[spectras_info.sample_type == "replica"]

Spectra Loop: 100%|██████████| 4275/4275 [00:00<00:00, 44073.41it/s]
Spectra Loop: 100%|██████████| 4845/4845 [00:00<00:00, 43651.72it/s]
Spectra Loop: 100%|██████████| 5016/5016 [00:00<00:00, 42499.96it/s]
Spectra Loop: 100%|██████████| 4429/4429 [00:00<00:00, 41010.33it/s]
Spectra Loop: 100%|██████████| 3096/3096 [00:00<00:00, 43000.41it/s]
Spectra Loop: 100%|██████████| 6240/6240 [00:00<00:00, 41879.01it/s]
Spectra Loop: 100%|██████████| 8034/8034 [00:00<00:00, 43901.56it/s]
Spectra Loop: 100%|██████████| 4536/4536 [00:00<00:00, 42792.47it/s]
Spectra Loop: 100%|██████████| 3456/3456 [00:00<00:00, 41627.22it/s]
Spectra Loop: 100%|██████████| 5928/5928 [00:00<00:00, 39785.19it/s]
Spectra Loop: 100%|██████████| 7068/7068 [00:00<00:00, 40847.36it/s]
Spectra Loop: 100%|██████████| 4550/4550 [00:00<00:00, 37916.70it/s]
Spectra Loop: 100%|██████████| 5740/5740 [00:00<00:00, 39311.11it/s]
Spectra Loop: 100%|██████████| 7826/7826 [00:00<00:00, 44466.96it/s]
Spectra Loop: 100%|██████████| 755

### ***MSI parsers opening:***

Next, let's create parser for each MSI in order to read spectra's for the model:

In [7]:
# Opening parsers
parsers = {
    file_name: ImzMLParser(os.path.join(LEVEL_2_PATH, f"{file_name}.imzML"))
    for file_name in metadata_df.sample_file_name.unique()
}

### ***Dataset generator:***

Next, let's create a dataset generator for the model:

In [8]:
def map_record(file_name: tf.Tensor, idx: tf.Tensor) -> Tuple[np.ndarray, int]:
  """Function to map a record to model input (spectra) and output (spectra).

  Args:
      file_name (tf.Tensor): Record file name to get spectra.
      idx (tf.Tensor): Record index to get spectra.

  Returns:
      Tuple[np.ndarray, int]: Input (spectra) and output (spectra).
  
  """
  # Decoding from the EagerTensor object
  file_name, idx = (file_name.numpy(), idx.numpy())

  # Decode bytes to str
  file_name = file_name.decode('utf-8')

  # Reading spectra from parser
  mzs, spectra = parsers[file_name].getspectrum(idx)

  # Get spectra in range 600-900
  sub_spectra = spectra[((mzs >= 600) & (mzs <= 900))]

  # Return spectra and spectra
  return (sub_spectra, sub_spectra)


def scale_spectra(
    x: tf.Tensor, y: tf.Tensor, min_spectra: np.ndarray, max_spectra: np.ndarray
) -> Tuple[np.ndarray, np.ndarray]:
  """Function to scale spectra.

  Args:
      x (tf.Tensor): Input (spectra)
      y (tf.Tensor): Output (spectra)
      min_spectra (np.ndarray): Min spectra to scale.
      max_spectra (np.ndarray): Max spectra to scale.

  Returns:
      Tuple[np.ndarray, np.ndarray]: Input (spectra) and output (spectra) after
        scaling.
  
  """
  # Scale spectras
  x_scaled = (x - min_spectra) / (max_spectra - min_spectra)
  y_scaled = (y - min_spectra) / (max_spectra - min_spectra)

  # Return scaled spectra and scaled spectra after making sure there between
  # 0 and 1
  return np.clip(x_scaled, 0, 1), np.clip(y_scaled, 0, 1)


def _fixup_shape(x: tf.Tensor, y: tf.Tensor) -> Tuple[tf.Tensor, tf.Tensor]:
  """Function to Fix the implicit inferring of the shapes of the
  output Tensors.

  Args:
      x (tf.Tensor): Input (spectra)
      y (tf.Tensor): Output (spectra)

  Returns:
      Tuple[np.ndarray, np.ndarray]: Input (spectra) and output (spectra) with
        correct shape.
  
  """
  x.set_shape([SPECTRA_DIM])
  y.set_shape([SPECTRA_DIM])
  return x, y


def create_ds(
    file_names: np.ndarray, indexes: np.ndarray, batch_size: int, shuffle: bool,
    min_max_spectra: Tuple[np.ndarray, np.ndarray] = None
) -> tf.data.Dataset:
  """Function to create a dataset for model

  Args:
      file_names (np.ndarray): File names of the dataset.
      indexes (np.ndarray): Indexes of the dataset.
      batch_size (int): Batch size.
      shuffle (bool): Flag to indicate if to shuffle or not.
      min_max_spectra (Tuple[np.ndarray,np.ndarray]): Min spectra ans Max 
          spectra to apply scaling. Defaults to None (no scaling) 

  Returns:
      tf.data.Dataset: Dataset
  
  """
  # Create dataset
  ds = tf.data.Dataset.from_tensor_slices((file_names, indexes))
  # Shuffle the data
  if shuffle:
    ds = ds.shuffle(len(file_names), seed=SEED)
  # Map record to model input
  ds = ds.map(
      lambda i, j: tf.
      py_function(func=map_record, inp=[i, j], Tout=[tf.float32, tf.float32])
  )
  # Scale record
  if min_max_spectra is not None:
    min_spectra = min_max_spectra[0]
    max_spectra = min_max_spectra[1]
    ds = ds.map(
        lambda i, j: tf.py_function(
            func=scale_spectra, inp=[i, j, min_spectra, max_spectra], Tout=
            [tf.float32, tf.float32]
        )
    )
  # Fix the implicit inferring of the shapes of the
  # output Tensors
  ds = ds.map(_fixup_shape)
  # Batch the spectra's
  ds = ds.batch(batch_size)
  # Prefetch batch's to make sure that a batch is ready to
  # be served at all time
  ds = ds.prefetch(tf.data.AUTOTUNE)
  return ds

### ***Dimension reduction model:***

Next, let's create a dimension reduction model:

In [9]:
class Sampling(layers.Layer):
  """Sampling layer for VAE, Uses (z_mean, z_log_var) to sample z
  (vector encoding).
  
  """

  def call(self, inputs: tf.Tensor) -> tf.Tensor:
    """Override of call method. Calls the model on new inputs and returns
    the outputs as tensors.
    
    Args:
        inputs (tf.Tensor): Model inputs.
    
    Returns:
        tf.Tensor: Model outputs.
    
    """
    # Unpack z_mean, z_log_var
    z_mean, z_log_var = inputs
    # Get batch size
    batch = tf.shape(z_mean)[0]
    # Get layer dimensions
    dim = tf.shape(z_mean)[1]
    # Sample noise from normal distribution
    epsilon = tf.random.normal(shape=(batch, dim))
    # Return re-parameterization
    return z_mean + tf.exp(0.5 * z_log_var) * epsilon


class Encoder(Model):
  """Encoder for VAE.
  
  """

  def __init__(
      self, latent_dim: int, intermediate_dim: int, name: str = "encoder",
      **kwargs
  ) -> None:
    """Initialization method.
    
    Args:
        latent_dim (int): Encoder latent dimension size.
        intermediate_dim (int): Encoder intermediate dimension size.
        name (str, optional): Encoder name. Defaults to "encoder".
    
    """
    super(Encoder, self).__init__(name=name, **kwargs)
    self.dense_proj = layers.Dense(intermediate_dim)
    self.batch_norm_proj = layers.BatchNormalization()
    self.relu_proj = layers.ReLU()
    self.dense_mean = layers.Dense(latent_dim)
    self.batch_norm_mean = layers.BatchNormalization()
    self.dense_log_var = layers.Dense(latent_dim)
    self.batch_norm_log_var = layers.BatchNormalization()
    self.sampling = Sampling()

  def call(self, inputs: tf.Tensor) -> tf.Tensor:
    """Override of call method. Calls the model on new inputs and returns
    the outputs as tensors.
    
    Args:
        inputs (tf.Tensor): Model inputs.
    
    Returns:
        tf.Tensor: Model outputs.
    
    """
    # Intermediate layer
    h = self.dense_proj(inputs)
    h = self.batch_norm_proj(h)
    h = self.relu_proj(h)

    # Mean layer
    z_mean = self.dense_mean(h)
    z_mean = self.batch_norm_mean(z_mean)

    # Log var layer
    z_log_var = self.dense_log_var(h)
    z_log_var = self.batch_norm_log_var(z_log_var)

    # Sampling layer
    z = self.sampling((z_mean, z_log_var))
    return z_mean, z_log_var, z


class Decoder(Model):
  """Decoder for VAE.
  
  """

  def __init__(
      self, original_dim: int, intermediate_dim: int, name: str = "decoder",
      **kwargs
  ) -> None:
    """Initialization method.
    
    Args:
        original_dim (int): Decoder original dimension size.
        intermediate_dim (int): Decoder intermediate dimension size.
        name (str, optional): Decoder name. Defaults to "decoder".
    
    """
    super(Decoder, self).__init__(name=name, **kwargs)
    self.dense_proj = layers.Dense(intermediate_dim)
    self.batch_norm_proj = layers.BatchNormalization()
    self.relu_proj = layers.ReLU()
    self.dense_output = layers.Dense(original_dim, activation="sigmoid")

  def call(self, inputs: tf.Tensor) -> tf.Tensor:
    """Override of call method. Calls the model on new inputs and returns
    the outputs as tensors.
    
    Args:
        inputs (tf.Tensor): Model inputs.
    
    Returns:
        tf.Tensor: Model outputs.
    
    """
    # Intermediate layer
    h = self.dense_proj(inputs)
    h = self.batch_norm_proj(h)
    h = self.relu_proj(h)

    # Reconstruction layer
    outputs = self.dense_output(h)
    return outputs


class BNVAE(Model):
  """Batch Normalization VAE class.
  
  """

  def __init__(
      self, original_dim: int, intermediate_dim: int, latent_dim: int,
      name="autoencoder", **kwargs
  ) -> None:
    """Initialization method.
    
    Args:
        original_dim (int): AutoEncoder original dimension size.
        intermediate_dim (int): AutoEncoder intermediate dimension size.
        latent_dim (int): AutoEncoder latent dimension size.
        name (str, optional): AutoEncoder name. Defaults to "autoencoder".
    
    """
    super(BNVAE, self).__init__(name=name, **kwargs)
    self.encoder = Encoder(
        latent_dim=latent_dim, intermediate_dim=intermediate_dim
    )
    self.decoder = Decoder(original_dim, intermediate_dim=intermediate_dim)

  def call(self, inputs: tf.Tensor) -> tf.Tensor:
    """Override of call method. Calls the model on new inputs and returns
    the outputs as tensors.
    
    Args:
        inputs (tf.Tensor): Model inputs.
    
    Returns:
        tf.Tensor: Model outputs.
    
    """
    # Unpack z_mean, z_log_var, z
    z_mean, z_log_var, z = self.encoder(inputs)

    # Get decoder reconstruction
    reconstructed = self.decoder(z)

    # Add KL divergence regularization loss
    kl_loss = -0.5 * tf.reduce_mean(
        z_log_var - tf.square(z_mean) - tf.exp(z_log_var) + 1
    )
    self.add_loss(kl_loss)

    # Return decoder reconstructed output for reconstruction loss
    return reconstructed

### ***Section dimension reduction:***
Next, let's train a dimension reduction model:

In [10]:
# Define dict's to store train, validation and test metrics
train_metrics = {}
validation_metrics = {}
test_metrics = {}

# Flag for first iteration
first_iteration = True

# Loop over each image
for exclude_image, group in s_metadata_df.groupby("file_name"):
  # Clear graph
  K.clear_session()
  gc.collect()

  # Get all spectra's in the exclude_image to exclude them - Leave one image out
  exclude_image_spectras = s_spectras_info["file_name"].isin(
      group.sample_file_name.to_list()
  )

  # Get all spectra's that are from the same patient as patients in
  # exclude_image - Leave one patient out
  exclude_patient_spectras = s_spectras_info["sample_number"].isin(
      group.sample_number.to_list()
  )

  # Create filter for training data - does not include the excluded image,
  # samples with from the same patients as excluded image and only include
  # tissue spectra's
  train_filter = (
      (~(exclude_image_spectras | exclude_patient_spectras)) &
      s_spectras_info.is_tissue
  )

  # Create filter for test data - includes the excluded image
  # and only include tissue spectra's
  test_filter = (exclude_image_spectras & s_spectras_info.is_tissue)

  # Filter training data
  s_spectras_info_train = s_spectras_info.loc[train_filter]

  # filter test data
  s_spectras_info_test = s_spectras_info.loc[test_filter]

  # Get train and validation set
  X_train, X_val, = train_test_split(
      s_spectras_info_train[["file_name", "idx"]].to_numpy(), test_size=0.2,
      random_state=SEED
  )

  # Get test set
  X_test = s_spectras_info_test[["file_name", "idx"]].to_numpy()

  # Create train generator
  train_generator = create_ds(
      X_train[:, 0], X_train[:, 1].astype("int"), BATCH_SIZE, True
  )

  # Create min max scaler object and train on training data
  scaler = MinMaxScaler()
  for batch in train_generator:
    batch = batch[0].numpy()
    scaler.partial_fit(batch)

  # Update train generator
  train_generator = create_ds(
      X_train[:, 0], X_train[:, 1].astype("int"), BATCH_SIZE, True,
      (scaler.data_min_, scaler.data_max_)
  )

  # Create validation generator
  validation_generator = create_ds(
      X_val[:, 0], X_val[:, 1].astype("int"), BATCH_SIZE, True,
      (scaler.data_min_, scaler.data_max_)
  )

  # Create test generator
  test_generator = create_ds(
      X_test[:, 0], X_test[:, 1].astype("int"), BATCH_SIZE, False,
      (scaler.data_min_, scaler.data_max_)
  )

  # Create Callback to save the best model
  checkpoint_filepath = os.path.join(
      MODELS_PATH, f"section_excluded_{exclude_image}"
  )
  model_checkpoint_callback = callbacks.ModelCheckpoint(
      filepath=checkpoint_filepath, save_weights_only=False, monitor="val_loss",
      mode="min", save_best_only=True
  )

  # Create Callback for model early stopping
  model_es_callback = callbacks.EarlyStopping(
      monitor='val_loss', mode='min', verbose=1, patience=5
  )

  # Create dimension reduction model
  dr_model = BNVAE(SPECTRA_DIM, INTERMEDIATE_DIM, LATENT_DIM)

  # Compile the model
  dr_model.compile(
      optimizers.Adam(learning_rate=LEARNING_RATE),
      loss=losses.MeanSquaredError()
  )

  # Train the model
  history = dr_model.fit(
      x=train_generator, validation_data=validation_generator, epochs=EPOCHS,
      callbacks=[model_checkpoint_callback, model_es_callback]
  )

  # Load the best saved
  dr_model = tf.keras.models.load_model(checkpoint_filepath)

  # Evaluate on train, validation and test
  train_metrics[exclude_image] = dr_model.evaluate(x=train_generator)
  validation_metrics[exclude_image] = dr_model.evaluate(x=validation_generator)
  test_metrics[exclude_image] = dr_model.evaluate(x=test_generator)

  # Get embeddings for test set
  pred = dr_model.encoder.predict(test_generator)[2]
  embeddings = pd.DataFrame(
      pred, columns=[f"Var {i}" for i in range(LATENT_DIM)]
  )
  embeddings_info = s_spectras_info.loc[test_filter, :].copy()
  for col in embeddings.columns:
    embeddings_info[col] = embeddings[col].to_numpy()

  # Save embeddings
  if first_iteration:
    embeddings_info.to_csv(f"{LEVEL_3_PATH}/section.csv", index=False)
    first_iteration = False
  else:
    embeddings_info.to_csv(
        f"{LEVEL_3_PATH}/section.csv", mode='a', header=False, index=False
    )

  # Save scaler
  with open(
      os.path.join(MODELS_PATH, f"section_excluded_{exclude_image}_scaler.pkl"),
      'wb'
  ) as f:
    pickle.dump(scaler, f)

  # Clean model for next iteration
  dr_model = None

  # Separate training
  print("#" * 30)

Epoch 1/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 2/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 3/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 4/100
Epoch 5/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 6/100
Epoch 7/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 8/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 9/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 10/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 11/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\section_excluded_HG 1-s\assets
Epoch 12/10

Next, let's view the loss metrics:

In [11]:
# Create empty dict to store metrics
metrics = {}
# Loop over each key
for key in train_metrics.keys():
  # Combine train, validation and test metrics
  metrics[key] = (
      train_metrics[key], validation_metrics[key], test_metrics[key]
  )
  # Create data frame of train metrics
  metrics_df = pd.DataFrame.from_dict(
      metrics, orient='index',
      columns=["train_loss", "validation_loss", "test_loss"]
  )

# Save
metrics_df.to_csv(f"{FIGURES_PATH}/section_metrics.csv")

# Print metrics
metrics_df

Unnamed: 0,train_loss,validation_loss,test_loss
HG 1-s,0.005906,0.0059,0.005154
HG 11-11-12-s,0.006258,0.006288,0.005155
HG 14-13-s,0.006085,0.006113,0.00393
HG 16-15-s,0.006054,0.006041,0.008031
HG 19-18-s,0.005917,0.005946,0.005085
HG 29-25-23-21-20-s,0.006148,0.006156,0.007338
HG 6-7-s,0.005911,0.005894,0.005846
HG 8-12-5-4-3-2-s,0.006931,0.006896,0.008832
HG 9-10-s,0.005888,0.005878,0.006234


### ***Replica dimension reduction:***
Next, let's train a dimension reduction model:

In [12]:
# Define dict's to store train, validation and test metrics
train_metrics = {}
validation_metrics = {}
test_metrics = {}

# Flag for first iteration
first_iteration = True

# Loop over each image
for exclude_image, group in r_metadata_df.groupby("file_name"):
  # Clear graph
  K.clear_session()
  gc.collect()

  # Get all spectra's in the exclude_image to exclude them - Leave one image out
  exclude_image_spectras = r_spectras_info["file_name"].isin(
      group.sample_file_name.to_list()
  )

  # Get all spectra's that are from the same patient as patients in
  # exclude_image - Leave one patient out
  exclude_patient_spectras = r_spectras_info["sample_number"].isin(
      group.sample_number.to_list()
  )

  # Create filter for training data - does not include the excluded image,
  # samples with from the same patients as excluded image and only include
  # tissue spectra's
  train_filter = (
      (~(exclude_image_spectras | exclude_patient_spectras)) &
      r_spectras_info.is_tissue
  )

  # Create filter for test data - includes the excluded image
  # and only include tissue spectra's
  test_filter = (exclude_image_spectras & r_spectras_info.is_tissue)

  # Filter training data
  r_spectras_info_train = r_spectras_info.loc[train_filter]

  # filter test data
  r_spectras_info_test = r_spectras_info.loc[test_filter]

  # Get train and validation set
  X_train, X_val, = train_test_split(
      r_spectras_info_train[["file_name", "idx"]].to_numpy(), test_size=0.2,
      random_state=SEED
  )

  # Get test set
  X_test = r_spectras_info_test[["file_name", "idx"]].to_numpy()

  # Create train generator
  train_generator = create_ds(
      X_train[:, 0], X_train[:, 1].astype("int"), BATCH_SIZE, True
  )

  # Create min max scaler object and train on training data
  scaler = MinMaxScaler()
  for batch in train_generator:
    batch = batch[0].numpy()
    scaler.partial_fit(batch)

  # Update train generator
  train_generator = create_ds(
      X_train[:, 0], X_train[:, 1].astype("int"), BATCH_SIZE, True,
      (scaler.data_min_, scaler.data_max_)
  )

  # Create validation generator
  validation_generator = create_ds(
      X_val[:, 0], X_val[:, 1].astype("int"), BATCH_SIZE, True,
      (scaler.data_min_, scaler.data_max_)
  )

  # Create test generator
  test_generator = create_ds(
      X_test[:, 0], X_test[:, 1].astype("int"), BATCH_SIZE, False,
      (scaler.data_min_, scaler.data_max_)
  )

  # Create Callback to save the best model
  checkpoint_filepath = os.path.join(
      MODELS_PATH, f"replica_excluded_{exclude_image}"
  )
  model_checkpoint_callback = callbacks.ModelCheckpoint(
      filepath=checkpoint_filepath, save_weights_only=False, monitor="val_loss",
      mode="min", save_best_only=True
  )

  # Create Callback for model early stopping
  model_es_callback = callbacks.EarlyStopping(
      monitor='val_loss', mode='min', verbose=1, patience=5
  )

  # Create dimension reduction model
  dr_model = BNVAE(SPECTRA_DIM, INTERMEDIATE_DIM, LATENT_DIM)

  # Compile the model
  dr_model.compile(
      optimizers.Adam(learning_rate=LEARNING_RATE),
      loss=losses.MeanSquaredError()
  )

  # Train the model
  history = dr_model.fit(
      x=train_generator, validation_data=validation_generator, epochs=EPOCHS,
      callbacks=[model_checkpoint_callback, model_es_callback]
  )

  # Load the best saved
  dr_model = tf.keras.models.load_model(checkpoint_filepath)

  # Evaluate on train, validation and test
  train_metrics[exclude_image] = dr_model.evaluate(x=train_generator)
  validation_metrics[exclude_image] = dr_model.evaluate(x=validation_generator)
  test_metrics[exclude_image] = dr_model.evaluate(x=test_generator)

  # Get embeddings for test set
  pred = dr_model.encoder.predict(test_generator)[2]
  embeddings = pd.DataFrame(
      pred, columns=[f"Var {i}" for i in range(LATENT_DIM)]
  )
  embeddings_info = r_spectras_info.loc[test_filter, :].copy()
  for col in embeddings.columns:
    embeddings_info[col] = embeddings[col].to_numpy()

  # Save embeddings
  if first_iteration:
    embeddings_info.to_csv(f"{LEVEL_3_PATH}/replica.csv", index=False)
    first_iteration = False
  else:
    embeddings_info.to_csv(
        f"{LEVEL_3_PATH}/replica.csv", mode='a', header=False, index=False
    )

  # Save scaler
  with open(
      os.path.join(MODELS_PATH, f"replica_excluded_{exclude_image}_scaler.pkl"),
      'wb'
  ) as f:
    pickle.dump(scaler, f)

  # Clean model for next iteration
  dr_model = None

  # Separate training
  print("#" * 30)

Epoch 1/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 2/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 3/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 4/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 5/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 6/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 7/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 8/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 9/100
INFO:tensorflow:Assets written to: ./models/dimension_reduction\replica_excluded_HG 1-r\assets
Epoch 10/100
INFO:tensorflow:Assets w

Next, let's view the loss metrics:

In [13]:
# Create empty dict to store metrics
metrics = {}
# Loop over each key
for key in train_metrics.keys():
  # Combine train, validation and test metrics
  metrics[key] = (
      train_metrics[key], validation_metrics[key], test_metrics[key]
  )
  # Create data frame of train metrics
  metrics_df = pd.DataFrame.from_dict(
      metrics, orient='index',
      columns=["train_loss", "validation_loss", "test_loss"]
  )

# Save
metrics_df.to_csv(f"{FIGURES_PATH}/replica_metrics.csv")

# Print metrics
metrics_df

Unnamed: 0,train_loss,validation_loss,test_loss
HG 1-r,0.001821,0.001841,0.0015
HG 12-11-r,0.001745,0.001745,0.00305
HG 14-13-r,0.001928,0.001927,0.00119
HG 16-15-r,0.002014,0.002013,0.001506
HG 18-19-18-r,0.002551,0.002566,0.00115
HG 29-25-23-21-20-r,0.002077,0.002104,0.00328
HG 6-6-7-r,0.001785,0.001801,0.002905
HG 8-5-4-3-2-r,0.002104,0.00212,0.00255
HG 9-10-r,0.001829,0.001829,0.001915


### ***MSI parsers closing:***

Next, let"s close MSI parsers:

In [None]:
# Closing parsers
for reader in parsers.values():
  if reader.m:
    reader.m.close()