<font size="4">
Use this training notebook to train UNet_lite from: <br> 
<code><font size="3.5">
Models/model_for_hls.py <br>
</font></code> as it works with its different architecture and functional API. <br><br>
Change as required for your setup:
<ul>
  <li><code>gpu = gpus[#GPU]</code> Choose #GPU to run training on, can copy and paste notebook and train on different GPUs simultaneously</li>
  <li><code>num_events = </code> How many events to load to .npy and/or train on</li>
  <li><code>dataset = Dataset(..., save = True/False,...</code> Use True for 1st run through to create .npy files, False after </li>
  <li><code>gpu = gpus[#GPU]</code> Choose #GPU to run training on, can copy and paste notebook and train on different GPUs simultaneously</li>
  <li><code>output_dir= </code>set where you want your trained model to be saved</li>
  <li><code>modtype = </code> Choose from either 'UNet_lite' or 'UNet2d'</li>
  <li><code>strip_size = </code> Choose from either 'strip' or 'full_image'</li>
</ul>
</font>

In [1]:
from tqdm.notebook import tqdm # Library used to display progress bars for loops, making it easy to track the progress of an iteration
from data_processing import Dataset
from noise import NoiseScheduler
import matplotlib.pyplot as plt
import numpy as np
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import CosineDecay
from tensorflow.keras.losses import MeanSquaredError
from tensorflow.keras.utils import Sequence
from tensorflow.keras.callbacks import LearningRateScheduler, ModelCheckpoint
from pathlib import Path
import os
import random

# Set seed for reproducibility
seed = 22
np.random.seed(seed)
random.seed(seed)
tf.random.set_seed(seed)

# Set the device to custom GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
gpu = gpus[0] # Specify which gpu to use here. Can run multiple scripts on different GPUs
if gpu:
    try:
        tf.config.experimental.set_memory_growth(gpu, True)
        tf.config.experimental.set_visible_devices(gpu, 'GPU')
        print("CUDA is available!")
        print("Number of available GPUs:", len(gpus))
        print("Current GPU:", gpu)
    except RuntimeError as e:
        print(e)
else:
    print("CUDA is not available. Running on CPU.")

2024-07-29 16:38:14.157501: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 16:38:14.215746: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.


CUDA is available!
Number of available GPUs: 1
Current GPU: PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')


2024-07-29 16:38:16.611111: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-29 16:38:16.630037: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-29 16:38:16.630066: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.


In [2]:
# Set to directory where data is stored
work_home = True
data_dir = "Datasets" if work_home else "/cephfs/dice/projects/L1T/diffusion/datasets/"

num_events = 256 # Adjust number of events to train model here
start_idx = 0
end_idx = num_events
dataset = Dataset(num_events, (120, 72), signal_file=f"{data_dir}/CaloImages_signal.root", pile_up_file=f"{data_dir}/CaloImages_bkg.root", save=False, start_idx=start_idx, end_idx=end_idx) # Can set to 10000
# 1000: number of samples in dataset
# (120, 72): Shape of each data sample (eg. image with dimensions 120x72)
# signal_file: Signal file for the dataset
# pile_up_file: This file contains background/ pileup data for the dataset
# save=False means the dataset should not be saved to disk after creation


In [3]:
dataset() # once this is cached, you don't have to re-load

INFO:root:Loading .npy files from /home/themrluke/projects/stablediffusion/keras_version/signal.npy and /home/themrluke/projects/stablediffusion/keras_version/pile_up.npy


In [4]:
new_dim=(64,64) #resize each data sample image into 64x64 resolution

In [5]:
saturation_value = 512 # Change saturation energy here
dataset.preprocess(new_dim)
# Pixels with an energy greater than the first number (eg.16 or 64 etc) will be clipped and shown as this number

INFO:root:Re-sizing tensors...


In [6]:
# Extract horizontal strip from y=26 to y=38 (12 pixels tall)
# Change how much of image to train model on here
strip_size = 'full_image'

if strip_size == 'full_image':
    y_start = 0
    y_end = 64

elif strip_size == 'strip':
    y_start = 26
    y_end = 38

In [7]:
# Convert data to TensorFlow tensors
clean_frames = tf.convert_to_tensor(dataset.signal, dtype=tf.float32)[:, y_start:y_end, :]
pile_up = tf.convert_to_tensor(dataset.pile_up, dtype=tf.float32)[:, y_start:y_end, :]

# Normalize data
clean_frames = tf.clip_by_value(clean_frames, 0, saturation_value)
pile_up = tf.clip_by_value(pile_up, 0, saturation_value)

# Reshape data
clean_frames = tf.expand_dims(clean_frames, axis=-1)
pile_up = tf.expand_dims(pile_up, axis=-1)

print(clean_frames.shape)
print(clean_frames.dtype)
# Permute changes the order to (B, H, W, C)
# This is done to match the common image representation format where the last dimension is the number of channels (e.g., RGB)

(256, 64, 64, 1)
<dtype: 'float32'>


2024-07-29 16:38:17.671989: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2024-07-29 16:38:17.673663: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-29 16:38:17.673691: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2024-07-29 16:38:17.673703: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/de

In [8]:
# Creating a DataLoader object for the clean_frames dataset
# batch_size determines how many samples will be processed together in each iteration during training or evaluation.
batch_size = 16
#dataloader = tf.data.Dataset.from_tensor_slices(clean_frames).batch(batch_size)
dataloader = (
    tf.data.Dataset.from_tensor_slices(clean_frames)
    .shuffle(buffer_size=len(clean_frames))
    .batch(batch_size)
    .prefetch(buffer_size=tf.data.experimental.AUTOTUNE)
)


In [9]:
# Import the UNet model
from Models.model_for_hls import Model, TrainingConfig

modtype = 'UNet_lite'  # Change Model type here
model = Model(modtype, new_dim)

config = TrainingConfig(output_dir='trained_models_lite/temp')
model = model.__getitem__(batch_size=batch_size)


print(model.summary())

Model: "UNetLite_hls"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
 pos_encoding_main (InputLayer)  [(None, 64, 64, 4)]  0          []                               
                                                                                                  
 input_images (InputLayer)      [(None, 64, 64, 1)]  0           []                               
                                                                                                  
 emb1 (Dense)                   (None, 64, 64, 1)    5           ['pos_encoding_main[0][0]']      
                                                                                                  
 add (Add)                      (None, 64, 64, 1)    0           ['input_images[0][0]',           
                                                                  'emb1[0][0]']        

In [10]:
# Define learning rate schedule
lr_schedule = tf.keras.optimizers.schedules.CosineDecay(
    initial_learning_rate=config.learning_rate,
    decay_steps=len(dataloader) * config.num_epochs,
    alpha=0.0
)

# Define optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=lr_schedule)

loss_fn = tf.keras.losses.MeanSquaredError()

# Compile the model (this step is to set up the internal state correctly)
model.compile(optimizer=optimizer, loss=loss_fn)

In [11]:
from Models.model_for_hls import positional_encoding

@tf.function(reduce_retracing=True)
def train_step(model, optimizer, noisy_images, noise_added, timestep, pos_encoding, pos_encoding_bottleneck, loss_fn, saturation_value, modtype):
    # Apply saturation value clipping and scaling
    noisy_images = tf.clip_by_value(noisy_images, 0, saturation_value)
    
    with tf.GradientTape() as tape:
        # Reshape timestep to the correct shape
        timestep = tf.reshape(timestep, [-1, 1])
        
        # Predict the noise residual
        if modtype == 'UNet2d':
            noise_pred = model([noisy_images, timestep, pos_encoding, pos_encoding_bottleneck], training=True)[0]
        elif modtype == 'UNet_lite':
            noise_pred = model([noisy_images, timestep, pos_encoding, pos_encoding_bottleneck], training=True)
        
        # Compute the loss
        loss = loss_fn(noise_added, noise_pred)
    
    # Compute gradients
    grads = tape.gradient(loss, model.trainable_weights)
    
    # Apply gradients
    optimizer.apply_gradients(zip(grads, model.trainable_weights))
    
    return loss


def train_loop(config, model, noise_sample, optimizer, train_dataloader, noise_scheduler, n_events, loss_fn, saturation_value, modtype):

    global_step = 0  # Counter to keep track of the number of steps taken during training
    
    # Loop over epochs
    for epoch in range(config.num_epochs):
        progress_bar = tqdm(total=len(train_dataloader))
        progress_bar.set_description(f"Epoch {epoch}")

        # Iterate over each batch in the training DataLoader
        for step, batch in enumerate(train_dataloader):
            batch_size = batch.shape[0]  # Determine batch size dynamically
            timestep = tf.random.uniform((), minval=0, maxval=config.num_train_timesteps, dtype=tf.int32)
            
            random_seed = np.random.randint(0, n_events)
            
            noisy_images, noise_added = noise_scheduler.add_noise(
                clean_frame=batch, 
                noise_sample=noise_sample, 
                timestep=timestep, 
                random_seed=random_seed, 
                n_events=n_events
            )
            
            # Compute positional encodings
            pos_encoding = positional_encoding(timestep, batch_size, (64, 64), 4, 5000)
            pos_encoding_bottleneck = positional_encoding(timestep, batch_size, (32, 32), 4, 5000)
            
            # Perform the training step
            loss = train_step(model, optimizer, noisy_images, noise_added, timestep, pos_encoding, pos_encoding_bottleneck, loss_fn, saturation_value, modtype)
            
            # Update progress bar
            progress_bar.update(1)
            logs = {"loss": loss.numpy(), "lr": optimizer.learning_rate.numpy(), "step": global_step}
            progress_bar.set_postfix(**logs)
            global_step += 1

        # Save the model after each epoch
        model.save(os.path.join(config.output_dir, f"model_epoch_{epoch}.h5"))

In [12]:
# Running the training loop
train_loop(config, model, pile_up, optimizer, dataloader, NoiseScheduler('pile-up'), num_events, loss_fn, saturation_value, modtype)

  0%|          | 0/16 [00:00<?, ?it/s]

2024-07-29 16:38:18.936570: I tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:630] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
2024-07-29 16:38:18.946174: I tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:428] Loaded cuDNN version 8902
2024-07-29 16:38:19.002944: I tensorflow/tsl/platform/default/subprocess.cc:304] Start cannot spawn child process: Permission denied
2024-07-29 16:38:19.272483: I tensorflow/compiler/xla/service/service.cc:173] XLA service 0x7fd9280d1650 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2024-07-29 16:38:19.272510: I tensorflow/compiler/xla/service/service.cc:181]   StreamExecutor device (0): NVIDIA GeForce RTX 4070 Ti, Compute Capability 8.9
2024-07-29 16:38:19.275168: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable.
2024-07-29 16:38:19.321088: I 

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]

  0%|          | 0/16 [00:00<?, ?it/s]