# Visualizing run logs, metrics and cost

This notebook will help us to explore logs and checkpoints generated during training. 

Some of the logging is optional, especially during full scale training runs. All logs are synced with a version naming convention in order to facilitate easy retrieval of a specific run.

In [13]:
import json
import os
import pandas as pd
import plotly.graph_objects as go

## Available logs

In [14]:
ckpts = os.listdir("../checkpoints")
lightning_logs = os.listdir("../logs/lightning_logs")
perf_logs = os.listdir("../logs/perf")

Checkpoints are .ckpt files that hold the model weights. To use these files, we can load them directly into our custom LightningModules with [.load_from_checkpoint](https://lightning.ai/docs/pytorch/stable/common/checkpointing_basic.html#lightningmodule-from-checkpoint).

Here are our available checkpoints:

In [15]:
ckpts

[]

PyTorch Lightning integrates with several experiment managers to log training metrics. My personal favorite, and perhaps the simplest to use to enable custom visualizations, is [CSVLogger](https://lightning.ai/docs/pytorch/stable/extensions/generated/lightning.pytorch.loggers.CSVLogger.html).

The logs saved during training runs are shown below:

In [16]:
lightning_logs

['google_bert_uncased_L-8_H-512_A-8_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T14_28_33.351315',
 'google_bert_uncased_L-4_H-512_A-8_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T14_03_54.835687',
 'google_bert_uncased_L-12_H-768_A-12_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T15_13_11.277592']

In order to get a basic sense of training times, there exists a `log_perf` function in `utils.py`. Using `log_perf` saves a simple json file with some basic metrics on the training run – notably, the CPU or GPU type, and the amount of time it took to complete a minimum number of epochs. 

These files were created by passing the `--perf` flag to `trainer.py` with

```sh
python trainer.py --perf=True
```

Doing the above will run your Trainer and time the run. This can help to determine which machine to use before initiating a longer run. 

Logs for the example runs are:

In [17]:
perf_logs

['google_bert_uncased_L-8_H-512_A-8_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T14_28_33.351315.json',
 'google_bert_uncased_L-4_H-512_A-8_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T14_03_54.835687.json']

## Examining training times

First, let's read in the files:

In [23]:
all_perfs = []

for run in perf_logs:
    with open(f"../logs/perf/{run}", 'r') as f:
        data = json.load(f)
        data['perf']['run'] = run
        all_perfs.append(data['perf'])
    
perf_df = pd.DataFrame(all_perfs)
perf_df.set_index("run", inplace=True)
perf_df.sort_values("runtime_min", inplace=True)
perf_df

Unnamed: 0_level_0,device_name,num_node,num_devices:,strategy,precision,epochs,global_step,max_epochs,min_epochs,batch_size,runtime_min
run,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
google_bert_uncased_L-4_H-512_A-8_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T14_03_54.835687.json,Tesla T4,1,1,SingleDeviceStrategy,16-mixed,5,10600,5,0,64,24.438696
google_bert_uncased_L-8_H-512_A-8_Tesla T4_LR3e-05_BS64_ML256_2025-03-31T14_28_33.351315.json,Tesla T4,1,1,SingleDeviceStrategy,16-mixed,5,10600,5,0,64,44.514956


And then, we can find the model with the minimum training time:

In [None]:
best_run_perf = perf_df.iloc[0]
display(best_run_perf)

print(f"The run with the fastest training time used the {best_run_perf['model']} model and a {best_run_perf['device_name']} device.")

NameError: name 'perf_df' is not defined