# STT Evaluation Metrics Visualization Notebook

| **Questions**| **Answers**|
|---------------------------------------------------|-|
| What is the difference between Ubuntu and UBI and is it worth switching? | |
| How big are the decompressed images with embedded models? | |
| What is the security posture (CVE report) of these model images (packages vs. model)? | |
| What is the role of ground truth and overall model evaluation (simple vs. complex data)? | |
| Performance gains from cold vs. warm start performance.  | |
| What can the model autodetect (FP, Language, Task)? | |
| What is the performance gains of basic inference flags?   | |
| The impact of advanced decoding arguments on speed and quality. | |
| Do advanced arguments / hyperparameters improve accuracy? | |
| How to measure accuracy for Audio models (Experiment vs. Production)?  | |
| Container placement on CPU and GPU.                | |
| The importance of scheduling batch jobs, parallel processing and saturation. | |
| Making data driven decisions from experiments.     | |
| What combinations of model-size, processor, argument flags, and start type performed the best on different audio samples.       | |
| Should you target CPU, GPU or Both?                | |
| Should you always have models warm?                | |
| What was the fastest transcription?                | |
| What was the slowest transcription?                | |
| What was the most accurate on complex audio files? | |
| What are useful metrics?                           | |

## Setup

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Set Seaborn style
sns.set(style="whitegrid")


## Load Data

In [None]:
aiml_df = pd.read_csv('data/metrics/g6-12xlarge/ubuntu/aiml_functional_metrics.csv')
system_df = pd.read_csv('data/metrics/g6-12xlarge/ubuntu/system_non_functional_metrics.csv')
#image_df = pd.read_csv('image_sizes.csv')


## Explore Functional Metrics

In [None]:
# Display the first few rows
aiml_df.head()

## Visualize Functional Metrics

In [None]:
# Plot tokens per second by model
plt.figure(figsize=(10, 6))
sns.barplot(data=aiml_df, x='model', y='tokens_per_second', hue='mode')
plt.title('Tokens per Second by Model and Mode')
plt.ylabel('Tokens per Second')
plt.xlabel('Model')
plt.legend(title='Mode')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()


What is the difference between Ubuntu and UBI and is it worth switching?

How many experiments were run, with how many different models on how many different audio samples.

How big are the decompressed images with embedded models?

What is the role of ground truth and overall model evaluation (simple vs. complex data)?

Performance gains from cold vs. warm start performance.

What can the model autodetect (FP, Language, Task)?

What is the performance gains of basic inference flags?

The impact of advanced decoding arguments on speed and quality.

Do advanced arguments / hyperparameters improve accuracy?

How to measure accuracy for Audio models (Experiment vs. Production)?

What combinations of model-size, processor, argument flags, and start type performed the best on different audio samples.

Should you target CPU, GPU or Both?

Should you always have models warm?

What was the fastest transcription?

What was the slowest transcription?

What was the most accurate on complex audio files?