# Assignment 1: Metrics & Experiment Tracking

## Overview 
In this assigment, you will train your first neural network for traffic sign classification and evaluate its performance using `Batch 1`. 
Unzip the corresponding batch folder before you proceed. The password is "Origin".

Complete the tasks below. 

**Note:** Make sure you hand in all Python files that you changed or created.

Learning goals:

* Get familiar with the existing code base and demonstrate familiarity with programming skills necessary for this course.
* Understand the scope and importance of performance evaluation, metrics and experiment tracking when building a machine learning model.
* Discuss different metrics for model evaluation, their scopes and limitations and your decisions.

In [None]:
# imports
import sys

sys.path.append("..")

from pathlib import Path
import mlflow
import torch.nn as nn

from eval import evaluate, load_and_transform_data, get_sign_names

ROOT_DIR = Path().cwd().parent
BATCH_DIR = ROOT_DIR / "safetyBatches" / "Batch_1"

# Set the tracking server to be localhost with sqlite as tracking store
mlflow.set_tracking_uri(uri=f"sqlite:///{ROOT_DIR / 'mlruns.db'}")

In [None]:
# TODO: make sure you use the MLFlow run ID of your trained model
RUN_ID = "your_run_id"

## Task 1: Track validation results
Below, you can find a starting point to validate your model's performance using the `evaluate` method.
So far, only the training metrics are tracked using MLFlow. Extend the `evaluate` method to also add evaluation results to the existing MLFlow run. 
You can find the MLFlow tracking documentation here: https://mlflow.org/docs/latest/tracking.html

In [None]:
model_uri = f"runs:/{RUN_ID}/model"
loaded_model = mlflow.pytorch.load_model(model_uri)

criterion = nn.CrossEntropyLoss()

batch_loader = load_and_transform_data(data_directory_path=str(BATCH_DIR))

# TODO: track evaluation results
predictions = evaluate(loaded_model, criterion, batch_loader)

# Output incorrect classifications
ground_truth = []
for _, target in batch_loader:
    ground_truth.extend(target.tolist())
sign_names = get_sign_names()
wrong_predictions_idx = [
    idx for idx, (y_pred, y) in enumerate(zip(predictions, ground_truth)) if y_pred != y
]
for idx in wrong_predictions_idx:
    print(
        f"Traffic sign {sign_names[ground_truth[idx]]} incorrectly classified as {sign_names[predictions[idx]]}"
    )

## Task 2: Accuracy is all you need?
2.1 At the moment, only the accuracy is tracked as a performance measure.
Is accuracy a proper measure to evaluate machine learning algorithms for safety-critical applications?

==> Please answer the question (yes/no) here and briefly explain the reasons for your decision. 

2.2 Think of how you can further increase your validation pipeline by tracking additional measures **and** actually track them (implementation). 

==> Please briefly explain the reasons for your decision on which additional metrics to track. 

## Task 3: Evaluate the performance of your model
Given the results of the two previous tasks, how would you estimate the performance of your model so far?
Following questions may serve as an inspiration:

- How does your model really perform? 
- How confident are you about the performance of your model at this point?
- Are there any issues with the given batch that you have not taken into consideration yet? If so, how can you tackle them?

==> Please answer the question here.