# Evaluating Traffic Sign Recognition Pipelines

---

This notebook is part of https://github.com/risc-mi/atsd.

This notebook demonstrates how detection- and classification trained on ATSD-Scenes and ATSD-Signs can be evaluated, by calculating class-wise average precision and mean average precision (mAP).

## Package Imports

In [1]:
from pathlib import Path
import pandas as pd
from util import evaluator

## Paths

Set `ROOT` to the path to the directory where ATSD-Scenes is located. This is the directory containing folders `"/train"` and `"/test"`.

In [2]:
ROOT = Path('path/to/atsd-scenes')

## Load Ground Truth and Recognition Results

Load ground truth:

In [3]:
annotations = pd.read_csv(ROOT / 'test/meta_test.csv', index_col=0)

Drop categories not used for training the detection models:

In [4]:
annotations = annotations[~annotations['class_id'].str[:2].isin(('09', 'xx'))]

In [5]:
len(annotations)

4889

In [6]:
annotations.head()

Unnamed: 0,image_id,annotation_id,xtl,ytl,xbr,ybr,group_id,type,not_normal_to_roadway,unusual_sign,...,weather,lighting,fog,tunnel,damaged,trimmed,covered,multiple_signs_visible,caption,class_id
2,32,122,761.08,0.0,950.42,104.87,-1,prismatic,False,False,...,normal,normal,False,False,False,False,False,False,False,08_02
3,32,123,1045.2,20.19,1101.48,76.73,-1,led,False,False,...,normal,normal,False,False,False,False,False,False,False,01_12
4,32,124,614.71,39.99,672.86,96.53,-1,led,False,False,...,normal,normal,False,False,False,False,False,False,False,01_12
5,33,125,952.73,297.87,979.66,328.02,299,plate,False,False,...,normal,normal,False,False,False,False,False,False,False,01_12
6,33,126,949.77,326.92,980.46,347.68,299,plate,False,False,...,normal,normal,False,False,False,False,False,False,False,07_xx


Load recognition results. These can be either detection results, where only the traffic sign category is predicted, or results from the entire recognition pipeline, where the exact class is predicted.

In this case, we load results from a detection+classification pipeline trained on the public training set and evaluated on the public test set. The classifier was trained with geometric+LED augmentation enabled.

In [7]:
recognitions = pd.read_csv('results/1_7.csv', index_col=0)

The traffic sign category predicted by the detector is stored in column `"cat_id"`, the detector's confidence in column `"conf"`, the traffic sign class predicted by the classifier can be found in column `"pred"` and the classifier's softmax score in column `"pred_score"`. Furthermore, every row is assigned a unique `"detection_id"`, similar to `"annotation_id"` in the ground-truth annotations:

In [8]:
recognitions.head()

Unnamed: 0,conf,image_id,cat_id,xtl,ytl,xbr,ybr,detection_id,pred,pred_score
0,0.998773,32,8,761.0,-0.5,949.0,102.5,6505,08_02,0.999986
1,0.999831,32,1,615.5,38.5,670.5,95.5,6506,01_12,1.0
2,0.998073,32,1,1045.5,20.0,1098.5,76.0,6507,01_12,0.99999
3,0.978714,33,7,949.0,327.0,979.0,347.0,6508,07_04,1.0
4,0.976064,33,7,1341.0,355.0,1419.0,383.0,6509,07_09,0.999196


In this case, bounding boxes are specified in columns `"xtl"`, `"ytl"`, `"xbr"` and `"ybr"`. Alternatively, it is also possible to have a single column `"bbox"` containing string-representations of the bounding boxes in the YOLO-native `(x_center, y_center, width, height)` format.

## Evaluate Detection Performance

Evaluate detection performance, i.e., ignore class predictions and only consider categories. Note that detection- and annotation-IDs must be on the respective row index:

In [9]:
det_matches, det_metrics = evaluator.evaluate(
    recognitions.set_index('detection_id'),
    annotations.set_index('annotation_id'),
    conf='conf',
    pred='cat_id',
    iou_threshold=0.5,
    conf_threshold=0.25,
    discard_disagreements=False,
    area_range=None
)

Per-class performance metrics:

In [10]:
det_metrics

Unnamed: 0,TP,FP,FN,Precision,Recall,AP
1,2614,163,100,0.941304,0.963154,0.977219
2,186,9,17,0.953846,0.916256,0.953874
3,60,14,16,0.810811,0.789474,0.839418
4,69,7,15,0.907895,0.821429,0.897446
5,253,49,47,0.837748,0.843333,0.877979
6,10,1,13,0.909091,0.434783,0.683517
7,713,77,107,0.902532,0.869512,0.912558
8,529,56,140,0.904274,0.790732,0.870103


Matched detections and ground truth annotations:

In [11]:
det_matches.head()

Unnamed: 0,recognition_id,annotation_id,iou
0,6505,122,0.965103
1,6507,123,0.926629
2,6506,124,0.906016
3,6513,125,0.744512
4,6508,126,0.895905


## Evaluate Performance of Detection+Classification Pipeline

Evaluate detection+classification performance. There are in fact only two differences to the invocation of function `evaluator.evaluate()` in the section above:
* `conf` is set to `"pred_score"`, to use the softmax score returned by the classifier as the detection confidence. It could be set to `"conf"` or any combination (product, minimum, maximum, etc.) of the two, but in our experiments `"pred_score"` worked best.
* `pred` is set to `"pred"`, which is the predicted traffic sign class.

Note that all ground-truth annotations of traffic sign classes not included among the 60 classes in ATSD-Signs are automatically ignored!

In [12]:
pip_matches, pip_metrics = evaluator.evaluate(
    recognitions.set_index('detection_id'),
    annotations.set_index('annotation_id'),
    conf='pred_score',
    pred='pred',
    iou_threshold=0.5,
    conf_threshold=0.25,
    discard_disagreements=False,
    area_range=None
)

Per-class performance metrics, sorted descending by average precision:

In [13]:
pip_metrics.sort_values('AP', ascending=False)

Unnamed: 0,TP,FP,FN,Precision,Recall,AP
01_22,13,0,0,1.0,1.0,0.995
01_10,3,0,0,1.0,1.0,0.995
05_04,10,0,0,1.0,1.0,0.995
02_01,9,3,0,0.75,1.0,0.995
01_18,3,2,0,0.6,1.0,0.995
02_05,8,1,0,0.888889,1.0,0.995
07_04,89,1,0,0.988889,1.0,0.995
01_12,399,23,4,0.945498,0.990074,0.993883
02_03,43,5,0,0.895833,1.0,0.993636
01_06,407,37,4,0.916667,0.990268,0.993347


Mean average precision (mAP):

In [14]:
pip_metrics['AP'].mean()

0.8965673844027315