![title](pics/cmiyc.jpg)

# Re-Developing of *Crash Me If You Can*

**Assuming you'd have to re-develop the software for CMIYC, how would you proceed?**

**Given**:
* All hardware, in particular,
  * a car-mounted camera that constantly streams images from the race track (which you can easily access), and
  * a way to control the speed for the car through simple software instructions.
    
**Objective**:
* AI system that automatically adjusts the car speed based on traffic signs (speed limits, stop signs) next to the race track.
  * The overall system thus consists of a DL model to recognize traffic signs and a (simple) rule-based AI to set the car speed based on the signs.
  * The focus should be on the DL model for traffic sign recognition.

**Examples**:

<img src="pics/frame1.jpg" alt="drawing" width="350"/><img src="pics/frame2.jpg" alt="drawing" width="350"/><img src="pics/frame3.jpg" alt="drawing" width="350"/>

Left: car speed should be 50 km/h (default speed)<br>
Middle: car speed should be temporarily set to 70 km/h<br>
Right: car speed should be temporarily set to 30 km/h

Try to solve the task on a high level of abstraction. The following list of questions may serve as an inspiration, but is not exhaustive:
* What data is needed? How can it be *efficiently* obtained/collected/labeled?
* What must be changed in the MNIST handwritten digits example to accommodate the new problem setting, with RGB images and a different number of classes?
* Is DL only needed for image classification, or are there other CV-related tasks that can/must be solved with DL in this case?
* What could be the overall workflow / individual stages of the AI system?
* Which data augmentation strategies might be useful?
* How can traffic sign recognition be made more robust? In particular, we're not dealing with single, isolated images, but with video frames ...
* How can the quality of the traffic sign recognition system be evaluated?
* What are potential corner cases one must consider?
* In an actual traffic sign recognition system for real cars, what might be additional challenges?

What do you think, which step takes most time?

# Our Solution

In [None]:
import numpy as np
from PIL import Image, ImageDraw
from matplotlib import pyplot as plt
from pathlib import Path
import torch

In [None]:
from cmiyc.classifier import Classifier
from cmiyc.app import load_detector, CMIYCApp

## Overall Traffic Sign Recognition Pipeline

![pipeline](pics/cmiyc_pipeline.png)

1. **Input**: RGB images of traffic scenes
2. *Detector* locates traffic signs in images (bounding boxes) and classifies rough categories (warning, prohibitory, etc.)
   * Neural network
3. *Classifier* classifies each found traffic sign
   * Neural network
4. **Output**: Detailed information about each detected sign: location within image, class

## Traffic Sign Detection

* Popular neural network architecture for general object detection in images: [YOLO-family](https://10.1109/CVPR.2016.91)
  * **Y**ou **O**nly **L**ook **O**nce
  * We used YOLOv4 (released 2020), current version is YOLOv9
  * Convolutional neural network
  * **Input**: $1600\times 1600$ pixel RGB images
  * **Output**: coordinates of bounding boxes of detected objects
  * 5,883,356 trainable parameters

* Training data: [ATSD-Scenes](https://github.com/risc-mi/atsd)
  * 7,454 high-res images from Austrian highways in 2014
  * 28,000 detailed traffic sign annotations
  * Created by RISC-SW and ASFINAG in research project *SafeSign*
    ![atsd](pics/atsd.jpg)

* Training and evaluation:
  * Training takes about 10h on GPU
  * Final performance on test set of ATSD-Scenes: $94.82\%$ [mAP](https://www.v7labs.com/blog/mean-average-precision)

* Application in CMIYC:
  * **Detector works exceptionally well also on CMIYC video frames** => model generalizes from highways to Carrera race tracks
  * Restrict to at most one traffic sign per image, with largest bounding box

Let's load the detector and apply it to some test images:

In [None]:
detector = load_detector()

In [None]:
img = Image.open('pics/frame1.jpg')
img

In [None]:
detection_result = detector.detect(np.asarray(img))

In [None]:
detection_result

In [None]:
# visualize bounding box of all detected signs
draw = ImageDraw.Draw(img)
for bb in detection_result[2]:
    draw.rectangle(
        # convert bounding box coordinates from (left, top, width, height) to (left, top, right, bottom)
        tuple(bb[:2]) + tuple(bb[:2] + bb[2:]),
        width=6,
        outline=(0, 255, 0)
    )

In [None]:
img

## Traffic Sign Classification

* Neural network architecture: [Li & Wang, 2019](https://doi.org/10.1109/TITS.2018.2843815)
  * Convolutional neural network
  * Similar architecture as MNIST classifier, but more layers (and more parameters)
  * **Input**: $48\times 48$ pixel RGB images
  * **Output**: index of one of 19 traffic sign classes

In [None]:
classifier = Classifier(19).eval()

In [None]:
sum(p.numel() for p in classifier.parameters())

* The network architecture can be summarized as follows:

![network architecture](pics/cmiyc_classifier.png)

* Training:
  * Data: 18,402 images acquired with car on race track
    * Using classifier trained on ATSD did not work very well, in contrast to detector
  * Augmentation: rotation, zoom, noise, ...
  * Training takes about 1 hour on GPU
  * Final test-set performance: $99.55\%$ accuracy

Let's load the trained parameters and classify some test images:

In [None]:
classifier.load_state_dict(torch.load('cmiyc/classifier.pt'))

In [None]:
img = Image.open('pics/speed_limit.png')
img

In [None]:
classification_result = classifier.classify(np.asarray(img))
classification_result

To see which traffic sign class this index corresponds to, we can visualize the template class image:

In [None]:
Image.open('cmiyc/class_imgs/{}.png'.format(classification_result))

## Webcam App

In [None]:
app = CMIYCApp(classifier=classifier, detector=detector, detect=True)

In [None]:
app.run()

**Note**: In the live-view window, press "q" to shut down the app.

# Concluding Remarks

* We've encountered two important CV tasks: *image classification* and *object detection*
  * Many other tasks of varying complexity exist: semantic segmentation, instance segmentation, captioning, visual Q&A, image generation, ...
* **Trustworthiness** is an important aspect of DL (and AI in general) and actively researched:
  * Explainable AI (XAI): How does a DL model arrive at its predictions? What are the most important input features? How do these features contribute to the output?
  * Uncertainty quantification: How (un)certain is the model about its predictions? Can we trust it?
  * Out-of-distribution detection: Is the input drawn from the same distribution as the training data? Are there systematic differences? Maybe even adversarial attacks?
* **Neurosymbolic AI** can bridge the gap between DL and symbolic AI:
  * For instance, use DL to extract high-level semantic content from unstructured, fuzzy input data (e.g., images, natural language text), and process this content using symbolic AI and automated reasoning
    * Example: In autonomous driving, DL can be used to process images and other sensor data, whereas symbolic AI controls car based on obtained information
  * Different approach: tackle difficult problems from symbolic AI, e.g., symbolic integration, with DL
  * Great potential for more robust, trustworthy, understandable AI!

# Additional Material

## Public CV Datasets

* MNIST: small gray-scale images of handwritten digits
* [Fashion-MNIST](https://github.com/zalandoresearch/fashion-mnist): 60,000 small gray-scale images of fashion items
  * drop-in replacement for MNIST
  * more challenging
* [CIFAR](https://www.cs.toronto.edu/~kriz/cifar.html): 60,000 small RGB images (photographs), either 10 or 100 different classes
* [ImageNet](https://image-net.org/index.php): 1.4M RGB images (photographs), 1000 different classes
* [COCO](https://cocodataset.org/#home): 123,000 RGB images (photographs), detailed object-level annotations for detection and segmentation
* ... and many, many more, from all kinds of domains and for all sorts of tasks

**Note: Many of these datasets can easily be accessed (downloaded) with [torchvision](https://pytorch.org/vision/stable/datasets.html)!**

## Useful Technologies

* [torchvision](https://pytorch.org/vision/stable/index.html): Add-on library for PyTorch, very useful for computer vision:
  * [public datasets](https://pytorch.org/vision/stable/datasets.html)
  * [neural network architectures and pre-trained models](https://pytorch.org/vision/stable/models.html)
  * [image augmentations](https://pytorch.org/vision/stable/transforms.html)
* [PyTorch-Lightning](https://lightning.ai/docs/pytorch/stable/), [fastai](https://docs.fast.ai/): High-level interface to PyTorch, facilitates especially model training
* [TensorBoard](https://www.tensorflow.org/tensorboard), [Wandb](https://wandb.ai/site): Browser-based real-time monitoring of training progress (loss, accuracy, etc.), and organizing/summarizing results
  * seamlessly integrated into PyTorch-Lightning
  * TensorBoard does not require TensorFlow installation
* [Huggingface](https://huggingface.co/): Huge collection of trained ML models and curated datasets for all kinds of tasks in computer vision, natural language processing, etc.
* [Kaggle](https://www.kaggle.com/): Huge collection of datasets, models, notebooks, competitions
  * particularly nice for learning ML, DL, data science, etc.
* [Docker](https://www.docker.com/): Lots of existing images for ML and DL, with Python, PyTorch and many other useful packages pre-installed
* [Visual Studio Code](https://code.visualstudio.com/): Excellent Python IDE, with integrated Jupyter notebook support