This model implements the paper "Vehicle Pose Estimation: Exploring Angular Representations" by Orlov, I.; Buzzelli, M. and Schettini, R. (2024) in Proceedings of the 19th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 2: VISAPP; ISBN 978-989-758-679-8; ISSN 2184-4321, SciTePress, pages 853-860
Our research utilizes the PASCAL3D+ dataset, which offers a diverse range of object categories, including cars, with annotated azimuth estimations for each photograph. We introduce two architectures that approach azimuth estimation as a regression problem, each employing a deep convolutional neural network (DCNN) backbone but diverging in their output definition strategies.
The azimuth, ( φ ), is defined as the angle in the range [-π, π] that represents the orientation of a vehicle with respect to the viewer. Originating from the front of the car, this angle describes how much the vehicle has rotated from this frontal viewpoint. For instance, ( φ = 0 ) would indicate a car directly facing the viewer, while ( φ = π/2 ) would signify the car turned 90° to the right. This definition is depicted below.
We use Pascal 3D+ dataset v.1.1 to train the model. We take only ImageNet images subset.
We tired two different approaches to formalize the output of the model, since we cannot return only one variable because of the polar coordinates nature (π and -π predictions should have 0 distance, not 2π).
- We output o1=sin(φ) & o2=cos(φ) of the predicted azimuth using tanh activation;
- Recover φ pred := atan2(o1, o2);
- Loss function: MSE
We Split the angle prediction into two discriminators:
- Front/Rear “classifier”
- Left/Right “classifier”
So we define two angles α & β as shown below
NB. The paper contains few mistakes in the values of β of corresponding figure, so we corrected them in the image above.
Model returns two outputs, which are absolute values of α & β.
- Final activation: two sigmoids, which correspond to two normalized angles α & β;
- Loss function: sum of two binary cross-entropy loss for the α & β predictions;
To evaluate the performance of the models:
- We recover the predicted φ angle
- Realign φpred to φtrue if needed and calculate the minimum angle between those
The metrics we use:
- Mean Absolute Error (in degrees)
- Rooted Mean Squared Error (in degrees)
- Median Absolute Error (in degrees)
- R2 score
- Accuracy π/6: The fraction of instances whose predicted viewpoint is within a π/6 threshold of the target viewpoint
Requirements:
- Python 3.8
- Tensorflow 2.8
For local setup, run:
conda create -n car-pose-estimation python=3.8 poetry -y
conda activate car-pose-estimation
poetry install --with dev
To generate the dataset from original Pascal 3D+ dataset zip file, run:
poetry run python scripts/prepare_dataset.py
To train the model, run:
poetry run python scripts/train_model.py
To validate the model, run:
poetry run python scripts/validate_model.py
To run inference on a folder of images, run:
poetry run python scripts/inference.py --model_path=model_file.h5 --approach=2 --images_path=tests/sample_images --output_path=./samples_results.json
To create the visualization of the results, add the --visualizations_path=./samples_visualization
argument.
Check the source of the scripts/inference.py
for more details on the arguments.