Aerial Object Detection in Earth Vision: An accurate and reliable model for monitoring harbour ports and airbases using satellite.
Explore the docs ยป
Architecture
ย ย ยทย ย
Features
ย ย ยทย ย
Local Setup
- Overview
- Dataset
- Models Implemented
- Algorithms Used
- Model Architectures
- Performance
- Results
- Ensemble Method
- Installation
- Project Structure
- Contributors
- License
The objective of this project is to accurately detect and classify objects such as ships, harbours, planes, jets, and other surveillance-relevant entities in aerial and satellite imagery. Aerial object detection presents unique challenges, including small object sizes, high object density, diverse orientations, and complex backgrounds. To tackle these issues, the project explores and evaluates multiple object detection architectures and employs an ensemble strategy to enhance detection accuracy and robustness across varying scenarios.
For this project, we utilized DOTA 1.5 (Dataset for Object Detection in Aerial Images), a large-scale and richly annotated dataset tailored for aerial object detection tasks. DOTA provides high-resolution satellite and aerial imagery that closely mirrors real-world surveillance scenarios.
Key Features:
- Total Images: 1,418 training images and 412 testing images
- Source Diversity: Images captured from a variety of sensors and platforms
- Annotation Format: Oriented bounding boxes to accurately capture object orientation
- Object Categories: 15 real-world classes including ships, harbours, airplanes, helicopters, and other surveillance-critical objects.
- Challenges Addressed: Small object sizes, dense object distributions, cluttered backgrounds, and wide-ranging orientations
๐ More about DOTA: https://captain-whu.github.io/DOTA/
To effectively detect objects such as ships, harbours, aircraft, and other surveillance targets in aerial imagery, we implemented three deep learning-based object detection modelsโSSD, YOLO, and Faster R-CNNโeach trained and evaluated independently on the DOTA 1.5 dataset.
- SSD (Single Shot Multibox Detector): SSD offers a balance between speed and accuracy through a single-stage detection framework that predicts bounding boxes and classes simultaneously. However, a major drawback of SSD is its relatively poor performance in detecting small objects, which are common in aerial imagery. Its fixed-scale anchor boxes and reliance on lower-resolution feature maps often result in missed detections or inaccurate localization. Due to these limitations, other models were explored to improve detection of fine-grained and densely packed targets.
- YOLO (You Only Look Once) YOLO is known for its exceptional speed and real-time processing capabilities. It treats detection as a regression problem and predicts bounding boxes and class probabilities directly from the entire image. However, YOLO also struggles with detecting small or overlapping objects, especially in cluttered scenes, due to its coarse grid-based prediction mechanism. Despite this, its fast inference time made it a valuable benchmark and a suitable candidate for scenarios requiring quick surveillance scans.
- Faster R-CNN (FRCNN) Faster R-CNN uses a two-stage detection process: a Region Proposal Network (RPN) to generate candidate object regions, followed by classification and bounding box refinement. This architecture achieves high accuracy and better localization, especially for small, overlapping, and rotated objectsโmaking it highly suitable for aerial surveillance. The main drawback, however, is its higher computational cost and slower inference speed, which may not be ideal for real-time applications.
Comparitive Insight While YOLO significantly outperformed the other models in terms of speed and efficiency, it lagged behind in detecting small and overlapping objects. On the other hand, Faster R-CNN, although slower, consistently delivered higher accuracy and better localization, especially in complex aerial scenes. This trade-off between speed and precision is a critical consideration when choosing the appropriate model for real-world aerial object detection tasks.
SSD is a fast one-stage detector that predicts object classes and bounding boxes directly from multiple feature maps. It offers a good balance between speed and accuracy.
YOLOv2 performs real-time object detection by predicting bounding boxes and classes in a single pass, making it highly efficient for fast inference.
Faster R-CNN is a two-stage detector that uses a Region Proposal Network (RPN) to generate object candidates and then classifies them. Itโs known for its high accuracy.
Visual representations of the object detection model architectures used in this project:
Model | mAP | Recall (%) |
---|---|---|
SSD | 26 | 47.61 |
YOLO | 34 | 31.51 |
Faster R-CNN | 51 | 57.32 |
Ensemble (WBF) | 57 | 63.41 |
To improve detection performance, we combined the outputs of multiple object detection models using Weighted Boxes Fusion (WBF).
Each model has its strengths:
- SSD is fast but may miss small objects.
- YOLO is balanced for speed and accuracy.
- Faster R-CNN is highly accurate but slower.
By merging predictions from all three, we maximize both precision and recall.
WBF takes overlapping predictions from different models and produces a more accurate bounding box based on confidence scores and spatial alignment.
The following diagram illustrates how we ensemble the predictions from SSD, YOLO, and FRCNN to generate the final result:
git clone https://github.com/Hamzawp/AerialTect.git
pip install -r requirements.txt
#This model implementation is done on (Kindly check requirements.txt)
torch==2.5.1+cu124
torchvision==0.20.1+cu124
torchaudio==2.5.1+cu124
The project consists of the following files:
SSD_Implementation.ipynb
: Implementation of the SSD model.Yolo_Implementation.ipynb
: Implementation of the YOLO model.FRCNN_Implementation.ipynb
: Implementation of the Faster R-CNN model.Ensemble_Implementation.ipynb
: Combining predictions from multiple models using ensemble methods.Testing.ipynb
: For testing and evaluating the models.Playground.ipynb
: To have fun and experience on models.