<a href="https://colab.research.google.com/github/safaabbes/safaabbes/blob/main/Safa_Abbes_ML_Use_Case_Computer_Vision.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#ML Use Case: Computer Vision
Task: Implement a proof of concept where you detect cars/trucks/motorcycles in images. \
Deadline: Monday 29 Mai 2023 \
Notes: For this task, you can use pre-trained models. No need to train your model.

#Description of the given problem:

This is an object detection problem in computer vision which combines both classification and localization. We're dealing with multi-class classification involving 3 classes: Cars, Trucks and Motorcycles. Localization involve precise identification of the position of the object in the image. \

To accomplish this task, we'll need suitable datasets and models. No training will be done as we will use pre-trained models and we'll be mainly using the fiftyone toolkit and torchvision.


In [36]:
!pip install fiftyone

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [37]:
import fiftyone as fo

#Dataset

Since we have the freedom of choice of the dataset, we'll explore the datasets available in the Dataset Zoo of fiftyone that are made for image detection purposes. These datasets include:

*   COCO-2014
*   COCO-2017
*   KITTI
*   KITTI Multiview
*   Open Images V6
*   Open Images V7
*   VOC-2007
*   VOC-2012

After inspecting those datasets, I personally choose COCO-2017 as I find it the most suitable for this task.



In [38]:
import fiftyone.zoo as foz

## COCO-2017 Dataset Details:

COCO is a large-scale object detection, segmentation, and captioning dataset.
This version contains images, bounding boxes, and segmentations for the 2017 version of the dataset.

*  COCO defines 91 classes but the data only uses 80 classes.
*  Some images from the train and validation sets don’t have annotations.
*  The test set does not have annotations.

In [39]:
dataset = foz.load_zoo_dataset(
    "coco-2017",
    split="validation",   #train, test or validation, we could use the train or validation but since the test has no annotations it might not help in the evaluation.
    label_types=["detections"], #detections or segmentations
    classes=["car","truck","motorcycle"], #We select only the specified objects to detect, only samples containing at least one instance of a specified class will be loaded
    only_matching=True, #only load labels that match the classes or attrs requirements provided
    # max_samples=100, #we use only 100 to make loading faster
    shuffle=True,
    seed=1234 #used when shuffle is True
)

Downloading split 'validation' to '/root/fiftyone/coco-2017/validation' if necessary


INFO:fiftyone.zoo.datasets:Downloading split 'validation' to '/root/fiftyone/coco-2017/validation' if necessary


Found annotations at '/root/fiftyone/coco-2017/raw/instances_val2017.json'


INFO:fiftyone.utils.coco:Found annotations at '/root/fiftyone/coco-2017/raw/instances_val2017.json'


Sufficient images already downloaded


INFO:fiftyone.utils.coco:Sufficient images already downloaded


Existing download of split 'validation' is sufficient


INFO:fiftyone.zoo.datasets:Existing download of split 'validation' is sufficient


Loading existing dataset 'coco-2017-validation'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


INFO:fiftyone.zoo.datasets:Loading existing dataset 'coco-2017-validation'. To reload from disk, either delete the existing dataset or provide a custom `dataset_name` to use


In [40]:
# Print some information about the dataset
print(dataset)

Name:        coco-2017-validation
Media type:  image
Num samples: 725
Persistent:  False
Tags:        []
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    faster_rcnn:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:      fiftyone.core.fields.IntField
    eval_fp:      fiftyone.core.fields.IntField
    eval_fn:      fiftyone.core.fields.IntField


In [41]:
# Print a ground truth detection
sample = dataset.first()
print(sample.ground_truth.detections[0])

<Detection: {
    'id': '6470825fa61f8abe35b5930e',
    'attributes': {},
    'tags': [],
    'label': 'car',
    'bounding_box': [
        0.169625,
        0.26844594594594595,
        0.0410625,
        0.05677927927927928,
    ],
    'mask': None,
    'confidence': None,
    'index': None,
    'supercategory': 'vehicle',
    'iscrowd': 0,
    'eval': 'tp',
    'eval_id': '647083cfa61f8abe35b5c9b1',
    'eval_iou': 0.6889941412651497,
}>


In [42]:
session = fo.launch_app(dataset)

#Model Selection
Pytorch provides popular object detection models that are pre-trained including:

*   Faster R-CNN
*   FCOS
*   RetinaNet
*   SSD
*   SSDlite

I personally choose to work with Faster R-CNN that is pre-trained on the COCO dataset.

In [43]:
!pip install torch torchvision

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Torchvision's Faster R-CNN combines a ResNet-50 backbone network with a Feature Pyramid Network (FPN) to detect objects in images. The ResNet-50 backbone is responsible for extracting features from the input image at different scales, while the FPN helps to generate a set of feature maps with varying resolutions to handle objects of different sizes. The model then applies region proposal network (RPN) to generate potential object bounding box proposals and performs classification and bounding box regression on these proposals. 

In [44]:
import torch
import torchvision

# Run the model on GPU if it is available
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# Load a pre-trained Faster R-CNN model
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
model.to(device)
model.eval() 


The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.


Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=FasterRCNN_ResNet50_FPN_Weights.COCO_V1`. You can also use `weights=FasterRCNN_ResNet50_FPN_Weights.DEFAULT` to get the most up-to-date weights.



FasterRCNN(
  (transform): GeneralizedRCNNTransform(
      Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
      Resize(min_size=(800,), max_size=1333, mode='bilinear')
  )
  (backbone): BackboneWithFPN(
    (body): IntermediateLayerGetter(
      (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
      (bn1): FrozenBatchNorm2d(64, eps=0.0)
      (relu): ReLU(inplace=True)
      (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
      (layer1): Sequential(
        (0): Bottleneck(
          (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn1): FrozenBatchNorm2d(64, eps=0.0)
          (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
          (bn2): FrozenBatchNorm2d(64, eps=0.0)
          (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
          (bn3): FrozenBatchNorm2d(256, eps=0.0)
          (relu): ReLU(

We can break down our model into: 

1. **ResNet-50 Backbone**: The ResNet-50 backbone is the core component of the model. It is responsible for feature extraction from the input image. ResNet-50 is a deep convolutional neural network that consists of multiple residual blocks. These blocks enable the model to learn complex and hierarchical features from the image at different scales. The backbone network processes the image and generates a set of feature maps with varying resolutions.

2. **Feature Pyramid Network (FPN)**: FPN is used to address the challenge of detecting objects at different scales. It takes the feature maps produced by the ResNet-50 backbone and constructs a feature pyramid with multiple levels. The FPN combines low-resolution, semantically strong features with high-resolution, semantically weak features to create a set of feature maps that capture rich information at different scales. This pyramid structure allows the model to handle objects of different sizes effectively.

3. **Region Proposal Network (RPN)**: The RPN is responsible for generating potential object bounding box proposals. It operates on the feature maps produced by the FPN and applies a set of predefined anchor boxes (or anchors) to identify regions of interest. The RPN predicts the likelihood of each anchor being an object or background and refines their bounding box coordinates. The highly confident proposals are passed on for further processing.

4. **RoI (Region of Interest) Pooling**: After the RPN generates the object proposals, the RoI pooling block is applied. This block extracts fixed-size feature maps from each proposed region of interest. These features are then fed into subsequent layers for classification and bounding box regression.

5. **Classification and Bounding Box Regression Heads:** The final stages of the model consist of classification and bounding box regression heads. The classification head performs object classification, predicting the probability of each proposed region belonging to different object classes. The bounding box regression head refines the coordinates of the proposed bounding boxes, aiming to improve the localization accuracy of the detected objects.

#Add predictions to dataset

In [45]:
from PIL import Image
from torchvision.transforms import functional as func
import fiftyone as fo

# Create a mapping between the desired classes and their assigned label in the default classes.
all_classes = dataset.default_classes
desired_classes = ['car', 'motorcycle', 'truck']
mapping = {}
for i, item in enumerate(all_classes):
    if item in desired_classes:
        mapping[i]=item

# Add predictions to samples
predictions_view = dataset.view()
with fo.ProgressBar() as pb:
    for sample in pb(predictions_view):
        # Load image
        image = Image.open(sample.filepath)
        image = func.to_tensor(image).to(device)
        c, h, w = image.shape
        
        # Perform inference
        preds = model([image])[0]
        filtered_preds = {}
        filtered_preds['boxes'] = preds['boxes'][torch.tensor([label in list(mapping.keys()) for label in preds['labels']])]
        filtered_preds['labels'] = preds['labels'][torch.tensor([label in list(mapping.keys()) for label in preds['labels']])]
        filtered_preds['scores'] = preds['scores'][torch.tensor([label in list(mapping.keys()) for label in preds['labels']])]
        labels = filtered_preds["labels"].cpu().detach().numpy()
        scores = filtered_preds["scores"].cpu().detach().numpy()
        boxes = filtered_preds["boxes"].cpu().detach().numpy()

        # Convert detections to FiftyOne format
        detections = []
        for label, score, box in zip(labels, scores, boxes):
            # Convert to [top-left-x, top-left-y, width, height]
            # in relative coordinates in [0, 1] x [0, 1]
            x1, y1, x2, y2 = box
            rel_box = [x1 / w, y1 / h, (x2 - x1) / w, (y2 - y1) / h]

            detections.append(
                fo.Detection(
                    label=mapping[label],
                    bounding_box=rel_box,
                    confidence=score
                )
            )
        
        # Save predictions to dataset
        sample["faster_rcnn"] = fo.Detections(detections=detections)
        sample.save()

 100% |█████████████████| 725/725 [2.0m elapsed, 0s remaining, 5.9 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 725/725 [2.0m elapsed, 0s remaining, 5.9 samples/s]      


In [46]:
session.view = predictions_view

By looking at the samples, we can notice many detection errors or duplicates done by the model with a low score. To overcome this issue we can simply apply a threshold and leave out the detections the model is most confident about.

In [47]:
from fiftyone import ViewField as F

# Only contains detections with confidence above threshold
threshold = 0.75
high_conf_view = predictions_view.filter_labels("faster_rcnn", F("confidence") > threshold, only_matches=False)

In [48]:
# Print some information about the view
print(high_conf_view)

Dataset:     coco-2017-validation
Media type:  image
Num samples: 725
Sample fields:
    id:           fiftyone.core.fields.ObjectIdField
    filepath:     fiftyone.core.fields.StringField
    tags:         fiftyone.core.fields.ListField(fiftyone.core.fields.StringField)
    metadata:     fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.metadata.ImageMetadata)
    ground_truth: fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    faster_rcnn:  fiftyone.core.fields.EmbeddedDocumentField(fiftyone.core.labels.Detections)
    eval_tp:      fiftyone.core.fields.IntField
    eval_fp:      fiftyone.core.fields.IntField
    eval_fn:      fiftyone.core.fields.IntField
View stages:
    1. FilterLabels(field='faster_rcnn', filter={'$gt': ['$$this.confidence', 0.75]}, only_matches=False, trajectories=False)


In [49]:
# Print a prediction from the view to verify that its confidence is > 0.75
sample = high_conf_view.first()
print(sample.faster_rcnn.detections[0])

<Detection: {
    'id': '64708c5ba61f8abe35b62bb8',
    'attributes': {},
    'tags': [],
    'label': 'motorcycle',
    'bounding_box': [
        0.8074630737304688,
        0.301586288589615,
        0.19253692626953126,
        0.26718273678341425,
    ],
    'mask': None,
    'confidence': 0.9974988102912903,
    'index': None,
}>


In [50]:
# Load high confidence view in the App
session.view = high_conf_view

We can clearly see that the detections are cleaner this way. Changing the threshold value leads to different results.There is a trade-off between precision and recall. A high threshold will result in fewer but more confident detections, while a lower threshold will yield more detections but with potentially lower confidence. If high precision is crucial, we may choose a higher threshold that provides a good balance between precision and recall. If maximizing recall is more important, we may opt for a lower threshold that captures more positive instances but with potential false positives.  For example, in certain scenarios where false positives are highly undesirable, we may opt for a higher threshold to be more conservative and reduce the chances of incorrect detections. In my opinion, if we were dealing with autonomous vehicules, incorrectly detecting objects that are not present can lead to unnecessary braking or evasive maneuvers, which can be dangerous or disruptive to traffic flow. Emphasizing precision helps reduce the likelihood of false positive detections and minimizes unnecessary interventions. Furthermore, autonomous cars typically have limited computational resources. Prioritizing precision helps reduce computational load by minimizing false positives. This enables efficient resource allocation, allowing the system to process a larger number of true positive detections effectively. However, A very high threshold that maximizes precision may lead to missed detections (lower recall) of certain objects, which could be problematic for safety. 

#Detection Evaluation 

The evaluate_detections function in FiftyOne is specifically designed to evaluate object detection results. It takes as input ground truth annotations and predicted detections, and then computes various evaluation metrics to assess the performance of the model. These metrics commonly include precision, recall, average precision (AP), and mean average precision (mAP).

In [51]:
# Evaluate the predictions in the `faster_rcnn` field of our `high_conf_view`
# with respect to the objects in the `ground_truth` field
results = high_conf_view.evaluate_detections(
    "faster_rcnn",
    gt_field="ground_truth",
    eval_key="eval",
    compute_mAP=True,
)

Evaluating detections...


INFO:fiftyone.utils.eval.detection:Evaluating detections...


 100% |█████████████████| 725/725 [16.3s elapsed, 0s remaining, 43.8 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 725/725 [16.3s elapsed, 0s remaining, 43.8 samples/s]      


Performing IoU sweep...


INFO:fiftyone.utils.eval.coco:Performing IoU sweep...


 100% |█████████████████| 725/725 [10.2s elapsed, 0s remaining, 90.5 samples/s]      


INFO:eta.core.utils: 100% |█████████████████| 725/725 [10.2s elapsed, 0s remaining, 90.5 samples/s]      


In [52]:
# Print a classification report 
results.print_report(classes=['car','motorcycle','truck'])

              precision    recall  f1-score   support

         car       0.79      0.61      0.68      1986
  motorcycle       0.82      0.62      0.71       372
       truck       0.71      0.41      0.52       415

   micro avg       0.78      0.58      0.67      2773
   macro avg       0.77      0.55      0.64      2773
weighted avg       0.78      0.58      0.66      2773



In [53]:
print(results.mAP())

0.30566323550504065


In [54]:
plot = results.plot_pr_curves(classes=['car','motorcycle','truck'])
plot.show()

The high confidence filtering threshold and the IoU sweep values hold a huge impact on the evaluation metrics. 
E.g:

confidence threshold = 0.85 > mAP = 0.27\
confidence threshold = 0.75 > mAP = 0.30 \
confidence threshold = 0.6 > mAP = 0.33 \


How can improve our results?



*   **Model Fine-Tuning**: Re-initializing the last layers (FastRCNNPredictor) and re-train for a few epochs (This involves hyperparameter tuning, learning rate scheduling, regularization techniques…).
*   **Data Augmentation**: Techniques such as random crops, rotations, scaling, and flipping can help the model generalize better and handle different variations and orientations of the objects.




