# Project 1 Section 4: Object Detection and Image Segmentation

We learn here how to detect objects in images and how to segment an image according to the objects in it.

## Tutorial 1: Object detection

Object Detection helps us understand the spatial context and location of different objects in an image. Below, you are going to perform Object detection with the use of the pretrained DETR model.

The DETR model is a complex CNN that takes as input an image and outputs the same image annotated with boxes identifying the objects in it. It was trained on COCO 2017, a dataset of 118k annotated images.




For more information about the DETR model, please see [End-to-End Object Detection with Transformers](https://https://arxiv.org/abs/2005.12872)

From the transformers module import the classes `DetrImageProcessor` and `DetrForObjectDetection`.

In [None]:
from transformers import DetrImageProcessor, DetrForObjectDetection, AutoModel

* The `DetrImageProcessor` class is used for the pre-processing of the images which then will be used as input to the DETR model.
* The `DetrForObjectDetection` class provides access to the pre-trained DETR model.

Load the model `facebook/detr-resnet-50` for preprocessing.


In [None]:
image_processor = DetrImageProcessor.from_pretrained("facebook/detr-resnet-50")

Load the object detector model `facebook/detr-resnet-50`.

In [None]:
model = DetrForObjectDetection.from_pretrained("facebook/detr-resnet-50")

See the 89 different objects that the model has been trained to recognize, with the `.config.id2label` attribute.

In [None]:
model.config.id2label

For example, it can recognise objects like: a bird, a cat, a hat, a car, a tie, a bicycle etc.

We now load the following image from the Internet.

In [None]:
from PIL import Image, ImageDraw
import requests
import torch

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image

Preprocess the image.

In [None]:
inputs = image_processor(images = image,
                         return_tensors = "pt")
outputs = model(**inputs)
target_sizes = torch.tensor([image.size[::-1]])

With the `post_process_object_detection()` method, return a dictionary which will contain the objects detected in the image. The three key-value pairs in the dictionary are:

* scores — The confidence of each detected object
* labels — the index of the detected object in model.config.id2label
* boxes — the bounding boxes of each detected object

The `post_process_object_detection()` method takes in the following arguments:

* the output of the model (outputs)
* the target size of the image (target_sizes)
* the threshold value (0.9) for filtering out predictions, which means that predictions with confidence greater than 90% will be returned.


In [None]:
results = image_processor.post_process_object_detection(outputs,
                                                        target_sizes = target_sizes,
                                                        threshold = 0.9)[0]
results

Visualise the detected objects with their confidence scores, labels and drawing boxes around them.

In [None]:
draw = ImageDraw.Draw(image)

for score, label, box in zip(results["scores"], results["labels"], results["boxes"]):
    box = [round(i, 2) for i in box.tolist()]
    print(
        f"Detected {model.config.id2label[label.item()]} with confidence "
        f"{round(score.item(), 3)} at location {box}"
    )

Draw bounding boxes around objects

In [None]:
draw.rectangle(box, outline="yellow", width=2)

Display the labels

In [None]:
draw.text((box[0], box[1]-10),
          model.config.id2label[label.item()],
          fill="white")
image

## Tutorial 2 (semantic image segmentation)

We start by importing necessary tools.
 - torch: A powerful library for working with artificial intelligence models.
 - torchvision: Offers tools and models specifically for image tasks.
 - PIL (Python Imaging Library): Helps us work with images (loading, displaying).
 - requests: Enables us to fetch data from the internet.
 - matplotlib: A plotting library, for showing images and results.

In [None]:
import torch
import torchvision.transforms as T
from torchvision.models.segmentation import deeplabv3_resnet101
from PIL import Image
import requests
from matplotlib import pyplot as plt

### Loading the model

We now load a pre-trained CNN, DeepLabV3, which has been trained to recognize various objects in images. The model is set to "evaluation mode," indicating it's ready to analyze images.

In [None]:
model = deeplabv3_resnet101(pretrained=True)
model.eval()  # We tell the model it's time to work (evaluation mode).

### Loading the image

We retrieve an image from the internet to analyze. The image is opened and ready to be processed, similar to selecting a photograph to examine.

In [None]:
from io import BytesIO

image_url = "https://cdn.pixabay.com/photo/2018/10/01/09/21/pets-3715733_960_720.jpg"
r = requests.get(url, timeout=20)
r.raise_for_status()

image = Image.open(BytesIO(r.content))
image.load()
plt.imshow(image)

### Preparing the image

We now transform the image into a format suitable for the model. It's akin to translating a document into a language the expert (our CNN) understands. This step ensures the image is correctly interpreted by the CNN.

In [None]:
# Before analyzing, we need to adjust the image to the format the model expects.
# This involves converting the image to a tensor (a multi-dimensional array used in AI models) and normalizing it.
preprocess = T.Compose([
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(image)
input_batch = input_tensor.unsqueeze(0)  # The model expects a batch of images, even if there's just one.

#### Analyzing the image

Here, the model examines the image and determines which parts of the image belong to different objects. This process is similar to asking an expert to identify and categorize different elements in a photograph.


In [None]:
# Now, we ask our pre-trained model to analyze the image and give us the segmentation result.
with torch.no_grad():  # This tells PyTorch we don't need to do any training.
    output = model(input_batch)['out'][0]
output_predictions = output.argmax(0)

#### Visualizing the results

This final step displays the original image alongside the segmentation result produced by our model. The segmentation map uses different colors to represent various parts of the image identified by the model, offering a visual representation of the model's "understanding" of the scene.

In [None]:
# Finally, we visualize the results. We'll show the original image and the model's understanding of it side by side.
plt.figure(figsize=(15, 5))
plt.subplot(1, 2, 1)
plt.imshow(image)
plt.title('Original Image')
plt.axis('off')  # Hide the axis for a cleaner look
plt.subplot(1, 2, 2)
# We use a color map to differentiate the segments identified by the model.
plt.imshow(output_predictions.byte().cpu().numpy(), cmap='nipy_spectral', interpolation='nearest')
plt.title('Segmentation Result')
plt.axis('off')
plt.show()

Observe how the CNN recognizes and distinguishes between cats and dogs. The cats are colored in green and the dogs are colored in gray. Note however, that it incorrectly believes that there is a cat in front of the two dogs.

## Assessment

### Task 1

Perform object detection on this image [cat and dog](https://www.companionanimalclinicvirginia.com/wp-content/uploads/2018/12/white_cat_and_dog.jpg) by using the DETR pre-trained model. Here is the URL `https://www.companionanimalclinicvirginia.com/wp-content/uploads/2018/12/white_cat_and_dog.jpg`.

### Task 2

Perform semantic image segmentation on this [image](https://cdn.pixabay.com/photo/2017/12/27/14/02/friends-3042751_960_720.jpg). Here is the URL `https://cdn.pixabay.com/photo/2017/12/27/14/02/friends-3042751_960_720.jpg`. Use the following models:

* FCN: You can load this model with the command `model = fcn_resnet101(pretrained=True)`. Do not forget to import it.
* LRASPP MobileNetV3: You can load this model with the command `model = lraspp_mobilenet_v3_large(pretrained=True)`.

Which one works best for this image? Do the same for the image of the cats and dogs. Which one works best for that image?