<a href="https://colab.research.google.com/github/thad75/TP-ENSEA-ELEVE/blob/main/3A/SIA/TP%25202023%25202024/TDmODTransformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# I - Object Detection and Segmentation

Let's apply what we have seen in class for the moment. Let's train a segmentation model : Mask R-CNN. Faster R-CNN is a model that predicts both bounding boxes and class scores for potential objects in the image.
Mask R-CNN adds an extra branch into Faster R-CNN, which also predicts segmentation masks for each instance.

As we only have 2 hours, let's finetune the model instead of training it from scratch.

In [None]:
!wget https://www.cis.upenn.edu/~jshi/ped_html/PennFudanPed.zip
!unzip PennFudanPed.zip

os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/engine.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/utils.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_utils.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/coco_eval.py")
os.system("wget https://raw.githubusercontent.com/pytorch/vision/main/references/detection/transforms.py")


In [None]:
%matplotlib inline
import os
import torch

from torchvision.io import read_image
from torchvision.ops.boxes import masks_to_boxes
from torchvision import tv_tensors
from torchvision.transforms.v2 import functional as F
from torchvision.transforms import v2 as T
import matplotlib.pyplot as plt
import numpy as np
from engine import train_one_epoch, evaluate
import utils

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')

def get_transform(train):
    transforms = []
    if train:
        transforms.append(T.RandomHorizontalFlip(0.5))
    transforms.append(T.ToDtype(torch.float, scale=True))
    transforms.append(T.ToPureTensor())
    return T.Compose(transforms)

## a - Dataset PennFunnPed

So we have a dataset, and we provide a dataset class. Let's observe, few samples

In [None]:
import os
import torch

from torchvision.io import read_image
from torchvision.ops.boxes import masks_to_boxes
from torchvision import tv_tensors
from torchvision.transforms.v2 import functional as F


class PennFudanDataset(torch.utils.data.Dataset):
    def __init__(self, root, transforms):
        self.root = root
        self.transforms = transforms
        # load all image files, sorting them to
        # ensure that they are aligned
        self.imgs = list(sorted(os.listdir(os.path.join(root, "PNGImages"))))
        self.masks = list(sorted(os.listdir(os.path.join(root, "PedMasks"))))

    def __getitem__(self, idx):
        # load images and masks
        img_path = os.path.join(self.root, "PNGImages", self.imgs[idx])
        mask_path = os.path.join(self.root, "PedMasks", self.masks[idx])
        img = read_image(img_path)
        mask = read_image(mask_path)
        # instances are encoded as different colors
        obj_ids = torch.unique(mask)
        # first id is the background, so remove it
        obj_ids = obj_ids[1:]
        num_objs = len(obj_ids)

        # split the color-encoded mask into a set
        # of binary masks
        masks = (mask == obj_ids[:, None, None]).to(dtype=torch.uint8)

        # get bounding box coordinates for each mask
        boxes = masks_to_boxes(masks)

        # there is only one class
        labels = torch.ones((num_objs,), dtype=torch.int64)

        image_id = idx
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2] - boxes[:, 0])
        # suppose all instances are not crowd
        iscrowd = torch.zeros((num_objs,), dtype=torch.int64)

        # Wrap sample and targets into torchvision tv_tensors:
        img = tv_tensors.Image(img)

        target = {}
        target["boxes"] = tv_tensors.BoundingBoxes(boxes, format="XYXY", canvas_size=F.get_size(img))
        target["masks"] = tv_tensors.Mask(masks)
        target["labels"] = labels
        target["image_id"] = image_id
        target["area"] = area
        target["iscrowd"] = iscrowd

        if self.transforms is not None:
            img, target = self.transforms(img, target)

        return img, target

    def __len__(self):
        return len(self.imgs)

dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))


TODO : Observe a sample of the dataset. What is inside ?

## TODO : Plot a sample.

In [None]:

# Provided sample item
image_tensor, target = ...

# Convert tensor to numpy array and transpose the dimensions for visualization
image_np = ...

# Create a figure and plot the image
plt.figure(figsize=(8, 8))
plt.imshow(image_np)
plt.axis('off')

# Get bounding boxes and masks from the target dictionary
boxes = ...
masks = ...
labels = ...

# Plot bounding boxes on the image
for i, box in enumerate(boxes):
    x1, y1, x2, y2 = box
    plt.plot([x1, x2, x2, x1, x1], [y1, y1, y2, y2, y1], linewidth=2, label=f'Label: {labels[i]}')

# Plot masks on the image
# Plot masks on the image
for i in range(masks.shape[0]):
    mask = masks[i]
    plt.imshow(mask, alpha=0.3, cmap='viridis', interpolation='none')

plt.show()


## Model

We will start with the pretrained on COCO version for a head start. Obviously we won't code everything. But we will modify the bakcbone of the model. We will modify a well chosen model to create our final model

In [None]:
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection.mask_rcnn import MaskRCNNPredictor


def get_model_instance_segmentation(num_classes):
    # load an instance segmentation model pre-trained on COCO
    # TODO : Have a look at the following page and pick the right model : https://pytorch.org/vision/stable/models.html
    model = ...

    # Modify the model's box predictor for classification
    in_features = model.roi_heads.box_predictor.cls_score.in_features
    model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

    # Modify the model's mask predictor for segmentation
    in_features_mask = model.roi_heads.mask_predictor.conv5_mask.in_channels
    hidden_layer = 256
    model.roi_heads.mask_predictor = MaskRCNNPredictor(
        in_features_mask,
        hidden_layer,
        num_classes
    )

    return model

## Let's go training


Let's put everything together and train the model

Dataset Preparation

In [None]:
dataset = PennFudanDataset('PennFudanPed', get_transform(train=True))
dataset_test = PennFudanDataset('PennFudanPed', get_transform(train=False))

indices = torch.randperm(len(dataset)).tolist()
train_indices = indices[:-50]
test_indices = indices[-50:]

train_dataset = torch.utils.data.Subset(dataset, train_indices)
test_dataset = torch.utils.data.Subset(dataset_test, test_indices)

# TODO : Define your dataloaders
train_loader = torch.utils.data.DataLoader(
    ...,
    batch_size=2,
    shuffle=True,
    num_workers=4,
    collate_fn=utils.collate_fn
)

test_loader = torch.utils.data.DataLoader(
    ...,
    batch_size=1,
    shuffle=False,
    num_workers=4,
    collate_fn=utils.collate_fn
)

Model Preparation

In [None]:
num_classes = ... # What is the number of classes
model = ... # TODO : Create your Model
model.to(device)

params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params,
                            lr=0.005,
                            momentum=0.9,
                            weight_decay=0.0005)
lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)


Train

In [None]:
from engine import train_one_epoch, evaluate  # Assuming functions for training and evaluation are in a separate file

num_epochs = # TODO : Define a num epoch


for epoch in range(num_epochs):
    train_one_epoch(model, optimizer, train_loader, device, epoch, print_freq=10)
    lr_scheduler.step()
    evaluate(model, test_loader, device=device)

print("Training completed!")


## Testing the model

Let's test the model on a random image taken from the internet. Do not hesitate to pick another image for fun.

In [None]:
!wget https://rspcb.safety.fhwa.dot.gov/RSF/images/unit2_crosswalk.png

In [None]:
import matplotlib.pyplot as plt

from torchvision.utils import draw_bounding_boxes, draw_segmentation_masks


image = read_image("/content/unit2_crosswalk.png")
eval_transform = ... # TOOD : Do we add something ?

model.eval()
with torch.no_grad():
    x = eval_transform(image)
    # convert RGBA -> RGB and move to device
    x = x[:3, ...].to(device)
    predictions = model([x, ])
    pred = ... # TODO : Get the predictions


image = (255.0 * (image - image.min()) / (image.max() - image.min())).to(torch.uint8)
image = image[:3, ...]
pred_labels = [f"pedestrian: {score:.3f}" for label, score in zip(pred["labels"], pred["scores"])]
pred_boxes = pred["boxes"].long()
output_image = draw_bounding_boxes(image, pred_boxes, pred_labels, colors="red")

masks = (pred["masks"] > ...).squeeze(1) # TODO : Select a good value for mask visualization
output_image = draw_segmentation_masks(output_image, masks, alpha=0.5, colors="blue")


plt.figure(figsize=(12, 12))
plt.imshow(output_image.permute(1, 2, 0))

# II - Attention Visualization

Go on this website and try to understand what each tokens see : https://epfml.github.io/attention-cnn/

# III - Transformers using HuggingFace

We will classify sentences. In this case we leverage from BERT.


In [None]:
!pip install datasets transformers
!pip install accelerate -U

The goal of this lab is to fine-tune a transformer model that can accurately determine whether two given sentences are paraphrases (semantically equivalent) of each other using the Microsoft Research Paraphrase Corpus (MRPC) dataset from the  [GLUE Benchmark](https://gluebenchmark.com/).

 The Hugging Face Transformers library is a popular open-source library that provides pre-trained models and various utilities for working with transformer-based models.

Make sure your version of Transformers is at least 4.11.0 since the functionality was introduced in that version:

In [None]:
import transformers

print(transformers.__version__)

## Loading the dataset

Let's load the GLUE Dataset and do some data exploration. Beware the format is not the same as usual

In [None]:
import numpy as np
from datasets import load_dataset, load_metric

In [None]:
dataset = load_dataset("glue", 'mrpc')
metric = load_metric('glue', 'mrpc')

In [None]:
# TODO : Explore your data


TODO : Analyse the dataset

## Preprocessing the data

Can we put the input sentences as is ? I don't think so, we need to perform tokenization. HuggingFace has already implemented some Tokenizers for us. Let's use them.

Documentation :
https://huggingface.co/docs/transformers/main_classes/tokenizer
https://huggingface.co/transformers/v3.0.2/model_doc/auto.html

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-uncased" # TODO : We choose distillbert has it is a smaller distilled model. But feel free to choose something else.
tokenizer = AutoTokenizer.from_pretrained(...,
                                          use_fast=True) # Load the tokenizer

Now let's test the tokenization.

TODO : Test different type of sentences:
- Sentences from different languages
- Sentences that has no linked meaning
- Sentences with wrong words.

Question : What is the attention mask ?

In [None]:
tokenizer("Hello, this one sentence!", "And this sentence goes with it to Disneyland.")

Now apply the tokenizer to sentence of our dataset.


In [None]:
# TODO : Print one element of the dataset. How do we access them ?
print(f"Sentence 1: {dataset['train'][0][...]}")
print(f"Sentence 2: {dataset['train'][0][...]}")

In [None]:
# TODO : Define a function that applies tokenization to the input. Please keep the format of the input sentences.

def preprocess_function(examples):
  return tokenizer(examples[...], examples[...], truncation=True)

Finally we are going to reencode the whole dataset using our tokenizer.

In [None]:
encoded_dataset = dataset.map(preprocess_function, batched=True)

## Fine-tuning the model

We are going to leverage from the AutoModels from HF, that encompasses lots of thing (model, processings, trainings,...) within one class. This is similar to the creation of a Trainer, model, module when using PytorchLightning

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

num_labels = ... # TODO : Define your number of labels
model = AutoModelForSequenceClassification.from_pretrained(..., # TODO : Initialize your Pipeline with the model. Which one ?
                                                           num_labels=num_labels)

Now let's use the HF Trainer. Similar to the trainer in Lightning, there are lots of arguments we can leverage from. Let's keep it simple.

In [None]:
model_name = model_checkpoint.split("/")[-1]
batch_size = 16
metric_name = ... # TODO : What Metric would you define here ?

args = TrainingArguments(
    f"{model_name}-finetuned-mrpc",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    push_to_hub=False,
)

In [None]:
# TODO : Define the metric

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = ... # TODO : What goes here ?
    return metric.compute(predictions=predictions, references=labels)

Now let's init the Trainer and train and evaluate the model. It is as easy as in Lightning.

In [None]:
trainer = Trainer(
    model,
    args,
    train_dataset= ..., # TODO : What do we put here ?
    eval_dataset= ... , # TODO : What do we put here ?
    tokenizer= ..., # TODO : What do we put here ?
    compute_metrics=compute_metrics
)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

How can we improve the results ?