# Task: `Vision-Language Model`

Given an image and a caption describing a target in that image, return a bounding box corresponding to the target’s location within the image.

Note that targets within a given image are not uniquely identified by their object class (e.g. ”airplane”, “helicopter”); multiple targets within an image may be members of the same object class. Instead, targets provided will correspond to a particular target description (e.g. “black and white drone”).

Not all possible target descriptions will be represented in the training dataset provided to participants. There will also be unseen targets and novel descriptions in the test data used in the hidden test cases of the Virtual Qualifiers, Semi-Finals / Finals. As such, Guardians will have to develop vision models capable of understanding **natural language** to identify the correct target from the scene.

For the **image datasets** provided to both Novice and Advanced Guardians, there will be no noise present. However, it is worth noting that your models will have to be adequately robust as the hidden test cases for the Virtual Qualifiers and the Semi-Finals/Finals will have increasing amounts of noise introduced. This is especially crucial for **Advanced Guardians**, due to the degradation of their robot sensors.

In [32]:
##import all the libraries

import albumentations
from PIL import Image
import IPython.display as display
import torch
import requests
from transformers import CLIPProcessor, CLIPModel
import matplotlib.pyplot as plt
import torch
import json
from sklearn.model_selection import train_test_split


import torchvision
from torchvision.transforms import functional as F
from torchvision import transforms
from torchinfo import summary
import urllib
import os

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [26]:
## random directories that will be needed

cur_dir = os.getcwd()
vlm_dir = os.path.dirname(cur_dir)
til_dir = os.path.dirname(vlm_dir)
home_dir = os.path.dirname(til_dir)
test_dir = os.path.join(home_dir, 'novice')
img_dir = os.path.join(test_dir, 'images')

##training data to be added to tune the models
metadata_path = os.path.join(test_dir, 'vlm.jsonl')

img_dir

'/home/jupyter/novice/images'

In [35]:
import os
import json
from PIL import Image
import torch
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor
from torch.utils.data import DataLoader
from sklearn.model_selection import train_test_split

# Define your image preprocessing transformations
image_transform = Compose([
    Resize((224, 224)),
    ToTensor(),
])

# Define function to preprocess captions (you may need to customize this based on your data)
def preprocess_caption(caption):
    # Tokenize the caption, handle punctuation, lowercase, etc.
    # You may use tokenization libraries like nltk or spaCy for this task
    # For simplicity, let's assume the caption is already preprocessed
    return caption

# Load captions from JSON file with preprocessing
def load_captions_from_json(json_file):
    captions_data = []
    with open(json_file, 'r') as f:
        for line in f:
            data = json.loads(line)
            annotations = data.get('annotations', [])
            for annotation in annotations:
                caption = preprocess_caption(annotation['caption'])
                captions_data.append((data['image'], caption))
    return captions_data

# Load images from a folder with preprocessing
def load_image(image_path):
    return image_transform(Image.open(image_path))


# Load captions from JSON with preprocessing
captions_data = load_captions_from_json(metadata_path)

# Split the combined data into training and test sets
train_data, test_data = train_test_split(captions_data, test_size=0.2, random_state=42)

# Define your custom datasets and data loaders for training and test sets
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        image_filename, caption = self.data[idx]
        image_path = os.path.join(img_dir, image_filename)
        
        # Load and preprocess image
        image = load_image(image_path)
        
        return image, caption

# Define your custom datasets and data loaders for training and test sets
train_dataset = CustomDataset(data=train_data)
test_dataset = CustomDataset(data=test_data)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=False)


In [15]:
##load the model
## use 2 model to take into account both pictures and text

import torch
from transformers import DetrForObjectDetection, DetrImageProcessor
from transformers import CLIPProcessor, CLIPModel

device = "cuda" if torch.cuda.is_available() else "cpu"

clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32").to(device)
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

detr_model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50').to(device)
detr_processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')

INFO:timm.models._builder:Loading pretrained weights from Hugging Face hub (timm/resnet50.a1_in1k)
INFO:timm.models._hub:[timm/resnet50.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
INFO:timm.models._builder:Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. i

In [18]:
##preprocess the data

## Transfer Learning to make new model adapt to current army data set
Feature extraction

1. **Instantiating a Pre-Trained Model with Weights**
   - Initialize a pre-trained model with its pre-existing weights.

2. **Replacing Classifier Heads**
   - Replace the output layer with a new one that corresponds to the number of categories in our target dataset.

3. **Task-Specific Training**
   - Freeze all the layers from the pre-trained model, leaving only the outer layer (classifier head) to be trained.

since smaller dataset

In [20]:
### placeholder


import torch
from torch.utils.data import DataLoader
from torchvision.transforms import Compose, Resize, CenterCrop, ToTensor
from PIL import Image
from transformers import CLIPModel, CLIPProcessor
from transformers import DetrForObjectDetection, DetrImageProcessor

# Define image preprocessing transformations
image_transform = Compose([
    Resize((224, 224)),  # Resize image to a fixed size
    ToTensor(),          # Convert image to tensor
])

# Define your dataset class with feature selection and preprocessing
class CustomDataset(torch.utils.data.Dataset):
    def __init__(self, data, image_transform):
        self.data = data
        self.image_transform = image_transform

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        image_path, caption, selected_features, bounding_box = self.data[idx]
        
        # Load and preprocess image
        image = Image.open(image_path)
        image = self.image_transform(image)
        
        return image, caption, selected_features, bounding_box

# Define your custom data loader
def collate_fn(batch):
    images, captions, selected_features, bounding_boxes = zip(*batch)
    return images, captions, selected_features, bounding_boxes

# Initialize CLIP and DETR models
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
clip_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

detr_model = DetrForObjectDetection.from_pretrained('facebook/detr-resnet-50')
detr_processor = DetrImageProcessor.from_pretrained('facebook/detr-resnet-50')

# Define loss functions
clip_loss_fn = ...
detr_loss_fn = ...

# Define optimizer
clip_optimizer = torch.optim.Adam(clip_model.parameters(), lr=clip_lr)
detr_optimizer = torch.optim.Adam(detr_model.parameters(), lr=detr_lr)

# Define your dataset and data loader
train_dataset = CustomDataset(data=your_data, image_transform=image_transform)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, collate_fn=collate_fn)

# Training loop
for epoch in range(num_epochs):
    for batch in train_loader:
        images, captions, selected_features, bounding_boxes = batch
        
        # CLIP forward pass
        clip_inputs = clip_processor(text=captions, images=images, return_tensors="pt", padding=True)
        clip_outputs = clip_model(**clip_inputs)
        
        # Combine embeddings
        image_embedding = clip_outputs.last_hidden_state[:, 0, :]
        caption_embedding = clip_outputs.last_hidden_state[:, 1, :]
        combined_embedding = torch.cat((image_embedding, caption_embedding, selected_features), dim=1)
        
        # DETR forward pass
        detr_outputs = detr_model.forward(features=combined_embedding)
        
        # Compute losses
        clip_loss = clip_loss_fn(...)
        detr_loss = detr_loss_fn(...)
        
        # Backpropagation and optimization
        clip_optimizer.zero_grad()
        detr_optimizer.zero_grad()
        clip_loss.backward()
        detr_loss.backward()
        clip_optimizer.step()
        detr_optimizer.step()

# Save trained models
torch.save(clip_model.state_dict(), "clip_model.pth")
torch.save(detr_model.state_dict(), "detr_model.pth")

INFO:timm.models._builder:Loading pretrained weights from Hugging Face hub (timm/resnet50.a1_in1k)
INFO:timm.models._hub:[timm/resnet50.a1_in1k] Safe alternative available for 'pytorch_model.bin' (as 'model.safetensors'). Loading weights using safetensors.
INFO:timm.models._builder:Missing keys (fc.weight, fc.bias) discovered while loading pretrained weights. This is expected if model is being adapted.
Some weights of the model checkpoint at facebook/detr-resnet-50 were not used when initializing DetrForObjectDetection: ['model.backbone.conv_encoder.model.layer1.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer2.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer3.0.downsample.1.num_batches_tracked', 'model.backbone.conv_encoder.model.layer4.0.downsample.1.num_batches_tracked']
- This IS expected if you are initializing DetrForObjectDetection from the checkpoint of a model trained on another task or with another architecture (e.g. i

NameError: name 'clip_lr' is not defined