## 0. Overview

This benchmark is created with the use of an LLm. After some quick data exploration, it's safe to say the prompt has a unique value(same prompt) and that the columns, aside from image and cadquery, are metadata.

This suggested solution uses the pix2struct-base pre-trained model from google. The model takes an image as input and generates the cadquery or code as output.

**Note:** It's important to highlight that after the train/test split, the train data contains no ``hundred_subset=True`` samples, because there's no shuffling before. This *could* impact the results if there's a particular modification in ``True`` samples.  

In [3]:
import torch
from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
from torch.utils.data import Dataset, DataLoader
from datasets import Dataset as HFDataset, load_dataset  # Hugging Face Dataset
from tqdm import tqdm
from metrics.best_iou import get_iou_best
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple

## 1. Passing to the real things 

In [4]:
class CADDataset(Dataset):
    def __init__(self, hf_dataset, processor):
        self.data = hf_dataset  # Directly use HF Dataset (no pandas needed)
        self.processor = processor

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        item = self.data[idx]  # Use direct indexing for HF Dataset
        
        # Process image
        image_encoding = self.processor(
            images=item['image'], 
            return_tensors="pt"
        )
        
        # Process text (cadquery) for labels
        text_encoding = self.processor(
            text=item['cadquery'],
            return_tensors="pt",
            padding="max_length",
            truncation=True,
            max_length=512
        )
        
        # Combine the encodings
        encoding = {
            'flattened_patches': image_encoding['flattened_patches'],
            'attention_mask': image_encoding['attention_mask'],
            'labels': text_encoding['input_ids']
        }
        
        return {k: v.squeeze() for k, v in encoding.items()}

### Model Sweet Model

In [8]:
device = "cpu"
processor = Pix2StructProcessor.from_pretrained("google/pix2struct-base")
model = Pix2StructForConditionalGeneration.from_pretrained("google/pix2struct-base").to(device)

# 3. Load Data (corrected)
ds = load_dataset("CADCODER/GenCAD-Code", num_proc=16, split=["train", "test"], 
                 cache_dir="/Volumes/BIG-DATA/HUGGINGFACE_CACHE")

# 4. Create Datasets
train_dataset = ds[0]
test_dataset = ds[1]

train_data = train_dataset.select(range(len(train_dataset) // 4 ))
test_data = test_dataset.select(range(len(test_dataset) // 4))

train_data=CADDataset(train_data, processor)
# test_data=CADDataset(test_data, processor)

# 5. Training Setup
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)
train_loader = DataLoader(train_data, batch_size=4, shuffle=True)


### Training 

In [None]:
i=0
for epoch in range(5):
    model.train()
    for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}"):
        batch = {k: v.to(device) for k, v in batch.items()}
        
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        
        print(f"Loss: {loss.item():.4f}\n")
        i+=1

        model.save_pretrained("pix2struct-cad-model-light")
        processor.save_pretrained("pix2struct-cad-model-light")

        if i== 2:
            break
    break


Epoch 1:   0%|          | 0/9206 [00:00<?, ?it/s]

Loss: 19.7514



Epoch 1:   0%|          | 1/9206 [02:06<323:50:24, 126.65s/it]

Loss: 15.6811



Epoch 1:   0%|          | 1/9206 [04:11<642:57:54, 251.46s/it]


### Evalution

In [None]:
# import torch
# from transformers import Pix2StructForConditionalGeneration, Pix2StructProcessor
# from datasets import Dataset as HFDataset
# from tqdm import tqdm
import numpy as np

# 1. Load saved model and processor
model_path = "pix2struct-cad-model-light"

processor = Pix2StructProcessor.from_pretrained(model_path)
model = Pix2StructForConditionalGeneration.from_pretrained(model_path).to(device)
model.eval()

# 2. Define generation function
def generate_code(model, processor, image, max_length=512):
    inputs = processor(images=image, return_tensors="pt").to(device)
    generated_ids = model.generate(**inputs, max_new_tokens=max_length)
    return processor.decode(generated_ids[0], skip_special_tokens=True)

# 3. Evaluation function
def evaluate_model(model, test_data, processor, num_samples=None):
    if num_samples is None:
        num_samples = len(test_data)
    
    generated_codes = []
    target_codes = []
    
    for i in tqdm(range(num_samples), desc="Evaluating"):
        sample = test_data[i]
        image = sample['image']
        target_code = sample['cadquery']
        
        # Generate prediction
        pred_code = generate_code(model, processor, image)
        
        generated_codes.append(pred_code)
        target_codes.append(target_code)
    
    # Calculate metrics
    vsr = evaluate_syntax_rate_simple(generated_codes)
    iou_scores = [get_iou_best(target, pred) for target, pred in zip(target_codes, generated_codes)]
    avg_iou = np.mean(iou_scores)
    
    return {
        "valid_syntax_rate": vsr,
        "average_iou": avg_iou,
        "generated_examples": list(zip(target_codes, generated_codes))[:5]  # First 5 samples
    }


# 5. Run evaluation
results = evaluate_model(model, test_data, processor, num_samples=100)  # Evaluate on 100 samples

# 6. Print results
print(f"\nEvaluation Results:")
print(f"Valid Syntax Rate: {results['valid_syntax_rate']:.4f}")
print(f"Average IOU Score: {results['average_iou']:.4f}")

Evaluating:  22%|██▏       | 22/100 [48:19<5:01:46, 232.14s/it]

## Comments


As you can see, we saved our model in ``/pix2struct-cad-model-light`` during first epoch after 2nd batch. Why? Bacause my pc would stop everytime.. Tried to minimize the dataset to 1/4 but it still didn't work. After all, we're dealing with images and transformers in a cpu.
The evalution, for some reason, had to take a long time too and reached 22% by the time this repo will be pushed. We might push another commit later on once finished.
To recap, we couldn't analyse any results ==> no code improved, but the idea is there.

**Enhacements:**
- Use a **VM** for training.. I actually laughed when I remembered "Enhance by any manner the baseline model and evaluate it again" from the instructions.. I got too lucky to see the **blue screen** twice today :'')
- Fine-tune hyperparameters: learning rate, batch size, and number of epochs.
- We could've used another metric, like BLEU
- use pix2struct-large for better performence.