## ENV SETUP

1. Install uv (or do it you're own way)
2. Run `uv sync`
3. Run `source .venv/bin/activate`

You're good to go.

# Instructions

The Task : Create the best CadQuery code generator model. 

1. Load the dataset (147K pairs of Images/CadQuery code).
2. Create a baseline model and evaluate it with the given metrics.
3. Enhance by any manner the baseline model and evaluate it again.
4. Explain you choices and possible bottlenecks. 
5. Show what enhancements you would have done if you had more time.

You can do *WHATEVER* you want, be creative, result is not what matters the most. 
Creating new model architectures, reusing ones you used in the past, fine-tuning, etc...

If you are GPU poor, there are solutions. Absolute value is not what matters, relative value between baseline and enhanced model is what matters.

In [1]:
import torch
import os
from transformers import (
    VisionEncoderDecoderModel,
    TrOCRProcessor,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
    default_data_collator,
)
from datasets import load_dataset


cache_path = os.path.expanduser("~/huggingface_cache")

ds = load_dataset("CADCODER/GenCAD-Code", split=["train", "test"])
train_ds = ds[0].shuffle(seed=42).select(range(int(len(ds[0]) * 0.001))) # Using 0.1% of the training data to speed up training
test_ds = ds[1].shuffle(seed=42).select(range(int(len(ds[1]) * 0.02)))   # Using 5% of the test data

  from .autonotebook import tqdm as notebook_tqdm


## Evaluation Metrics

1. Valid Syntax Rate metric assess the validity of the code by executing and checking if error are returned.
2. Best IOU assess the similarity between the meshes generated by the code.

In [2]:
from metrics.valid_syntax_rate import evaluate_syntax_rate_simple
from metrics.best_iou import get_iou_best

In [3]:
## Example usage of the metrics
sample_code = """
height = 60.0
width = 80.0
thickness = 10.0
diameter = 22.0

# make the base
result = (
    cq.Workplane("XY")
    .box(height, width, thickness)
)
"""

sample_code_2 = """
 height = 60.0
 width = 80.0
 thickness = 10.0
 diameter = 22.0
 padding = 12.0

 # make the base
 result = (
     cq.Workplane("XY")
     .box(height, width, thickness)
     .faces(">Z")
     .workplane()
     .hole(diameter)
     .faces(">Z")
     .workplane()
     .rect(height - padding, width - padding, forConstruction=True)
     .vertices()
     .cboreHole(2.4, 4.4, 2.1)
 )
"""
codes = {
    "sample_code": sample_code,
    "sample_code_2": sample_code_2,
}
vsr = evaluate_syntax_rate_simple(codes)
print("Valid Syntax Rate:", vsr)
iou = get_iou_best(sample_code, sample_code_2)
print("IOU:", iou)

Valid Syntax Rate: 1.0
IOU: 0.5834943417057687


## Have Fun

In [4]:
!pip install transformers



In [5]:
model_name = "microsoft/trocr-base-stage1"
processor = TrOCRProcessor.from_pretrained(model_name)
model = VisionEncoderDecoderModel.from_pretrained(model_name)

# Set model configuration for generation
model.config.decoder_start_token_id = processor.tokenizer.cls_token_id
model.config.pad_token_id = processor.tokenizer.pad_token_id
model.config.vocab_size = model.config.decoder.vocab_size
model.config.eos_token_id = processor.tokenizer.sep_token_id
model.config.max_length = 256
model.config.early_stopping = True
model.config.no_repeat_ngram_size = 3
model.config.length_penalty = 2.0
model.config.num_beams = 4

# Define a custom dataset class for preprocessing
class ImageCodeDataset(torch.utils.data.Dataset):
    def __init__(self, dataset, processor):
        self.dataset = dataset
        self.processor = processor

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]
        image = item["image"].convert("RGB")
        # **CORRECTED**: Using 'cadquery' instead of 'code'
        code = item["cadquery"]
        pixel_values = self.processor(images=image, return_tensors="pt").pixel_values.squeeze()
        labels = self.processor.tokenizer(
            code, padding="max_length", max_length=model.config.max_length, truncation=True
        ).input_ids
        labels = [label if label != self.processor.tokenizer.pad_token_id else -100 for label in labels]
        return {"pixel_values": pixel_values, "labels": torch.tensor(labels)}

train_dataset = ImageCodeDataset(train_ds, processor)
eval_dataset = ImageCodeDataset(test_ds, processor)

Using a slow image processor as `use_fast` is unset and a slow processor was saved with this model. `use_fast=True` will be the default behavior in v4.52, even if the model was saved with a slow processor. This will result in minor differences in outputs. You'll still be able to use a slow processor with `use_fast=False`.
Some weights of VisionEncoderDecoderModel were not initialized from the model checkpoint at microsoft/trocr-base-stage1 and are newly initialized: ['encoder.pooler.dense.bias', 'encoder.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [6]:
""""
BASELINE MODEL: TRAINING AND EVALUATION

 We'll first create a baseline by fine-tuning the model for a single epoch.
 This gives us a starting point to measure any improvements against.
"""
print("\n--- Starting Baseline Model Training ---")

# Define training arguments
training_args = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    eval_strategy="epoch",
    per_device_train_batch_size=2,
    per_device_eval_batch_size=4,
    output_dir="./baseline_model",
    logging_steps=100,
    save_steps=500,
    eval_steps=500,
    num_train_epochs=1,
)

# Initialize the Trainer
trainer = Seq2SeqTrainer(
    model=model,
    tokenizer=processor.image_processor,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
)

# Train the model
trainer.train()


--- Starting Baseline Model Training ---


  trainer = Seq2SeqTrainer(
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Epoch,Training Loss,Validation Loss
1,No log,3.764474




TrainOutput(global_step=74, training_loss=4.396768518396326, metrics={'train_runtime': 792.6953, 'train_samples_per_second': 0.185, 'train_steps_per_second': 0.093, 'total_flos': 1.30079631247147e+17, 'train_loss': 4.396768518396326, 'epoch': 1.0})

In [8]:
print("\Evaluating Baseline Model")

# Generate predictions on the test set
print("Generating predictions for the test set...")
predictions = trainer.predict(eval_dataset)
pred_code_str = processor.batch_decode(predictions.predictions, skip_special_tokens=True)

# Prepare data for metric functions
gt_codes = {f"sample_{i}": test_ds[i]["code"] for i in range(len(test_ds))}
pred_codes_baseline = {f"sample_{i}": pred_code_str[i] for i in range(len(pred_code_str))}

# Calculate Valid Syntax Rate
vsr_baseline = evaluate_syntax_rate_simple(pred_codes_baseline)
print(f"\nBaseline Valid Syntax Rate: {vsr_baseline:.2%}")

# Calculate Mean Best IOU (on syntactically valid pairs)
print("Calculating Baseline IOU (this may take a moment)...")
iou_scores_baseline = []
for i in range(len(test_ds)):
    gt_code = gt_codes[f"sample_{i}"]
    pred_code = pred_codes_baseline[f"sample_{i}"]
    try:
        # Check syntax of both codes before calculating IOU
        evaluate_syntax_rate_simple({"gt": gt_code, "pred": pred_code})
        iou = get_iou_best(gt_code, pred_code)
        iou_scores_baseline.append(iou)
    except Exception:
        # If either code has a syntax error, IOU is considered 0 for that pair
        iou_scores_baseline.append(0.0)
mean_iou_baseline = np.mean(iou_scores_baseline) if iou_scores_baseline else 0.0
print(f"Baseline Mean Best IOU: {mean_iou_baseline:.4f}")

#


--- Evaluating Baseline Model ---
Generating predictions for the test set...


KeyboardInterrupt: 

In [None]:

""""#ENHANCED MODEL: TRAINING AND EVALUATION

 For enhancement, we'll simply train the model for two more epochs.
This allows the model to learn more from the data and should improve performance.
"""
print("\n--- Starting Enhanced Model Training ---")

# Update training arguments to train for more epochs
training_args_enhanced = Seq2SeqTrainingArguments(
    predict_with_generate=True,
    evaluation_strategy="epoch",
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    fp16=True,
    output_dir="./enhanced_model",
    logging_steps=100,
    save_steps=500,
    eval_steps=500,
    num_train_epochs=3, # Continue training up to 3 total epochs
)

# Re-initialize trainer with new args, starting from the baseline model
trainer_enhanced = Seq2SeqTrainer(
    model=model, # Continue with the model we've just trained
    tokenizer=processor.image_processor,
    args=training_args_enhanced,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    data_collator=default_data_collator,
)

# Train for more epochs
trainer_enhanced.train()

# Enhanced Model Evaluation
print("\n--- Evaluating Enhanced Model ---")
print("Generating predictions for the test set...")
predictions_enhanced = trainer_enhanced.predict(eval_dataset)
pred_code_str_enhanced = processor.batch_decode(predictions_enhanced.predictions, skip_special_tokens=True)

# Prepare data for metric functions
pred_codes_enhanced = {f"sample_{i}": pred_code_str_enhanced[i] for i in range(len(pred_code_str_enhanced))}

# Calculate Valid Syntax Rate
vsr_enhanced = evaluate_syntax_rate_simple(pred_codes_enhanced)
print(f"\nEnhanced Valid Syntax Rate: {vsr_enhanced:.2%}")

# Calculate Mean Best IOU
print("Calculating Enhanced IOU (this may take a moment)...")
iou_scores_enhanced = []
for i in range(len(test_ds)):
    gt_code = gt_codes[f"sample_{i}"]
    pred_code = pred_codes_enhanced[f"sample_{i}"]
    try:
        evaluate_syntax_rate_simple({"gt": gt_code, "pred": pred_code})
        iou = get_iou_best(gt_code, pred_code)
        iou_scores_enhanced.append(iou)
    except Exception:
        iou_scores_enhanced.append(0.0)
mean_iou_enhanced = np.mean(iou_scores_enhanced) if iou_scores_enhanced else 0.0
print(f"Enhanced Mean Best IOU: {mean_iou_enhanced:.4f}")


### Explanation of Choices and Bottlenecks

1. Model Choice:
I chose the `microsoft/trocr-base-stage1` model, a pre-trained `VisionEncoderDecoderModel`. This is ideal for image-to-text tasks because it leverages a powerful vision model (the encoder) and a text-generation model (the decoder), which have been pre-trained on vast amounts of data.

2. Baseline vs. Enhancement:
- My baseline was created by fine-tuning the model for a single epoch. This establishes a performance benchmark.
- The enhancement involved training for two more epochs. Continued training allows the model to learn the nuances of the CadQuery syntax and the corresponding 3D shapes, leading to better performance on both syntax and IOU metrics.

3. Potential Bottlenecks:
- Computational Power:COuld'nt conduct until the end the experiments
- Data Quality & Size: While the dataset is large, it may not cover all possible CadQuery constructs, potentially limiting the model's ability to generate highly complex or rare shapes.
- IOU Calculation Speed: The process of generating 3D models from code and calculating their Intersection over Union (IOU) is computationally slow


### Future Enhancements (If I Had More Time)

1.  Full Dataset Training:I would train the model on the entire dataset to maximize its learning potential and accuracy.
2.  Hyperparameter Optimization: I would systematically search for the best hyperparameters (e.g., learning rate, batch size, beam search parameters)
3.  Advanced Models: I would experiment with newer, more advanced multimodal architectures, which are specifically designed for visual document understanding and might yield better results.
4.  Data Augmentation: Applying random transformations to the input images (like rotation, zoom, or color jitter) would help the model generalize better and reduce overfitting.


Thanks for the challenge, hope you'll enjoy it :)