## Ablation Study for multi-modal integration

The tutorial compares three approaches:

- **Multi-modal integration:** Combines text and image embeddings.
- **Multi-modal with reduced images:** Removes some image embeddings.
- **Image-to-text conversion with text-only methods:** Converts images to textual descriptions and uses text embeddings only.

#### Setup

Dataset: Use a dataset with text and images (e.g., MS-COCO or a custom dataset). The dataset should have the following:

- Text: Descriptions.
- Images: Associated images.

In [None]:
pip install numpy pandas sklearn transformers sentence-transformers torch torchvision PIL

#### Step 1: Load models and dataset

In [None]:
# Load Models
text_model = SentenceTransformer('all-MiniLM-L6-v2')  # Text embeddings
image_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
image_processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Load Dataset
# Assume a dataset with 'text', 'image_path', and 'label' columns
data = pd.DataFrame({
    'text': ["A dog in a park", "A sunny beach", "A plate of food"],
    'image_path': ["dog.jpg", "beach.jpg", "food.jpg"],
    'label': [1, 2, 3]  # Dummy labels
})

#### Step 2: Define Functions for Ablation and embeddings
You will create functions to perform the ablation by removing images from the embeddings and then evaluate retrieval performance.

In [None]:
# Define Utility Functions
def compute_text_embedding(texts):
    """Compute text embeddings using SentenceTransformer."""
    return text_model.encode(texts, convert_to_tensor=True)

def compute_image_embedding(image_paths):
    """Compute image embeddings using CLIP."""
    images = [Image.open(img_path).convert("RGB") for img_path in image_paths]
    inputs = image_processor(images=images, return_tensors="pt")
    with torch.no_grad():
        return image_model.get_image_features(**inputs).cpu()

def evaluate_retrieval(query_embeddings, candidate_embeddings, true_labels, k=5):
    """Evaluate retrieval metrics: precision@k and recall@k."""
    similarities = cosine_similarity(query_embeddings, candidate_embeddings)
    top_k_indices = np.argsort(similarities, axis=1)[:, -k:]
    precision = np.mean([
        precision_score([label], top_k_indices[i], average='micro') 
        for i, label in enumerate(true_labels)
    ])
    recall = np.mean([
        recall_score([label], top_k_indices[i], average='micro') 
        for i, label in enumerate(true_labels)
    ])
    return precision, recall

# Compute Embeddings
text_embeddings = compute_text_embedding(data['text'].tolist())
image_embeddings = compute_image_embedding(data['image_path'].tolist())

# Combine text and image embeddings for multi-modal integration
multi_modal_embeddings = np.concatenate([text_embeddings, image_embeddings], axis=1)

# Ablation Functions
def remove_images(embeddings, fraction):
    """Remove a fraction of image embeddings by setting them to zeros."""
    num_remove = int(fraction * embeddings.shape[0])
    embeddings[:num_remove, text_embeddings.shape[1]:] = 0
    return embeddings

def convert_images_to_text(image_paths):
    """Convert images to text using a dummy function or pre-trained model."""
    # Dummy conversion (real case would use image captioning)
    return ["Generated text for image " + str(i) for i, _ in enumerate(image_paths)]

#### Step 3: Run the Ablation Study

Now, you can run your ablation study by varying the fraction of images removed and evaluating the impact on retrieval metrics.

In [None]:
# Case 1: Multi-modal integration
precision_mm, recall_mm = evaluate_retrieval(multi_modal_embeddings, multi_modal_embeddings, data['label'].values)

# Case 2: Multi-modal with reduced images
reduced_embeddings = remove_images(multi_modal_embeddings.copy(), fraction=0.5)
precision_reduced, recall_reduced = evaluate_retrieval(reduced_embeddings, reduced_embeddings, data['label'].values)

# Case 3: Convert images to text
image_texts = convert_images_to_text(data['image_path'].tolist())
image_text_embeddings = compute_text_embedding(image_texts)
text_only_embeddings = np.concatenate([text_embeddings, image_text_embeddings], axis=1)
precision_text_only, recall_text_only = evaluate_retrieval(text_only_embeddings, text_only_embeddings, data['label'].values)


#### Step 4: Present the results

In [None]:
# Results
print("Multi-modal Integration: Precision@k:", precision_mm, "Recall@k:", recall_mm)
print("Reduced Images: Precision@k:", precision_reduced, "Recall@k:", recall_reduced)
print("Text-Only (Images to Text): Precision@k:", precision_text_only, "Recall@k:", recall_text_only)