## CLIP
CLIP (Contrastive Language-Image Pre-Training) is a neural network trained on a variety of (image, text) pairs. It can be instructed in natural language to predict the most relevant text snippet, given an image, without directly optimizing for the task, similarly to the zero-shot capabilities of GPT-2 and 3. We found CLIP matches the performance of the original ResNet50 on ImageNet “zero-shot” without using any of the original 1.28M labeled examples, overcoming several major challenges in computer vision.

In [45]:
import torch
import clip
from PIL import Image
import os
import sys

import torch.nn as nn
import torch.optim as optim

from sklearn.metrics import f1_score, accuracy_score
import time

from tqdm import tqdm



# Add modules to path
sys.path.append('modules')

# Import our custom utilities
from data_utils import load_processed_data_v2, create_data_loaders
from model_utils import CNNBaseline, train_model, evaluate_model, EfficientNetBaseline


device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(f'Using device: {device}')

Using device: cuda


In [34]:
from torch.utils.data import DataLoader

In [21]:
## See available models
clip.available_models()

['RN50',
 'RN101',
 'RN50x4',
 'RN50x16',
 'RN50x64',
 'ViT-B/32',
 'ViT-B/16',
 'ViT-L/14',
 'ViT-L/14@336px']

In [50]:
clip_model, preprocess = clip.load("ViT-B/32", device=device, jit=False)
clip_model = clip_model.float().to(device)

In [23]:
print(preprocess)

Compose(
    Resize(size=224, interpolation=bicubic, max_size=None, antialias=True)
    CenterCrop(size=(224, 224))
    <function _convert_image_to_rgb at 0x0000017535C4C900>
    ToTensor()
    Normalize(mean=(0.48145466, 0.4578275, 0.40821073), std=(0.26862954, 0.26130258, 0.27577711))
)


In [39]:
class CLIPDataset(torch.utils.data.Dataset):
    def __init__(self, df, image_dirs, preprocess,label_encoder, device='cuda'):
        self.df = df.reset_index(drop=True)
        self.image_dirs = image_dirs
        self.preprocess = preprocess
        self.device = device
        self.label_encoder = label_encoder

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        img_id = row['img_id']

        # Load image from any of the given directories
        img = None
        for part, dir_path in self.image_dirs.items():
            full_path = os.path.join('archive', dir_path, img_id)
            if os.path.exists(full_path):
                try:
                    img = Image.open(full_path).convert("RGB")
                    break
                except Exception:
                    continue

        # Fallback: create black image if missing
        if img is None:
            img = Image.new("RGB", (224, 224), color="black")

        # Preprocess for CLIP
        image_tensor = self.preprocess(img)

        # Text input (raw string)
        text_input = row['text_description']

        # Target label → encode to int, then tensor
        label_str = row['diagnostic']
        label_idx = self.label_encoder.transform([label_str])[0]  # integer class
        target = torch.tensor(label_idx, dtype=torch.long)

        return image_tensor, text_input, target



In [25]:
train_df, val_df, test_df, label_encoder, config= load_processed_data_v2()

In [51]:
IMAGE_DIRS = {
    'part1': 'imgs_part_1/imgs_part_1',
    'part2': 'imgs_part_2/imgs_part_2', 
    'part3': 'imgs_part_3/imgs_part_3'
}

train_dataset = CLIPDataset(train_df, IMAGE_DIRS, preprocess,label_encoder, device=device) 
val_dataset = CLIPDataset(val_df, IMAGE_DIRS, preprocess,label_encoder, device=device) 
test_dataset = CLIPDataset(test_df, IMAGE_DIRS, preprocess,label_encoder, device=device)


In [41]:
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader   = DataLoader(val_dataset, batch_size=32)
test_loader  = DataLoader(test_dataset, batch_size=32)

In [42]:
for images, texts, targets in train_loader:
    print(images.shape)  # Should be [batch_size, 3, 224, 224]
    print(texts)        # List of text descriptions
    print(targets)      # Tensor of target labels
    break

torch.Size([32, 3, 224, 224])
('66-year-old male lesion on neck risk factors: smoke, pesticide', '53-year-old lesion on chest', '73-year-old female lesion on face risk factors: skin cancer history', '62-year-old female lesion on nose risk factors: drink, skin cancer history, cancer history', '39-year-old lesion on face', '71-year-old female lesion on chest risk factors: cancer history', '91-year-old female lesion on face', '69-year-old male lesion on forearm risk factors: cancer history', '85-year-old female lesion on ear risk factors: pesticide', '78-year-old female lesion on face risk factors: skin cancer history', '55-year-old male lesion on forearm risk factors: drink, pesticide, skin cancer history, cancer history', '64-year-old male lesion on face risk factors: drink, skin cancer history, cancer history', '77-year-old female lesion on face', '55-year-old male lesion on neck risk factors: drink, pesticide, skin cancer history', '54-year-old lesion on forearm', '73-year-old lesion 

In [53]:
class ProjectionHead(nn.Module):
    def __init__(self, in_dim, proj_dim=512, dropout=0.2):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(in_dim, proj_dim),
            nn.ReLU(inplace=True),
            nn.Dropout(dropout),
            nn.Linear(proj_dim, proj_dim),
            nn.LayerNorm(proj_dim)
        )
    def forward(self, x):
        return self.net(x)

class GatedFusion(nn.Module):
    """Learnable gate between two modality vectors (x and y)"""
    def __init__(self, dim):
        super().__init__()
        self.gate = nn.Sequential(
            nn.Linear(dim * 2, dim),
            nn.Sigmoid()
        )
    def forward(self, x, y):
        # x,y shape: (B, dim)
        z = torch.cat([x, y], dim=1)
        g = self.gate(z)  # (B, dim)
        return g * x + (1 - g) * y

class MultimodalClassifier(nn.Module):
    def __init__(
        self,
        clip_model,
        image_dim=512,
        text_dim=512,
        proj_dim=512,
        fusion_method="concat",   # "concat", "mean", "gated"
        num_classes=2,
        dropout=0.3,
        freeze_clip=True
    ):
        """
        clip_model: loaded clip model (OpenAI), used optionally for fine-tuning or just to encode
        image_dim, text_dim: output dims from CLIP encoders (usually 512)
        proj_dim: projection dimension (if using projection heads)
        fusion_method: 'concat'|'mean'|'gated'
        """
        super().__init__()
        self.clip = clip_model
        self.freeze_clip = freeze_clip

        # Optionally freeze CLIP parameters
        if freeze_clip:
            for p in self.clip.parameters():
                p.requires_grad = False

        # Projection heads (bring both to proj_dim)
        self.image_proj = ProjectionHead(image_dim, proj_dim=proj_dim)
        self.text_proj = ProjectionHead(text_dim, proj_dim=proj_dim)

        self.fusion_method = fusion_method
        if fusion_method == "gated":
            self.fuser = GatedFusion(proj_dim)

        # Classifier input dim depends on fusion method
        if fusion_method == "concat":
            clf_in = proj_dim * 2
        elif fusion_method in ("mean", "gated"):
            clf_in = proj_dim
        else:
            raise ValueError("Unknown fusion method")

        # Classifier head
        self.classifier = nn.Sequential(
            nn.Linear(clf_in, clf_in // 2),
            nn.ReLU(inplace=True),
            nn.Dropout(dropout),
            nn.Linear(clf_in // 2, num_classes)
        )

    def encode_image(self, images):
        # images: Tensor [B, 3, H, W]
        return self.clip.encode_image(images)

    def encode_text(self, tokenized_text):
        # tokenized_text: Tensor [B, 77]
        return self.clip.encode_text(tokenized_text)

    def forward(self, images, tokenized_text):
        """
        Forward pass: encode image & text via CLIP, fuse their embeddings, classify.
        """
        # Encode using CLIP
        image_feats = self.encode_image(images)        # (B, image_dim)
        text_feats  = self.encode_text(tokenized_text) # (B, text_dim)

        # Normalize embeddings (stabilizes training)
        image_feats = image_feats / image_feats.norm(dim=1, keepdim=True).clamp(min=1e-6)
        text_feats  = text_feats / text_feats.norm(dim=1, keepdim=True).clamp(min=1e-6)

        # Ensure consistent dtype between CLIP outputs and our projection/classifier
        # (prevents errors like: "mat1 and mat2 must have the same dtype, but got Half and Float")
        image_feats = image_feats.float()
        text_feats = text_feats.float()

        # Project both to a common dimension
        img_p = self.image_proj(image_feats)
        txt_p = self.text_proj(text_feats)

        # Fuse (simple concatenation)
        fused = torch.cat([img_p, txt_p], dim=1)

        # Classify
        logits = self.classifier(fused)
        return logits


In [54]:
num_classes = len(label_encoder.classes_) 
num_classes

2

In [55]:

optimizer = torch.optim.AdamW(filter(lambda p: p.requires_grad, model.parameters()),
                              lr=1e-4, weight_decay=1e-4)
criterion = nn.CrossEntropyLoss()
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="max", factor=0.5, patience=2)


In [56]:
model = MultimodalClassifier(
    clip_model=clip_model).to(device)

In [57]:
best_val_f1 = 0.0
num_epochs = 10

for epoch in range(num_epochs):
    t0 = time.time()

    #train mode
    model.train()
    train_losses, y_true_train, y_pred_train = [], [], []

    train_bar = tqdm(train_loader, desc=f"Epoch {epoch}/{num_epochs} [Train]", leave=False)
    for images, texts, targets in train_bar:
        images = images.to(device)
        targets = targets.to(device)

        # model expects tokenized text
        tokenized = clip.tokenize(list(texts), truncate=True).to(device)

        optimizer.zero_grad()
        logits = model(images=images, tokenized_text=tokenized)
        loss = criterion(logits, targets)

        loss.backward()
        optimizer.step()

        train_losses.append(loss.item())

        # prediction ( detach to avoid tracking in autograd )
        preds = logits.argmax(dim=1).detach().cpu().numpy()
        y_pred_train.extend(preds.tolist())
        y_true_train.extend(targets.detach().cpu().numpy().tolist())

        # update tqdm bar
        train_bar.set_postfix(loss=f"{loss.item():.4f}")

    train_loss = sum(train_losses) / len(train_losses)
    train_acc = accuracy_score(y_true_train, y_pred_train)
    train_f1 = f1_score(y_true_train, y_pred_train, average="weighted")

    # val mode
    model.eval()
    val_losses, y_true_val, y_pred_val = [], [], []

    val_bar = tqdm(val_loader, desc=f"Epoch {epoch}/{num_epochs} [Val]", leave=False)
    with torch.no_grad():
        for images, texts, targets in val_bar:
            images = images.to(device)
            targets = targets.to(device)
            tokenized = clip.tokenize(list(texts), truncate=True).to(device)

            logits = model(images=images, tokenized_text=tokenized)
            loss = criterion(logits, targets)

            val_losses.append(loss.item())
            preds = logits.argmax(dim=1).cpu().numpy()
            y_pred_val.extend(preds.tolist())
            y_true_val.extend(targets.cpu().numpy().tolist())

            # update tqdm bar
            val_bar.set_postfix(loss=f"{loss.item():.4f}")

    val_loss = sum(val_losses) / len(val_losses)
    val_acc = accuracy_score(y_true_val, y_pred_val)
    val_f1 = f1_score(y_true_val, y_pred_val, average="weighted")

    scheduler.step(val_f1)

    print(f"\nEpoch {epoch:02d} | "
          f"Train Loss: {train_loss:.4f}, Acc: {train_acc:.4f}, F1: {train_f1:.4f} | "
          f"Val Loss: {val_loss:.4f}, Acc: {val_acc:.4f}, F1: {val_f1:.4f} | "
          f"Time: {(time.time()-t0):.1f}s")

    # saving best model
    if val_f1 > best_val_f1:
        best_val_f1 = val_f1
        torch.save(model.state_dict(), "best_multimodal_clip.pth")
        print("[NOTE]Saved best model.")

print("!!Training complete.!!")


                                                                                


Epoch 00 | Train Loss: 0.7535, Acc: 0.4624, F1: 0.3040 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 69.8s
[NOTE]Saved best model.
[NOTE]Saved best model.


                                                                                


Epoch 01 | Train Loss: 0.7618, Acc: 0.4603, F1: 0.3013 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 67.9s


                                                                                


Epoch 02 | Train Loss: 0.7535, Acc: 0.4624, F1: 0.3022 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 69.8s


                                                                                


Epoch 03 | Train Loss: 0.7560, Acc: 0.4624, F1: 0.3004 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 76.8s


                                                                                


Epoch 04 | Train Loss: 0.7601, Acc: 0.4561, F1: 0.2958 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 75.8s


                                                                                


Epoch 05 | Train Loss: 0.7567, Acc: 0.4550, F1: 0.2988 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 76.6s


                                                                                


Epoch 06 | Train Loss: 0.7545, Acc: 0.4656, F1: 0.3037 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 78.8s


                                                                                


Epoch 07 | Train Loss: 0.7571, Acc: 0.4614, F1: 0.3000 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 80.6s


                                                                                


Epoch 08 | Train Loss: 0.7557, Acc: 0.4582, F1: 0.2967 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 83.1s


                                                                                


Epoch 09 | Train Loss: 0.7555, Acc: 0.4603, F1: 0.2977 | Val Loss: 0.7516, Acc: 0.4635, F1: 0.2936 | Time: 84.6s
!!Training complete.!!




The model doesn't seem to be learning well.  Many reasons may have been come to play.
The model methodology is correct. we use pretrained CLIP encoders for image and text, project to a common latent space, and fuse them for classification. The observed low metrics can be attributed to frozen CLIP embeddings, limited dataset size, shallow classifier, and structured clinical text. Despite the metrics, the approach demonstrates a valid multimodal pipeline and can be justified for future work with larger datasets or fine-tuning.