# Pretraining CNNs with Contrastive Loss

![](https://i.ibb.co/BNJrb1C/CL.png)

Images in the graph above are from this competition's dataset :)

## Introduction

Welcome to this notebook. In this competition, we are provided with a dataset in which for every row (which is a product in a shop) we have its **image** and a **title** explaining what it is (among other things!). The goal is to find which other products are **similar** to a given product.

As the competition page puts it:

> Finding near-duplicates in large datasets is an important problem for many online businesses. In Shopee's case, everyday users can upload their own images and write their own product descriptions, adding an extra layer of challenge. Your task is to identify which products have been posted repeatedly. The differences between related products may be subtle while photos of identical products may be wildly different!

Since we can't compare raw pixels of images to each other(!), we need to first build a representation from each image which is understanable for the computer. Convolutional Neural Nets are good for this purpose; we give them an image and they return a 1D array descriving that image. When we get this array for each image, tons of machine learning models and techniques can be used to find similar images, which is the topic of my next notebook to come!

## What we are going to do

So, we need good representations (those 1D arrays). We can use a CNN which is pretrained on ImageNet to obtain these representations; but, this approach is sub-optimal because the images in imagenet and those in the dataset can wildly differ. Therefore, in this notebook, we are going to pre-train a CNN on the images of this dataset using **Contrastive Loss**.

In contrast to its name which sounds way compicated, the idea is really simple! We are going to choose two images randomly from the dataset; then, if their labels (label_group in this dataset) are the same, we make the image's representations similar to that of the other image. And, if the labels are different, we will make the representations different. In this way, we can obtain better representations when the training is done.

Stay Tuned :)

## Installing and Importing needed libraries

In [None]:
!pip install timm

In [None]:
import os
import gc
import cv2
import sys
import copy
import math
import random
import argparse
import itertools
import pandas as pd
import numpy as np

import torch
from torch import nn, optim
from torch.utils.data import Dataset
from torchvision import models
import torch.nn.functional as F
import timm

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import albumentations as A
from tqdm.autonotebook import tqdm

import matplotlib.pyplot as plt

## Configuration Class

In [None]:
class CFG:
    size = 224
    batch_size = 16
    num_workers = 2
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model_name = 'resnet18'
    pretrained = True 
    dropout = None # put it to a prob. to include a dropout layer in the model
    linear = None # put to an int to include a nn.Linear layer in the model to lower/higher the length of the repr array
    margin = 5
    
    scheduler = "ReduceLROnPlateau"
    step = "epoch" # wheter to step it after epoch or after batch
    
    learning_rate = 1e-3
    factor = 0.5
    patience = 2
    epochs = 5
    model_path = "."
    model_save_name = "best.pt"

## Loading the dataframe

In [None]:
dataframe = pd.read_csv("../input/shopee-product-matching/train.csv")
print(dataframe.shape)
dataframe.head()

See how many unique instances are there in each column

In [None]:
print(f"Number of rows: {dataframe.shape[0]}")
for col in dataframe.columns:
    print(f"{col}: \n"
          f"Number of unique elements {dataframe[col].nunique()}")

## Building the Dataset

I've provided a short description of the code below to understand it easier. I know there may be better ways to do the same function but this came to my mind and I implemented it. The related code is the next code cell.

![pres](https://i.ibb.co/d5PKJG5/pres2.png)

In [None]:
class ContrastiveDataset(Dataset):
    def __init__(self, df, transforms, base="../input/shopee-product-matching/train_images"):
        self.base = base
        self.transforms = transforms
        # getting all unique label_groups
        self.labels = list(df['label_group'].unique())
        # we put the image names of each label_group in front of it in a big dictionary
        self.labels_to_imgs = {label: df[df['label_group'] == label].image.values
                               for label in self.labels}
    
    def __getitem__(self, idx):
        label = self.labels[idx]
        
        if random.random() > 0.5:
            same = True
            same_label_images = self.labels_to_imgs[label]
            img1, img2 = np.random.choice(same_label_images, 
                                          size=2, 
                                          replace=False if len(same_label_images) > 1 else True)
        else:
            same = False
            img1 = np.random.choice(self.labels_to_imgs[label], size=1)[0]
            while True:
                different_label = np.random.choice(self.labels, size=1)[0]
                if different_label != label:
                    break
            img2 = np.random.choice(self.labels_to_imgs[different_label], size=1)[0]
        
        img1_tensor, img2_tensor = self.process_imgs(img1, img2)
        
        # returning everything :)
        return {'images1': img1_tensor,
                'images2': img2_tensor,
                'same': torch.tensor(same).float(),
                'label1': label,
                'label2': label if same else different_label, 
                'image1_name': img1,
                'image2_name': img2}
    
    def read_transform_one(self, img):
        img = cv2.imread(f"{self.base}/{img}")[..., ::-1]
        if self.transforms is not None:
            img = self.transforms(image=img)['image']
        return torch.tensor(img).float()
    
    def process_imgs(self, img1, img2):
        img1 = self.read_transform_one(img1).permute(2, 0, 1)
        img2 = self.read_transform_one(img2).permute(2, 0, 1)
        return img1, img2

    
    def __len__(self):
        return len(self.labels)


def remove_normalization(image, mean=(0.485, 0.456, 0.406), std=(0.229, 0.224, 0.225)):
    """
    Function to undo the normalization done in the dataset.
    Useful for visualization purposes
    
    :param image: tensor with shape -> (channel, height, width)
    """
    mean, std = torch.tensor(mean), torch.tensor(std)
    mean = mean.unsqueeze(1).unsqueeze(2)
    std = std.unsqueeze(1).unsqueeze(2)
    return image * std + mean

In [None]:
transforms = A.Compose([
    A.Resize(CFG.size, CFG.size),
    A.Normalize(max_pixel_value=255.) # Normalizes with ImageNet stats 
])

dataset = ContrastiveDataset(dataframe, transforms)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=CFG.batch_size, shuffle=True)

## Visualization

See some of the images from the dataset. Adjucent to the top if each image, you see the image file name. At the top of each row, you see the label groups for the images and whether they are from the same label_group.

In [None]:
batch = next(iter(dataloader))
print(batch['images1'].shape, batch['images2'].shape, batch['same'].shape)

Rows = 5
for r in range(Rows):
    fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
    img1, img2, same = batch['images1'][r], batch['images2'][r], batch['same'][r]
    img1, img2 = remove_normalization(img1), remove_normalization(img2)
    ax1.imshow(img1.permute(1, 2, 0))
    ax1.axis("off")
    ax1.set_title(batch['image1_name'][r])
    ax2.imshow(img2.permute(1, 2, 0))
    ax2.axis("off")
    ax2.set_title(batch['image2_name'][r])
    same = "Same" if same == 1 else "Different"
    fig.suptitle(f"{same} ({batch['label1'][r]}---{batch['label2'][r]})")
    plt.show()

##  Model

In [None]:
class Model(nn.Module):
    def __init__(self, 
                 model_name="resnet18",
                 pretrained=True,
                 dropout=0.2,
                 linear=128):
        """
        :param model_name: many models are available from the awesome timm library (ResNets, EfficientNets, DenseNets, ...)
        :param pretrained: whether to initialize the model with the pre-trained weights (pre-trained on ImageNet)
        :param linear: out_dim of nn.Linear. If None or 0, no linear layer will be added
        :param dropout: dropout ratio. If None or 0, no dropout layer will be added

        """
        super().__init__()
        model = timm.create_model(model_name, 
                                  pretrained=pretrained, 
                                  num_classes=0)
        
        # num of final features after adaptive (global) average pooling
        self.num_features = model.num_features
        self.linear = None
        if linear is not None and linear > 0:
            self.linear = nn.Linear(self.num_features, linear)
        
        # nn.Identity does nothing! just returns the input tensor
        self.backbone = nn.Sequential(model, 
                                      self.linear if self.linear is not None else nn.Identity(),
                                      nn.ReLU() if self.linear is not None else nn.Identity(),
                                      nn.Dropout(0.2) if dropout is not None else nn.Identity())
    
    def forward(self, batch):
        images_1 = self.backbone(batch['images1'].to(CFG.device))
        images_2 = self.backbone(batch['images2'].to(CFG.device))

        return images_1, images_2

## Contrastive Loss Function

Here's the cool part! This is from the [Dimensionality Reduction by Learning an Invariant Mapping](https://ieeexplore.ieee.org/document/1640964) paper (Yann LeCun is in co-authors).

In the section regrading the ***Dataset***, you saw that when we get images from dataset, with 50% probability we get images from the same label and with 50% from different labels.

We feed each of those to the model separately (you can see this in the forward function of the model above) and then we get representations out of the model for each of the two images.

Now, we want to guid the model to produce more similar representations for similar images and different representations for different ones. This is done with the loss function below. 

![](https://miro.medium.com/max/778/1*g2I-W-iIQuCNsczuGPN0ZQ.png)

Image from [HERE](https://towardsdatascience.com/how-to-choose-your-loss-when-designing-a-siamese-neural-net-contrastive-triplet-or-quadruplet-ecba11944ec).

Don't panic! It is really simple when converted to code. 

1. Just think about what happens when the label is 1; meaning that images are **SIMILAR** -> only the first term remains and the model needs to reduce the **distance** (d) to be able to lower the loss (which it has to!) 
2. Now thing what happens when the label is 0 meaning that images are **DIFFERENT** -> only the second term remains and the model should try to make distance (d) bigger than margin (m) to be able to lower the loss.

In [None]:
class ContrastiveLoss(nn.Module):
    def __init__(self, margin):
        super().__init__()
        self.margin = margin

    def forward(self, output1, output2, targets):
        d = (output1 - output2).pow(2).sum(1).sqrt() # distance
        loss = torch.mean(0.5 * targets.float() * d.pow(2) + \
                          0.5 * (1 - targets.float()) * F.relu(self.margin - d).pow(2))
        return loss, d # we also return distance; it is needed to evaluate the model

## Train and Evaluation Functions

Functions that we will need to train and evaluate our model

In [None]:
class AvgMeter:
    def __init__(self, name="Metric"):
        self.name = name
        self.reset()
    
    def reset(self):
        self.avg, self.sum, self.count = [0]*3
    
    def update(self, val, count=1):
        self.count += count
        self.sum += val * count
        self.avg = self.sum / self.count
    
    def __repr__(self):
        text = f"{self.name}: {self.avg:.4f}"
        return text
    
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group["lr"]

In [None]:
def one_epoch(model, 
              criterion, 
              loader, 
              optimizer=None, 
              lr_scheduler=None,
              mode="train", 
              step="batch"):
    
    loss_meter = AvgMeter()
    
    # these two are needed for the after-epoch evaluation. You're gonna see what they are used for
    distances = None
    labels = None
    
    tqdm_object = tqdm(loader, total=len(loader))
    for batch in tqdm_object:
        images1_f, images2_f = model(batch)
        loss, d = criterion(images1_f, images2_f, batch['same'].to(CFG.device))
        if mode == "train":
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if step == "batch":
                lr_scheduler.step()
                
        count = batch['same'].size(0)
        loss_meter.update(loss.item(), count)
          
        # collecting all the labels and distances
        if distances is None:
            distances = d.detach().cpu()
            labels = batch['same']
        else:
            distances = torch.cat([distances, d.detach().cpu()], dim=0)
            labels = torch.cat([labels, batch['same']], dim=0)


        if mode == "train":
            tqdm_object.set_postfix(train_loss=loss_meter.avg, lr=get_lr(optimizer))
        else:
            tqdm_object.set_postfix(valid_loss=loss_meter.avg)
    
    return loss_meter, distances, labels

### How can we evaluate the model during training?

Thats a good question!
Actually, the metric you are going to see is probably a bit different from other metrics you've seen because another model gets trained each time we call this metric function!

Remember the margin and distance from the loss function section. The model needed to decrease the distance for similar images and increase it above an arbitrary margin for dissimilar images. So, we can use the distances to find out if the images are similar or not. This was the reason we were collecting all the distances during a training or eval epoch.

Then, after each epoch is done, we will fit a simple Logistic Regression model from scikit-learn library and train it on the distances and labels of the training image pairs. After that, we will see how this simple model predicts if the distances from the validation image pairs describe a similar or dissimilar pair of images.

If the CNN model is not able to produce good representations whose distances from each other are useless, the logistic regression model will not be able to correctly classify the pairs and should always give an accuracy of 50 percent. But, if this simple model could clasify the pairs with a good accuracy, chances are we are producing useful representations in the model which we can use for the main task of this competition.

In [None]:
def get_score(train_d, train_y, valid_d, valid_y):
    """
    suffix _d means distances
    suffix _y means labels
    """
    log_reg = LogisticRegression()
    log_reg.fit(train_d.numpy().reshape((-1, 1)), train_y.numpy())
    train_preds = log_reg.predict(train_d.numpy().reshape((-1, 1)))
    valid_preds = log_reg.predict(valid_d.numpy().reshape((-1, 1)))
    train_acc = accuracy_score(train_preds, train_y)
    valid_acc = accuracy_score(valid_preds, valid_y)
    return train_acc, valid_acc

In [None]:
def train_eval(epochs, model, train_loader, valid_loader, 
               criterion, optimizer, lr_scheduler=None):
    
    best_loss = float('inf')
    
    for epoch in range(epochs):
        print("*" * 30)
        print(f"Epoch {epoch + 1}")
        current_lr = get_lr(optimizer)
        
        model.train()
        train_loss, train_d, train_y = one_epoch(model, 
                                                  criterion, 
                                                  train_loader, 
                                                  optimizer=optimizer,
                                                  lr_scheduler=lr_scheduler,
                                                  mode="train",
                                                  step=CFG.step)                     
        model.eval()
        with torch.no_grad():
            valid_loss, valid_d, valid_y = one_epoch(model, 
                                                      criterion, 
                                                      valid_loader, 
                                                      optimizer=None,
                                                      lr_scheduler=None,
                                                      mode="valid")
        
        train_acc, valid_acc = get_score(train_d, train_y, valid_d, valid_y)
        print(f"Train Accuracy: {train_acc:.3f}")
        print(f"Valid Accuracy: {valid_acc:.3f}")
        
        if valid_loss.avg < best_loss:
            best_loss = valid_loss.avg
            torch.save(model.state_dict(), f'{CFG.model_path}/{CFG.model_save_name}')
            print("Saved best model!")
        
        # or you could do: if step == "epoch":
        if isinstance(lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
            lr_scheduler.step(train_loss.avg)

## Train and Valid loaders

In [None]:
labels = dataframe['label_group'].unique()
train_labels, valid_labels = train_test_split(labels, test_size=0.2, shuffle=True, random_state=42)
train_df = dataframe[dataframe['label_group'].isin(train_labels)].reset_index(drop=True)
valid_df = dataframe[dataframe['label_group'].isin(valid_labels)].reset_index(drop=True)

In [None]:
train_dataset = ContrastiveDataset(train_df, transforms)
train_loader = torch.utils.data.DataLoader(train_dataset, 
                                           batch_size=CFG.batch_size,
                                           num_workers=CFG.num_workers,
                                           shuffle=True)

valid_dataset = ContrastiveDataset(valid_df, transforms)
valid_loader = torch.utils.data.DataLoader(valid_dataset, 
                                           batch_size=CFG.batch_size,
                                           num_workers=CFG.num_workers,
                                           shuffle=False)

In [None]:
model = Model(CFG.model_name, CFG.pretrained, CFG.dropout, CFG.linear)
model.to(CFG.device)
optimizer = torch.optim.Adam(model.parameters(), lr=CFG.learning_rate)

if CFG.scheduler == "ReduceLROnPlateau":
    lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 
                                                              mode="min", 
                                                              factor=CFG.factor, 
                                                              patience=CFG.patience)

    # when to step the scheduler: after an epoch or after a batch
    CFG.step = "epoch"
    
criterion = ContrastiveLoss(margin=CFG.margin)
train_eval(CFG.epochs, 
           model, 
           train_loader, 
           valid_loader, 
           criterion, 
           optimizer, 
           lr_scheduler)

You see that the train and validation accuracies are above 50 percent after training and increase further after each epoch. So it seems that we are making good features for each image.

In this next notebook to come, I'll use these representations to find similar and dissimilar images and focus on the main task of the competition.