## Pretraining the CNN with regressing the number of atoms of each element in the structure

![](https://i.ibb.co/WPDk5d7/Presentation1.jpg)

As discussed [here](https://www.kaggle.com/c/bms-molecular-translation/discussion/224257#1258149) and mentioned by [@hengck23](https://www.kaggle.com/hengck23), I'm going to first pre-train the CNN part of the final model on a task **similar** to this competition's. Then we can freeze the CNN and only train the RNN part on the real task of the competition. In the final step, both the CNN and RNN will be trained with a lower learning_rate end-to-end for the main task. So, here is what will be done:

1. Pretraining the CNN on a similar task to competition's [This notebook]
2. Freeze CNN and train the RNN part on the main task
3. Train both CNN and RNN with lower learning_rate on the main task

In this notebook, **we are going to regress the number of atoms of each element** present in the biochemical structure. For doing this, the model needs to learn a lot of useful features of the image which I hope will help the CNN for the final task of this competition which is giving us the whole InChI of the molecule.




My **preprocessing** notebook related to this one is [here](https://www.kaggle.com/moeinshariatnia/cnn-rnn-cnn-pretraining-w-regression-preprocess). 


I want to thank [Y.Nakama](https://www.kaggle.com/yasufuminakama) who shared really great starter notebooks.

## Importing libraries

In [None]:
!pip install timm

In [None]:
import os
import gc
import cv2
import numpy as np
import pandas as pd
from tqdm.autonotebook import tqdm

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_absolute_error
from sklearn.preprocessing import StandardScaler

import torch
from torch import nn
import torch.nn.functional as F

import timm
import albumentations as A

np.set_printoptions(suppress=True)

## Config

Set the **debug to False** to train the model on the whole data

In [None]:
class CFG:
    debug = True # set to False to run on the whole data
    epochs = 4
    batch_size = 128
    num_workers = 4
    size = 224
    classes = ["C", "Cl", "F", "H", "N", "O", "S"]
    num_classes = len(["C", "Cl", "F", "H", "N", "O", "S"])
    model_name = 'resnet34'
    fc_hidden_size = 128
    dropout = 0.2 # set to 0 or None to remove it
    seed = 42
    n_fold = 5
    learning_rate = 3e-4
    scheduler = "ReduceLROnPlateau"
    patience = 2
    factor = 0.5
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

## Taking a look at the dataframe

In [None]:
train = pd.read_pickle("../input/cnn-rnn-cnn-pretraining-w-regression-preprocess/cnn_pretrain.pkl")
train.head(3)

In [None]:
elements = ["C", "Cl", "F", "H", "N", "O", "S"]
weights = []
for element in elements:
    zeros = train[train[element] == 0].shape[0]
    non_zeros = train[train[element] != 0].shape[0]
    weight = non_zeros / len(train)
    print(f"Element: {element} | Percent non zero: {weight * 100} | weight: {round(weight, 2)}")
    weights.append(weight)

We will ignore elements B, Br, and Si because they are really rare (less than 0.01 percent non zero)

## Calculating the weights

Using the non-zero ratios as the weights associated to each of the elements. Used in the final loss function.

In [None]:
weights = np.array(weights)
weights_normalized = (weights / sum(weights)).round(3)
weights_normalized

In [None]:
CFG.weights = torch.tensor(weights_normalized).float().to(CFG.device)

Set CFG.debug to False to train on the whole data

In [None]:
if CFG.debug:
    train = train.sample(10_000).reset_index(drop=True)

## Dataset

We will scale the counts of atoms with **StandardScaler** from scikit-learn. It helps the model's learning if all the outputs are on roughly similar scales (Carbon has much more atoms compared to say Oxygen; so, scaling will help)
Keep in mind that we will use train stats for the validation set; so, only using .fit function in train dataset

In [None]:
class BMSDataset(torch.utils.data.Dataset):
    def __init__(self, img_paths, counts, scaler=None, transforms=None):
        self.img_paths = img_paths
        if scaler is None:
            self.scaler = StandardScaler()
            self.counts = self.scaler.fit_transform(counts)
        else:
            self.scaler = scaler
            self.counts = scaler.transform(counts)
        
        self.transforms = transforms
    
    def __getitem__(self, idx):
        img = cv2.imread(self.img_paths[idx])[..., ::-1]
        if self.transforms is not None:
            img = self.transforms(image=img)['image']
        counts = self.counts[idx]
        img = torch.tensor(img).permute(2, 0, 1).float()
        counts = torch.tensor(counts).float()
        return img, counts
    
    def inverse_transform(self, x):
        """
        x type is np array
        """
        return self.scaler.inverse_transform(x)
        
    def __len__(self):
        return len(self.img_paths)
    

def get_transforms(mode="train"):
    if mode == "train":
        return A.Compose([
            A.Resize(CFG.size, CFG.size),
            A.Normalize()
        ])
    else:
        return A.Compose([
            A.Resize(CFG.size, CFG.size),
            A.Normalize()
        ])

## Model

In [None]:
class Model(nn.Module):
    def __init__(self,
                 model_name=CFG.model_name, 
                 num_classes=CFG.num_classes, 
                 dropout=CFG.dropout,
                 fc_hidden_size=CFG.fc_hidden_size,
                 pretrained=True):
        
        super().__init__()
        self.cnn = timm.create_model(
            model_name, pretrained=pretrained, num_classes=0, global_pool=""
        )
        num_features = self.cnn.num_features
        self.dropout = nn.Dropout(dropout) if dropout is not None else nn.Identity()
        self.fc = nn.Linear(num_features, fc_hidden_size)
        self.output = nn.Linear(fc_hidden_size, num_classes)
        
    def get_features(self, x):
        return self.cnn(x)
    
    def forward(self, x):
        batch_size = x.size(0)
        x = self.get_features(x)
        channels = x.size(1)
        x = F.adaptive_avg_pool2d(x, 1).reshape(batch_size, channels)
        x = self.dropout(x)
        x = F.relu(self.fc(x))
        x = self.dropout(x)
        out = self.output(x)

        return out

## Utils

In [None]:
class AvgMeter:
    def __init__(self, name="Metric"):
        self.name = name
        self.reset()
    
    def reset(self):
        self.avg, self.sum, self.count = [0]*3
    
    def update(self, val, count=1):
        self.count += count
        self.sum += val * count
        self.avg = self.sum / self.count
    
    def __repr__(self):
        text = f"{self.name}: {self.avg:.4f}"
        return text
    
def get_lr(optimizer):
    for param_group in optimizer.param_groups:
        return param_group["lr"]

In [None]:
def one_epoch(model, 
              criterion, 
              loader, 
              optimizer=None, 
              lr_scheduler=None,
              mode="train", 
              step="batch"):
    
    loss_meter = AvgMeter()
    mae_orig_meter = AvgMeter()
    
    tqdm_object = tqdm(loader, total=len(loader))
    for images, targets in tqdm_object:
        images, targets = images.to(CFG.device), targets.to(CFG.device)
        preds = model(images)
        loss = criterion(preds, targets)
        loss *= CFG.weights.unsqueeze(0)
        loss = loss.mean()
        
        if mode == "train":
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            if step == "batch":
                lr_scheduler.step()
                
        count = images.size(0)
        loss_meter.update(loss.item(), count)
        
        targets_orig = loader.dataset.inverse_transform(targets.cpu().numpy())
        preds_orig = loader.dataset.inverse_transform(preds.detach().cpu().numpy())
        mae_orig = mean_absolute_error(targets_orig, preds_orig)
        mae_orig_meter.update(mae_orig, count)

        if mode == "train":
            tqdm_object.set_postfix(train_loss=loss_meter.avg, train_mae=mae_orig_meter.avg, lr=get_lr(optimizer))
        else:
            tqdm_object.set_postfix(valid_loss=loss_meter.avg, valid_mae=mae_orig_meter.avg)
    if mode != "train":
        print(f"Targets: \n"
              f"{np.round(targets_orig[:5], 2)}")
        
        print(f"Preds: \n"
              f"{np.round(preds_orig[:5], 2)}")
        
    return loss_meter, mae_orig_meter

In [None]:
def train_eval(model, 
               train_loader, 
               valid_loader, 
               criterion, 
               optimizer, 
               lr_scheduler=None, 
               step=None,
               fold=0):
    
    best_loss = float('inf')
    
    for epoch in range(CFG.epochs):
        print("*" * 30)
        print(f"Epoch {epoch + 1}")
        current_lr = get_lr(optimizer)
        
        model.train()
        train_loss, train_mae = one_epoch(model, 
                                          criterion, 
                                          train_loader, 
                                          optimizer=optimizer,
                                          lr_scheduler=lr_scheduler,
                                          mode="train",
                                          step=step)                     
        model.eval()
        with torch.no_grad():
            valid_loss, valid_mae = one_epoch(model, 
                                              criterion, 
                                              valid_loader, 
                                              optimizer=None,
                                              lr_scheduler=None,
                                              mode="valid")
        
        if valid_loss.avg < best_loss:
            best_loss = valid_loss.avg
            torch.save(model.state_dict(), f'best_fold_{fold}.pt')
            print("Saved best model!")
        
        # or you could do: if step == "epoch":
        if isinstance(lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
            lr_scheduler.step(valid_loss.avg)

In [None]:
def make_folds(dataframe):
    folds = dataframe.copy()
    Fold = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
    for n, (train_index, val_index) in enumerate(Fold.split(folds, folds['InChI_length'])):
        folds.loc[val_index, 'fold'] = int(n)
    folds['fold'] = folds['fold'].astype(int)
    print(folds.groupby(['fold']).size())
    return folds

In [None]:
def make_loader(dataframe, mode="train", scaler=None):
    transforms = get_transforms(mode=mode)
    dataset = BMSDataset(dataframe['file_path'].values, 
                         dataframe.loc[:, CFG.classes].values, 
                         scaler=scaler, 
                         transforms=transforms)
    
    dataloader = torch.utils.data.DataLoader(dataset, 
                                             batch_size=CFG.batch_size, 
                                             shuffle=True if mode == "train" else False,
                                             num_workers=CFG.num_workers)
    return dataloader

In [None]:
def one_fold(folds, fold):  
    print(f"Training Fold: {fold}")
    
    
    train_dataframe = folds[folds['fold'] != fold].reset_index(drop=True)
    valid_dataframe = folds[folds['fold'] == fold].reset_index(drop=True)

    train_loader = make_loader(train_dataframe, "train", None) # Setting scaler to None, fits a new scaler on train data
    valid_loader = make_loader(valid_dataframe, "valid", train_loader.dataset.scaler) # Using train scaler for valid data

    model = Model().to(CFG.device)
    optimizer = torch.optim.Adam(model.parameters(), lr=CFG.learning_rate)
    if CFG.scheduler == "ReduceLROnPlateau":
        lr_scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, 
                                                                  mode="min", 
                                                                  factor=CFG.factor, 
                                                                  patience=CFG.patience)
        step = "epoch"
    
    criterion = nn.MSELoss(reduction='none')
    train_eval(model, 
               train_loader, 
               valid_loader,
               criterion, 
               optimizer, 
               lr_scheduler=lr_scheduler,
               step=step,
               fold=fold)

In [None]:
folds = make_folds(train)

If debug == True, only one fold will be trained.

In [None]:
if CFG.debug:
    one_fold(folds, 0)
else:
    for i in range(CFG.n_fold):
        one_fold(folds, i)