# Pretrained SWIN Transformer

**Description:** Millions of stray animal suffer on the streets or euthanized in shelters every day around the world. A good picture of homeless animal might increase their chance of getting adopted. But what makes a good picture? Our mission is to build an ML model which is able to accurately determine a pet photo's appeal and even suggest improvements to give these rescue animals a higher chance of loving homes.

This competition is organized by PetFinder.my. They are Malaysia's leading animal welfare platform, featuring 180,000 animals with 54,000 happily adopted. Good and accurate model might have a change to be adapted into AI tools that will guide shelters and rescuers around the world to improve the photo quality of their sheltered pet. 

**Data:** 9912 images of pet animals labeled with "Pawpularity" score from 1 to 100. Photo Metadata = (Focus, Eyes, Face, Near, Action, Accessory, Group, Collage, Human, Occlusion, Info, Blur)

![](https://pbs.twimg.com/media/CvhLlXxXgAA5TDJ.jpg)


# 1. Introduction

In my previous attempt, I have constructed a CNN model myself ([Link](https://www.kaggle.com/gohweizheng/petfinder-wz-first-cnn)) to solve this problem. But the result wasn't great. So, in this notebook I will try to get better result by using different pretrained models available on the internet.

Tested pretrained models: tf_efficientnet_b0_ns, swin_large_patch4_window12_384

Model structure was refered from this tutorial: https://albumentations.ai/docs/examples/pytorch_classification/
And this notebook : https://www.kaggle.com/manabendrarout/transformers-classifier-method-starter-train

Please give an upvote to this notebook.


In [None]:
import sys
sys.path.append('../input/timm-pytorch-image-models/pytorch-image-models-master')
sys.path.append('../input/earlystoppingpytorch/early-stopping-pytorch')

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import os
import path
import random
import cv2
import timm
import gc
import albumentations
from tqdm import tqdm
from collections import defaultdict

import matplotlib.pyplot as plt
%matplotlib inline

# Import PyTorch Libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
from torch.optim.lr_scheduler import CosineAnnealingWarmRestarts
from pytorchtools import EarlyStopping
from torch.utils.data import DataLoader


# Import SKlearn Libraries
from sklearn import datasets
from sklearn import model_selection
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import mean_squared_error

# Deciding the device used for calculation. CUDA = GPU
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# 2. Data Loading

**2.1 Data loading from CSV file**


In [None]:
data_df = pd.read_csv("../input/petfinder-pawpularity-score/train.csv")
test_df = pd.read_csv("../input/petfinder-pawpularity-score/test.csv")

**2.2 Parameters setting**

Bundling all the parameters that will be used in this notebook here for easy editing.

In [None]:
target = ['Pawpularity']
not_features = ['Id', 'kfold', 'image_path', 'Pawpularity']
cols = list(data_df.columns)
features = [feat for feat in cols if feat not in not_features]
print(features)

In [None]:
params = {
    'folder_dir': '../input/petfinder-pawpularity-score/',
    'model':'swin_large_patch4_window12_384',
    'image_dir': '../input/petfinder2-cropped-dataset/crop/',
    'test_img_dir': '../input/petfinder-pawpularity-score/test/',
    'features': features,
    'img_size' : 384,
    'dropout':0.4,
    'num_workers':2,
    'fold' : 10,
    'batch_size' : 8,
    'lr' : 1e-5,
    'scheduler_name': 'CosineAnnealingWarmRestarts',
    'T_0':10,
    'min_lr':1e-7,
    'pretrained':True,
    'weight_decay':1e-6
}

# Setting manual seed to everything.
# So that we will get the same results everything we run the notebook.
SEED = 42

def seed_everything(seed=SEED):
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = True
    
seed_everything()

**2.2 Data Processing**

In this section, I will build a pipeline to process data by bundling all the functions and data processing code in a class using object oriented programming(OOP). Class is like a container where we can store all the data and functions that we need inside it to make our code tidier. Moreover, it has a very convenient function that allows inheritance of all the functions and data inside it at other further processes. 

In [None]:
class PawDataSet():
    def __init__(self,dataset, params, features, transform = None,):
        self.dataset = dataset
        self.image_path = dataset['Id'].apply(lambda x: os.path.join(params['image_dir'],f'{x}.jpg'))
        self.target_label = dataset['Pawpularity']
        self.features = dataset[features].values
        self.class_label = self.target_label/100
        self.transform = transform
        self.params = params
    
    # Returen the len of data.
    def __len__(self):
        return len(self.image_path)
    
    # Load images and target score according to index number (idx)
    def __getitem__(self, idx):
        image_filepath = self.image_path[idx]
        image = cv2.imread(image_filepath)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        
        if self.transform is not None:
            image = self.transform(image=image)['image']
        
        image = np.transpose(image,(2, 0, 1)).astype(np.float32)
        image = torch.tensor(image)
        features = self.features[idx, :]
        targets = torch.tensor(self.class_label[idx]).float()
        
        return image, features, targets

# 3. Image augmentation

Augmentation is used to transform image data into the desired type or shape, normalize it and turn it into a tensor form. At the same time, augmentation can also be used to create multiple images out of the single image input by flipping, rotating or mixing up a few different images. Images will be transformed bit by bit using this function when they are iterated at the cross validation process. 


In [None]:
# Augmentation function for the training data.
def Transform_train(DIM = params['img_size']):
    return albumentations.Compose(
        [
            albumentations.Resize(DIM,DIM),
            albumentations.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
            ),
            albumentations.HorizontalFlip(p=0.5),
            albumentations.VerticalFlip(p=0.5),
            albumentations.Rotate(limit=45, p=0.4),
            albumentations.ShiftScaleRotate(
                shift_limit = 0.1, scale_limit=0.1, rotate_limit=45, p=0.5
            ),
            albumentations.HueSaturationValue(
                hue_shift_limit=0.2, sat_shift_limit=0.2,
                val_shift_limit=0.2, p=0.5
            ),
            albumentations.RandomBrightnessContrast(
                brightness_limit=(-0.1, 0.1),
                contrast_limit=(-0.1, 0.1), p=0.5
            )
        ],
        p=1.0
    )

In [None]:
# Augmentation function for the validation data.
def Transform_val(DIM = params['img_size']):
    return albumentations.Compose(
        [
            albumentations.Resize(DIM, DIM),
            albumentations.Normalize(
                mean = [0.485, 0.456, 0.406],
                std = [0.229, 0.224, 0.225],
                max_pixel_value=255.0,
                p = 1.0
            ),
        ],
        p=1.0
    )

# 4. Spliting the data into K Fold

At this section, I will split the data into different Folds(batches) to perform cross validation. By splitting the data into K folds, I can leave out one fold of the data as validation data and use K-1 folds to train the model. By doing this, I can prevent the model from overfitting and be able to predict general data more accurately.

More about cross validation here : https://machinelearningmastery.com/repeated-k-fold-cross-validation-with-python/

**4.1 K fold function**

In [None]:
# Using Sturges' rule to determine the best number of bins for our data.
num_bins = int(np.floor(1+np.log2(len(data_df))))

In [None]:
data_df['bins'] = pd.cut(data_df['Pawpularity'], bins=num_bins, labels=False)
data_df['fold'] = -1

# Function to create Folds.
def create_folds(data, num_splits):
    strat_kfold = StratifiedKFold(n_splits=num_splits, random_state=SEED, shuffle=True)
    for i, (_, idx) in enumerate(strat_kfold.split(data_df.index, data_df['bins'])):
        data_df.iloc[idx, -1] = i
    
    data_df['fold'] = data_df['fold'].astype('int')
    data_df.fold.value_counts().plot.bar(xlabel="Fold", ylabel="Number of data")
    

**4.2 Splitting the data into K fold**

Using the K fold functions above, data are splited into 5 and 10 folds. I will only use the 10 folds data in the further process.


In [None]:
# 5 Folds
df_5 = create_folds(data_df, num_splits=5)

In [None]:
# 10 Folds
df_10 = create_folds(data_df, num_splits=10)

# 5. Model Building

Many different fuctions that are needed to train the model will be written in this session.

**5.1 Metrics function**

Metrics function is used to monitor the performance of the model so that the model's performance can be visualized after each epoch. Metrics function is different from loss function where metrics function will not be used in model training process.  

In [None]:
def usr_rmse_score(output, target):
    y_pred = torch.sigmoid(output).cpu()
    y_pred = y_pred.detach().numpy()*100
    target = target.cpu()*100
    
    return mean_squared_error(target, y_pred, squared=False)

In [None]:
class MetricMonitor:
    def __init__(self, float_precision=3):
        self.float_precision = float_precision
        self.reset()
        
    def reset(self):
        self.metrics = defaultdict(lambda: {'val':0, 'count':0, 'avg':0})
    
    def update(self, metric_name, val):
        metric = self.metrics[metric_name]
        
        metric['val'] += val
        metric['count'] += 1
        metric['avg'] = metric['val'] / metric['count']
        
    def __str__(self):
        return "|".join(
            [
                "{metric_name}: {avg:.{float_precision}f}".format(
                    metric_name=metric_name, avg=metric['avg'],
                    float_precision=self.float_precision
                )
                for (metric_name, metric) in self.metrics.items()
            ]
        )
    

**5.2 Scheduler function**

Scheduler function is used to change the learning rate of optimizer automatically after each epoch. So that there are more flexibility when learning rate of optimizer are setted up initially. 

Setting a learning rate that is too small will take forever for the model to train while setting a learning rate that is too big might prevent the model to ever reach the minimal optimum point. Thus having a scheduler function allows me to set bigger learning rate at early stage and automatically lower the learning rate as the model training is iterated.

In this notebook I used CosineAnnealingWarmRestarts for my scheduler function. But there are many other kind of scheduler functions that can be used, you can find them in this website: https://pytorch.org/docs/stable/optim.html.

In [None]:
def get_scheduler(optimizer, scheduler_params=params):
    if scheduler_params['scheduler_name'] == 'CosineAnnealingWarmRestarts':
        scheduler = CosineAnnealingWarmRestarts(
            optimizer,
            T_0 = scheduler_params['T_0'],
            eta_min = scheduler_params['min_lr'],
            last_epoch = -1
        )
    return scheduler

**5.3 Pretrained Model**

In this section, model class will be built by downloading pretrained model from the internet. This technique is known as Transfer Learning.

By using pretrained model, the state-of-art architecture of convolutional neural network that was built by world top data scientist can be utilized in this ML solution. For example ResNet, VGG 16, EfficientNet and etc. At the same time, pretrained weights of each layers that were trained by the developer can also be imported by setting the pretrained = True. 

In this notebook, SWIN Transform was chosen as the pretrained model. Only the final fully-connected layer will be edited to return one output which is the Pawpularity score of each animal image. The CNN architecture of the model will be printed out in the next column.

In [None]:
# Load and print out the architecture of the pretrained model.
# We will change only the last layer of the model(head) in the next column.
SWIN_model = timm.create_model(model_name = params['model'])
print(SWIN_model)

In [None]:
class PetNet(nn.Module):
    def __init__(self, model_name=params['model'], pretrained=params['pretrained'], features=len(params['features']) ):
        super().__init__()
        self.model = timm.create_model(model_name=model_name, pretrained=pretrained, in_chans=3)
        # Replace the final head layers in model with our own Linear layer
        num_features = self.model.head.in_features
        self.model.head = nn.Linear(num_features, 128)
        self.fully_connect = nn.Sequential(nn.Linear(128 + features, 64),
                                           nn.ReLU(),
                                           nn.Linear(64, 1)
                                          )
        self.dropout = nn.Dropout(p=0.5)
    
    def forward(self, image, features):
        x = self.model(image)
        # Using dropout functions to randomly shutdown some of the nodes in hidden layers to prevent overfitting.
        x = self.dropout(x)
        # Concatenate the metadata into the results.
        x = torch.cat([x, features], dim=1)
        output = self.fully_connect(x)
        return output

**5.4 Training function**

This function is used to input images into the model and generate prediction. Then, the difference between generated prediction values and target values will be calculated using loss function. Lastly, the gradient of the loss will be calculated and be used to optimize the parameters of the weights.


In [None]:
def train_fn(train_loader, model, criterion, optimizer ,epoch, params, scheduler=None):
    metric_monitor = MetricMonitor()
    # Set the model into train model. There are train mode and eval mode.
    model.train()
    
    # Load the data using tqdm to visualize the training process.
    stream = tqdm(train_loader)
    
    for i, (images,features, target) in enumerate(stream):
        images = images.to(device)
        target = target.to(device).view(-1, 1)
        features = features.to(device)
        
        # Generate predictions by passing images through the model.
        preds = model(images, features)
        
        # Calculate the difference between prediction value and target value ('Pawpularity' label). 
        loss = criterion(preds, target)
        
        # Generate Root Mean Square Error score
        rmse_score = usr_rmse_score(preds, target)
        metric_monitor.update('Loss', loss.item())
        metric_monitor.update('RMSE', rmse_score)
        
        # Generate loss gradient and optimize the weight of model using optimizer.
        loss.backward()
        if (i+1)%4==0:
            optimizer.step()
        
        # Use scheduler to change the learning rate.
            scheduler.step()
            
        # Reset the gradient after each loop. To avoid it from adding up.
        optimizer.zero_grad()
        
        # Set description to the progress bar when we run this training function
        stream.set_description(f"Epoch: {epoch:02}. Train. {metric_monitor}")

**5.5 Validation Function**

This function will be used to validate the trained model by using the unseen validation data set. The result will be used to determine whether the model is improving or not. 

Validation function is very important in the process of training a model because model tends to overfit by memorizing all the training data at the training phase. Overfitting will make the model seems to be able to generate very accurate result, but often failed to reproduce similar result when it's used on unseen data. Thus, it's very important to reserve part of the data and have a separate function to validate the model with unseen data in order to have a accurate evaluation of the model.

In [None]:
def validate_fn(val_loader, model, criterion, epoch, params):
    metric_monitor = MetricMonitor()
    
    # Set the model into evaluation mode. This will turn off the Dropout layers or BatchNorm layers in the model.
    model.eval()
    stream = tqdm(val_loader)
    valid_targets = []
    predictions = []
    
    # Turn off the gradient tracking for faster processing.
    with torch.no_grad():
        for i, (images,features, target) in enumerate(stream, start=1):
            images = images.to(device)
            target = target.float().view(-1, 1)
            target = target.to(device)
            features = features.to(device)
           
            preds = model(images, features)
            loss = criterion(preds, target)
           
            rmse_score = usr_rmse_score(preds, target)
            metric_monitor.update('Loss', loss.item())
            metric_monitor.update('RMSE', rmse_score)
            stream.set_description(f'Epoch: {epoch:02}. Valid. {metric_monitor}')
            
            targets = (target.detach().cpu().numpy()*100).tolist()
            outputs = (torch.sigmoid(preds).detach().cpu().numpy()*100).tolist()
            
            valid_targets.extend(targets)
            predictions.extend(outputs)
            
    return valid_targets, predictions

# 6. Run function


In [None]:
best_models_of_each_fold = []
rmse_tracker = []

In [None]:
# Set the range from 0 to 10 to run the whole Folds.
for fold in range(2,4):
    # Split the data into training data and validation data for cross validation
    # The data that have same label as the fold will be used as Validation data, the rest as Training data.
    train = data_df[data_df['fold']!=fold].reset_index(drop=True)
    val = data_df[data_df['fold']==fold].reset_index(drop=True)
    
    # Making training and validating dataset.
    train_dataset = PawDataSet(
        dataset = train,
        params = params,
        features = params['features'],
        transform = Transform_train()
    )
    val_dataset = PawDataSet(
        dataset = val,
        params = params,
        features = params['features'],
        transform = Transform_val()
    )
    
    # Making data loader using PyTorch DataLoader function. This allow us to separate data into small batches to train the model.
    train_loader  = DataLoader(
        train_dataset, batch_size=params['batch_size'], shuffle=True, 
        num_workers=params['num_workers']
    )
    
    val_loader = DataLoader(
        val_dataset, batch_size=params['batch_size'], shuffle=False,
        num_workers=params['num_workers']
    )
    
    # Loading model into GPU.
    model = PetNet()
    model = model.to(device)
    
    # Setting criterion to calculate loss, optimizer and scheduler.
    criterion = nn.BCEWithLogitsLoss()
    optimizer = torch.optim.AdamW(model.parameters(),
                                 lr=params['lr'],
                                 weight_decay=params['weight_decay'],
                                 amsgrad=False)
    
    # Use the scheduler functions that we defined at section 5.2 to update the learning rate in optimizer.
    scheduler = get_scheduler(optimizer)
    
    # Early stopping functions to stop the training process if the model is not improving after each epoch.
    early_stopping = EarlyStopping(patience=2, verbose=True)
    
    # Training and validation loop
    best_rmse = np.inf
    best_epoch = np.inf
    best_model_name = None
    
    # Epoch = how many times to repeat the training loop.
    for epoch in range(40):
        train_fn(train_loader, model, criterion, optimizer, epoch, params, scheduler)
        predictions, valid_targets = validate_fn(val_loader, model, criterion, epoch, params)
        rmse = round(mean_squared_error(valid_targets, predictions, squared=False), 3)
        
        # Condition loop to save the model with best score.
        if rmse < best_rmse:
            best_rmse = rmse
            best_epoch = epoch
            if best_model_name is not None:
                os.remove(best_model_name)
                
            # Saving state_dict of the best model to rerun it later for inference.
            torch.save(model.state_dict(),
                       f"{params['model']}_epoch_f{fold}.pth")
            best_model_name = f"{params['model']}_epoch_f{fold}.pth"
        
        # Evaluate the output rmse of the model to decide whether to stop the loop or not.
        early_stopping(rmse, model)
        
        # Stop the training loop if the score doesn't improve after each epoch.
        if early_stopping.early_stop:
            print("Early stopping")
            break
            
    # Print summary
    print('')
    print(f'The best RMSE: {best_rmse} for fold {fold+1} was achieved on epoch: {best_epoch}')
    print(f'The best saved model is: {best_model_name}')
    best_models_of_each_fold.append(best_model_name)
    rmse_tracker.append(best_rmse)
    print(''.join(['#']*50))
    del model
    gc.collect()
    torch.cuda.empty_cache()
    
print('')
print(f'Average RMSE of all folds: {round(np.mean(rmse_tracker), 4)}')
    
    

# 7. Conclusion

By using pretrained models, I was able to improve my competition score from 21.15 to the scores listed below.

1. EfficientNet pretrained model: 18.51
2. SWIN Transform pretrained model: 18.15

Through this notebook I have learned that instead of spending hours to build my own model, using publicly available pretrained model is the way to go in CNN. Because each of the pretrained model come with very complicated state-of-art architecture of neural network and pretrained weights built inside them. 

The trained model will be run with test result in this inference version : https://www.kaggle.com/gohweizheng/swin-transformer-inference?scriptVersionId=84293799

**End notes**

Both EfficientNet and SWIN Transform model took very long time to train them. EfficientNet took me 10 GPU hours and SWIN transform model took me 20 GPU hours to finish training all the Folds. EfficientNet was quick to train but require many epoches to reach the global optimum, thus Early Stopping function was necessary to get best model. On other hand, SWIN Transform took very long time to train as the architecture of the model is very complicated compared to EfficientNet, but it was able to reach global optimum much quicker compared to EfficientNet. It may be faster to train by using 4 epoches to train instead of using Early Stopping.