<a href="https://colab.research.google.com/github/sysu17363098/CS231n/blob/master/swin%2Broberta%2Battn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **[Pytorch] Wikipedia Image Caption With Attention**

## **Install Required Libraries & load google drive**

In [None]:
!pip install git+https://github.com/rwightman/pytorch-image-models

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/rwightman/pytorch-image-models
  Cloning https://github.com/rwightman/pytorch-image-models to /tmp/pip-req-build-i5nhfz_i
  Running command git clone -q https://github.com/rwightman/pytorch-image-models /tmp/pip-req-build-i5nhfz_i
Collecting huggingface_hub
  Downloading huggingface_hub-0.11.0-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 15.1 MB/s 
Building wheels for collected packages: timm
  Building wheel for timm (setup.py) ... [?25l[?25hdone
  Created wheel for timm: filename=timm-0.7.0.dev0-py3-none-any.whl size=560593 sha256=722ff9c995c92158f402088c3aebd8b85b04b4756a78fcc778d0a1afad3dc0a7
  Stored in directory: /tmp/pip-ephem-wheel-cache-r_ist1jg/wheels/69/3d/b0/be55cbadabd87a0e1875d63c7492d199097a39cc2433637650
Successfully built timm
Installing collected packages: huggingface-hub, timm
Successfully installed hugg

In [None]:
!pip install --upgrade wandb

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting wandb
  Downloading wandb-0.13.5-py2.py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 13.5 MB/s 
Collecting GitPython>=1.0.0
  Downloading GitPython-3.1.29-py3-none-any.whl (182 kB)
[K     |████████████████████████████████| 182 kB 66.3 MB/s 
Collecting pathtools
  Downloading pathtools-0.1.2.tar.gz (11 kB)
Collecting sentry-sdk>=1.0.0
  Downloading sentry_sdk-1.11.1-py2.py3-none-any.whl (168 kB)
[K     |████████████████████████████████| 168 kB 60.4 MB/s 
[?25hCollecting shortuuid>=0.5.0
  Downloading shortuuid-1.0.11-py3-none-any.whl (10 kB)
Collecting docker-pycreds>=0.4.0
  Downloading docker_pycreds-0.4.0-py2.py3-none-any.whl (9.0 kB)
Collecting setproctitle
  Downloading setproctitle-1.3.2-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (30 kB)
Collecting gitdb<5,>=4.0.1
  Downloading gitdb-4.0.1

In [None]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.24.0-py3-none-any.whl (5.5 MB)
[K     |████████████████████████████████| 5.5 MB 14.0 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[K     |████████████████████████████████| 7.6 MB 54.9 MB/s 
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.2 transformers-4.24.0


In [None]:
!pip install colorama

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting colorama
  Downloading colorama-0.4.6-py2.py3-none-any.whl (25 kB)
Installing collected packages: colorama
Successfully installed colorama-0.4.6


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## **Import Required Libraries**

In [None]:
import os
import gc
import cv2
import copy
import time
import random
from PIL import Image
from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

import base64
import pickle

# For downloading images
from io import BytesIO

# For data manipulation
import numpy as np
import pandas as pd

# Pytorch Imports
import torch
import torch.nn as nn
import torch.optim as optim
from torch.optim import lr_scheduler
from torch.utils.data import Dataset, DataLoader
from torch.cuda import amp

# Utils
import joblib
from tqdm import tqdm
from collections import defaultdict

# Sklearn Imports
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import StratifiedKFold, KFold

# For Image Models
import timm

# For Transformer Models
from transformers import AutoTokenizer, AutoModel
from transformers import AutoFeatureExtractor, SwinModel

# Albumentations for augmentations
import albumentations as A
from albumentations.pytorch import ToTensorV2

# For colored terminal text
from colorama import Fore, Back, Style
b_ = Fore.BLUE
sr_ = Style.RESET_ALL

# import Weights&Baises
import wandb

import warnings
warnings.filterwarnings("ignore")

# For descriptive error messages
os.environ['CUDA_LAUNCH_BLOCKING'] = "1"

## **Prepare Weights & Biases Account [*Optional*]**

In [None]:
import wandb

wandb.login()

ERROR:wandb.jupyter:Failed to detect the name of this notebook, you can set it manually with the WANDB_NOTEBOOK_NAME environment variable to enable code saving.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

## **Training Configuration**

In [None]:
CONFIG = {"seed": 2022,
      "epochs": 50,
      "img_size": 256,
      "image_model_name": "swin", # "tf_efficientnet_b0"
      "text_model_name": "xlm-roberta-base",
      "embedding_size": 768,
      "train_batch_size": 32,
      "valid_batch_size": 64,
      "learning_rate": 1e-3,
      "scheduler": 'CosineAnnealingLR',
      "min_lr": 1e-5,
      "T_max": 500,
      "weight_decay": 1e-6,
      "max_length": 32,
      "n_accumulate": 1,
      "device": torch.device("cuda:0" if torch.cuda.is_available() else "cpu"),
      "root": "/content/drive/MyDrive/BDT SEM 1/ImageCaptionMatching/training/"
      }

CONFIG["tokenizer"] = AutoTokenizer.from_pretrained(CONFIG['text_model_name'])

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

In [None]:
CONFIG['device']

device(type='cuda', index=0)

## **Set Seed for Reproducibility**

In [None]:
def set_seed(seed=88):
    '''Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.'''
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    # When running on the CuDNN backend, two further options must be set
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)

set_seed(CONFIG['seed'])

## **Dataset**

### default dataset

#### *Load & Split the Dataset*

In [None]:
import pickle
filepath = "/content/drive/MyDrive/BDT SEM 1/ImageCaptionMatching/dataset/dataset.pkl"
with open(filepath, "rb") as fp:
  data = pickle.load(fp)

In [None]:
print("size of the whole data:",len(data))

size of the whole data: 97052


In [None]:
data_transforms = {
    "train": A.Compose([
        A.Resize(256, 256),
        A.HorizontalFlip(p=0.5),
        A.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
                max_pixel_value=255.0,
                p=1.0
            ),
        ToTensorV2()
        ], p=1.),

    "valid": A.Compose([
        A.Resize(256, 256),
        A.Normalize(
                mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225],
                max_pixel_value=255.0,
                p=1.0
            ),
        ToTensorV2()
        ], p=1.)
}

split and shuffle the data

In [None]:
import random
random.shuffle(data)

train_data = data[:45000]
valid_data = data[45000:60000]
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(valid_data)}")

Number of training samples: 45000
Number of validation samples: 15000


#### **Dataset Structure**

data: list of dictionaries
*   b64_bytes: base64 encoded bytes of the image file at a 300px resolution

*   caption_title_and_reference_description: list of captions
*   filename: filename of the image





#### *Prepare the DataLoader*

##### self-made dataset

In [None]:
class WikipediaDataset(Dataset):
    def __init__(self, data, tokenizer, max_length, transforms=None):
        self.data = data
        self.max_len = max_length
        self.tokenizer = tokenizer
        self.transforms = transforms


    def __len__(self):
        return len(self.data)

    def __getitem__(self, index):
        # image
        image_bytes = base64.b64decode(self.data[index]['b64_bytes'])
        img = np.asarray(Image.open(BytesIO(image_bytes)).convert('RGB'))

        if self.transforms:
            img = self.transforms(image=img)['image']
        """
        inputs = feature_extractor(img, return_tensors="pt").to(CONFIG['device'])

        with torch.no_grad():
            outputs = swin_model(**inputs)

        img = outputs.last_hidden_state  # (batch_size, sq_len=49, emb_size=768)
        """
        # caption
        caption = random.choice(self.data[index]['caption_title_and_reference_description'])
        caption = caption.replace('[SEP]', '</s>') # sep token for xlm-roberta
        inputs = self.tokenizer.encode_plus(
                caption,
                truncation=True,
                add_special_tokens=True,
                max_length=self.max_len,
                padding='max_length'
            )

        ids = inputs['input_ids']
        mask = inputs['attention_mask']

        # filename
        filename = self.data[index]['filename']
        filename = filename.replace('[SEP]', '</s>') # sep token for xlm-roberta
        encode_name = self.tokenizer.encode_plus(
                filename,
                truncation=True,
                add_special_tokens=True,
                max_length=self.max_len,
                padding='max_length'
            )
        name_ids = encode_name['input_ids']
        name_mask = encode_name['attention_mask']


        return {
            'ids': torch.tensor(ids, dtype=torch.long),
            'mask': torch.tensor(mask, dtype=torch.long),
            'image': img,
            'name_ids': torch.tensor(name_ids, dtype=torch.long),
            'name_mask': torch.tensor(name_mask, dtype=torch.long)
        }

Build the DataLoader for training set and validation set

In [None]:
train_dataset = WikipediaDataset(
    train_data,
    CONFIG["tokenizer"],
    CONFIG["max_length"],
    transforms=data_transforms["train"]
    )
train_loader = DataLoader(
    train_dataset,
    batch_size=CONFIG['train_batch_size'],
    num_workers=4, shuffle=True, pin_memory=True, drop_last=True)

valid_dataset = WikipediaDataset(
    valid_data,
    CONFIG["tokenizer"],
    CONFIG["max_length"],
    transforms=data_transforms["valid"]
    )
valid_loader = DataLoader(valid_dataset, batch_size=CONFIG['valid_batch_size'],
                          num_workers=4, shuffle=False, pin_memory=True)

## **Model Architecture**

### EfficientNet + Xlm-roberta

In [None]:
class WikipediaModel(nn.Module):
    def __init__(self, image_model, text_model, embedding_size):
        super(WikipediaModel, self).__init__()
        self.image_model = timm.create_model(image_model, pretrained=True)
        self.n_features = self.image_model.classifier.in_features
        self.image_model.reset_classifier(0)
        self.image_drop = nn.Dropout(p=0.2)
        self.image_fc = nn.Linear(self.n_features, embedding_size)

        self.text_model = AutoModel.from_pretrained(text_model)
        self.text_drop = nn.Dropout(p=0.2)
        self.text_fc = nn.Linear(768, embedding_size)

        self.freeze_backbone()

    def forward(self, images, ids, mask):
        image_features = self.image_model(images)
        image_embeddings = self.image_fc(self.image_drop(image_features))

        out = self.text_model(input_ids=ids,attention_mask=mask,
                              output_hidden_states=False)
        out = self.text_drop(out[1])
        text_embeddings = self.text_fc(out)

        return image_embeddings, text_embeddings

    def freeze_backbone(self):
        for params in self.image_model.parameters():
            params.requires_grad = False
        # Only finetune final layer
        self.image_fc.weight.requires_grad = True
        self.image_fc.bias.requires_grad = True

        for params in self.text_model.parameters():
            params.requires_grad = False
        # Only finetune final layer
        self.text_fc.weight.requires_grad = True
        self.text_fc.bias.requires_grad = True

### Swin + Xlm-roberta + Attention

In [None]:
class Attention(nn.Module):
    def __init__(self, feat_dim, attention=None, ffn=None, last_norm=True):
        super(Attention, self).__init__()

        self.att = attention if attention else nn.MultiheadAttention(embed_dim=feat_dim, num_heads=8)
        self.norm = nn.LayerNorm(feat_dim)
        self.ffn = ffn if ffn else nn.Sequential(nn.Linear(feat_dim, 1024), nn.ReLU(inplace=True), nn.Linear(1024, feat_dim))
        self.last_norm = nn.LayerNorm(feat_dim) if last_norm else (lambda x: x)

    def forward(self, q, k, v, attention_mask=None):
        feat = self.att(q, k, v, attention_mask)[0]
        feat = self.norm(feat + q)
        feat = self.last_norm(feat + self.ffn(feat))
        return feat

In [None]:
class WikipediaModel_Attention(nn.Module):
    def __init__(self, image_model, text_model, embedding_size, device):
        super(WikipediaModel_Attention, self).__init__()
        self.device = device

        # image feature extractor
        self.feature_extractor = AutoFeatureExtractor.from_pretrained("microsoft/swin-tiny-patch4-window7-224")
        self.image_model = SwinModel.from_pretrained("microsoft/swin-tiny-patch4-window7-224").to(self.device)

        # text(caption/filename) feature extractor
        self.text_model = AutoModel.from_pretrained(text_model)

        # Attention
        self.att1 = Attention(
            embedding_size,
            nn.MultiheadAttention(
                embed_dim=embedding_size,
                num_heads=8,
                kdim=768,
                vdim=768))
        self.att2 = Attention(embedding_size)
        self.att3 = Attention(embedding_size, last_norm=False)

        # fnn for caption
        self.norm = nn.LayerNorm(768)
        self.ffn = nn.Sequential(nn.Linear(768, 1024), nn.ReLU(inplace=True), nn.Linear(1024, 768))

        self.freeze_backbone()

    def forward(self, images, ids, mask, f_ids, f_mask):
        # image embedding
        # (batch, sq_len=49, emb_size=768)->(sq_len,batch,emb_size)
        image_embedding = self.image_model(images).last_hidden_state
        """
        image_embedding = []
        for img in images.tolist(): # len = batch size
          #inputs = self.feature_extractor(torch.Tensor(img).cpu().numpy(), return_tensors="pt").to(self.device)

          with torch.no_grad():
              outputs = self.image_model(img)  # **inputs
          image_embedding.append(outputs.last_hidden_state)
        image_embedding = torch.stack(image_embedding)  # (batch_size, sq_len=49, emb_size=768)
        """
        # caption embedding
        out = self.text_model(input_ids=ids,attention_mask=mask,
                              output_hidden_states=False)
        caption_embedding = out['pooler_output']
        caption_embedding = caption_embedding + self.ffn(caption_embedding)

        # filename embedding
        output = self.text_model(input_ids=f_ids, attention_mask=f_mask, output_hidden_states=False)
        seq_embedding = output['last_hidden_state'] # (batch_size, sequence_length, hidden_size)
        name_embedding = output['pooler_output'] # (batch_size, hidden_size)
        attention_mask = (1 - f_mask).bool()

        # attn
        if image_embedding.shape[0] != 1: # batch != 1
          image_embedding = image_embedding.squeeze().permute(1, 0, 2)
        else:
          image_embedding = image_embedding.permute(1, 0, 2)
        seq_embedding = seq_embedding.permute(1, 0, 2) # (sq_len,batch,emb_size)
        embedding = self.att1(seq_embedding, image_embedding, image_embedding)
        embedding = self.att2(embedding, embedding, embedding, attention_mask)

        name_embedding = name_embedding.unsqueeze(0)
        embedding = self.att3(name_embedding, embedding, embedding, attention_mask)

        return embedding.squeeze(0), caption_embedding

    def freeze_backbone(self):
        for params in self.image_model.parameters():
            params.requires_grad = False

        for params in self.text_model.parameters():
            params.requires_grad = False

NameError: ignored

## **Create Model Based on Architecture**

new model

In [None]:
model = WikipediaModel_Attention(
    CONFIG['image_model_name'],
    CONFIG['text_model_name'],
    CONFIG['embedding_size'],
    CONFIG['device'])
model.to(CONFIG['device']);

Downloading:   0%|          | 0.00/255 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/71.8k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/113M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/swin-tiny-patch4-window7-224 were not used when initializing SwinModel: ['classifier.weight', 'classifier.bias']
- This IS expected if you are initializing SwinModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing SwinModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Downloading:   0%|          | 0.00/1.12G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-base were not used when initializing XLMRobertaModel: ['lm_head.decoder.weight', 'lm_head.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.dense.bias']
- This IS expected if you are initializing XLMRobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


start from last time

In [None]:
#model = WikipediaModel_Attention(CONFIG['image_model_name'], CONFIG['text_model_name'], CONFIG['embedding_size'])
#model.load_state_dict(torch.load("/content/drive/MyDrive/BDT SEM 1/ImageCaptionMatching/training/inceptionv3_xlm-roberta-base_Loss0.4338_epoch21.bin"))
#model.to(CONFIG['device']);

## **Loss Function**

In [None]:
def criterion(outputs1, outputs2, targets=1):
    target = torch.ones(outputs1.size()[0]).to(CONFIG['device'])
    return nn.CosineEmbeddingLoss()(outputs1, outputs2, target)

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ArcInfoNCE(nn.Module):
    def __init__(self, margin=0.5, scale=64):
        super(ArcInfoNCE, self).__init__()
        self.margin = margin
        self.scale = scale
        self.loss = nn.CrossEntropyLoss(reduction="mean")

    def create_mask(self, target, num_classes):
        batch_size = target.size(0)
        mask = torch.zeros(batch_size, num_classes, device=target.device)
        mask.scatter_(1, target.view(batch_size, 1), 1)
        return mask.bool()

    def forward(self, feat1, feat2, y=None):
        logits = F.cosine_similarity(feat1.unsqueeze(1), feat2.unsqueeze(0), dim=2)
        if y is None and feat1.size(0) == feat2.size(0):
            y = torch.arange(0, feat1.size(0), device=feat1.device, dtype=torch.long)
        mask = self.create_mask(y, num_classes=feat2.size(0))
        logits[mask] -= self.margin
        return self.loss(logits * self.scale, y)

In [None]:
def criterion1(outputs1, outputs2):
    return ArcInfoNCE()(outputs1, outputs2)

## **Training Function**

In [None]:
def train_one_epoch(model, optimizer, scheduler, dataloader, device, epoch):
    model.train()

    dataset_size = 0
    running_loss = 0.0

    bar = tqdm(enumerate(dataloader), total=len(dataloader))
    for step, data in bar:
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)

        images = data['image'].to(device, dtype = torch.float)

        name_ids = data['name_ids'].to(device, dtype=torch.long)
        name_mask = data['name_mask'].to(device, dtype=torch.long)

        batch_size = ids.size(0)

        image_outputs, caption_outputs = model(images, ids, mask, name_ids, name_mask)
        loss = criterion(image_outputs, caption_outputs)
        loss = loss / CONFIG['n_accumulate']
        loss.backward()

        if (step + 1) % CONFIG['n_accumulate'] == 0:
            optimizer.step()

            # zero the parameter gradients
            optimizer.zero_grad()

            if scheduler is not None:
                scheduler.step()

        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size

        epoch_loss = running_loss / dataset_size

        bar.set_postfix(Epoch=epoch, Train_Loss=epoch_loss,
                        LR=optimizer.param_groups[0]['lr'])
    gc.collect()

    return epoch_loss

## **Validation Function**

In [None]:
@torch.no_grad()
def valid_one_epoch(model, dataloader, device, epoch):
    model.eval()

    dataset_size = 0
    running_loss = 0.0

    bar = tqdm(enumerate(dataloader), total=len(dataloader))
    for step, data in bar:
        ids = data['ids'].to(device, dtype = torch.long)
        mask = data['mask'].to(device, dtype = torch.long)
        images = data['image'].to(device, dtype = torch.float)
        name_ids = data['name_ids'].to(device, dtype=torch.long)
        name_mask = data['name_mask'].to(device, dtype=torch.long)

        batch_size = ids.size(0)

        image_outputs, text_outputs = model(images, ids, mask, name_ids, name_mask)
        loss = criterion(image_outputs, text_outputs)

        running_loss += (loss.item() * batch_size)
        dataset_size += batch_size

        epoch_loss = running_loss / dataset_size

        bar.set_postfix(Epoch=epoch, Valid_Loss=epoch_loss,
                        LR=optimizer.param_groups[0]['lr'])

    gc.collect()

    return epoch_loss

## **Run Training**

In [None]:
def run_training(model, optimizer, scheduler, device, num_epochs):
    # To automatically log gradients
    wandb.watch(model, log_freq=100)

    if torch.cuda.is_available():
        print("[INFO] Using GPU: {}\n".format(torch.cuda.get_device_name()))

    start = time.time()
    best_model_wts = copy.deepcopy(model.state_dict())
    best_epoch_loss = np.inf
    history = defaultdict(list)

    for epoch in range(1, num_epochs + 1):
        gc.collect()
        train_epoch_loss = train_one_epoch(model, optimizer, scheduler,
                                           dataloader=train_loader,
                                           device=CONFIG['device'], epoch=epoch)

        val_epoch_loss = valid_one_epoch(model, valid_loader, device=CONFIG['device'],
                                         epoch=epoch)

        history['Train Loss'].append(train_epoch_loss)
        history['Valid Loss'].append(val_epoch_loss)

        # Log the metrics
        wandb.log({"Train Loss": train_epoch_loss})
        wandb.log({"Valid Loss": val_epoch_loss})

        # deep copy the model
        if val_epoch_loss <= best_epoch_loss:
            print(f"{b_}Validation Loss Improved ({best_epoch_loss} ---> {val_epoch_loss})")
            best_epoch_loss = val_epoch_loss
            run.summary["Best Loss"] = best_epoch_loss
            best_model_wts = copy.deepcopy(model.state_dict())
            PATH = "Loss{:.4f}_epoch{:.0f}.bin".format(best_epoch_loss, epoch)
            torch.save(model.state_dict(), CONFIG["root"]+CONFIG['image_model_name']+"_"+CONFIG['text_model_name']+"_attn_"+PATH)
            # Save a model file from the current directory
            print(f"Model Saved{sr_}")

        print()

    end = time.time()
    time_elapsed = end - start
    print('Training complete in {:.0f}h {:.0f}m {:.0f}s'.format(
        time_elapsed // 3600, (time_elapsed % 3600) // 60, (time_elapsed % 3600) % 60))
    print("Best Loss: {:.4f}".format(best_epoch_loss))

    # load best model weights
    model.load_state_dict(best_model_wts)

    return model, history

In [None]:
def fetch_scheduler(optimizer):
    if CONFIG['scheduler'] == 'CosineAnnealingLR':
        scheduler = lr_scheduler.CosineAnnealingLR(optimizer,T_max=CONFIG['T_max'],
                                                   eta_min=CONFIG['min_lr'])
    elif CONFIG['scheduler'] == 'CosineAnnealingWarmRestarts':
        scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer,T_0=CONFIG['T_0'],
                                                             eta_min=CONFIG['min_lr'])
    elif CONFIG['scheduler'] == None:
        return None

    return scheduler

Define Optimizer and Scheduler

In [None]:
optimizer = optim.Adam(model.parameters(), lr=CONFIG['learning_rate'],
                       weight_decay=CONFIG['weight_decay'])
scheduler = fetch_scheduler(optimizer)

Start Training

In [None]:
run = wandb.init(project="Wikipedia",
                 entity="wzhangcz",
                 name="swim_xlm_attn_debug_1",
                 config=CONFIG,
                 job_type='Train')

[34m[1mwandb[0m: Currently logged in as: [33mwzhangcz[0m. Use [1m`wandb login --relogin`[0m to force relogin


In [None]:
model, history = run_training(model, optimizer, scheduler,
                              device=CONFIG['device'],
                              num_epochs=CONFIG['epochs'])

[INFO] Using GPU: Tesla T4



100%|██████████| 1406/1406 [10:49<00:00,  2.16it/s, Epoch=1, LR=9.39e-5, Train_Loss=0.000891]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=1, LR=9.39e-5, Valid_Loss=4.32e-6]


[34mValidation Loss Improved (inf ---> 4.321002960690142e-06)
Model Saved[0m



100%|██████████| 1406/1406 [10:38<00:00,  2.20it/s, Epoch=2, LR=0.000693, Train_Loss=2.32e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=2, LR=0.000693, Valid_Loss=1.04e-5]





100%|██████████| 1406/1406 [10:33<00:00,  2.22it/s, Epoch=3, LR=0.000604, Train_Loss=3.25e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=3, LR=0.000604, Valid_Loss=5.65e-6]





100%|██████████| 1406/1406 [10:34<00:00,  2.22it/s, Epoch=4, LR=0.000153, Train_Loss=4.33e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=4, LR=0.000153, Valid_Loss=2.63e-6]


[34mValidation Loss Improved (4.321002960690142e-06 ---> 2.634398142618011e-06)
Model Saved[0m



100%|██████████| 1406/1406 [10:36<00:00,  2.21it/s, Epoch=5, LR=0.000991, Train_Loss=2.26e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=5, LR=0.000991, Valid_Loss=1.8e-5]





100%|██████████| 1406/1406 [10:32<00:00,  2.22it/s, Epoch=6, LR=4.95e-5, Train_Loss=2.62e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=6, LR=4.95e-5, Valid_Loss=2.27e-6]


[34mValidation Loss Improved (2.634398142618011e-06 ---> 2.2723952931604194e-06)
Model Saved[0m



100%|██████████| 1406/1406 [10:38<00:00,  2.20it/s, Epoch=7, LR=0.000775, Train_Loss=2.05e-5]
100%|██████████| 235/235 [02:54<00:00,  1.35it/s, Epoch=7, LR=0.000775, Valid_Loss=1.34e-5]





100%|██████████| 1406/1406 [10:32<00:00,  2.22it/s, Epoch=8, LR=0.000511, Train_Loss=4.33e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=8, LR=0.000511, Valid_Loss=4.44e-6]





100%|██████████| 1406/1406 [10:31<00:00,  2.23it/s, Epoch=9, LR=0.000224, Train_Loss=1.9e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=9, LR=0.000224, Valid_Loss=2.93e-6]





100%|██████████| 1406/1406 [10:30<00:00,  2.23it/s, Epoch=10, LR=0.000965, Train_Loss=2.12e-5]
100%|██████████| 235/235 [02:55<00:00,  1.34it/s, Epoch=10, LR=0.000965, Valid_Loss=1.48e-5]





  1%|▏         | 18/1406 [00:09<10:15,  2.26it/s, Epoch=11, LR=0.000942, Train_Loss=2.29e-5]

## **Visualizations**

In [None]:
# Code taken from https://www.kaggle.com/ayuraj/interactive-eda-using-w-b-tables

# This is just to display the W&B run page in this interactive session.
from IPython import display

# we create an IFrame and set the width and height
iF = display.IFrame(run.url, width=1080, height=720)
iF