# HW2P2: Face Classification and Verification


Congrats on coming to the second homework in 11785: Introduction to Deep Learning. This homework significantly longer and tougher than the previous homework. You have 2 sub-parts as outlined below. Please start early! 


*   Face Recognition: You will be writing your own CNN model to tackle the problem of classification, consisting of 7000 identities
*   Face Verification: You use the model trained for classification to evaluate the quality of its feature embeddings, by comparing the similarity of known and unknown identities

Common errors which you may face in this homeworks (because of the size of the model)


*   CUDA Out of Memory (OOM): You can tackle this problem by (1) Reducing the batch size (2) Calling `torch.cuda.empty_cache()` and `gc.collect()` (3) Finally restarting the runtime



# Instruction to Run the Code

## To run the final model corresponding to the highest kaggle submission, please first make sure the **Global Variables** (next section) are set to fit the purpose, and then go to **Runtime**, click **Restart and run all**. This would
1. pip install, import and download all required packages and data
2. run all functions and classes for loading data and creating model/optimizer/scheduler
3. train the model for 42 epochs based on the parameters saved in the config_king variable defined under **Parameter Configuration**. 
4. Sequentially finetune the model and reset learning rates accordingly for the verification task

**Note** that by default, the notebook is expected to finish running in one click. If you pause the run and want to reload the model from saved path for finetuning, please set **FINETUNE_FROM_RELOAD** as True under Global Variable section. 

**Note** that to run the train data using Spherenet36 model, the notebook require a GCP environment with at least 4vCPU.

By default, the notebook would run the trained model on the test dataset and save the predicted result in csv file, but it would not make the submission to Kaggle. To run the notebook with kaggle submission, set **SUBMIT_KAGGLE** to True. 

# Global Variables

In [None]:
CONNECT_DRIVE = False
SELECT_MODEL = 'spherenet' # resnet, mobilenet
USE_CELOSS = True
FINETUNE_FROM_LOAD = False # whether the finetune is pause and need re-load
SUBMIT_KAGGLE = False

In [None]:
import torch
torch.__version__

'2.0.0+cu118'

# README

## Best Score Hyperparameters:
* **model** = SphereNet36
* **basic convolution block** = 2d Convolutional Layer > 2d Batchnorm > PReLU activation
* **residual model block** = 2 basic convolution block with kernel size = 3 and stride = 1, original input is added first (rather than last) to the convoluted blocks
* **model structure** = 4 chunks of layers (each chunk is composed of one basic convolution block and a series of residual model block), a 2d Dropout layer, a Linear layer that turns the scaled output by 14x14 to lower dimension at 512, a 2d Batchnorm and a PReLU layer, and finally the output Linear layer that turns the 512 dimension into dimension of the number of classes at 7000.
> For all chunks, the stride for initial basic convolution block is 2 and kernel size is 3. The 1st chunk has hidden size 64, 2nd 128 , 3rd 256 and 4th 512. The 1st chunk has 2 residual model blocks, the 2nd has 4, the 3rd has 8 and the 4th has 2. These numbers are referred from https://www.cvlibs.net/publications/Coors2018ECCV.pdf and https://github.com/wy1iu/SphereNet.  
* **activation** = PReLU
* **dropout** = 0.25
* **weight decay** = 0.005
* **batch size** = 64
* **training epoch** = 42
* **finetuning epoch** = 40 (30 for 1st level finetune and 10 for 2nd level finetune)
* **weight init**: kaiming normal on convolution layer, constant on batchnorm layer, normal for weights and constant for biases on linear layer 
* **init model learning rate** = 0.15
* **finetuning init model learning rate** = 0.01 for first level finetuning and 0.005 for second level finetuning
* **training scheduler**: CosineAnnealingWarmRestarts + StepLR
** CosineAnnealingWarmRestarts for 20 epochs x 2
** At the 38th epoch, after stepping, switch to StepLR starting with 0.009 LR for another 3 epochs
** StepLR for 3 epochs using gamma = 0.6 and step_size = 1
* **finetuning scheduler**: CosineAnnealingWarmRestarts
** CosineAnnealingWarmRestarts for 15 epochs x 2 (first level)
** CosineAnnealingWarmRestarts for 10 epochs x 1 (second level)
* **model base loss**: CrossEntropyLoss 
* **model base loss smoothing** = 0.1
* **center loss weight**: 0.001 for training and 0.003 for finetuning
* **center loss learning rate**: 
> --fixed at 0.1 for training <br>
> --flexible (follows CosineAnnealingWarmRestarts) starting at 0.5 for first level finetuning <br>
> --flexible (follows CosineAnnealingWarmRestarts) starting at 0.25 for second level finetuning <br>
* **optimizer**: SGD with 0.9 momentum
* **verification threshold**: 0.435



## Data Loading and Transformation Scheme:
The highest kaggle score is run by loading the train dataset and apply **RandomHorizontalFlip**, **RandomRotation(10)**, **ToTensor**, and **Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])** in sequence. Note that such combination comes from various ablation studies, including adding **RandomVerticalFlip**, using **RandomRotation(45)**,using **0.5** as normalize value, etc. Note that the mean and std values used in normalization refer to PyTorch ImageNet values. 

For test and validation data, none of the transformation is applied. However, both are normalized using the above mean and std. 


## Architectures:
The highest kaggle score is reached using **SphereNet36** architecture. It has four chunks of SphereNet Residual blocks, follow by a dropout layer, a linear layer, a batchnorm layer, a PReLU layer, and the final classification linear layer.  

1. 1st Chunk: 2 SphereNet residual blocks, hidden size is 64.
2. 2nd Chunk: 4 SphereNet residual blocks, hidden size is 128.
3. 3rd Chunk: 8 SphereNet residual blocks, hidden size is 256.
4. 4th Chunk: 2 SphereNet residual blocks, hidden size is 512.
5. Dropout Layer with 0.25 rate
6. Linear layer (from 512 x 14 x 14 to 512) + BatchNorm + PReLU
7. Classification Linear Layer (from 512 to 7000)

Note that in the 6th layer, 512 is multiply by 14 x 14 as the image size usually are factor by 1/16 on each side in SphereNet (224/16=14), hence we multiply 14 back to get the output size from the last chunk. 

Other architectures are tested as well, but with less ideal performance:
1. ResNet 34 with 4 chunks, with hidden size 64, 128, 256, 512, and block number 3, 4, 6, 3. 
2. SphereNet64 with 4 chunks, with same hidden size as above but block bumber 3, 8, 16, 3. This requires a much larger computation power with marginal improvement. 
3. MobileNetV2 with 
* initial layer with 32 hidden size and stride 2
* 7 chunks, with expansion [1, 6, 6, 6, 6, 6, 6], channels [16, 24, 32, 64, 96, 160, 320], number of layers [1, 2, 3, 4, 3, 3, 1] and stride [1, 2, 2, 2, 1, 2, 1]
* last layer with 1280 hidden size and stride 1


## Epochs
I trained the model for 42 epochs and finetuned the model for another 40 epochs to get the highest kaggle submission for both classification and verification tasks. 

During Training, the model learning rate is scheduled to follow CosineAnnealingWarmRestarts for 20 epochs twice, until the 38th epoch. After stepping in epoch 38th, the scheduler is switched to use StepLR, so as to approach the smaller learning rate slower for another 3 epochs. Note the center loss learning rate is fixed at 0.1.  

During Finetuning, the model learning rate and center loss learning rate are scheduled to follow CosineAnnealingWarmRestarts for 15 epochs twice, and after halving the initial learning rates, finetune for another 10 epochs once. 


## Hyperparameters and Experiments
* **Init Model Learning Rate (training)**: 0.15. I started with 0.1, and found that a slightly larger learning rate with CosineAnnealingWarmRestarts works better during ablation study. 

* **Batch Size**: 64. Since SphereNet 36 alreadys use 200m parameters, the batch size has to be limited to 64 when we added center loss. I used 96 for SphereNet36 + CrossEntrophyLoss, but has to reduce batch size when introducing center loss. 

* **Weight Decay**: 0.005. I did not try various weight decay value, but this value works well with the overall model architecture. 

* **Dropout**: 0.25. Though CNNs tend to not include dropout, I found my model starts to overfit at epoch 8 if not adding dropout. After adding dropout, the model does not overfit until epoch 11-12. Hence I keep this dropout layer with relatively small dropout rate. 

* **Init model learning rate (1st finetuning)**: 0.01. During finetuning, it is not necessary to start with a large learning rate, as it would significantly drop the val accuracy and the model needs more epochs to surpass previous performance. Hence, I round the 0.009 learning rate (at 19th or 39th epoch) to 0.01 and use it as the initial learning rate for finetuning. I tried 0.05 as well but the val accuracy is not as good as 0.01. 

* **Init model learning rate (2nd finetuning)**: 0.005. During the 2nd level finetuning, the model has almost reach a saturated stage, hence I half the starting learning rate in order to reach a even lower learning rate at the end using CosineAnnealingWarmRestarts. 

* **CrossEntropy Loss Label Smoothing**: 0.1. Adding label smoothing obviously improves model performance, and based on my search online, 0.1 seems to be a widely-used value. I did not do ablation study on this parameters.

* **Center Loss Weight (training)**: 0.001. Center loss value is very large, so I used 0.001 to avoid the loass from exploding. 
* **Center Loss Weight (finetuning)**: 0.003. For finetuning, a new center loss with slightly larger weight is used. This value is inspired by the center loss paper, and 0.003 is the chosen value in the paper. 
* **Center Loss learning rate (training)**: 0.1. For training, the learning rate is fixed at 0.1, as suggested by TAs. I avoid larger learning rate as the model is still at early learning stage, and I avoid smaller learning rate as the rate is not going to change during training, and starting with a too small rate could make the training less efficient. 
* **Center Loss learning rate (1st finetuning)**: 0.5. For finetuning, I used the recommended learning rate in the center loss paper at 0.5, and let the learning rate to decrease in CosineAnnealingWarmRestarts schedule for 15 epochs loop twice. 
* **Center Loss learning rate (2nd finetuning)**: For the last 10 epoch, I half the init learning rate and let the rate to decrease in CosineAnnealing schedule for 10 epochs. 
* **Verification Threshold**: 0.435. Note that this threshold is derived by testing threshold value on validation dataset and adjust the value accordingly. 0.445 is the best threshold value for validation data, but the mean value for test data is about 0.02 smaller than validation dataset, hence I subtract 0.01 from the best threshold value and use 0.435. 

## Other Experiments

### Scheduler 
I used ReduceLROnPlateau at first, but the model learned very slowly during training, as the learning rate is quite large and not decreasing for the early epochs. I then switched to CosineAnnealingWarmRestarts for 20 epochs and it performs much better and quicker. The 20 comes from observing the chart of ResNet34, where the validation accuracy starts to fluctuate more frequently at around 20 epochs. 

For Finetuning, I reduced the learning rate and hence the CosineAnnealingWarmRestarts period as well. The first two loops are 15 epochs, and the last loop is 10 epochs total. For the last 10 epochs, the learning rate is further reduce by half. Note that CosineAnnealingWarmRestarts and CosineAnnealing would be the same for the last 10 epochs. 




# Preliminaries

In [None]:
!nvidia-smi # to see what GPU you have

Thu Apr  6 20:37:26 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   58C    P8    10W /  70W |      0MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
!pip install wandb --quiet

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m21.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m184.3/184.3 KB[0m [31m22.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m62.7/62.7 KB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for pathtools (setup.py) ... [?25l[?25hdone


In [None]:
import torch
import torch.nn.functional as F
from torchsummary import summary
import torchvision #This library is used for image-based operations (Augmentations)
import os
import gc
from tqdm import tqdm
from PIL import Image
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score
import glob
import wandb
import matplotlib.pyplot as plt
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
print("Device: ", DEVICE)

Device:  cuda


In [None]:
if CONNECT_DRIVE:
    from google.colab import drive # Link your drive if you are a colab user
    drive.mount('/content/drive') # Models in this HW take a long time to get trained and make sure to save it her

# Download Data from Kaggle

In [None]:
# TODO: Use the same Kaggle code from HW1P2
!pip install --upgrade --force-reinstall --no-deps kaggle==1.5.8
!mkdir /root/.kaggle

with open("/root/.kaggle/kaggle.json", "w+") as f:
    f.write('{"username":"sharonxin1207","key":"a6eb67109ee97e7f02df4bfe642cf615"}') 
    # Put your kaggle username & key here

!chmod 600 /root/.kaggle/kaggle.json

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting kaggle==1.5.8
  Downloading kaggle-1.5.8.tar.gz (59 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m59.2/59.2 KB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: kaggle
  Building wheel for kaggle (setup.py) ... [?25l[?25hdone
  Created wheel for kaggle: filename=kaggle-1.5.8-py3-none-any.whl size=73272 sha256=49cae3d25be394fda287ec40bc22414d83d8066fd1265531e9e08e92ffe71d38
  Stored in directory: /root/.cache/pip/wheels/d4/02/ef/3f8c8d86b8d5388a1d3155876837f1a1a3143ab3fc2ff1ffad
Successfully built kaggle
Installing collected packages: kaggle
  Attempting uninstall: kaggle
    Found existing installation: kaggle 1.5.13
    Uninstalling kaggle-1.5.13:
      Successfully uninstalled kaggle-1.5.13
Successfully installed kaggle-1.5.8


In [None]:
!mkdir '/content/data'

!kaggle competitions download -c 11-785-s23-hw2p2-classification
!unzip -qo '11-785-s23-hw2p2-classification.zip' -d '/content/data'

!kaggle competitions download -c 11-785-s23-hw2p2-verification
!unzip -qo '11-785-s23-hw2p2-verification.zip' -d '/content/data'

Downloading 11-785-s23-hw2p2-classification.zip to /content
 99% 1.71G/1.72G [00:20<00:00, 106MB/s]
100% 1.72G/1.72G [00:20<00:00, 91.7MB/s]
Downloading 11-785-s23-hw2p2-verification.zip to /content
 53% 9.00M/16.8M [00:00<00:00, 92.1MB/s]
100% 16.8M/16.8M [00:00<00:00, 89.3MB/s]


# Parameter Configuration

In [None]:
config = {
    'batch_size': 64, # Increase this if your GPU can handle it - 96 for spherenet only, have to reduce to 64 if using celoss
    'lr': 0.15,
    'celoss_lr': 0.1, 
    'celoss_weight': 0.001,
    # use different learning rate and weight for fintune
    'finetune': {
        'model_lr': 0.01,
        'celoss_lr': 0.5,
        'celoss_weight': 0.003,
        'epochs_1':30,
        'epochs_2':10,
        '2ft_lr_factor':0.1
    },

    'resnet': {
        'epochs': 88,
        'dropout': 0.4
    },
# reference: https://sahiltinky94.medium.com/know-about-mobilenet-v2-implementation-from-scratch-using-pytorch-8e589b55599
    'mobilenet':  {
        'bottleneck': [[1, 16, 1, 1], # t = expansion, c = channels, n=num layers, s=stride
	                     [6, 24, 2, 2],
	                     [6, 32, 3, 2],
	                     [6, 64, 4, 2],
                       [6, 96, 3, 1],
	                     [6, 160, 3, 2],
                       [6, 320, 1, 1]],
        'init_layer': [0, 32, 1, 2],
        'last_layer': [0, 1280, 1, 1],
        'epochs': 100,
        'dropout': 0.5
        },
        
# reference:
# https://www.cvlibs.net/publications/Coors2018ECCV.pdf
# https://github.com/wy1iu/SphereNet  
    'spherenet': {
        'epochs': 42, 
        'linear_scale_factor': 1,
        'dropout': 0.25
    },

    'num_classes': 7000,
}

# Classification Dataset

In [None]:
DATA_DIR    = '/content/data/11-785-s23-hw2p2-classification/'# TODO: Path where you have downloaded the data
TRAIN_DIR   = os.path.join(DATA_DIR, "train") 
VAL_DIR     = os.path.join(DATA_DIR, "dev")
TEST_DIR    = os.path.join(DATA_DIR, "test")

# Transforms using torchvision - Refer https://pytorch.org/vision/stable/transforms.html

train_transforms = torchvision.transforms.Compose([
    torchvision.transforms.RandomHorizontalFlip(), 
    #torchvision.transforms.RandomVerticalFlip(),
    torchvision.transforms.RandomRotation(10),
    torchvision.transforms.ToTensor(),
    # referring pytorch imagenet values: https://pytorch.org/vision/stable/models.html
    torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) 
    #torchvision.transforms.Normalize(mean=[0.5, 0.5, 0.5], std=[0.5, 0.5, 0.5])
])# Implementing the right train transforms/augmentation methods is key to improving performance.

# Most torchvision transforms are done on PIL images. So you convert it into a tensor at the end with ToTensor()
# But there are some transforms which are performed after ToTensor() : e.g - Normalization
# Normalization Tip - Do not blindly use normalization that is not suitable for this dataset

valid_transforms = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])


train_dataset   = torchvision.datasets.ImageFolder(TRAIN_DIR, transform= train_transforms)
valid_dataset   = torchvision.datasets.ImageFolder(VAL_DIR, transform= valid_transforms)
# You should NOT have data augmentation on the validation set. Why?


# Create data loaders
train_loader = torch.utils.data.DataLoader(
    dataset     = train_dataset, 
    batch_size  = config['batch_size'], 
    shuffle     = True,
    num_workers = 4, 
    pin_memory  = True
)

valid_loader = torch.utils.data.DataLoader(
    dataset     = valid_dataset, 
    batch_size  = config['batch_size'],
    shuffle     = False,
    num_workers = 2
)



In [None]:
# You can do this with ImageFolder as well, but it requires some tweaking
class ClassificationTestDataset(torch.utils.data.Dataset):

    def __init__(self, data_dir, transforms):
        self.data_dir   = data_dir
        self.transforms = transforms

        # This one-liner basically generates a sorted list of full paths to each image in the test directory
        self.img_paths  = list(map(lambda fname: os.path.join(self.data_dir, fname), sorted(os.listdir(self.data_dir))))

    def __len__(self):
        return len(self.img_paths)
    
    def __getitem__(self, idx):
        return self.transforms(Image.open(self.img_paths[idx]))

In [None]:
test_dataset = ClassificationTestDataset(TEST_DIR, transforms = valid_transforms) #Why are we using val_transforms for Test Data?
test_loader = torch.utils.data.DataLoader(test_dataset, batch_size = config['batch_size'], shuffle = False,
                         drop_last = False, num_workers = 2)

In [None]:
print("Number of classes    : ", len(train_dataset.classes))
print("No. of train images  : ", train_dataset.__len__())
print("Shape of image       : ", train_dataset[0][0].shape)
print("Batch size           : ", config['batch_size'])
print("Train batches        : ", train_loader.__len__())
print("Val batches          : ", valid_loader.__len__())

Number of classes    :  7000
No. of train images  :  140000
Shape of image       :  torch.Size([3, 224, 224])
Batch size           :  64
Train batches        :  2188
Val batches          :  547


## Data visualization

In [None]:
VISUALIZE = False

if VISUALIZE:
    # Visualize a few images in the dataset
    # You can write your own code, and you don't need to understand the code
    # It is highly recommended that you visualize your data augmentation as sanity check

    r, c    = [5, 5]
    fig, ax = plt.subplots(r, c, figsize= (15, 15))

    k       = 0
    dtl     = torch.utils.data.DataLoader(
        dataset     = torchvision.datasets.ImageFolder(TRAIN_DIR, transform= train_transforms), # dont wanna see the images with transforms
        batch_size  = config['batch_size'], 
        shuffle     = True,
    )

    for data in dtl:
        x, y = data
        
        for i in range(r):
            for j in range(c):
                img = x[k].numpy().transpose(1, 2, 0)
                ax[i, j].imshow(img)
                ax[i, j].axis('off')
                k+=1
        break

    del dtl

#Set Up Functions

In [None]:
def init_weights(m):
    if isinstance(m, torch.nn.Conv2d):
        torch.nn.init.kaiming_normal_(m.weight.data, mode='fan_out', nonlinearity='relu')
#       torch.nn.init.kaiming_uniform_(m.weight.data)
    elif isinstance(m, torch.nn.BatchNorm2d):
        torch.nn.init.constant_(m.weight.data, 1)
        torch.nn.init.constant_(m.bias.data, 0)
    elif isinstance(m, torch.nn.Linear):
        torch.nn.init.normal_(m.weight.data, 0, 0.01)
        torch.nn.init.constant_(m.bias.data, 0)

In [None]:
class CenterLoss(torch.nn.Module):
    """
    Reference:
    Wen et al. A Discriminative Feature Learning Approach for Deep Face Recognition. ECCV 2016.
    
    """
    def __init__(self, num_classes, feature_dim=512, loss_weight=config['celoss_weight']):
        super(CenterLoss, self).__init__()
        self.num_classes = num_classes
        self.feature_dim = feature_dim
        self.loss_w = loss_weight
        self.centers = torch.nn.Parameter(torch.randn(self.num_classes, self.feature_dim).cuda())
        
    def forward(self, x, labels):
        """
        Args:
            x: feature matrix with shape (batch_size, feat_dim).
            labels: ground truth labels with shape (batch_size).
        """
        batch_size = x.size(0)
        # x^2 + c^2 - 2xc -- consult TA for the below code 
        distmat = torch.pow(x, 2).sum(dim=1, keepdim=True).expand(batch_size, self.num_classes) + \
                  torch.pow(self.centers, 2).sum(dim=1, keepdim=True).expand(self.num_classes, batch_size).t()
        distmat.addmm_(x.type(torch.cuda.FloatTensor), self.centers.t().type(torch.cuda.FloatTensor), beta=1.0, alpha=-2.0)
        
        classes = torch.arange(self.num_classes).long()
        classes = classes.cuda()
        labels = labels.unsqueeze(1).expand(batch_size, self.num_classes)
        mask = labels.eq(classes.expand(batch_size, self.num_classes))

        dist = []
        for i in range(batch_size):
            value = distmat[i][mask[i]]
            value = value.clamp(min=1e-12, max=1e+12)  # for numerical stability
            dist.append(value)

        dist = torch.cat(dist)
        loss = dist.mean()*self.loss_w

        return loss

# Very Simple Network (for Mandatory Early Submission)

In [None]:
class BasicNetwork(torch.nn.Module):
    """
    The Very Low early deadline architecture is a 4-layer CNN.

    The first Conv layer has 64 channels, kernel size 7, and stride 4.
    The next three have 128, 256, and 512 channels. Each have kernel size 3 and stride 2.
    
    Think about strided convolutions from the lecture, as convolutioin with stride= 1 and downsampling.
    For stride 1 convolution, what padding do you need for preserving the spatial resolution? 
    (Hint => padding = kernel_size // 2) - Why?)

    Each Conv layer is accompanied by a Batchnorm and ReLU layer.
    Finally, you want to average pool over the spatial dimensions to reduce them to 1 x 1. Use AdaptiveAvgPool2d.
    Then, remove (Flatten?) these trivial 1x1 dimensions away.
    Look through https://pytorch.org/docs/stable/nn.html 

    Why does a very simple network have 4 convolutions?
    Input images are 224x224. Note that each of these convolutions downsample.
    Downsampling 2x effectively doubles the receptive field, increasing the spatial
    region each pixel extracts features from. Downsampling 32x is standard
    for most image models.

    Why does a very simple network have high channel sizes?
    Every time you downsample 2x, you do 4x less computation (at same channel size).
    To maintain the same level of computation, you 2x increase # of channels, which 
    increases computation by 4x. So, balances out to same computation.
    Another intuition is - as you downsample, you lose spatial information. We want
    to preserve some of it in the channel dimension.
    """

    def __init__(self, num_classes=7000):
        super().__init__()

        self.backbone = torch.nn.Sequential(
            torch.nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, stride=2),
            torch.nn.BatchNorm2d(64),
            torch.nn.ReLU6(inplace=True),
            torch.nn.Conv2d(in_channels=64, out_channels=128, kernel_size=7, stride=4),
            torch.nn.BatchNorm2d(128),
            torch.nn.ReLU6(inplace=True),
            torch.nn.Conv2d(in_channels=128, out_channels=256, kernel_size=3, stride=2),  
            torch.nn.BatchNorm2d(256),
            torch.nn.ReLU6(inplace=True),
            torch.nn.Conv2d(in_channels=256, out_channels=512, kernel_size=3, stride=2),
            torch.nn.BatchNorm2d(512),
            torch.nn.ReLU6(inplace=True),
            torch.nn.AdaptiveAvgPool2d(output_size=(1, 1)), 
            torch.nn.Flatten()            
            ) 
        
        self.cls_layer = torch.nn.Linear(512, num_classes)
    
    def forward(self, x, return_feats=False):
        """
        What is return_feats? It essentially returns the second-to-last-layer
        features of a given image. It's a "feature encoding" of the input image,
        and you can use it for the verification task. You would use the outputs
        of the final classification layer for the classification task.

        You might also find that the classification outputs are sometimes better
        for verification too - try both.
        """
        feats = self.backbone(x)
        out = self.cls_layer(feats)

        if return_feats:
            return feats
        else:
            return out
            
model = BasicNetwork().to(DEVICE)
summary(model, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 111, 111]           1,792
       BatchNorm2d-2         [-1, 64, 111, 111]             128
             ReLU6-3         [-1, 64, 111, 111]               0
            Conv2d-4          [-1, 128, 27, 27]         401,536
       BatchNorm2d-5          [-1, 128, 27, 27]             256
             ReLU6-6          [-1, 128, 27, 27]               0
            Conv2d-7          [-1, 256, 13, 13]         295,168
       BatchNorm2d-8          [-1, 256, 13, 13]             512
             ReLU6-9          [-1, 256, 13, 13]               0
           Conv2d-10            [-1, 512, 6, 6]       1,180,160
      BatchNorm2d-11            [-1, 512, 6, 6]           1,024
            ReLU6-12            [-1, 512, 6, 6]               0
AdaptiveAvgPool2d-13            [-1, 512, 1, 1]               0
          Flatten-14                  [

# SphereNet

In [None]:
class ConvBlockPReLU(torch.nn.Sequential):
    def __init__(self, in_chan, out_chan, kernel, stride, padding=-1, groups=1, bias=False):
        if padding < 0:
          padding = (kernel-1)//2
        super(ConvBlockPReLU, self).__init__(
            torch.nn.Conv2d(in_channels=in_chan, out_channels=out_chan, 
                            kernel_size=kernel, stride=stride, padding=padding,
                            groups=groups, bias=bias),
            torch.nn.BatchNorm2d(out_chan),
            torch.nn.PReLU(out_chan)
        )

class SphereNetBlock(torch.nn.Module):
    def __init__(self, in_chan):
      super(SphereNetBlock, self).__init__()
      layer1 = ConvBlockPReLU(in_chan, in_chan, 3, 1)
      layer2 = ConvBlockPReLU(in_chan, in_chan, 3, 1)

      self.net = torch.nn.Sequential(*[layer1, layer2])

    def forward(self, x):
            return x + self.net(x)


class SphereNet(torch.nn.Module):
    # [3, 7, 16, 3] for 64, [2, 4, 8, 2] for 36
    def __init__(self, in_chan, layer_list, num_classes, block):
        super(SphereNet, self).__init__()
        self.first_chan = in_chan
        self.layer1 = self.make_layer(block, layer_list[0], out_chan=64, stride=2)
        self.layer2 = self.make_layer(block, layer_list[1], out_chan=128, stride=2)
        self.layer3 = self.make_layer(block, layer_list[2], out_chan=256, stride=2)
        self.layer4 = self.make_layer(block, layer_list[3], out_chan=512, stride=2)
        
        #self.avgpool_layer = torch.nn.AdaptiveAvgPool2d((1,1))
        self.dropout_layer = torch.nn.Dropout2d(0.25)
        # image size /16 = 14 or /32 = 7
        self.final_factor = config['spherenet']['linear_scale_factor']
        scale_out = int(512*self.final_factor)
        self.final_linear = torch.nn.Linear(512*14*14, scale_out)
        self.final_batchnorm = torch.nn.BatchNorm1d(scale_out)
        self.final_prelu = torch.nn.PReLU(scale_out)
        self.cls_layer = torch.nn.Linear(scale_out, num_classes)

      
    def make_layer(self, block, block_num, out_chan, stride):
        init_layer = ConvBlockPReLU(self.first_chan, out_chan, 3, stride)
        layers = [init_layer]

        self.first_chan = out_chan
        
        for i in range(block_num):
            layers.append(block(out_chan))

        return torch.nn.Sequential(*layers)

    def forward(self, x, return_feats=False):
        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        # x = self.avgpool_layer(x)
        x = self.dropout_layer(x)
        # flatten
        x = x.view(x.size(0), -1)
        x = self.final_linear(x)
        feats = self.final_prelu(self.final_batchnorm(x))
        out = self.cls_layer(feats)
        

        if return_feats:
            return feats, out
        else:
            return out

def SphereNet36(num_classes):
    # in channels and layers are fixed for this model
    return SphereNet(3, [2, 4, 8, 2], num_classes, SphereNetBlock)

def SphereNet64(num_classes):
    # in channels and layers are fixed for this model
    return SphereNet(3, [3, 8, 16, 3], num_classes, SphereNetBlock)

    


In [None]:
if SELECT_MODEL == 'spherenet':
    model = SphereNet36(config['num_classes']).to(DEVICE)
    model.apply(init_weights)
    summary(model, (3, 224, 224))

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           1,728
       BatchNorm2d-2         [-1, 64, 112, 112]             128
             PReLU-3         [-1, 64, 112, 112]              64
            Conv2d-4         [-1, 64, 112, 112]          36,864
       BatchNorm2d-5         [-1, 64, 112, 112]             128
             PReLU-6         [-1, 64, 112, 112]              64
            Conv2d-7         [-1, 64, 112, 112]          36,864
       BatchNorm2d-8         [-1, 64, 112, 112]             128
             PReLU-9         [-1, 64, 112, 112]              64
   SphereNetBlock-10         [-1, 64, 112, 112]               0
           Conv2d-11         [-1, 64, 112, 112]          36,864
      BatchNorm2d-12         [-1, 64, 112, 112]             128
            PReLU-13         [-1, 64, 112, 112]              64
           Conv2d-14         [-1, 64, 1

# ResNet

In [None]:
class ConvBlock(torch.nn.Sequential):
    def __init__(self, in_chan, out_chan, kernel, stride, padding=-1, groups=1, bias=False):
        if padding < 0:
          padding = (kernel-1)//2
        super(ConvBlock, self).__init__(
            torch.nn.Conv2d(in_channels=in_chan, out_channels=out_chan, 
                            kernel_size=kernel, stride=stride, padding=padding,
                            groups=groups, bias=bias),
            torch.nn.BatchNorm2d(out_chan),
            torch.nn.ReLU6(inplace=True)
        )

class ResNetBlock(torch.nn.Module):
    expansion = 1
    def __init__(self, in_chan, out_chan, stride=1):
        super(ResNetBlock, self).__init__()
        self.stride = stride
        self.downsample = self.stride == 1 and in_chan == (out_chan*self.expansion)
        layer1 = ConvBlock(in_chan, out_chan, 3, self.stride)
        layer2 = torch.nn.Sequential(
            torch.nn.Conv2d(out_chan, out_chan, kernel_size=3, stride=1, padding=1, bias=False),
            torch.nn.BatchNorm2d(out_chan)
        )

        self.basicnet = torch.nn.Sequential(*[layer1, layer2])

        if not self.downsample:
            self.ds_layer = torch.nn.Sequential(
                torch.nn.Conv2d(in_chan, out_chan*self.expansion, 1, self.stride, bias=False),
                torch.nn.BatchNorm2d(out_chan*self.expansion)
            )
        else:
            self.ds_layer = torch.nn.Sequential()

        self.activate = torch.nn.ReLU6()

    def forward(self, x):
        out = self.basicnet(x)
        # to skip the connection if downsample condition is not met
        out += self.ds_layer(x)
        out = torch.nn.ReLU6(inplace=True)(out)
        return out

        

In [None]:

class ResNet(torch.nn.Module):
    def __init__(self, in_chan, layer_list, num_classes, block):
        super(ResNet, self).__init__()
        self.first_chan = 64
        self.init_layer = ConvBlock(in_chan, self.first_chan, 7, 2)
        self.maxpool_layer = torch.nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
        # 64, 128, 256, 512
        self.layer1 = self.make_layer(block, layer_list[0], out_chan=64, stride=1)
        self.layer2 = self.make_layer(block, layer_list[1], out_chan=128, stride=2)
        self.layer3 = self.make_layer(block, layer_list[2], out_chan=256, stride=2)
        self.layer4 = self.make_layer(block, layer_list[3], out_chan=512, stride=2)
        
        self.avgpool_layer = torch.nn.AdaptiveAvgPool2d((1,1))
        #self.dropout_layer = torch.nn.Dropout2d(config['resnet']['dropout'])
        self.cls_layer = torch.nn.Linear(512*block.expansion, num_classes)
        
    def make_layer(self, block, block_num, out_chan, stride):
        init_layer = block(self.first_chan, out_chan, stride)
        # update input channel
        self.first_chan = out_chan*block.expansion

        layers = [init_layer]
        
        for i in range(block_num-1):
            layers.append(block(self.first_chan, out_chan, 1))
            self.first_chan = out_chan*block.expansion

        return torch.nn.Sequential(*layers)

    def forward(self, x, return_feats=False):
        x = self.init_layer(x)
        x = self.maxpool_layer(x)

        x = self.layer1(x)
        x = self.layer2(x)
        x = self.layer3(x)
        x = self.layer4(x)
        
        x = self.avgpool_layer(x)
        #x = self.dropout_layer(x)
        # flatten
        feats = x.view(x.size(0), -1)
        out = self.cls_layer(feats)

        if return_feats:
            return feats, out
        else:
            return out

def ResNet34(num_classes):
    # in channels and layers are fixed for this model
    return ResNet(3, [3,4,6,3], num_classes, ResNetBlock)


In [None]:
if SELECT_MODEL == 'resnet':
  model = ResNet34(config['num_classes']).to(DEVICE)
  model.apply(init_weights)
  summary(model, (3, 224, 224))

#MobileNet Network

In [None]:
class ConvBlock(torch.nn.Sequential):
    def __init__(self, in_chan, out_chan, kernel, stride, groups=1, bias=False):
        super(ConvBlock, self).__init__(
            torch.nn.Conv2d(in_channels=in_chan, out_channels=out_chan, 
                            kernel_size=kernel, stride=stride, padding=(kernel-1)//2,
                            groups=groups, bias=bias),
            torch.nn.BatchNorm2d(out_chan),
            torch.nn.ReLU6(inplace=True)
        )


class BottleNeck(torch.nn.Module):
    """
    Bottleneck block fo MobileNetV2
    """
    def __init__(self, in_chan, out_chan, stride, expand=1):
        super(BottleNeck, self).__init__()
        self.stride = stride
        self.skip = False
        if self.stride == 1 and in_chan == out_chan:
          self.skip = True
        # expand dimension by expansion factor 
        h_dim = int(expand*in_chan)
        # pointwise with kernel = 1 and stride = 1 and no padding
        layer1 = ConvBlock(in_chan, h_dim, 1, 1)
        # depthwise with kernel = 3 and stride by bottlenck and 1 padding
        layer2 = ConvBlock(h_dim, h_dim, 3, self.stride, groups=h_dim)
        # projection with kernal = 1 and stride = 1 and no padding
        layer3 = torch.nn.Sequential(
            torch.nn.Conv2d(h_dim, out_chan, 1, 1),
            torch.nn.BatchNorm2d(out_chan)
        )

        self.bottlenet = torch.nn.Sequential(*[layer1, layer2, layer3]
        )
    
    def forward(self, x):
        # skip if in and out channel is the same and stride is 1
        if self.skip:
            return x + self.bottlenet(x)
        else:
            return self.bottlenet(x)

In [None]:
class MobileNetV2(torch.nn.Module):
    """
    """
    def __init__(self, in_chan, num_classes, bottles, blocktype=BottleNeck):
        super(MobileNetV2, self).__init__()
        bottle_in_chan = bottles['init_layer'][1]
        bottle_in_stride = bottles['init_layer'][3]
        bottle_out_chan = bottles['last_layer'][1]
        bottle_out_stride = bottles['last_layer'][3]
        # first layer
        init_layer = ConvBlock(in_chan, bottle_in_chan, 3, bottle_in_stride)

        blocks = [init_layer]
        # adding bottleneck layers
        for expand, chan, num, stride in bottles["bottleneck"]:
          print(expand, chan, num, stride)
          blocks.append(BottleNeck(bottle_in_chan, chan, stride, expand))
          bottle_in_chan = chan
          if num > 1:
              for l in range(1, num):
                blocks.append(BottleNeck(bottle_in_chan, chan, 1, expand))
        # last layer
        blocks.append(ConvBlock(bottle_in_chan, bottle_out_chan, 1, bottle_out_stride))

        # pooling layer and flatten
        blocks.append(torch.nn.AdaptiveAvgPool2d(output_size=(1, 1)))
        #blocks.append(torch.nn.Flatten())

        blocks.append(torch.nn.Dropout2d(config['mobilenet']['dropout']))
        blocks.append(torch.nn.Flatten())

        self.mobilenet = torch.nn.Sequential(*blocks)

        # classifier
        
        self.cls_layer = torch.nn.Linear(bottle_out_chan, num_classes)
    
    def forward(self, x, return_feats=False):
        """
        What is return_feats? It essentially returns the second-to-last-layer
        features of a given image. It's a "feature encoding" of the input image,
        and you can use it for the verification task. You would use the outputs
        of the final classification layer for the classification task.

        You might also find that the classification outputs are sometimes better
        for verification too - try both.
        """
        feats = self.mobilenet(x)
        out = self.cls_layer(feats)

        if return_feats:
            return feats, out
        else:
            return out
            

In [None]:
if SELECT_MODEL == 'mobilenet':
  model = MobileNetV2(3, config['num_classes'], config['mobilenet_setting']).to(DEVICE)
  model.apply(init_weights)
  summary(model, (3, 224, 224))

# Other Setup - Loss, Optimizer, Scheduler

In [None]:
criterion = torch.nn.CrossEntropyLoss(label_smoothing=0.1) # TODO: What loss do you need for a multi class classification problem?
optimizer = torch.optim.SGD(model.parameters(), lr=config['lr'], momentum=0.9, weight_decay=5e-4)

if USE_CELOSS:
  finetune_loss = CenterLoss(7000, 512, config['celoss_weight'])
  finetune_optimizer = torch.optim.SGD(finetune_loss.parameters(), lr = config['celoss_lr']) 

scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, 20, 1)
#loss_scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(finetune_optimizer, 24, 1)
scaler = torch.cuda.amp.GradScaler() # Good news. We have FP16 (Mixed precision training) implemented for you
# It is useful only in the case of compatible GPUs such as T4/V100


# Training Functions

In [None]:
def train_celoss(model, dataloader, optimizer, finetune_optimizer, criterion, 
          finetune_loss, loss_weight, scheduler=None):
    
    model.train()

    # Progress Bar 
    batch_bar   = tqdm(total=len(dataloader), dynamic_ncols=True, leave=False, position=0, desc='Train', ncols=5) 

    num_correct = 0
    total_loss  = 0

    for i, (images, labels) in enumerate(dataloader):
        optimizer.zero_grad()
        finetune_optimizer.zero_grad()

        images, labels = images.to(DEVICE), labels.to(DEVICE)

        with torch.cuda.amp.autocast():
            feats, outputs = model(images, return_feats=True)
            loss0 = criterion(outputs, labels) # TODO: calculate cross entropy loss from outputs and labels
            loss1 = finetune_loss(feats, labels)# TODO: calculate weighted finetune_loss (center loss) from feats and labels
       
        # Update no. of correct predictions & loss as we iterate
        num_correct     += int((torch.argmax(outputs, axis=1) == labels).sum())
        total_loss      += float(loss0.item())+float(loss1.item())

        # tqdm lets you add some details so you can monitor training as you train.
        batch_bar.set_postfix(
            acc         = "{:.04f}%".format(100 * num_correct / (config['batch_size']*(i + 1))),
            loss        = "{:.04f}".format(float(total_loss / (i + 1))),
            num_correct = num_correct,
            lr          = "{:.04f}".format(float(optimizer.param_groups[0]['lr'])),
            losslr          = "{:.04f}".format(float(finetune_optimizer.param_groups[0]['lr'])),
        )       
        # TODO: backward loss0 to calculate gradients for model paramters
        scaler.scale(loss0).backward(retain_graph=True)
        # Hint: You have to pass retain_graph=True here, so that the scaler will remember this backward call
        scaler.scale(loss1).backward()
        # TODO: backward loss1 to calculate gradients for finetune_loss paramters

        # update fine tuning loss' parameters
        # the paramerters should be adjusted according to the loss_weight you choose
        for parameter in finetune_loss.parameters():
            parameter.grad.data *= (1.0 / loss_weight)

        scaler.step(optimizer)
        scaler.step(finetune_optimizer)
        scaler.update()

        if scheduler is not None:
           scheduler.step()
        # if you use a scheduler to schedule your learning rate for Center Loss
        # scheduler_finetune_loss.step()
        
        del images, labels, outputs, loss0, loss1
        torch.cuda.empty_cache()

        batch_bar.update() # Update tqdm bar

    batch_bar.close() # You need this to close the tqdm bar

    acc         = 100 * num_correct / (config['batch_size']* len(dataloader))
    total_loss  = float(total_loss / len(dataloader))

    return acc, total_loss

In [None]:
def train(model, dataloader, optimizer, criterion):
    
    model.train()

    # Progress Bar 
    batch_bar   = tqdm(total=len(dataloader), dynamic_ncols=True, leave=False, position=0, desc='Train', ncols=5) 

    num_correct = 0
    total_loss  = 0

    for i, (images, labels) in enumerate(dataloader):
        
        optimizer.zero_grad() # Zero gradients

        images, labels = images.to(DEVICE), labels.to(DEVICE)
        
        with torch.cuda.amp.autocast(): # This implements mixed precision. Thats it! 
            outputs = model(images)
            loss    = criterion(outputs, labels)

        # Update no. of correct predictions & loss as we iterate
        num_correct     += int((torch.argmax(outputs, axis=1) == labels).sum())
        total_loss      += float(loss.item())

        # tqdm lets you add some details so you can monitor training as you train.
        batch_bar.set_postfix(
            acc         = "{:.04f}%".format(100 * num_correct / (config['batch_size']*(i + 1))),
            loss        = "{:.04f}".format(float(total_loss / (i + 1))),
            num_correct = num_correct,
            lr          = "{:.04f}".format(float(optimizer.param_groups[0]['lr']))
        )
        
        scaler.scale(loss).backward() # This is a replacement for loss.backward()
        scaler.step(optimizer) # This is a replacement for optimizer.step()
        scaler.update() 

        # TODO? Depending on your choice of scheduler,
        # You may want to call some schdulers inside the train function. What are these?
      
        batch_bar.update() # Update tqdm bar

    batch_bar.close() # You need this to close the tqdm bar

    acc         = 100 * num_correct / (config['batch_size']* len(dataloader))
    total_loss  = float(total_loss / len(dataloader))

    return acc, total_loss

In [None]:
def validate(model, dataloader, criterion):
  
    model.eval()
    batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, position=0, leave=False, desc='Val', ncols=5)

    num_correct = 0.0
    total_loss = 0.0

    for i, (images, labels) in enumerate(dataloader):
        
        # Move images to device
        images, labels = images.to(DEVICE), labels.to(DEVICE)
        
        # Get model outputs
        with torch.inference_mode():
            outputs = model(images)
            loss = criterion(outputs, labels)

        num_correct += int((torch.argmax(outputs, axis=1) == labels).sum())
        total_loss += float(loss.item())

        batch_bar.set_postfix(
            acc="{:.04f}%".format(100 * num_correct / (config['batch_size']*(i + 1))),
            loss="{:.04f}".format(float(total_loss / (i + 1))),
            num_correct=num_correct)

        batch_bar.update()
        
    batch_bar.close()
    acc = 100 * num_correct / (config['batch_size']* len(dataloader))
    total_loss = float(total_loss / len(dataloader))
    return acc, total_loss

In [None]:
gc.collect() # These commands help you when you face CUDA OOM error
torch.cuda.empty_cache()

# Wandb

In [None]:
wandb.login(key="27ad915a9386068b1fc160cd97b84be7ba1fe659")

[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


True

In [None]:
# Create your wandb run
run = wandb.init(
    name = "spherenet-run", ## Wandb creates random run names if you skip this field
    reinit = True, ### Allows reinitalizing runs when you re-run this cell
    # run_id = ### Insert specific run id here if you want to resume a previous run
    # resume = "must" ### You need this to resume previous runs, but comment out reinit = True when using this
    project = "hw2p2-ablations", ### Project should be created in your wandb account 
    config = config ### Wandb Config for your run
)

[34m[1mwandb[0m: Currently logged in as: [33mwenxinz3[0m ([33msharonxin1207[0m). Use [1m`wandb login --relogin`[0m to force relogin


# Experiments

In [None]:
# set the threshold epoch for scheduler switch
if SELECT_MODEL in ['resnet', 'mobilenet']:
  switch_epoch = 77
elif SELECT_MODEL in ['spherenet']:
  switch_epoch = 37 # leave 5 epochs for smoothing the learning rate towards the end, rather than pause at sharp end of cosine scheduler
else:
  switch_epoch = 100

In [None]:
best_valacc = 0

for epoch in range(config[SELECT_MODEL]['epochs']):
    curr_lr = float(optimizer.param_groups[0]['lr'])
    if not USE_CELOSS:
      train_acc, train_loss = train(model, train_loader, optimizer, criterion)
    else:
      # during training, we fix the learning rate for Celoss so not using any scheduler
      train_acc, train_loss = train_celoss(
          model, train_loader, optimizer, finetune_optimizer, 
          criterion, finetune_loss, config['celoss_weight'])
    
    print("\nEpoch {}/{}: \nTrain Acc {:.04f}%\t Train Loss {:.04f}\t Learning Rate {:.04f}".format(
        epoch + 1,
        config[SELECT_MODEL]['epochs'],
        train_acc,
        train_loss,
        curr_lr))
    
    val_acc, val_loss = validate(model, valid_loader, criterion)
    # check if scheduler needs to be switched
    scheduler.step()
    if epoch == switch_epoch:
      scheduler.step()
      print("switching scheduler")
      if SELECT_MODEL == 'spherenet':
        # for spherenet, only train for a few epoch, so use stepLR
        scheduler = torch.optim.lr_scheduler.StepLR(optimizer, gamma=0.6, step_size=1)
      else:
        scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.6, patience=1, verbose=True)

    print("Val Acc {:.04f}%\t Val Loss {:.04f}".format(val_acc, val_loss))

    wandb.log({"train_loss":train_loss, 'train_Acc': train_acc, 'validation_Acc':val_acc, 
               'validation_loss': val_loss, "learning_Rate": curr_lr})
    
    # If you are using a scheduler in your train function within your iteration loop, you may want to log
    # your learning rate differently 

    # #Save model in drive location if val_acc is better than best recorded val_acc
    if val_acc >= best_valacc:
      #path = os.path.join(root, model_directory, 'checkpoint' + '.pth')
      print("Saving model at epoch "+str(epoch+1))
      save_vals = {'model_state_dict':model.state_dict(),
                  'optimizer_state_dict':optimizer.state_dict(),
                  'scheduler_state_dict':scheduler.state_dict(),
                  'val_acc': val_acc, 
                  'val_loss': val_loss,
                  'learning_rate': curr_lr,
                  'epoch': epoch}
      if USE_CELOSS:
        save_vals['celoss_state'] = finetune_loss.state_dict()
        save_vals['celoss_optimizer'] = finetune_optimizer.state_dict()
      

      torch.save(save_vals, './checkpoint_spherenet_w_celoss.pth')
      
      best_valacc = val_acc
      wandb.save('checkpoint_spherenet_w_celoss.pth')

    gc.collect() # These commands help you when you face CUDA OOM error
    torch.cuda.empty_cache()
      # You may find it interesting to exlplore Wandb Artifcats to version your models

# spherenet + cross loss > 96 batch
# + celoss > 64 batch

In [None]:
run.finish()

# FineTuning

In [None]:
# add your finetune/retrain code here
# use celoss_lr = 0.5, lambda = 0.003
def reload_prev_model(modeltype, modelpath, scheduler, use_celoss=USE_CELOSS):
  if modeltype == 'spherenet':
    model = SphereNet36(config['num_classes']).to(DEVICE)
    model.apply(init_weights)
    #scheduler = torch.optim.lr_scheduler.StepLR(optimizer, gamma=0.6, step_size=1)
  elif modeltype == 'resnet':
    model = ResNet34(config['num_classes']).to(DEVICE)
    model.apply(init_weights)

  optimizer = torch.optim.SGD(model.parameters(), lr=config['lr'], momentum=0.9, weight_decay=5e-4)

  checkpoint = torch.load(modelpath)
  model.load_state_dict(checkpoint['model_state_dict'])
  optimizer.load_state_dict(checkpoint['optimizer_state_dict'])
  scheduler.load_state_dict(checkpoint['scheduler_state_dict'])
  print("Current Val ACC, LOSS, LR and EPOCH from reload is:\n")
  print(checkpoint['val_acc'], checkpoint['learning_rate'], checkpoint['epoch'])

  if use_celoss:
      finetune_loss = CenterLoss(config['num_classes'], 512)
      finetune_optimizer = torch.optim.SGD(finetune_loss.parameters(), lr = config['celoss_lr']) 
      finetune_loss.load_state_dict(checkpoint['celoss_state'])
      finetune_optimizer.load_state_dict(checkpoint['celoss_optimizer'])
      return model, optimizer, scheduler, finetune_loss, finetune_optimizer
  else:
    return model, optimizer, scheduler



### First Level Finetune - 15 epoch for 2 times

In [None]:
modelpath = './checkpoint_spherenet_w_celoss.pth'
model, optimizer, scheduler = reload_prev_model('spherenet', modelpath, scheduler, False)
# reseting learning rate for finetuning 
optimizer.param_groups[0]['lr'] = config['finetune']['model_lr']
# for finetuning, load a new loss with smaller loss weight
finetune_loss = CenterLoss(config['num_classes'], 512, config['finetune']['celoss_weight'])
finetune_optimizer = torch.optim.SGD(finetune_loss.parameters(), lr = config['finetune']['celoss_lr']) 
# create scheduler for both model and loss with 15 epoch loop
scheduler_1ft = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 15)
finetune_scheduler_1ft = torch.optim.lr_scheduler.CosineAnnealingLR(finetune_optimizer, 15)

print(optimizer.param_groups[0]['lr'])
print(finetune_optimizer.param_groups[0]['lr'])

Current Val ACC, LOSS, LR and EPOCH from reload is:

0.6798446069469836 0.15 0
0.01
0.5


In [None]:
best_valacc = 90

# start finetuning for 15 epoch each 2 times
for epoch in range(config['finetune']['epochs_1']):
    curr_lr = float(optimizer.param_groups[0]['lr'])
    curr_losslr = float(finetune_optimizer.param_groups[0]['lr'])
    train_acc, train_loss = train_celoss(
          model, train_loader, optimizer, finetune_optimizer, 
          criterion, finetune_loss, config['finetune']['celoss_weight'])
    
    print("\nFinetune Epoch {}/{}: \nTrain Acc {:.04f}%\t Train Loss {:.04f}\t Learning Rate {:.04f}, Loss LR {:.04f}".format(
        epoch + 1,
        config['finetune']['epochs_1'],
        train_acc,
        train_loss,
        curr_lr, 
        curr_losslr))
    
    val_acc, val_loss = validate(model, valid_loader, criterion)
    scheduler_1ft.step()
    finetune_scheduler_1ft.step()
    
    print("Val Acc {:.04f}%\t Val Loss {:.04f}".format(val_acc, val_loss))
    
    wandb.log({"train_loss":train_loss, 'train_Acc': train_acc, 'validation_Acc':val_acc, 
               'validation_loss': val_loss, "learning_Rate": curr_lr})
    
    # If you are using a scheduler in your train function within your iteration loop, you may want to log
    # your learning rate differently 

    # #Save model in drive location if val_acc is better than best recorded val_acc
    if val_acc >= best_valacc:
      #path = os.path.join(root, model_directory, 'checkpoint' + '.pth')
      print("Saving model at epoch "+str(epoch+1))
      save_vals = {'model_state_dict':model.state_dict(),
                  'optimizer_state_dict':optimizer.state_dict(),
                  'scheduler_state_dict':scheduler_1ft.state_dict(),
                  'val_acc': val_acc, 
                  'val_loss': val_loss,
                  'learning_rate': curr_lr,
                  'ft_learning_rate': curr_losslr, 
                  'epoch': epoch}
      if USE_CELOSS:
        save_vals['celoss_state'] = finetune_loss.state_dict()
        save_vals['celoss_optimizer'] = finetune_optimizer.state_dict()
      torch.save(save_vals, './checkpoint_spherenet_finetune.pth')
      
      best_valacc = val_acc
      wandb.save('checkpoint_spherenet_finetune.pth')

    gc.collect() # These commands help you when you face CUDA OOM error
    torch.cuda.empty_cache()
      # You may find it interesting to exlplore Wandb Artifcats to version your models
#run.finish()

### Second Level Finetune - 10 epoch for 1 times

In [None]:
if FINETUNE_FROM_LOAD:
  modelpath = './checkpoint_spherenet_finetune.pth'  
  model, optimizer, scheduler, finetune_loss, finetune_optimizer = reload_prev_model('spherenet', modelpath, scheduler, True)

# reseting learning rate for finetuning 
optimizer.param_groups[0]['lr'] = config['finetune']['model_lr']*config['finetune']['2ft_lr_factor']
finetune_optimizer.param_groups[0]['lr'] = config['finetune']['celoss_lr']*config['finetune']['2ft_lr_factor']
# create scheduler for both model and loss with 15 epoch loop
scheduler_2ft = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, 10)
finetune_scheduler_2ft = torch.optim.lr_scheduler.CosineAnnealingLR(finetune_optimizer, 10)

print(optimizer.param_groups[0]['lr'])
print(finetune_optimizer.param_groups[0]['lr'])

0.001
0.05


In [None]:
best_valacc = 90

# start finetuning for 10 epoch each 1 times
for epoch in range(config['finetune']['epochs_2']):
    curr_lr = float(optimizer.param_groups[0]['lr'])
    curr_losslr = float(finetune_optimizer.param_groups[0]['lr'])
    train_acc, train_loss = train_celoss(
          model, train_loader, optimizer, finetune_optimizer, 
          criterion, finetune_loss, config['finetune']['celoss_weight'])
    
    print("\nFinetune Epoch {}/{}: \nTrain Acc {:.04f}%\t Train Loss {:.04f}\t Learning Rate {:.04f}, Loss LR {:.04f}".format(
        epoch + 1,
        config['finetune']['epochs_2'],
        train_acc,
        train_loss,
        curr_lr, 
        curr_losslr))
    
    val_acc, val_loss = validate(model, valid_loader, criterion)
    scheduler_2ft.step()
    finetune_scheduler_2ft.step()
    
    print("Val Acc {:.04f}%\t Val Loss {:.04f}".format(val_acc, val_loss))
    
    wandb.log({"train_loss":train_loss, 'train_Acc': train_acc, 'validation_Acc':val_acc, 
               'validation_loss': val_loss, "learning_Rate": curr_lr})
    
    # If you are using a scheduler in your train function within your iteration loop, you may want to log
    # your learning rate differently 

    # #Save model in drive location if val_acc is better than best recorded val_acc
    if val_acc >= best_valacc:
      #path = os.path.join(root, model_directory, 'checkpoint' + '.pth')
      print("Saving model at epoch "+str(epoch+1))
      save_vals = {'model_state_dict':model.state_dict(),
                  'optimizer_state_dict':optimizer.state_dict(),
                  'scheduler_state_dict':scheduler_2ft.state_dict(),
                  'val_acc': val_acc, 
                  'val_loss': val_loss,
                  'learning_rate': curr_lr,
                  'ft_learning_rate': curr_losslr, 
                  'epoch': epoch}
      if USE_CELOSS:
        save_vals['celoss_state'] = finetune_loss.state_dict()
        save_vals['celoss_optimizer'] = finetune_optimizer.state_dict()
      torch.save(save_vals, './checkpoint_spherenet_finetune_2.pth')
      
      best_valacc = val_acc
      wandb.save('checkpoint_spherenet_finetune_2.pth')

    gc.collect() # These commands help you when you face CUDA OOM error
    torch.cuda.empty_cache()
      # You may find it interesting to exlplore Wandb Artifcats to version your models
run.finish()

# Classification Task: Testing

In [None]:
def test(model,dataloader):

  model.eval()
  batch_bar = tqdm(total=len(dataloader), dynamic_ncols=True, position=0, leave=False, desc='Test')
  test_results = []
  
  for i, (images) in enumerate(dataloader):
      # TODO: Finish predicting on the test set.
      images = images.to(DEVICE)

      with torch.inference_mode():
        outputs = model(images)

      outputs = torch.argmax(outputs, axis=1).detach().cpu().numpy().tolist()
      test_results.extend(outputs)
      
      batch_bar.update()
      
  batch_bar.close()
  return test_results

In [None]:
test_results = test(model, test_loader)

## Generate csv to submit to Kaggle

In [None]:
with open("classification_spherenet_submission_sub.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(test_dataset)):
        f.write("{},{}\n".format(str(i).zfill(5) + ".jpg", test_results[i]))

In [None]:
if SUBMIT_KAGGLE:
  !kaggle competitions submit -c 11-785-s23-hw2p2-classification -f classification_spherenet_submission_sub.csv -m "submission"

# Verification Task: Validation

The verification task consists of the following generalized scenario:
- You are given X unknown identitites 
- You are given Y known identitites
- Your goal is to match X unknown identities to Y known identities.

We have given you a verification dataset, that consists of 960 known identities, and 1080 unknown identities. The 1080 unknown identities are split into dev (360) and test (720). Your goal is to compare the unknown identities to the 1080 known identities and assign an identity to each image from the set of unknown identities. Some unknown identities do not have correspondence in known identities, you also need to identify these and label them with a special label n000000.

Your will use/finetune your model trained for classification to compare images between known and unknown identities using a similarity metric and assign labels to the unknown identities. 

This will judge your model's performance in terms of the quality of embeddings/features it generates on images/faces it has never seen during training for classification.

In [None]:
# This obtains the list of known identities from the known folder
known_regex = "/content/data/11-785-s23-hw2p2-verification/known/*/*"
known_paths = [i.split('/')[-2] for i in sorted(glob.glob(known_regex))]

# Obtain a list of images from unknown folders
unknown_dev_regex = "/content/data/11-785-s23-hw2p2-verification/unknown_dev/*"
unknown_test_regex = "/content/data/11-785-s23-hw2p2-verification/unknown_test/*"

# We load the images from known and unknown folders
unknown_dev_images = [Image.open(p) for p in tqdm(sorted(glob.glob(unknown_dev_regex)))]
unknown_test_images = [Image.open(p) for p in tqdm(sorted(glob.glob(unknown_test_regex)))]
known_images = [Image.open(p) for p in tqdm(sorted(glob.glob(known_regex)))]

# Why do you need only ToTensor() here?
transforms = torchvision.transforms.Compose([
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])])

unknown_dev_images = torch.stack([transforms(x) for x in unknown_dev_images])
unknown_test_images = torch.stack([transforms(x) for x in unknown_test_images])
known_images  = torch.stack([transforms(y) for y in known_images ])
#Print your shapes here to understand what we have done

# You can use other similarity metrics like Euclidean Distance if you wish
similarity_metric = torch.nn.CosineSimilarity(dim= 1, eps= 1e-6) 

In [None]:
def eval_verification(unknown_images, known_images, model, similarity, batch_size= config['batch_size'], mode='val'): 

    unknown_feats, known_feats = [], []

    batch_bar = tqdm(total=len(unknown_images)//batch_size, dynamic_ncols=True, position=0, leave=False, desc=mode)
    model.eval()

    # We load the images as batches for memory optimization and avoiding CUDA OOM errors
    for i in range(0, unknown_images.shape[0], batch_size):
        unknown_batch = unknown_images[i:i+batch_size] # Slice a given portion upto batch_size
        
        with torch.no_grad():
            #unknown_feat = model(unknown_batch.float().to(DEVICE), return_feats=True) #Get features from model         
            unknown_feat, unknown_out = model(unknown_batch.float().to(DEVICE), return_feats=True) #Get features from model         
        
        unknown_feats.append(unknown_feat)
        batch_bar.update()
    
    batch_bar.close()
    
    batch_bar = tqdm(total=len(known_images)//batch_size, dynamic_ncols=True, position=0, leave=False, desc=mode)
    
    for i in range(0, known_images.shape[0], batch_size):
        known_batch = known_images[i:i+batch_size] 
        with torch.no_grad():
              #known_feat = model(known_batch.float().to(DEVICE), return_feats=True)
              known_feat, known_out = model(known_batch.float().to(DEVICE), return_feats=True)
          
        known_feats.append(known_feat)
        batch_bar.update()

    batch_bar.close()

    # Concatenate all the batches
    unknown_feats = torch.cat(unknown_feats, dim=0)
    known_feats = torch.cat(known_feats, dim=0)

    similarity_values = torch.stack([similarity(unknown_feats, known_feature) for known_feature in known_feats])
    # Print the inner list comprehension in a separate cell - what is really happening?
    #print(similarity_values.shape)
    #print(similarity_values)
    max_similarity_values, predictions = similarity_values.max(0) #Why are we doing an max here, where are the return values?
    max_similarity_values, predictions = max_similarity_values.cpu().numpy(), predictions.cpu().numpy()
    print(max_similarity_values.mean())

    # Note that in unknown identities, there are identities without correspondence in known identities.
    # Therefore, these identities should be not similar to all the known identities, i.e. max similarity will be below a certain 
    # threshold compared with those identities with correspondence.

    # In early submission, you can ignore identities without correspondence, simply taking identity with max similarity value
    # pred_id_strings = [known_paths[i] for i in predictions] # Map argmax indices to identity strings
    
    # After early submission, remove the previous line and uncomment the following code 

    NO_CORRESPONDENCE_LABEL = 'n000000'
    pred_id_strings = []
    for idx, prediction in enumerate(predictions):
        if max_similarity_values[idx] <  0.435: # 0.388 why < ? Thank about what is your similarity metric
             pred_id_strings.append(NO_CORRESPONDENCE_LABEL)
        else:
             pred_id_strings.append(known_paths[prediction])
    
    if mode == 'val':
      true_ids = pd.read_csv('/content/data/11-785-s23-hw2p2-verification/verification_dev.csv')['label'].tolist()
      accuracy = accuracy_score(pred_id_strings, true_ids)
      print("Verification Accuracy = {}".format(accuracy))
    
    return pred_id_strings

In [None]:
# verification eval - finetune 2
#0.6694444 by 0.42, 0.675 by 0.45, 0.6805556 by 0.475, 0.683333 by 0.5, 0.6861111 by 0.525 AND 535 and 512
#0.68888889 BY 0.515
# finetune - 3
# 0.6666 by 0.505, 0.697 by 0.475, 0.702777 by 0.465, 0.71111 by 0.45, 0.708333 by 0.435, 0.7138888 by 0.445
# 0.71111 by 0.442
pred_id_strings = eval_verification(unknown_dev_images, known_images, model, similarity_metric, config['batch_size'], mode='val')
# verification test
pred_id_strings = eval_verification(unknown_test_images, known_images, model, similarity_metric, config['batch_size'], mode='test')

## Generate csv to submit to Kaggle

In [None]:
with open("verification_spherenet_submission_adjthreshold.csv", "w+") as f:
    f.write("id,label\n")
    for i in range(len(pred_id_strings)):
        f.write("{},{}\n".format(i, pred_id_strings[i]))

In [None]:
if SUBMIT_KAGGLE:
  !kaggle competitions submit -c 11-785-s23-hw2p2-verification -f verification_spherenet_submission_adjthreshold.csv -m "early submission"