<a href="https://colab.research.google.com/github/sainithinkatta/deep_learning_class/blob/main/HW9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 9 - Network Compression

## **Intro**

HW9 is about network compression

There are many types of Network/Model Compression,  here we introduce two:
* Knowledge Distillation
* Design Architecture


The process of this notebook is as follows: <br/>
1. Introduce depthwise, pointwise and group convolution in MobileNet.
2. Design the model of this colab
3. Introduce Knowledge-Distillation
4. Set up TeacherNet and it would be helpful in training


## **About the Dataset**

The dataset used here is food-11, a collection of food images in 11 classes.

For the requirement in the homework，the network is slightly different from the original version so please DO NOT access the original fully-labeled training data or testing labels.


In [1]:
# Download food-11.zip from share files and put it in your own GoogleDrive
# The following code allows colab to access your GoogleDrive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
# Unzip the dataset.
# This may take some time.
# Put the pretrained resnet_food11 in your OWN GoogleDrive, the path need to specify!!
PATH = './drive/My Drive/Colab Models/ResNet_food11.pth'
!unzip -q "/content/drive/MyDrive/Colab Models/food-11.zip"

## **Import Packages**

First, we need to import packages that will be used later.

In this homework, we highly rely on **torchvision**, a library of PyTorch.

In [21]:
# Import necessary packages.
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch
import torchvision.transforms as transforms
import torchvision.models as models

from PIL import Image
# "ConcatDataset" and "Subset" are possibly useful when doing semi-supervised learning.
from torch.utils.data import ConcatDataset, DataLoader, Subset
from torchvision.datasets import DatasetFolder

# This is for the progress bar.
# from tqdm.auto import tqdm
from tqdm import tqdm

## **Dataset, Data Loader, and Transforms** (2 pts)

Torchvision provides lots of useful utilities for image preprocessing, data wrapping as well as data augmentation.

Here, since our data are stored in folders by class labels, we can directly apply **torchvision.datasets.DatasetFolder** for wrapping data without much effort. You can refer to [PyTorch official website](https://pytorch.org/vision/stable/transforms.html) for details about different transforms.


In [22]:
# It is important to do data augmentation in training.
# However, not every augmentation is useful.
# Please think about what kind of augmentation is helpful for food recognition.

train_tfm = transforms.Compose([
  # Resize the image into a fixed shape (height = width = 142)
	transforms.Resize((142, 142)),
  # (2 pts) TODO: Apply in order: RandomHorizontalFilp, RandomRotation(15), RandomCrop to size 128, and convert to Tensor
  transforms.RandomHorizontalFlip(),
  transforms.RandomRotation(15),
  transforms.RandomCrop(128),
  transforms.ToTensor()

])

# We don't need augmentations in testing and validation.
# All we need here is to resize the PIL image and transform it into Tensor.
test_tfm = transforms.Compose([
    # Resize the image into a fixed shape (height = width = 142)
    transforms.Resize((142, 142)),
    transforms.CenterCrop(128),
    transforms.ToTensor(),
])


In [23]:
# Batch size for training, validation, and testing.
# A greater batch size usually gives a more stable gradient.
# But the GPU memory is limited, so please adjust it carefully.
batch_size = 64

# Construct datasets.
# The argument "loader" tells how torchvision reads the data.
train_set = DatasetFolder("food-11/training/labeled", loader=lambda x: Image.open(x), extensions="jpg", transform=train_tfm)
valid_set = DatasetFolder("food-11/validation", loader=lambda x: Image.open(x), extensions="jpg", transform=test_tfm)
unlabeled_set = DatasetFolder("food-11/training/unlabeled", loader=lambda x: Image.open(x), extensions="jpg", transform=train_tfm)
test_set = DatasetFolder("food-11/testing", loader=lambda x: Image.open(x), extensions="jpg", transform=test_tfm)

# Construct data loaders.
train_loader = DataLoader(train_set, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True)
valid_loader = DataLoader(valid_set, batch_size=batch_size, shuffle=True, num_workers=0, pin_memory=True)
test_loader = DataLoader(test_set, batch_size=batch_size, shuffle=False)

# **Architecture / Model Design**
The following are types of convolution layer design that has fewer parameters.

## **Depthwise & Pointwise Convolution**
![](https://i.imgur.com/FBgcA0s.png)
> Blue: the connection between layers \
> Green: the expansion of **receptive field** \
> (reference: arxiv:1810.04231)

(a) normal convolution layer: It is fully connected. The difference between fully connected layer and fully connected convolution layer is the operation. (multiply --> convolution)

(b) Depthwise convolution layer(DW): You can consider each feature map pass through their own filter and then pass through pointwise convolution layer(PW) to combine the information of all pixels in feature maps.


(c) Group convolution layer(GC): Group the feature maps. Each group passes their filter then concate together. If group_size = input_feature_size, then GC becomes DC (channels are independent). If group_size = 1, then GC becomes fully connected.

<img src="https://i.imgur.com/Hqhg0Q9.png" width="500px">


## **Implementation details**
```python
# Regular Convolution, # of params = in_chs * out_chs * kernel_size^2
nn.Conv2d(in_chs, out_chs, kernel_size, stride, padding)

# Group Convolution, "groups" controls the connections between inputs and
# outputs. in_chs and out_chs must both be divisible by groups.
nn.Conv2d(in_chs, out_chs, kernel_size, stride, padding, groups=groups)

# Depthwise Convolution, out_chs=in_chs=groups, # of params = in_chs * kernel_size^2
nn.Conv2d(in_chs, out_chs=in_chs, kernel_size, stride, padding, groups=in_chs)

# Pointwise Convolution, a.k.a 1 by 1 convolution, # of params = in_chs * out_chs
nn.Conv2d(in_chs, out_chs, 1)

# Merge Depthwise and Pointwise Convolution (without )
def dwpw_conv(in_chs, out_chs, kernel_size, stride, padding):
    return nn.Sequential(
        nn.Conv2d(in_chs, in_chs, kernels, stride, padding, groups=in_chs),
        nn.Conv2d(in_chs, out_chs, 1),
    )
```

## **Model**

The basic model here is simply a stack of convolutional layers followed by some fully-connected layers. You can take advatage of depthwise & pointwise convolution to make your model deeper, but still follow the size constraint.

In [24]:
class StudentNet(nn.Module):
    def __init__(self):
      super(StudentNet, self).__init__()

      self.cnn = nn.Sequential(
        nn.Conv2d(3, 32, 3),
        nn.BatchNorm2d(32),
        nn.ReLU(),
        nn.Conv2d(32, 32, 3),
        nn.BatchNorm2d(32),
        nn.ReLU(),
        nn.MaxPool2d(2, 2, 0),

        nn.Conv2d(32, 64, 3),
        nn.BatchNorm2d(64),
        nn.ReLU(),
        nn.MaxPool2d(2, 2, 0),

        nn.Conv2d(64, 100, 3),
        nn.BatchNorm2d(100),
        nn.ReLU(),
        nn.MaxPool2d(2, 2, 0),

        # Here we adopt Global Average Pooling for various input size.
        nn.AdaptiveAvgPool2d((1, 1)),
      )

      # (2 pts) TODO: Apply fully connected layer with correct input and output channels
      self.fc = nn.Sequential(
        nn.Linear(100, 11)
      )

    def forward(self, x):
      out = self.cnn(x)
      out = out.view(out.size()[0], -1)
      return self.fc(out)


## **Model Analysis**

Use `torchsummary` to get your model architecture (screenshot or pasting text are allowed.) and numbers of
parameters, these two information should be submit.
Note that the number of parameters **should not greater than 100,000**


In [25]:
from torchsummary import summary

student_net = StudentNet()
summary(student_net, (3, 128, 128), device="cpu")

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 32, 126, 126]             896
       BatchNorm2d-2         [-1, 32, 126, 126]              64
              ReLU-3         [-1, 32, 126, 126]               0
            Conv2d-4         [-1, 32, 124, 124]           9,248
       BatchNorm2d-5         [-1, 32, 124, 124]              64
              ReLU-6         [-1, 32, 124, 124]               0
         MaxPool2d-7           [-1, 32, 62, 62]               0
            Conv2d-8           [-1, 64, 60, 60]          18,496
       BatchNorm2d-9           [-1, 64, 60, 60]             128
             ReLU-10           [-1, 64, 60, 60]               0
        MaxPool2d-11           [-1, 64, 30, 30]               0
           Conv2d-12          [-1, 100, 28, 28]          57,700
      BatchNorm2d-13          [-1, 100, 28, 28]             200
             ReLU-14          [-1, 100,

## **Knowledge Distillation**

<img src="https://i.imgur.com/H2aF7Rv.png=100x" width="500px">

Since we have a learned big model, let it teach the other small model. In implementation, let the training target be the prediction of big model instead of the ground truth.

## **Why it works?**
* If the data is not clean, then the prediction of big model could ignore the noise of the data with wrong labeled.
* The labels might have some relations. Number 8 is more similar to 6, 9, 0 than 1, 7, for example.


## **How to implement?**
* $Loss = \alpha T^2 \times KL(\frac{\text{Teacher's Logits}}{T} || \frac{\text{Student's Logits}}{T}) + (1-\alpha)(\text{Original Loss})$
* Note that the logits here should have passed softmax.

In [26]:
def loss_fn_kd(outputs, labels, teacher_outputs, alpha=0.5):
    hard_loss = F.cross_entropy(outputs, labels) * (1. - alpha)
    # ---------- TODO ----------
    # Complete soft loss in knowledge distillation
    T = 4.0
    soft_loss = nn.KLDivLoss(reduction='batchmean')(F.log_softmax(outputs/T, dim=1),
                                                    F.softmax(teacher_outputs/T, dim=1)) * (T * T * alpha)
    return hard_loss + soft_loss

## **Teacher Model Setting**
We provide a well-trained teacher model to help you knowledge distillation to student model.
Note that if you want to change the transform function, you should consider  if suitable for this well-trained teacher model.
* If you cannot successfully gdown, you can change a link. (Backup link is provided at the bottom of this colab tutorial).


In [27]:
# Download teacherNet
import torchvision.models as models
# Load teacherNet
teacher_net = models.resnet18(pretrained=False)
teacher_net.fc = nn.Linear(in_features=512, out_features=11, bias=True)
checkpoint = torch.load(PATH)
teacher_net.load_state_dict(checkpoint)
teacher_net.eval()

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

## **Generate Pseudo Labels in Unlabeled Data**

Since we have a well-trained model, we can use this model to predict pseudo-labels and help the student network train well. Note that you
**CANNOT** use well-trained model to pseudo-label the test data.


---

**AGAIN, DO NOT USE TEST DATA FOR PURPOSE OTHER THAN INFERENCING**

* Because If you use teacher network to predict pseudo-labels of the test data, you can only use student network to overfit these pseudo-labels without train/unlabeled data. In this way, your kaggle accuracy will be as high as the teacher network, but the fact is that you just overfit the test data and your true testing accuracy is very low.
* These contradict the purpose of these assignment (network compression); therefore, you should not misuse the test data.


In [28]:
# "cuda" only when GPUs are available.
device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

# Initialize a model, and put it on the device specified.
student_net = student_net.to(device)
teacher_net = teacher_net.to(device)

# Whether to do pseudo label.
do_semi = True

def get_pseudo_labels(dataset, model):
    loader = DataLoader(dataset, batch_size=batch_size*3, shuffle=False, pin_memory=True)
    pseudo_labels = []
    for batch in tqdm(loader):
        # A batch consists of image data and corresponding labels.
        img, _ = batch

        # Forward the data

        with torch.no_grad():
            logits = model(img.to(device))
            pseudo_labels.append(logits.argmax(dim=-1).detach().cpu())
        # Obtain the probability distributions by applying softmax on logits.
    pseudo_labels = torch.cat(pseudo_labels)
    # Update the labels by replacing with pseudo labels.
    for idx, ((img, _), pseudo_label) in enumerate(zip(dataset.samples, pseudo_labels)):
        dataset.samples[idx] = (img, pseudo_label.item())
    return dataset

if do_semi:
    # Generate new trainloader with unlabeled set.
    unlabeled_set = get_pseudo_labels(unlabeled_set, teacher_net)
    print(unlabeled_set)
    concat_dataset = ConcatDataset([train_set, unlabeled_set])
    train_loader = DataLoader(concat_dataset, batch_size=batch_size, shuffle=True, pin_memory=True, drop_last=True, num_workers=0)




cuda


100%|██████████| 36/36 [00:30<00:00,  1.20it/s]

Dataset DatasetFolder
    Number of datapoints: 6786
    Root location: food-11/training/unlabeled
    StandardTransform
Transform: Compose(
               Resize(size=(142, 142), interpolation=bilinear, max_size=None, antialias=True)
               RandomHorizontalFlip(p=0.5)
               RandomRotation(degrees=[-15.0, 15.0], interpolation=nearest, expand=False, fill=0)
               RandomCrop(size=(128, 128), padding=None)
               ToTensor()
           )





## **Training** (*14* pts)

You can finish supervised learning by simply running the provided code without any modification.

The function "get_pseudo_labels" is used for semi-supervised learning.
It is expected to get better performance if you use unlabeled data for semi-supervised learning.
However, you have to implement the function on your own and need to adjust several hyperparameters manually.
Again, please notice that utilizing external data (or pre-trained model) for training is **prohibited**.

---
**You should use loss in  knowledge distillation.**




In [29]:
# For the classification task, we use cross-entropy as the measurement of performance.
# (2 pts) TODO: Apply crossentropy loss
criterion = nn.CrossEntropyLoss()

# (3 pts) TODO: Set your own optimizer(Hint: Aplly Adam from torch.optim to student_net.parameters with lr=0.0003, weight_decay=1e-5)
optimizer = torch.optim.Adam(student_net.parameters(), lr=0.0003, weight_decay=1e-5)

# The number of training epochs.
n_epochs = 80

for epoch in range(n_epochs):
    # ---------- Training ----------

    # (1pt) TODO: Set the model into training mode
    student_net.train()

    # These are used to record information in training.
    train_loss = []
    train_accs = []

    # Iterate the training set by batches.
    for batch in tqdm(train_loader):

        # A batch consists of image data and corresponding labels.
        imgs, labels = batch

        # Forward the data. (Make sure data and model are on the same device.
        logits = student_net(imgs.to(device))
        # Teacher net will not be updated. And we use torch.no_grad
        # to tell torch do not retain the intermediate values
        # (which are for backpropgation) and save the memory.

        with torch.no_grad():
          # (4 pts) TODO: put imgs to teacher_net to create soft labels(Remember to put imgs onto gpu)
          soft_labels = teacher_net(imgs.to(device))

        # Calculate the loss in knowledge distillation method.
        loss = loss_fn_kd(logits, labels.to(device), soft_labels)

        # Gradients stored in the parameters in the previous step should be cleared out first.
        optimizer.zero_grad()

        # Compute the gradients for parameters.
        loss.backward()

        # (2 pts) TODO: Apply nn.utils.clip_grad_norm_ for stable training (max_norm=10)
        grad_norm = nn.utils.clip_grad_norm_(student_net.parameters(), max_norm=10)

        # Update the parameters with computed gradients.
        optimizer.step()

        # Compute the accuracy for current batch.
        acc = (logits.argmax(dim=-1) == labels.to(device)).float().mean()

        # Record the loss and accuracy.
        train_loss.append(loss.item())
        train_accs.append(acc)

    # The average loss and accuracy of the training set is the average of the recorded values.
    train_loss = sum(train_loss) / len(train_loss)
    train_acc = sum(train_accs) / len(train_accs)

    # Print the information.
    print(f"[ Train | {epoch + 1:03d}/{n_epochs:03d} ] loss = {train_loss:.5f}, acc = {train_acc:.5f}")


    # ---------- Validation ----------
    # (1pt)TODO: Set the model into eval mode so that some modules like dropout are disabled and work normally.
    student_net.eval()

    # These are used to record information in validation.
    valid_loss = []
    valid_accs = []

    # Iterate the validation set by batches.
    for batch in tqdm(valid_loader):

        # A batch consists of image data and corresponding labels.
        imgs, labels = batch

        # We don't need gradient in validation.
        # (1 pt) TODO: Using torch.no_grad() accelerates the forward process.

        with torch.no_grad():
          logits = student_net(imgs.to(device))
          soft_labels = teacher_net(imgs.to(device))
        # We can still compute the loss (but not the gradient).
        loss = loss_fn_kd(logits, labels.to(device), soft_labels)

        # Compute the accuracy for current batch.
        acc = (logits.argmax(dim=-1) == labels.to(device)).float().detach().cpu().view(-1).numpy()

        # Record the loss and accuracy.
        valid_loss.append(loss.item())
        valid_accs += list(acc)

    # The average loss and accuracy for entire validation set is the average of the recorded values.
    valid_loss = sum(valid_loss) / len(valid_loss)
    valid_acc = sum(valid_accs) / len(valid_accs)

    # Print the information.
    print(f"[ Valid | {epoch + 1:03d}/{n_epochs:03d} ] loss = {valid_loss:.5f}, acc = {valid_acc:.5f}")

100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 001/080 ] loss = 7.41463, acc = 0.30134


100%|██████████| 11/11 [00:03<00:00,  3.02it/s]


[ Valid | 001/080 ] loss = 6.71444, acc = 0.25455


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 002/080 ] loss = 6.76381, acc = 0.35948


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 002/080 ] loss = 6.04152, acc = 0.33182


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 003/080 ] loss = 6.37591, acc = 0.40321


100%|██████████| 11/11 [00:03<00:00,  3.02it/s]


[ Valid | 003/080 ] loss = 5.70211, acc = 0.34697


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 004/080 ] loss = 6.01048, acc = 0.43222


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 004/080 ] loss = 5.85445, acc = 0.32576


100%|██████████| 154/154 [00:52<00:00,  2.94it/s]


[ Train | 005/080 ] loss = 5.86487, acc = 0.45120


100%|██████████| 11/11 [00:03<00:00,  3.00it/s]


[ Valid | 005/080 ] loss = 5.52807, acc = 0.36970


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 006/080 ] loss = 5.62457, acc = 0.46845


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 006/080 ] loss = 5.68649, acc = 0.35000


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 007/080 ] loss = 5.49419, acc = 0.47859


100%|██████████| 11/11 [00:03<00:00,  3.06it/s]


[ Valid | 007/080 ] loss = 5.65971, acc = 0.33788


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 008/080 ] loss = 5.32758, acc = 0.49280


100%|██████████| 11/11 [00:03<00:00,  2.94it/s]


[ Valid | 008/080 ] loss = 5.30305, acc = 0.40303


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 009/080 ] loss = 5.21921, acc = 0.50893


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 009/080 ] loss = 5.02642, acc = 0.45455


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 010/080 ] loss = 5.09586, acc = 0.51329


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 010/080 ] loss = 4.99694, acc = 0.41515


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 011/080 ] loss = 5.00188, acc = 0.52050


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 011/080 ] loss = 4.65534, acc = 0.46515


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 012/080 ] loss = 4.89958, acc = 0.53216


100%|██████████| 11/11 [00:03<00:00,  3.02it/s]


[ Valid | 012/080 ] loss = 4.68230, acc = 0.44242


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 013/080 ] loss = 4.84851, acc = 0.54332


100%|██████████| 11/11 [00:03<00:00,  3.03it/s]


[ Valid | 013/080 ] loss = 5.43620, acc = 0.38788


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 014/080 ] loss = 4.75869, acc = 0.55317


100%|██████████| 11/11 [00:03<00:00,  3.07it/s]


[ Valid | 014/080 ] loss = 4.89402, acc = 0.41212


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 015/080 ] loss = 4.72640, acc = 0.54860


100%|██████████| 11/11 [00:03<00:00,  3.06it/s]


[ Valid | 015/080 ] loss = 4.93661, acc = 0.45758


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 016/080 ] loss = 4.59632, acc = 0.55905


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 016/080 ] loss = 4.92618, acc = 0.42273


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 017/080 ] loss = 4.53098, acc = 0.56575


100%|██████████| 11/11 [00:03<00:00,  3.06it/s]


[ Valid | 017/080 ] loss = 4.60704, acc = 0.45152


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 018/080 ] loss = 4.48290, acc = 0.57254


100%|██████████| 11/11 [00:03<00:00,  3.00it/s]


[ Valid | 018/080 ] loss = 4.83703, acc = 0.45303


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 019/080 ] loss = 4.47016, acc = 0.57194


100%|██████████| 11/11 [00:03<00:00,  3.03it/s]


[ Valid | 019/080 ] loss = 4.64301, acc = 0.45000


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 020/080 ] loss = 4.38841, acc = 0.58228


100%|██████████| 11/11 [00:03<00:00,  3.08it/s]


[ Valid | 020/080 ] loss = 4.52811, acc = 0.47576


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 021/080 ] loss = 4.33964, acc = 0.58665


100%|██████████| 11/11 [00:03<00:00,  2.96it/s]


[ Valid | 021/080 ] loss = 4.67984, acc = 0.46667


100%|██████████| 154/154 [00:52<00:00,  2.94it/s]


[ Train | 022/080 ] loss = 4.26156, acc = 0.59426


100%|██████████| 11/11 [00:03<00:00,  3.10it/s]


[ Valid | 022/080 ] loss = 4.71818, acc = 0.46515


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 023/080 ] loss = 4.24639, acc = 0.59517


100%|██████████| 11/11 [00:03<00:00,  3.07it/s]


[ Valid | 023/080 ] loss = 4.57361, acc = 0.47273


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 024/080 ] loss = 4.19953, acc = 0.60095


100%|██████████| 11/11 [00:03<00:00,  2.93it/s]


[ Valid | 024/080 ] loss = 4.68130, acc = 0.45606


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 025/080 ] loss = 4.12922, acc = 0.60501


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 025/080 ] loss = 4.38845, acc = 0.53788


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 026/080 ] loss = 4.10842, acc = 0.61009


100%|██████████| 11/11 [00:03<00:00,  3.06it/s]


[ Valid | 026/080 ] loss = 4.21463, acc = 0.54242


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 027/080 ] loss = 4.10403, acc = 0.60988


100%|██████████| 11/11 [00:03<00:00,  2.90it/s]


[ Valid | 027/080 ] loss = 4.21883, acc = 0.51667


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 028/080 ] loss = 4.03998, acc = 0.61577


100%|██████████| 11/11 [00:03<00:00,  3.02it/s]


[ Valid | 028/080 ] loss = 4.54443, acc = 0.51515


100%|██████████| 154/154 [00:52<00:00,  2.94it/s]


[ Train | 029/080 ] loss = 4.05609, acc = 0.61536


100%|██████████| 11/11 [00:03<00:00,  3.08it/s]


[ Valid | 029/080 ] loss = 4.44019, acc = 0.49242


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 030/080 ] loss = 4.01161, acc = 0.61171


100%|██████████| 11/11 [00:03<00:00,  2.86it/s]


[ Valid | 030/080 ] loss = 4.44148, acc = 0.51212


100%|██████████| 154/154 [00:52<00:00,  2.92it/s]


[ Train | 031/080 ] loss = 3.94249, acc = 0.62409


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 031/080 ] loss = 4.36283, acc = 0.51212


100%|██████████| 154/154 [00:52<00:00,  2.92it/s]


[ Train | 032/080 ] loss = 3.90475, acc = 0.62419


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 032/080 ] loss = 4.16009, acc = 0.53485


100%|██████████| 154/154 [00:52<00:00,  2.94it/s]


[ Train | 033/080 ] loss = 3.87373, acc = 0.62723


100%|██████████| 11/11 [00:03<00:00,  2.87it/s]


[ Valid | 033/080 ] loss = 3.98779, acc = 0.54091


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 034/080 ] loss = 3.86699, acc = 0.62804


100%|██████████| 11/11 [00:03<00:00,  2.96it/s]


[ Valid | 034/080 ] loss = 4.93692, acc = 0.44848


100%|██████████| 154/154 [00:53<00:00,  2.90it/s]


[ Train | 035/080 ] loss = 3.83592, acc = 0.63362


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 035/080 ] loss = 3.95340, acc = 0.54545


100%|██████████| 154/154 [00:51<00:00,  2.96it/s]


[ Train | 036/080 ] loss = 3.82225, acc = 0.63606


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 036/080 ] loss = 4.65736, acc = 0.49091


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 037/080 ] loss = 3.80952, acc = 0.63393


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 037/080 ] loss = 4.68424, acc = 0.47727


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 038/080 ] loss = 3.76114, acc = 0.63890


100%|██████████| 11/11 [00:03<00:00,  3.03it/s]


[ Valid | 038/080 ] loss = 4.23781, acc = 0.55152


100%|██████████| 154/154 [00:51<00:00,  2.96it/s]


[ Train | 039/080 ] loss = 3.75255, acc = 0.64225


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 039/080 ] loss = 3.95078, acc = 0.56818


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 040/080 ] loss = 3.69294, acc = 0.64478


100%|██████████| 11/11 [00:03<00:00,  3.00it/s]


[ Valid | 040/080 ] loss = 3.84932, acc = 0.55909


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 041/080 ] loss = 3.69247, acc = 0.64813


100%|██████████| 11/11 [00:03<00:00,  2.95it/s]


[ Valid | 041/080 ] loss = 4.42291, acc = 0.55606


100%|██████████| 154/154 [00:52<00:00,  2.93it/s]


[ Train | 042/080 ] loss = 3.65635, acc = 0.65037


100%|██████████| 11/11 [00:03<00:00,  3.02it/s]


[ Valid | 042/080 ] loss = 4.22497, acc = 0.55152


100%|██████████| 154/154 [00:51<00:00,  2.96it/s]


[ Train | 043/080 ] loss = 3.65271, acc = 0.65392


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 043/080 ] loss = 4.27832, acc = 0.51212


100%|██████████| 154/154 [00:52<00:00,  2.94it/s]


[ Train | 044/080 ] loss = 3.63600, acc = 0.65331


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 044/080 ] loss = 4.33505, acc = 0.52273


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 045/080 ] loss = 3.56811, acc = 0.65280


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 045/080 ] loss = 3.59802, acc = 0.60758


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 046/080 ] loss = 3.57536, acc = 0.65919


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 046/080 ] loss = 4.34194, acc = 0.53182


100%|██████████| 154/154 [00:50<00:00,  3.03it/s]


[ Train | 047/080 ] loss = 3.55989, acc = 0.66203


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 047/080 ] loss = 4.21064, acc = 0.50606


100%|██████████| 154/154 [00:51<00:00,  3.01it/s]


[ Train | 048/080 ] loss = 3.51518, acc = 0.66477


100%|██████████| 11/11 [00:03<00:00,  3.08it/s]


[ Valid | 048/080 ] loss = 4.43757, acc = 0.51818


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 049/080 ] loss = 3.54701, acc = 0.66315


100%|██████████| 11/11 [00:03<00:00,  2.89it/s]


[ Valid | 049/080 ] loss = 4.48246, acc = 0.52879


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 050/080 ] loss = 3.52267, acc = 0.66213


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 050/080 ] loss = 3.99385, acc = 0.56061


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 051/080 ] loss = 3.48366, acc = 0.66345


100%|██████████| 11/11 [00:03<00:00,  2.97it/s]


[ Valid | 051/080 ] loss = 3.66783, acc = 0.60909


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 052/080 ] loss = 3.47908, acc = 0.66579


100%|██████████| 11/11 [00:03<00:00,  3.00it/s]


[ Valid | 052/080 ] loss = 4.32373, acc = 0.54848


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 053/080 ] loss = 3.44656, acc = 0.66751


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 053/080 ] loss = 3.64626, acc = 0.58485


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 054/080 ] loss = 3.42974, acc = 0.67198


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 054/080 ] loss = 3.64226, acc = 0.57727


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 055/080 ] loss = 3.41000, acc = 0.67188


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 055/080 ] loss = 3.89554, acc = 0.59242


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 056/080 ] loss = 3.41332, acc = 0.67137


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 056/080 ] loss = 4.93960, acc = 0.48182


100%|██████████| 154/154 [00:52<00:00,  2.96it/s]


[ Train | 057/080 ] loss = 3.37267, acc = 0.68222


100%|██████████| 11/11 [00:03<00:00,  2.90it/s]


[ Valid | 057/080 ] loss = 3.84376, acc = 0.57576


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 058/080 ] loss = 3.36370, acc = 0.67938


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 058/080 ] loss = 4.53459, acc = 0.52727


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 059/080 ] loss = 3.36242, acc = 0.68446


100%|██████████| 11/11 [00:03<00:00,  3.00it/s]


[ Valid | 059/080 ] loss = 4.03638, acc = 0.53333


100%|██████████| 154/154 [00:51<00:00,  3.01it/s]


[ Train | 060/080 ] loss = 3.33037, acc = 0.68111


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 060/080 ] loss = 4.14644, acc = 0.53485


100%|██████████| 154/154 [00:52<00:00,  2.95it/s]


[ Train | 061/080 ] loss = 3.30112, acc = 0.68821


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 061/080 ] loss = 3.59312, acc = 0.57727


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 062/080 ] loss = 3.29853, acc = 0.68872


100%|██████████| 11/11 [00:03<00:00,  2.97it/s]


[ Valid | 062/080 ] loss = 4.09194, acc = 0.55909


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 063/080 ] loss = 3.27628, acc = 0.68912


100%|██████████| 11/11 [00:03<00:00,  3.07it/s]


[ Valid | 063/080 ] loss = 4.67656, acc = 0.49848


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 064/080 ] loss = 3.33739, acc = 0.68537


100%|██████████| 11/11 [00:03<00:00,  3.04it/s]


[ Valid | 064/080 ] loss = 3.61293, acc = 0.61970


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 065/080 ] loss = 3.27206, acc = 0.69004


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 065/080 ] loss = 3.50470, acc = 0.62273


100%|██████████| 154/154 [00:51<00:00,  2.96it/s]


[ Train | 066/080 ] loss = 3.27764, acc = 0.68791


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 066/080 ] loss = 3.68262, acc = 0.58636


100%|██████████| 154/154 [00:51<00:00,  2.97it/s]


[ Train | 067/080 ] loss = 3.26590, acc = 0.69227


100%|██████████| 11/11 [00:03<00:00,  3.02it/s]


[ Valid | 067/080 ] loss = 4.31151, acc = 0.51818


100%|██████████| 154/154 [00:51<00:00,  3.01it/s]


[ Train | 068/080 ] loss = 3.23175, acc = 0.69338


100%|██████████| 11/11 [00:03<00:00,  3.13it/s]


[ Valid | 068/080 ] loss = 3.74185, acc = 0.61970


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 069/080 ] loss = 3.23504, acc = 0.69257


100%|██████████| 11/11 [00:03<00:00,  3.00it/s]


[ Valid | 069/080 ] loss = 3.85803, acc = 0.56515


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 070/080 ] loss = 3.21595, acc = 0.69278


100%|██████████| 11/11 [00:03<00:00,  3.07it/s]


[ Valid | 070/080 ] loss = 3.53939, acc = 0.62727


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 071/080 ] loss = 3.19862, acc = 0.69582


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 071/080 ] loss = 3.91582, acc = 0.57424


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 072/080 ] loss = 3.18638, acc = 0.69694


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 072/080 ] loss = 4.37977, acc = 0.52424


100%|██████████| 154/154 [00:51<00:00,  3.01it/s]


[ Train | 073/080 ] loss = 3.16629, acc = 0.69988


100%|██████████| 11/11 [00:03<00:00,  3.09it/s]


[ Valid | 073/080 ] loss = 3.55357, acc = 0.61212


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 074/080 ] loss = 3.15077, acc = 0.69795


100%|██████████| 11/11 [00:03<00:00,  2.92it/s]


[ Valid | 074/080 ] loss = 3.44913, acc = 0.60152


100%|██████████| 154/154 [00:51<00:00,  3.01it/s]


[ Train | 075/080 ] loss = 3.14884, acc = 0.69947


100%|██████████| 11/11 [00:03<00:00,  3.05it/s]


[ Valid | 075/080 ] loss = 3.98313, acc = 0.56970


100%|██████████| 154/154 [00:51<00:00,  2.98it/s]


[ Train | 076/080 ] loss = 3.16501, acc = 0.69917


100%|██████████| 11/11 [00:03<00:00,  3.11it/s]


[ Valid | 076/080 ] loss = 4.73738, acc = 0.48939


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 077/080 ] loss = 3.14115, acc = 0.70302


100%|██████████| 11/11 [00:03<00:00,  3.01it/s]


[ Valid | 077/080 ] loss = 3.75064, acc = 0.62424


100%|██████████| 154/154 [00:51<00:00,  3.01it/s]


[ Train | 078/080 ] loss = 3.11770, acc = 0.70495


100%|██████████| 11/11 [00:03<00:00,  3.08it/s]


[ Valid | 078/080 ] loss = 3.64149, acc = 0.60000


100%|██████████| 154/154 [00:51<00:00,  3.00it/s]


[ Train | 079/080 ] loss = 3.11563, acc = 0.70333


100%|██████████| 11/11 [00:03<00:00,  3.03it/s]


[ Valid | 079/080 ] loss = 3.36437, acc = 0.62879


100%|██████████| 154/154 [00:51<00:00,  2.99it/s]


[ Train | 080/080 ] loss = 3.10034, acc = 0.70079


100%|██████████| 11/11 [00:03<00:00,  3.07it/s]

[ Valid | 080/080 ] loss = 3.91361, acc = 0.54848





## **Testing**

For inference, we need to make sure the model is in eval mode, and the order of the dataset should not be shuffled ("shuffle=False" in test_loader).


In [30]:
# Make sure the model is in eval mode.
# Some modules like Dropout or BatchNorm affect if the model is in training mode.
student_net.eval()

# Initialize a list to store the predictions.
predictions = []

# Iterate the testing set by batches.
for batch in tqdm(test_loader):
    # A batch consists of image data and corresponding labels.
    # But here the variable "labels" is useless since we do not have the ground-truth.
    # If printing out the labels, you will find that it is always 0.
    # This is because the wrapper (DatasetFolder) returns images and labels for each batch,
    # so we have to create fake labels to make it work normally.
    imgs, labels = batch

    # We don't need gradient in testing, and we don't even have labels to compute loss.
    # Using torch.no_grad() accelerates the forward process.
    with torch.no_grad():
        logits = student_net(imgs.to(device))

    # Take the class with greatest logit as prediction and record it.
    predictions.extend(logits.argmax(dim=-1).cpu().numpy().tolist())

100%|██████████| 53/53 [00:15<00:00,  3.49it/s]


In [31]:
# Save predictions into the file.
with open("predict.csv", "w") as f:

    # The first row must be "Id, Category"
    f.write("Id,Category\n")

    # For the rest of the rows, each image id corresponds to a predicted class.
    for i, pred in  enumerate(predictions):
         f.write(f"{i},{pred}\n")

## **Statistics**

|Baseline|Accuracy|Training Time|
|-|-|-|
|Simple Baseline |0.59856|2 Hours|
|Medium Baseline |0.65412|2 Hours|
|Strong Baseline |0.72819|4 Hours|
|Boss Baseline |0.81003|Unmeasueable|

## **Learning Curve**

![img](https://lh5.googleusercontent.com/amMLGa7dkqvXGmsJlrVN49VfSjClk5d-n7nCi_Y3ROK4himsBSHhB7SpdWe7Zm06ctRO77VdDkD9u_aKfAh1tMW-KcyYX7vF7LPlKqOo2fVtt3SyfsLv0KTYDB0YbAk6ZhyOIKT8Zfg)