# Homework 13 - Network Compression

Author: Liang-Hsuan Tseng (b07502072@ntu.edu.tw), modified from ML2021-HW13

If you have any questions, feel free to ask: ntu-ml-2022spring-ta@googlegroups.com

[**Link to HW13 Slides**](https://docs.google.com/presentation/d/1nCT9XrInF21B4qQAWuODy5sonKDnpGhjtcAwqa75mVU/edit#slide=id.p)

## Outline

* [Packages](#Packages) - intall some required packages.
* [Dataset](#Dataset) - something you need to know about the dataset.
* [Configs](#Configs) - the configs of the experiments, you can change some hyperparameters here.
* [Architecture_Design](#Architecture_Design) - depthwise and pointwise convolution examples and some useful links.
* [Knowledge_Distillation](#Knowledge_Distillation) - KL divergence loss for knowledge distillation and some useful links.
* [Training](#Training) - training loop implementation modified from HW3.
* [Inference](#Inference) - create submission.csv by using the student_best.ckpt from the previous experiment.



In [None]:
from google.colab import drive
import os
# 挂载Google Drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [None]:
!pip install torchsummary



### Packages
First, we need to import some useful packages. If the torchsummary package are not intalled, please install it via `pip install torchsummary`

In [None]:
# Import some useful packages for this homework
import numpy as np
import pandas as pd
import torch
import os
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import ConcatDataset, DataLoader, Subset, Dataset # "ConcatDataset" and "Subset" are possibly useful
from torchvision.datasets import DatasetFolder, VisionDataset
from torchsummary import summary
from tqdm.auto import tqdm
import random

# !nvidia-smi # list your current GPU

### Configs
In this part, you can specify some variables and hyperparameters as your configs.

In [None]:
cfg = {
    'dataset_root': '/content/drive/MyDrive/food11-hw13',
    'save_dir': '/content/drive/MyDrive/outputs',           # 输出保存目录
    'exp_name': "strong_baseline",     # 实验名称
    'batch_size': 64,                 # 批次大小128
    'lr': 1.4e-3,                       # 学习率3e-3
    'seed': 20220013,                 # 随机种子
    'loss_fn_type': 'KD',             # 损失函数类型：KD表示知识蒸馏
    'weight_decay': 5e-5,             # 权重衰减（L2正则化）
    'grad_norm_max': 10,              # 梯度裁剪的最大范数
    'n_epochs': 150,                  # 训练轮数
    'patience': 20,                   # 早停法的耐心值
}

In [None]:
myseed = cfg['seed']  # set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(myseed)
torch.manual_seed(myseed)
random.seed(myseed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(myseed)

save_path = os.path.join(cfg['save_dir'], cfg['exp_name']) # create saving directory
os.makedirs(save_path, exist_ok=True)

# define simple logging functionality
log_fw = open(f"{save_path}/log.txt", 'w') # open log file to save log outputs
def log(text):     # define a logging function to trace the training process
    print(text)
    log_fw.write(str(text)+'\n')
    log_fw.flush()

log(cfg)  # log your configs to the log file

{'dataset_root': '/content/drive/MyDrive/food11-hw13', 'save_dir': './outputs', 'exp_name': 'strong_baseline', 'batch_size': 64, 'lr': 0.0014, 'seed': 20220013, 'loss_fn_type': 'KD', 'weight_decay': 5e-05, 'grad_norm_max': 10, 'n_epochs': 150, 'patience': 20}


### Dataset
We use Food11 dataset for this homework, which is similar to homework3. But remember, Please DO NOT utilize the dataset of HW3. We've modified the dataset, so you should only access the dataset by loading it in this kaggle notebook or through the links provided in the HW13 colab notebooks.

In [None]:
# fetch and download the dataset from github (about 1.12G)
# !wget https://github.com/virginiakm1988/ML2022-Spring/raw/main/HW13/food11-hw13.tar.gz
## backup links:

#!wget https://github.com/andybi7676/ml2022spring-hw13/raw/main/food11-hw13.tar.gz -O food11-hw13.tar.gz
# !gdown '1ijKoNmpike_yjUw8SWRVVWVoMOXXqycj' --output food11-hw13.tar.gz

In [None]:
# extract the data
#!tar -xzf ./food11-hw13.tar.gz # Could take some time
# !tar -xzvf ./food11-hw13.tar.gz # use this command if you want to checkout the whole process.

In [None]:
for dirname, _, filenames in os.walk('/content/drive/MyDrive/food11-hw13'):
    if len(filenames) > 0:
        print(f"{dirname}: {len(filenames)} files.") # Show the file amounts in each split.

/content/drive/MyDrive/food11-hw13: 1 files.
/content/drive/MyDrive/food11-hw13/training: 9866 files.
/content/drive/MyDrive/food11-hw13/validation: 3431 files.
/content/drive/MyDrive/food11-hw13/evaluation: 3347 files.


Next, specify train/test transform for image data augmentation.
Torchvision provides lots of useful utilities for image preprocessing, data wrapping as well as data augmentation.

Please refer to [PyTorch official website](https://pytorch.org/vision/stable/transforms.html) for details about different transforms. You can also apply the knowledge or experience you learned in HW3.

In [None]:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
# define training/testing transforms
test_tfm = transforms.Compose([
    # It is not encouraged to modify this part if you are using the provided teacher model. This transform is stardard and good enough for testing.
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize,
])

train_tfm = transforms.Compose([
    transforms.RandomResizedCrop((224, 224), scale=(0.8, 1.0)),
    transforms.RandomHorizontalFlip(p=0.5),
    transforms.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.1, hue=0.05),
    transforms.RandomPerspective(distortion_scale=0.3, p=0.4),
    transforms.RandomAffine(
        degrees=(-20, 20),
        translate=(0, 0.2),
        scale=(0.9, 1.2)
    ),
    transforms.ToTensor(),
    normalize,
])

In [None]:
class FoodDataset(Dataset):
    def __init__(self, path, tfm=test_tfm, files = None):
        super().__init__()
        self.path = path
        self.files = sorted([os.path.join(path,x) for x in os.listdir(path) if x.endswith(".jpg")])
        if files != None:
            self.files = files
        print(f"One {path} sample",self.files[0])
        self.transform = tfm

    def __len__(self):
        return len(self.files)

    def __getitem__(self,idx):
        fname = self.files[idx]
        im = Image.open(fname)
        im = self.transform(im)
        try:
            label = int(fname.split("/")[-1].split("_")[0])
        except:
            label = -1 # test has no label
        return im,label

In [None]:
# Form train/valid dataloaders
train_set = FoodDataset(os.path.join(cfg['dataset_root'],"training"), tfm=train_tfm)
train_loader = DataLoader(train_set, batch_size=cfg['batch_size'], shuffle=True, num_workers=0, pin_memory=True)

valid_set = FoodDataset(os.path.join(cfg['dataset_root'], "validation"), tfm=test_tfm)
valid_loader = DataLoader(valid_set, batch_size=cfg['batch_size'], shuffle=False, num_workers=0, pin_memory=True)

One /content/drive/MyDrive/food11-hw13/training sample /content/drive/MyDrive/food11-hw13/training/0_0.jpg
One /content/drive/MyDrive/food11-hw13/validation sample /content/drive/MyDrive/food11-hw13/validation/0_0.jpg


### Architecture_Design

In this homework, you have to design a smaller network and make it perform well. Apparently, a well-designed architecture is crucial for such task. Here, we introduce the depthwise and pointwise convolution. These variants of convolution are some common techniques for architecture design when it comes to network compression.

<img src="https://i.imgur.com/LFDKHOp.png" width=400px>

* explanation of depthwise and pointwise convolutions:
    * [prof. Hung-yi Lee's slides(p.24~p.30, especially p.28)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/tiny_v7.pdf)

In [None]:
# Example implementation of Depthwise and Pointwise Convolution
# def dwpw_conv(in_channels, out_channels, kernel_size, stride=1, padding=0):
#     return nn.Sequential(
#         nn.Conv2d(in_channels, in_channels, kernel_size, stride=stride, padding=padding, groups=in_channels), #depthwise convolution
#         nn.Conv2d(in_channels, out_channels, 1), # pointwise convolution
#     )
# 重新定义深度可分离卷积块（包含批归一化和激活函数）
def dwpw_conv(ic, oc, kernel_size=3, stride=2, padding=1):
    return nn.Sequential(
        # 深度卷积
        nn.Conv2d(ic, ic, kernel_size, stride=stride, padding=padding, groups=ic),
        nn.BatchNorm2d(ic),                    # 批归一化
        nn.LeakyReLU(0.01, inplace=True),      # LeakyReLU激活函数
        # 逐点卷积
        nn.Conv2d(ic, oc, 1),
        nn.BatchNorm2d(oc),                    # 批归一化
        nn.LeakyReLU(0.01, inplace=True)       # LeakyReLU激活函数
    )


* other useful techniques
    * [group convolution](https://www.researchgate.net/figure/The-transformations-within-a-layer-in-DenseNets-left-and-CondenseNets-at-training-time_fig2_321325862) (Actually, depthwise convolution is a specific type of group convolution)
    * [SqueezeNet](!https://arxiv.org/abs/1602.07360)
    * [MobileNet](!https://arxiv.org/abs/1704.04861)
    * [ShuffleNet](!https://arxiv.org/abs/1707.01083)
    * [Xception](!https://arxiv.org/abs/1610.02357)
    * [GhostNet](!https://arxiv.org/abs/1911.11907)


After introducing depthwise and pointwise convolutions, let's define the **student network architecture**. Here, we have a very simple network formed by some regular convolution layers and pooling layers. You can replace the regular convolution layers with the depthwise and pointwise convolutions. In this way, you can further increase the depth or the width of your network architecture.

In [None]:
# 定义学生网络（轻量级模型）
class StudentNet(nn.Module):
    def __init__(self):
        super().__init__()
        # 初始卷积层（类似ResNet的开始）
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(64)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # 使用深度可分离卷积的层
        self.layer1 = dwpw_conv(64, 64, stride=1)
        self.layer2 = dwpw_conv(64, 128)
        self.layer3 = dwpw_conv(128, 256)
        self.layer4 = dwpw_conv(256, 140)

        # 全局平均池化和分类层
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.dropout = nn.Dropout(0.5)
        self.fc = nn.Linear(140, 11)                 # 11个食物类别

    def forward(self, x):
        # 前向传播
        out = self.relu(self.bn1(self.conv1(x)))    # 初始卷积+BN+ReLU
        out = self.maxpool(out)                     # 最大池化
        out = self.layer1(out)                      # 深度可分离卷积层1
        out = self.layer2(out)                      # 深度可分离卷积层2
        out = self.layer3(out)                      # 深度可分离卷积层3
        out = self.layer4(out)                      # 深度可分离卷积层4
        out = self.avgpool(out)                     # 全局平均池化
        out = out.flatten(1)                        # 展平
        out = self.dropout(out)                     # 防止过拟合
        out = self.fc(out)                          # 全连接层
        return out

# 创建学生模型的函数
def get_student_model():
    return StudentNet()

After specifying the student network architecture, please use `torchsummary` package to get information about the network and verify the total number of parameters. Note that the total params of your student network should not exceed the limit (`Total params` in `torchsummary` ≤ 100,000).

In [None]:
# DO NOT modify this block and please make sure that this block can run sucessfully.
student_model = get_student_model()
summary(student_model, (3, 224, 224), device='cpu')
# You have to copy&paste the results of this block to HW13 GradeScope.

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]             640
       BatchNorm2d-6           [-1, 64, 56, 56]             128
         LeakyReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]           4,160
       BatchNorm2d-9           [-1, 64, 56, 56]             128
        LeakyReLU-10           [-1, 64, 56, 56]               0
           Conv2d-11           [-1, 64, 28, 28]             640
      BatchNorm2d-12           [-1, 64, 28, 28]             128
        LeakyReLU-13           [-1, 64, 28, 28]               0
           Conv2d-14          [-1, 128,

In [None]:
# Load provided teacher model (model architecture: resnet18, num_classes=11, test-acc ~= 89.9%)
teacher_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False, num_classes=11)
# load state dict
teacher_ckpt_path = os.path.join(cfg['dataset_root'], "resnet18_teacher.ckpt")
teacher_model.load_state_dict(torch.load(teacher_ckpt_path, map_location='cpu'))
# 显示教师模型结构摘要
summary(teacher_model, (3, 224, 224), device='cpu')
# Now you already know the teacher model's architecture. You can take advantage of it if you want to pass the strong or boss baseline.
# Source code of resnet in pytorch: (https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py)
# You can also see the summary of teacher model. There are 11,182,155 parameters totally in the teacher model
# summary(teacher_model, (3, 224, 224), device='cpu')

Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]          36,864
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
       BasicBlock-11           [-1, 64, 56, 56]               0
           Conv2d-12           [-1, 64, 56, 56]          36,864
      BatchNorm2d-13           [-1, 64, 56, 56]             128
             ReLU-14           [-1, 64,

### Knowledge_Distillation

<img src="https://i.imgur.com/H2aF7Rv.png=100x" width="400px">

Since we have a learned big model, let it teach the other small model. In implementation, let the training target be the prediction of big model instead of the ground truth.

**Why it works?**
* If the data is not clean, then the prediction of big model could ignore the noise of the data with wrong labeled.
* There might have some relations between classes, so soft labels from teacher model might be useful. For example, Number 8 is more similar to 6, 9, 0 than 1, 7.


**How to implement?**
* $Loss = \alpha T^2 \times KL(p || q) + (1-\alpha)(\text{Original Cross Entropy Loss}), \text{where } p=softmax(\frac{\text{student's logits}}{T}), \text{and } q=softmax(\frac{\text{teacher's logits}}{T})$
* very useful link: [pytorch docs of KLDivLoss with examples](!https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html)
* original paper: [Distilling the Knowledge in a Neural Network](!https://arxiv.org/abs/1503.02531)

In [None]:
# 传统知识蒸馏->只看最终答案
# Hook知识蒸馏->看思考过程 学生也要在相应层学会看到这些特征(如边缘纹理等)

# 使用教师模型的预训练权重初始化学生模型的部分层
def use_pretrain():
    student_model.conv1.weight = teacher_model.conv1.weight
    student_model.bn1.weight = teacher_model.bn1.weight
    student_model.bn1.bias = teacher_model.bn1.bias
    student_model.bn1.running_mean = teacher_model.bn1.running_mean
    student_model.bn1.running_var = teacher_model.bn1.running_var

    # 冻结这些层的参数（不参与训练）
    student_model.conv1.weight.requires_grad = False
    student_model.bn1.weight.requires_grad = False
    student_model.bn1.bias.requires_grad = False

use_pretrain()  # 执行预训练权重转移

# 钩子工具类，用于提取中间层特征
class HookTool:
    def __init__(self):
        self.fea = None

    def hook_fun(self, module, fea_in, fea_out):
        self.fea = fea_out  # 保存输出特征

# 为指定层注册前向钩子的函数-安装"窃听器"
def get_feas_by_hook(model, names=['layer1', 'layer2', 'layer3']):
    fea_hooks = []
    for name, module in model.named_modules():
        if name in names:
            cur_hook = HookTool()
            module.register_forward_hook(cur_hook.hook_fun)  # 注册钩子
            fea_hooks.append(cur_hook)
    return fea_hooks

# 为教师和学生模型注册钩子
fea_hooks_teacher = get_feas_by_hook(teacher_model)
fea_hooks_student = get_feas_by_hook(student_model)

# 计算特征层之间的损失-比较中间思考过程
def loss_fea_layers(student, teacher):
    loss = 0
    for i in range(len(student)):
        # 加权损失：后面的层权重更大-比较学生和教师在第i层的"思考"是否相似
        loss += (len(student) - i) * F.smooth_l1_loss(student[i].fea, teacher[i].fea)
    return loss

In [None]:
# Implement the loss function with KL divergence loss for knowledge distillation.
# You also have to copy-paste this whole block to HW13 GradeScope.
# def loss_fn_kd(student_logits, labels, teacher_logits, alpha=0.5, temperature=1.0):
#     pass

CE = nn.CrossEntropyLoss()  # 交叉熵损失

# 知识蒸馏损失函数
def loss_fn_kd(student_logits, labels, teacher_logits, alpha=0.5, temperature=20.0):
    # 软标签：使用温度参数软化logits
    student_T = (student_logits/temperature).softmax(dim=-1)
    teacher_T = (teacher_logits/temperature).softmax(dim=-1)

    # KL散度损失（知识蒸馏损失）
    kl_loss = (teacher_T*(teacher_T.log() - student_T.log())).sum(1).mean()

    # 交叉熵损失（硬标签损失）
    ce_loss = CE(student_logits, labels)

    # 组合损失：alpha控制两种损失的权重，temperature^2用于平衡梯度
    return alpha*(temperature**2)*kl_loss + (1 - alpha)*ce_loss

In [None]:
# choose the loss function by the config
if cfg['loss_fn_type'] == 'CE':
    # For the classification task, we use cross-entropy as the default loss function.
    loss_fn = nn.CrossEntropyLoss() # loss function for simple baseline.

if cfg['loss_fn_type'] == 'KD': # KD stands for knowledge distillation
    loss_fn = loss_fn_kd # implement loss_fn_kd for the report question and the medium baseline.

# You can also adopt other types of knowledge distillation techniques for strong and boss baseline, but use function name other than `loss_fn_kd`
# For example:
# def loss_fn_custom_kd():
#     pass
# if cfg['loss_fn_type'] == 'custom_kd':
#     loss_fn = loss_fn_custom_kd

# "cuda" only when GPUs are available.
device = "cuda" if torch.cuda.is_available() else "cpu"
log(f"device: {device}")

# The number of training epochs and patience.
n_epochs = cfg['n_epochs']
patience = cfg['patience'] # If no improvement in 'patience' epochs, early stop

device: cuda


### Training
implement training loop for simple baseline, feel free to modify it.

In [None]:
# 将模型移到计算设备
student_model.to(device)
teacher_model.to(device)

# 优化器：只优化需要梯度的参数
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, student_model.parameters()),
                            lr=cfg['lr'], weight_decay=cfg['weight_decay'])

# 学习率调度器：余弦退火重启
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=9, T_mult=2, eta_min=1e-5)

# 早停相关变量
stale = 0      # 连续没有改善的epoch数
best_acc = 0.0 # 最佳验证准确率

teacher_model.eval()  # 教师模型设为评估模式

# 训练循环
for epoch in range(n_epochs):
    student_model.train()  # 学生模型设为训练模式

    # 训练统计变量
    train_loss = []
    train_loss_fea = []
    train_accs = []
    train_lens = []

    # 计算当前epoch的进度百分比
    percent = (1+epoch)/n_epochs

    # 训练一个epoch
    for batch in tqdm(train_loader):
        imgs, labels = batch
        imgs = imgs.to(device)
        labels = labels.to(device)

        # 获取教师模型的输出（不计算梯度）
        with torch.no_grad():
            teacher_logits = teacher_model(imgs)

        # 学生模型前向传播
        logits = student_model(imgs)

        # 计算逻辑损失（知识蒸馏损失）
        # alpha随训练进度变化：开始时更依赖教师，后期更依赖真实标签
        loss_logits = loss_fn(logits, labels, teacher_logits, alpha=1 - percent*percent)

        # 计算特征损失
        loss_fea = loss_fea_layers(fea_hooks_student, fea_hooks_teacher)

        # 总损失：逻辑损失权重随训练进度增加，特征损失保持恒定
        loss = (10*percent*percent) * loss_logits + loss_fea

        # 反向传播和优化
        optimizer.zero_grad()
        loss.backward()
        grad_norm = nn.utils.clip_grad_norm_(student_model.parameters(), max_norm=cfg['grad_norm_max'])
        optimizer.step()

        # 计算准确率
        acc = (logits.argmax(dim=-1) == labels).float().sum()
        train_batch_len = len(imgs)

        # 累积统计信息
        train_loss.append(loss_logits.item() * train_batch_len)
        train_loss_fea.append(loss_fea.item() * train_batch_len)
        train_accs.append(acc)
        train_lens.append(train_batch_len)

    # 计算训练指标
    train_loss = sum(train_loss) / sum(train_lens)
    train_loss_fea = sum(train_loss_fea) / sum(train_lens)
    train_acc = sum(train_accs) / sum(train_lens)

    log(f"[ Train | {epoch + 1:03d}/{n_epochs:03d} ] loss = {train_loss:.5f}, loss_fea = {train_loss_fea:.5f}, acc = {train_acc:.5f}")

    # 验证阶段
    student_model.eval()
    valid_loss = []
    valid_accs = []
    valid_lens = []

    for batch in tqdm(valid_loader):
        imgs, labels = batch
        imgs = imgs.to(device)
        labels = labels.to(device)

        # 验证时不计算梯度
        with torch.no_grad():
            logits = student_model(imgs)
            teacher_logits = teacher_model(imgs)

        loss = loss_fn(logits, labels, teacher_logits)
        acc = (logits.argmax(dim=-1) == labels).float().sum()
        batch_len = len(imgs)

        valid_loss.append(loss.item() * batch_len)
        valid_accs.append(acc)
        valid_lens.append(batch_len)

    # 计算验证指标
    valid_loss = sum(valid_loss) / sum(valid_lens)
    valid_acc = sum(valid_accs) / sum(valid_lens)

    # 记录验证结果
    if valid_acc > best_acc:
        log(f"[ Valid | {epoch + 1:03d}/{n_epochs:03d} ] loss = {valid_loss:.5f}, acc = {valid_acc:.5f} ---------------------> best")
    else:
        log(f"[ Valid | {epoch + 1:03d}/{n_epochs:03d} ] loss = {valid_loss:.5f}, acc = {valid_acc:.5f}")

    # 保存最佳模型和早停检查
    if valid_acc > best_acc:
        log(f"Best model found at epoch {epoch+1}, saving model")
        torch.save(student_model.state_dict(), f"{save_path}/student_best.ckpt")
        best_acc = valid_acc
        stale = 0
    else:
        stale += 1
        if stale > patience:
            log(f"No improvment {patience} consecutive epochs, early stopping")
            break

log("Finish training")
log_fw.close()

  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 001/150 ] loss = 18.65147, loss_fea = 3.00103, acc = 0.23870


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 001/150 ] loss = 16.69830, acc = 0.31157 ---------------------> best
Best model found at epoch 1, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 002/150 ] loss = 16.60045, loss_fea = 2.15093, acc = 0.32465


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 002/150 ] loss = 15.25823, acc = 0.35034 ---------------------> best
Best model found at epoch 2, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 003/150 ] loss = 15.28997, loss_fea = 1.79048, acc = 0.38060


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 003/150 ] loss = 13.21152, acc = 0.43894 ---------------------> best
Best model found at epoch 3, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 004/150 ] loss = 14.42387, loss_fea = 1.60934, acc = 0.41658


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 004/150 ] loss = 12.10238, acc = 0.48674 ---------------------> best
Best model found at epoch 4, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 005/150 ] loss = 13.60791, loss_fea = 1.52079, acc = 0.44810


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 005/150 ] loss = 11.36349, acc = 0.51938 ---------------------> best
Best model found at epoch 5, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 006/150 ] loss = 12.93769, loss_fea = 1.47308, acc = 0.46777


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 006/150 ] loss = 11.13420, acc = 0.50102


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 007/150 ] loss = 12.86520, loss_fea = 1.45538, acc = 0.48054


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 007/150 ] loss = 10.84421, acc = 0.51734


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 008/150 ] loss = 12.23738, loss_fea = 1.44402, acc = 0.49747


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 008/150 ] loss = 11.38368, acc = 0.49694


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 009/150 ] loss = 12.00169, loss_fea = 1.44489, acc = 0.51237


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 009/150 ] loss = 10.97811, acc = 0.54765 ---------------------> best
Best model found at epoch 9, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 010/150 ] loss = 11.70693, loss_fea = 1.44492, acc = 0.52635


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 010/150 ] loss = 9.88881, acc = 0.59545 ---------------------> best
Best model found at epoch 10, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 011/150 ] loss = 11.17522, loss_fea = 1.44582, acc = 0.54379


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 011/150 ] loss = 8.80595, acc = 0.62431 ---------------------> best
Best model found at epoch 11, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 012/150 ] loss = 10.96852, loss_fea = 1.44781, acc = 0.54906


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 012/150 ] loss = 9.35935, acc = 0.58758


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 013/150 ] loss = 10.84362, loss_fea = 1.45283, acc = 0.55625


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 013/150 ] loss = 9.03561, acc = 0.59341


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 014/150 ] loss = 10.54602, loss_fea = 1.45591, acc = 0.57723


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 014/150 ] loss = 8.96016, acc = 0.62139


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 015/150 ] loss = 10.39374, loss_fea = 1.46337, acc = 0.58068


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 015/150 ] loss = 9.35167, acc = 0.59604


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 016/150 ] loss = 10.17598, loss_fea = 1.46715, acc = 0.58899


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 016/150 ] loss = 7.67772, acc = 0.65870 ---------------------> best
Best model found at epoch 16, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 017/150 ] loss = 9.96403, loss_fea = 1.47067, acc = 0.59477


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 017/150 ] loss = 8.56852, acc = 0.60187


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 018/150 ] loss = 9.96184, loss_fea = 1.47735, acc = 0.59599


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 018/150 ] loss = 8.58638, acc = 0.63684


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 019/150 ] loss = 9.88544, loss_fea = 1.48541, acc = 0.60237


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 019/150 ] loss = 7.73791, acc = 0.64384


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 020/150 ] loss = 9.70923, loss_fea = 1.48873, acc = 0.60622


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 020/150 ] loss = 8.71273, acc = 0.60536


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 021/150 ] loss = 9.58178, loss_fea = 1.49668, acc = 0.61149


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 021/150 ] loss = 7.48387, acc = 0.65258


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 022/150 ] loss = 9.38900, loss_fea = 1.50092, acc = 0.62680


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 022/150 ] loss = 7.36759, acc = 0.67298 ---------------------> best
Best model found at epoch 22, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 023/150 ] loss = 9.38738, loss_fea = 1.50812, acc = 0.63035


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 023/150 ] loss = 7.52750, acc = 0.64267


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 024/150 ] loss = 9.13627, loss_fea = 1.50982, acc = 0.62903


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 024/150 ] loss = 7.44341, acc = 0.66365


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 025/150 ] loss = 8.99546, loss_fea = 1.51519, acc = 0.63258


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 025/150 ] loss = 7.65722, acc = 0.65579


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 026/150 ] loss = 9.02882, loss_fea = 1.52222, acc = 0.63369


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 026/150 ] loss = 7.10819, acc = 0.67881 ---------------------> best
Best model found at epoch 26, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 027/150 ] loss = 8.92514, loss_fea = 1.52868, acc = 0.63633


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 027/150 ] loss = 7.45525, acc = 0.64850


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 028/150 ] loss = 8.84283, loss_fea = 1.53602, acc = 0.64018


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 028/150 ] loss = 7.13304, acc = 0.69455 ---------------------> best
Best model found at epoch 28, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 029/150 ] loss = 8.76570, loss_fea = 1.53862, acc = 0.64778


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 029/150 ] loss = 6.89590, acc = 0.68464


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 030/150 ] loss = 8.50918, loss_fea = 1.54300, acc = 0.65113


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 030/150 ] loss = 6.68157, acc = 0.69572 ---------------------> best
Best model found at epoch 30, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 031/150 ] loss = 8.60213, loss_fea = 1.54815, acc = 0.65741


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 031/150 ] loss = 6.73685, acc = 0.68231


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 032/150 ] loss = 8.46529, loss_fea = 1.55041, acc = 0.65021


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 032/150 ] loss = 6.83509, acc = 0.68493


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 033/150 ] loss = 8.35384, loss_fea = 1.55620, acc = 0.65173


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 033/150 ] loss = 6.75567, acc = 0.69076


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 034/150 ] loss = 8.39496, loss_fea = 1.56478, acc = 0.65934


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 034/150 ] loss = 6.71308, acc = 0.69164


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 035/150 ] loss = 8.27300, loss_fea = 1.56876, acc = 0.65903


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 035/150 ] loss = 6.88659, acc = 0.69863 ---------------------> best
Best model found at epoch 35, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 036/150 ] loss = 8.09762, loss_fea = 1.57544, acc = 0.66866


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 036/150 ] loss = 7.46204, acc = 0.68930


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 037/150 ] loss = 8.13282, loss_fea = 1.57643, acc = 0.66673


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 037/150 ] loss = 6.50381, acc = 0.73506 ---------------------> best
Best model found at epoch 37, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 038/150 ] loss = 7.91713, loss_fea = 1.58588, acc = 0.67221


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 038/150 ] loss = 7.09805, acc = 0.69076


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 039/150 ] loss = 7.90199, loss_fea = 1.58969, acc = 0.68174


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 039/150 ] loss = 5.95203, acc = 0.73244


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 040/150 ] loss = 7.91308, loss_fea = 1.59489, acc = 0.67748


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 040/150 ] loss = 6.26353, acc = 0.74410 ---------------------> best
Best model found at epoch 40, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 041/150 ] loss = 7.87305, loss_fea = 1.59881, acc = 0.68174


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 041/150 ] loss = 6.64557, acc = 0.71379


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 042/150 ] loss = 7.77766, loss_fea = 1.60118, acc = 0.67839


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 042/150 ] loss = 6.23206, acc = 0.72078


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 043/150 ] loss = 7.58649, loss_fea = 1.60520, acc = 0.69278


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 043/150 ] loss = 6.53369, acc = 0.69309


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 044/150 ] loss = 7.55481, loss_fea = 1.61053, acc = 0.68599


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 044/150 ] loss = 6.45299, acc = 0.73506


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 045/150 ] loss = 7.55991, loss_fea = 1.61240, acc = 0.69349


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 045/150 ] loss = 6.02477, acc = 0.72982


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 046/150 ] loss = 7.51191, loss_fea = 1.61521, acc = 0.68457


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 046/150 ] loss = 6.12347, acc = 0.72078


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 047/150 ] loss = 7.31057, loss_fea = 1.62083, acc = 0.68265


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 047/150 ] loss = 5.99407, acc = 0.72574


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 048/150 ] loss = 7.33958, loss_fea = 1.62365, acc = 0.69643


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 048/150 ] loss = 6.60106, acc = 0.73856


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 049/150 ] loss = 7.28090, loss_fea = 1.62969, acc = 0.69968


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 049/150 ] loss = 5.83628, acc = 0.73273


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 050/150 ] loss = 7.13057, loss_fea = 1.63329, acc = 0.69724


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 050/150 ] loss = 6.12928, acc = 0.71962


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 051/150 ] loss = 7.16695, loss_fea = 1.63594, acc = 0.69461


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 051/150 ] loss = 6.11172, acc = 0.70329


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 052/150 ] loss = 7.02866, loss_fea = 1.63702, acc = 0.70180


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 052/150 ] loss = 5.84776, acc = 0.75138 ---------------------> best
Best model found at epoch 52, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 053/150 ] loss = 6.98960, loss_fea = 1.64218, acc = 0.70069


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 053/150 ] loss = 5.80827, acc = 0.75372 ---------------------> best
Best model found at epoch 53, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 054/150 ] loss = 7.05194, loss_fea = 1.64607, acc = 0.69501


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 054/150 ] loss = 6.79512, acc = 0.68755


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 055/150 ] loss = 6.90901, loss_fea = 1.65175, acc = 0.70150


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 055/150 ] loss = 5.90306, acc = 0.71787


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 056/150 ] loss = 6.79875, loss_fea = 1.65578, acc = 0.70484


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 056/150 ] loss = 5.55849, acc = 0.76159 ---------------------> best
Best model found at epoch 56, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 057/150 ] loss = 6.75060, loss_fea = 1.65678, acc = 0.71133


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 057/150 ] loss = 5.68445, acc = 0.75925


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 058/150 ] loss = 6.70925, loss_fea = 1.66206, acc = 0.70768


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 058/150 ] loss = 5.87270, acc = 0.72078


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 059/150 ] loss = 6.66289, loss_fea = 1.66719, acc = 0.71711


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 059/150 ] loss = 5.67580, acc = 0.75663


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 060/150 ] loss = 6.60790, loss_fea = 1.66777, acc = 0.71934


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 060/150 ] loss = 5.93614, acc = 0.73215


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 061/150 ] loss = 6.56456, loss_fea = 1.67369, acc = 0.70829


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 061/150 ] loss = 5.87913, acc = 0.75226


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 062/150 ] loss = 6.45531, loss_fea = 1.67691, acc = 0.70991


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 062/150 ] loss = 5.70086, acc = 0.74614


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 063/150 ] loss = 6.46929, loss_fea = 1.68185, acc = 0.71164


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 063/150 ] loss = 5.78788, acc = 0.72632


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 064/150 ] loss = 6.49723, loss_fea = 1.68742, acc = 0.71366


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 064/150 ] loss = 5.68420, acc = 0.74643


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 065/150 ] loss = 6.45982, loss_fea = 1.69153, acc = 0.71224


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 065/150 ] loss = 5.99588, acc = 0.73477


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 066/150 ] loss = 6.31561, loss_fea = 1.69258, acc = 0.71904


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 066/150 ] loss = 5.49701, acc = 0.75955


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 067/150 ] loss = 6.28463, loss_fea = 1.69433, acc = 0.72491


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 067/150 ] loss = 5.37167, acc = 0.75138


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 068/150 ] loss = 6.17941, loss_fea = 1.69878, acc = 0.71944


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 068/150 ] loss = 5.27915, acc = 0.76625 ---------------------> best
Best model found at epoch 68, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 069/150 ] loss = 6.14382, loss_fea = 1.70188, acc = 0.72420


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 069/150 ] loss = 5.67744, acc = 0.76537


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 070/150 ] loss = 6.10433, loss_fea = 1.70753, acc = 0.71883


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 070/150 ] loss = 5.45796, acc = 0.75168


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 071/150 ] loss = 6.07711, loss_fea = 1.71420, acc = 0.72491


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 071/150 ] loss = 5.43387, acc = 0.76363


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 072/150 ] loss = 5.96412, loss_fea = 1.71449, acc = 0.72400


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 072/150 ] loss = 5.55859, acc = 0.76013


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 073/150 ] loss = 5.88162, loss_fea = 1.71610, acc = 0.73130


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 073/150 ] loss = 5.40874, acc = 0.77120 ---------------------> best
Best model found at epoch 73, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 074/150 ] loss = 5.81803, loss_fea = 1.71934, acc = 0.73414


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 074/150 ] loss = 5.70907, acc = 0.74614


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 075/150 ] loss = 5.80819, loss_fea = 1.72288, acc = 0.72451


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 075/150 ] loss = 5.24488, acc = 0.76654


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 076/150 ] loss = 5.71145, loss_fea = 1.72842, acc = 0.73718


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 076/150 ] loss = 5.32751, acc = 0.77470 ---------------------> best
Best model found at epoch 76, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 077/150 ] loss = 5.67335, loss_fea = 1.73384, acc = 0.73191


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 077/150 ] loss = 5.51969, acc = 0.76887


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 078/150 ] loss = 5.61379, loss_fea = 1.73530, acc = 0.73535


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 078/150 ] loss = 5.51492, acc = 0.75925


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 079/150 ] loss = 5.64685, loss_fea = 1.73344, acc = 0.73110


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 079/150 ] loss = 5.36092, acc = 0.76596


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 080/150 ] loss = 5.49773, loss_fea = 1.73751, acc = 0.73282


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 080/150 ] loss = 5.46663, acc = 0.76450


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 081/150 ] loss = 5.47905, loss_fea = 1.74352, acc = 0.73505


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 081/150 ] loss = 5.39784, acc = 0.76392


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 082/150 ] loss = 5.41420, loss_fea = 1.74758, acc = 0.73464


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 082/150 ] loss = 5.31229, acc = 0.75925


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 083/150 ] loss = 5.35801, loss_fea = 1.75206, acc = 0.73779


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 083/150 ] loss = 5.23065, acc = 0.77762 ---------------------> best
Best model found at epoch 83, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 084/150 ] loss = 5.38612, loss_fea = 1.75544, acc = 0.74448


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 084/150 ] loss = 5.50854, acc = 0.73477


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 085/150 ] loss = 5.24486, loss_fea = 1.75657, acc = 0.74144


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 085/150 ] loss = 5.41232, acc = 0.76333


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 086/150 ] loss = 5.29024, loss_fea = 1.76384, acc = 0.73850


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 086/150 ] loss = 5.36527, acc = 0.77062


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 087/150 ] loss = 5.12540, loss_fea = 1.76153, acc = 0.74062


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 087/150 ] loss = 5.20120, acc = 0.78286 ---------------------> best
Best model found at epoch 87, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 088/150 ] loss = 5.12113, loss_fea = 1.76668, acc = 0.74367


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 088/150 ] loss = 5.29957, acc = 0.77295


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 089/150 ] loss = 4.99167, loss_fea = 1.76792, acc = 0.74681


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 089/150 ] loss = 5.49634, acc = 0.75459


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 090/150 ] loss = 4.98565, loss_fea = 1.77569, acc = 0.74752


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 090/150 ] loss = 5.27415, acc = 0.76246


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 091/150 ] loss = 4.93031, loss_fea = 1.77320, acc = 0.74437


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 091/150 ] loss = 5.57616, acc = 0.76129


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 092/150 ] loss = 4.88497, loss_fea = 1.77712, acc = 0.74204


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 092/150 ] loss = 5.17893, acc = 0.76129


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 093/150 ] loss = 4.81027, loss_fea = 1.78164, acc = 0.74579


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 093/150 ] loss = 5.31950, acc = 0.78053


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 094/150 ] loss = 4.68315, loss_fea = 1.78348, acc = 0.74812


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 094/150 ] loss = 5.36709, acc = 0.77791


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 095/150 ] loss = 4.72778, loss_fea = 1.78995, acc = 0.74519


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 095/150 ] loss = 5.23233, acc = 0.77266


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 096/150 ] loss = 4.61548, loss_fea = 1.79380, acc = 0.74873


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 096/150 ] loss = 5.21622, acc = 0.77762


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 097/150 ] loss = 4.50992, loss_fea = 1.79153, acc = 0.74762


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 097/150 ] loss = 5.24060, acc = 0.78082


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 098/150 ] loss = 4.45678, loss_fea = 1.79604, acc = 0.75502


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 098/150 ] loss = 5.24535, acc = 0.77878


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 099/150 ] loss = 4.46427, loss_fea = 1.80113, acc = 0.75623


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 099/150 ] loss = 5.23961, acc = 0.77004


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 100/150 ] loss = 4.33658, loss_fea = 1.80393, acc = 0.75411


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 100/150 ] loss = 5.25766, acc = 0.77936


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 101/150 ] loss = 4.29078, loss_fea = 1.79884, acc = 0.75177


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 101/150 ] loss = 5.23257, acc = 0.76654


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 102/150 ] loss = 4.18870, loss_fea = 1.80533, acc = 0.75755


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 102/150 ] loss = 5.54530, acc = 0.74643


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 103/150 ] loss = 4.14633, loss_fea = 1.80428, acc = 0.75644


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 103/150 ] loss = 5.19741, acc = 0.77936


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 104/150 ] loss = 4.10628, loss_fea = 1.81287, acc = 0.76100


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 104/150 ] loss = 5.30911, acc = 0.76829


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 105/150 ] loss = 4.02787, loss_fea = 1.81661, acc = 0.76272


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 105/150 ] loss = 5.40589, acc = 0.75401


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 106/150 ] loss = 3.93084, loss_fea = 1.81586, acc = 0.76424


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 106/150 ] loss = 5.16841, acc = 0.78578 ---------------------> best
Best model found at epoch 106, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 107/150 ] loss = 3.91256, loss_fea = 1.81442, acc = 0.76282


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 107/150 ] loss = 5.35523, acc = 0.77499


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 108/150 ] loss = 3.85499, loss_fea = 1.81918, acc = 0.75603


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 108/150 ] loss = 5.15522, acc = 0.77354


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 109/150 ] loss = 3.82583, loss_fea = 1.82487, acc = 0.76515


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 109/150 ] loss = 5.23259, acc = 0.78694 ---------------------> best
Best model found at epoch 109, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 110/150 ] loss = 3.73045, loss_fea = 1.83059, acc = 0.76191


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 110/150 ] loss = 5.52685, acc = 0.76916


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 111/150 ] loss = 3.63296, loss_fea = 1.82805, acc = 0.76860


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 111/150 ] loss = 5.05468, acc = 0.78927 ---------------------> best
Best model found at epoch 111, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 112/150 ] loss = 3.55553, loss_fea = 1.83008, acc = 0.76292


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 112/150 ] loss = 5.12425, acc = 0.77966


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 113/150 ] loss = 3.49938, loss_fea = 1.83568, acc = 0.77407


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 113/150 ] loss = 5.25185, acc = 0.78607


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 114/150 ] loss = 3.42688, loss_fea = 1.83359, acc = 0.76931


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 114/150 ] loss = 5.24066, acc = 0.77150


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 115/150 ] loss = 3.35658, loss_fea = 1.83292, acc = 0.77468


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 115/150 ] loss = 5.07835, acc = 0.78665


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 116/150 ] loss = 3.33068, loss_fea = 1.83946, acc = 0.77661


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 116/150 ] loss = 5.08469, acc = 0.78723


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 117/150 ] loss = 3.24528, loss_fea = 1.84191, acc = 0.77569


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 117/150 ] loss = 5.07117, acc = 0.78986 ---------------------> best
Best model found at epoch 117, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 118/150 ] loss = 3.19052, loss_fea = 1.84570, acc = 0.77539


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 118/150 ] loss = 5.06182, acc = 0.79802 ---------------------> best
Best model found at epoch 118, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 119/150 ] loss = 3.09846, loss_fea = 1.84845, acc = 0.77529


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 119/150 ] loss = 5.09115, acc = 0.78199


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 120/150 ] loss = 3.04020, loss_fea = 1.84854, acc = 0.77965


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 120/150 ] loss = 5.25002, acc = 0.77645


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 121/150 ] loss = 2.98117, loss_fea = 1.85136, acc = 0.77498


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 121/150 ] loss = 4.99041, acc = 0.78257


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 122/150 ] loss = 2.90051, loss_fea = 1.85620, acc = 0.78249


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 122/150 ] loss = 5.15294, acc = 0.78986


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 123/150 ] loss = 2.81534, loss_fea = 1.85732, acc = 0.78188


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 123/150 ] loss = 5.11008, acc = 0.80035 ---------------------> best
Best model found at epoch 123, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 124/150 ] loss = 2.72719, loss_fea = 1.85791, acc = 0.78532


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 124/150 ] loss = 4.98669, acc = 0.80793 ---------------------> best
Best model found at epoch 124, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 125/150 ] loss = 2.63635, loss_fea = 1.85775, acc = 0.78532


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 125/150 ] loss = 5.15458, acc = 0.78694


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 126/150 ] loss = 2.61925, loss_fea = 1.85668, acc = 0.78117


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 126/150 ] loss = 5.01586, acc = 0.80210


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 127/150 ] loss = 2.52517, loss_fea = 1.86302, acc = 0.78695


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 127/150 ] loss = 5.03470, acc = 0.80152


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 128/150 ] loss = 2.44157, loss_fea = 1.86059, acc = 0.78512


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 128/150 ] loss = 5.12739, acc = 0.79714


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 129/150 ] loss = 2.39322, loss_fea = 1.86177, acc = 0.78674


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 129/150 ] loss = 5.05154, acc = 0.79102


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 130/150 ] loss = 2.27075, loss_fea = 1.86238, acc = 0.79586


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 130/150 ] loss = 5.43288, acc = 0.77470


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 131/150 ] loss = 2.19824, loss_fea = 1.86339, acc = 0.79617


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 131/150 ] loss = 5.05098, acc = 0.79569


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 132/150 ] loss = 2.17488, loss_fea = 1.86547, acc = 0.78917


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 132/150 ] loss = 5.20778, acc = 0.80414


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 133/150 ] loss = 2.09428, loss_fea = 1.86717, acc = 0.78816


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 133/150 ] loss = 5.19602, acc = 0.80764


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 134/150 ] loss = 1.96846, loss_fea = 1.87356, acc = 0.79333


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 134/150 ] loss = 5.26317, acc = 0.78927


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 135/150 ] loss = 1.90639, loss_fea = 1.87472, acc = 0.79678


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 135/150 ] loss = 5.43629, acc = 0.78315


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 136/150 ] loss = 1.80091, loss_fea = 1.87349, acc = 0.79738


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 136/150 ] loss = 5.26538, acc = 0.79394


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 137/150 ] loss = 1.73472, loss_fea = 1.87378, acc = 0.80103


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 137/150 ] loss = 5.17303, acc = 0.78665


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 138/150 ] loss = 1.64816, loss_fea = 1.87568, acc = 0.79718


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 138/150 ] loss = 5.26078, acc = 0.80268


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 139/150 ] loss = 1.57124, loss_fea = 1.87747, acc = 0.80276


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 139/150 ] loss = 5.38769, acc = 0.80676


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 140/150 ] loss = 1.48121, loss_fea = 1.88034, acc = 0.80144


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 140/150 ] loss = 5.17628, acc = 0.80705


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 141/150 ] loss = 1.38439, loss_fea = 1.87927, acc = 0.80935


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 141/150 ] loss = 5.46239, acc = 0.78957


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 142/150 ] loss = 1.31273, loss_fea = 1.88118, acc = 0.80762


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 142/150 ] loss = 5.39609, acc = 0.79539


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 143/150 ] loss = 1.21092, loss_fea = 1.88227, acc = 0.80945


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 143/150 ] loss = 5.58957, acc = 0.80968 ---------------------> best
Best model found at epoch 143, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 144/150 ] loss = 1.13581, loss_fea = 1.88283, acc = 0.80458


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 144/150 ] loss = 5.55082, acc = 0.80705


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 145/150 ] loss = 1.05212, loss_fea = 1.88712, acc = 0.80975


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 145/150 ] loss = 5.36683, acc = 0.80443


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 146/150 ] loss = 0.94476, loss_fea = 1.88691, acc = 0.81674


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 146/150 ] loss = 5.69661, acc = 0.77791


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 147/150 ] loss = 0.86811, loss_fea = 1.88670, acc = 0.81076


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 147/150 ] loss = 5.76151, acc = 0.80181


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 148/150 ] loss = 0.75906, loss_fea = 1.88512, acc = 0.81624


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 148/150 ] loss = 5.95554, acc = 0.80064


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 149/150 ] loss = 0.66925, loss_fea = 1.89075, acc = 0.81411


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 149/150 ] loss = 6.39277, acc = 0.79365


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 150/150 ] loss = 0.55793, loss_fea = 1.89073, acc = 0.82029


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 150/150 ] loss = 6.60750, acc = 0.78898
Finish training


### Inference
load the best model of the experiment and generate submission.csv

In [None]:
# create dataloader for evaluation
eval_set = FoodDataset(os.path.join(cfg['dataset_root'], "evaluation"), tfm=test_tfm)
eval_loader = DataLoader(eval_set, batch_size=cfg['batch_size'], shuffle=False, num_workers=0, pin_memory=True)

One /content/drive/MyDrive/food11-hw13/evaluation sample /content/drive/MyDrive/food11-hw13/evaluation/0000.jpg


In [None]:
# Load model from {exp_name}/student_best.ckpt
student_model_best = get_student_model() # get a new student model to avoid reference before assignment.
ckpt_path = f"{save_path}/student_best.ckpt" # the ckpt path of the best student model.
student_model_best.load_state_dict(torch.load(ckpt_path, map_location='cpu')) # load the state dict and set it to the student model
student_model_best.to(device) # set the student model to device

# Start evaluate
student_model_best.eval()
eval_preds = [] # storing predictions of the evaluation dataset

# Iterate the validation set by batches.
for batch in tqdm(eval_loader):
    # A batch consists of image data and corresponding labels.
    imgs, _ = batch
    # We don't need gradient in evaluation.
    # Using torch.no_grad() accelerates the forward process.
    with torch.no_grad():
        logits = student_model_best(imgs.to(device))
        preds = list(logits.argmax(dim=-1).squeeze().cpu().numpy())
    # loss and acc can not be calculated because we do not have the true labels of the evaluation set.
    eval_preds += preds

def pad4(i):
    return "0"*(4-len(str(i))) + str(i)

# Save prediction results
ids = [pad4(i) for i in range(0,len(eval_set))]
categories = eval_preds

df = pd.DataFrame()
df['Id'] = ids
df['Category'] = categories
df.to_csv(f"{save_path}/submission.csv", index=False) # now you can download the submission.csv and upload it to the kaggle competition.

  0%|          | 0/53 [00:00<?, ?it/s]

> Don't forget to answer the report questions on GradeScope!