In [1]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Homework 13 - Network Compression

Author: Liang-Hsuan Tseng (b07502072@ntu.edu.tw), modified from ML2021-HW13

If you have any questions, feel free to ask: ntu-ml-2022spring-ta@googlegroups.com

[**Link to HW13 Slides**](https://docs.google.com/presentation/d/1nCT9XrInF21B4qQAWuODy5sonKDnpGhjtcAwqa75mVU/edit#slide=id.p)

## Outline

* [Packages](#Packages) - intall some required packages.
* [Dataset](#Dataset) - something you need to know about the dataset.
* [Configs](#Configs) - the configs of the experiments, you can change some hyperparameters here.
* [Architecture_Design](#Architecture_Design) - depthwise and pointwise convolution examples and some useful links.
* [Knowledge_Distillation](#Knowledge_Distillation) - KL divergence loss for knowledge distillation and some useful links.
* [Training](#Training) - training loop implementation modified from HW3.
* [Inference](#Inference) - create submission.csv by using the student_best.ckpt from the previous experiment.



### Packages
First, we need to import some useful packages. If the torchsummary package are not intalled, please install it via `pip install torchsummary`

In [2]:
# Import some useful packages for this homework
import numpy as np
import pandas as pd
import torch
import os
import torch.nn as nn
import torch.nn.functional as F
import torchvision.transforms as transforms
from PIL import Image
from torch.utils.data import ConcatDataset, DataLoader, Subset, Dataset # "ConcatDataset" and "Subset" are possibly useful
from torchvision.datasets import DatasetFolder, VisionDataset
from torchsummary import summary
from tqdm.auto import tqdm
import random
import math

# !nvidia-smi # list your current GPU

### Configs
In this part, you can specify some variables and hyperparameters as your configs.

In [3]:
cfg = {
    'dataset_root': './food11-hw13',
    'save_dir': '/content/gdrive/MyDrive/Colab Notebooks/Li_homework/hw13',
    'exp_name': "strong_baseline",
    'batch_size': 64,
    'lr': 3e-4,
    'seed': 20220013,
    'loss_fn_type': 'KD', # simple baseline: CE, medium baseline: KD. See the Knowledge_Distillation part for more information.
    'weight_decay': 1e-5,
    'grad_norm_max': 10,
    'n_epochs': 300, # train more steps to pass the medium baseline.
    'patience': 60,
}

In [4]:
myseed = cfg['seed']  # set a random seed for reproducibility
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(myseed)
torch.manual_seed(myseed)
random.seed(myseed)
if torch.cuda.is_available():
    torch.cuda.manual_seed_all(myseed)

save_path = os.path.join(cfg['save_dir'], cfg['exp_name']) # create saving directory
os.makedirs(save_path, exist_ok=True)

# define simple logging functionality
log_fw = open(f"{save_path}/log.txt", 'w') # open log file to save log outputs
def log(text):     # define a logging function to trace the training process
    print(text)
    log_fw.write(str(text)+'\n')
    log_fw.flush()

log(cfg)  # log your configs to the log file

{'dataset_root': './food11-hw13', 'save_dir': '/content/gdrive/MyDrive/Colab Notebooks/Li_homework/hw13', 'exp_name': 'strong_baseline', 'batch_size': 64, 'lr': 0.0003, 'seed': 20220013, 'loss_fn_type': 'KD', 'weight_decay': 1e-05, 'grad_norm_max': 10, 'n_epochs': 300, 'patience': 60}


### Dataset
We use Food11 dataset for this homework, which is similar to homework3. But remember, Please DO NOT utilize the dataset of HW3. We've modified the dataset, so you should only access the dataset by loading it in this kaggle notebook or through the links provided in the HW13 colab notebooks.

In [5]:
# fetch and download the dataset from github (about 1.12G)
# !wget https://github.com/virginiakm1988/ML2022-Spring/raw/main/HW13/food11-hw13.tar.gz 
## backup links:

!wget https://github.com/andybi7676/ml2022spring-hw13/raw/main/food11-hw13.tar.gz -O food11-hw13.tar.gz
# !gdown '1ijKoNmpike_yjUw8SWRVVWVoMOXXqycj' --output food11-hw13.tar.gz

--2022-09-07 01:57:22--  https://github.com/andybi7676/ml2022spring-hw13/raw/main/food11-hw13.tar.gz
Resolving github.com (github.com)... 20.27.177.113
Connecting to github.com (github.com)|20.27.177.113|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://media.githubusercontent.com/media/andybi7676/ml2022spring-hw13/main/food11-hw13.tar.gz [following]
--2022-09-07 01:57:23--  https://media.githubusercontent.com/media/andybi7676/ml2022spring-hw13/main/food11-hw13.tar.gz
Resolving media.githubusercontent.com (media.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to media.githubusercontent.com (media.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1203320552 (1.1G) [application/octet-stream]
Saving to: ‘food11-hw13.tar.gz’


2022-09-07 01:59:35 (212 MB/s) - ‘food11-hw13.tar.gz’ saved [1203320552/1203320552]



In [6]:
# extract the data
!tar -xzf ./food11-hw13.tar.gz # Could take some time
# !tar -xzvf ./food11-hw13.tar.gz # use this command if you want to checkout the whole process.

In [7]:
for dirname, _, filenames in os.walk('./food11-hw13'):
    if len(filenames) > 0:
        print(f"{dirname}: {len(filenames)} files.") # Show the file amounts in each split.

./food11-hw13: 1 files.
./food11-hw13/training: 9866 files.
./food11-hw13/validation: 3430 files.
./food11-hw13/evaluation: 3347 files.


Next, specify train/test transform for image data augmentation.
Torchvision provides lots of useful utilities for image preprocessing, data wrapping as well as data augmentation.

Please refer to [PyTorch official website](https://pytorch.org/vision/stable/transforms.html) for details about different transforms. You can also apply the knowledge or experience you learned in HW3.

In [8]:
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
# define training/testing transforms
test_tfm = transforms.Compose([
    # It is not encouraged to modify this part if you are using the provided teacher model. This transform is stardard and good enough for testing.
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    normalize,
])

train_tfm = transforms.Compose([
    # add some useful transform or augmentation here, according to your experience in HW3.
    transforms.Resize(256),  # You can change this
    transforms.RandomCrop(224), # You can change this, but be aware of that the given teacher model's input size is 224.
    # The training input size of the provided teacher model is (3, 224, 224).
    # Thus, Input size other then 224 might hurt the performance. please be careful.
    transforms.RandomHorizontalFlip(), # You can change this.
    transforms.ToTensor(),
    normalize,
])

In [9]:
path=os.path.join(cfg['dataset_root'],"training")
print(os.listdir(path))
l = sorted([os.path.join(path,x) for x in os.listdir(path) if x.endswith(".jpg")])
print(l)

['2_607.jpg', '4_66.jpg', '9_1377.jpg', '0_989.jpg', '2_523.jpg', '10_516.jpg', '4_318.jpg', '3_257.jpg', '0_909.jpg', '0_513.jpg', '5_210.jpg', '10_13.jpg', '9_87.jpg', '4_757.jpg', '10_472.jpg', '5_110.jpg', '0_774.jpg', '3_557.jpg', '2_153.jpg', '2_870.jpg', '6_147.jpg', '1_168.jpg', '8_720.jpg', '3_556.jpg', '8_848.jpg', '0_538.jpg', '9_503.jpg', '9_349.jpg', '5_1055.jpg', '2_1220.jpg', '6_65.jpg', '5_352.jpg', '5_1191.jpg', '2_988.jpg', '9_81.jpg', '9_420.jpg', '5_1069.jpg', '3_323.jpg', '8_533.jpg', '2_1472.jpg', '9_542.jpg', '5_737.jpg', '8_184.jpg', '0_354.jpg', '0_879.jpg', '2_866.jpg', '9_681.jpg', '9_844.jpg', '9_1429.jpg', '2_1035.jpg', '9_1434.jpg', '9_833.jpg', '6_22.jpg', '10_338.jpg', '5_234.jpg', '8_767.jpg', '0_134.jpg', '4_651.jpg', '6_317.jpg', '4_156.jpg', '9_284.jpg', '1_292.jpg', '9_1113.jpg', '8_199.jpg', '0_369.jpg', '10_427.jpg', '5_998.jpg', '0_748.jpg', '5_518.jpg', '3_300.jpg', '4_506.jpg', '10_605.jpg', '9_297.jpg', '3_259.jpg', '6_320.jpg', '0_939.jpg', '

In [10]:
class FoodDataset(Dataset):
    def __init__(self, path, tfm=test_tfm, files = None):
        super().__init__()
        self.path = path
        self.files = sorted([os.path.join(path,x) for x in os.listdir(path) if x.endswith(".jpg")])
        if files != None:
            self.files = files
        print(f"One {path} sample",self.files[0])
        self.transform = tfm
  
    def __len__(self):
        return len(self.files)
  
    def __getitem__(self,idx):
        fname = self.files[idx]
        im = Image.open(fname)
        im = self.transform(im)
        try:
            label = int(fname.split("/")[-1].split("_")[0])
        except:
            label = -1 # test has no label
        return im,label

In [11]:
# Form train/valid dataloaders
train_set = FoodDataset(os.path.join(cfg['dataset_root'],"training"), tfm=train_tfm)
train_loader = DataLoader(train_set, batch_size=cfg['batch_size'], shuffle=True, num_workers=0, pin_memory=True)

valid_set = FoodDataset(os.path.join(cfg['dataset_root'], "validation"), tfm=test_tfm)
valid_loader = DataLoader(valid_set, batch_size=cfg['batch_size'], shuffle=False, num_workers=0, pin_memory=True)

One ./food11-hw13/training sample ./food11-hw13/training/0_0.jpg
One ./food11-hw13/validation sample ./food11-hw13/validation/0_0.jpg


### Architecture_Design

In this homework, you have to design a smaller network and make it perform well. Apparently, a well-designed architecture is crucial for such task. Here, we introduce the depthwise and pointwise convolution. These variants of convolution are some common techniques for architecture design when it comes to network compression.

<img src="https://i.imgur.com/LFDKHOp.png" width=400px>

* explanation of depthwise and pointwise convolutions:
    * [prof. Hung-yi Lee's slides(p.24~p.30, especially p.28)](https://speech.ee.ntu.edu.tw/~hylee/ml/ml2021-course-data/tiny_v7.pdf)

* other useful techniques
    * [group convolution](https://www.researchgate.net/figure/The-transformations-within-a-layer-in-DenseNets-left-and-CondenseNets-at-training-time_fig2_321325862) (Actually, depthwise convolution is a specific type of group convolution)
    * [SqueezeNet](!https://arxiv.org/abs/1602.07360)
    * [MobileNet](!https://arxiv.org/abs/1704.04861)
    * [ShuffleNet](!https://arxiv.org/abs/1707.01083)
    * [Xception](!https://arxiv.org/abs/1610.02357)
    * [GhostNet](!https://arxiv.org/abs/1911.11907)


After introducing depthwise and pointwise convolutions, let's define the **student network architecture**. Here, we have a very simple network formed by some regular convolution layers and pooling layers. You can replace the regular convolution layers with the depthwise and pointwise convolutions. In this way, you can further increase the depth or the width of your network architecture.

After specifying the student network architecture, please use `torchsummary` package to get information about the network and verify the total number of parameters. Note that the total params of your student network should not exceed the limit (`Total params` in `torchsummary` ≤ 100,000). 

In [12]:
def dwpw_conv(in_channels, out_channels, kernel_size, stride=1, padding=1,bias=False):
    return nn.Sequential(
        nn.Conv2d(in_channels, in_channels, kernel_size, stride=stride, padding=padding,bias=bias, groups=in_channels), #depthwise convolution
        nn.BatchNorm2d(in_channels),
        nn.ReLU(inplace=True),
        nn.Conv2d(in_channels, out_channels, 1,  bias= bias,), # pointwise convolution
        nn.BatchNorm2d(out_channels),
        nn.ReLU(inplace=True)
    )

class StudentNet(nn.Module):
    def __init__(self, inplanes = 64):
        super().__init__()
        self.inplanes = inplanes
        self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = nn.BatchNorm2d(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        self.layer1 = dwpw_conv(inplanes, inplanes, kernel_size=3)
        self.layer2 = dwpw_conv(inplanes, 128, kernel_size=3, stride=2)
        self.layer3 = dwpw_conv(128, 256, kernel_size=3, stride=2)
        self.layer4 = dwpw_conv(256, 141, kernel_size=3, stride=2)

        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(141, 11)

    def forward(self, x):
        x=self.conv1(x)
        x=self.bn1(x)
        x=self.relu(x)
        x=self.maxpool(x)

        x=self.layer1(x)
        x=self.layer2(x)
        x=self.layer3(x)
        x=self.layer4(x)

        x=self.avgpool(x)
        x = torch.flatten(x, 1)
        x=self.fc(x)

        return x

In [13]:
# DO NOT modify this block and please make sure that this block can run sucessfully. 
student_model = StudentNet()
summary(student_model, (3, 224, 224), device='cpu')
# You have to copy&paste the results of this block to HW13 GradeScope. 

----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]             576
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]           4,096
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
           Conv2d-11           [-1, 64, 28, 28]             576
      BatchNorm2d-12           [-1, 64, 28, 28]             128
             ReLU-13           [-1, 64, 28, 28]               0
           Conv2d-14          [-1, 128,

In [14]:
# Load provided teacher model (model architecture: resnet18, num_classes=11, test-acc ~= 89.9%)
teacher_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False, num_classes=11)
# load state dict
teacher_ckpt_path = os.path.join(cfg['dataset_root'], "resnet18_teacher.ckpt")
teacher_model.load_state_dict(torch.load(teacher_ckpt_path, map_location='cpu'))
# Now you already know the teacher model's architecture. You can take advantage of it if you want to pass the strong or boss baseline. 
# Source code of resnet in pytorch: (https://github.com/pytorch/vision/blob/main/torchvision/models/resnet.py)
# You can also see the summary of teacher model. There are 11,182,155 parameters totally in the teacher model
summary(teacher_model, (3, 224, 224), device='cpu')

Downloading: "https://github.com/pytorch/vision/zipball/v0.10.0" to /root/.cache/torch/hub/v0.10.0.zip
  f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "


----------------------------------------------------------------
        Layer (type)               Output Shape         Param #
            Conv2d-1         [-1, 64, 112, 112]           9,408
       BatchNorm2d-2         [-1, 64, 112, 112]             128
              ReLU-3         [-1, 64, 112, 112]               0
         MaxPool2d-4           [-1, 64, 56, 56]               0
            Conv2d-5           [-1, 64, 56, 56]          36,864
       BatchNorm2d-6           [-1, 64, 56, 56]             128
              ReLU-7           [-1, 64, 56, 56]               0
            Conv2d-8           [-1, 64, 56, 56]          36,864
       BatchNorm2d-9           [-1, 64, 56, 56]             128
             ReLU-10           [-1, 64, 56, 56]               0
       BasicBlock-11           [-1, 64, 56, 56]               0
           Conv2d-12           [-1, 64, 56, 56]          36,864
      BatchNorm2d-13           [-1, 64, 56, 56]             128
             ReLU-14           [-1, 64,

In [15]:
Slayer1out, Slayer2out, Slayer3out, Tlayer1out, Tlayer2out, Tlayer3out = [], [], [], [], [], []

def hookS1(module, input, output):
  Slayer1out.append(output)
  return None

def hookS2(module, input, output):
  Slayer2out.append(output)
  return None

def hookS3(module, input, output):
  Slayer3out.append(output)
  return None

def hookT1(module, input, output):
  Tlayer1out.append(output)
  return None

def hookT2(module, input, output):
  Tlayer2out.append(output)
  return None

def hookT3(module, input, output):
  Tlayer3out.append(output)
  return None

student_model = StudentNet()
student_model.layer1.register_forward_hook(hookS1)
student_model.layer2.register_forward_hook(hookS2)
student_model.layer3.register_forward_hook(hookS3)

teacher_model = torch.hub.load('pytorch/vision:v0.10.0', 'resnet18', pretrained=False, num_classes=11)
teacher_ckpt_path = os.path.join(cfg['dataset_root'], "resnet18_teacher.ckpt")
teacher_model.load_state_dict(torch.load(teacher_ckpt_path, map_location='cpu'))

teacher_model.layer1.register_forward_hook(hookT1)
teacher_model.layer2.register_forward_hook(hookT2)
teacher_model.layer3.register_forward_hook(hookT3)

Using cache found in /root/.cache/torch/hub/pytorch_vision_v0.10.0


<torch.utils.hooks.RemovableHandle at 0x7fa0723a06d0>

### Knowledge_Distillation

<img src="https://i.imgur.com/H2aF7Rv.png=100x" width="400px">

Since we have a learned big model, let it teach the other small model. In implementation, let the training target be the prediction of big model instead of the ground truth.

**Why it works?**
* If the data is not clean, then the prediction of big model could ignore the noise of the data with wrong labeled.
* There might have some relations between classes, so soft labels from teacher model might be useful. For example, Number 8 is more similar to 6, 9, 0 than 1, 7.


**How to implement?**
* $Loss = \alpha T^2 \times KL(p || q) + (1-\alpha)(\text{Original Cross Entropy Loss}), \text{where } p=softmax(\frac{\text{student's logits}}{T}), \text{and } q=softmax(\frac{\text{teacher's logits}}{T})$
* very useful link: [pytorch docs of KLDivLoss with examples](!https://pytorch.org/docs/stable/generated/torch.nn.KLDivLoss.html)
* original paper: [Distilling the Knowledge in a Neural Network](!https://arxiv.org/abs/1503.02531)

In [16]:
# Implement the loss function with KL divergence loss for knowledge distillation.
# You also have to copy-paste this whole block to HW13 GradeScope. 
def loss_fn_kd(student_logits, labels, teacher_logits, alpha=0.99, temperature=25.0):
  kl_loss = nn.KLDivLoss(reduction="batchmean", log_target=True)
  CE_loss = nn.CrossEntropyLoss()
  p = F.log_softmax(student_logits / temperature, dim=1)
  q = F.log_softmax(teacher_logits / temperature, dim=1)
  loss=alpha * (temperature**2) * kl_loss(p, q) + (1-alpha) * CE_loss(student_logits,labels)

  return loss
    

In [17]:
# choose the loss function by the config
if cfg['loss_fn_type'] == 'CE':
    # For the classification task, we use cross-entropy as the default loss function.
    loss_fn = nn.CrossEntropyLoss() # loss function for simple baseline.

if cfg['loss_fn_type'] == 'KD': # KD stands for knowledge distillation
    loss_fn = loss_fn_kd # implement loss_fn_kd for the report question and the medium baseline.

# You can also adopt other types of knowledge distillation techniques for strong and boss baseline, but use function name other than `loss_fn_kd`
# For example:
# def loss_fn_custom_kd():
#     pass
# if cfg['loss_fn_type'] == 'custom_kd':
#     loss_fn = loss_fn_custom_kd

# "cuda" only when GPUs are available.
device = "cuda" if torch.cuda.is_available() else "cpu"
log(f"device: {device}")

# The number of training epochs and patience.
n_epochs = cfg['n_epochs']
patience = cfg['patience'] # If no improvement in 'patience' epochs, early stop

device: cuda


In [18]:
def use_pretrain():
    student_model.conv1.weight = teacher_model.conv1.weight
    student_model.bn1.weight = teacher_model.bn1.weight
    student_model.bn1.bias = teacher_model.bn1.bias
    student_model.bn1.running_mean = teacher_model.bn1.running_mean
    student_model.bn1.running_var = teacher_model.bn1.running_var
    student_model.conv1.weight.requires_grad = False
    student_model.bn1.weight.requires_grad = False
    student_model.bn1.bias.requires_grad = False
use_pretrain()

### Training
implement training loop for simple baseline, feel free to modify it.

In [19]:
# Initialize a model, and put it on the device specified.
student_model.to(device)
teacher_model.to(device) # MEDIUM BASELINE

# Initialize optimizer, you may fine-tune some hyperparameters such as learning rate on your own.
optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, student_model.parameters()), lr=cfg['lr'], weight_decay=cfg['weight_decay']) 
scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=9, T_mult=2, eta_min=1e-5)

# Initialize trackers, these are not parameters and should not be changed
stale = 0
best_acc = 0.0

teacher_model.eval()  # MEDIUM BASELINE
for epoch in range(n_epochs):

    # ---------- Training ----------
    # Make sure the model is in train mode before training.
    student_model.train()

    # These are used to record information in training.
    train_loss = []
    train_loss_hidden = []
    train_accs = []
    train_lens = []
    p=epoch/(n_epochs-1)
    lamb= p * p # 0-1
    for batch in tqdm(train_loader):
        Slayer1out, Slayer2out, Slayer3out, Tlayer1out, Tlayer2out, Tlayer3out = [], [], [], [], [], []
        # A batch consists of image data and corresponding labels.
        imgs, labels = batch
        imgs = imgs.to(device)
        labels = labels.to(device)
        #imgs = imgs.half()
        #print(imgs.shape,labels.shape)

        # Forward the data. (Make sure data and model are on the same device.)
        with torch.no_grad():  # MEDIUM BASELINE
            teacher_logits = teacher_model(imgs)  # MEDIUM BASELINE

        
        logits = student_model(imgs)
        slayer1out, slayer2out, slayer3out, tlayer1out, tlayer2out, tlayer3out = \
          Slayer1out[0], Slayer2out[0], Slayer3out[0], Tlayer1out[0], Tlayer2out[0], Tlayer3out[0]
        # Calculate the cross-entropy loss.
        # We don't need to apply softmax before computing cross-entropy as it is done automatically.
        loss_output = loss_fn(logits, labels, teacher_logits) # MEDIUM BASELINE
        loss_hidden = F.smooth_l1_loss(slayer1out, tlayer1out) + F.smooth_l1_loss(slayer2out, tlayer2out) + F.smooth_l1_loss(slayer3out, tlayer3out)
        
        loss =  loss_hidden + 10 * lamb * loss_output
        # Gradients stored in the parameters in the previous step should be cleared out first.
        optimizer.zero_grad()

        # Compute the gradients for parameters.
        loss.backward()

        # Clip the gradient norms for stable training.
        grad_norm = nn.utils.clip_grad_norm_(student_model.parameters(), max_norm=cfg['grad_norm_max'])

        # Update the parameters with computed gradients.
        optimizer.step()

        # Compute the accuracy for current batch.
        acc = (logits.argmax(dim=-1) == labels).float().sum()

        # Record the loss and accuracy.
        train_batch_len = len(imgs)
        train_loss.append(loss.item() * train_batch_len)
        train_loss_hidden.append(loss_hidden.item() * train_batch_len)
        train_accs.append(acc)
        train_lens.append(train_batch_len)
        
    train_loss = sum(train_loss) / sum(train_lens)
    train_acc = sum(train_accs) / sum(train_lens)
    train_hidden_loss = sum(train_loss_hidden) / sum(train_lens)

    # Print the information.
    log(f"[ Train | {epoch + 1:03d}/{n_epochs:03d} ] loss = {train_loss:.5f}, hidden_loss = {train_hidden_loss:.5f}, acc = {train_acc:.5f}")

  
# ---------- Validation ----------
    # Make sure the model is in eval mode so that some modules like dropout are disabled and work normally.
    student_model.eval()

    # These are used to record information in validation.
    valid_loss = []
    valid_accs = []
    valid_lens = []
    # Iterate the validation set by batches.
    for batch in tqdm(valid_loader):

        # A batch consists of image data and corresponding labels.
        imgs, labels = batch
        imgs = imgs.to(device)
        labels = labels.to(device)

        # We don't need gradient in validation.
        # Using torch.no_grad() accelerates the forward process.
        with torch.no_grad():
            logits = student_model(imgs)
            teacher_logits = teacher_model(imgs) # MEDIUM BASELINE

        # We can still compute the loss (but not the gradient).
        loss = loss_fn(logits, labels, teacher_logits) # MEDIUM BASELINE
        # loss = loss_fn(logits, labels) # SIMPLE BASELINE

        # Compute the accuracy for current batch.
        acc = (logits.argmax(dim=-1) == labels).float().sum()

        # Record the loss and accuracy.
        batch_len = len(imgs)
        valid_loss.append(loss.item() * batch_len)
        valid_accs.append(acc)
        valid_lens.append(batch_len)
        #break

    # The average loss and accuracy for entire validation set is the average of the recorded values.
    valid_loss = sum(valid_loss) / sum(valid_lens)
    valid_acc = sum(valid_accs) / sum(valid_lens)

    # update logs
    
    if valid_acc > best_acc:
        log(f"[ Valid | {epoch + 1:03d}/{n_epochs:03d} ] loss = {valid_loss:.5f}, acc = {valid_acc:.5f} -> best")
    else:
        log(f"[ Valid | {epoch + 1:03d}/{n_epochs:03d} ] loss = {valid_loss:.5f}, acc = {valid_acc:.5f}")


    # save models
    if valid_acc > best_acc:
        log(f"Best model found at epoch {epoch}, saving model")
        torch.save(student_model.state_dict(), f"{save_path}/student_best.ckpt") # only save best to prevent output memory exceed error
        best_acc = valid_acc
        stale = 0
    else:
        stale += 1
        if stale > patience:
            log(f"No improvment {patience} consecutive epochs, early stopping")
            break
log("Finish training")
log_fw.close()

  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 001/300 ] loss = 1.56156, hidden_loss = 1.56156, acc = 0.08697


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 001/300 ] loss = 30.02279, acc = 0.07172 -> best
Best model found at epoch 0, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 002/300 ] loss = 1.33882, hidden_loss = 1.33528, acc = 0.25573


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 002/300 ] loss = 25.92514, acc = 0.25364 -> best
Best model found at epoch 1, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 003/300 ] loss = 1.22810, hidden_loss = 1.21502, acc = 0.29151


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 003/300 ] loss = 23.33335, acc = 0.33294 -> best
Best model found at epoch 2, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 004/300 ] loss = 1.15315, hidden_loss = 1.12621, acc = 0.36094


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 004/300 ] loss = 21.40481, acc = 0.40875 -> best
Best model found at epoch 3, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 005/300 ] loss = 1.10181, hidden_loss = 1.05716, acc = 0.41506


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 005/300 ] loss = 19.50606, acc = 0.44344 -> best
Best model found at epoch 4, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 006/300 ] loss = 1.06866, hidden_loss = 1.00387, acc = 0.44111


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 006/300 ] loss = 18.11941, acc = 0.47230 -> best
Best model found at epoch 5, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 007/300 ] loss = 1.04949, hidden_loss = 0.96276, acc = 0.48439


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 007/300 ] loss = 16.72948, acc = 0.51633 -> best
Best model found at epoch 6, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 008/300 ] loss = 1.04407, hidden_loss = 0.93289, acc = 0.51632


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 008/300 ] loss = 16.48719, acc = 0.48076


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 009/300 ] loss = 1.04622, hidden_loss = 0.90971, acc = 0.53760


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 009/300 ] loss = 14.84612, acc = 0.56006 -> best
Best model found at epoch 8, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 010/300 ] loss = 1.05466, hidden_loss = 0.89083, acc = 0.57278


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 010/300 ] loss = 13.93058, acc = 0.57347 -> best
Best model found at epoch 9, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 011/300 ] loss = 1.06549, hidden_loss = 0.87595, acc = 0.58899


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 011/300 ] loss = 13.78499, acc = 0.57697 -> best
Best model found at epoch 10, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 012/300 ] loss = 1.08534, hidden_loss = 0.86466, acc = 0.60774


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 012/300 ] loss = 12.66919, acc = 0.62157 -> best
Best model found at epoch 11, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 013/300 ] loss = 1.10810, hidden_loss = 0.85531, acc = 0.62122


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 013/300 ] loss = 12.48677, acc = 0.61137


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 014/300 ] loss = 1.13122, hidden_loss = 0.84756, acc = 0.63795


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 014/300 ] loss = 12.01985, acc = 0.65773 -> best
Best model found at epoch 13, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 015/300 ] loss = 1.16095, hidden_loss = 0.84173, acc = 0.65832


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 015/300 ] loss = 12.33761, acc = 0.61195


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 016/300 ] loss = 1.19439, hidden_loss = 0.83717, acc = 0.66572


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 016/300 ] loss = 11.51333, acc = 0.66297 -> best
Best model found at epoch 15, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 017/300 ] loss = 1.22739, hidden_loss = 0.83350, acc = 0.67981


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 017/300 ] loss = 11.14961, acc = 0.66793 -> best
Best model found at epoch 16, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 018/300 ] loss = 1.26609, hidden_loss = 0.83086, acc = 0.68609


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 018/300 ] loss = 10.80454, acc = 0.67114 -> best
Best model found at epoch 17, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 019/300 ] loss = 1.30509, hidden_loss = 0.82830, acc = 0.69653


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 019/300 ] loss = 10.96757, acc = 0.70845 -> best
Best model found at epoch 18, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 020/300 ] loss = 1.34800, hidden_loss = 0.82700, acc = 0.69816


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 020/300 ] loss = 10.55460, acc = 0.68776


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 021/300 ] loss = 1.39646, hidden_loss = 0.82669, acc = 0.70667


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 021/300 ] loss = 10.25124, acc = 0.69708


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 022/300 ] loss = 1.43974, hidden_loss = 0.82623, acc = 0.71914


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 022/300 ] loss = 10.85144, acc = 0.67347


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 023/300 ] loss = 1.48928, hidden_loss = 0.82591, acc = 0.72562


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 023/300 ] loss = 9.94532, acc = 0.71574 -> best
Best model found at epoch 22, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 024/300 ] loss = 1.54055, hidden_loss = 0.82585, acc = 0.72228


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 024/300 ] loss = 10.08396, acc = 0.69708


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 025/300 ] loss = 1.59120, hidden_loss = 0.82604, acc = 0.73029


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 025/300 ] loss = 10.47029, acc = 0.69213


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 026/300 ] loss = 1.63904, hidden_loss = 0.82653, acc = 0.73981


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 026/300 ] loss = 9.62304, acc = 0.73703 -> best
Best model found at epoch 25, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 027/300 ] loss = 1.69015, hidden_loss = 0.82752, acc = 0.74640


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 027/300 ] loss = 9.54420, acc = 0.73032


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 028/300 ] loss = 1.75138, hidden_loss = 0.82830, acc = 0.74904


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 028/300 ] loss = 10.72280, acc = 0.68863


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 029/300 ] loss = 1.81563, hidden_loss = 0.82978, acc = 0.75411


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 029/300 ] loss = 9.32336, acc = 0.74315 -> best
Best model found at epoch 28, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 030/300 ] loss = 1.86386, hidden_loss = 0.83031, acc = 0.75704


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 030/300 ] loss = 9.61368, acc = 0.73848


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 031/300 ] loss = 1.92221, hidden_loss = 0.83134, acc = 0.76444


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 031/300 ] loss = 10.21682, acc = 0.69359


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 032/300 ] loss = 1.99346, hidden_loss = 0.83249, acc = 0.75958


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 032/300 ] loss = 9.20324, acc = 0.72974


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 033/300 ] loss = 2.05054, hidden_loss = 0.83434, acc = 0.76556


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 033/300 ] loss = 8.82996, acc = 0.74257


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 034/300 ] loss = 2.11311, hidden_loss = 0.83503, acc = 0.77235


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 034/300 ] loss = 9.42307, acc = 0.76152 -> best
Best model found at epoch 33, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 035/300 ] loss = 2.18997, hidden_loss = 0.83656, acc = 0.77772


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 035/300 ] loss = 8.90355, acc = 0.75685


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 036/300 ] loss = 2.24648, hidden_loss = 0.83772, acc = 0.77346


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 036/300 ] loss = 9.54816, acc = 0.73032


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 037/300 ] loss = 2.32995, hidden_loss = 0.83921, acc = 0.77924


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 037/300 ] loss = 8.94144, acc = 0.74227


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 038/300 ] loss = 2.38151, hidden_loss = 0.84131, acc = 0.78330


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 038/300 ] loss = 9.02937, acc = 0.75248


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 039/300 ] loss = 2.46635, hidden_loss = 0.84303, acc = 0.78765


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 039/300 ] loss = 8.91956, acc = 0.75394


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 040/300 ] loss = 2.51540, hidden_loss = 0.84446, acc = 0.79201


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 040/300 ] loss = 8.83533, acc = 0.75569


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 041/300 ] loss = 2.61518, hidden_loss = 0.84586, acc = 0.78968


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 041/300 ] loss = 8.99575, acc = 0.75073


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 042/300 ] loss = 2.68662, hidden_loss = 0.84736, acc = 0.78948


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 042/300 ] loss = 8.81429, acc = 0.75073


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 043/300 ] loss = 2.75410, hidden_loss = 0.84880, acc = 0.78978


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 043/300 ] loss = 8.51518, acc = 0.75335


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 044/300 ] loss = 2.85390, hidden_loss = 0.85127, acc = 0.79526


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 044/300 ] loss = 8.97015, acc = 0.73790


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 045/300 ] loss = 2.91687, hidden_loss = 0.85247, acc = 0.80387


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 045/300 ] loss = 8.60054, acc = 0.75452


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 046/300 ] loss = 2.99317, hidden_loss = 0.85403, acc = 0.80195


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 046/300 ] loss = 8.81775, acc = 0.77143 -> best
Best model found at epoch 45, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 047/300 ] loss = 3.07859, hidden_loss = 0.85601, acc = 0.80407


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 047/300 ] loss = 8.57757, acc = 0.76239


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 048/300 ] loss = 3.16094, hidden_loss = 0.85687, acc = 0.80418


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 048/300 ] loss = 8.71056, acc = 0.75714


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 049/300 ] loss = 3.23361, hidden_loss = 0.85798, acc = 0.80519


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 049/300 ] loss = 8.75745, acc = 0.74227


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 050/300 ] loss = 3.33785, hidden_loss = 0.86016, acc = 0.81066


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 050/300 ] loss = 8.44910, acc = 0.76822


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 051/300 ] loss = 3.38710, hidden_loss = 0.86134, acc = 0.81482


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 051/300 ] loss = 8.23237, acc = 0.76501


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 052/300 ] loss = 3.49011, hidden_loss = 0.86240, acc = 0.81259


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 052/300 ] loss = 8.15161, acc = 0.76968


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 053/300 ] loss = 3.57316, hidden_loss = 0.86407, acc = 0.81239


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 053/300 ] loss = 8.67695, acc = 0.76764


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 054/300 ] loss = 3.68639, hidden_loss = 0.86565, acc = 0.81533


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 054/300 ] loss = 8.25602, acc = 0.77959 -> best
Best model found at epoch 53, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 055/300 ] loss = 3.73943, hidden_loss = 0.86678, acc = 0.82090


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 055/300 ] loss = 8.30279, acc = 0.77143


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 056/300 ] loss = 3.84093, hidden_loss = 0.86804, acc = 0.81634


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 056/300 ] loss = 8.50033, acc = 0.76443


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 057/300 ] loss = 3.94608, hidden_loss = 0.86979, acc = 0.82343


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 057/300 ] loss = 7.99003, acc = 0.76939


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 058/300 ] loss = 4.03892, hidden_loss = 0.87080, acc = 0.82516


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 058/300 ] loss = 8.46803, acc = 0.75131


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 059/300 ] loss = 4.14369, hidden_loss = 0.87186, acc = 0.82161


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 059/300 ] loss = 8.04242, acc = 0.77434


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 060/300 ] loss = 4.24930, hidden_loss = 0.87397, acc = 0.82972


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 060/300 ] loss = 8.01437, acc = 0.78192 -> best
Best model found at epoch 59, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 061/300 ] loss = 4.32966, hidden_loss = 0.87523, acc = 0.82637


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 061/300 ] loss = 8.03879, acc = 0.77143


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 062/300 ] loss = 4.44278, hidden_loss = 0.87621, acc = 0.82860


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 062/300 ] loss = 8.12798, acc = 0.76501


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 063/300 ] loss = 4.53910, hidden_loss = 0.87775, acc = 0.82992


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 063/300 ] loss = 7.94641, acc = 0.77813


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 064/300 ] loss = 4.64186, hidden_loss = 0.87901, acc = 0.83539


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 064/300 ] loss = 8.12293, acc = 0.76676


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 065/300 ] loss = 4.73536, hidden_loss = 0.88030, acc = 0.83225


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 065/300 ] loss = 7.88195, acc = 0.77551


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 066/300 ] loss = 4.87260, hidden_loss = 0.88131, acc = 0.83590


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 066/300 ] loss = 8.08656, acc = 0.77872


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 067/300 ] loss = 4.96704, hidden_loss = 0.88169, acc = 0.83387


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 067/300 ] loss = 8.89018, acc = 0.74869


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 068/300 ] loss = 5.05571, hidden_loss = 0.88298, acc = 0.83712


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 068/300 ] loss = 7.88042, acc = 0.79271 -> best
Best model found at epoch 67, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 069/300 ] loss = 5.15478, hidden_loss = 0.88395, acc = 0.84097


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 069/300 ] loss = 8.18079, acc = 0.77026


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 070/300 ] loss = 5.34738, hidden_loss = 0.88611, acc = 0.83803


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 070/300 ] loss = 7.83759, acc = 0.78630


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 071/300 ] loss = 5.40677, hidden_loss = 0.88689, acc = 0.83884


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 071/300 ] loss = 7.76849, acc = 0.78630


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 072/300 ] loss = 5.49025, hidden_loss = 0.88819, acc = 0.84310


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 072/300 ] loss = 7.86215, acc = 0.77697


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 073/300 ] loss = 5.63842, hidden_loss = 0.88984, acc = 0.83864


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 073/300 ] loss = 7.90976, acc = 0.78192


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 074/300 ] loss = 5.71720, hidden_loss = 0.89059, acc = 0.84208


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 074/300 ] loss = 8.01751, acc = 0.77259


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 075/300 ] loss = 5.86501, hidden_loss = 0.89170, acc = 0.84381


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 075/300 ] loss = 7.96221, acc = 0.79767 -> best
Best model found at epoch 74, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 076/300 ] loss = 5.98554, hidden_loss = 0.89296, acc = 0.84360


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 076/300 ] loss = 8.06153, acc = 0.76939


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 077/300 ] loss = 6.10168, hidden_loss = 0.89448, acc = 0.84634


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 077/300 ] loss = 7.95013, acc = 0.78950


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 078/300 ] loss = 6.22308, hidden_loss = 0.89564, acc = 0.84786


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 078/300 ] loss = 8.31971, acc = 0.78571


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 079/300 ] loss = 6.31725, hidden_loss = 0.89631, acc = 0.84908


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 079/300 ] loss = 7.80886, acc = 0.79563


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 080/300 ] loss = 6.47497, hidden_loss = 0.89783, acc = 0.84847


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 080/300 ] loss = 7.71089, acc = 0.78863


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 081/300 ] loss = 6.65041, hidden_loss = 0.89895, acc = 0.84391


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 081/300 ] loss = 7.65482, acc = 0.78513


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 082/300 ] loss = 6.74858, hidden_loss = 0.90002, acc = 0.84695


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 082/300 ] loss = 8.31413, acc = 0.78017


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 083/300 ] loss = 6.81078, hidden_loss = 0.90112, acc = 0.85060


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 083/300 ] loss = 7.67119, acc = 0.79825 -> best
Best model found at epoch 82, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 084/300 ] loss = 6.97048, hidden_loss = 0.90240, acc = 0.85090


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 084/300 ] loss = 7.74565, acc = 0.79359


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 085/300 ] loss = 7.08811, hidden_loss = 0.90401, acc = 0.85455


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 085/300 ] loss = 7.79607, acc = 0.77930


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 086/300 ] loss = 7.26689, hidden_loss = 0.90409, acc = 0.84908


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 086/300 ] loss = 7.79551, acc = 0.78659


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 087/300 ] loss = 7.39491, hidden_loss = 0.90494, acc = 0.85536


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 087/300 ] loss = 8.18017, acc = 0.78601


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 088/300 ] loss = 7.49125, hidden_loss = 0.90659, acc = 0.85303


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 088/300 ] loss = 7.91032, acc = 0.79009


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 089/300 ] loss = 7.70274, hidden_loss = 0.90767, acc = 0.85486


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 089/300 ] loss = 7.91113, acc = 0.79009


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 090/300 ] loss = 7.75307, hidden_loss = 0.90868, acc = 0.85638


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 090/300 ] loss = 7.70489, acc = 0.79446


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 091/300 ] loss = 7.91966, hidden_loss = 0.90953, acc = 0.85617


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 091/300 ] loss = 7.81475, acc = 0.78513


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 092/300 ] loss = 8.05782, hidden_loss = 0.91080, acc = 0.85830


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 092/300 ] loss = 7.93797, acc = 0.79446


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 093/300 ] loss = 8.15793, hidden_loss = 0.91199, acc = 0.85790


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 093/300 ] loss = 7.69677, acc = 0.78688


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 094/300 ] loss = 8.29061, hidden_loss = 0.91222, acc = 0.86175


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 094/300 ] loss = 7.61308, acc = 0.80058 -> best
Best model found at epoch 93, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 095/300 ] loss = 8.51774, hidden_loss = 0.91389, acc = 0.85536


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 095/300 ] loss = 8.17788, acc = 0.78571


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 096/300 ] loss = 8.56981, hidden_loss = 0.91503, acc = 0.86236


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 096/300 ] loss = 7.42014, acc = 0.80175 -> best
Best model found at epoch 95, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 097/300 ] loss = 8.72215, hidden_loss = 0.91597, acc = 0.85952


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 097/300 ] loss = 7.75480, acc = 0.77843


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 098/300 ] loss = 8.83995, hidden_loss = 0.91663, acc = 0.85587


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 098/300 ] loss = 7.66502, acc = 0.80525 -> best
Best model found at epoch 97, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 099/300 ] loss = 9.02397, hidden_loss = 0.91713, acc = 0.86185


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 099/300 ] loss = 7.90674, acc = 0.77843


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 100/300 ] loss = 9.18142, hidden_loss = 0.91854, acc = 0.86063


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 100/300 ] loss = 7.81212, acc = 0.79271


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 101/300 ] loss = 9.39402, hidden_loss = 0.91952, acc = 0.86084


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 101/300 ] loss = 7.78739, acc = 0.79825


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 102/300 ] loss = 9.36419, hidden_loss = 0.92130, acc = 0.86489


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 102/300 ] loss = 7.56844, acc = 0.80233


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 103/300 ] loss = 9.57390, hidden_loss = 0.92173, acc = 0.86286


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 103/300 ] loss = 7.69743, acc = 0.78426


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 104/300 ] loss = 9.79200, hidden_loss = 0.92297, acc = 0.86154


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 104/300 ] loss = 7.54052, acc = 0.78863


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 105/300 ] loss = 9.92641, hidden_loss = 0.92386, acc = 0.86813


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 105/300 ] loss = 7.61671, acc = 0.79184


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 106/300 ] loss = 10.01198, hidden_loss = 0.92532, acc = 0.86682


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 106/300 ] loss = 7.83697, acc = 0.76735


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 107/300 ] loss = 10.22602, hidden_loss = 0.92705, acc = 0.86925


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 107/300 ] loss = 7.85273, acc = 0.79359


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 108/300 ] loss = 10.37809, hidden_loss = 0.92736, acc = 0.86965


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 108/300 ] loss = 7.60839, acc = 0.79650


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 109/300 ] loss = 10.54565, hidden_loss = 0.92857, acc = 0.86732


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 109/300 ] loss = 7.88398, acc = 0.77813


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 110/300 ] loss = 10.69398, hidden_loss = 0.92973, acc = 0.86803


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 110/300 ] loss = 7.42224, acc = 0.79738


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 111/300 ] loss = 10.94524, hidden_loss = 0.93071, acc = 0.86580


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 111/300 ] loss = 7.42070, acc = 0.79796


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 112/300 ] loss = 11.03106, hidden_loss = 0.93173, acc = 0.86823


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 112/300 ] loss = 7.38930, acc = 0.79679


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 113/300 ] loss = 11.08635, hidden_loss = 0.93241, acc = 0.86864


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 113/300 ] loss = 7.75550, acc = 0.78746


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 114/300 ] loss = 11.26832, hidden_loss = 0.93282, acc = 0.87077


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 114/300 ] loss = 7.56595, acc = 0.79009


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 115/300 ] loss = 11.50499, hidden_loss = 0.93351, acc = 0.86955


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 115/300 ] loss = 7.78405, acc = 0.79038


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 116/300 ] loss = 11.67612, hidden_loss = 0.93426, acc = 0.87067


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 116/300 ] loss = 7.37101, acc = 0.79708


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 117/300 ] loss = 11.85861, hidden_loss = 0.93532, acc = 0.86975


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 117/300 ] loss = 7.63199, acc = 0.79767


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 118/300 ] loss = 12.00650, hidden_loss = 0.93616, acc = 0.86986


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 118/300 ] loss = 7.48018, acc = 0.79650


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 119/300 ] loss = 12.17899, hidden_loss = 0.93742, acc = 0.87290


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 119/300 ] loss = 7.67746, acc = 0.79184


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 120/300 ] loss = 12.27956, hidden_loss = 0.93802, acc = 0.87421


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 120/300 ] loss = 7.72862, acc = 0.80437


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 121/300 ] loss = 12.53373, hidden_loss = 0.93906, acc = 0.87543


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 121/300 ] loss = 7.54528, acc = 0.79767


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 122/300 ] loss = 12.62521, hidden_loss = 0.93969, acc = 0.87371


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 122/300 ] loss = 7.59034, acc = 0.79475


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 123/300 ] loss = 12.86021, hidden_loss = 0.94068, acc = 0.87340


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 123/300 ] loss = 7.59259, acc = 0.79329


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 124/300 ] loss = 13.06239, hidden_loss = 0.94204, acc = 0.87310


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 124/300 ] loss = 7.35050, acc = 0.80000


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 125/300 ] loss = 13.20808, hidden_loss = 0.94208, acc = 0.87259


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 125/300 ] loss = 7.48759, acc = 0.80058


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 126/300 ] loss = 13.34243, hidden_loss = 0.94324, acc = 0.87715


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 126/300 ] loss = 7.44822, acc = 0.79009


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 127/300 ] loss = 13.57897, hidden_loss = 0.94332, acc = 0.87077


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 127/300 ] loss = 7.36250, acc = 0.79359


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 128/300 ] loss = 13.64377, hidden_loss = 0.94457, acc = 0.87675


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 128/300 ] loss = 7.49747, acc = 0.78980


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 129/300 ] loss = 13.96403, hidden_loss = 0.94478, acc = 0.87685


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 129/300 ] loss = 7.60534, acc = 0.79854


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 130/300 ] loss = 14.06539, hidden_loss = 0.94600, acc = 0.87229


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 130/300 ] loss = 7.58564, acc = 0.79155


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 131/300 ] loss = 14.11588, hidden_loss = 0.94607, acc = 0.87695


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 131/300 ] loss = 7.74194, acc = 0.79125


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 132/300 ] loss = 14.38508, hidden_loss = 0.94733, acc = 0.87573


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 132/300 ] loss = 7.49948, acc = 0.79213


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 133/300 ] loss = 14.64457, hidden_loss = 0.94769, acc = 0.87503


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 133/300 ] loss = 7.34576, acc = 0.78688


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 134/300 ] loss = 14.74822, hidden_loss = 0.94874, acc = 0.87796


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 134/300 ] loss = 7.56625, acc = 0.79300


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 135/300 ] loss = 14.91351, hidden_loss = 0.94922, acc = 0.87655


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 135/300 ] loss = 7.52132, acc = 0.79708


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 136/300 ] loss = 15.25211, hidden_loss = 0.94966, acc = 0.87573


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 136/300 ] loss = 7.34472, acc = 0.78484


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 137/300 ] loss = 15.35802, hidden_loss = 0.95101, acc = 0.88141


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 137/300 ] loss = 7.42179, acc = 0.79796


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 138/300 ] loss = 15.36724, hidden_loss = 0.95188, acc = 0.88293


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 138/300 ] loss = 7.39345, acc = 0.80321


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 139/300 ] loss = 15.62056, hidden_loss = 0.95237, acc = 0.88212


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 139/300 ] loss = 7.58347, acc = 0.79271


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 140/300 ] loss = 15.86442, hidden_loss = 0.95318, acc = 0.88030


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 140/300 ] loss = 7.31456, acc = 0.80583 -> best
Best model found at epoch 139, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 141/300 ] loss = 16.21442, hidden_loss = 0.95350, acc = 0.87695


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 141/300 ] loss = 7.57063, acc = 0.79067


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 142/300 ] loss = 16.32672, hidden_loss = 0.95423, acc = 0.88151


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 142/300 ] loss = 7.37397, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 143/300 ] loss = 16.50162, hidden_loss = 0.95461, acc = 0.88192


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 143/300 ] loss = 7.54385, acc = 0.80175


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 144/300 ] loss = 16.73904, hidden_loss = 0.95578, acc = 0.88019


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 144/300 ] loss = 7.57554, acc = 0.78222


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 145/300 ] loss = 16.88099, hidden_loss = 0.95651, acc = 0.88182


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 145/300 ] loss = 7.57024, acc = 0.80029


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 146/300 ] loss = 16.95289, hidden_loss = 0.95679, acc = 0.88263


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 146/300 ] loss = 7.35126, acc = 0.80671 -> best
Best model found at epoch 145, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 147/300 ] loss = 17.32177, hidden_loss = 0.95735, acc = 0.88030


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 147/300 ] loss = 7.53431, acc = 0.79767


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 148/300 ] loss = 17.58728, hidden_loss = 0.95862, acc = 0.88070


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 148/300 ] loss = 7.54522, acc = 0.79883


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 149/300 ] loss = 17.68863, hidden_loss = 0.95937, acc = 0.88405


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 149/300 ] loss = 7.52884, acc = 0.79329


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 150/300 ] loss = 17.96226, hidden_loss = 0.96059, acc = 0.88435


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 150/300 ] loss = 7.48899, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 151/300 ] loss = 18.11218, hidden_loss = 0.96172, acc = 0.88770


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 151/300 ] loss = 7.46376, acc = 0.80292


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 152/300 ] loss = 18.29601, hidden_loss = 0.96254, acc = 0.88820


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 152/300 ] loss = 7.33539, acc = 0.80612


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 153/300 ] loss = 18.45481, hidden_loss = 0.96366, acc = 0.88840


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 153/300 ] loss = 7.69330, acc = 0.79592


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 154/300 ] loss = 18.42630, hidden_loss = 0.96457, acc = 0.88820


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 154/300 ] loss = 7.64856, acc = 0.79300


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 155/300 ] loss = 18.99879, hidden_loss = 0.96561, acc = 0.88496


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 155/300 ] loss = 7.72847, acc = 0.79184


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 156/300 ] loss = 19.04941, hidden_loss = 0.96584, acc = 0.88273


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 156/300 ] loss = 7.51314, acc = 0.79971


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 157/300 ] loss = 19.19881, hidden_loss = 0.96549, acc = 0.88688


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 157/300 ] loss = 7.56216, acc = 0.80146


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 158/300 ] loss = 19.45732, hidden_loss = 0.96622, acc = 0.88739


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 158/300 ] loss = 7.30024, acc = 0.80612


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 159/300 ] loss = 19.74233, hidden_loss = 0.96687, acc = 0.89043


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 159/300 ] loss = 7.86212, acc = 0.78251


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 160/300 ] loss = 19.73049, hidden_loss = 0.96757, acc = 0.89033


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 160/300 ] loss = 7.59429, acc = 0.80292


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 161/300 ] loss = 19.96863, hidden_loss = 0.96850, acc = 0.88526


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 161/300 ] loss = 7.32448, acc = 0.78513


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 162/300 ] loss = 20.23983, hidden_loss = 0.96968, acc = 0.89063


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 162/300 ] loss = 7.95716, acc = 0.79417


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 163/300 ] loss = 20.55023, hidden_loss = 0.97031, acc = 0.88699


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 163/300 ] loss = 7.84253, acc = 0.79038


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 164/300 ] loss = 20.51289, hidden_loss = 0.97053, acc = 0.89236


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 164/300 ] loss = 7.38437, acc = 0.79913


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 165/300 ] loss = 21.02696, hidden_loss = 0.97089, acc = 0.88942


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 165/300 ] loss = 7.38372, acc = 0.79563


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 166/300 ] loss = 21.23555, hidden_loss = 0.97164, acc = 0.88942


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 166/300 ] loss = 7.43290, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 167/300 ] loss = 21.46369, hidden_loss = 0.97228, acc = 0.89145


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 167/300 ] loss = 7.30157, acc = 0.80087


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 168/300 ] loss = 21.53792, hidden_loss = 0.97278, acc = 0.89327


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 168/300 ] loss = 7.30132, acc = 0.79125


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 169/300 ] loss = 21.72032, hidden_loss = 0.97306, acc = 0.89530


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 169/300 ] loss = 7.28463, acc = 0.80292


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 170/300 ] loss = 21.96625, hidden_loss = 0.97431, acc = 0.89215


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 170/300 ] loss = 7.39788, acc = 0.79825


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 171/300 ] loss = 22.40136, hidden_loss = 0.97469, acc = 0.89205


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 171/300 ] loss = 7.37814, acc = 0.80787 -> best
Best model found at epoch 170, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 172/300 ] loss = 22.73415, hidden_loss = 0.97592, acc = 0.89155


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 172/300 ] loss = 7.23971, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 173/300 ] loss = 22.65081, hidden_loss = 0.97560, acc = 0.88972


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 173/300 ] loss = 7.25197, acc = 0.79708


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 174/300 ] loss = 22.87865, hidden_loss = 0.97663, acc = 0.89591


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 174/300 ] loss = 7.68930, acc = 0.79854


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 175/300 ] loss = 23.32404, hidden_loss = 0.97745, acc = 0.88709


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 175/300 ] loss = 7.28287, acc = 0.80904 -> best
Best model found at epoch 174, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 176/300 ] loss = 23.26151, hidden_loss = 0.97777, acc = 0.89266


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 176/300 ] loss = 7.92191, acc = 0.76968


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 177/300 ] loss = 23.69229, hidden_loss = 0.97924, acc = 0.89530


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 177/300 ] loss = 7.35732, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 178/300 ] loss = 23.69228, hidden_loss = 0.97949, acc = 0.88982


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 178/300 ] loss = 7.31722, acc = 0.80000


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 179/300 ] loss = 24.04335, hidden_loss = 0.97949, acc = 0.89013


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 179/300 ] loss = 7.43383, acc = 0.80146


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 180/300 ] loss = 24.20530, hidden_loss = 0.98041, acc = 0.89256


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 180/300 ] loss = 7.26099, acc = 0.79563


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 181/300 ] loss = 24.53523, hidden_loss = 0.98114, acc = 0.89165


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 181/300 ] loss = 7.22851, acc = 0.80117


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 182/300 ] loss = 25.05713, hidden_loss = 0.98264, acc = 0.89033


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 182/300 ] loss = 7.77734, acc = 0.79563


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 183/300 ] loss = 24.97534, hidden_loss = 0.98345, acc = 0.89580


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 183/300 ] loss = 7.41946, acc = 0.80729


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 184/300 ] loss = 25.07222, hidden_loss = 0.98329, acc = 0.89003


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 184/300 ] loss = 7.47213, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 185/300 ] loss = 25.23218, hidden_loss = 0.98337, acc = 0.89368


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 185/300 ] loss = 7.64490, acc = 0.79329


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 186/300 ] loss = 25.78072, hidden_loss = 0.98417, acc = 0.90067


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 186/300 ] loss = 7.52765, acc = 0.79184


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 187/300 ] loss = 26.01678, hidden_loss = 0.98508, acc = 0.89418


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 187/300 ] loss = 7.52860, acc = 0.80146


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 188/300 ] loss = 26.10524, hidden_loss = 0.98518, acc = 0.89601


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 188/300 ] loss = 7.33854, acc = 0.79796


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 189/300 ] loss = 26.40470, hidden_loss = 0.98545, acc = 0.89570


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 189/300 ] loss = 7.72040, acc = 0.80612


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 190/300 ] loss = 26.75445, hidden_loss = 0.98531, acc = 0.89793


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 190/300 ] loss = 7.24059, acc = 0.79534


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 191/300 ] loss = 26.86907, hidden_loss = 0.98666, acc = 0.89530


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 191/300 ] loss = 7.25922, acc = 0.80729


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 192/300 ] loss = 27.21479, hidden_loss = 0.98650, acc = 0.89935


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 192/300 ] loss = 7.61516, acc = 0.81603 -> best
Best model found at epoch 191, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 193/300 ] loss = 27.61864, hidden_loss = 0.98756, acc = 0.89266


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 193/300 ] loss = 7.51548, acc = 0.80612


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 194/300 ] loss = 27.79868, hidden_loss = 0.98867, acc = 0.89489


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 194/300 ] loss = 7.37260, acc = 0.80962


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 195/300 ] loss = 27.61442, hidden_loss = 0.98920, acc = 0.90036


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 195/300 ] loss = 7.47237, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 196/300 ] loss = 28.23030, hidden_loss = 0.98940, acc = 0.89489


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 196/300 ] loss = 7.28665, acc = 0.79883


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 197/300 ] loss = 28.37782, hidden_loss = 0.99014, acc = 0.89925


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 197/300 ] loss = 7.52309, acc = 0.78863


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 198/300 ] loss = 28.56655, hidden_loss = 0.99048, acc = 0.89672


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 198/300 ] loss = 7.48405, acc = 0.79679


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 199/300 ] loss = 28.88348, hidden_loss = 0.99087, acc = 0.89438


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 199/300 ] loss = 7.39164, acc = 0.80350


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 200/300 ] loss = 29.09291, hidden_loss = 0.99147, acc = 0.89499


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 200/300 ] loss = 7.36097, acc = 0.80175


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 201/300 ] loss = 29.37375, hidden_loss = 0.99296, acc = 0.90229


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 201/300 ] loss = 7.08807, acc = 0.80671


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 202/300 ] loss = 29.95106, hidden_loss = 0.99334, acc = 0.89641


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 202/300 ] loss = 7.42746, acc = 0.80991


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 203/300 ] loss = 29.85949, hidden_loss = 0.99273, acc = 0.90128


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 203/300 ] loss = 7.25366, acc = 0.81137


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 204/300 ] loss = 30.21857, hidden_loss = 0.99364, acc = 0.89996


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 204/300 ] loss = 7.67145, acc = 0.80437


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 205/300 ] loss = 30.54131, hidden_loss = 0.99391, acc = 0.89591


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 205/300 ] loss = 7.58986, acc = 0.80233


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 206/300 ] loss = 30.49352, hidden_loss = 0.99430, acc = 0.90006


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 206/300 ] loss = 7.19489, acc = 0.80204


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 207/300 ] loss = 30.81470, hidden_loss = 0.99517, acc = 0.90675


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 207/300 ] loss = 7.34165, acc = 0.81195


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 208/300 ] loss = 31.41516, hidden_loss = 0.99550, acc = 0.89925


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 208/300 ] loss = 7.09457, acc = 0.81283


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 209/300 ] loss = 31.20066, hidden_loss = 0.99628, acc = 0.90128


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 209/300 ] loss = 7.34656, acc = 0.79883


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 210/300 ] loss = 31.79059, hidden_loss = 0.99708, acc = 0.89915


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 210/300 ] loss = 7.29590, acc = 0.80525


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 211/300 ] loss = 32.02919, hidden_loss = 0.99679, acc = 0.89955


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 211/300 ] loss = 7.35202, acc = 0.80933


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 212/300 ] loss = 32.01435, hidden_loss = 0.99748, acc = 0.90016


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 212/300 ] loss = 7.57689, acc = 0.81283


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 213/300 ] loss = 32.44538, hidden_loss = 0.99800, acc = 0.90036


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 213/300 ] loss = 7.21421, acc = 0.79854


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 214/300 ] loss = 32.68566, hidden_loss = 0.99860, acc = 0.90209


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 214/300 ] loss = 7.29076, acc = 0.81108


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 215/300 ] loss = 32.97980, hidden_loss = 0.99924, acc = 0.90168


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 215/300 ] loss = 7.26489, acc = 0.79825


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 216/300 ] loss = 33.38391, hidden_loss = 0.99959, acc = 0.90016


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 216/300 ] loss = 7.23924, acc = 0.80962


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 217/300 ] loss = 33.62251, hidden_loss = 0.99999, acc = 0.90300


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 217/300 ] loss = 7.19485, acc = 0.80175


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 218/300 ] loss = 33.75287, hidden_loss = 1.00045, acc = 0.89925


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 218/300 ] loss = 7.53153, acc = 0.80466


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 219/300 ] loss = 33.95671, hidden_loss = 1.00033, acc = 0.90016


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 219/300 ] loss = 7.25095, acc = 0.81166


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 220/300 ] loss = 34.21413, hidden_loss = 1.00056, acc = 0.90189


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 220/300 ] loss = 7.10726, acc = 0.80758


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 221/300 ] loss = 34.60912, hidden_loss = 1.00180, acc = 0.90087


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 221/300 ] loss = 7.12510, acc = 0.80787


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 222/300 ] loss = 35.00095, hidden_loss = 1.00215, acc = 0.89854


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 222/300 ] loss = 7.22935, acc = 0.79679


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 223/300 ] loss = 35.24785, hidden_loss = 1.00294, acc = 0.90097


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 223/300 ] loss = 7.28221, acc = 0.80612


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 224/300 ] loss = 35.13937, hidden_loss = 1.00334, acc = 0.90726


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 224/300 ] loss = 7.62922, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 225/300 ] loss = 35.74048, hidden_loss = 1.00387, acc = 0.90178


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 225/300 ] loss = 7.19153, acc = 0.81166


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 226/300 ] loss = 35.56849, hidden_loss = 1.00407, acc = 0.90756


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 226/300 ] loss = 7.22056, acc = 0.80700


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 227/300 ] loss = 35.94052, hidden_loss = 1.00535, acc = 0.90685


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 227/300 ] loss = 7.07941, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 228/300 ] loss = 36.38913, hidden_loss = 1.00496, acc = 0.90199


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 228/300 ] loss = 7.35344, acc = 0.80117


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 229/300 ] loss = 36.61095, hidden_loss = 1.00616, acc = 0.90918


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 229/300 ] loss = 7.20887, acc = 0.81691 -> best
Best model found at epoch 228, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 230/300 ] loss = 36.96455, hidden_loss = 1.00633, acc = 0.90756


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 230/300 ] loss = 7.48382, acc = 0.79767


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 231/300 ] loss = 37.38481, hidden_loss = 1.00675, acc = 0.90310


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 231/300 ] loss = 7.11634, acc = 0.79650


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 232/300 ] loss = 37.49210, hidden_loss = 1.00703, acc = 0.90300


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 232/300 ] loss = 7.31110, acc = 0.79796


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 233/300 ] loss = 37.72093, hidden_loss = 1.00785, acc = 0.90239


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 233/300 ] loss = 7.50367, acc = 0.80991


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 234/300 ] loss = 38.40509, hidden_loss = 1.00841, acc = 0.90442


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 234/300 ] loss = 7.35441, acc = 0.80758


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 235/300 ] loss = 38.47440, hidden_loss = 1.00940, acc = 0.90685


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 235/300 ] loss = 7.34935, acc = 0.81050


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 236/300 ] loss = 38.85099, hidden_loss = 1.00992, acc = 0.90513


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 236/300 ] loss = 7.51814, acc = 0.80379


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 237/300 ] loss = 39.01877, hidden_loss = 1.00955, acc = 0.90827


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 237/300 ] loss = 7.38468, acc = 0.80816


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 238/300 ] loss = 38.87769, hidden_loss = 1.00979, acc = 0.90685


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 238/300 ] loss = 7.36656, acc = 0.80641


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 239/300 ] loss = 39.67136, hidden_loss = 1.01059, acc = 0.90330


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 239/300 ] loss = 7.36633, acc = 0.80058


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 240/300 ] loss = 40.18771, hidden_loss = 1.01044, acc = 0.90939


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 240/300 ] loss = 7.30934, acc = 0.79534


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 241/300 ] loss = 40.28483, hidden_loss = 1.01070, acc = 0.90412


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 241/300 ] loss = 7.27071, acc = 0.80496


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 242/300 ] loss = 40.62323, hidden_loss = 1.01181, acc = 0.90787


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 242/300 ] loss = 7.18754, acc = 0.81254


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 243/300 ] loss = 40.95213, hidden_loss = 1.01196, acc = 0.90290


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 243/300 ] loss = 7.51299, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 244/300 ] loss = 41.08140, hidden_loss = 1.01251, acc = 0.90939


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 244/300 ] loss = 7.19940, acc = 0.80496


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 245/300 ] loss = 41.28763, hidden_loss = 1.01337, acc = 0.90776


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 245/300 ] loss = 7.30351, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 246/300 ] loss = 41.57330, hidden_loss = 1.01308, acc = 0.90412


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 246/300 ] loss = 7.45050, acc = 0.81837 -> best
Best model found at epoch 245, saving model


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 247/300 ] loss = 42.01371, hidden_loss = 1.01347, acc = 0.90939


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 247/300 ] loss = 7.28726, acc = 0.81079


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 248/300 ] loss = 42.03952, hidden_loss = 1.01414, acc = 0.91010


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 248/300 ] loss = 7.25516, acc = 0.80117


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 249/300 ] loss = 42.47506, hidden_loss = 1.01400, acc = 0.90351


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 249/300 ] loss = 7.20928, acc = 0.79971


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 250/300 ] loss = 42.69334, hidden_loss = 1.01422, acc = 0.90878


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 250/300 ] loss = 7.18966, acc = 0.80729


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 251/300 ] loss = 42.81270, hidden_loss = 1.01584, acc = 0.91131


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 251/300 ] loss = 7.24449, acc = 0.80933


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 252/300 ] loss = 43.09409, hidden_loss = 1.01623, acc = 0.91121


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 252/300 ] loss = 7.13478, acc = 0.80525


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 253/300 ] loss = 43.87658, hidden_loss = 1.01614, acc = 0.91091


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 253/300 ] loss = 7.20828, acc = 0.80233


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 254/300 ] loss = 43.90469, hidden_loss = 1.01628, acc = 0.90989


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 254/300 ] loss = 7.10570, acc = 0.80700


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 255/300 ] loss = 44.55970, hidden_loss = 1.01696, acc = 0.90797


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 255/300 ] loss = 7.39546, acc = 0.79971


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 256/300 ] loss = 44.08884, hidden_loss = 1.01680, acc = 0.91162


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 256/300 ] loss = 7.19296, acc = 0.80087


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 257/300 ] loss = 44.69483, hidden_loss = 1.01764, acc = 0.90746


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 257/300 ] loss = 7.25842, acc = 0.80525


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 258/300 ] loss = 45.32654, hidden_loss = 1.01778, acc = 0.90705


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 258/300 ] loss = 7.34897, acc = 0.79359


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 259/300 ] loss = 45.34335, hidden_loss = 1.01831, acc = 0.91060


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 259/300 ] loss = 7.34774, acc = 0.79971


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 260/300 ] loss = 46.09919, hidden_loss = 1.01876, acc = 0.91040


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 260/300 ] loss = 7.12839, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 261/300 ] loss = 46.12599, hidden_loss = 1.01885, acc = 0.91172


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 261/300 ] loss = 7.06027, acc = 0.80816


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 262/300 ] loss = 46.00140, hidden_loss = 1.01935, acc = 0.90939


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 262/300 ] loss = 7.47602, acc = 0.80466


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 263/300 ] loss = 46.43452, hidden_loss = 1.01985, acc = 0.91121


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 263/300 ] loss = 7.35068, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 264/300 ] loss = 46.98690, hidden_loss = 1.02127, acc = 0.91162


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 264/300 ] loss = 7.33558, acc = 0.79329


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 265/300 ] loss = 46.87311, hidden_loss = 1.02145, acc = 0.91131


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 265/300 ] loss = 7.23461, acc = 0.80671


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 266/300 ] loss = 47.56785, hidden_loss = 1.02207, acc = 0.91080


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 266/300 ] loss = 7.14663, acc = 0.80845


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 267/300 ] loss = 48.21200, hidden_loss = 1.02307, acc = 0.90999


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 267/300 ] loss = 7.29015, acc = 0.79359


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 268/300 ] loss = 48.76737, hidden_loss = 1.02293, acc = 0.91141


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 268/300 ] loss = 7.55043, acc = 0.79067


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 269/300 ] loss = 48.27026, hidden_loss = 1.02313, acc = 0.91202


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 269/300 ] loss = 7.12801, acc = 0.81079


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 270/300 ] loss = 48.83389, hidden_loss = 1.02333, acc = 0.91202


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 270/300 ] loss = 7.33941, acc = 0.81283


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 271/300 ] loss = 49.01806, hidden_loss = 1.02459, acc = 0.91080


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 271/300 ] loss = 7.11469, acc = 0.80292


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 272/300 ] loss = 49.08171, hidden_loss = 1.02547, acc = 0.91374


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 272/300 ] loss = 7.27633, acc = 0.79971


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 273/300 ] loss = 48.97878, hidden_loss = 1.02520, acc = 0.91374


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 273/300 ] loss = 7.17535, acc = 0.80233


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 274/300 ] loss = 50.45090, hidden_loss = 1.02549, acc = 0.91030


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 274/300 ] loss = 7.08411, acc = 0.80408


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 275/300 ] loss = 50.41003, hidden_loss = 1.02572, acc = 0.91263


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 275/300 ] loss = 7.06445, acc = 0.80875


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 276/300 ] loss = 50.56013, hidden_loss = 1.02625, acc = 0.91496


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 276/300 ] loss = 7.28345, acc = 0.80058


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 277/300 ] loss = 50.83286, hidden_loss = 1.02633, acc = 0.91547


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 277/300 ] loss = 7.01700, acc = 0.81254


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 278/300 ] loss = 51.08934, hidden_loss = 1.02676, acc = 0.91476


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 278/300 ] loss = 7.50136, acc = 0.79679


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 279/300 ] loss = 51.93920, hidden_loss = 1.02708, acc = 0.91253


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 279/300 ] loss = 7.22604, acc = 0.80991


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 280/300 ] loss = 52.04162, hidden_loss = 1.02695, acc = 0.90878


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 280/300 ] loss = 7.29095, acc = 0.80554


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 281/300 ] loss = 52.65586, hidden_loss = 1.02693, acc = 0.91314


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 281/300 ] loss = 7.12847, acc = 0.80729


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 282/300 ] loss = 52.81874, hidden_loss = 1.02778, acc = 0.91030


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 282/300 ] loss = 7.27202, acc = 0.81166


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 283/300 ] loss = 53.19356, hidden_loss = 1.02784, acc = 0.91020


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 283/300 ] loss = 7.26133, acc = 0.79738


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 284/300 ] loss = 53.42523, hidden_loss = 1.02803, acc = 0.91425


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 284/300 ] loss = 7.23929, acc = 0.80262


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 285/300 ] loss = 53.66654, hidden_loss = 1.02831, acc = 0.91202


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 285/300 ] loss = 7.14315, acc = 0.80496


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 286/300 ] loss = 54.56198, hidden_loss = 1.02890, acc = 0.91273


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 286/300 ] loss = 7.31884, acc = 0.81108


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 287/300 ] loss = 54.26021, hidden_loss = 1.02960, acc = 0.91537


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 287/300 ] loss = 7.35667, acc = 0.80496


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 288/300 ] loss = 54.98894, hidden_loss = 1.03030, acc = 0.91324


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 288/300 ] loss = 7.29579, acc = 0.81224


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 289/300 ] loss = 55.11796, hidden_loss = 1.03081, acc = 0.91192


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 289/300 ] loss = 7.42216, acc = 0.80729


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 290/300 ] loss = 55.38381, hidden_loss = 1.03100, acc = 0.91709


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 290/300 ] loss = 7.30302, acc = 0.80729


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 291/300 ] loss = 55.54900, hidden_loss = 1.03173, acc = 0.91435


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 291/300 ] loss = 7.20006, acc = 0.80875


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 292/300 ] loss = 56.35880, hidden_loss = 1.03212, acc = 0.91364


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 292/300 ] loss = 7.09825, acc = 0.81749


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 293/300 ] loss = 56.02729, hidden_loss = 1.03190, acc = 0.91608


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 293/300 ] loss = 7.66605, acc = 0.78571


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 294/300 ] loss = 57.16982, hidden_loss = 1.03246, acc = 0.91547


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 294/300 ] loss = 7.13954, acc = 0.81603


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 295/300 ] loss = 57.38483, hidden_loss = 1.03287, acc = 0.91749


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 295/300 ] loss = 7.07024, acc = 0.81079


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 296/300 ] loss = 57.29315, hidden_loss = 1.03356, acc = 0.91506


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 296/300 ] loss = 7.15565, acc = 0.80029


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 297/300 ] loss = 57.53802, hidden_loss = 1.03447, acc = 0.91972


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 297/300 ] loss = 7.17655, acc = 0.80816


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 298/300 ] loss = 58.25125, hidden_loss = 1.03457, acc = 0.91810


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 298/300 ] loss = 7.26920, acc = 0.81020


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 299/300 ] loss = 58.45443, hidden_loss = 1.03454, acc = 0.91496


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 299/300 ] loss = 7.38520, acc = 0.80321


  0%|          | 0/155 [00:00<?, ?it/s]

[ Train | 300/300 ] loss = 58.89815, hidden_loss = 1.03569, acc = 0.91486


  0%|          | 0/54 [00:00<?, ?it/s]

[ Valid | 300/300 ] loss = 7.27788, acc = 0.80204
Finish training


### Inference
load the best model of the experiment and generate submission.csv

In [20]:
# create dataloader for evaluation
eval_set = FoodDataset(os.path.join(cfg['dataset_root'], "evaluation"), tfm=test_tfm)
eval_loader = DataLoader(eval_set, batch_size=cfg['batch_size'], shuffle=False, num_workers=0, pin_memory=True)

One ./food11-hw13/evaluation sample ./food11-hw13/evaluation/0000.jpg


In [21]:
student_model.eval()
eval_preds = [] # storing predictions of the evaluation dataset

# Iterate the validation set by batches.
for batch in tqdm(eval_loader):
    # A batch consists of image data and corresponding labels.
    imgs, _ = batch
    # We don't need gradient in evaluation.
    # Using torch.no_grad() accelerates the forward process.
    with torch.no_grad():
        logits = student_model(imgs.to(device))
        preds = list(logits.argmax(dim=-1).squeeze().cpu().numpy())
    # loss and acc can not be calculated because we do not have the true labels of the evaluation set.
    eval_preds += preds

def pad4(i):
    return "0"*(4-len(str(i))) + str(i)

# Save prediction results
ids = [pad4(i) for i in range(0,len(eval_set))]
categories = eval_preds

df = pd.DataFrame()
df['Id'] = ids
df['Category'] = categories
df.to_csv(f"{save_path}/submission_final.csv", index=False) # now you can download the submission.csv and upload it to the kaggle competition.

  0%|          | 0/53 [00:00<?, ?it/s]

In [22]:
# Load model from {exp_name}/student_best.ckpt
student_model_best = StudentNet()
ckpt_path = f"{save_path}/student_best.ckpt" # the ckpt path of the best student model.
student_model_best.load_state_dict(torch.load(ckpt_path, map_location='cpu')) # load the state dict and set it to the student model
student_model_best.to(device) # set the student model to device

# Start evaluate
student_model_best.eval()
eval_preds = [] # storing predictions of the evaluation dataset

# Iterate the validation set by batches.
for batch in tqdm(eval_loader):
    # A batch consists of image data and corresponding labels.
    imgs, _ = batch
    # We don't need gradient in evaluation.
    # Using torch.no_grad() accelerates the forward process.
    with torch.no_grad():
        logits = student_model_best(imgs.to(device))
        preds = list(logits.argmax(dim=-1).squeeze().cpu().numpy())
    # loss and acc can not be calculated because we do not have the true labels of the evaluation set.
    eval_preds += preds

def pad4(i):
    return "0"*(4-len(str(i))) + str(i)

# Save prediction results
ids = [pad4(i) for i in range(0,len(eval_set))]
categories = eval_preds

df = pd.DataFrame()
df['Id'] = ids
df['Category'] = categories
df.to_csv(f"{save_path}/submission_best.csv", index=False) # now you can download the submission.csv and upload it to the kaggle competition.

  0%|          | 0/53 [00:00<?, ?it/s]

> Don't forget to answer the report questions on GradeScope! 