# deberta-v3-small Experiments

We use fast experiments with 1/5 of the data and 3 epochs to quickly determine which adaptations are helpful and report the cv-score.  
For adaptation we want to explore further, we retrain on the whole dataset and evaluate on the leader board.  

</br> </br>  


## Fast Experiments 
*1/5 of original training data, 3 epochs, 4 fold cv*

---

### Prompt Engineering
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline           | `fast_baseline_cfg`   | 0.7430   | - | |
|  CPC Context Text   | `fast_context_cfg`    | 0.7523    | - | |  
|  Custom Tokens      | `fast_customtok_cfg`  | 0.7464  | - | |  



**Conclusion**: Adding Context Text seems to work best here, we will hence continue with that.
</br> </br>  

### Model Type
    
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Regression       | `fast_reg_cfg` |  *see CPC Context Text*  | - | |
|  Classification   | `fast_class_cfg` | 0.7378 | - | |  
|  Ordinal          | `fast_ord_cfg` | 0.7262 | - | |  


**Conclusion**: Regular regression seems to work best, we will hence continue with that.
</br> </br>  

### Classical NLP Preprocssing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Stemming              |    |   | - | |
| Lemmatizing           |    |   | - | |
| Special Characters    |    |   | - | Removing special characters from the phrases |


### Postprocessing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Clipping              |    |   | - | Range `[0,1]`|
| MinMax           |    |   | - | |
| Chemical Lookup    |    |   | - |  |


### Identity Mapping Augmentation

|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Identities              |   `fast_ident_cfg` |  0.7461 | - | |
| All Mirrored            |   `fast_mirr_cfg` |  0.7481 | - | |
| Identity Paths          |  `fast_identpaths_cfg` |   | - | |
| Mirrored Identity Paths |  `fast_mirridentpaths_cfg` |   | - | |
| Neighbors               |  `fast_neighbors_cfg` |  | - | |


</br>  </br>  </br>  

## Full Experiments 
*full training data, 5 epochs, 4 fold cv*

---

    
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Base Regression       | `full_base_reg_cfg` |   |  | |



# Directory settings

In [1]:
# ====================================================
# Directory settings
# ====================================================
import os

INPUT_DIR = '../input/us-patent-phrase-to-phrase-matching/'
IDENTITY_MAPPINGS_DIR = '../input/uspppm-identity-mappings/'
OUTPUT_DIR = './'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# Model Configuration

In [2]:
from dataclasses import dataclass, field
from typing import Set, Optional, Tuple, Dict


@dataclass
class ModelConfig:
    # when true uses classification else a regression model
    classification: bool = False
    # when true uses ordinal regression (only active when `classification = False`)
    ordinal: bool = False
    # How to augment the data for training using the graph based indentity mappings
    # available options ['neighbors'. paths_mirrored', 'paths', 'mirrored', 'identities', None]
    # None indicates that no augmentation should take place
    augment_identity_graph_data: Optional[str] = None 
    validate_on_original: bool = True
        
    # Prompt engineering config
    # - ctxt_txt (Default): Add context text at the end of the prompt
    # - custom_tok: Add custom tokens for contexts and a custom separator token
    # - None: Append the context appreviation to the end of prompt
    # When true uses custom separator token and context tokens
    prompt_engineering: Optional[str] = 'ctx_txt' # ['custom_tok', None]
    custom_sep_token: str = '[S]'
    
    # Use averaging of chemical component synonyms for 
    chem_comp_pred_avg: bool  = False
    # Use chemical component synonyms to creat samples for augmenting the training set
    chem_comp_train_aug: bool = False
    
    # Traditional NLP preprocessing
    stemming: bool = False
    lemmatizing: bool = False
    special_chr_rem: bool = False
    
    
    debug: bool = False
    apex: bool = True
    print_freq: int= 200
    num_workers: int = 4
    model: str = "microsoft/deberta-v3-small"
    scheduler: str = 'cosine' # ['linear', 'cosine']
    batch_scheduler: bool = True
    num_cycles: float = 0.5
    num_warmup_steps: int = 0
    encoder_lr: float = 2e-5
    decoder_lr: float = 2e-5
    min_lr: float = 1e-6
    eps: float = 1e-6
    betas: Set[float] = (0.9, 0.999)
    batch_size: int = 16
    fc_dropout: float = 0.2
    max_len: int = 512
    weight_decay: float = 0.01
    gradient_accumulation_steps: int = 1
    max_grad_norm: int = 1000
    seed: int = 42
    
    epochs: int = 5
    train_frac: Optional[float] = None
    n_fold: int = 4
    trn_fold: Set[int] = (0, 1, 2, 3)
    map_score: Dict[float, int] = field(default_factory = lambda: ({0.0: 0, 0.25: 1, 0.5: 2, 0.75: 3, 1.0: 4}))
    map_labels: Dict[int, float] = field(default_factory = lambda: ({0: 0.0, 1: 0.25, 2: 0.5, 3: 0.75, 4: 1.0}))
        
    target_size: int=1
    def __post_init__(self):
        if self.classification or self.ordinal:
            self.target_size = 5
        else:
            self.target_size = 1  


############################################
# Experiment Configurations
############################################

# Fast Experiments
FAST_BASE_TRAIN_FRAC = 1/5
FAST_BASE_TRAIN_EPOCHS = 3

def fastModelConfg(**kwargs):
    return ModelConfig(**kwargs, train_frac=FAST_BASE_TRAIN_FRAC, epochs=FAST_BASE_TRAIN_EPOCHS)


fast_baseline_cfg = fastModelConfg(prompt_engineering=None)
fast_context_ctx = fastModelConfg()
fast_customtok_cfg = fastModelConfg(prompt_engineering='custom_tok')

fast_reg_cfg = fastModelConfg()
fast_class_cfg = fastModelConfg(classification=True)
fast_ord_cfg = fastModelConfg(ordinal=True)

fast_ident_cfg = fastModelConfg(augment_identity_graph_data='identities')
fast_mirr_cfg = fastModelConfg(augment_identity_graph_data='mirrored')
fast_identpaths_cfg = fastModelConfg(augment_identity_graph_data='paths')
fast_mirridentpaths_cfg = fastModelConfg(augment_identity_graph_data='paths_mirrored')
fast_neighbors_cfg = fastModelConfg(augment_identity_graph_data='neighbors')

# Full Experiments



### Config used throughout this notebook

In [3]:
# CFG = fast_baseline_cfg
# CFG = fast_context_ctx
# CFG = fast_customtok_cfg

# CFG = fast_reg_cfg
# CFG = fast_class_cfg
# CFG = fast_ord_cfg

CFG = fast_ident_cfg
# CFG = fast_mirr_cfg
# CFG = fast_identpaths_cfg
# CFG = fast_mirridentpaths_cfg
# CFG =fast_neighbors_cfg

# Library

In [4]:
# ====================================================
# Library
# ====================================================
import os
import gc
import re
import ast
import sys
import copy
import json
import time
import math
import shutil
import string
import pickle
import random
import joblib
import itertools
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold

import torch
print(f"torch.__version__: {torch.__version__}")
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset

os.system('pip uninstall -y transformers')
os.system('pip uninstall -y tokenizers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels transformers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels tokenizers')
import tokenizers
import transformers
print(f"tokenizers.__version__: {tokenizers.__version__}")
print(f"transformers.__version__: {transformers.__version__}")
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
%env TOKENIZERS_PARALLELISM=true

import nltk
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


nltk.data.path.append('../input/wordnet')

# common chemical formulae lookup utility script
from uspppm_common_chemical_compound_lookup import USPPPMChemCompLookup

torch.__version__: 1.9.1
Found existing installation: transformers 4.16.2
Uninstalling transformers-4.16.2:
  Successfully uninstalled transformers-4.16.2




Found existing installation: tokenizers 0.11.6
Uninstalling tokenizers-0.11.6:
  Successfully uninstalled tokenizers-0.11.6




Looking in links: ../input/pppm-pip-wheels
Processing /kaggle/input/pppm-pip-wheels/transformers-4.18.0-py3-none-any.whl
Processing /kaggle/input/pppm-pip-wheels/tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Installing collected packages: tokenizers, transformers


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.9.1 requires transformers<4.17,>=4.1, but you have transformers 4.18.0 which is incompatible.


Successfully installed tokenizers-0.12.1 transformers-4.18.0
Looking in links: ../input/pppm-pip-wheels




tokenizers.__version__: 0.12.1
transformers.__version__: 4.18.0
env: TOKENIZERS_PARALLELISM=true


# Utils

### Chemical Compound Lookup

In [5]:
chem_lookup = USPPPMChemCompLookup(chem_comp_path='../input/chemical-compounds-lookup/compounds.csv')

**Chemical Compound Lookup Tests**

In [6]:
print('Testing basic formula lookup...')
display(chem_lookup.lookup_df)
print(chem_lookup.phrase_chem_formula_synonym('agbr test'))
print(chem_lookup.phrase_chem_formula_synonym('agbr dna test agonc ag2cl2'))
print(chem_lookup.phrase_chem_formula_synonym('dna test d2o'))
print(chem_lookup.chem_formula_synonyms('c3h6'))
print('Done!')
print()
print('Testing dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),  
                          'anchor': pd.Series(['agbr dna test agonc ag2cl2', 'agbr test', 'agbr', 'last']),
                          'target': pd.Series(['agonc ag2cl2', 'test thingy', 'c4h7no4', 'last']),
                          'context': pd.Series(['G02', 'G02', 'C12', 'C12'])})
print("Before")
display(chem_test)
print("After")
display(chem_lookup.augment_chem_formulae(chem_test, True))
print('Done!')

Testing basic formula lookup...


Unnamed: 0,Name,Formula
0,actiniumiiioxide,ac2o3
1,silvertetrachloroaluminate,agalcl4
2,silverbromide,agbr
3,silverbromate,agbro3
4,silvercyanide,agcn
...,...,...
4067,zirconateion,zro32
4068,zirconiumphosphide,zrp2
4069,zirconiumsulfide,zrs2
4070,zirconiumsilicide,zrsi2


['silverbromide test']
['silverbromide dna test silverfulminate disilverdichloride', 'silverbromide dna test silvercyanate disilverdichloride', 'silverbromide dna test silverfulminate silveriidichloride', 'silverbromide dna test silvercyanate silveriidichloride']
['dna test deuteriumoxide', 'dna test heavywater']
['cyclopropane', 'propylene']
Done!

Testing dataset augmentation...
Before


Unnamed: 0,id,anchor,target,context
0,t1,agbr dna test agonc ag2cl2,agonc ag2cl2,G02
1,t2,agbr test,test thingy,G02
2,t3,agbr,c4h7no4,C12
3,t4,last,last,C12


After


Unnamed: 0,id,anchor,target
0,t1,silverbromide dna test silverfulminate disilve...,silverfulminate disilverdichloride
0,t1,silverbromide dna test silverfulminate disilve...,silvercyanate disilverdichloride
0,t1,silverbromide dna test silverfulminate disilve...,silverfulminate silveriidichloride
0,t1,silverbromide dna test silverfulminate disilve...,silvercyanate silveriidichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silverfulminate disilverdichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silvercyanate disilverdichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silverfulminate silveriidichloride
0,t1,silverbromide dna test silvercyanate disilverd...,silvercyanate silveriidichloride
0,t1,silverbromide dna test silverfulminate silveri...,silverfulminate disilverdichloride
0,t1,silverbromide dna test silverfulminate silveri...,silvercyanate disilverdichloride


Done!


In [7]:
# ====================================================
# Utils
# ====================================================
def get_score(y_true, y_pred):
    score = sp.stats.pearsonr(y_true, y_pred)[0]
    return score


def get_logger(filename=OUTPUT_DIR+'train'):
    from logging import getLogger, INFO, StreamHandler, FileHandler, Formatter
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=f"{filename}.log")
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    logger.propagate = False
    return logger

LOGGER = get_logger()

def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

# Data Loading

In [8]:
# ====================================================
# Data Loading
# ====================================================
orig_data = pd.read_csv(INPUT_DIR+'train.csv')

if CFG.augment_identity_graph_data == 'neighbors':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_path_neighbors.csv')
elif CFG.augment_identity_graph_data == 'paths_mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_paths.csv')
elif CFG.augment_identity_graph_data == 'paths':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_paths_mirrored.csv')
elif CFG.augment_identity_graph_data == 'mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored.csv')
elif CFG.augment_identity_graph_data == 'identities':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_mirrored.csv')
elif CFG.augment_identity_graph_data == None:
    train = orig_data
else:
    raise(ValueError('CFG.augment_identity_graph_data = {} not recognized!'.format(CFG.augment_identity_graph_data)))
    
#if CFG.chem_comp_train_aug:
#    train = 
    
if CFG.train_frac:
    # to get a fair estimate we always use a fraction of the original data
    n = int(CFG.train_frac * len(orig_data))
    train = train.sample(n=n, replace=False, ignore_index=True)
    
test = pd.read_csv(INPUT_DIR+'test.csv')
submission = pd.read_csv(INPUT_DIR+'sample_submission.csv')
print(f"train.shape: {train.shape}")
print(f"test.shape: {test.shape}")
print(f"submission.shape: {submission.shape}")
# display(train.head())
# display(test.head())
# display(submission.head())

train.shape: (7294, 5)
test.shape: (36, 4)
submission.shape: (36, 2)


# Pre-processing

In [9]:
# Add augmented indicator
# I'm sorry for this dirty hack
train['augmented'] = train['id'].str.contains('_')

In [10]:
# ====================================================
# CPC Data
# ====================================================
def get_cpc_texts():
    contexts = []
    pattern = '[A-Z]\d+'
    for file_name in os.listdir('../input/cpc-data/CPCSchemeXML202105'):
        result = re.findall(pattern, file_name)
        if result:
            contexts.append(result)
    contexts = sorted(set(sum(contexts, [])))
    results = {}
    for cpc in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
        with open(f'../input/cpc-data/CPCTitleList202202/cpc-section-{cpc}_20220201.txt') as f:
            s = f.read()
        pattern = f'{cpc}\t\t.+'
        result = re.findall(pattern, s)
        cpc_result = result[0].lstrip(pattern)
        for context in [c for c in contexts if c[0] == cpc]:
            pattern = f'{context}\t\t.+'
            result = re.findall(pattern, s)
            results[context] = cpc_result + ". " + result[0].lstrip(pattern)
    return results


cpc_texts = get_cpc_texts()
torch.save(cpc_texts, OUTPUT_DIR+"cpc_texts.pth")
train['context_text'] = train['context'].map(cpc_texts)
test['context_text'] = test['context'].map(cpc_texts)
# display(train.head())
# display(test.head())

# CV split

In [11]:
# ====================================================
# CV split
# ====================================================

train['score_map'] = train['score'].map({0.00: 0, 0.25: 1, 0.50: 2, 0.75: 3, 1.00: 4})
Fold = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
for n, (train_index, val_index) in enumerate(Fold.split(train, train['score_map'])):
    train.loc[val_index, 'fold'] = int(n)
train['fold'] = train['fold'].astype(int)
# display(train.groupby('fold').size())

In [12]:
def get_sec_toks(df):
    return '[' + df['context'].str[0] + ']'

if CFG.prompt_engineering == 'custom_tok':
    train['text'] = get_sec_toks(train) + train['context_text'] + '[SEP]'+ train['anchor'] + CFG.custom_sep_token + train['target']
    test['text'] = get_sec_toks(test) + test['context_text'] + '[SEP]' + test['anchor'] + CFG.custom_sep_token + test['target']
elif CFG.prompt_engineering == 'ctx_txt':
    train['text'] = train['anchor'] + '[SEP]' + train['target'] + '[SEP]'  + train['context_text']
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]'  + test['context_text']
elif CFG.prompt_engineering == None:
    train['text'] = train['anchor'] + '[SEP]' + train['target'] + '[SEP]'  + train['context']
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]'  + test['context']
else:
    raise(ValueError('CFG.prompt_engineering = {} not recognized!'.format(CFG.prompt_engineering)))
    

print(train['text'][0])
display(train['text'].head())
display(test['text'].head())

adjustable multiple[SEP]flexible multiple[SEP]PERFORMING OPERATIONS; TRANSPORTING. HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS


0    adjustable multiple[SEP]flexible multiple[SEP]...
1    disperse in plastic material[SEP]elasticity[SE...
2    substantially axial[SEP]tangential[SEP]HEMISTR...
3    different conductivity[SEP]conductivity channe...
4    reflection type liquid crystal display[SEP]ref...
Name: text, dtype: object

0    opc drum[SEP]inorganic photoconductor drum[SEP...
1    adjust gas flow[SEP]altering gas flow[SEP]MECH...
2    lower trunnion[SEP]lower locating[SEP]PERFORMI...
3    cap component[SEP]upper portion[SEP]TEXTILES; ...
4    neural stimulation[SEP]artificial neural netwo...
Name: text, dtype: object

# tokenizer

In [13]:
# ====================================================
# tokenizer
# ====================================================
tokenizer = AutoTokenizer.from_pretrained(CFG.model)


# add special tokens for sections
cpc_sections = [
    'A', # Human Necessities
    'B', # Operations and Transport
    'C', # Chemistry and Metallurgy
    'D', # Textiles
    'E', # Fixed Constructions
    'F', # Mechanical Engineering
    'G', # Physics
    'H', # Electricity
    'Y' # Emerging Cross-Sectional Technologies
]
if CFG.prompt_engineering == 'custom_tok':
    tokenizer.add_special_tokens({'additional_special_tokens': ['['+  s + ']' for s in cpc_sections]})
    print(tokenizer.all_special_tokens)
    
tokenizer.save_pretrained(OUTPUT_DIR+'tokenizer/')
CFG.tokenizer = tokenizer

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Dataset

In [14]:
# ====================================================
# Define max_len
# ====================================================
lengths_dict = {}

lengths = []
tk0 = tqdm(cpc_texts.values(), total=len(cpc_texts))
for text in tk0:
    length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
    lengths.append(length)
lengths_dict['context_text'] = lengths

for text_col in ['anchor', 'target']:
    lengths = []
    tk0 = tqdm(train[text_col].fillna("").values, total=len(train))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
        lengths.append(length)
    lengths_dict[text_col] = lengths
    
CFG.max_len = max(lengths_dict['anchor']) + max(lengths_dict['target'])\
                + max(lengths_dict['context_text']) + 4 # CLS + SEP + SEP + SEP
LOGGER.info(f"max_len: {CFG.max_len}")

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/7294 [00:00<?, ?it/s]

  0%|          | 0/7294 [00:00<?, ?it/s]

max_len: 115


In [15]:
# ====================================================
# Dataset
# ====================================================
def prepare_input(cfg, text):
    inputs = cfg.tokenizer(text,
                           add_special_tokens=True,
                           max_length=cfg.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs

def prepare_labels(cfg, label):
    if cfg.classification:
        label_onehot = [0 for _ in range(cfg.target_size)]
        label_onehot[cfg.map_score[label]] = 1 
        return torch.tensor(label_onehot, dtype=torch.float)
    elif cfg.ordinal:
        label_ordinal = [1 if i <= cfg.map_score[label] else 0 for i in range(cfg.target_size)]
        return torch.tensor(label_ordinal, dtype=torch.float)
    else:
        return torch.tensor(label, dtype=torch.float)

class TrainDataset(Dataset):
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.texts = df['text'].values
        self.labels = df['score'].values

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, self.texts[item])
        label = prepare_labels(self.cfg, self.labels[item])
        return inputs, label

# Model

In [16]:
# ====================================================
# Model
# ====================================================
class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, self.cfg.target_size)
        self._init_weights(self.fc)
        self.attention = nn.Sequential(
            nn.Linear(self.config.hidden_size, 512),
            nn.Tanh(),
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )
        self._init_weights(self.attention)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        # feature = torch.mean(last_hidden_states, 1)
        weights = self.attention(last_hidden_states)
        feature = torch.sum(weights * last_hidden_states, dim=1)
        return feature

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        return output

# Helper functions

In [17]:
# ====================================================
# Helper functions
# ====================================================
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))

def ordinal_regression(predictions, targets):
    """Ordinal regression with encoding as in https://arxiv.org/pdf/0704.1028.pdf"""
    return nn.MSELoss(reduction='mean')(predictions, targets)


def train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device):
    model.train()
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex)
    losses = AverageMeter()
    start = end = time.time()
    global_step = 0
    for step, (inputs, labels) in enumerate(train_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
#         needed to disable amp, error: half expected got float (maybe some bug)
        if CFG.ordinal:
            y_preds = model(inputs)
        else:
            with torch.cuda.amp.autocast(enabled=CFG.apex):
                y_preds = model(inputs)
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels) 
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        scaler.scale(loss).backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm)
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            global_step += 1
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'Grad: {grad_norm:.4f}  '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          grad_norm=grad_norm,
                          lr=scheduler.get_lr()[0]))
    return losses.avg


def valid_fn(valid_loader, model, criterion, device):
    losses = AverageMeter()
    model.eval()
    preds = []
    start = end = time.time()
    for step, (inputs, labels) in enumerate(valid_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.no_grad():
            y_preds = model(inputs)
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels)
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        if CFG.classification or CFG.ordinal:
            preds.append(y_preds.to('cpu').numpy())
        else:
            preds.append(y_preds.sigmoid().to('cpu').numpy())
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    predictions = np.concatenate(preds)
    if CFG.classification:
        predictions = np.argmax(predictions, axis=1)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    elif CFG.ordinal:
        predictions = (predictions > 0.5).cumprod(axis=1).sum(axis=1) - 1
        predictions = np.clip(predictions, 0, None)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    else:
        predictions = np.concatenate(predictions)
    return losses.avg, predictions


def inference_fn(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    tk0 = tqdm(test_loader, total=len(test_loader))
    for inputs in tk0:
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        with torch.no_grad():
            y_preds = model(inputs)
        if CFG.classification or CFG.ordinal:
            preds.append(y_preds.to('cpu').numpy())
        else:
            preds.append(y_preds.sigmoid().to('cpu').numpy())
    predictions = np.concatenate(preds)
    if CFG.classification:
        predictions = np.argmax(predictions, axis=1)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    elif CFG.ordinal:
        predictions = (predictions > 0.5).cumprod(axis=1).sum(axis=1) - 1
        predictions = np.clip(predictions, 0, None)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    return predictions

In [18]:
# ====================================================
# train loop
# ====================================================
def train_loop(folds, fold):
    
    LOGGER.info(f"========== fold: {fold} training ==========")

    # ====================================================
    # loader
    # ====================================================
    train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    if CFG.augment_identity_graph_data != None and CFG.validate_on_original:
        valid_folds = valid_folds[valid_folds['augmented'] == False].reset_index(drop=True)

    valid_labels = valid_folds['score'].values
    
    train_dataset = TrainDataset(CFG, train_folds)
    valid_dataset = TrainDataset(CFG, valid_folds)

    train_loader = DataLoader(train_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=True,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=True)
    valid_loader = DataLoader(valid_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=False,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

    # ====================================================
    # model & optimizer
    # ====================================================
    model = CustomModel(CFG, config_path=None, pretrained=True)
    torch.save(model.config, OUTPUT_DIR+'config.pth')
    model.to(device)
    
    def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
        param_optimizer = list(model.named_parameters())
        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
        optimizer_parameters = [
            {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': weight_decay},
            {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': 0.0},
            {'params': [p for n, p in model.named_parameters() if "model" not in n],
             'lr': decoder_lr, 'weight_decay': 0.0}
        ]
        return optimizer_parameters

    optimizer_parameters = get_optimizer_params(model,
                                                encoder_lr=CFG.encoder_lr, 
                                                decoder_lr=CFG.decoder_lr,
                                                weight_decay=CFG.weight_decay)
    optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr, eps=CFG.eps, betas=CFG.betas)
    
    # ====================================================
    # scheduler
    # ====================================================
    def get_scheduler(cfg, optimizer, num_train_steps):
        if cfg.scheduler == 'linear':
            scheduler = get_linear_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
            )
        elif cfg.scheduler == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
            )
        return scheduler
    
    num_train_steps = int(len(train_folds) / CFG.batch_size * CFG.epochs)
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)

    # ====================================================
    # loop
    # ====================================================
    if CFG.classification:
        criterion = nn.CrossEntropyLoss()
    elif CFG.ordinal:
        criterion = ordinal_regression
    else:
        criterion = nn.BCEWithLogitsLoss(reduction="mean")
    
    best_score = 0.

    for epoch in range(CFG.epochs):

        start_time = time.time()

        # train
        avg_loss = train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device)

        # eval
        avg_val_loss, predictions = valid_fn(valid_loader, model, criterion, device)
        
        # scoring
        score = get_score(valid_labels, predictions)

        elapsed = time.time() - start_time

        LOGGER.info(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
        LOGGER.info(f'Epoch {epoch+1} - Score: {score:.4f}')       
        if best_score < score:
            best_score = score
            LOGGER.info(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
            torch.save({'model': model.state_dict(),
                        'predictions': predictions},
                        OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth")

    predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth", 
                             map_location=torch.device('cpu'))['predictions']
    valid_folds['pred'] = predictions

    torch.cuda.empty_cache()
    gc.collect()
    
    return valid_folds

In [19]:
if __name__ == '__main__':
    
    def get_result(oof_df):
        labels = oof_df['score'].values
        preds = oof_df['pred'].values
        score = get_score(labels, preds)
        LOGGER.info(f'Score: {score:<.4f}')
    
    oof_df = pd.DataFrame()
    for fold in range(CFG.n_fold):
        if fold in CFG.trn_fold:
            _oof_df = train_loop(train, fold)
            oof_df = pd.concat([oof_df, _oof_df])
            LOGGER.info(f"========== fold: {fold} result ==========")
            get_result(_oof_df)
    oof_df = oof_df.reset_index(drop=True)
    LOGGER.info(f"========== CV ==========")
    get_result(oof_df)
    oof_df.to_csv(OUTPUT_DIR+'oof_df.csv')
    



Downloading:   0%|          | 0.00/273M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/341] Elapsed 0m 1s (remain 8m 14s) Loss: 0.7860(0.7860) Grad: inf  LR: 0.00002000  
Epoch: [1][200/341] Elapsed 0m 34s (remain 0m 23s) Loss: 0.5878(0.6235) Grad: 110814.4141  LR: 0.00001816  
Epoch: [1][340/341] Elapsed 0m 56s (remain 0m 0s) Loss: 0.5990(0.6100) Grad: 68200.6016  LR: 0.00001502  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 29s) Loss: 0.5944(0.5944) 


Epoch 1 - avg_train_loss: 0.6100  avg_val_loss: 0.5871  time: 62s
Epoch 1 - Score: 0.6873
Epoch 1 - Save Best Score: 0.6873 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 1.0010(0.5871) 
Epoch: [2][0/341] Elapsed 0m 0s (remain 1m 54s) Loss: 0.6073(0.6073) Grad: 125924.9844  LR: 0.00001499  
Epoch: [2][200/341] Elapsed 0m 32s (remain 0m 22s) Loss: 0.5161(0.5581) Grad: 121988.3672  LR: 0.00000910  
Epoch: [2][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.5915(0.5583) Grad: 104011.9688  LR: 0.00000504  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 22s) Loss: 0.5813(0.5813) 


Epoch 2 - avg_train_loss: 0.5583  avg_val_loss: 0.5732  time: 61s
Epoch 2 - Score: 0.7226
Epoch 2 - Save Best Score: 0.7226 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.9857(0.5732) 
Epoch: [3][0/341] Elapsed 0m 0s (remain 1m 57s) Loss: 0.5324(0.5324) Grad: 110151.0703  LR: 0.00000501  
Epoch: [3][200/341] Elapsed 0m 32s (remain 0m 22s) Loss: 0.5396(0.5447) Grad: 90049.2734  LR: 0.00000093  
Epoch: [3][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.4360(0.5434) Grad: 124033.2891  LR: 0.00000000  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 24s) Loss: 0.5751(0.5751) 


Epoch 3 - avg_train_loss: 0.5434  avg_val_loss: 0.5829  time: 61s
Epoch 3 - Score: 0.7276
Epoch 3 - Save Best Score: 0.7276 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 1.0609(0.5829) 


Score: 0.7276
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/341] Elapsed 0m 0s (remain 1m 54s) Loss: 0.7213(0.7213) Grad: inf  LR: 0.00002000  
Epoch: [1][200/341] Elapsed 0m 32s (remain 0m 22s) Loss: 0.6085(0.6379) Grad: 63272.9961  LR: 0.00001816  
Epoch: [1][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.5757(0.6129) Grad: 44786.4883  LR: 0.00001502  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 23s) Loss: 0.6132(0.6132) 


Epoch 1 - avg_train_loss: 0.6129  avg_val_loss: 0.5699  time: 61s
Epoch 1 - Score: 0.7377
Epoch 1 - Save Best Score: 0.7377 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.4951(0.5699) 
Epoch: [2][0/341] Elapsed 0m 0s (remain 1m 53s) Loss: 0.5265(0.5265) Grad: 105111.6328  LR: 0.00001499  
Epoch: [2][200/341] Elapsed 0m 33s (remain 0m 23s) Loss: 0.6309(0.5551) Grad: 112667.7656  LR: 0.00000910  
Epoch: [2][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.5619(0.5550) Grad: 130620.6797  LR: 0.00000504  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 22s) Loss: 0.5878(0.5878) 


Epoch 2 - avg_train_loss: 0.5550  avg_val_loss: 0.5575  time: 61s
Epoch 2 - Score: 0.7694
Epoch 2 - Save Best Score: 0.7694 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.4762(0.5575) 
Epoch: [3][0/341] Elapsed 0m 0s (remain 1m 51s) Loss: 0.5641(0.5641) Grad: 166729.6719  LR: 0.00000501  
Epoch: [3][200/341] Elapsed 0m 33s (remain 0m 23s) Loss: 0.6360(0.5432) Grad: 98836.7422  LR: 0.00000093  
Epoch: [3][340/341] Elapsed 0m 56s (remain 0m 0s) Loss: 0.4562(0.5413) Grad: 170194.2344  LR: 0.00000000  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 23s) Loss: 0.5929(0.5929) 


Epoch 3 - avg_train_loss: 0.5413  avg_val_loss: 0.5598  time: 61s
Epoch 3 - Score: 0.7671


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.4684(0.5598) 


Score: 0.7694
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/341] Elapsed 0m 0s (remain 1m 57s) Loss: 0.6863(0.6863) Grad: inf  LR: 0.00002000  
Epoch: [1][200/341] Elapsed 0m 33s (remain 0m 23s) Loss: 0.6551(0.6304) Grad: 39359.2500  LR: 0.00001816  
Epoch: [1][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.5936(0.6088) Grad: 25628.5684  LR: 0.00001502  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 25s) Loss: 0.6141(0.6141) 


Epoch 1 - avg_train_loss: 0.6088  avg_val_loss: 0.5730  time: 61s
Epoch 1 - Score: 0.7344
Epoch 1 - Save Best Score: 0.7344 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.4535(0.5730) 
Epoch: [2][0/341] Elapsed 0m 0s (remain 2m 1s) Loss: 0.5374(0.5374) Grad: 172462.2969  LR: 0.00001499  
Epoch: [2][200/341] Elapsed 0m 33s (remain 0m 23s) Loss: 0.6060(0.5534) Grad: 69152.9219  LR: 0.00000910  
Epoch: [2][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.5001(0.5537) Grad: 98288.8359  LR: 0.00000504  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 22s) Loss: 0.6338(0.6338) 


Epoch 2 - avg_train_loss: 0.5537  avg_val_loss: 0.5689  time: 61s
Epoch 2 - Score: 0.7600
Epoch 2 - Save Best Score: 0.7600 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.4710(0.5689) 
Epoch: [3][0/341] Elapsed 0m 0s (remain 1m 57s) Loss: 0.5605(0.5605) Grad: 172782.5000  LR: 0.00000501  
Epoch: [3][200/341] Elapsed 0m 32s (remain 0m 22s) Loss: 0.5402(0.5344) Grad: 82191.8828  LR: 0.00000093  
Epoch: [3][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.4562(0.5342) Grad: 99481.4453  LR: 0.00000000  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 23s) Loss: 0.6332(0.6332) 


Epoch 3 - avg_train_loss: 0.5342  avg_val_loss: 0.5633  time: 61s
Epoch 3 - Score: 0.7629
Epoch 3 - Save Best Score: 0.7629 Model


EVAL: [111/112] Elapsed 0m 5s (remain 0m 0s) Loss: 0.4523(0.5633) 


Score: 0.7629
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.LayerNorm.bias', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/341] Elapsed 0m 0s (remain 2m 5s) Loss: 0.5756(0.5756) Grad: 231866.7031  LR: 0.00002000  
Epoch: [1][200/341] Elapsed 0m 33s (remain 0m 22s) Loss: 0.6470(0.6222) Grad: 66065.5703  LR: 0.00001816  
Epoch: [1][340/341] Elapsed 0m 55s (remain 0m 0s) Loss: 0.4747(0.6086) Grad: 61641.9688  LR: 0.00001502  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 24s) Loss: 0.6478(0.6478) 


Epoch 1 - avg_train_loss: 0.6086  avg_val_loss: 0.5984  time: 61s
Epoch 1 - Score: 0.6987
Epoch 1 - Save Best Score: 0.6987 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.6144(0.5984) 
Epoch: [2][0/341] Elapsed 0m 0s (remain 1m 53s) Loss: 0.6133(0.6133) Grad: 121922.0078  LR: 0.00001499  
Epoch: [2][200/341] Elapsed 0m 33s (remain 0m 23s) Loss: 0.5750(0.5645) Grad: 89089.9531  LR: 0.00000910  
Epoch: [2][340/341] Elapsed 0m 56s (remain 0m 0s) Loss: 0.4748(0.5608) Grad: 128803.9531  LR: 0.00000504  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 24s) Loss: 0.6427(0.6427) 


Epoch 2 - avg_train_loss: 0.5608  avg_val_loss: 0.5732  time: 61s
Epoch 2 - Score: 0.7343
Epoch 2 - Save Best Score: 0.7343 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.5793(0.5732) 
Epoch: [3][0/341] Elapsed 0m 0s (remain 2m 0s) Loss: 0.5060(0.5060) Grad: 99340.0703  LR: 0.00000501  
Epoch: [3][200/341] Elapsed 0m 33s (remain 0m 23s) Loss: 0.4329(0.5459) Grad: 151833.1719  LR: 0.00000093  
Epoch: [3][340/341] Elapsed 0m 56s (remain 0m 0s) Loss: 0.5438(0.5480) Grad: 127567.8438  LR: 0.00000000  
EVAL: [0/112] Elapsed 0m 0s (remain 0m 22s) Loss: 0.6450(0.6450) 


Epoch 3 - avg_train_loss: 0.5480  avg_val_loss: 0.5783  time: 61s
Epoch 3 - Score: 0.7384
Epoch 3 - Save Best Score: 0.7384 Model


EVAL: [111/112] Elapsed 0m 4s (remain 0m 0s) Loss: 0.5817(0.5783) 


Score: 0.7384
Score: 0.7456
