# deberta-v3-small Experiments

We use fast experiments with 1/5 of the data and 3 epochs to quickly determine which adaptations are helpful and report the cv-score.  
For adaptation we want to explore further, we retrain on the whole dataset and evaluate on the leader board.  

</br> </br>  


## Fast Experiments 
*1/5 of original training data, 3 epochs, 4 fold cv*

---

### Prompt Engineering
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline           | `fast_baseline_cfg`   | 0.7430   | - | |
|  CPC Context Text   | `fast_context_cfg`    | 0.7523    | - | |  
|  Custom Tokens      | `fast_customtok_cfg`  | 0.7464  | - | |  



**Conclusion**: Adding Context Text seems to work best here, we will hence continue with that.
</br> </br>  

### Model Type
    
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Regression       | `fast_reg_cfg` |  *see CPC Context Text*  | - | |
|  Classification   | `fast_class_cfg` | 0.7378 | - | |  
|  Ordinal          | `fast_ord_cfg` | 0.7262 | - | |  


**Conclusion**: Regular regression seems to work best, we will hence continue with that.
</br> </br>  

### Classical NLP Preprocssing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Stemming              |    |   | - | |
| Lemmatizing           |    |   | - | |
| Special Characters    |    |   | - | Removing special characters from the phrases |


### Postprocessing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Clipping              | `fast_post_clip_cfg`     |  0.7423 | - | Range `[0,1]`|
| MinMax                | `fast_post_minmax_cfg`   |  0.7523 | - | Range `[0,1]`|
| Chemical Lookup       | `fast_post_chem_cfg`     |  0.7541 | - |              |

**Conclusion**: `Clipping` performs a little worse, which makes sense when considering the pearson correlation score.
`MinMax` performs exactly the same, which is again not surprising.
`Chemical Lookup` performs a little better, we hence continue with that.
</br> </br>  

### Data Augmentation
To aquire a fair estimate, here, we only validate on non-augmented data and keep the training data the same size as in previous fast experiments.  


|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Identities              |   `fast_aug_ident_cfg` |  0.7456 | - | Adds the reverse mapping for Anchor-Target pairs with score 1|
| All Mirrored            |   `fast_aug_mirr_cfg` |  0.7382 | - | Adds the reverse mapping for all Anchor-Target pairs |
| Identity Paths          |  `fast_aug_identpaths_cfg` | 0.7597  | - | Adds all pairs in a path between phrases connected by a score of 1|
| All Mirrored + Identity Paths |  `fast_aug_mirridentpaths_cfg` | 0.7449 | - | Adds identity paths and additionally mirrors all Anchor-Target pairs|
| Neighbors               |  `fast_aug_neighbors_cfg` | 0.7518 | - | Additionally to *All Mirrored + Identity Paths* considers phrases adjacent of idendity paths.|
| Chemical Compounds      |   `fast_aug_chem_cfg`  | 0.7576 | - | Finds synonyms for formulae of common chemical compounds in the dataset and creates new phrases from it.| 

**Conclusion**: Augmenting seems not to hurt performance significantly and it is likely that the additional data will increase performance.  
To giver a better comparison we will compare full models based on the leaderboard score.

</br>  </br>  </br>  

## Full Experiments 
*full training data, 5 epochs, 4 fold cv*

---

    
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline                        | `full_baseline_cfg`      |   |  | |
|  Context Text                    | `full_ctxt_cfg`          |   |  | |  
|  Chemical Lookup                 | `full_post_chem_cfg`     |   |  | |
|  Neighbors Augmentation          | `full_aug_neighbors_cfg` |   |  | |     
|  Chemical Compounds Augmentation | `full_aug_chem_cfg`      |   |  | | 



# Directory settings

In [1]:
# ====================================================
# Directory settings
# ====================================================
import os

INPUT_DIR = '../input/us-patent-phrase-to-phrase-matching/'
IDENTITY_MAPPINGS_DIR = '../input/uspppm-identity-mappings/'
OUTPUT_DIR = './'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# Model Configuration

In [2]:
from dataclasses import dataclass, field
from typing import Set, Optional, Tuple, Dict


@dataclass
class ModelConfig:
    ############################################
    # Prompt engineering
    ############################################
    # - ctxt_txt (Default): Add context text at the end of the prompt
    # - custom_tok: Add custom tokens for contexts and a custom separator token
    # - None: Append the context appreviation to the end of prompt
    # When true uses custom separator token and context tokens
    prompt_engineering: Optional[str] = 'ctx_txt' # ['custom_tok', None]
    custom_sep_token: str = '[S]'
        
    ############################################
    # Model Type
    ############################################
    # when true uses classification else a regression model
    classification: bool = False
    # when true uses ordinal regression (only active when `classification = False`)
    ordinal: bool = False
    
    ############################################
    # Traditional NLP preprocessing
    ############################################
    stemming: bool = False
    lemmatizing: bool = False
    special_chr_rem: bool = False
        
    ############################################
    # Post Processing
    ############################################
    clipping: bool = False
    minmax: bool = False
    # Use averaging of chemical component synonyms for 
    chem_comp_pred_avg: bool  = False
        
    ############################################
    # Data Augmentation
    ############################################
    # Use chemical component synonyms to creat samples for augmenting the training set
    chem_comp_train_aug: bool = False
        
    # How to augment the data for training using the graph based indentity mappings
    # available options ['neighbors'. paths_mirrored', 'paths', 'mirrored', 'identities', None]
    # None indicates that no augmentation should take place
    augment_identity_graph_data: Optional[str] = None 
    validate_on_original: bool = True
        
    
    ############################################
    # General Model Config and Hyperparams
    ############################################
    debug: bool = False
    apex: bool = True
    print_freq: int= 200
    num_workers: int = 4
    model: str = "microsoft/deberta-v3-small"
    scheduler: str = 'cosine' # ['linear', 'cosine']
    batch_scheduler: bool = True
    num_cycles: float = 0.5
    num_warmup_steps: int = 0
    encoder_lr: float = 2e-5
    decoder_lr: float = 2e-5
    min_lr: float = 1e-6
    eps: float = 1e-6
    betas: Set[float] = (0.9, 0.999)
    batch_size: int = 16
    fc_dropout: float = 0.2
    max_len: int = 512
    weight_decay: float = 0.01
    gradient_accumulation_steps: int = 1
    max_grad_norm: int = 1000
    seed: int = 42
    
    epochs: int = 5
    train_frac: Optional[float] = None
    n_fold: int = 4
    trn_fold: Set[int] = (0, 1, 2, 3)
    map_score: Dict[float, int] = field(default_factory = lambda: ({0.0: 0, 0.25: 1, 0.5: 2, 0.75: 3, 1.0: 4}))
    map_labels: Dict[int, float] = field(default_factory = lambda: ({0: 0.0, 1: 0.25, 2: 0.5, 3: 0.75, 4: 1.0}))
        
    target_size: int=1
    def __post_init__(self):
        if self.classification or self.ordinal:
            self.target_size = 5
        else:
            self.target_size = 1  


############################################
# Experiment Configurations
############################################

# Fast Experiments
FAST_BASE_TRAIN_FRAC = 1/5
FAST_BASE_TRAIN_EPOCHS = 3

def fastModelConfg(**kwargs):
    return ModelConfig(**kwargs, train_frac=FAST_BASE_TRAIN_FRAC, epochs=FAST_BASE_TRAIN_EPOCHS)


fast_baseline_cfg = fastModelConfg(prompt_engineering=None)
fast_context_ctx = fastModelConfg()
fast_customtok_cfg = fastModelConfg(prompt_engineering='custom_tok')

fast_reg_cfg = fastModelConfg()
fast_class_cfg = fastModelConfg(classification=True)
fast_ord_cfg = fastModelConfg(ordinal=True)

fast_aug_ident_cfg = fastModelConfg(augment_identity_graph_data='identities')
fast_aug_mirr_cfg = fastModelConfg(augment_identity_graph_data='mirrored')
fast_aug_identpaths_cfg = fastModelConfg(augment_identity_graph_data='paths')
fast_aug_mirridentpaths_cfg = fastModelConfg(augment_identity_graph_data='paths_mirrored')
fast_aug_neighbors_cfg = fastModelConfg(augment_identity_graph_data='neighbors')
fast_aug_chem_cfg = fastModelConfg(chem_comp_train_aug=True)

fast_post_clip_cfg = fastModelConfg(clipping=True)
fast_post_minmax_cfg = fastModelConfg(minmax=True)
fast_post_chem_cfg = fastModelConfg(chem_comp_pred_avg=True)

# Full Experiments
full_baseline_cfg = ModelConfig(prompt_engineering=None)
full_ctxt_cfg = ModelConfig()
full_post_chem_cfg = ModelConfig(chem_comp_pred_avg=True)
full_aug_neighbors_cfg = ModelConfig(augment_identity_graph_data='neighbors')
full_aug_chem_cfg = ModelConfig(chem_comp_train_aug=True)


### Config used throughout this notebook

In [3]:
############################################
# Fast Experiments
############################################

# CFG = fast_baseline_cfg
# CFG = fast_context_ctx
# CFG = fast_customtok_cfg

# CFG = fast_reg_cfg
# CFG = fast_class_cfg
# CFG = fast_ord_cfg

# CFG = fast_aug_ident_cfg
# CFG = fast_aug_mirr_cfg
# CFG = fast_aug_identpaths_cfg
# CFG = fast_aug_mirridentpaths_cfg
# CFG = fast_aug_neighbors_cfg
#CFG = fast_aug_chem_cfg

# CFG = fast_post_clip_cfg
# CFG = fast_post_minmax_cfg
# CFG = fast_post_chem_cfg



############################################
# Full Experiments
############################################
# CFG = full_baseline_cfg
# CFG = full_ctxt_cfg
CFG = full_post_chem_cfg
# CFG = full_aug_neighbors_cfg
# CFG = fast_aug_chem_cfg

# Library

In [4]:
# ====================================================
# Library
# ====================================================
import os
import gc
import re
import ast
import sys
import copy
import json
import time
import math
import shutil
import string
import pickle
import random
import joblib
import itertools
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
from sklearn.preprocessing import minmax_scale
    
import torch
print(f"torch.__version__: {torch.__version__}")
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset

os.system('pip uninstall -y transformers')
os.system('pip uninstall -y tokenizers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels transformers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels tokenizers')
import tokenizers
import transformers
print(f"tokenizers.__version__: {tokenizers.__version__}")
print(f"transformers.__version__: {transformers.__version__}")
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
%env TOKENIZERS_PARALLELISM=true

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import nltk
nltk.data.path.append('../input/wordnet')
from nltk.stem import WordNetLemmatizer


# common chemical formulae lookup utility script
from uspppm_common_chemical_compound_lookup import USPPPMChemCompLookup

torch.__version__: 1.9.1
Found existing installation: transformers 4.16.2
Uninstalling transformers-4.16.2:
  Successfully uninstalled transformers-4.16.2




Found existing installation: tokenizers 0.11.6
Uninstalling tokenizers-0.11.6:
  Successfully uninstalled tokenizers-0.11.6




Looking in links: ../input/pppm-pip-wheels
Processing /kaggle/input/pppm-pip-wheels/transformers-4.18.0-py3-none-any.whl
Processing /kaggle/input/pppm-pip-wheels/tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.12.1 transformers-4.18.0


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.9.1 requires transformers<4.17,>=4.1, but you have transformers 4.18.0 which is incompatible.


Looking in links: ../input/pppm-pip-wheels




tokenizers.__version__: 0.12.1
transformers.__version__: 4.18.0
env: TOKENIZERS_PARALLELISM=true


# Utils

### Chemical Compound Lookup

In [5]:
chem_lookup = USPPPMChemCompLookup(chem_comp_path='../input/chemical-compounds-lookup/compounds.csv')

**Chemical Compound Lookup Tests**

In [6]:
print('Testing basic formula lookup...')
print(chem_lookup.lookup_df)
print(chem_lookup.phrase_chem_formula_synonym('agbr test'))
print(chem_lookup.phrase_chem_formula_synonym('agbr dna test agonc ag2cl2'))
print(chem_lookup.phrase_chem_formula_synonym('dna test d2o'))
print(chem_lookup.chem_formula_synonyms('c3h6'))
print('Done!')
print()
print('Testing train dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),
                          'score': pd.Series([3, 2, 1, 0]),
                          'anchor': pd.Series(['agbr dna test agonc ag2cl2', 'agbr test', 'agbr', 'last']),
                          'target': pd.Series(['agonc ag2cl2', 'test thingy', 'c4h7no4', 'last']),
                          'context': pd.Series(['G02', 'G02', 'C12', 'C12'])})
print("Before")
print(chem_test)
print("After")
print(chem_lookup.pre_augment_chem_formulae(chem_test, True))
print('Done!')
print()
print('Testing test dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),
                          'text': pd.Series([
                              'agbr dna test agonc ag2cl2 [SEP] agonc ag2cl2 [SEP] G02',
                              'agbr test [SEP] test thingy [SEP] G02',
                              'agbr [SEP] c4h7no4 [SEP] C12',
                              'last [SEP] last [SEP] C12'
                          ])})
print("Before")
print(chem_test)
print("After")
print(chem_lookup.post_augment_chem_formulae(chem_test, True))
print('Done!')

Testing basic formula lookup...
                            Name  Formula
0               actiniumiiioxide    ac2o3
1     silvertetrachloroaluminate  agalcl4
2                  silverbromide     agbr
3                  silverbromate   agbro3
4                  silvercyanide     agcn
...                          ...      ...
4067                zirconateion    zro32
4068          zirconiumphosphide     zrp2
4069            zirconiumsulfide     zrs2
4070           zirconiumsilicide    zrsi2
4071          zirconiumphosphate  zr3po44

[4072 rows x 2 columns]
[('silverbromide test', True)]
[('silverbromide dna test silverfulminate disilverdichloride', True), ('silverbromide dna test silvercyanate disilverdichloride', True), ('silverbromide dna test silverfulminate silveriidichloride', True), ('silverbromide dna test silvercyanate silveriidichloride', True)]
[('dna test deuteriumoxide', True), ('dna test heavywater', True)]
['cyclopropane', 'propylene']
Done!

Testing train dataset augmentat

In [7]:
# ====================================================
# Utils
# ====================================================
def get_score(y_true, y_pred):
    score = sp.stats.pearsonr(y_true, y_pred)[0]
    return score


def get_logger(filename=OUTPUT_DIR+'train'):
    from logging import getLogger, INFO, StreamHandler, FileHandler, Formatter
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=f"{filename}.log")
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    logger.propagate = False
    return logger

LOGGER = get_logger()

def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

# Data Loading

In [8]:
# ====================================================
# Data Loading
# ====================================================
orig_data = pd.read_csv(INPUT_DIR+'train.csv')

if CFG.augment_identity_graph_data == 'neighbors':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_path_neighbors.csv')
elif CFG.augment_identity_graph_data == 'paths_mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_paths.csv')
elif CFG.augment_identity_graph_data == 'paths':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_paths_mirrored.csv')
elif CFG.augment_identity_graph_data == 'mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored.csv')
elif CFG.augment_identity_graph_data == 'identities':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_mirrored.csv')
elif CFG.augment_identity_graph_data == None:
    train = orig_data
else:
    raise(ValueError('CFG.augment_identity_graph_data = {} not recognized!'.format(CFG.augment_identity_graph_data)))

if CFG.chem_comp_train_aug:
    train = chem_lookup.pre_augment_chem_formulae(train)
    
display(train)
    
if CFG.train_frac:
    # to get a fair estimate we always use a fraction of the original data
    n = int(CFG.train_frac * len(orig_data))
    train = train.sample(n=n, replace=False, ignore_index=True)
    
test = pd.read_csv(INPUT_DIR+'test.csv')
submission = pd.read_csv(INPUT_DIR+'sample_submission.csv')
print(f"train.shape: {train.shape}")
print(f"test.shape: {test.shape}")
print(f"submission.shape: {submission.shape}")
# display(train.head())
# display(test.head())
# display(submission.head())

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


train.shape: (36473, 5)
test.shape: (36, 4)
submission.shape: (36, 2)


# Pre-processing

In [9]:
# Add augmented indicator
# I'm sorry for this dirty hack
if 'augmented' not in train.columns:
    train['augmented'] = train['id'].str.contains('_')
else:
    train['augmented'] = train['id'].str.contains('_') | train['augmented']
    
print(sum(train['augmented']), 'augmented samples')

0 augmented samples


In [10]:
# ====================================================
# CPC Data
# ====================================================
def get_cpc_texts():
    contexts = []
    pattern = '[A-Z]\d+'
    for file_name in os.listdir('../input/cpc-data/CPCSchemeXML202105'):
        result = re.findall(pattern, file_name)
        if result:
            contexts.append(result)
    contexts = sorted(set(sum(contexts, [])))
    results = {}
    for cpc in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
        with open(f'../input/cpc-data/CPCTitleList202202/cpc-section-{cpc}_20220201.txt') as f:
            s = f.read()
        pattern = f'{cpc}\t\t.+'
        result = re.findall(pattern, s)
        cpc_result = result[0].lstrip(pattern)
        for context in [c for c in contexts if c[0] == cpc]:
            pattern = f'{context}\t\t.+'
            result = re.findall(pattern, s)
            results[context] = cpc_result + ". " + result[0].lstrip(pattern)
    return results


cpc_texts = get_cpc_texts()
torch.save(cpc_texts, OUTPUT_DIR+"cpc_texts.pth")
train['context_text'] = train['context'].map(cpc_texts)
test['context_text'] = test['context'].map(cpc_texts)
# display(train.head())
# display(test.head())

# CV split

In [11]:
# ====================================================
# CV split
# ====================================================
train['score_map'] = train['score'].map({0.00: 0, 0.25: 1, 0.50: 2, 0.75: 3, 1.00: 4})
Fold = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
for n, (train_index, val_index) in enumerate(Fold.split(train, train['score_map'])):
    train.loc[val_index, 'fold'] = int(n)
train['fold'] = train['fold'].astype(int)
# display(train.groupby('fold').size())

In [12]:
def get_sec_toks(df):
    return '[' + df['context'].str[0] + ']'

if CFG.prompt_engineering == 'custom_tok':
    train['text'] = get_sec_toks(train) + train['context_text'] + '[SEP]'+ train['anchor'] + CFG.custom_sep_token + train['target']
    test['text'] = get_sec_toks(test) + test['context_text'] + '[SEP]' + test['anchor'] + CFG.custom_sep_token + test['target']
elif CFG.prompt_engineering == 'ctx_txt':
    train['text'] = train['anchor'] + '[SEP]' + train['target'] + '[SEP]'  + train['context_text']
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]'  + test['context_text']
elif CFG.prompt_engineering == None:
    train['text'] = train['anchor'] + '[SEP]' + train['target'] + '[SEP]'  + train['context']
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]'  + test['context']
else:
    raise(ValueError('CFG.prompt_engineering = {} not recognized!'.format(CFG.prompt_engineering)))
    

print(train['text'][0])
display(train['text'].head())
display(test['text'].head())

abatement[SEP]abatement of pollution[SEP]HUMAN NECESSITIES. FURNITURE; DOMESTIC ARTICLES OR APPLIANCES; COFFEE MILLS; SPICE MILLS; SUCTION CLEANERS IN GENERAL


0    abatement[SEP]abatement of pollution[SEP]HUMAN...
1    abatement[SEP]act of abating[SEP]HUMAN NECESSI...
2    abatement[SEP]active catalyst[SEP]HUMAN NECESS...
3    abatement[SEP]eliminating process[SEP]HUMAN NE...
4    abatement[SEP]forest region[SEP]HUMAN NECESSIT...
Name: text, dtype: object

0    opc drum[SEP]inorganic photoconductor drum[SEP...
1    adjust gas flow[SEP]altering gas flow[SEP]MECH...
2    lower trunnion[SEP]lower locating[SEP]PERFORMI...
3    cap component[SEP]upper portion[SEP]TEXTILES; ...
4    neural stimulation[SEP]artificial neural netwo...
Name: text, dtype: object

# tokenizer

In [13]:
# ====================================================
# tokenizer
# ====================================================
tokenizer = AutoTokenizer.from_pretrained(CFG.model)


# add special tokens for sections
cpc_sections = [
    'A', # Human Necessities
    'B', # Operations and Transport
    'C', # Chemistry and Metallurgy
    'D', # Textiles
    'E', # Fixed Constructions
    'F', # Mechanical Engineering
    'G', # Physics
    'H', # Electricity
    'Y' # Emerging Cross-Sectional Technologies
]
if CFG.prompt_engineering == 'custom_tok':
    tokenizer.add_special_tokens({'additional_special_tokens': ['['+  s + ']' for s in cpc_sections]})
    print(tokenizer.all_special_tokens)
    
tokenizer.save_pretrained(OUTPUT_DIR+'tokenizer/')
CFG.tokenizer = tokenizer

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Dataset

In [14]:
# ====================================================
# Define max_len
# ====================================================
lengths_dict = {}

lengths = []
tk0 = tqdm(cpc_texts.values(), total=len(cpc_texts))
for text in tk0:
    length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
    lengths.append(length)
lengths_dict['context_text'] = lengths

for text_col in ['anchor', 'target']:
    lengths = []
    tk0 = tqdm(train[text_col].fillna("").values, total=len(train))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
        lengths.append(length)
    lengths_dict[text_col] = lengths
    
CFG.max_len = max(lengths_dict['anchor']) + max(lengths_dict['target'])\
                + max(lengths_dict['context_text']) + 4 # CLS + SEP + SEP + SEP
LOGGER.info(f"max_len: {CFG.max_len}")

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/36473 [00:00<?, ?it/s]

  0%|          | 0/36473 [00:00<?, ?it/s]

max_len: 133


In [15]:
# ====================================================
# Dataset
# ====================================================
def prepare_input(cfg, text):
    inputs = cfg.tokenizer(text,
                           add_special_tokens=True,
                           max_length=cfg.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs

def prepare_labels(cfg, label):
    if cfg.classification:
        label_onehot = [0 for _ in range(cfg.target_size)]
        label_onehot[cfg.map_score[label]] = 1 
        return torch.tensor(label_onehot, dtype=torch.float)
    elif cfg.ordinal:
        label_ordinal = [1 if i <= cfg.map_score[label] else 0 for i in range(cfg.target_size)]
        return torch.tensor(label_ordinal, dtype=torch.float)
    else:
        return torch.tensor(label, dtype=torch.float)

class TrainDataset(Dataset):
    def __init__(self, cfg, df, chem_lookup=None):
        self.cfg = cfg
        if cfg.chem_comp_pred_avg and chem_lookup:
            df = chem_lookup.post_augment_chem_formulae(df)
        self.texts = df['text'].values
        self.labels = df['score'].values
        self.ids = df['id'].values

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, self.texts[item])
        label = prepare_labels(self.cfg, self.labels[item])
        return inputs, label, self.ids[item]

# Model

In [16]:
# ====================================================
# Model
# ====================================================
class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, self.cfg.target_size)
        self._init_weights(self.fc)
        self.attention = nn.Sequential(
            nn.Linear(self.config.hidden_size, 512),
            nn.Tanh(),
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )
        self._init_weights(self.attention)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        # feature = torch.mean(last_hidden_states, 1)
        weights = self.attention(last_hidden_states)
        feature = torch.sum(weights * last_hidden_states, dim=1)
        return feature

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        return output

# Helper functions

In [17]:
# ====================================================
# Helper functions
# ====================================================
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))

def ordinal_regression(predictions, targets):
    """Ordinal regression with encoding as in https://arxiv.org/pdf/0704.1028.pdf"""
    return nn.MSELoss(reduction='mean')(predictions, targets)

def average_by_id(df):
    ''' Averages a dataframe by a column id'''
    orig_id_order = df['id']
    unordered_means = df.groupby('id').mean().reset_index()
    return unordered_means.set_index('id').loc[orig_id_order].reset_index().drop_duplicates()


def train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device):
    model.train()
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex)
    losses = AverageMeter()
    start = end = time.time()
    global_step = 0
    for step, (inputs, labels, _) in enumerate(train_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
#         needed to disable amp, error: half expected got float (maybe some bug)
        if CFG.ordinal:
            y_preds = model(inputs)
        else:
            with torch.cuda.amp.autocast(enabled=CFG.apex):
                y_preds = model(inputs)
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels) 
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        scaler.scale(loss).backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm)
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            global_step += 1
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'Grad: {grad_norm:.4f}  '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          grad_norm=grad_norm,
                          lr=scheduler.get_lr()[0]))
    return losses.avg


def valid_fn(valid_loader, model, criterion, device):
    losses = AverageMeter()
    model.eval()
    preds = []
    ids = []
    start = end = time.time()
    for step, (inputs, labels, sample_id) in enumerate(valid_loader):
        ids.append(sample_id)
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.no_grad():
            y_preds = model(inputs)
            
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels)
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        if CFG.classification or CFG.ordinal:
            preds.append(y_preds.to('cpu').numpy())
        else:
            preds.append(y_preds.sigmoid().to('cpu').numpy())
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    predictions = np.concatenate(preds)
    ids = np.concatenate(ids)
    # print("Sanity Check:", sum(pd.Series(ids).duplicated()))
    if CFG.classification:
        predictions = np.argmax(predictions, axis=1)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    elif CFG.ordinal:
        predictions = (predictions > 0.5).cumprod(axis=1).sum(axis=1) - 1
        predictions = np.clip(predictions, 0, None)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    else:
        predictions = np.concatenate(predictions)
        if CFG.clipping:
            predictions = np.clip(predictions, 0, 1)
        if CFG.minmax:
            predictions = minmax_scale(predictions, feature_range=(0, 1))
            
    if CFG.chem_comp_pred_avg:
        pred_new = average_by_id(pd.DataFrame({'pred': predictions, 'id': ids}))
        predictions = pred_new['pred'].to_numpy()
    return losses.avg, predictions


In [18]:
# ====================================================
# train loop
# ====================================================
def train_loop(folds, fold):
    
    LOGGER.info(f"========== fold: {fold} training ==========")

    # ====================================================
    # loader
    # ====================================================
    train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    if CFG.augment_identity_graph_data != None and CFG.validate_on_original:
        valid_folds = valid_folds[valid_folds['augmented'] == False].reset_index(drop=True)

    valid_labels = valid_folds['score'].values
    
    train_dataset = TrainDataset(CFG, train_folds)
    # we only want the prediction using chemical synonyms for the validation
    valid_dataset = TrainDataset(CFG, valid_folds, chem_lookup)

    train_loader = DataLoader(train_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=True,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=True)
    valid_loader = DataLoader(valid_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=False,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

    # ====================================================
    # model & optimizer
    # ====================================================
    model = CustomModel(CFG, config_path=None, pretrained=True)
    torch.save(model.config, OUTPUT_DIR+'config.pth')
    model.to(device)
    
    def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
        param_optimizer = list(model.named_parameters())
        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
        optimizer_parameters = [
            {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': weight_decay},
            {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': 0.0},
            {'params': [p for n, p in model.named_parameters() if "model" not in n],
             'lr': decoder_lr, 'weight_decay': 0.0}
        ]
        return optimizer_parameters

    optimizer_parameters = get_optimizer_params(model,
                                                encoder_lr=CFG.encoder_lr, 
                                                decoder_lr=CFG.decoder_lr,
                                                weight_decay=CFG.weight_decay)
    optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr, eps=CFG.eps, betas=CFG.betas)
    
    # ====================================================
    # scheduler
    # ====================================================
    def get_scheduler(cfg, optimizer, num_train_steps):
        if cfg.scheduler == 'linear':
            scheduler = get_linear_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
            )
        elif cfg.scheduler == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
            )
        return scheduler
    
    num_train_steps = int(len(train_folds) / CFG.batch_size * CFG.epochs)
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)

    # ====================================================
    # loop
    # ====================================================
    if CFG.classification:
        criterion = nn.CrossEntropyLoss()
    elif CFG.ordinal:
        criterion = ordinal_regression
    else:
        criterion = nn.BCEWithLogitsLoss(reduction="mean")
    
    best_score = 0.

    for epoch in range(CFG.epochs):

        start_time = time.time()

        # train
        avg_loss = train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device)

        # eval
        avg_val_loss, predictions = valid_fn(valid_loader, model, criterion, device)
        
        # scoring
        score = get_score(valid_labels, predictions)

        elapsed = time.time() - start_time

        LOGGER.info(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
        LOGGER.info(f'Epoch {epoch+1} - Score: {score:.4f}')       
        if best_score < score:
            best_score = score
            LOGGER.info(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
            torch.save({'model': model.state_dict(),
                        'predictions': predictions},
                        OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth")

    predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth", 
                             map_location=torch.device('cpu'))['predictions']
    valid_folds['pred'] = predictions

    torch.cuda.empty_cache()
    gc.collect()
    
    return valid_folds

In [19]:
if __name__ == '__main__':
    
    def get_result(oof_df):
        labels = oof_df['score'].values
        preds = oof_df['pred'].values
        score = get_score(labels, preds)
        LOGGER.info(f'Score: {score:<.4f}')
    
    oof_df = pd.DataFrame()
    for fold in range(CFG.n_fold):
        if fold in CFG.trn_fold:
            _oof_df = train_loop(train, fold)
            oof_df = pd.concat([oof_df, _oof_df])
            LOGGER.info(f"========== fold: {fold} result ==========")
            get_result(_oof_df)
    oof_df = oof_df.reset_index(drop=True)
    LOGGER.info(f"========== CV ==========")
    get_result(oof_df)
    oof_df.to_csv(OUTPUT_DIR+'oof_df.csv')



Downloading:   0%|          | 0.00/273M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 1s (remain 42m 53s) Loss: 0.8040(0.8040) Grad: inf  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 38s (remain 4m 48s) Loss: 0.6125(0.6306) Grad: 49369.7969  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 15s (remain 4m 6s) Loss: 0.5629(0.6096) Grad: 51028.5742  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.6407(0.5983) Grad: 63374.3906  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.6690(0.5902) Grad: 111753.0156  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 6s (remain 2m 11s) Loss: 0.5897(0.5854) Grad: 87120.7031  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 43s (remain 1m 34s) Loss: 0.6149(0.5807) Grad: 35933.2227  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 20s (remain 0m 57s) Loss: 0.6781(0.5774) Grad: 71492.0547  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 57s (remain 0m 20s) Loss: 0.6919(0.5750) Grad: 151317.5938  LR: 0.00001832  
Epoch: [1][1708/1709] Elapsed 5m 

Epoch 1 - avg_train_loss: 0.5737  avg_val_loss: 0.5660  time: 347s
Epoch 1 - Score: 0.8052
Epoch 1 - Save Best Score: 0.8052 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.5418(0.5660) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 10m 48s) Loss: 0.5172(0.5172) Grad: 158434.5781  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.6282(0.5399) Grad: 194219.9688  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 14s (remain 4m 3s) Loss: 0.5311(0.5400) Grad: 156626.8906  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 51s (remain 3m 26s) Loss: 0.5950(0.5384) Grad: 128836.3438  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5695(0.5398) Grad: 168351.9062  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4507(0.5385) Grad: 167225.0938  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.6287(0.5381) Grad: 130522.4453  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 19s (remain 0m 57s) Loss: 0.5271(0.5376) Grad: 75191.2812  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 57s (remain 0m 20s) Loss: 0.4

Epoch 2 - avg_train_loss: 0.5368  avg_val_loss: 0.5472  time: 347s
Epoch 2 - Score: 0.8265
Epoch 2 - Save Best Score: 0.8265 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4966(0.5472) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 11m 1s) Loss: 0.5196(0.5196) Grad: 62670.8711  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 41s) Loss: 0.5232(0.5265) Grad: 59762.9766  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 14s (remain 4m 3s) Loss: 0.5957(0.5297) Grad: 83574.5391  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5234(0.5266) Grad: 76798.7031  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5290(0.5263) Grad: 219218.8594  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 6s (remain 2m 11s) Loss: 0.4472(0.5267) Grad: 75331.2891  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 43s (remain 1m 34s) Loss: 0.5008(0.5271) Grad: 76837.3516  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 20s (remain 0m 57s) Loss: 0.5848(0.5264) Grad: 74248.1406  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 57s (remain 0m 20s) Loss: 0.3819(0.5

Epoch 3 - avg_train_loss: 0.5262  avg_val_loss: 0.5476  time: 347s
Epoch 3 - Score: 0.8317
Epoch 3 - Save Best Score: 0.8317 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4850(0.5476) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 11m 0s) Loss: 0.5957(0.5957) Grad: 221782.9844  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.3524(0.5198) Grad: 152754.3125  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.5215(0.5210) Grad: 112179.4219  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5310(0.5207) Grad: 62994.4922  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.6184(0.5201) Grad: 50196.3359  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 5s (remain 2m 10s) Loss: 0.5531(0.5203) Grad: 80784.8125  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.5957(0.5205) Grad: 45348.3438  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.4952(0.5209) Grad: 90324.3828  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.4656(0

Epoch 4 - avg_train_loss: 0.5186  avg_val_loss: 0.5439  time: 345s
Epoch 4 - Score: 0.8394
Epoch 4 - Save Best Score: 0.8394 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4880(0.5439) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 18m 11s) Loss: 0.5314(0.5314) Grad: 469068.3750  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.4314(0.5132) Grad: 42065.6016  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 14s (remain 4m 3s) Loss: 0.5329(0.5138) Grad: 88884.5938  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.4899(0.5118) Grad: 132327.2656  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5381(0.5140) Grad: 74150.9844  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4707(0.5122) Grad: 128140.6328  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.5530(0.5124) Grad: 46896.4922  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 19s (remain 0m 57s) Loss: 0.5627(0.5126) Grad: 57887.4727  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 56s (remain 0m 20s) Loss: 0.5758(

Epoch 5 - avg_train_loss: 0.5143  avg_val_loss: 0.5426  time: 346s
Epoch 5 - Score: 0.8409
Epoch 5 - Save Best Score: 0.8409 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4843(0.5426) 


Score: 0.8409
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 0s (remain 11m 12s) Loss: 0.7117(0.7117) Grad: 129001.0391  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.6349(0.6391) Grad: 98042.7500  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.5773(0.6145) Grad: 49581.6016  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 51s (remain 3m 24s) Loss: 0.5609(0.5998) Grad: 44315.4922  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 28s (remain 2m 47s) Loss: 0.5708(0.5934) Grad: 77761.5625  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 5s (remain 2m 10s) Loss: 0.5670(0.5852) Grad: 55597.9180  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.5795(0.5816) Grad: 75790.8516  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.6357(0.5784) Grad: 223707.5156  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.6099(0.5759) Grad: 93591.5000  LR: 0.00001832  
Epoch: [1][1708/1709] Elap

Epoch 1 - avg_train_loss: 0.5747  avg_val_loss: 0.5446  time: 346s
Epoch 1 - Score: 0.8181
Epoch 1 - Save Best Score: 0.8181 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4890(0.5446) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 10m 42s) Loss: 0.5925(0.5925) Grad: 60855.3594  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.6056(0.5388) Grad: 300052.0938  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.4717(0.5364) Grad: 95758.4453  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.4507(0.5332) Grad: 88648.0391  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.4984(0.5342) Grad: 454053.1875  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.5271(0.5339) Grad: 63945.5391  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.4027(0.5358) Grad: 63243.8672  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 19s (remain 0m 57s) Loss: 0.5263(0.5368) Grad: 269802.0938  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.4191(

Epoch 2 - avg_train_loss: 0.5371  avg_val_loss: 0.5414  time: 346s
Epoch 2 - Score: 0.8301
Epoch 2 - Save Best Score: 0.8301 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4259(0.5414) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 11m 4s) Loss: 0.4460(0.4460) Grad: 103212.7188  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.5925(0.5308) Grad: 126616.2812  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.4927(0.5217) Grad: 62245.1133  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5466(0.5236) Grad: 84215.0625  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5130(0.5253) Grad: 55198.5820  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.6300(0.5254) Grad: 196060.5156  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.4970(0.5257) Grad: 539934.0000  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 19s (remain 0m 57s) Loss: 0.5844(0.5262) Grad: 80450.8672  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.4899(

Epoch 3 - avg_train_loss: 0.5259  avg_val_loss: 0.5391  time: 346s
Epoch 3 - Score: 0.8369
Epoch 3 - Save Best Score: 0.8369 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4222(0.5391) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 10m 46s) Loss: 0.5711(0.5711) Grad: 76580.1328  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.6093(0.5238) Grad: 177608.9219  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.5381(0.5211) Grad: 312752.1250  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5345(0.5193) Grad: 50275.1406  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5191(0.5182) Grad: 134924.6875  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 5s (remain 2m 10s) Loss: 0.5285(0.5174) Grad: 88904.5000  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.5241(0.5164) Grad: 61988.1250  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.5273(0.5179) Grad: 51510.5625  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.5806(

Epoch 4 - avg_train_loss: 0.5177  avg_val_loss: 0.5441  time: 345s
Epoch 4 - Score: 0.8401
Epoch 4 - Save Best Score: 0.8401 Model


EVAL: [569/570] Elapsed 0m 28s (remain 0m 0s) Loss: 0.4065(0.5441) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 19m 41s) Loss: 0.5525(0.5525) Grad: 30430.2637  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 44s) Loss: 0.6016(0.5077) Grad: 123119.4922  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 14s (remain 4m 4s) Loss: 0.3890(0.5077) Grad: 143839.1719  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 51s (remain 3m 26s) Loss: 0.5652(0.5106) Grad: 35292.5625  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5331(0.5118) Grad: 119785.0938  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4749(0.5124) Grad: 84330.2266  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.4596(0.5120) Grad: 84648.1875  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 19s (remain 0m 57s) Loss: 0.3808(0.5120) Grad: 66105.4297  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 56s (remain 0m 20s) Loss: 0.5605(

Epoch 5 - avg_train_loss: 0.5133  avg_val_loss: 0.5457  time: 346s
Epoch 5 - Score: 0.8389


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4115(0.5457) 


Score: 0.8401
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 0s (remain 11m 9s) Loss: 0.6957(0.6957) Grad: inf  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.6314(0.6339) Grad: 50085.0742  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.5806(0.6077) Grad: 50686.9766  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5352(0.5947) Grad: 61452.3594  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5217(0.5892) Grad: 54277.6289  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 5s (remain 2m 10s) Loss: 0.5725(0.5817) Grad: 60214.6562  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.5457(0.5768) Grad: 49101.2266  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.4743(0.5744) Grad: 30280.4180  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.4967(0.5722) Grad: 56240.9609  LR: 0.00001832  
Epoch: [1][1708/1709] Elapsed 5m 16s

Epoch 1 - avg_train_loss: 0.5715  avg_val_loss: 0.5517  time: 346s
Epoch 1 - Score: 0.7978
Epoch 1 - Save Best Score: 0.7978 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4964(0.5517) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 10m 37s) Loss: 0.5486(0.5486) Grad: 120707.6641  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 41s) Loss: 0.5115(0.5342) Grad: 70278.3594  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.4628(0.5329) Grad: 93928.4531  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.3642(0.5358) Grad: 107952.6406  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5385(0.5353) Grad: 213039.4688  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4942(0.5361) Grad: 59890.8633  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.5965(0.5366) Grad: 472491.4688  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.4775(0.5367) Grad: 142711.3281  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.481

Epoch 2 - avg_train_loss: 0.5363  avg_val_loss: 0.5436  time: 346s
Epoch 2 - Score: 0.8254
Epoch 2 - Save Best Score: 0.8254 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4460(0.5436) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 10m 58s) Loss: 0.5205(0.5205) Grad: 123394.9688  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.6057(0.5300) Grad: 175148.5781  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.6075(0.5260) Grad: 142455.6406  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5782(0.5301) Grad: 70167.9531  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5649(0.5273) Grad: 94056.6953  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4464(0.5255) Grad: 402692.0938  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.4571(0.5264) Grad: 35738.0469  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.5511(0.5252) Grad: 387133.7812  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.614

Epoch 3 - avg_train_loss: 0.5248  avg_val_loss: 0.5443  time: 345s
Epoch 3 - Score: 0.8318
Epoch 3 - Save Best Score: 0.8318 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4456(0.5443) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 10m 50s) Loss: 0.5694(0.5694) Grad: 94055.4844  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.5036(0.5225) Grad: 109344.5703  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.5503(0.5183) Grad: 169945.7656  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5229(0.5180) Grad: 72632.2344  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5330(0.5180) Grad: 115218.2266  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.3966(0.5158) Grad: 64504.1602  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.5150(0.5167) Grad: 50601.5625  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.4846(0.5176) Grad: 87642.6953  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.5175(

Epoch 4 - avg_train_loss: 0.5172  avg_val_loss: 0.5489  time: 346s
Epoch 4 - Score: 0.8341
Epoch 4 - Save Best Score: 0.8341 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4433(0.5489) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 11m 4s) Loss: 0.4837(0.4837) Grad: 48846.4453  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.6017(0.5140) Grad: 132882.3906  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.4749(0.5176) Grad: 48003.7500  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.4553(0.5143) Grad: 185814.3281  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5698(0.5139) Grad: 46776.2188  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4859(0.5126) Grad: 130829.0312  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.6561(0.5131) Grad: 235430.8750  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.4683(0.5125) Grad: 68381.3125  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.5080(

Epoch 5 - avg_train_loss: 0.5136  avg_val_loss: 0.5494  time: 345s
Epoch 5 - Score: 0.8337


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4446(0.5494) 


Score: 0.8341
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'lm_predictions.lm_head.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.weight', 'mask_predictions.classifier.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.bias']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 0s (remain 11m 37s) Loss: 0.6941(0.6941) Grad: 70077.0938  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 37s (remain 4m 39s) Loss: 0.6542(0.6319) Grad: 62846.9844  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.6251(0.6137) Grad: 60752.0664  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 51s (remain 3m 24s) Loss: 0.6950(0.6020) Grad: 119778.7969  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 28s (remain 2m 47s) Loss: 0.5354(0.5936) Grad: 40954.8164  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 5s (remain 2m 10s) Loss: 0.6193(0.5885) Grad: 40608.2539  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.5668(0.5826) Grad: 85989.5703  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.5155(0.5788) Grad: 96245.6094  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.5211(0.5759) Grad: 51536.0742  LR: 0.00001832  
Epoch: [1][1708/1709] Elaps

Epoch 1 - avg_train_loss: 0.5743  avg_val_loss: 0.5483  time: 345s
Epoch 1 - Score: 0.8065
Epoch 1 - Save Best Score: 0.8065 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4609(0.5483) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 10m 57s) Loss: 0.5831(0.5831) Grad: 61326.2578  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.5363(0.5426) Grad: 90311.5000  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.6631(0.5423) Grad: 152255.9844  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5186(0.5395) Grad: 90858.6719  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5057(0.5396) Grad: 110358.7578  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.5976(0.5389) Grad: 203966.5781  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.4840(0.5371) Grad: 51112.3359  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.5546(0.5369) Grad: 99511.0234  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.4827(

Epoch 2 - avg_train_loss: 0.5369  avg_val_loss: 0.5469  time: 346s
Epoch 2 - Score: 0.8216
Epoch 2 - Save Best Score: 0.8216 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4587(0.5469) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 11m 20s) Loss: 0.4807(0.4807) Grad: 138922.7969  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.5306(0.5252) Grad: 96016.1797  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.5454(0.5246) Grad: 67460.0625  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.6288(0.5247) Grad: 263974.8438  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.4459(0.5262) Grad: 85620.6094  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.5424(0.5253) Grad: 82228.0000  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 42s (remain 1m 34s) Loss: 0.4571(0.5254) Grad: 62072.2695  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 19s (remain 0m 57s) Loss: 0.5383(0.5264) Grad: 23719.7793  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.4391(0

Epoch 3 - avg_train_loss: 0.5267  avg_val_loss: 0.5530  time: 346s
Epoch 3 - Score: 0.8239
Epoch 3 - Save Best Score: 0.8239 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4620(0.5530) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 12m 40s) Loss: 0.5603(0.5603) Grad: 69074.3438  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.4737(0.5200) Grad: 78662.2656  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.6249(0.5185) Grad: 393065.8125  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.6630(0.5167) Grad: 60226.0547  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 28s (remain 2m 47s) Loss: 0.6183(0.5163) Grad: 80190.7891  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 5s (remain 2m 10s) Loss: 0.4609(0.5154) Grad: 88312.0234  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.3506(0.5148) Grad: 106302.0703  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.5762(0.5164) Grad: 290477.5625  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.3960(

Epoch 4 - avg_train_loss: 0.5168  avg_val_loss: 0.5479  time: 346s
Epoch 4 - Score: 0.8305
Epoch 4 - Save Best Score: 0.8305 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4418(0.5479) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 11m 20s) Loss: 0.5254(0.5254) Grad: 72192.1016  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 40s) Loss: 0.5535(0.5138) Grad: 137403.0000  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 14s (remain 4m 2s) Loss: 0.6031(0.5106) Grad: 14890.8340  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 51s (remain 3m 25s) Loss: 0.5215(0.5099) Grad: 33048.7578  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 28s (remain 2m 48s) Loss: 0.5598(0.5123) Grad: 34295.5469  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 5s (remain 2m 11s) Loss: 0.4256(0.5130) Grad: 43358.9102  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 42s (remain 1m 33s) Loss: 0.5340(0.5136) Grad: 49083.4023  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 19s (remain 0m 56s) Loss: 0.6459(0.5139) Grad: 46656.3828  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 56s (remain 0m 19s) Loss: 0.5191(0.

Epoch 5 - avg_train_loss: 0.5135  avg_val_loss: 0.5498  time: 345s
Epoch 5 - Score: 0.8306
Epoch 5 - Save Best Score: 0.8306 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4415(0.5498) 


Score: 0.8306
Score: 0.8362
