# deberta-v3-small Experiments

We use fast experiments with 1/5 of the data and 3 epochs to quickly determine which adaptations are helpful and report the cv-score.  
For adaptation we want to explore further, we retrain on the whole dataset and evaluate on the leader board.  

</br> </br>  


## Fast Experiments 
*1/5 of original training data, 3 epochs, 4 fold cv*

---

### Prompt Engineering
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline           | `fast_baseline_cfg`   | 0.7430   | - | |
|  CPC Context Text   | `fast_context_cfg`    | 0.7523    | - | |  
|  Custom Tokens      | `fast_customtok_cfg`  | 0.7464  | - | |  



**Conclusion**: Adding Context Text seems to work best here, we will hence continue with that.
</br> </br>  

### Model Type
    
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Regression       | `fast_reg_cfg` |  *see CPC Context Text*  | - | |
|  Classification   | `fast_class_cfg` | 0.7378 | - | |  
|  Ordinal          | `fast_ord_cfg` | 0.7262 | - | |  


**Conclusion**: Regular regression seems to work best, we will hence continue with that.
</br> </br>  

### Classical NLP Preprocssing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Stemming              |    |   | - | |
| Lemmatizing           |    |   | - | |
| Special Characters    |    |   | - | Removing special characters from the phrases |


### Postprocessing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Clipping              | `fast_post_clip_cfg`     |  0.7423 | - | Range `[0,1]`|
| MinMax                | `fast_post_minmax_cfg`   |  0.7523 | - | Range `[0,1]`|
| Chemical Lookup       | `fast_post_chem_cfg`     |  0.7541 | - |              |

**Conclusion**: `Clipping` performs a little worse, which makes sense when considering the pearson correlation score.
`MinMax` performs exactly the same, which is again not surprising.
`Chemical Lookup` performs a little better, we hence continue with that.
</br> </br>  

### Data Augmentation
To aquire a fair estimate, here, we only validate on non-augmented data and keep the training data the same size as in previous fast experiments.  


|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Identities              |   `fast_aug_ident_cfg` |  0.7456 | - | Adds the reverse mapping for Anchor-Target pairs with score 1|
| All Mirrored            |   `fast_aug_mirr_cfg` |  0.7382 | - | Adds the reverse mapping for all Anchor-Target pairs |
| Identity Paths          |  `fast_aug_identpaths_cfg` | 0.7597  | - | Adds all pairs in a path between phrases connected by a score of 1|
| All Mirrored + Identity Paths |  `fast_aug_mirridentpaths_cfg` | 0.7449 | - | Adds identity paths and additionally mirrors all Anchor-Target pairs|
| Neighbors               |  `fast_aug_neighbors_cfg` | 0.7518 | - | Additionally to *All Mirrored + Identity Paths* considers phrases adjacent of idendity paths.|
| Chemical Compounds      |   `fast_aug_chem_cfg`  | 0.7576 | - | Finds synonyms for formulae of common chemical compounds in the dataset and creates new phrases from it.| 

**Conclusion**: Augmenting seems not to hurt performance significantly and it is likely that the additional data will increase performance.  
To giver a better comparison we will compare full models based on the leaderboard score.

</br>  </br>  </br>  

## Full Experiments 
*full training data, 5 epochs, 4 fold cv*

---

    
|     Model              |  CFG | cv score | lb (pb) score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline                        | `full_baseline_cfg`      |   |  | |
|  Context Text                    | `full_ctxt_cfg`          | 0.8337  | 0.8293 (0.8171) | |  
|  Chemical Lookup                 | `full_post_chem_cfg`     | 0.8362  | 0.8293 (0.8171) | |
|  Neighbors Augmentation          | `full_aug_neighbors_cfg` |   |  | |     
|  Chemical Compounds Augmentation | `full_aug_chem_cfg`      |   |  | | 



# Directory settings

In [1]:
# ====================================================
# Directory settings
# ====================================================
import os

INPUT_DIR = '../input/us-patent-phrase-to-phrase-matching/'
IDENTITY_MAPPINGS_DIR = '../input/uspppm-identity-mappings/'
OUTPUT_DIR = './'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# Model Configuration

In [2]:
from dataclasses import dataclass, field
from typing import Set, Optional, Tuple, Dict


@dataclass
class ModelConfig:
    ############################################
    # Prompt engineering
    ############################################
    # - ctxt_txt (Default): Add context text at the end of the prompt
    # - custom_tok: Add custom tokens for contexts and a custom separator token
    # - None: Append the context appreviation to the end of prompt
    # When true uses custom separator token and context tokens
    prompt_engineering: Optional[str] = 'ctx_txt' # ['custom_tok', None]
    custom_sep_token: str = '[S]'
        
    ############################################
    # Model Type
    ############################################
    # when true uses classification else a regression model
    classification: bool = False
    # when true uses ordinal regression (only active when `classification = False`)
    ordinal: bool = False
    
    ############################################
    # Traditional NLP preprocessing
    ############################################
    stemming: bool = False
    lemmatizing: bool = False
    special_chr_rem: bool = False
        
    ############################################
    # Post Processing
    ############################################
    clipping: bool = False
    minmax: bool = False
    # Use averaging of chemical component synonyms for 
    chem_comp_pred_avg: bool  = False
        
    ############################################
    # Data Augmentation
    ############################################
    # Use chemical component synonyms to creat samples for augmenting the training set
    chem_comp_train_aug: bool = False
        
    # How to augment the data for training using the graph based indentity mappings
    # available options ['neighbors'. paths_mirrored', 'paths', 'mirrored', 'identities', None]
    # None indicates that no augmentation should take place
    augment_identity_graph_data: Optional[str] = None 
    validate_on_original: bool = True
        
    
    ############################################
    # General Model Config and Hyperparams
    ############################################
    debug: bool = False
    apex: bool = True
    print_freq: int= 200
    num_workers: int = 4
    model: str = "microsoft/deberta-v3-small"
    scheduler: str = 'cosine' # ['linear', 'cosine']
    batch_scheduler: bool = True
    num_cycles: float = 0.5
    num_warmup_steps: int = 0
    encoder_lr: float = 2e-5
    decoder_lr: float = 2e-5
    min_lr: float = 1e-6
    eps: float = 1e-6
    betas: Set[float] = (0.9, 0.999)
    batch_size: int = 16
    fc_dropout: float = 0.2
    max_len: int = 512
    weight_decay: float = 0.01
    gradient_accumulation_steps: int = 1
    max_grad_norm: int = 1000
    seed: int = 42
    
    epochs: int = 5
    train_frac: Optional[float] = None
    n_fold: int = 4
    trn_fold: Set[int] = (0, 1, 2, 3)
    map_score: Dict[float, int] = field(default_factory = lambda: ({0.0: 0, 0.25: 1, 0.5: 2, 0.75: 3, 1.0: 4}))
    map_labels: Dict[int, float] = field(default_factory = lambda: ({0: 0.0, 1: 0.25, 2: 0.5, 3: 0.75, 4: 1.0}))
        
    target_size: int=1
    def __post_init__(self):
        if self.classification or self.ordinal:
            self.target_size = 5
        else:
            self.target_size = 1  


############################################
# Experiment Configurations
############################################

# Fast Experiments
FAST_BASE_TRAIN_FRAC = 1/5
FAST_BASE_TRAIN_EPOCHS = 3

def fastModelConfg(**kwargs):
    return ModelConfig(**kwargs, train_frac=FAST_BASE_TRAIN_FRAC, epochs=FAST_BASE_TRAIN_EPOCHS)


fast_baseline_cfg = fastModelConfg(prompt_engineering=None)
fast_context_ctx = fastModelConfg()
fast_customtok_cfg = fastModelConfg(prompt_engineering='custom_tok')

fast_reg_cfg = fastModelConfg()
fast_class_cfg = fastModelConfg(classification=True)
fast_ord_cfg = fastModelConfg(ordinal=True)

fast_aug_ident_cfg = fastModelConfg(augment_identity_graph_data='identities')
fast_aug_mirr_cfg = fastModelConfg(augment_identity_graph_data='mirrored')
fast_aug_identpaths_cfg = fastModelConfg(augment_identity_graph_data='paths')
fast_aug_mirridentpaths_cfg = fastModelConfg(augment_identity_graph_data='paths_mirrored')
fast_aug_neighbors_cfg = fastModelConfg(augment_identity_graph_data='neighbors')
fast_aug_chem_cfg = fastModelConfg(chem_comp_train_aug=True)

fast_post_clip_cfg = fastModelConfg(clipping=True)
fast_post_minmax_cfg = fastModelConfg(minmax=True)
fast_post_chem_cfg = fastModelConfg(chem_comp_pred_avg=True)

# Full Experiments
full_baseline_cfg = ModelConfig(prompt_engineering=None)
full_ctxt_cfg = ModelConfig()
full_post_chem_cfg = ModelConfig(chem_comp_pred_avg=True)
full_aug_neighbors_cfg = ModelConfig(augment_identity_graph_data='neighbors')
full_aug_chem_cfg = ModelConfig(chem_comp_train_aug=True, augment_identity_graph_data='neighbors')


### Config used throughout this notebook

In [3]:
############################################
# Fast Experiments
############################################

# CFG = fast_baseline_cfg
# CFG = fast_context_ctx
# CFG = fast_customtok_cfg

# CFG = fast_reg_cfg
# CFG = fast_class_cfg
# CFG = fast_ord_cfg

# CFG = fast_aug_ident_cfg
# CFG = fast_aug_mirr_cfg
# CFG = fast_aug_identpaths_cfg
# CFG = fast_aug_mirridentpaths_cfg
# CFG = fast_aug_neighbors_cfg
# CFG = fast_aug_chem_cfg

# CFG = fast_post_clip_cfg
# CFG = fast_post_minmax_cfg
# CFG = fast_post_chem_cfg



############################################
# Full Experiments
############################################
CFG = full_baseline_cfg
# CFG = full_ctxt_cfg
# CFG = full_post_chem_cfg
# CFG = full_aug_neighbors_cfg
# CFG = fast_aug_chem_cfg

# Library

In [4]:
# ====================================================
# Library
# ====================================================
import os
import gc
import re
import ast
import sys
import copy
import json
import time
import math
import shutil
import string
import pickle
import random
import joblib
import itertools
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
from sklearn.preprocessing import minmax_scale
    
import torch
print(f"torch.__version__: {torch.__version__}")
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset

os.system('pip uninstall -y transformers')
os.system('pip uninstall -y tokenizers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels transformers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels tokenizers')
import tokenizers
import transformers
print(f"tokenizers.__version__: {tokenizers.__version__}")
print(f"transformers.__version__: {transformers.__version__}")
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
%env TOKENIZERS_PARALLELISM=true

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import nltk
nltk.data.path.append('../input/wordnet')
from nltk.stem import WordNetLemmatizer


# common chemical formulae lookup utility script
from uspppm_common_chemical_compound_lookup import USPPPMChemCompLookup

torch.__version__: 1.9.1
Found existing installation: transformers 4.16.2
Uninstalling transformers-4.16.2:
  Successfully uninstalled transformers-4.16.2




Found existing installation: tokenizers 0.11.6
Uninstalling tokenizers-0.11.6:
  Successfully uninstalled tokenizers-0.11.6




Looking in links: ../input/pppm-pip-wheels
Processing /kaggle/input/pppm-pip-wheels/transformers-4.18.0-py3-none-any.whl
Processing /kaggle/input/pppm-pip-wheels/tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Installing collected packages: tokenizers, transformers


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.9.1 requires transformers<4.17,>=4.1, but you have transformers 4.18.0 which is incompatible.


Successfully installed tokenizers-0.12.1 transformers-4.18.0
Looking in links: ../input/pppm-pip-wheels




tokenizers.__version__: 0.12.1
transformers.__version__: 4.18.0
env: TOKENIZERS_PARALLELISM=true


# Utils

### Chemical Compound Lookup

In [5]:
chem_lookup = USPPPMChemCompLookup(chem_comp_path='../input/chemical-compounds-lookup/compounds.csv')

**Chemical Compound Lookup Tests**

In [6]:
print('Testing basic formula lookup...')
print(chem_lookup.lookup_df)
print(chem_lookup.phrase_chem_formula_synonym('agbr test'))
print(chem_lookup.phrase_chem_formula_synonym('agbr dna test agonc ag2cl2'))
print(chem_lookup.phrase_chem_formula_synonym('dna test d2o'))
print(chem_lookup.chem_formula_synonyms('c3h6'))
print('Done!')
print()
print('Testing train dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),
                          'score': pd.Series([3, 2, 1, 0]),
                          'anchor': pd.Series(['agbr dna test agonc ag2cl2', 'agbr test', 'agbr', 'last']),
                          'target': pd.Series(['agonc ag2cl2', 'test thingy', 'c4h7no4', 'last']),
                          'context': pd.Series(['G02', 'G02', 'C12', 'C12'])})
print("Before")
print(chem_test)
print("After")
print(chem_lookup.pre_augment_chem_formulae(chem_test, True))
print('Done!')
print()
print('Testing test dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),
                          'text': pd.Series([
                              'agbr dna test agonc ag2cl2 [SEP] agonc ag2cl2 [SEP] G02',
                              'agbr test [SEP] test thingy [SEP] G02',
                              'agbr [SEP] c4h7no4 [SEP] C12',
                              'last [SEP] last [SEP] C12'
                          ])})
print("Before")
print(chem_test)
print("After")
print(chem_lookup.post_augment_chem_formulae(chem_test, True))
print('Done!')

Testing basic formula lookup...
                            Name  Formula
0               actiniumiiioxide    ac2o3
1     silvertetrachloroaluminate  agalcl4
2                  silverbromide     agbr
3                  silverbromate   agbro3
4                  silvercyanide     agcn
...                          ...      ...
4067                zirconateion    zro32
4068          zirconiumphosphide     zrp2
4069            zirconiumsulfide     zrs2
4070           zirconiumsilicide    zrsi2
4071          zirconiumphosphate  zr3po44

[4072 rows x 2 columns]
[('silverbromide test', True)]
[('silverbromide dna test silverfulminate disilverdichloride', True), ('silverbromide dna test silvercyanate disilverdichloride', True), ('silverbromide dna test silverfulminate silveriidichloride', True), ('silverbromide dna test silvercyanate silveriidichloride', True)]
[('dna test deuteriumoxide', True), ('dna test heavywater', True)]
['cyclopropane', 'propylene']
Done!

Testing train dataset augmentat

In [7]:
# ====================================================
# Utils
# ====================================================
def get_score(y_true, y_pred):
    score = sp.stats.pearsonr(y_true, y_pred)[0]
    return score


def get_logger(filename=OUTPUT_DIR+'train'):
    from logging import getLogger, INFO, StreamHandler, FileHandler, Formatter
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=f"{filename}.log")
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    logger.propagate = False
    return logger

LOGGER = get_logger()

def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

# Data Loading

In [8]:
# ====================================================
# Data Loading
# ====================================================
orig_data = pd.read_csv(INPUT_DIR+'train.csv')

if CFG.augment_identity_graph_data == 'neighbors':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_path_neighbors.csv')
elif CFG.augment_identity_graph_data == 'paths_mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_paths.csv')
elif CFG.augment_identity_graph_data == 'paths':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_paths_mirrored.csv')
elif CFG.augment_identity_graph_data == 'mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored.csv')
elif CFG.augment_identity_graph_data == 'identities':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_mirrored.csv')
elif CFG.augment_identity_graph_data == None:
    train = orig_data
else:
    raise(ValueError('CFG.augment_identity_graph_data = {} not recognized!'.format(CFG.augment_identity_graph_data)))

if CFG.chem_comp_train_aug:
    train = chem_lookup.pre_augment_chem_formulae(train)
    
display(train)
    
if CFG.train_frac:
    # to get a fair estimate we always use a fraction of the original data
    n = int(CFG.train_frac * len(orig_data))
    train = train.sample(n=n, replace=False, ignore_index=True)
    
test = pd.read_csv(INPUT_DIR+'test.csv')
submission = pd.read_csv(INPUT_DIR+'sample_submission.csv')
print(f"train.shape: {train.shape}")
print(f"test.shape: {test.shape}")
print(f"submission.shape: {submission.shape}")
# display(train.head())
# display(test.head())
# display(submission.head())

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


train.shape: (36473, 5)
test.shape: (36, 4)
submission.shape: (36, 2)


# Pre-processing

In [9]:
# Add augmented indicator
# I'm sorry for this dirty hack
if 'augmented' not in train.columns:
    train['augmented'] = train['id'].str.contains('_')
else:
    train['augmented'] = train['id'].str.contains('_') | train['augmented']
    
print(sum(train['augmented']), 'augmented samples')

0 augmented samples


In [10]:
# ====================================================
# CPC Data
# ====================================================
def get_cpc_texts():
    contexts = []
    pattern = '[A-Z]\d+'
    for file_name in os.listdir('../input/cpc-data/CPCSchemeXML202105'):
        result = re.findall(pattern, file_name)
        if result:
            contexts.append(result)
    contexts = sorted(set(sum(contexts, [])))
    results = {}
    for cpc in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
        with open(f'../input/cpc-data/CPCTitleList202202/cpc-section-{cpc}_20220201.txt') as f:
            s = f.read()
        pattern = f'{cpc}\t\t.+'
        result = re.findall(pattern, s)
        cpc_result = result[0].lstrip(pattern)
        for context in [c for c in contexts if c[0] == cpc]:
            pattern = f'{context}\t\t.+'
            result = re.findall(pattern, s)
            results[context] = cpc_result + ". " + result[0].lstrip(pattern)
    return results


cpc_texts = get_cpc_texts()
torch.save(cpc_texts, OUTPUT_DIR+"cpc_texts.pth")
train['context_text'] = train['context'].map(cpc_texts)
test['context_text'] = test['context'].map(cpc_texts)
# display(train.head())
# display(test.head())

# CV split

In [11]:
# ====================================================
# CV split
# ====================================================
train['score_map'] = train['score'].map({0.00: 0, 0.25: 1, 0.50: 2, 0.75: 3, 1.00: 4})
Fold = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
for n, (train_index, val_index) in enumerate(Fold.split(train, train['score_map'])):
    train.loc[val_index, 'fold'] = int(n)
train['fold'] = train['fold'].astype(int)
# display(train.groupby('fold').size())

In [12]:
def get_sec_toks(df):
    return '[' + df['context'].str[0] + ']'

if CFG.prompt_engineering == 'custom_tok':
    train['text'] = get_sec_toks(train) + train['context_text'] + '[SEP]'+ train['anchor'] + CFG.custom_sep_token + train['target']
    test['text'] = get_sec_toks(test) + test['context_text'] + '[SEP]' + test['anchor'] + CFG.custom_sep_token + test['target']
elif CFG.prompt_engineering == 'ctx_txt':
    train['text'] = train['anchor'] + '[SEP]' + train['target'] + '[SEP]'  + train['context_text']
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]'  + test['context_text']
elif CFG.prompt_engineering == None:
    train['text'] = train['anchor'] + '[SEP]' + train['target'] + '[SEP]'  + train['context']
    test['text'] = test['anchor'] + '[SEP]' + test['target'] + '[SEP]'  + test['context']
else:
    raise(ValueError('CFG.prompt_engineering = {} not recognized!'.format(CFG.prompt_engineering)))
    

print(train['text'][0])
display(train['text'].head())
display(test['text'].head())

abatement[SEP]abatement of pollution[SEP]A47


0    abatement[SEP]abatement of pollution[SEP]A47
1            abatement[SEP]act of abating[SEP]A47
2           abatement[SEP]active catalyst[SEP]A47
3       abatement[SEP]eliminating process[SEP]A47
4             abatement[SEP]forest region[SEP]A47
Name: text, dtype: object

0    opc drum[SEP]inorganic photoconductor drum[SEP...
1        adjust gas flow[SEP]altering gas flow[SEP]F23
2            lower trunnion[SEP]lower locating[SEP]B60
3              cap component[SEP]upper portion[SEP]D06
4    neural stimulation[SEP]artificial neural netwo...
Name: text, dtype: object

# tokenizer

In [13]:
# ====================================================
# tokenizer
# ====================================================
tokenizer = AutoTokenizer.from_pretrained(CFG.model)


# add special tokens for sections
cpc_sections = [
    'A', # Human Necessities
    'B', # Operations and Transport
    'C', # Chemistry and Metallurgy
    'D', # Textiles
    'E', # Fixed Constructions
    'F', # Mechanical Engineering
    'G', # Physics
    'H', # Electricity
    'Y' # Emerging Cross-Sectional Technologies
]
if CFG.prompt_engineering == 'custom_tok':
    tokenizer.add_special_tokens({'additional_special_tokens': ['['+  s + ']' for s in cpc_sections]})
    print(tokenizer.all_special_tokens)
    
tokenizer.save_pretrained(OUTPUT_DIR+'tokenizer/')
CFG.tokenizer = tokenizer

Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/578 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/2.35M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


# Dataset

In [14]:
# ====================================================
# Define max_len
# ====================================================
lengths_dict = {}

lengths = []
tk0 = tqdm(cpc_texts.values(), total=len(cpc_texts))
for text in tk0:
    length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
    lengths.append(length)
lengths_dict['context_text'] = lengths

for text_col in ['anchor', 'target']:
    lengths = []
    tk0 = tqdm(train[text_col].fillna("").values, total=len(train))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
        lengths.append(length)
    lengths_dict[text_col] = lengths
    
CFG.max_len = max(lengths_dict['anchor']) + max(lengths_dict['target'])\
                + max(lengths_dict['context_text']) + 4 # CLS + SEP + SEP + SEP
LOGGER.info(f"max_len: {CFG.max_len}")

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/36473 [00:00<?, ?it/s]

  0%|          | 0/36473 [00:00<?, ?it/s]

max_len: 133


In [15]:
# ====================================================
# Dataset
# ====================================================
def prepare_input(cfg, text):
    inputs = cfg.tokenizer(text,
                           add_special_tokens=True,
                           max_length=cfg.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs

def prepare_labels(cfg, label):
    if cfg.classification:
        label_onehot = [0 for _ in range(cfg.target_size)]
        label_onehot[cfg.map_score[label]] = 1 
        return torch.tensor(label_onehot, dtype=torch.float)
    elif cfg.ordinal:
        label_ordinal = [1 if i <= cfg.map_score[label] else 0 for i in range(cfg.target_size)]
        return torch.tensor(label_ordinal, dtype=torch.float)
    else:
        return torch.tensor(label, dtype=torch.float)

class TrainDataset(Dataset):
    def __init__(self, cfg, df, chem_lookup=None):
        self.cfg = cfg
        if cfg.chem_comp_pred_avg and chem_lookup:
            df = chem_lookup.post_augment_chem_formulae(df)
        self.texts = df['text'].values
        self.labels = df['score'].values
        self.ids = df['id'].values

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, self.texts[item])
        label = prepare_labels(self.cfg, self.labels[item])
        return inputs, label, self.ids[item]

# Model

In [16]:
# ====================================================
# Model
# ====================================================
class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, self.cfg.target_size)
        self._init_weights(self.fc)
        self.attention = nn.Sequential(
            nn.Linear(self.config.hidden_size, 512),
            nn.Tanh(),
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )
        self._init_weights(self.attention)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        # feature = torch.mean(last_hidden_states, 1)
        weights = self.attention(last_hidden_states)
        feature = torch.sum(weights * last_hidden_states, dim=1)
        return feature

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        return output

# Helper functions

In [17]:
# ====================================================
# Helper functions
# ====================================================
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))

def ordinal_regression(predictions, targets):
    """Ordinal regression with encoding as in https://arxiv.org/pdf/0704.1028.pdf"""
    return nn.MSELoss(reduction='mean')(predictions, targets)

def average_by_id(df):
    ''' Averages a dataframe by a column id'''
    orig_id_order = df['id']
    unordered_means = df.groupby('id').mean().reset_index()
    return unordered_means.set_index('id').loc[orig_id_order].reset_index().drop_duplicates()


def train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device):
    model.train()
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex)
    losses = AverageMeter()
    start = end = time.time()
    global_step = 0
    for step, (inputs, labels, _) in enumerate(train_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
#         needed to disable amp, error: half expected got float (maybe some bug)
        if CFG.ordinal:
            y_preds = model(inputs)
        else:
            with torch.cuda.amp.autocast(enabled=CFG.apex):
                y_preds = model(inputs)
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels) 
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        scaler.scale(loss).backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm)
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            global_step += 1
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'Grad: {grad_norm:.4f}  '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          grad_norm=grad_norm,
                          lr=scheduler.get_lr()[0]))
    return losses.avg


def valid_fn(valid_loader, model, criterion, device):
    losses = AverageMeter()
    model.eval()
    preds = []
    ids = []
    start = end = time.time()
    for step, (inputs, labels, sample_id) in enumerate(valid_loader):
        ids.append(sample_id)
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.no_grad():
            y_preds = model(inputs)
            
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels)
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        if CFG.classification or CFG.ordinal:
            preds.append(y_preds.to('cpu').numpy())
        else:
            preds.append(y_preds.sigmoid().to('cpu').numpy())
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    predictions = np.concatenate(preds)
    ids = np.concatenate(ids)
    # print("Sanity Check:", sum(pd.Series(ids).duplicated()))
    if CFG.classification:
        predictions = np.argmax(predictions, axis=1)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    elif CFG.ordinal:
        predictions = (predictions > 0.5).cumprod(axis=1).sum(axis=1) - 1
        predictions = np.clip(predictions, 0, None)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    else:
        predictions = np.concatenate(predictions)
        if CFG.clipping:
            predictions = np.clip(predictions, 0, 1)
        if CFG.minmax:
            predictions = minmax_scale(predictions, feature_range=(0, 1))
            
    if CFG.chem_comp_pred_avg:
        pred_new = average_by_id(pd.DataFrame({'pred': predictions, 'id': ids}))
        predictions = pred_new['pred'].to_numpy()
    return losses.avg, predictions


In [18]:
# ====================================================
# train loop
# ====================================================
def train_loop(folds, fold):
    
    LOGGER.info(f"========== fold: {fold} training ==========")

    # ====================================================
    # loader
    # ====================================================
    train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    if CFG.augment_identity_graph_data != None and CFG.validate_on_original:
        valid_folds = valid_folds[valid_folds['augmented'] == False].reset_index(drop=True)

    valid_labels = valid_folds['score'].values
    
    train_dataset = TrainDataset(CFG, train_folds)
    # we only want the prediction using chemical synonyms for the validation
    valid_dataset = TrainDataset(CFG, valid_folds, chem_lookup)

    train_loader = DataLoader(train_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=True,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=True)
    valid_loader = DataLoader(valid_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=False,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

    # ====================================================
    # model & optimizer
    # ====================================================
    model = CustomModel(CFG, config_path=None, pretrained=True)
    torch.save(model.config, OUTPUT_DIR+'config.pth')
    model.to(device)
    
    def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
        param_optimizer = list(model.named_parameters())
        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
        optimizer_parameters = [
            {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': weight_decay},
            {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': 0.0},
            {'params': [p for n, p in model.named_parameters() if "model" not in n],
             'lr': decoder_lr, 'weight_decay': 0.0}
        ]
        return optimizer_parameters

    optimizer_parameters = get_optimizer_params(model,
                                                encoder_lr=CFG.encoder_lr, 
                                                decoder_lr=CFG.decoder_lr,
                                                weight_decay=CFG.weight_decay)
    optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr, eps=CFG.eps, betas=CFG.betas)
    
    # ====================================================
    # scheduler
    # ====================================================
    def get_scheduler(cfg, optimizer, num_train_steps):
        if cfg.scheduler == 'linear':
            scheduler = get_linear_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
            )
        elif cfg.scheduler == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
            )
        return scheduler
    
    num_train_steps = int(len(train_folds) / CFG.batch_size * CFG.epochs)
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)

    # ====================================================
    # loop
    # ====================================================
    if CFG.classification:
        criterion = nn.CrossEntropyLoss()
    elif CFG.ordinal:
        criterion = ordinal_regression
    else:
        criterion = nn.BCEWithLogitsLoss(reduction="mean")
    
    best_score = 0.

    for epoch in range(CFG.epochs):

        start_time = time.time()

        # train
        avg_loss = train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device)

        # eval
        avg_val_loss, predictions = valid_fn(valid_loader, model, criterion, device)
        
        # scoring
        score = get_score(valid_labels, predictions)

        elapsed = time.time() - start_time

        LOGGER.info(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
        LOGGER.info(f'Epoch {epoch+1} - Score: {score:.4f}')       
        if best_score < score:
            best_score = score
            LOGGER.info(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
            torch.save({'model': model.state_dict(),
                        'predictions': predictions},
                        OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth")

    predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth", 
                             map_location=torch.device('cpu'))['predictions']
    valid_folds['pred'] = predictions

    torch.cuda.empty_cache()
    gc.collect()
    
    return valid_folds

In [19]:
if __name__ == '__main__':
    
    def get_result(oof_df):
        labels = oof_df['score'].values
        preds = oof_df['pred'].values
        score = get_score(labels, preds)
        LOGGER.info(f'Score: {score:<.4f}')
    
    oof_df = pd.DataFrame()
    for fold in range(CFG.n_fold):
        if fold in CFG.trn_fold:
            _oof_df = train_loop(train, fold)
            oof_df = pd.concat([oof_df, _oof_df])
            LOGGER.info(f"========== fold: {fold} result ==========")
            get_result(_oof_df)
    oof_df = oof_df.reset_index(drop=True)
    LOGGER.info(f"========== CV ==========")
    get_result(oof_df)
    oof_df.to_csv(OUTPUT_DIR+'oof_df.csv')



Downloading:   0%|          | 0.00/273M [00:00<?, ?B/s]

Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 1s (remain 46m 37s) Loss: 0.8129(0.8129) Grad: inf  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 38s (remain 4m 52s) Loss: 0.6084(0.6220) Grad: 22293.9668  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 16s (remain 4m 8s) Loss: 0.5703(0.6036) Grad: 25987.8574  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 53s (remain 3m 29s) Loss: 0.6233(0.5946) Grad: 22003.6387  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 30s (remain 2m 51s) Loss: 0.6607(0.5881) Grad: 63647.8125  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 8s (remain 2m 13s) Loss: 0.5898(0.5845) Grad: 39769.1289  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 45s (remain 1m 35s) Loss: 0.6223(0.5802) Grad: 14565.5303  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 23s (remain 0m 57s) Loss: 0.6684(0.5774) Grad: 19106.8691  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 5m 0s (remain 0m 20s) Loss: 0.6561(0.5751) Grad: 38454.6602  LR: 0.00001832  
Epoch: [1][1708/1709] Elapsed 5m 20s

Epoch 1 - avg_train_loss: 0.5739  avg_val_loss: 0.5606  time: 350s
Epoch 1 - Score: 0.8066
Epoch 1 - Save Best Score: 0.8066 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.5120(0.5606) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 13m 5s) Loss: 0.5022(0.5022) Grad: 95969.3906  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 44s) Loss: 0.6045(0.5432) Grad: 214206.1719  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.5152(0.5414) Grad: 104295.8516  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5939(0.5390) Grad: 93785.5625  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5631(0.5398) Grad: 127155.6953  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 6s (remain 2m 12s) Loss: 0.4599(0.5386) Grad: 150896.0781  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.6226(0.5387) Grad: 151727.9219  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5271(0.5379) Grad: 76608.9609  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 58s (remain 0m 20s) Loss: 0.4184

Epoch 2 - avg_train_loss: 0.5374  avg_val_loss: 0.5473  time: 348s
Epoch 2 - Score: 0.8261
Epoch 2 - Save Best Score: 0.8261 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4898(0.5473) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 10m 59s) Loss: 0.5360(0.5360) Grad: 111288.3359  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 42s) Loss: 0.5270(0.5303) Grad: 42125.6992  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 14s (remain 4m 4s) Loss: 0.6135(0.5338) Grad: 80224.7188  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 52s (remain 3m 26s) Loss: 0.5376(0.5304) Grad: 147923.4531  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5156(0.5297) Grad: 152080.7188  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 6s (remain 2m 12s) Loss: 0.4455(0.5301) Grad: 58976.0312  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 43s (remain 1m 34s) Loss: 0.5158(0.5306) Grad: 118926.0703  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5894(0.5298) Grad: 113538.4453  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 58s (remain 0m 20s) Loss: 0.392

Epoch 3 - avg_train_loss: 0.5294  avg_val_loss: 0.5441  time: 348s
Epoch 3 - Score: 0.8331
Epoch 3 - Save Best Score: 0.8331 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4772(0.5441) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 10m 41s) Loss: 0.5738(0.5738) Grad: 102032.2266  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 42s) Loss: 0.3547(0.5227) Grad: 82888.7656  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 14s (remain 4m 4s) Loss: 0.5423(0.5252) Grad: 184858.4375  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 52s (remain 3m 26s) Loss: 0.5335(0.5251) Grad: 65079.4336  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.6262(0.5245) Grad: 120131.4922  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 6s (remain 2m 12s) Loss: 0.5777(0.5244) Grad: 218012.2188  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 43s (remain 1m 34s) Loss: 0.5945(0.5247) Grad: 46860.7891  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5115(0.5250) Grad: 116836.4609  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 58s (remain 0m 20s) Loss: 0.459

Epoch 4 - avg_train_loss: 0.5229  avg_val_loss: 0.5436  time: 348s
Epoch 4 - Score: 0.8357
Epoch 4 - Save Best Score: 0.8357 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4827(0.5436) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 10m 9s) Loss: 0.5050(0.5050) Grad: 54567.0195  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 42s) Loss: 0.4465(0.5191) Grad: 70519.7266  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 14s (remain 4m 4s) Loss: 0.5185(0.5189) Grad: 118233.3594  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 52s (remain 3m 26s) Loss: 0.5036(0.5178) Grad: 191480.9219  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5417(0.5190) Grad: 49145.8438  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 6s (remain 2m 11s) Loss: 0.4825(0.5169) Grad: 232960.5781  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 43s (remain 1m 34s) Loss: 0.5570(0.5168) Grad: 54607.8008  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5615(0.5170) Grad: 51556.3711  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 58s (remain 0m 20s) Loss: 0.5676(0

Epoch 5 - avg_train_loss: 0.5185  avg_val_loss: 0.5449  time: 348s
Epoch 5 - Score: 0.8364
Epoch 5 - Save Best Score: 0.8364 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4808(0.5449) 


Score: 0.8364
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 0s (remain 11m 23s) Loss: 0.7134(0.7134) Grad: 109696.0391  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 37s (remain 4m 42s) Loss: 0.5978(0.6286) Grad: 73042.0234  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 14s (remain 4m 4s) Loss: 0.5984(0.6087) Grad: 56065.8320  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5836(0.5968) Grad: 67609.0000  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5859(0.5915) Grad: 71261.5547  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.5664(0.5839) Grad: 45436.4805  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.5585(0.5812) Grad: 59260.6055  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.6317(0.5781) Grad: 204904.5938  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.6113(0.5758) Grad: 111620.3906  LR: 0.00001832  
Epoch: [1][1708/1709] Ela

Epoch 1 - avg_train_loss: 0.5748  avg_val_loss: 0.5487  time: 349s
Epoch 1 - Score: 0.8159
Epoch 1 - Save Best Score: 0.8159 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4668(0.5487) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 11m 18s) Loss: 0.6161(0.6161) Grad: 258156.5312  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 44s) Loss: 0.5549(0.5420) Grad: 97631.4688  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.4923(0.5394) Grad: 111218.7266  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.4609(0.5362) Grad: 113198.3750  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5666(0.5373) Grad: 386534.0312  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.5260(0.5367) Grad: 49634.9883  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.4154(0.5389) Grad: 125588.7969  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.4962(0.5398) Grad: 91100.1016  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.447

Epoch 2 - avg_train_loss: 0.5407  avg_val_loss: 0.5416  time: 349s
Epoch 2 - Score: 0.8267
Epoch 2 - Save Best Score: 0.8267 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4043(0.5416) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 13m 7s) Loss: 0.4425(0.4425) Grad: 84593.4922  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.6184(0.5345) Grad: 317619.5312  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.4867(0.5260) Grad: 67129.2109  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5690(0.5265) Grad: 207792.9375  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5203(0.5275) Grad: 72743.1094  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.6091(0.5275) Grad: 120049.9766  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.3970(0.5277) Grad: 715017.8750  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5837(0.5282) Grad: 82004.2500  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.5107(

Epoch 3 - avg_train_loss: 0.5278  avg_val_loss: 0.5419  time: 349s
Epoch 3 - Score: 0.8356
Epoch 3 - Save Best Score: 0.8356 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4100(0.5419) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 11m 24s) Loss: 0.5790(0.5790) Grad: 69182.3359  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.5813(0.5257) Grad: 311861.1562  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.5218(0.5228) Grad: 113750.1641  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5399(0.5213) Grad: 78849.5234  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.4950(0.5205) Grad: 233848.1562  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.5151(0.5193) Grad: 88728.5703  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 45s (remain 1m 35s) Loss: 0.5260(0.5184) Grad: 89016.1641  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.5240(0.5197) Grad: 24708.8535  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 5m 0s (remain 0m 20s) Loss: 0.5846(0

Epoch 4 - avg_train_loss: 0.5196  avg_val_loss: 0.5445  time: 351s
Epoch 4 - Score: 0.8375
Epoch 4 - Save Best Score: 0.8375 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4503(0.5445) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 15m 3s) Loss: 0.5511(0.5511) Grad: 28939.2402  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 38s (remain 4m 48s) Loss: 0.6363(0.5113) Grad: 547815.5625  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 15s (remain 4m 7s) Loss: 0.3657(0.5102) Grad: 109013.9375  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 53s (remain 3m 29s) Loss: 0.5678(0.5135) Grad: 102504.9141  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 31s (remain 2m 51s) Loss: 0.5450(0.5145) Grad: 99635.6953  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 8s (remain 2m 13s) Loss: 0.4685(0.5149) Grad: 173156.4688  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 46s (remain 1m 35s) Loss: 0.4531(0.5147) Grad: 68999.6641  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 24s (remain 0m 58s) Loss: 0.3993(0.5149) Grad: 128957.0000  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 5m 1s (remain 0m 20s) Loss: 0.5611(

Epoch 5 - avg_train_loss: 0.5160  avg_val_loss: 0.5462  time: 352s
Epoch 5 - Score: 0.8377
Epoch 5 - Save Best Score: 0.8377 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4645(0.5462) 


Score: 0.8377
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 0s (remain 11m 21s) Loss: 0.6991(0.6991) Grad: inf  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 37s (remain 4m 44s) Loss: 0.6380(0.6210) Grad: 26085.8613  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.5882(0.6012) Grad: 36736.5664  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5309(0.5903) Grad: 33720.8555  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 30s (remain 2m 50s) Loss: 0.5406(0.5860) Grad: 43545.1250  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.5757(0.5801) Grad: 35973.8125  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 45s (remain 1m 35s) Loss: 0.5681(0.5760) Grad: 58657.6953  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.4688(0.5739) Grad: 18112.8223  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.4831(0.5722) Grad: 22665.0879  LR: 0.00001832  
Epoch: [1][1708/1709] Elapsed 5m 20

Epoch 1 - avg_train_loss: 0.5714  avg_val_loss: 0.5506  time: 350s
Epoch 1 - Score: 0.8038
Epoch 1 - Save Best Score: 0.8038 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.5182(0.5506) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 10m 57s) Loss: 0.5487(0.5487) Grad: 55716.8125  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 44s) Loss: 0.5157(0.5343) Grad: 94078.4453  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.4759(0.5329) Grad: 170016.5625  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.3740(0.5356) Grad: 107158.2109  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 30s (remain 2m 50s) Loss: 0.5502(0.5348) Grad: 160695.9375  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.4944(0.5355) Grad: 66245.6094  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 44s (remain 1m 35s) Loss: 0.5223(0.5360) Grad: 165572.0000  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.4835(0.5360) Grad: 135429.6250  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.478

Epoch 2 - avg_train_loss: 0.5358  avg_val_loss: 0.5482  time: 350s
Epoch 2 - Score: 0.8212


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4617(0.5482) 


Epoch 2 - Save Best Score: 0.8212 Model


Epoch: [3][0/1709] Elapsed 0m 0s (remain 11m 4s) Loss: 0.5196(0.5196) Grad: 194327.1562  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 42s) Loss: 0.6076(0.5329) Grad: 161223.1562  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 15s (remain 4m 4s) Loss: 0.6149(0.5288) Grad: 149633.6562  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5750(0.5327) Grad: 59628.8984  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5527(0.5298) Grad: 64435.6016  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.4539(0.5282) Grad: 201595.2188  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 44s (remain 1m 35s) Loss: 0.4624(0.5291) Grad: 51648.5039  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.5327(0.5278) Grad: 139991.3750  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.5935(0.5281) Grad: 163898.5938  LR: 0.00000730  
Epoch: [3][1708/1709] E

Epoch 3 - avg_train_loss: 0.5276  avg_val_loss: 0.5507  time: 349s
Epoch 3 - Score: 0.8236
Epoch 3 - Save Best Score: 0.8236 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4662(0.5507) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 12m 41s) Loss: 0.5820(0.5820) Grad: 614546.2500  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.5208(0.5275) Grad: 112154.5078  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.5803(0.5232) Grad: 146115.2969  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5640(0.5223) Grad: 162647.5156  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5436(0.5217) Grad: 101036.3281  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.3943(0.5196) Grad: 62732.1758  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 44s (remain 1m 35s) Loss: 0.5215(0.5208) Grad: 43774.9102  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.4862(0.5219) Grad: 109904.2656  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.51

Epoch 4 - avg_train_loss: 0.5211  avg_val_loss: 0.5495  time: 349s
Epoch 4 - Score: 0.8271
Epoch 4 - Save Best Score: 0.8271 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4488(0.5495) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 11m 30s) Loss: 0.5093(0.5093) Grad: 254134.5000  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.5970(0.5174) Grad: 81761.1328  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.5678(0.5218) Grad: 461110.7812  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.4855(0.5183) Grad: 60162.2305  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5661(0.5181) Grad: 18620.3906  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.4842(0.5166) Grad: 35575.7812  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 44s (remain 1m 35s) Loss: 0.6490(0.5169) Grad: 124150.1250  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.4661(0.5163) Grad: 40472.9453  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.5145(

Epoch 5 - avg_train_loss: 0.5175  avg_val_loss: 0.5491  time: 349s
Epoch 5 - Score: 0.8277
Epoch 5 - Save Best Score: 0.8277 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4431(0.5491) 


Score: 0.8277
Some weights of the model checkpoint at microsoft/deberta-v3-small were not used when initializing DebertaV2Model: ['lm_predictions.lm_head.bias', 'mask_predictions.classifier.weight', 'mask_predictions.classifier.bias', 'mask_predictions.dense.bias', 'lm_predictions.lm_head.dense.bias', 'mask_predictions.LayerNorm.bias', 'lm_predictions.lm_head.LayerNorm.weight', 'lm_predictions.lm_head.dense.weight', 'mask_predictions.LayerNorm.weight', 'lm_predictions.lm_head.LayerNorm.bias', 'mask_predictions.dense.weight']
- This IS expected if you are initializing DebertaV2Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DebertaV2Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Epoch: [1][0/1709] Elapsed 0m 0s (remain 12m 5s) Loss: 0.6992(0.6992) Grad: 74274.9531  LR: 0.00002000  
Epoch: [1][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.6554(0.6222) Grad: 79431.5781  LR: 0.00001997  
Epoch: [1][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.6537(0.6068) Grad: 81566.1641  LR: 0.00001989  
Epoch: [1][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.6779(0.5970) Grad: 93405.6016  LR: 0.00001976  
Epoch: [1][800/1709] Elapsed 2m 30s (remain 2m 50s) Loss: 0.5360(0.5898) Grad: 39505.8164  LR: 0.00001957  
Epoch: [1][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.6123(0.5855) Grad: 42378.8555  LR: 0.00001933  
Epoch: [1][1200/1709] Elapsed 3m 44s (remain 1m 35s) Loss: 0.5723(0.5803) Grad: 91408.0156  LR: 0.00001904  
Epoch: [1][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.4813(0.5771) Grad: 90534.1953  LR: 0.00001870  
Epoch: [1][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.4926(0.5749) Grad: 34549.2266  LR: 0.00001832  
Epoch: [1][1708/1709] Elapsed

Epoch 1 - avg_train_loss: 0.5734  avg_val_loss: 0.5525  time: 350s
Epoch 1 - Score: 0.7917
Epoch 1 - Save Best Score: 0.7917 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4623(0.5525) 
Epoch: [2][0/1709] Elapsed 0m 0s (remain 10m 27s) Loss: 0.5821(0.5821) Grad: 62044.2422  LR: 0.00001809  
Epoch: [2][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.5367(0.5440) Grad: 90874.4062  LR: 0.00001764  
Epoch: [2][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.6618(0.5440) Grad: 124619.8281  LR: 0.00001714  
Epoch: [2][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5241(0.5417) Grad: 97337.1797  LR: 0.00001661  
Epoch: [2][800/1709] Elapsed 2m 30s (remain 2m 50s) Loss: 0.5124(0.5424) Grad: 114465.6797  LR: 0.00001604  
Epoch: [2][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.6259(0.5415) Grad: 294933.8750  LR: 0.00001544  
Epoch: [2][1200/1709] Elapsed 3m 44s (remain 1m 35s) Loss: 0.4884(0.5398) Grad: 63139.2617  LR: 0.00001481  
Epoch: [2][1400/1709] Elapsed 4m 22s (remain 0m 57s) Loss: 0.5801(0.5395) Grad: 160796.0000  LR: 0.00001415  
Epoch: [2][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.4795

Epoch 2 - avg_train_loss: 0.5394  avg_val_loss: 0.5511  time: 349s
Epoch 2 - Score: 0.8131
Epoch 2 - Save Best Score: 0.8131 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4763(0.5511) 
Epoch: [3][0/1709] Elapsed 0m 0s (remain 12m 1s) Loss: 0.5001(0.5001) Grad: 208907.8438  LR: 0.00001309  
Epoch: [3][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.5505(0.5281) Grad: 72927.7266  LR: 0.00001238  
Epoch: [3][400/1709] Elapsed 1m 15s (remain 4m 4s) Loss: 0.5543(0.5287) Grad: 46122.2148  LR: 0.00001166  
Epoch: [3][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5320(0.5292) Grad: 143634.8750  LR: 0.00001094  
Epoch: [3][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.4555(0.5307) Grad: 30564.7070  LR: 0.00001020  
Epoch: [3][1000/1709] Elapsed 3m 7s (remain 2m 12s) Loss: 0.5199(0.5299) Grad: 65140.1094  LR: 0.00000947  
Epoch: [3][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.4698(0.5292) Grad: 30377.5293  LR: 0.00000874  
Epoch: [3][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5478(0.5301) Grad: 32605.4980  LR: 0.00000801  
Epoch: [3][1600/1709] Elapsed 4m 59s (remain 0m 20s) Loss: 0.4349(0.

Epoch 3 - avg_train_loss: 0.5298  avg_val_loss: 0.5539  time: 349s
Epoch 3 - Score: 0.8206
Epoch 3 - Save Best Score: 0.8206 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4641(0.5539) 
Epoch: [4][0/1709] Elapsed 0m 0s (remain 10m 43s) Loss: 0.5611(0.5611) Grad: 53209.0430  LR: 0.00000691  
Epoch: [4][200/1709] Elapsed 0m 37s (remain 4m 42s) Loss: 0.4973(0.5216) Grad: 182911.9219  LR: 0.00000622  
Epoch: [4][400/1709] Elapsed 1m 14s (remain 4m 4s) Loss: 0.5717(0.5199) Grad: 68862.7188  LR: 0.00000555  
Epoch: [4][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.6595(0.5187) Grad: 92183.1250  LR: 0.00000491  
Epoch: [4][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.6340(0.5180) Grad: 289735.9688  LR: 0.00000429  
Epoch: [4][1000/1709] Elapsed 3m 6s (remain 2m 12s) Loss: 0.4652(0.5170) Grad: 79861.7578  LR: 0.00000370  
Epoch: [4][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.3584(0.5166) Grad: 103491.3828  LR: 0.00000315  
Epoch: [4][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.5453(0.5185) Grad: 204079.9375  LR: 0.00000263  
Epoch: [4][1600/1709] Elapsed 4m 58s (remain 0m 20s) Loss: 0.4753

Epoch 4 - avg_train_loss: 0.5184  avg_val_loss: 0.5510  time: 348s
Epoch 4 - Score: 0.8233
Epoch 4 - Save Best Score: 0.8233 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4507(0.5510) 
Epoch: [5][0/1709] Elapsed 0m 0s (remain 14m 24s) Loss: 0.5342(0.5342) Grad: 69058.6797  LR: 0.00000191  
Epoch: [5][200/1709] Elapsed 0m 37s (remain 4m 43s) Loss: 0.5444(0.5157) Grad: 68609.2734  LR: 0.00000150  
Epoch: [5][400/1709] Elapsed 1m 15s (remain 4m 5s) Loss: 0.6071(0.5115) Grad: 51437.0508  LR: 0.00000114  
Epoch: [5][600/1709] Elapsed 1m 52s (remain 3m 27s) Loss: 0.5146(0.5111) Grad: 32685.7930  LR: 0.00000082  
Epoch: [5][800/1709] Elapsed 2m 29s (remain 2m 49s) Loss: 0.5727(0.5137) Grad: 90109.1953  LR: 0.00000056  
Epoch: [5][1000/1709] Elapsed 3m 6s (remain 2m 12s) Loss: 0.4177(0.5145) Grad: 83061.0391  LR: 0.00000034  
Epoch: [5][1200/1709] Elapsed 3m 44s (remain 1m 34s) Loss: 0.5214(0.5149) Grad: 58722.0234  LR: 0.00000018  
Epoch: [5][1400/1709] Elapsed 4m 21s (remain 0m 57s) Loss: 0.6516(0.5153) Grad: 113788.7031  LR: 0.00000007  
Epoch: [5][1600/1709] Elapsed 4m 58s (remain 0m 20s) Loss: 0.5232(0.

Epoch 5 - avg_train_loss: 0.5148  avg_val_loss: 0.5540  time: 349s
Epoch 5 - Score: 0.8242
Epoch 5 - Save Best Score: 0.8242 Model


EVAL: [569/570] Elapsed 0m 29s (remain 0m 0s) Loss: 0.4503(0.5540) 


Score: 0.8242
Score: 0.8314
