# deberta-v3-small Experiments

We use fast experiments with 1/5 of the data and 3 epochs to quickly determine which adaptations are helpful and report the cv-score.  
For adaptation we want to explore further, we retrain on the whole dataset and evaluate on the leader board.  

</br> </br>  


## Fast Experiments 
*1/5 of original training data, 3 epochs, 4 fold cv*

---

### Prompt Engineering
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline           | `fast_baseline_cfg`   | 0.7430   | - | |
|  CPC Context Text   | `fast_context_cfg`    | 0.7523    | - | |  
|  Custom Tokens      | `fast_customtok_cfg`  | 0.7464  | - | |  



**Conclusion**: Adding Context Text seems to work best here, we will hence continue with that.
</br> </br>  

### Model Type
    
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Regression       | `fast_reg_cfg` |  *see CPC Context Text*  | - | |
|  Classification   | `fast_class_cfg` | 0.7378 | - | |  
|  Ordinal          | `fast_ord_cfg` | 0.7262 | - | |  


**Conclusion**: Regular regression seems to work best, we will hence continue with that.
</br> </br>  

### Classical NLP Preprocssing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Stemming              | `fast_stem_cfg`   |  0.6663 | - | |
| Lemmatizing           | `fast_lemma_cfg`   | 0.7510  | - | |
| Special Characters    | `fast_specchr_cfg`   |   | - | Removing special characters from the prompt |


### Postprocessing
|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Clipping              | `fast_post_clip_cfg`     |  0.7423 | - | Range `[0,1]`|
| MinMax                | `fast_post_minmax_cfg`   |  0.7523 | - | Range `[0,1]`|
| Chemical Lookup       | `fast_post_chem_cfg`     |  0.7541 | - |              |

**Conclusion**: `Clipping` performs a little worse, which makes sense when considering the pearson correlation score.
`MinMax` performs exactly the same, which is again not surprising.
`Chemical Lookup` performs a little better, we hence continue with that.
</br> </br>  

### Data Augmentation
To aquire a fair estimate, here, we only validate on non-augmented data and keep the training data the same size as in previous fast experiments.  


|     Model              |  CFG | cv score | lb score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
| Identities              |   `fast_aug_ident_cfg` |  0.7456 | - | Adds the reverse mapping for Anchor-Target pairs with score 1|
| All Mirrored            |   `fast_aug_mirr_cfg` |  0.7382 | - | Adds the reverse mapping for all Anchor-Target pairs |
| Identity Paths          |  `fast_aug_identpaths_cfg` | 0.7597  | - | Adds all pairs in a path between phrases connected by a score of 1|
| All Mirrored + Identity Paths |  `fast_aug_mirridentpaths_cfg` | 0.7449 | - | Adds identity paths and additionally mirrors all Anchor-Target pairs|
| Neighbors               |  `fast_aug_neighbors_cfg` | 0.7518 | - | Additionally to *All Mirrored + Identity Paths* considers phrases adjacent of idendity paths.|
| Chemical Compounds      |   `fast_aug_chem_cfg`  | 0.7576 | - | Finds synonyms for formulae of common chemical compounds in the dataset and creates new phrases from it.| 

**Conclusion**: Augmenting seems not to hurt performance significantly and it is likely that the additional data will increase performance.  
To giver a better comparison we will compare full models based on the leaderboard score.

</br>  </br>  </br>  

## Full Experiments 
*full training data, 5 epochs, 4 fold cv*

---

    
|     Model              |  CFG | cv score | lb (pb) score | comment|
|:----------------------|:--------|:--------:|:--------:|:--------|
|  Baseline                        | `full_baseline_cfg`      | 0.8314  | 0.8210 (0.8107) | |
|  Context Text                    | `full_ctxt_cfg`          | 0.8337  | 0.8293 (0.8171) | |  
|  Chemical Lookup                 | `full_post_chem_cfg`     | 0.8361  | 0.8292 (0.8167) | |
|  Neighbors Augmentation          | `full_aug_neighbors_cfg` |   |  | |     
|  Chemical Compounds Augmentation | `full_aug_chem_cfg`      |   |  | | 



# Directory settings

In [2]:
# ====================================================
# Directory settings
# ====================================================
import os

INPUT_DIR = '../input/us-patent-phrase-to-phrase-matching/'
IDENTITY_MAPPINGS_DIR = '../input/uspppm-identity-mappings/'
OUTPUT_DIR = './'
if not os.path.exists(OUTPUT_DIR):
    os.makedirs(OUTPUT_DIR)

# Model Configuration

In [3]:
from dataclasses import dataclass, field
from typing import Set, Optional, Tuple, Dict


@dataclass
class ModelConfig:
    ############################################
    # Prompt engineering
    ############################################
    # - ctxt_txt (Default): Add context text at the end of the prompt
    # - custom_tok: Add custom tokens for contexts and a custom separator token
    # - None: Append the context appreviation to the end of prompt
    # When true uses custom separator token and context tokens
    prompt_engineering: Optional[str] = 'ctx_txt' # ['custom_tok', None]
    custom_sep_token: str = '[S]'
        
    ############################################
    # Model Type
    ############################################
    # when true uses classification else a regression model
    classification: bool = False
    # when true uses ordinal regression (only active when `classification = False`)
    ordinal: bool = False
    
    ############################################
    # Traditional NLP preprocessing
    ############################################
    stemming: bool = False
    lemmatizing: bool = False
    spec_chr_rem: bool = False
        
    ############################################
    # Post Processing
    ############################################
    clipping: bool = False
    minmax: bool = False
    # Use averaging of chemical component synonyms for 
    chem_comp_pred_avg: bool  = False
        
    ############################################
    # Data Augmentation
    ############################################
    # Use chemical component synonyms to creat samples for augmenting the training set
    chem_comp_train_aug: bool = False
        
    # How to augment the data for training using the graph based indentity mappings
    # available options ['neighbors'. paths_mirrored', 'paths', 'mirrored', 'identities', None]
    # None indicates that no augmentation should take place
    augment_identity_graph_data: Optional[str] = None 
    validate_on_original: bool = True
        
    
    ############################################
    # General Model Config and Hyperparams
    ############################################
    debug: bool = False
    apex: bool = True
    print_freq: int= 200
    num_workers: int = 4
    model: str = "microsoft/deberta-v3-small"
    scheduler: str = 'cosine' # ['linear', 'cosine']
    batch_scheduler: bool = True
    num_cycles: float = 0.5
    num_warmup_steps: int = 0
    encoder_lr: float = 2e-5
    decoder_lr: float = 2e-5
    min_lr: float = 1e-6
    eps: float = 1e-6
    betas: Set[float] = (0.9, 0.999)
    batch_size: int = 16
    fc_dropout: float = 0.2
    max_len: int = 512
    weight_decay: float = 0.01
    gradient_accumulation_steps: int = 1
    max_grad_norm: int = 1000
    seed: int = 42
    
    epochs: int = 5
    train_frac: Optional[float] = None
    n_fold: int = 4
    trn_fold: Set[int] = (0, 1, 2, 3)
    map_score: Dict[float, int] = field(default_factory = lambda: ({0.0: 0, 0.25: 1, 0.5: 2, 0.75: 3, 1.0: 4}))
    map_labels: Dict[int, float] = field(default_factory = lambda: ({0: 0.0, 1: 0.25, 2: 0.5, 3: 0.75, 4: 1.0}))
        
    target_size: int=1
    def __post_init__(self):
        if self.classification or self.ordinal:
            self.target_size = 5
        else:
            self.target_size = 1  


############################################
# Experiment Configurations
############################################

# Fast Experiments
FAST_BASE_TRAIN_FRAC = 1/5
FAST_BASE_TRAIN_EPOCHS = 3

def fastModelConfig(**kwargs):
    return ModelConfig(**kwargs, train_frac=FAST_BASE_TRAIN_FRAC, epochs=FAST_BASE_TRAIN_EPOCHS)


fast_baseline_cfg = fastModelConfig(prompt_engineering=None)
fast_context_ctx = fastModelConfig()
fast_customtok_cfg = fastModelConfig(prompt_engineering='custom_tok')

fast_reg_cfg = fastModelConfig()
fast_class_cfg = fastModelConfig(classification=True)
fast_ord_cfg = fastModelConfig(ordinal=True)

fast_stem_cfg = fastModelConfig(stemming=True)
fast_lemma_cfg = fastModelConfig(lemmatizing=True)
fast_specchr_cfg = fastModelConfig(spec_chr_rem=True)

fast_aug_ident_cfg = fastModelConfig(augment_identity_graph_data='identities')
fast_aug_mirr_cfg = fastModelConfig(augment_identity_graph_data='mirrored')
fast_aug_identpaths_cfg = fastModelConfig(augment_identity_graph_data='paths')
fast_aug_mirridentpaths_cfg = fastModelConfig(augment_identity_graph_data='paths_mirrored')
fast_aug_neighbors_cfg = fastModelConfig(augment_identity_graph_data='neighbors')
fast_aug_chem_cfg = fastModelConfig(chem_comp_train_aug=True)

fast_post_clip_cfg = fastModelConfig(clipping=True)
fast_post_minmax_cfg = fastModelConfig(minmax=True)
fast_post_chem_cfg = fastModelConfig(chem_comp_pred_avg=True)

# Full Experiments
full_baseline_cfg = ModelConfig(prompt_engineering=None)
full_ctxt_cfg = ModelConfig()
full_post_chem_cfg = ModelConfig(chem_comp_pred_avg=True)
full_aug_neighbors_cfg = ModelConfig(augment_identity_graph_data='neighbors')
full_aug_chem_cfg = ModelConfig(chem_comp_train_aug=True)


### Config used throughout this notebook

In [4]:
############################################
# Fast Experiments
############################################

# CFG = fast_baseline_cfg
# CFG = fast_context_ctx
# CFG = fast_customtok_cfg

# CFG = fast_reg_cfg
# CFG = fast_class_cfg
# CFG = fast_ord_cfg

# CFG = fast_stem_cfg
# CFG = fast_lemma_cfg
CFG = fast_specchr_cfg

# CFG = fast_aug_ident_cfg
# CFG = fast_aug_mirr_cfg
# CFG = fast_aug_identpaths_cfg
# CFG = fast_aug_mirridentpaths_cfg
# CFG = fast_aug_neighbors_cfg
# CFG = fast_aug_chem_cfg

# CFG = fast_post_clip_cfg
# CFG = fast_post_minmax_cfg
# CFG = fast_post_chem_cfg

############################################
# Full Experiments
############################################
# CFG = full_baseline_cfg
# CFG = full_ctxt_cfg
# CFG = full_post_chem_cfg
# CFG = full_aug_neighbors_cfg
# CFG = full_aug_chem_cfg

# Library

In [5]:
# ====================================================
# Library
# ====================================================
import os
import gc
import re
import ast
import sys
import copy
import json
import time
import math
import shutil
import string
import pickle
import random
import joblib
import itertools
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import scipy as sp
import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.metrics import f1_score
from sklearn.model_selection import StratifiedKFold, GroupKFold, KFold
from sklearn.preprocessing import minmax_scale
    
import torch
print(f"torch.__version__: {torch.__version__}")
import torch.nn as nn
from torch.nn import Parameter
import torch.nn.functional as F
from torch.optim import Adam, SGD, AdamW
from torch.utils.data import DataLoader, Dataset

os.system('pip uninstall -y transformers')
os.system('pip uninstall -y tokenizers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels transformers')
os.system('python -m pip install --no-index --find-links=../input/pppm-pip-wheels tokenizers')
import tokenizers
import transformers
print(f"tokenizers.__version__: {tokenizers.__version__}")
print(f"transformers.__version__: {transformers.__version__}")
from transformers import AutoTokenizer, AutoModel, AutoConfig
from transformers import get_linear_schedule_with_warmup, get_cosine_schedule_with_warmup
%env TOKENIZERS_PARALLELISM=true

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

import nltk
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.data.path.append('../input/wordnet')


# common chemical formulae lookup utility script
from uspppm_common_chemical_compound_lookup import USPPPMChemCompLookup

torch.__version__: 1.9.1+cpu
Found existing installation: transformers 4.16.2
Uninstalling transformers-4.16.2:
  Successfully uninstalled transformers-4.16.2




Found existing installation: tokenizers 0.11.6
Uninstalling tokenizers-0.11.6:
  Successfully uninstalled tokenizers-0.11.6




Looking in links: ../input/pppm-pip-wheels
Processing /kaggle/input/pppm-pip-wheels/transformers-4.18.0-py3-none-any.whl
Processing /kaggle/input/pppm-pip-wheels/tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
Installing collected packages: tokenizers, transformers


ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
allennlp 2.9.1 requires transformers<4.17,>=4.1, but you have transformers 4.18.0 which is incompatible.


Successfully installed tokenizers-0.12.1 transformers-4.18.0
Looking in links: ../input/pppm-pip-wheels




tokenizers.__version__: 0.12.1
transformers.__version__: 4.18.0
env: TOKENIZERS_PARALLELISM=true


# Utils

### Chemical Compound Lookup

In [6]:
chem_lookup = USPPPMChemCompLookup(chem_comp_path='../input/chemical-compounds-lookup/compounds.csv')

**Chemical Compound Lookup Tests**

In [7]:
print('Testing basic formula lookup...')
print(chem_lookup.lookup_df)
print(chem_lookup.phrase_chem_formula_synonym('agbr test'))
print(chem_lookup.phrase_chem_formula_synonym('agbr dna test agonc ag2cl2'))
print(chem_lookup.phrase_chem_formula_synonym('dna test d2o'))
print(chem_lookup.chem_formula_synonyms('c3h6'))
print('Done!')
print()
print('Testing train dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),
                          'score': pd.Series([3, 2, 1, 0]),
                          'anchor': pd.Series(['agbr dna test agonc ag2cl2', 'agbr test', 'agbr', 'last']),
                          'target': pd.Series(['agonc ag2cl2', 'test thingy', 'c4h7no4', 'last']),
                          'context': pd.Series(['G02', 'G02', 'C12', 'C12'])})
print("Before")
print(chem_test)
print("After")
print(chem_lookup.pre_augment_chem_formulae(chem_test, True))
print('Done!')
print()
print('Testing test dataset augmentation...')
chem_test = pd.DataFrame({'id': pd.Series(['t1', 't2', 't3', 't4']),
                          'text': pd.Series([
                              'agbr dna test agonc ag2cl2 [SEP] agonc ag2cl2 [SEP] G02',
                              'agbr test [SEP] test thingy [SEP] G02',
                              'agbr [SEP] c4h7no4 [SEP] C12',
                              'last [SEP] last [SEP] C12'
                          ])})
print("Before")
print(chem_test)
print("After")
print(chem_lookup.post_augment_chem_formulae(chem_test, True))
print('Done!')

Testing basic formula lookup...
                            Name  Formula
0               actiniumiiioxide    ac2o3
1     silvertetrachloroaluminate  agalcl4
2                  silverbromide     agbr
3                  silverbromate   agbro3
4                  silvercyanide     agcn
...                          ...      ...
4067                zirconateion    zro32
4068          zirconiumphosphide     zrp2
4069            zirconiumsulfide     zrs2
4070           zirconiumsilicide    zrsi2
4071          zirconiumphosphate  zr3po44

[4072 rows x 2 columns]
[('silverbromide test', True)]
[('silverbromide dna test silverfulminate disilverdichloride', True), ('silverbromide dna test silvercyanate disilverdichloride', True), ('silverbromide dna test silverfulminate silveriidichloride', True), ('silverbromide dna test silvercyanate silveriidichloride', True)]
[('dna test deuteriumoxide', True), ('dna test heavywater', True)]
['cyclopropane', 'propylene']
Done!

Testing train dataset augmentat

In [8]:
# ====================================================
# Utils
# ====================================================
def get_score(y_true, y_pred):
    score = sp.stats.pearsonr(y_true, y_pred)[0]
    return score


def get_logger(filename=OUTPUT_DIR+'train'):
    from logging import getLogger, INFO, StreamHandler, FileHandler, Formatter
    logger = getLogger(__name__)
    logger.setLevel(INFO)
    handler1 = StreamHandler()
    handler1.setFormatter(Formatter("%(message)s"))
    handler2 = FileHandler(filename=f"{filename}.log")
    handler2.setFormatter(Formatter("%(message)s"))
    logger.addHandler(handler1)
    logger.addHandler(handler2)
    logger.propagate = False
    return logger

LOGGER = get_logger()

def seed_everything(seed=42):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    torch.backends.cudnn.deterministic = True
    
seed_everything(seed=42)

# Data Loading

In [9]:
# ====================================================
# Data Loading
# ====================================================
orig_data = pd.read_csv(INPUT_DIR+'train.csv')

if CFG.augment_identity_graph_data == 'neighbors':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_path_neighbors.csv')
elif CFG.augment_identity_graph_data == 'paths_mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored_w_identity_paths.csv')
elif CFG.augment_identity_graph_data == 'paths':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_paths_mirrored.csv')
elif CFG.augment_identity_graph_data == 'mirrored':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'all_mirrored.csv')
elif CFG.augment_identity_graph_data == 'identities':
    train = pd.read_csv(IDENTITY_MAPPINGS_DIR+'identity_mirrored.csv')
elif CFG.augment_identity_graph_data == None:
    train = orig_data
else:
    raise(ValueError('CFG.augment_identity_graph_data = {} not recognized!'.format(CFG.augment_identity_graph_data)))

if CFG.chem_comp_train_aug:
    train = chem_lookup.pre_augment_chem_formulae(train).reindex()
    
display(train)
    
if CFG.train_frac:
    # to get a fair estimate we always use a fraction of the original data
    n = int(CFG.train_frac * len(orig_data))
    train = train.sample(n=n, replace=False, ignore_index=True)
    
test = pd.read_csv(INPUT_DIR+'test.csv')
submission = pd.read_csv(INPUT_DIR+'sample_submission.csv')
print(f"train.shape: {train.shape}")
print(f"test.shape: {test.shape}")
print(f"submission.shape: {submission.shape}")
# display(train.head())
# display(test.head())
# display(submission.head())

Unnamed: 0,id,anchor,target,context,score
0,37d61fd2272659b1,abatement,abatement of pollution,A47,0.50
1,7b9652b17b68b7a4,abatement,act of abating,A47,0.75
2,36d72442aefd8232,abatement,active catalyst,A47,0.25
3,5296b0c19e1ce60e,abatement,eliminating process,A47,0.50
4,54c1e3b9184cb5b6,abatement,forest region,A47,0.00
...,...,...,...,...,...
36468,8e1386cbefd7f245,wood article,wooden article,B44,1.00
36469,42d9e032d1cd3242,wood article,wooden box,B44,0.50
36470,208654ccb9e14fa3,wood article,wooden handle,B44,0.50
36471,756ec035e694722b,wood article,wooden material,B44,0.75


train.shape: (7294, 5)
test.shape: (36, 4)
submission.shape: (36, 2)


In [10]:
# For sanity check below - count the different 
sum(chem_lookup.pre_augment_chem_formulae(train)['id'].duplicated())

10

# Pre-processing

In [11]:
# Add augmented indicator
# I'm sorry for this dirty hack
if 'augmented' not in train.columns:
    train['augmented'] = train['id'].str.contains('_')
else:
    train['augmented'] = train['id'].str.contains('_') | train['augmented']
    
print(sum(train['augmented']), 'augmented samples')

0 augmented samples


In [12]:
# ====================================================
# CPC Data
# ====================================================
def get_cpc_texts():
    contexts = []
    pattern = '[A-Z]\d+'
    for file_name in os.listdir('../input/cpc-data/CPCSchemeXML202105'):
        result = re.findall(pattern, file_name)
        if result:
            contexts.append(result)
    contexts = sorted(set(sum(contexts, [])))
    results = {}
    for cpc in ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'Y']:
        with open(f'../input/cpc-data/CPCTitleList202202/cpc-section-{cpc}_20220201.txt') as f:
            s = f.read()
        pattern = f'{cpc}\t\t.+'
        result = re.findall(pattern, s)
        cpc_result = result[0].lstrip(pattern)
        for context in [c for c in contexts if c[0] == cpc]:
            pattern = f'{context}\t\t.+'
            result = re.findall(pattern, s)
            results[context] = cpc_result + ". " + result[0].lstrip(pattern)
    return results


cpc_texts = get_cpc_texts()
torch.save(cpc_texts, OUTPUT_DIR+"cpc_texts.pth")
train['context_text'] = train['context'].map(cpc_texts)
test['context_text'] = test['context'].map(cpc_texts)
# display(train.head())
# display(test.head())

## Tradition NLP Preprocessing

In [15]:
train_old = train.copy()

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def stem_phrase(phrase: str):
    tokens = word_tokenize(phrase)
    return " ".join([stemmer.stem(token) for token in tokens])
        
def lemmatize_phrase(phrase: str):
    tokens = word_tokenize(phrase)
    return " ".join([lemmatizer.lemmatize(token) for token in tokens])

def rem_stop_words(phrase: str):
    tokens = word_tokenize(phrase)
    return " ".join([t for t in tokens if t not in stop_words])

def rem_spec_chr(phrase: str):
    return re.sub(r'[^A-Za-z0-9\s]+', "", phrase)
    
if CFG.stemming:
    train['anchor'] = train['anchor'].apply(stem_phrase)
    train['target'] = train['target'].apply(stem_phrase)
    test['anchor'] = test['anchor'].apply(stem_phrase)
    test['target'] = test['target'].apply(stem_phrase)
    
    
if CFG.lemmatizing:
    train['anchor'] = train['anchor'].apply(lemmatize_phrase)
    train['target'] = train['target'].apply(lemmatize_phrase)
    test['anchor'] = test['anchor'].apply(lemmatize_phrase)
    test['target'] = test['target'].apply(lemmatize_phrase)
    

#print("Nr. special characters: ", sum(train['anchor'].apply(lambda p: len(re.findall(r'[^A-Za-z0-9\s]+', p)) > 0)))

if CFG.spec_chr_rem:
    train['anchor'] = train['anchor'].apply(lambda p: rem_spec_chr(p))
    train['target'] = train['target'].apply(lambda p: rem_spec_chr(p))
    train['context_text'] = train['context_text'].apply(lambda p: rem_spec_chr(p))
    test['anchor'] = test['anchor'].apply(lambda p: rem_spec_chr(p))
    test['target'] = test['target'].apply(lambda p: rem_spec_chr(p))
    test['context_text'] = test['context_text'].apply(lambda p: rem_spec_chr(p))

# CV split

In [None]:
# ====================================================
# CV split
# ====================================================
train['score_map'] = train['score'].map({0.00: 0, 0.25: 1, 0.50: 2, 0.75: 3, 1.00: 4})
Fold = StratifiedKFold(n_splits=CFG.n_fold, shuffle=True, random_state=CFG.seed)
for n, (train_index, val_index) in enumerate(Fold.split(train, train['score_map'])):
    train.loc[val_index, 'fold'] = int(n)
train['fold'] = train['fold'].astype(int)
# display(train.groupby('fold').size())

In [None]:
def get_sec_toks(df):
    return '[' + df['context'].str[0] + ']'

if CFG.prompt_engineering == 'custom_tok':
    train['text'] = get_sec_toks(train) + train['context_text'] + ' [SEP] '+ train['anchor'] + CFG.custom_sep_token + train['target']
    test['text'] = get_sec_toks(test) + test['context_text'] + ' [SEP] ' + test['anchor'] + CFG.custom_sep_token + test['target']
elif CFG.prompt_engineering == 'ctx_txt':
    train['text'] = train['anchor'] + ' [SEP] ' + train['target'] + ' [SEP] '  + train['context_text']
    test['text'] = test['anchor'] + ' [SEP] ' + test['target'] + ' [SEP] '  + test['context_text']
elif CFG.prompt_engineering == None:
    train['text'] = train['anchor'] + ' [SEP] ' + train['target'] + ' [SEP] '  + train['context']
    test['text'] = test['anchor'] + ' [SEP] ' + test['target'] + ' [SEP] '  + test['context']
else:
    raise(ValueError('CFG.prompt_engineering = {} not recognized!'.format(CFG.prompt_engineering)))
    

print(train['text'][0])
display(train['text'].head())
display(test['text'].head())

# tokenizer

In [None]:
# ====================================================
# tokenizer
# ====================================================
tokenizer = AutoTokenizer.from_pretrained(CFG.model)


# add special tokens for sections
cpc_sections = [
    'A', # Human Necessities
    'B', # Operations and Transport
    'C', # Chemistry and Metallurgy
    'D', # Textiles
    'E', # Fixed Constructions
    'F', # Mechanical Engineering
    'G', # Physics
    'H', # Electricity
    'Y' # Emerging Cross-Sectional Technologies
]
if CFG.prompt_engineering == 'custom_tok':
    tokenizer.add_special_tokens({'additional_special_tokens': ['['+  s + ']' for s in cpc_sections]})
    print(tokenizer.all_special_tokens)
    
tokenizer.save_pretrained(OUTPUT_DIR+'tokenizer/')
CFG.tokenizer = tokenizer

# Dataset

In [None]:
# ====================================================
# Define max_len
# ====================================================
lengths_dict = {}

lengths = []
tk0 = tqdm(cpc_texts.values(), total=len(cpc_texts))
for text in tk0:
    length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
    lengths.append(length)
lengths_dict['context_text'] = lengths

for text_col in ['anchor', 'target']:
    lengths = []
    tk0 = tqdm(train[text_col].fillna("").values, total=len(train))
    for text in tk0:
        length = len(tokenizer(text, add_special_tokens=False)['input_ids'])
        lengths.append(length)
    lengths_dict[text_col] = lengths
    
CFG.max_len = max(lengths_dict['anchor']) + max(lengths_dict['target'])\
                + max(lengths_dict['context_text']) + 4 # CLS + SEP + SEP + SEP
LOGGER.info(f"max_len: {CFG.max_len}")

In [None]:
# ====================================================
# Dataset
# ====================================================
def prepare_input(cfg, text):
    inputs = cfg.tokenizer(text,
                           add_special_tokens=True,
                           max_length=cfg.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs

def prepare_labels(cfg, label):
    if cfg.classification:
        label_onehot = [0 for _ in range(cfg.target_size)]
        label_onehot[cfg.map_score[label]] = 1 
        return torch.tensor(label_onehot, dtype=torch.float)
    elif cfg.ordinal:
        label_ordinal = [1 if i <= cfg.map_score[label] else 0 for i in range(cfg.target_size)]
        return torch.tensor(label_ordinal, dtype=torch.float)
    else:
        return torch.tensor(label, dtype=torch.float)

class TrainDataset(Dataset):
    def __init__(self, cfg, df, chem_lookup=None):
        self.cfg = cfg
        if cfg.chem_comp_pred_avg and chem_lookup:
            df = chem_lookup.post_augment_chem_formulae(df)
        self.texts = df['text'].values
        self.labels = df['score'].values
        self.ids = df['id'].values

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, self.texts[item])
        label = prepare_labels(self.cfg, self.labels[item])
        return inputs, label, self.ids[item]

# Model

In [None]:
# ====================================================
# Model
# ====================================================
class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, self.cfg.target_size)
        self._init_weights(self.fc)
        self.attention = nn.Sequential(
            nn.Linear(self.config.hidden_size, 512),
            nn.Tanh(),
            nn.Linear(512, 1),
            nn.Softmax(dim=1)
        )
        self._init_weights(self.attention)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        # feature = torch.mean(last_hidden_states, 1)
        weights = self.attention(last_hidden_states)
        feature = torch.sum(weights * last_hidden_states, dim=1)
        return feature

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        return output

# Helper functions

In [None]:
# ====================================================
# Helper functions
# ====================================================
class AverageMeter(object):
    """Computes and stores the average and current value"""
    def __init__(self):
        self.reset()

    def reset(self):
        self.val = 0
        self.avg = 0
        self.sum = 0
        self.count = 0

    def update(self, val, n=1):
        self.val = val
        self.sum += val * n
        self.count += n
        self.avg = self.sum / self.count


def asMinutes(s):
    m = math.floor(s / 60)
    s -= m * 60
    return '%dm %ds' % (m, s)


def timeSince(since, percent):
    now = time.time()
    s = now - since
    es = s / (percent)
    rs = es - s
    return '%s (remain %s)' % (asMinutes(s), asMinutes(rs))

def ordinal_regression(predictions, targets):
    """Ordinal regression with encoding as in https://arxiv.org/pdf/0704.1028.pdf"""
    return nn.MSELoss(reduction='mean')(predictions, targets)

def average_by_id(df):
    ''' Averages a dataframe by a column id'''
    orig_id_order = df['id']
    unordered_means = df.groupby('id').mean().reset_index()
    return unordered_means.set_index('id').loc[orig_id_order].reset_index().drop_duplicates()


def train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device):
    model.train()
    scaler = torch.cuda.amp.GradScaler(enabled=CFG.apex)
    losses = AverageMeter()
    start = end = time.time()
    global_step = 0
    for step, (inputs, labels, _) in enumerate(train_loader):
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
#         needed to disable amp, error: half expected got float (maybe some bug)
        if CFG.ordinal:
            y_preds = model(inputs)
        else:
            with torch.cuda.amp.autocast(enabled=CFG.apex):
                y_preds = model(inputs)
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels) 
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        scaler.scale(loss).backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), CFG.max_grad_norm)
        if (step + 1) % CFG.gradient_accumulation_steps == 0:
            scaler.step(optimizer)
            scaler.update()
            optimizer.zero_grad()
            global_step += 1
            if CFG.batch_scheduler:
                scheduler.step()
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(train_loader)-1):
            print('Epoch: [{0}][{1}/{2}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  'Grad: {grad_norm:.4f}  '
                  'LR: {lr:.8f}  '
                  .format(epoch+1, step, len(train_loader), 
                          remain=timeSince(start, float(step+1)/len(train_loader)),
                          loss=losses,
                          grad_norm=grad_norm,
                          lr=scheduler.get_lr()[0]))
    return losses.avg


def valid_fn(valid_loader, model, criterion, device):
    losses = AverageMeter()
    model.eval()
    preds = []
    ids = []
    start = end = time.time()
    for step, (inputs, labels, sample_id) in enumerate(valid_loader):
        ids.append(sample_id)
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        labels = labels.to(device)
        batch_size = labels.size(0)
        with torch.no_grad():
            y_preds = model(inputs)
            
        if CFG.classification:
            loss = criterion(y_preds, torch.argmax(labels, 1))
        elif CFG.ordinal:
            loss = criterion(y_preds, labels)
        else:
            loss = criterion(y_preds.view(-1, 1), labels.view(-1, 1))
        if CFG.gradient_accumulation_steps > 1:
            loss = loss / CFG.gradient_accumulation_steps
        losses.update(loss.item(), batch_size)
        if CFG.classification or CFG.ordinal:
            preds.append(y_preds.to('cpu').numpy())
        else:
            preds.append(y_preds.sigmoid().to('cpu').numpy())
        end = time.time()
        if step % CFG.print_freq == 0 or step == (len(valid_loader)-1):
            print('EVAL: [{0}/{1}] '
                  'Elapsed {remain:s} '
                  'Loss: {loss.val:.4f}({loss.avg:.4f}) '
                  .format(step, len(valid_loader),
                          loss=losses,
                          remain=timeSince(start, float(step+1)/len(valid_loader))))
    predictions = np.concatenate(preds)
    ids = np.concatenate(ids)
    print("Sanity Check:", sum(pd.Series(ids).duplicated()))
    if CFG.classification:
        predictions = np.argmax(predictions, axis=1)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    elif CFG.ordinal:
        predictions = (predictions > 0.5).cumprod(axis=1).sum(axis=1) - 1
        predictions = np.clip(predictions, 0, None)
        predictions = np.array([CFG.map_labels[p] for p in predictions])
    else:
        predictions = np.concatenate(predictions)
        if CFG.clipping:
            predictions = np.clip(predictions, 0, 1)
        if CFG.minmax:
            predictions = minmax_scale(predictions, feature_range=(0, 1))
            
    if CFG.chem_comp_pred_avg:
        pred_new = average_by_id(pd.DataFrame({'pred': predictions, 'id': ids}))
        predictions = pred_new['pred'].to_numpy()
    return losses.avg, predictions


In [None]:
# ====================================================
# train loop
# ====================================================
def train_loop(folds, fold):
    
    LOGGER.info(f"========== fold: {fold} training ==========")

    # ====================================================
    # loader
    # ====================================================
    train_folds = folds[folds['fold'] != fold].reset_index(drop=True)
    valid_folds = folds[folds['fold'] == fold].reset_index(drop=True)
    if CFG.augment_identity_graph_data != None and CFG.validate_on_original:
        valid_folds = valid_folds[valid_folds['augmented'] == False].reset_index(drop=True)

    valid_labels = valid_folds['score'].values
    
    train_dataset = TrainDataset(CFG, train_folds)
    # we only want the prediction using chemical synonyms for the validation
    valid_dataset = TrainDataset(CFG, valid_folds, chem_lookup)

    train_loader = DataLoader(train_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=True,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=True)
    valid_loader = DataLoader(valid_dataset,
                              batch_size=CFG.batch_size,
                              shuffle=False,
                              num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

    # ====================================================
    # model & optimizer
    # ====================================================
    model = CustomModel(CFG, config_path=None, pretrained=True)
    torch.save(model.config, OUTPUT_DIR+'config.pth')
    model.to(device)
    
    def get_optimizer_params(model, encoder_lr, decoder_lr, weight_decay=0.0):
        param_optimizer = list(model.named_parameters())
        no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
        optimizer_parameters = [
            {'params': [p for n, p in model.model.named_parameters() if not any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': weight_decay},
            {'params': [p for n, p in model.model.named_parameters() if any(nd in n for nd in no_decay)],
             'lr': encoder_lr, 'weight_decay': 0.0},
            {'params': [p for n, p in model.named_parameters() if "model" not in n],
             'lr': decoder_lr, 'weight_decay': 0.0}
        ]
        return optimizer_parameters

    optimizer_parameters = get_optimizer_params(model,
                                                encoder_lr=CFG.encoder_lr, 
                                                decoder_lr=CFG.decoder_lr,
                                                weight_decay=CFG.weight_decay)
    optimizer = AdamW(optimizer_parameters, lr=CFG.encoder_lr, eps=CFG.eps, betas=CFG.betas)
    
    # ====================================================
    # scheduler
    # ====================================================
    def get_scheduler(cfg, optimizer, num_train_steps):
        if cfg.scheduler == 'linear':
            scheduler = get_linear_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps
            )
        elif cfg.scheduler == 'cosine':
            scheduler = get_cosine_schedule_with_warmup(
                optimizer, num_warmup_steps=cfg.num_warmup_steps, num_training_steps=num_train_steps, num_cycles=cfg.num_cycles
            )
        return scheduler
    
    num_train_steps = int(len(train_folds) / CFG.batch_size * CFG.epochs)
    scheduler = get_scheduler(CFG, optimizer, num_train_steps)

    # ====================================================
    # loop
    # ====================================================
    if CFG.classification:
        criterion = nn.CrossEntropyLoss()
    elif CFG.ordinal:
        criterion = ordinal_regression
    else:
        criterion = nn.BCEWithLogitsLoss(reduction="mean")
    
    best_score = 0.

    for epoch in range(CFG.epochs):

        start_time = time.time()

        # train
        avg_loss = train_fn(fold, train_loader, model, criterion, optimizer, epoch, scheduler, device)

        # eval
        avg_val_loss, predictions = valid_fn(valid_loader, model, criterion, device)
        
        # scoring
        score = get_score(valid_labels, predictions)

        elapsed = time.time() - start_time

        LOGGER.info(f'Epoch {epoch+1} - avg_train_loss: {avg_loss:.4f}  avg_val_loss: {avg_val_loss:.4f}  time: {elapsed:.0f}s')
        LOGGER.info(f'Epoch {epoch+1} - Score: {score:.4f}')       
        if best_score < score:
            best_score = score
            LOGGER.info(f'Epoch {epoch+1} - Save Best Score: {best_score:.4f} Model')
            torch.save({'model': model.state_dict(),
                        'predictions': predictions},
                        OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth")

    predictions = torch.load(OUTPUT_DIR+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth", 
                             map_location=torch.device('cpu'))['predictions']
    valid_folds['pred'] = predictions

    torch.cuda.empty_cache()
    gc.collect()
    
    return valid_folds

In [None]:
if __name__ == '__main__':
    
    def get_result(oof_df):
        labels = oof_df['score'].values
        preds = oof_df['pred'].values
        score = get_score(labels, preds)
        LOGGER.info(f'Score: {score:<.4f}')
    
    oof_df = pd.DataFrame()
    for fold in range(CFG.n_fold):
        if fold in CFG.trn_fold:
            _oof_df = train_loop(train, fold)
            oof_df = pd.concat([oof_df, _oof_df])
            LOGGER.info(f"========== fold: {fold} result ==========")
            get_result(_oof_df)
    oof_df = oof_df.reset_index(drop=True)
    LOGGER.info(f"========== CV ==========")
    get_result(oof_df)
    oof_df.to_csv(OUTPUT_DIR+'oof_df.csv')