# Introduction

Ensemble of the three public Deberta notebooks. Scores 0.884 on the leaderboard (scoring takes 5+ hours).

Please upvote the original notebooks:

**Deberta v3 large**

[inference](https://www.kaggle.com/code/lunapandachan/nbme-thanh-s-infer-add-test)

[original inference](https://www.kaggle.com/code/thanhns/deberta-v3-large-0-883-lb)

[model](https://www.kaggle.com/datasets/thanhns/deberta-v3-large-5-folds-public)

**Deberta v1 large**

[inference](https://www.kaggle.com/code/manojprabhaakr/nbme-deberta-large-baseline-inference)

[model](https://www.kaggle.com/datasets/manojprabhaakr/debertalarge)

**Deberta v1 base**

[train](https://www.kaggle.com/code/yasufuminakama/nbme-deberta-base-baseline-train)

[inference](https://www.kaggle.com/code/yasufuminakama/nbme-deberta-base-baseline-inference)

**What is going on:**
First I use the publicly available models to calculate the predictions_v3_l, predictions_v1_l and predictions_v3_b. These are lists of np.arrays. Each np.array corresponds to one patient note / feature number combination and represents probabilities that n-th letter in the patient note should be selected as belonging to the feature.

Then in the very end I take these probabilities and for each patient note+feature number combine them in a simple linear combination:
```
predictions = []
for p1, p2, p3 in zip(predictions_v3_l, predictions_v1_l, predictions_v1_b):
    predictions.append(w1*p1 + w2*p2 + w3*p3)
```
The weights `w1,w2,w3` I got from playing with the out-of-fold results that comes with each trained model (this is in a separate notebook, private at the moment).

With the final "probabilities" (they can now actually go above one, so not really probabilities any more), I just get the results

```
results = get_results(predictions)
```

and save them.

**Why it takes five hours:**
The notebook evaluates all three models on the full test dataset (\~2000 patient notes) and then combines the results. The large Debertas take about 2 hours each, the base one takes about 1 hour. Reason why this happens is because the models are huge, Kaggle GPU does not have much RAM and you need to use small batchsizes. It may be possible to increase the batchsizes here by a bit (~ factor two), but I did not look into it. That would speed things up a little. When you run the notebook, the models are only evaluated for the five notes in "test.csv" so it runs way quicker.

# Weights

In [None]:
w1 = 0.5     # Deberta v3 large  # 0.5
w2 = 0.4      # Deberta v3 large - 5fold
w3 = 0.2     # Deberta v1 (large or base)  # 0.4
# w4 = 0.1     # Deberta v1 base   # 0.18

# Imports

In [None]:
# The following is necessary if you want to use the fast tokenizer for deberta v2 or v3
# This must be done before importing transformers
import shutil
from pathlib import Path

transformers_path = Path("/opt/conda/lib/python3.7/site-packages/transformers")

input_dir = Path("../input/deberta-v2-3-fast-tokenizer")

convert_file = input_dir / "convert_slow_tokenizer.py"
conversion_path = transformers_path/convert_file.name

if conversion_path.exists():
    conversion_path.unlink()

shutil.copy(convert_file, transformers_path)
deberta_v2_path = transformers_path / "models" / "deberta_v2"

for filename in ['tokenization_deberta_v2.py', 'tokenization_deberta_v2_fast.py']:
    filepath = deberta_v2_path/filename
    
    if filepath.exists():
        filepath.unlink()

    shutil.copy(input_dir/filename, filepath)

In [None]:
import os
import gc
import ast
import sys
import copy
import json
import math
import string
import pickle
import random
import itertools
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
from tqdm.auto import tqdm
from sklearn.metrics import f1_score

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset

import tokenizers
import transformers
from transformers import AutoTokenizer, AutoModel, AutoConfig
%env TOKENIZERS_PARALLELISM=true

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(device)

# Seed

In [None]:
def seed_everything(seed=42):
    '''
    Sets the seed of the entire notebook so results are the same every time we run.
    This is for REPRODUCIBILITY.
    '''
    random.seed(seed)
    # Set a fixed value for the hash seed
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)
    
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # When running on the CuDNN backend, two further options must be set
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

seed_everything(seed=42)

# Helper functions for scoring

In [None]:
def micro_f1(preds, truths):
    """
    Micro f1 on binary arrays.

    Args:
        preds (list of lists of ints): Predictions.
        truths (list of lists of ints): Ground truths.

    Returns:
        float: f1 score.
    """
    # Micro : aggregating over all instances
    preds = np.concatenate(preds)
    truths = np.concatenate(truths)
    
    return f1_score(truths, preds)


def spans_to_binary(spans, length=None):
    """
    Converts spans to a binary array indicating whether each character is in the span.

    Args:
        spans (list of lists of two ints): Spans.

    Returns:
        np array [length]: Binarized spans.
    """
    length = np.max(spans) if length is None else length
    binary = np.zeros(length)
    for start, end in spans:
        binary[start:end] = 1
        
    return binary


def span_micro_f1(preds, truths):
    """
    Micro f1 on spans.

    Args:
        preds (list of lists of two ints): Prediction spans.
        truths (list of lists of two ints): Ground truth spans.

    Returns:
        float: f1 score.
    """
    bin_preds = []
    bin_truths = []
    for pred, truth in zip(preds, truths):
        if not len(pred) and not len(truth):
            continue
        length = max(np.max(pred) if len(pred) else 0, np.max(truth) if len(truth) else 0)
        bin_preds.append(spans_to_binary(pred, length))
        bin_truths.append(spans_to_binary(truth, length))
        
    return micro_f1(bin_preds, bin_truths)

In [None]:
def create_labels_for_scoring(df):
    # example: ['0 1', '3 4'] -> ['0 1; 3 4']
    df['location_for_create_labels'] = [ast.literal_eval(f'[]')] * len(df)
    for i in range(len(df)):
        lst = df.loc[i, 'location']
        if lst:
            new_lst = ';'.join(lst)
            df.loc[i, 'location_for_create_labels'] = ast.literal_eval(f'[["{new_lst}"]]')
    # create labels
    truths = []
    for location_list in df['location_for_create_labels'].values:
        truth = []
        if len(location_list) > 0:
            location = location_list[0]
            for loc in [s.split() for s in location.split(';')]:
                start, end = int(loc[0]), int(loc[1])
                truth.append([start, end])
        truths.append(truth)
        
    return truths


def get_char_probs(texts, predictions, tokenizer):
    results = [np.zeros(len(t)) for t in texts]
    for i, (text, prediction) in enumerate(zip(texts, predictions)):
        encoded = tokenizer(text, 
                            add_special_tokens=True,
                            return_offsets_mapping=True)
        for idx, (offset_mapping, pred) in enumerate(zip(encoded['offset_mapping'], prediction)):
            start = offset_mapping[0]
            end = offset_mapping[1]
            results[i][start:end] = pred
            
    return results


def get_results(char_probs, th=0.5):
    results = []
    for char_prob in char_probs:
        result = np.where(char_prob >= th)[0] + 1
        result = [list(g) for _, g in itertools.groupby(result, key=lambda n, c=itertools.count(): n - next(c))]
        result = [f"{min(r)} {max(r)}" for r in result]
        result = ";".join(result)
        results.append(result)
        
    return results


def get_predictions(results):
    predictions = []
    for result in results:
        prediction = []
        if result != "":
            for loc in [s.split() for s in result.split(';')]:
                start, end = int(loc[0]), int(loc[1])
                prediction.append([start, end])
        predictions.append(prediction)
        
    return predictions


def get_score(y_true, y_pred):
    return span_micro_f1(y_true, y_pred)

# post-processing and ensemble tools

In [None]:
def _trim_space_start_or_end(location_string):
    trim_location_string = location_string
    if not location_string != location_string:
        if location_string[0] == ' ' and location_string[-1] == ' ':
            trim_location_string = location_string[1:]
            trim_location_string = trim_location_string[:-1]
        elif location_string[0] == ' ':
            trim_location_string = location_string[1:]
        elif location_string[-1] == ' ':
            trim_location_string = location_string[:-1]
    return trim_location_string


def _get_start_end_from_location_string(base_location, other_location):
    base_location = _trim_space_start_or_end(base_location)
    other_location = _trim_space_start_or_end(other_location)
    base_start = int(base_location.split(' ')[0])
    base_end = int(base_location.split(' ')[1])
    other_start = int(other_location.split(' ')[0])
    other_end = int(other_location.split(' ')[1])
    return base_start, base_end, other_start, other_end


def _get_location_difference(base_location, other_location):
    base_start, base_end, other_start, other_end = _get_start_end_from_location_string(base_location, other_location)
    start_difference = abs(base_start - other_start)
    end_difference = abs(base_end - other_end)
    mid_difference_base = abs(base_end - other_start)
    mid_difference_other = abs(other_end - base_start)
    return start_difference, end_difference, mid_difference_base, mid_difference_other

# Weighted Box Fusions

In [None]:
def _get_box_fusion(base_location, other_location):
    base_start, base_end, other_start, other_end = _get_start_end_from_location_string(base_location, other_location)
    mean_start = int((base_start + other_start) / 2)
    mean_end = int((base_end + other_end) / 2)
    box_fusion = f'{mean_start} {mean_end}'
    return box_fusion


def _get_matrix_cross_difference(from_location_list):
    len_location_split = len(from_location_list)
    matrix_cross_location = np.ones((len_location_split, len_location_split), dtype='int') * 999
    for index_i in range(len_location_split):
        for index_j in range(index_i+1, len_location_split):
            location_difference = _get_location_difference(from_location_list[index_i], from_location_list[index_j])
            matrix_cross_location[index_i, index_j] = min(location_difference[2], location_difference[3])
    return matrix_cross_location


def _get_matrix_start_position_difference(from_location_list):
    len_location_split = len(from_location_list)
    matrix_start_position_location = np.ones((len_location_split, len_location_split), dtype='int') * 999
    for index_i in range(len_location_split):
        for index_j in range(index_i+1, len_location_split):
            location_difference = _get_location_difference(from_location_list[index_i], from_location_list[index_j])
            matrix_start_position_location[index_i, index_j] = min(location_difference[0], location_difference[1])
    return matrix_start_position_location

def _order_by_start_position(location_list):
    tuple_list = [(int(element.split(' ')[0]), int(element.split(' ')[1])) for element in location_list]
    sorted_tuple_list = sorted(tuple_list, key=lambda t: t[0])
    ordered_location_list = [f"{element[0]} {element[1]}" for element in sorted_tuple_list]
    return ordered_location_list


def _matrix_postprocess_location(location_string, matrix_usage='cross'):
    postprocess_location = location_string
    if location_string == '':
        postprocess_location = float('NaN')
    elif not location_string != location_string:
        location_split = location_string.split(';')
        location_split = [_trim_space_start_or_end(element) for element in location_split]
        location_split = list(set(location_split))
        location_split = _order_by_start_position(location_split)
        if len(location_split) > 1:
            if matrix_usage == 'cross':
                matrix_location = _get_matrix_cross_difference(location_split)
                where_ones = np.where(matrix_location <= 2)
            elif matrix_usage == 'start':
                matrix_location = _get_matrix_start_position_difference(location_split)
                where_ones = np.where(matrix_location <= 5)

            array_ones_size = where_ones[0].size
            if array_ones_size > 0:
                sub_location_split = []
                for index in range(array_ones_size):
                    index_i = where_ones[0][index]
                    index_j = where_ones[1][index]
                    start_end_location = _get_start_end_from_location_string(location_split[index_i], 
                                                                             location_split[index_j])
                    min_start = min(start_end_location[0], start_end_location[2])  
                    max_end = max(start_end_location[1], start_end_location[3])
                    sub_location_split.append(f'{min_start} {max_end}')
                
                processed_index = list(set(list(np.unique(where_ones[0])) + list(np.unique(where_ones[1]))))
                not_processed_index = [index_raw for index_raw in range(len(location_split)) if index_raw not in processed_index]
                for index_raw in not_processed_index:
                    sub_location_split.append(location_split[index_raw])
                sub_location = ';'.join(sub_location_split)
                postprocess_location = _matrix_postprocess_location(sub_location, matrix_usage=matrix_usage)
            else:
                postprocess_location = ';'.join(location_split)
    return postprocess_location

In [None]:
def _postprocess_location(location_string):
    cross_matrix_postprocess = _matrix_postprocess_location(location_string)
    postprocess_location = _matrix_postprocess_location(cross_matrix_postprocess, matrix_usage='start')
    return postprocess_location

In [None]:
def _custom_weighted_box_fusion(base_location, other_location, treshold:int = 5):
    # Check if NaN (works with string and float('NaN') values)
    isnan_base = base_location != base_location
    isnan_other = other_location != other_location
    # If base is NaN and other is not, fill base value with other's
    if isnan_base and not isnan_other:
        wbf_location = other_location
    elif not isnan_other and not isnan_base:
        if base_location != other_location:
            wbf_location = ''
            base_location_split = base_location.split(";")
            other_location_split = other_location.split(';')
            set_location_split = list(set(base_location_split + other_location_split))
            first_set_location = set_location_split.pop(0)
            box_fusion_used = False
            for set_location_string in set_location_split:
                start_difference, end_difference, mid_difference_base, mid_difference_other = _get_location_difference(first_set_location, 
                                                                                                                       set_location_string)
                if start_difference <= treshold and end_difference <= treshold:
                    if wbf_location != '' and wbf_location[-1] != ';':
                        wbf_location += ';'
                    wbf_location += _get_box_fusion(first_set_location, set_location_string)
                    wbf_location += ';'
                    box_fusion_used = True
                else:
                    if wbf_location != '' and wbf_location[-1] != ';':
                        wbf_location += ';'
                    wbf_location += f'{first_set_location};{set_location_string}'

            if box_fusion_used:
                if wbf_location[-1] == ';':
                    wbf_location = wbf_location[:-1]
        else:
            wbf_location = base_location
    else:
        wbf_location = base_location
    
    return_wbf = _postprocess_location(wbf_location)     
    return return_wbf

In [None]:
def _check_duplications(location_string):
    no_duplicate = location_string
    if not location_string != location_string:
        location_split = location_string.split(';')
        location_split = [_trim_space_start_or_end(element) for element in location_split]
        location_split = list(set(location_split))
        location_split = _order_by_start_position(location_split)
        no_duplicate = ';'.join(location_split)
    return no_duplicate

# Data Loading

In [None]:
main_dir="../input/nbme-score-clinical-patient-notes/"

def preprocess_features(features):
    features.loc[27, 'feature_text'] = "Last-Pap-smear-1-year-ago"
    return features


test = pd.read_csv(main_dir+'test.csv')
# test = pd.read_csv(main_dir+'train.csv')
submission = pd.read_csv(main_dir+'sample_submission.csv')
features = pd.read_csv(main_dir+'features.csv')
patient_notes = pd.read_csv(main_dir+'patient_notes.csv')

features = preprocess_features(features)

print(f"test.shape: {test.shape}")
display(test.head())
print(f"features.shape: {features.shape}")
display(features.head())
print(f"patient_notes.shape: {patient_notes.shape}")
display(patient_notes.head())

In [None]:
test = test.merge(features, on=['feature_num', 'case_num'], how='left')
test = test.merge(patient_notes, on=['pn_num', 'case_num'], how='left')
display(test.head())

# Deberta v3 large

In [None]:
# class CFG:
#     num_workers=4
#     path="../input/deberta-v3-large-5-folds-public/"
#     config_path=path+'config.pth'
#     model="microsoft/deberta-v3-large"
#     batch_size=32
#     fc_dropout=0.2
#     max_len=354
#     seed=42
#     n_fold=5
#     trn_fold=[0, 1, 2, 3, 4]


# ====================================================
# CFG
# ====================================================
class CFG:
    num_workers=4
    path="../input/nbmedebertav3largefold10/"
    config_path=path+'config.pth'
    model="microsoft/deberta-v3-large"  # ["microsoft/deberta-base", "kamalkraj/bioelectra-base-discriminator-pubmed"]
    batch_size=32
    fc_dropout=0.2
    max_len=354
    seed=42
    n_fold=10
    trn_fold=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # [0, 1, 2, 3, 4]


In [None]:
from transformers.models.deberta_v2.tokenization_deberta_v2_fast import DebertaV2TokenizerFast

tokenizer = DebertaV2TokenizerFast.from_pretrained('../input/deberta-tokenizer')
CFG.tokenizer = tokenizer

In [None]:
def prepare_input(cfg, text, feature_text):
    inputs = cfg.tokenizer(text, feature_text, 
                           add_special_tokens=True,
                           max_length=CFG.max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
        
    return inputs


class TestDataset(Dataset):
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.feature_texts = df['feature_text'].values
        self.pn_historys = df['pn_history'].values

    def __len__(self):
        return len(self.feature_texts)

    def __getitem__(self, item):
        inputs = prepare_input(self.cfg, 
                               self.pn_historys[item], 
                               self.feature_texts[item])
        
        return inputs

[link to the notebook for the fast inference](https://www.kaggle.com/code/anyai28/fast-inference-by-padding-optimization)

In [None]:
# ====================================================
# Dataset
# ====================================================
def prepare_input_fast(cfg, text, feature_text, batch_max_len):
    inputs = cfg.tokenizer(text, feature_text, 
                           add_special_tokens=True,
                           max_length=batch_max_len,
                           padding="max_length",
                           return_offsets_mapping=False)
    for k, v in inputs.items():
        inputs[k] = torch.tensor(v, dtype=torch.long)
    return inputs


class TestDatasetFast(Dataset):
    def __init__(self, cfg, df):
        self.cfg = cfg
        self.feature_texts = df['feature_text'].values
        self.pn_historys = df['pn_history'].values
        self.batch_max_len = df['batch_max_length'].values

    def __len__(self):
        return len(self.feature_texts)

    def __getitem__(self, item):
        inputs = prepare_input_fast(self.cfg, 
                               self.pn_historys[item], 
                               self.feature_texts[item],
                               self.batch_max_len[item],
                              )
        return inputs

In [None]:
class ScoringModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, 1)
        self._init_weights(self.fc)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        
        return last_hidden_states

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        
        return output

In [None]:
def inference_fn(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    
    tk0 = tqdm(test_loader, total=len(test_loader))
    for inputs in tk0:
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        with torch.no_grad():
            y_preds = model(inputs)
        preds.append(y_preds.sigmoid().to('cpu').numpy())
    predictions = np.concatenate(preds)
    
    return predictions

In [None]:
# ====================================================
# inference
# ====================================================
def inference_fn_fast(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    tk0 = tqdm(test_loader, total=len(test_loader))
    for inputs in tk0:
    # for inputs in test_loader:
        bs = len(inputs['input_ids'])
        pred_w_pad = np.zeros((bs, CFG.max_len, 1))
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        with torch.no_grad():
            y_preds = model(inputs)
        y_preds = y_preds.sigmoid().to('cpu').numpy()
        pred_w_pad[:, :y_preds.shape[1]] = y_preds
        preds.append(pred_w_pad)
    predictions = np.concatenate(preds)
    return predictions

In [None]:
###### Reduce Padding Inference ######

# sort by token num
# input_lengths = []
# tk0 = tqdm(zip(test['pn_history'].fillna("").values, test['feature_text'].fillna("").values), total=len(test))
# for text, feature_text in tk0:
#     length = len(CFG.tokenizer(text, feature_text, add_special_tokens=True)['input_ids'])
#     input_lengths.append(length)
# test['input_lengths'] = input_lengths
# length_sorted_idx = np.argsort([-len_ for len_ in input_lengths])

# # sort dataframe
# sort_df = test.iloc[length_sorted_idx]

# # calc max_len per batch
# sorted_input_length = sort_df['input_lengths'].values
# batch_max_length = np.zeros_like(sorted_input_length)
# bs = CFG.batch_size
# for i in range((len(sorted_input_length)//bs)+1):
#     batch_max_length[i*bs:(i+1)*bs] = np.max(sorted_input_length[i*bs:(i+1)*bs])    
# sort_df['batch_max_length'] = batch_max_length

In [None]:
# Original version

test_dataset = TestDataset(CFG, test)
test_loader = DataLoader(test_dataset,
                         batch_size=CFG.batch_size,
                         shuffle=False,
                         num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
predictions_v3_l = []
for fold in CFG.trn_fold:
    model = ScoringModel(CFG, config_path=CFG.config_path, pretrained=False)
    state = torch.load(CFG.path+f"debertav3-itpt-10ep_fold{fold}_best.pth",
                       map_location=torch.device('cpu'))
    
    model.load_state_dict(state['model'])
    prediction = inference_fn(test_loader, model, device)
    prediction = prediction.reshape((len(test), CFG.max_len))
    char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
    predictions_v3_l.append(char_probs)
    del model, state, prediction, char_probs
    gc.collect()
    torch.cuda.empty_cache()
predictions_v3_l = np.mean(predictions_v3_l, axis=0)

In [None]:
# Fast version

# test_dataset = TestDatasetFast(CFG, sort_df)
# test_loader = DataLoader(test_dataset,
#                       batch_size=CFG.batch_size,
#                       shuffle=False,
#                       num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
# predictions = []
# for fold in CFG.trn_fold:
#     model = ScoringModel(CFG, config_path=CFG.config_path, pretrained=False)
#     state = torch.load(CFG.path+f"debertav3-itpt-10ep_fold{fold}_best.pth",
#                        map_location=torch.device('cpu'))
#     model.load_state_dict(state['model'])
#     prediction = inference_fn_fast(test_loader, model, device)
#     prediction = prediction.reshape((len(test), CFG.max_len))
#     ## data re-sort ## 
#     prediction = prediction[np.argsort(length_sorted_idx)]

#     char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
#     predictions.append(char_probs)
#     del model, state, prediction, char_probs; gc.collect()
#     torch.cuda.empty_cache()
# predictions_v3_l = np.mean(predictions, axis=0)

# Deberta v3 large (5 fold)

In [None]:
class CFG:
    num_workers=4
    path="../input/nbme-deverta-v3-large-new-model-ep5-fold5/"
    config_path=path+'config.pth'
    model="microsoft/deberta-v3-large"
    model_name='deberta-v3-large'
    batch_size=32
    fc_dropout=0.2
    max_len=354
    seed=42
    n_fold=5
    trn_fold=[0, 1, 2, 3, 4]

In [None]:
tokenizer = DebertaV2TokenizerFast.from_pretrained('../input/deberta-tokenizer')
CFG.tokenizer = tokenizer

In [None]:
# Original

test_dataset = TestDataset(CFG, test)
test_loader = DataLoader(test_dataset,
                         batch_size=CFG.batch_size,
                         shuffle=False,
                         num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
predictions = []
for fold in CFG.trn_fold:
    model = ScoringModel(CFG, config_path=CFG.config_path, pretrained=False)
    state = torch.load(CFG.path+f"microsoft-deberta-v3-large_fold{fold}_best.pth",
                           map_location=torch.device('cpu'))
       
    model.load_state_dict(state['model'])
    prediction = inference_fn(test_loader, model, device)
    prediction = prediction.reshape((len(test), CFG.max_len))
    char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
    predictions.append(char_probs)
    del model, state, prediction, char_probs
    gc.collect()
    torch.cuda.empty_cache()
    
predictions_v3_l_5f = np.mean(predictions, axis=0)

In [None]:
# Fast version

# test_dataset = TestDatasetFast(CFG, sort_df)
# test_loader = DataLoader(test_dataset,
#                       batch_size=CFG.batch_size,
#                       shuffle=False,
#                       num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
# predictions = []
# for fold in CFG.trn_fold:
#     model = ScoringModel(CFG, config_path=CFG.config_path, pretrained=False)
#     state = torch.load(CFG.path+f"{CFG.model_name.replace('/', '-')}_fold{fold}_best.pth",
#                        map_location=torch.device('cpu'))
#     model.load_state_dict(state['model'])
#     prediction = inference_fn_fast(test_loader, model, device)
#     prediction = prediction.reshape((len(test), CFG.max_len))
#     ## data re-sort ## 
#     prediction = prediction[np.argsort(length_sorted_idx)]

#     char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
#     predictions.append(char_probs)
#     del model, state, prediction, char_probs; gc.collect()
#     torch.cuda.empty_cache()
# predictions_v3_l_5f = np.mean(predictions, axis=0)

# Deberta large

In [None]:
# ====================================================
# CFG
# ====================================================
class CFG:
    num_workers=4
    path="../input/debertalarge/"
    config_path=path+'config.pth'
    model="microsoft/deberta-large"
    batch_size=32
    fc_dropout=0.2
    max_len=466
    seed=42
    n_fold=4
    trn_fold=[0, 1, 2, 3]

In [None]:
# ====================================================
# tokenizer
# ====================================================
CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path+'tokenizer/')

In [None]:
# ====================================================
# Model
# ====================================================
class CustomModel(nn.Module):
    def __init__(self, cfg, config_path=None, pretrained=False):
        super().__init__()
        self.cfg = cfg
        if config_path is None:
            self.config = AutoConfig.from_pretrained(cfg.model, output_hidden_states=True)
        else:
            self.config = torch.load(config_path)
        if pretrained:
            self.model = AutoModel.from_pretrained(cfg.model, config=self.config)
        else:
            self.model = AutoModel.from_config(self.config)
        self.fc_dropout = nn.Dropout(cfg.fc_dropout)
        self.fc = nn.Linear(self.config.hidden_size, 1)
        self._init_weights(self.fc)
        
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.bias is not None:
                module.bias.data.zero_()
        elif isinstance(module, nn.Embedding):
            module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
            if module.padding_idx is not None:
                module.weight.data[module.padding_idx].zero_()
        elif isinstance(module, nn.LayerNorm):
            module.bias.data.zero_()
            module.weight.data.fill_(1.0)
        
    def feature(self, inputs):
        outputs = self.model(**inputs)
        last_hidden_states = outputs[0]
        return last_hidden_states

    def forward(self, inputs):
        feature = self.feature(inputs)
        output = self.fc(self.fc_dropout(feature))
        return output

In [None]:
# ====================================================
# inference
# ====================================================
def inference_fn(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    tk0 = tqdm(test_loader, total=len(test_loader))
    for inputs in tk0:
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        with torch.no_grad():
            y_preds = model(inputs)
        preds.append(y_preds.sigmoid().to('cpu').numpy())
    predictions = np.concatenate(preds)
    return predictions

In [None]:
# ====================================================
# inference
# ====================================================
def inference_fn_fast(test_loader, model, device):
    preds = []
    model.eval()
    model.to(device)
    tk0 = tqdm(test_loader, total=len(test_loader))
    for inputs in tk0:
    # for inputs in test_loader:
        bs = len(inputs['input_ids'])
        pred_w_pad = np.zeros((bs, CFG.max_len, 1))
        for k, v in inputs.items():
            inputs[k] = v.to(device)
        with torch.no_grad():
            y_preds = model(inputs)
        y_preds = y_preds.sigmoid().to('cpu').numpy()
        pred_w_pad[:, :y_preds.shape[1]] = y_preds
        preds.append(pred_w_pad)
    predictions = np.concatenate(preds)
    return predictions

In [None]:
# Original version

# test_dataset = TestDataset(CFG, test)
# test_loader = DataLoader(test_dataset,
#                          batch_size=CFG.batch_size,
#                          shuffle=False,
#                          num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
# predictions = []
# for fold in CFG.trn_fold:
#     model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
#     state = torch.load(CFG.path+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth",
#                        map_location=torch.device('cpu'))
#     model.load_state_dict(state['model'])
#     prediction = inference_fn(test_loader, model, device)
#     prediction = prediction.reshape((len(test), CFG.max_len))
#     char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
#     predictions.append(char_probs)
#     del model, state, prediction, char_probs; gc.collect()
#     torch.cuda.empty_cache()
# predictions_v1_l = np.mean(predictions, axis=0)

In [None]:
# Fast version

# test_dataset = TestDataset(CFG, test)
# test_loader = DataLoader(test_dataset,
#                          batch_size=CFG.batch_size,
#                          shuffle=False,
#                          num_workers=CFG.num_workers, pin_memory=True, drop_last=False)

# predictions = []
# for fold in CFG.trn_fold:
#     model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
#     state = torch.load(CFG.path+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth",
#                        map_location=torch.device('cpu'))
#     model.load_state_dict(state['model'])
#     prediction = inference_fn_fast(test_loader, model, device)
#     prediction = prediction.reshape((len(test), CFG.max_len))
#     ## data re-sort ## 
#     prediction = prediction[np.argsort(length_sorted_idx)]

#     char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
#     predictions.append(char_probs)
#     del model, state, prediction, char_probs; gc.collect()
#     torch.cuda.empty_cache()
# predictions_v1_l = np.mean(predictions, axis=0)

# Deberta base

In [None]:
# ====================================================
# CFG
# ====================================================
# class CFG:
#     num_workers=4
#     path="../input/nbme-deberta-base-baseline-train/"
#     config_path=path+'config.pth'
#     model="microsoft/deberta-base"
#     batch_size=24
#     fc_dropout=0.2
#     max_len=466
#     seed=42
#     n_fold=5
#     trn_fold=[0, 1, 2, 3, 4]


class CFG:
    num_workers=4
    path="../input/deberta-v3-pseudo-labeling/"
    config_path=path+'config.pth'
    model="microsoft/deberta-v3-large"
    model_name='deberta-v3-large'
    batch_size=32
    fc_dropout=0.2
    max_len=354
    seed=42
    n_fold=5
    trn_fold=[0, 1, 2, 3, 4]

In [None]:
# ====================================================
# tokenizer
# ====================================================
# CFG.tokenizer = AutoTokenizer.from_pretrained(CFG.path+'tokenizer/')

tokenizer = DebertaV2TokenizerFast.from_pretrained('../input/deberta-tokenizer')
CFG.tokenizer = tokenizer

In [None]:
# Original version

test_dataset = TestDataset(CFG, test)
test_loader = DataLoader(test_dataset,
                         batch_size=CFG.batch_size,
                         shuffle=False,
                         num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
predictions = []
for fold in CFG.trn_fold:
    model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
#     state = torch.load(CFG.path+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth",
#                        map_location=torch.device('cpu'))
    state = torch.load(CFG.path+f"microsoft-deberta-v3-large_fold{fold}_best.pth",
                           map_location=torch.device('cpu'))
    model.load_state_dict(state['model'])
    prediction = inference_fn(test_loader, model, device)
    prediction = prediction.reshape((len(test), CFG.max_len))
    char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
    predictions.append(char_probs)
    del model, state, prediction, char_probs; gc.collect()
    torch.cuda.empty_cache()
predictions_v1_b = np.mean(predictions, axis=0)

In [None]:
# Fast version

# test_dataset = TestDatasetFast(CFG, sort_df)
# test_loader = DataLoader(test_dataset,
#                       batch_size=CFG.batch_size,
#                       shuffle=False,
#                       num_workers=CFG.num_workers, pin_memory=True, drop_last=False)
# predictions = []
# for fold in CFG.trn_fold:
#     model = CustomModel(CFG, config_path=CFG.config_path, pretrained=False)
#     state = torch.load(CFG.path+f"{CFG.model.replace('/', '-')}_fold{fold}_best.pth",
#                        map_location=torch.device('cpu'))
#     model.load_state_dict(state['model'])
#     prediction = inference_fn_fast(test_loader, model, device)
#     prediction = prediction.reshape((len(test), CFG.max_len))
#     ## data re-sort ## 
#     prediction = prediction[np.argsort(length_sorted_idx)]

#     char_probs = get_char_probs(test['pn_history'].values, prediction, CFG.tokenizer)
#     predictions.append(char_probs)
#     del model, state, prediction, char_probs; gc.collect()
#     torch.cuda.empty_cache()
# predictions_v1_b = np.mean(predictions, axis=0)

# Combine

In [None]:
predictions = []
# predictions_v3_l, predictions_v3_l_5f, predictions_v1_l, predictions_v1_b
for p1, p2, p3 in zip(predictions_v3_l, predictions_v3_l_5f, predictions_v1_b):
    predictions.append(w1*p1 + w2*p2 + w3*p3)
# for p1, p2, p3, p4 in zip(predictions_v3_l, predictions_v3_l_5f, predictions_v1_l, predictions_v1_b):
#     predictions.append(w1*p1 + w2*p2 + w3*p3 + w4*p4)

In [None]:
# predictions = []
# my_w1 = 0.5
# my_w2 = 0.5
# for p1, p2 in zip(predictions_v3_l, predictions_v3_l_5f):
#     predictions.append(my_w1 * p1 + my_w2 * p2)

In [None]:
# # WBF example notebook: <https://www.kaggle.com/code/maximegatineau/nbme-post-processing-and-ensemble-tools/notebook>
# results1 = get_results(predictions_v3_l)
# results2 = get_results(predictions)
# postprocess_base_value = list(map(_postprocess_location, results1))
# postprocess_other_value = list(map(_postprocess_location, results2))
# list_wbf = list(map(_custom_weighted_box_fusion, postprocess_base_value, postprocess_other_value))

# # submission_df = pd.DataFrame({'id': test['id'].values, 'final_location': list_wbf})
# submission_df = pd.DataFrame({'id': submission['id'].values, 'location': list_wbf})
# submission_df.to_csv('submission.csv', index=False)

In [None]:
results = get_results(predictions)
submission['location'] = results
display(submission.head())
submission[['id', 'location']].to_csv('submission.csv', index=False)