## The title is a bit clickbait, because in the inference I've used regex for emails and numbers, so the LB score for this model would be about .955-.956 without them.

- The only thing I did was setting truncation to false in the tokenizer. That's it.

## 🛑 Wait a second - after this you should also look at the inference notebook
- My inference notebook (containing equally many emojis) is here:
- https://www.kaggle.com/code/valentinwerner/893-deberta3base-inference

## 🏟️ Credits (because this baseline did mostly already exist when I joiend)

- @Nicholas Broad published the transformer baseline which performs only marginally worse: https://www.kaggle.com/code/nbroad/transformer-ner-baseline-lb-0-854
- @Joseph Josia published the training notebook which I basically copy pasted (which is based itself on nbroad, but yeah): https://www.kaggle.com/code/takanashihumbert/piidd-deberta-model-starter-training



## 💡 What I added
- Downsampling negative samples (samples without labels, but they possible still work as examples where names should not be tagged as name)
- Adding @moths external data: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/469493
- Adding PJMathematicianss external data: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/470921
- However, I used my cleaned version instead (the punctuation is flawed in the original data set at the time of this trainign): https://www.kaggle.com/code/valentinwerner/fix-punctuation-tokenization-external-dataset

Doing this brought the LB score to .888 - Trained in Kaggle Notebook, no tricks or secrets.

- I added emojis because that seems to be the kaggle upvote meta

## 📝 Config & Imports
- 1024 max length has been working well for me. As some samples are longer, you may want to go as high as you can 

In [1]:
# !pip install wandb --upgrade

In [2]:
from dotenv import load_dotenv
import os

load_dotenv("/kaggle/.env")
WANDB_API_KEY = os.getenv("WANDB_API_KEY")

TRAINING_MODEL_PATH = "microsoft/deberta-v3-large"
TRAINING_MAX_LENGTH = 1536
STRIDE = 128
OUTPUT_DIR = "/kaggle/output/deberta3large_cnn"

BATCH_SIZE = 1
ACC_STEPS = 2
EPOCHS = 4
LR = 2e-5

arch_suffix = "deberta_large_CNN"

name = f"ex_ec2_{arch_suffix}_return_overflowing_tokens"

In [3]:
import json
import argparse
from itertools import chain
from functools import partial
from typing import Optional, Tuple, Union

import torch
from torch import nn
from transformers import (
    AutoTokenizer, 
    Trainer, 
    TrainingArguments,
    AutoModelForTokenClassification, 
    DataCollatorForTokenClassification, 
    DebertaV2ForTokenClassification
)
from transformers.models.deberta_v2 import DebertaV2PreTrainedModel, DebertaV2Model
from transformers.models.deberta_v2.modeling_deberta_v2 import (
    DEBERTA_START_DOCSTRING,
    DEBERTA_INPUTS_DOCSTRING,
    _CHECKPOINT_FOR_DOC,
    _CONFIG_FOR_DOC
)
from transformers.utils import(
    add_code_sample_docstrings,
    add_start_docstrings,
    add_start_docstrings_to_model_forward,
)
from transformers.modeling_outputs import TokenClassifierOutput, TokenClassifierOutput
import evaluate
from datasets import Dataset, features
import numpy as np

import wandb



In [4]:
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33msueyoshi124[0m. Use [1m`wandb login --relogin`[0m to force relogin


True

In [5]:
run = wandb.init(
    project="kaggle_pii",
    name=name,
    config={
        "learning_rate": LR,
        "architecture":  f"{TRAINING_MODEL_PATH}_{arch_suffix}",
        "epochs": EPOCHS,
        "batch_size": BATCH_SIZE,
        "gradient_accumulation_steps": ACC_STEPS,
        "train_max_length": TRAINING_MAX_LENGTH,
    }
)

## 🗺️ Data Selection and Label Mapping
- As mentioned before, I additionaly use the moth dataset

In [6]:
data = json.load(open("/kaggle/input/pii-detection-removal-from-educational-data/train.json"))

# downsampling of negative examples
p=[] # positive samples (contain relevant labels)
n=[] # negative samples (presumably contain entities that are possibly wrongly classified as entity)
for d in data:
    if any(np.array(d["labels"]) != "O"): p.append(d)
    else: n.append(d)
print("original datapoints: ", len(data))

external = json.load(open("/kaggle/input/fix-punctuation-tokenization-external-dataset/pii_dataset_fixed.json"))
print("external datapoints: ", len(external))

moredata = json.load(open("/kaggle/input/fix-punctuation-tokenization-external-dataset/moredata_dataset_fixed.json"))
print("moredata datapoints: ", len(moredata))

data = moredata+external+p+n[:len(n)//3]
print("combined: ", len(data))

original datapoints:  6807
external datapoints:  4434
moredata datapoints:  2000
combined:  9333


In [7]:
all_labels = sorted(list(set(chain(*[x["labels"] for x in data]))))
label2id = {l: i for i,l in enumerate(all_labels)}
id2label = {v:k for k,v in label2id.items()}

target = [
    'B-EMAIL', 'B-ID_NUM', 'B-NAME_STUDENT', 'B-PHONE_NUM', 
    'B-STREET_ADDRESS', 'B-URL_PERSONAL', 'B-USERNAME', 'I-ID_NUM', 
    'I-NAME_STUDENT', 'I-PHONE_NUM', 'I-STREET_ADDRESS', 'I-URL_PERSONAL'
]

print(id2label)

{0: 'B-EMAIL', 1: 'B-ID_NUM', 2: 'B-NAME_STUDENT', 3: 'B-PHONE_NUM', 4: 'B-STREET_ADDRESS', 5: 'B-URL_PERSONAL', 6: 'B-USERNAME', 7: 'I-ID_NUM', 8: 'I-NAME_STUDENT', 9: 'I-PHONE_NUM', 10: 'I-STREET_ADDRESS', 11: 'I-URL_PERSONAL', 12: 'O'}


## ♟️ Data Tokenization
- This tokenizer is actually special, comparing to usual NLP challenges

In [8]:
def tokenize_train(example, tokenizer, label2id):

    # rebuild text from tokens
    text = []
    labels = []
    
    idx = 0

    for t, l, ws in zip(
        example["tokens"], example["provided_labels"], example["trailing_whitespace"]
    ):
        text.append(t)
        labels.extend([l] * len(t))

        if ws:
            text.append(" ")
            labels.append("O")
            

    # actual tokenization
    tokenized = tokenizer(
        "".join(text),
        return_offsets_mapping=True,
        max_length=TRAINING_MAX_LENGTH, 
        stride=STRIDE,
        truncation=True, 
        return_overflowing_tokens=True,
    )

    labels = np.array(labels)

    text = "".join(text)
    token_labels = []
    
    for offsets in tokenized.offset_mapping:
        tmp_labels = []
        
        for idxs in offsets:        
            start_idx = idxs[0]
            end_idx = idxs[1]
            # CLS token
            if start_idx == 0 and end_idx == 0:
                tmp_labels.append(-100)
                continue

            # case when token starts with whitespace
            if text[start_idx].isspace():
                start_idx += 1

            tmp_labels.append(label2id[labels[start_idx]])
        token_labels.append(tmp_labels)
    
#     length = len(tokenized.input_ids)

    tokenized.pop("overflow_to_sample_mapping")
    tokenized.pop("offset_mapping")
    return {
        **tokenized, 
        "labels": token_labels, 
#         "length": length,
#         "token_map": token_map,
    }

In [9]:
tokenizer = AutoTokenizer.from_pretrained(TRAINING_MODEL_PATH)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [10]:
n = 64
ds = Dataset.from_dict({
    "full_text": [x["full_text"] for x in data],
    "document": [str(x["document"]) for x in data],
    "tokens": [x["tokens"] for x in data],
    "trailing_whitespace": [x["trailing_whitespace"] for x in data],
    "provided_labels": [x["labels"] for x in data],
})
# ds = Dataset.from_dict({
#     "full_text": [x["full_text"] for x in data[:n]],
#     "document": [str(x["document"]) for x in data[:n]],
#     "tokens": [x["tokens"] for x in data[:n]],
#     "trailing_whitespace": [x["trailing_whitespace"] for x in data[:n]],
#     "provided_labels": [x["labels"] for x in data[:n]],
# })
ds = ds.map(
    tokenize_train, 
    fn_kwargs={"tokenizer": tokenizer, "label2id": label2id}, 
    remove_columns=ds.column_names,
    num_proc=4
)
# ds = ds.class_encode_column("group")

     

#0:   0%|          | 0/2334 [00:00<?, ?ex/s]

 

#1:   0%|          | 0/2333 [00:00<?, ?ex/s]

 

#2:   0%|          | 0/2333 [00:00<?, ?ex/s]

 

#3:   0%|          | 0/2333 [00:00<?, ?ex/s]

In [11]:
train_dict = None
for d in ds:
    if train_dict is None:
        train_dict = d
    else:
        for k, v in d.items():
            train_dict[k] += d[k]

ds = Dataset.from_dict(train_dict)

In [12]:
# x = ds[0]

# for t,l in zip(x["tokens"], x["provided_labels"]):
#     if l != "O":
#         print((t,l))

# print("*"*100)

# for t, l in zip(tokenizer.convert_ids_to_tokens(x["input_ids"]), x["labels"]):
#     if id2label[l] != "O":
#         print((t,id2label[l]))

## 🧮 Competition metrics
- Note that we are not using the normal F1 score.
- Although it is early in the competition, there are plenty of discsussions already explaining this:
- e.g., here: https://www.kaggle.com/competitions/pii-detection-removal-from-educational-data/discussion/470024

In [13]:
from seqeval.metrics import recall_score, precision_score
from seqeval.metrics import classification_report
from seqeval.metrics import f1_score

def compute_metrics(p, all_labels):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [all_labels[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [all_labels[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    
    recall = recall_score(true_labels, true_predictions)
    precision = precision_score(true_labels, true_predictions)
    f1_score = (1 + 5*5) * recall * precision / (5*5*precision + recall)
    
    results = {
        'recall': recall,
        'precision': precision,
        'f1': f1_score
    }
    return results

In [14]:
@add_start_docstrings(
    """
    DeBERTa Model with a token classification head on top (a linear layer on top of the hidden-states output) e.g. for
    Named-Entity-Recognition (NER) tasks.
    """,
    DEBERTA_START_DOCSTRING,
)
class DebertaV2CnnForTokenClassification(DebertaV2PreTrainedModel):
    def __init__(self, config):
        super().__init__(config)
        self.num_labels = config.num_labels

        self.deberta = DebertaV2Model(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.dropout22 = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        
        self.cnn = nn.Conv1d(
            config.hidden_size, 
            config.hidden_size, 
            kernel_size=3, 
            padding=1
        )
        self.relu = nn.ReLU()
        self.dropout1 = nn.Dropout(0.1)
        self.dropout2 = nn.Dropout(0.2)
        self.dropout3 = nn.Dropout(0.3)
        self.dropout4 = nn.Dropout(0.4)
        self.dropout5 = nn.Dropout(0.5)
        
        # Initialize weights and apply final processing
        self.post_init()
    
    @add_start_docstrings_to_model_forward(DEBERTA_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
    @add_code_sample_docstrings(
        checkpoint=_CHECKPOINT_FOR_DOC,
        output_type=TokenClassifierOutput,
        config_class=_CONFIG_FOR_DOC,
    )
    def forward(
        self,
        input_ids: Optional[torch.Tensor] = None,
        attention_mask: Optional[torch.Tensor] = None,
        token_type_ids: Optional[torch.Tensor] = None,
        position_ids: Optional[torch.Tensor] = None,
        inputs_embeds: Optional[torch.Tensor] = None,
        labels: Optional[torch.Tensor] = None,
        output_attentions: Optional[bool] = None,
        output_hidden_states: Optional[bool] = None,
        return_dict: Optional[bool] = None,
    ) -> Union[Tuple, TokenClassifierOutput]:
        r"""
        labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
            Labels for computing the token classification loss. Indices should be in `[0, ..., config.num_labels - 1]`.
        """
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict

        outputs = self.deberta(
            input_ids,
            attention_mask=attention_mask,
            token_type_ids=token_type_ids,
            position_ids=position_ids,
            inputs_embeds=inputs_embeds,
            output_attentions=output_attentions,
            output_hidden_states=output_hidden_states,
            return_dict=return_dict,
        )
        
        sequence_output = self.dropout(outputs[0])
        sequence_output = self.cnn(sequence_output.permute(0, 2, 1))
        sequence_output = self.relu(sequence_output.permute(0, 2, 1))
        sequence_output = self.dropout22(sequence_output)
        
        logits1 = self.classifier(self.dropout1(sequence_output))
        logits2 = self.classifier(self.dropout2(sequence_output))
        logits3 = self.classifier(self.dropout3(sequence_output))
        logits4 = self.classifier(self.dropout4(sequence_output))
        logits5 = self.classifier(self.dropout5(sequence_output))

        logits = (logits1 + logits2 + logits3 + logits4 + logits5) / 5


        loss = None
        if labels is not None:
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

        if not return_dict:
            output = (logits,) + outputs[1:]
            return ((loss,) + output) if loss is not None else output

        return TokenClassifierOutput(
            loss=loss, logits=logits, hidden_states=outputs.hidden_states, attentions=outputs.attentions
        )

In [15]:
model = DebertaV2CnnForTokenClassification.from_pretrained(
    TRAINING_MODEL_PATH,
    num_labels=len(all_labels),
    id2label=id2label,
    label2id=label2id,
    ignore_mismatched_sizes=True
)
collator = DataCollatorForTokenClassification(tokenizer, pad_to_multiple_of=16)

Some weights of DebertaV2CnnForTokenClassification were not initialized from the model checkpoint at microsoft/deberta-v3-large and are newly initialized: ['cnn.weight', 'classifier.bias', 'cnn.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [16]:
# I decided to uses no eval
# final_ds = ds.train_test_split(test_size=0.2, seed=42) # cannot use stratify_by_column='group'
# final_ds

## 🏋🏻‍♀️ Training
- I actually do not use an eval set for submission to train on all data
- Values are not really tuned and go by gut feeling, as this is my first iteration / baseline

In [17]:
# I actually chose to not use any validation set. This is only for the model I use for submission.
args = TrainingArguments(
    output_dir=OUTPUT_DIR, 
    bf16=True,
    learning_rate=LR,
    num_train_epochs=EPOCHS,
    per_device_train_batch_size=BATCH_SIZE,
    gradient_accumulation_steps=ACC_STEPS,
    report_to="wandb",
    evaluation_strategy="no",
    do_eval=False,
    logging_steps=20,
    lr_scheduler_type='cosine',
    metric_for_best_model="f1",
    greater_is_better=True,
    warmup_ratio=0.1,
    weight_decay=0.01,
    save_strategy="epoch",
    save_total_limit=1,
)

trainer = Trainer(
    model=model, 
    args=args, 
    train_dataset=ds,
    data_collator=collator, 
    tokenizer=tokenizer,
    compute_metrics=partial(compute_metrics, all_labels=all_labels),
)

In [18]:
%%time
trainer.train()

You're using a DebertaV2TokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
20,2.4512
40,2.3993
60,2.2918
80,2.0336
100,1.4041
120,0.7367
140,0.3824
160,0.2558
180,0.1604
200,0.1693


CPU times: user 2h 7min 53s, sys: 11min 44s, total: 2h 19min 38s
Wall time: 2h 21min 4s


TrainOutput(global_step=18768, training_loss=0.019756440567508207, metrics={'train_runtime': 8464.485, 'train_samples_per_second': 4.435, 'train_steps_per_second': 2.217, 'total_flos': 3.614975933896714e+16, 'train_loss': 0.019756440567508207, 'epoch': 4.0})

## 💾 Save models
- You can click on "Save version" (top right) and "Save & Run All (Commit)"
- Then you can use this notebook as input for your inference notebook

In [19]:
# trainer.save_model("deberta3base_1024")
# tokenizer.save_pretrained("deberta3base_1024")

In [20]:
torch.save(model.state_dict(), os.path.join(OUTPUT_DIR, "pt_output.pt"))

In [21]:
wandb.finish()

VBox(children=(Label(value='0.001 MB of 0.001 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
train/epoch,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/global_step,▁▁▁▁▂▂▂▂▂▃▃▃▃▃▃▄▄▄▄▄▅▅▅▅▅▅▆▆▆▆▆▇▇▇▇▇▇███
train/learning_rate,▂▃▅▆██████▇▇▇▇▇▆▆▆▆▅▅▅▄▄▄▃▃▃▃▂▂▂▂▂▁▁▁▁▁▁
train/loss,█▂▂▁▁▂▁▁▁▁▁▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁
train/total_flos,▁
train/train_loss,▁
train/train_runtime,▁
train/train_samples_per_second,▁
train/train_steps_per_second,▁

0,1
train/epoch,4.0
train/global_step,18768.0
train/learning_rate,0.0
train/loss,0.0003
train/total_flos,3.614975933896714e+16
train/train_loss,0.01976
train/train_runtime,8464.485
train/train_samples_per_second,4.435
train/train_steps_per_second,2.217


In [22]:
!aws ec2 stop-instances --instance-ids i-0449cb0b94ddaa813

/bin/bash: line 1: aws: command not found
