## NER Tagging Demo

See details at: https://www.kaggle.com/c/feedback-prize-2021/discussion/296669

In [None]:
import os

import pandas as pd
import transformers as trm
from tqdm.auto import tqdm

# Uses this utility script: https://www.kaggle.com/xhlulu/ner-tagging
import ner_tagging as tag

Let's first load the training dataframe, along with the essays text and a dictionary to quickly map a id to a subset of the dataframe for the corresponding essay:

In [None]:
%%time
train_df = pd.read_csv('../input/feedback-prize-2021/train.csv')

train_dir = '../input/feedback-prize-2021/train'
train_files = list(os.listdir(train_dir))
train_ids = [f.replace('.txt', '') for f in train_files]

train_essays = [
    open(os.path.join(train_dir, f)).read()
    for f in tqdm(train_files)
]

train_id_to_df = dict(list(train_df.groupby('id')))

Let's use any tokenizer (you can change it here if needed) to tokenize our training text. I selected `bert-base-cased` but you can choose any tokenize you want; the important part is to keep the `return_offsets_mapping` to true so we can use it in the `iob.create_target` function later.

In [None]:
tokenizer = trm.AutoTokenizer.from_pretrained("bert-base-cased")
tokens = tokenizer(train_essays, return_offsets_mapping=True, truncation=True, max_length=1024, padding='max_length')

We will now generate the target training data. 

In [None]:
train_targets = []

# First, you need to generate the tags from labels in the training dataframe
tags, tag_to_num = tag.generate_tags(train_df.discourse_type, scheme="BILOU")

for i, essay in enumerate(tqdm(train_essays)):
    essay_df = train_id_to_df[train_ids[i]]
    
    # Using the offset_mapping obtained from the tokenizer, we can align
    # it with the tagged characters to create the target for our model
    target = tag.create_target(
        text=essay,
        labels=essay_df.discourse_type,
        start=essay_df.discourse_start,
        end=essay_df.discourse_end,
        offset_mapping=tokens.offset_mapping[i],
        tag_to_num=tag_to_num,
        scheme="BILOU"
    )
    train_targets.append(target)