### Token to Word - implementation

My strategy is to decode token by token, until a word is matched. Maybe this is not the most efficient way to solve this since I'm encoding and decoding back the text, but the whole process along with tokenization is finished in 02:34 min so that is good enough (for now!).

Tokens that don't belong to any word are give -1 position.

I tested this with the Longformer tokenizer. I'm not sure for the other types of tokenizers.

If you have any questions or ideas for improvements, please let me know!

In [None]:
import os

import numpy as np
import pandas as pd

from transformers import LongformerTokenizerFast

from tqdm import tqdm

### Dataloading

In [None]:
# load the train text data

config = {
    'model_name': 'allenai/longformer-base-4096',
    'batch_size': 4,
}

TEXT_FILES = os.listdir('../input/feedback-prize-2021/train')
TEXT_FILES = [f'../input/feedback-prize-2021/train/{file}' for file in TEXT_FILES]

text_data = dict()
for file_path in tqdm(TEXT_FILES):
    with open(file_path, 'r') as file:
        idx = os.path.basename(file_path).split('.txt')[0]
        text_data[idx] = file.read()
        
# 1. delete spaces from texts ends
for key, value in text_data.items():
    text_data[key] = value.rstrip()


Here, I am going to use the Longformer tokenizer.

In [None]:
from transformers import LongformerTokenizerFast
tokenizer = LongformerTokenizerFast.from_pretrained(config['model_name'])

### Tokenize and get token to word mapping

In [None]:
data = []

for idx, text in tqdm(text_data.items()):
    
    # tokenize input text
    inputs = tokenizer(text,
                       add_special_tokens=True,
                       return_offsets_mapping=True,
                       return_length=True)
        
    # create token to word mapping
    words = text.split()
    
    word_ids = [] # list to store token -> word mapping
    word_pos = 0 # word strating position

    tokens = inputs['input_ids'][1:-1] # exclude <s> and </s> tokens
    
    # current token positions (used for iteration)
    start = 0
    end = 1

    for _ in tokens:

        decoded_word = tokenizer.decode(tokens[start:end]).strip()

        if decoded_word == '':
            # if striped word is an empty string, that token doesn't belong to any word
            word_ids.append(-1)
            start += 1
            end += 1
            continue
        
        # no match
        # continue adding tokens
        if decoded_word != words[word_pos]:
            end += 1
            word_ids.append(word_pos)
        # match    
        else:
            word_ids.append(word_pos)
            start = end
            end = start+1
            word_pos += 1
    
    # add -1 position for the <s> and </s> tokens
    word_ids = [-1] + word_ids + [-1]
    
    data.append([idx, inputs['input_ids'], inputs['attention_mask'], word_ids])

In [None]:
longformer_df = pd.DataFrame(data, columns=['id', 'input_ids', 'attention_mask', 'word_ids'])
longformer_df.head()

### Testing

In [None]:
# the number of tokens should be equal to the number of word_ids

(longformer_df['input_ids'].apply(len) != longformer_df['word_ids'].apply(len)).any()