# showus: Create datasetdict for NER (RoBERTa)

1. Create samples of text from provided competition data.
2. Label them for NER.
3. Create train-valid split.
4. Tokenize them.
5. Save tokenized samples (in the form of `datasets.dataset_dict.DataDict`) to disk. 

In [1]:
! pip install /kaggle/input/nlp-packages/datasets/datasets/fsspec-2021.4.0-py3-none-any.whl
! pip install datasets --no-index --find-links=file:///kaggle/input/coleridge-packages/packages/datasets
! pip install ../input/coleridge-packages/seqeval-1.2.2-py3-none-any.whl
! pip install ../input/coleridge-packages/tokenizers-0.10.1-cp37-cp37m-manylinux1_x86_64.whl
! pip install ../input/coleridge-packages/transformers-4.5.0.dev0-py3-none-any.whl

Processing /kaggle/input/nlp-packages/datasets/datasets/fsspec-2021.4.0-py3-none-any.whl
Installing collected packages: fsspec
  Attempting uninstall: fsspec
    Found existing installation: fsspec 0.8.7
    Uninstalling fsspec-0.8.7:
      Successfully uninstalled fsspec-0.8.7
Successfully installed fsspec-2021.4.0
Looking in links: file:///kaggle/input/coleridge-packages/packages/datasets
Processing /kaggle/input/coleridge-packages/packages/datasets/datasets-1.5.0-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/huggingface_hub-0.0.7-py3-none-any.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/xxhash-2.0.0-cp37-cp37m-manylinux2010_x86_64.whl
Processing /kaggle/input/coleridge-packages/packages/datasets/tqdm-4.49.0-py2.py3-none-any.whl
Installing collected packages: tqdm, xxhash, huggingface-hub, datasets
  Attempting uninstall: tqdm
    Found existing installation: tqdm 4.59.0
    Uninstalling tqdm-4.59.0:
      Successf

In [2]:
import sys
from functools import partial
from tokenizers.pre_tokenizers import BertPreTokenizer

sys.path.append('/kaggle/input/showus-package/')
from showus import load_papers, load_train_meta
from showus import get_ner_classlabel, get_ner_data, write_ner_json, batched_write_ner_json
from showus import load_ner_datasets, create_tokenizer, tokenize_and_align_labels

In [3]:
%%time

data_kwargs = dict(sentence_definition='paper', mark_title=True, mark_text=True,
                   pretokenizer=BertPreTokenizer(), max_length=360, overlap=20)
ner_kwargs = dict(classlabel=get_ner_classlabel(), neg_keywords=None, neg_sample_prob=.2)
batch_size = 4_000
model_checkpoint = 'roberta-base' #'xlm-roberta-base' # 'distilbert-base-cased'

print('Loading meta data and text data...')
train_meta = load_train_meta('/kaggle/input/coleridgeinitiative-show-us-the-data/train.csv')
papers = load_papers('/kaggle/input/coleridgeinitiative-show-us-the-data/train', train_meta.Id)

print('Creating train and valid splits...')
valid_cutoff = int(.05 * len(train_meta))
valid_meta = train_meta.iloc[:valid_cutoff].reset_index(drop=True)
train_meta = train_meta.iloc[valid_cutoff:].reset_index(drop=True)

print('Creating NER data, writing to json files...')
batched_write_ner_json(papers, train_meta, pth='train_ner.json', 
                       batch_size=batch_size, **data_kwargs, **ner_kwargs)

batched_write_ner_json(papers, valid_meta, pth='valid_ner.json', 
                       batch_size=batch_size, **data_kwargs, **ner_kwargs)

print('Tokenizing samples...')
datasets = load_ner_datasets(data_files={'train':'train_ner.json', 'valid':'valid_ner.json'})
tokenizer = create_tokenizer(model_checkpoint)
tokenized_datasets = datasets.map(
    partial(tokenize_and_align_labels, tokenizer=tokenizer, label_all_tokens=True), batched=True)

tokenized_datasets.save_to_disk(f'datasetdict_{model_checkpoint}')

Loading meta data and text data...


Training data size: 5 positives + 20 negatives:   0%|          | 5/4000 [00:00<01:54, 35.00it/s]

Creating train and valid splits...
Creating NER data, writing to json files...
Batch 0...

Training data size: 9852 positives + 20103 negatives: 100%|██████████| 4000/4000 [04:05<00:00, 16.32it/s]
Training data size: 7 positives + 23 negatives:   0%|          | 7/4000 [00:00<01:55, 34.68it/s]

done in 4.165611843268077 mins.
Batch 1...

Training data size: 12 positives + 21 negatives:   0%|          | 7/4000 [00:00<01:26, 46.20it/s]

done in 5.262351679801941 mins.
Batch 2...

Training data size: 9533 positives + 22578 negatives: 100%|██████████| 4000/4000 [05:16<00:00, 12.64it/s]
Training data size: 9172 positives + 19244 negatives: 100%|██████████| 4000/4000 [04:45<00:00, 14.02it/s]
Training data size: 11 positives + 30 negatives:   0%|          | 5/1601 [00:00<00:54, 29.31it/s]

done in 4.8231809616088865 mins.
Batch 3...

Training data size: 3684 positives + 7663 negatives: 100%|██████████| 1601/1601 [01:48<00:00, 14.74it/s]
Training data size: 15 positives + 24 negatives:   1%|          | 8/715 [00:00<00:18, 37.99it/s]

done in 1.8435897747675578 mins.
Batch 0...

Training data size: 1652 positives + 3889 negatives: 100%|██████████| 715/715 [00:37<00:00, 12.53it/s]

done in 0.6640518387158711 mins.
Tokenizing samples...
Downloading and preparing dataset json/default (download: Unknown size, generated: Unknown size, post-processed: Unknown size, total: Unknown size) to /root/.cache/huggingface/datasets/json/default-ee02eab0887e6fef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02...


HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

HBox(children=(FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0), HTML(value=''…

Dataset json downloaded and prepared to /root/.cache/huggingface/datasets/json/default-ee02eab0887e6fef/0.0.0/83d5b3a2f62630efc6b5315f00f20209b4ad91a00ac586597caee3a4da0bef02. Subsequent calls will reuse this data.


Training data size: 1652 positives + 3889 negatives: 100%|██████████| 715/715 [00:47<00:00, 12.53it/s]

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=481.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=898823.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1355863.0, style=ProgressStyle(descript…




Special tokens have been added in the vocabulary, make sure the associated word embedding are fine-tuned or trained.


HBox(children=(FloatProgress(value=0.0, max=102.0), HTML(value='')))

Training data size: 1652 positives + 3889 negatives: 100%|██████████| 715/715 [01:31<00:00,  7.81it/s]





HBox(children=(FloatProgress(value=0.0, max=6.0), HTML(value='')))


CPU times: user 25min 36s, sys: 26.3 s, total: 26min 2s
Wall time: 21min 31s


In [4]:
datasets.load_from_disk(f'datasetdict_{model_checkpoint}')

DatasetDict({
    train: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'ner_tags', 'tokens', 'word_ids'],
        num_rows: 101829
    })
    valid: Dataset({
        features: ['attention_mask', 'input_ids', 'labels', 'ner_tags', 'tokens', 'word_ids'],
        num_rows: 5541
    })
})