Load the wikineural data

In [1]:
from datasets import load_dataset

raw_datasets = load_dataset("Babelscape/wikineural")

Found cached dataset parquet (/home/jhdavis/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec)


  0%|          | 0/27 [00:00<?, ?it/s]

Store NER label names, taken from Huggingface card, with keys and values inverted

In [2]:
label_names = {'O': 0, 'B-PER': 1, 'I-PER': 2, 'B-ORG': 3, 'I-ORG': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-MISC': 7, 'I-MISC': 8}
label_names = {v: k for k, v in label_names.items()}
label_names

{0: 'O',
 1: 'B-PER',
 2: 'I-PER',
 3: 'B-ORG',
 4: 'I-ORG',
 5: 'B-LOC',
 6: 'I-LOC',
 7: 'B-MISC',
 8: 'I-MISC'}

Print out the entity labels for an example sentence.

In [7]:
example_num = 0
words = raw_datasets['train_en'][example_num]['tokens']
labels = raw_datasets['train_en'][example_num]['ner_tags']
line1 = ''
line2 = ''
for word, label in zip(words, labels):
    full_label = label_names[label]
    max_length = max(len(word), len(full_label))
    line1 += word + ' ' * (max_length - len(word) + 1)
    line2 += full_label + ' ' * (max_length - len(full_label) + 1)

print(line1)
print(line2)

This division also contains the Ventana Wilderness , home to the California condor . 
O    O        O    O        O   B-LOC   I-LOC      O O    O  O   B-LOC      O      O 


Create a tokenizer

In [8]:
from transformers import AutoTokenizer

model_checkpoint = "distilbert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Tokenize an example sentence.

In [9]:
inputs = tokenizer(raw_datasets['train_en'][0]['tokens'], is_split_into_words=True)
print(inputs.tokens())
print(inputs.word_ids())

['[CLS]', 'This', 'division', 'also', 'contains', 'the', 'V', '##ent', '##ana', 'Wilderness', ',', 'home', 'to', 'the', 'California', 'con', '##dor', '.', '[SEP]']
[None, 0, 1, 2, 3, 4, 5, 5, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, None]


Set up a function to configure labels to align correctly with our tokenization.

In [10]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # start of new word
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # special token
            new_labels.append(-100)
        else:
            # same word as previous token
            label = labels[word_id]
            # if the label is B- we change to I-
            if label % 2 == 1:
                label += 1
            new_labels.append(label)
        
    return new_labels

Apply our new function to an example.

In [11]:
labels = raw_datasets['train_en'][0]['ner_tags']
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[0, 0, 0, 0, 0, 5, 6, 0, 0, 0, 0, 5, 0, 0]
[-100, 0, 0, 0, 0, 0, 5, 6, 6, 6, 0, 0, 0, 0, 5, 0, 0, 0, -100]


Create a function to batch align an entire set.

In [12]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples['tokens'], truncation=True, is_split_into_words=True
    )
    all_labels = examples['ner_tags']
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))
    
    tokenized_inputs['labels'] = new_labels
    return tokenized_inputs

Apply set align function to all datasets.

In [13]:
tokenized_datasets = raw_datasets.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns=raw_datasets['train_en'].column_names
)

Loading cached processed dataset at /home/jhdavis/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-e19219bcdc6b3f08.arrow


Map:   0%|          | 0/10160 [00:00<?, ? examples/s]

Loading cached processed dataset at /home/jhdavis/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-1e02c2f43f0d09e3.arrow
Loading cached processed dataset at /home/jhdavis/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-5c012189c89ed7f5.arrow
Loading cached processed dataset at /home/jhdavis/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-c46ee93282ab0beb.arrow
Loading cached processed dataset at /home/jhdavis/.cache/huggingface/datasets/Babelscape___parquet/Babelscape--wikineural-579d1dc98d2a6b93/0.0.0/2a3b91fbd88a2c90d1dbbb32b460cf621d31bd5b05b934492fdef7d8d6f236ec/cache-86dbce317ca31794.arrow
Loading cached processed dataset at /home/jh

Collate data into batches with appropriate patches.

In [14]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)
batch = data_collator([tokenized_datasets['train_en'][i] for i in range(2)])
batch['labels']

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


tensor([[-100,    0,    0,    0,    0,    0,    5,    6,    6,    6,    0,    0,
            0,    0,    5,    0,    0,    0, -100, -100, -100],
        [-100,    0,    0,    0,    0,    0,    0,    3,    0,    0,    0,    0,
            7,    8,    0,    0,    7,    8,    0,    0, -100]])

Set up evaluation metric.

In [15]:
import evaluate

metric = evaluate.load("seqeval")

Test out evaluation metric on an example.

In [16]:
labels = raw_datasets['train_en'][0]['ner_tags']
labels = [label_names[i] for i in labels]
predictions = labels.copy()
predictions[5] = 'O'
metric.compute(predictions=[predictions], references=[labels])

{'LOC': {'precision': 0.5, 'recall': 0.5, 'f1': 0.5, 'number': 2},
 'overall_precision': 0.5,
 'overall_recall': 0.5,
 'overall_f1': 0.5,
 'overall_accuracy': 0.9285714285714286}

Create function to compute metrics over predictions.

In [17]:
import numpy as np

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)
    
    # remove special tokens which are ignored, and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        'precision': all_metrics['overall_precision'],
        'recall': all_metrics['overall_recall'],
        'f1': all_metrics['overall_f1'],
        'accuracy': all_metrics['overall_accuracy']
    }

Define model to fine-tune.

In [18]:
# provide mappings between labels and IDs
id2label = {i: label for i, label in enumerate(label_names)}
label2id = {v: k for k, v in id2label.items()}

from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

model.config.num_labels

Some weights of the model checkpoint at distilbert-base-cased were not used when initializing DistilBertForTokenClassification: ['vocab_transform.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this 

9

In [19]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [20]:
from transformers import TrainingArguments

args = TrainingArguments(
    'bert-finetuned-ner',
    evaluation_strategy='epoch',
    save_strategy='epoch',
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

In [21]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets['train_en'],
    eval_dataset=tokenized_datasets['test_en'],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

/home/jhdavis/repos/cmsc828a-group-timber/hw1/bert-finetuned-ner is already a clone of https://huggingface.co/jhdavis/bert-finetuned-ner. Make sure you pull the latest changes with `repo.git_pull()`.
***** Running training *****
  Num examples = 92720
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 34770
  Number of trainable parameters = 65197833
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"
[34m[1mwandb[0m: Currently logged in as: [33mjhdavis[0m. Use [1m`wandb login --relogin`[0m to force relogin


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0488,0.047597,0.893307,0.915781,0.904405,0.984317
2,0.0313,0.048091,0.901214,0.920407,0.910709,0.98537
3,0.0139,0.053014,0.908696,0.925693,0.917116,0.985727


***** Running Evaluation *****
  Num examples = 11597
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-11590
Configuration saved in bert-finetuned-ner/checkpoint-11590/config.json
Model weights saved in bert-finetuned-ner/checkpoint-11590/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-11590/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-11590/special_tokens_map.json
tokenizer config file saved in bert-finetuned-ner/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 11597
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-23180
Configuration saved in bert-finetuned-ner/checkpoint-23180/config.json
Model weights saved in bert-finetuned-ner/checkpoint-23180/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-23180/tokenizer_config.json
Special tokens file saved

TrainOutput(global_step=34770, training_loss=0.03747569308529169, metrics={'train_runtime': 2265.3474, 'train_samples_per_second': 122.789, 'train_steps_per_second': 15.349, 'total_flos': 3742592772325680.0, 'train_loss': 0.03747569308529169, 'epoch': 3.0})

Push to model hub.

In [1]:
trainer.push_to_hub(commit_message="Training complete")

NameError: name 'trainer' is not defined