---------------------------------

<a href="https://www.youtube.com/watch?v=dzyDHMycx_c&list=PLxqBkZuBynVQEvXfJpq3smfuKq3AiNW-N&index=18"><h1 style="font-size:250%; font-family:cursive; color:#ff6666;"><b>Link YouTube Video - Fine Tuning BERT for Named Entity Recognition (NER) | NLP</b></h1></a>

[![IMAGE ALT TEXT](https://imgur.com/O7UNR3C.png)](https://bit.ly/3mXnKGH "")


## [Dataset in HuggingFace](https://huggingface.co/datasets/conll2003)

## First What is BERT?

BERT stands for Bidirectional Encoder Representations from Transformers. The name itself gives us several clues to what BERT is all about.

BERT architecture consists of several Transformer encoders stacked together. Each Transformer encoder encapsulates two sub-layers: a self-attention layer and a feed-forward layer.

### There are two different BERT models:

- BERT base, which is a BERT model consists of 12 layers of Transformer encoder, 12 attention heads, 768 hidden size, and 110M parameters.

- BERT large, which is a BERT model consists of 24 layers of Transformer encoder,16 attention heads, 1024 hidden size, and 340 parameters.



BERT Input and Output
BERT model expects a sequence of tokens (words) as an input. In each sequence of tokens, there are two special tokens that BERT would expect as an input:

- [CLS]: This is the first token of every sequence, which stands for classification token.
- [SEP]: This is the token that makes BERT know which token belongs to which sequence. This special token is mainly important for a next sentence prediction task or question-answering task. If we only have one sequence, then this token will be appended to the end of the sequence.


It is also important to note that the maximum size of tokens that can be fed into BERT model is 512. If the tokens in a sequence are less than 512, we can use padding to fill the unused token slots with [PAD] token. If the tokens in a sequence are longer than 512, then we need to do a truncation.

And that’s all that BERT expects as input.

BERT model then will output an embedding vector of size 768 in each of the tokens. We can use these vectors as an input for different kinds of NLP applications, whether it is text classification, next sentence prediction, Named-Entity-Recognition (NER), or question-answering.


------------

**For a text classification task**, we focus our attention on the embedding vector output from the special [CLS] token. This means that we’re going to use the embedding vector of size 768 from [CLS] token as an input for our classifier, which then will output a vector of size the number of classes in our classification task.

-----------------------

![Imgur](https://imgur.com/NpeB9vb.png)

-------------------------

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!cd /content/drive/MyDrive/bert-ner-transformers

In [3]:
!pip uninstall gcsfs -y
!pip install gcsfs==2024.9.0

Found existing installation: gcsfs 2024.10.0
Uninstalling gcsfs-2024.10.0:
  Successfully uninstalled gcsfs-2024.10.0
Collecting gcsfs==2024.9.0
  Downloading gcsfs-2024.9.0-py2.py3-none-any.whl.metadata (1.6 kB)
Collecting fsspec==2024.6.1 (from gcsfs==2024.9.0)
  Downloading fsspec-2024.6.1-py3-none-any.whl.metadata (11 kB)
Reason for being yanked: requirements incorrect[0m[33m
[0mDownloading gcsfs-2024.9.0-py2.py3-none-any.whl (34 kB)
Downloading fsspec-2024.6.1-py3-none-any.whl (177 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m177.6/177.6 kB[0m [31m12.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, gcsfs
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2024.10.0
    Uninstalling fsspec-2024.10.0:
      Successfully uninstalled fsspec-2024.10.0
Successfully installed fsspec-2024.6.1 gcsfs-2024.9.0


In [4]:
!pip install transformers datasets tokenizers seqeval -q

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/43.6 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m43.6/43.6 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m480.6/480.6 kB[0m [31m20.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m194.1/194.1 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Building wheel for seqeval (setup.py) ... [?25l[?25hdone


# Token classification

The first application we’ll explore is token classification. This generic task encompasses any problem that can be formulated as “attributing a label to each token in a sentence,” such as:

**Named entity recognition (NER):** Find the entities (such as persons, locations, or organizations) in a sentence. This can be formulated as attributing a label to each token by having one class per entity and one class for “no entity.”

**Part-of-speech tagging (POS):** Mark each word in a sentence as corresponding to a particular part of speech (such as noun, verb, adjective, etc.).

**Chunking:** Find the tokens that belong to the same entity. This task (which can be combined with POS or NER) can be formulated as attributing one label (usually B-) to any tokens that are at the beginning of a chunk, another label (usually I-) to tokens that are inside a chunk, and a third label (usually O) to tokens that don’t belong to any chunk.

* O means the word doesn’t correspond to any entity.
* B-PER/I-PER means the word corresponds to the beginning of/is inside a person entity.
* B-ORG/I-ORG means the word corresponds to the beginning of/is inside an organization entity.
* B-LOC/I-LOC means the word corresponds to the beginning of/is inside a location entity.
* B-MISC/I-MISC means the word corresponds to the beginning of/is inside a miscellaneous entity.

In [5]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To log in, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Enter your token (input will not be visible): 
Add token as git credential? (Y/n) y
Token is valid (permission: fineGrained).
The token `bert-ner` has been saved to /root/.cache/huggingface/stored_tokens
[1m[31mCannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-aut

### https://huggingface.co/course/chapter7/2

In [6]:
import datasets
import numpy as np
from transformers import BertTokenizerFast
from transformers import DataCollatorForTokenClassification
from transformers import AutoModelForTokenClassification

conll2003 = datasets.load_dataset("conll2003", trust_remote_code=True)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md:   0%|          | 0.00/12.3k [00:00<?, ?B/s]

conll2003.py:   0%|          | 0.00/9.57k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/983k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/14041 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3250 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [7]:
conll2003

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 14041
    })
    validation: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3250
    })
    test: Dataset({
        features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
        num_rows: 3453
    })
})

In [8]:
conll2003.shape

{'train': (14041, 5), 'validation': (3250, 5), 'test': (3453, 5)}

In [9]:
conll2003["train"][0]

{'id': '0',
 'tokens': ['EU',
  'rejects',
  'German',
  'call',
  'to',
  'boycott',
  'British',
  'lamb',
  '.'],
 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7],
 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0],
 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}

In [10]:
conll2003["train"].features["ner_tags"]

Sequence(feature=ClassLabel(names=['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC'], id=None), length=-1, id=None)

In [11]:
conll2003['train'].description

'The shared task of CoNLL-2003 concerns language-independent named entity recognition. We will concentrate on\nfour types of named entities: persons, locations, organizations and names of miscellaneous entities that do\nnot belong to the previous three groups.\n\nThe CoNLL-2003 shared task data files contain four columns separated by a single space. Each word has been put on\na separate line and there is an empty line after each sentence. The first item on each line is a word, the second\na part-of-speech (POS) tag, the third a syntactic chunk tag and the fourth the named entity tag. The chunk tags\nand the named entity tags have the format I-TYPE which means that the word is inside a phrase of type TYPE. Only\nif two phrases of the same type immediately follow each other, the first word of the second phrase will have tag\nB-TYPE to show that it starts a new phrase. A word with tag O is not part of a phrase. Note the dataset uses IOB2\ntagging scheme, whereas the original dataset uses 

In [12]:
tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

# Problem of consecutive subwords.

### Note that transformers are often pretrained with subword tokenizers, meaning that even if your inputs have been split into words already, each of those words could be split again by the tokenizer.

### This means that we need to do some processing on our labels as the input ids returned by the tokenizer are longer than the lists of labels our dataset contain.

This is happening, first because some special tokens might be added (we can a [CLS] and a [SEP] above) and then because of those possible splits of words in multiple tokens:

## Strategy to handle above - Here we set the labels of all special tokens to -100 (the index that is ignored by PyTorch) and the labels of all other tokens to the label of the word they come from. Another strategy is to set the label only on the first token obtained from a given word, and give a label of -100 to the other subtokens from the same word. We propose the two strategies here, just change the value of the following flag:

-----------------------------------

### Setting –100 as the label for these special tokens and the subwords we wish to mask during training:

Why did we choose –100 as the ID to mask subword representations? The reason is
that in PyTorch the cross-entropy loss class torch.nn.CrossEntropyLoss has an
attribute called ignore_index whose value is –100. This index is ignored during
training,

Also we can use it to ignore the tokens associated with consecutive subwords.

-----------------------------------

## Below cell are just for checking the output of some variables before applying `tokenize_and_align_labels()`

In [13]:
example_text = conll2003['train'][0]

tokenized_input = tokenizer(example_text["tokens"], is_split_into_words=True)

tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])

word_ids = tokenized_input.word_ids()

print(word_ids)

''' As we can see, it returns a list with the same number of elements as our processed input ids, mapping special tokens to None and all other tokens to their respective word. This way, we can align the labels with the processed input ids. '''

tokenized_input

[None, 0, 1, 2, 3, 4, 5, 6, 7, 8, None]


{'input_ids': [101, 7327, 19164, 2446, 2655, 2000, 17757, 2329, 12559, 1012, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

## Problem of Sub-Token - The  input ids returned by the tokenizer are longer than the lists of labels our dataset contain.

In [14]:
len(example_text['ner_tags']), len(tokenized_input["input_ids"])
# (9, 11)

(9, 11)

## The below function `tokenize_and_align_labels` does 2 jobs

1. set –100 as the label for these special tokens and the subwords we wish to mask during training
2. mask the subword representations after the first subword


### Then we align the labels with the token ids using the strategy we picked:

In [15]:
def tokenize_and_align_labels(examples, label_all_tokens=True):
    """
    Function to tokenize and align labels with respect to the tokens. This function is specifically designed for
    Named Entity Recognition (NER) tasks where alignment of the labels is necessary after tokenization.

    Parameters:
    examples (dict): A dictionary containing the tokens and the corresponding NER tags.
                     - "tokens": list of words in a sentence.
                     - "ner_tags": list of corresponding entity tags for each word.

    label_all_tokens (bool): A flag to indicate whether all tokens should have labels.
                             If False, only the first token of a word will have a label,
                             the other tokens (subwords) corresponding to the same word will be assigned -100.

    Returns:
    tokenized_inputs (dict): A dictionary containing the tokenized inputs and the corresponding labels aligned with the tokens.
    """
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        # word_ids() => Return a list mapping the tokens
        # to their actual word in the initial sentence.
        # It Returns a list indicating the word corresponding to each token.
        previous_word_idx = None
        label_ids = []
        # Special tokens like `<s>` and `<\s>` are originally mapped to None
        # We need to set the label to -100 so they are automatically ignored in the loss function.
        for word_idx in word_ids:
            if word_idx is None:
                # set –100 as the label for these special tokens
                label_ids.append(-100)
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            elif word_idx != previous_word_idx:
                # if current word_idx is != prev then its the most regular case
                # and add the corresponding token
                label_ids.append(label[word_idx])
            else:
                # to take care of sub-words which have the same word_idx
                # set -100 as well for them, but only if label_all_tokens == False
                label_ids.append(label[word_idx] if label_all_tokens else -100)
                # mask the subword representations after the first subword

            previous_word_idx = word_idx
        labels.append(label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

In [16]:
q = tokenize_and_align_labels(conll2003['train'][4:5])
print(q)

{'input_ids': [[101, 2762, 1005, 1055, 4387, 2000, 1996, 2647, 2586, 1005, 1055, 15651, 2837, 14121, 1062, 9328, 5804, 2056, 2006, 9317, 10390, 2323, 4965, 8351, 4168, 4017, 2013, 3032, 2060, 2084, 3725, 2127, 1996, 4045, 6040, 2001, 24509, 1012, 102]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], 'labels': [[-100, 5, 0, 0, 0, 0, 0, 3, 4, 0, 0, 0, 0, 1, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0, -100]]}


### So before applying the `tokenize_and_align_labels()` the `tokenized_input` has 3 keys
- input_ids
- token_type_ids
- attention_mask

But after applying `tokenize_and_align_labels()` we have an extra key - `'labels'`


===================================

In [17]:
for token, label in zip(tokenizer.convert_ids_to_tokens(q["input_ids"][0]),q["labels"][0]):
    print(f"{token:_<40} {label}")

[CLS]___________________________________ -100
germany_________________________________ 5
'_______________________________________ 0
s_______________________________________ 0
representative__________________________ 0
to______________________________________ 0
the_____________________________________ 0
european________________________________ 3
union___________________________________ 4
'_______________________________________ 0
s_______________________________________ 0
veterinary______________________________ 0
committee_______________________________ 0
werner__________________________________ 1
z_______________________________________ 2
##wing__________________________________ 2
##mann__________________________________ 2
said____________________________________ 0
on______________________________________ 0
wednesday_______________________________ 0
consumers_______________________________ 0
should__________________________________ 0
buy_____________________________________ 0
sheep___

In [18]:
tokenized_datasets = conll2003.map(tokenize_and_align_labels, batched=True)

Map:   0%|          | 0/14041 [00:00<?, ? examples/s]

Map:   0%|          | 0/3250 [00:00<?, ? examples/s]

Map:   0%|          | 0/3453 [00:00<?, ? examples/s]

In [19]:

model = AutoModelForTokenClassification.from_pretrained("bert-base-uncased", num_labels=9)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [20]:
from transformers import TrainingArguments, Trainer
args = TrainingArguments(
"test-ner",
evaluation_strategy = "epoch",
learning_rate=2e-5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
num_train_epochs=3,
weight_decay=0.01,
)



In [21]:
data_collator = DataCollatorForTokenClassification(tokenizer)

In [22]:
!pip install evaluate

Collecting evaluate
  Downloading evaluate-0.4.3-py3-none-any.whl.metadata (9.2 kB)
Downloading evaluate-0.4.3-py3-none-any.whl (84 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: evaluate
Successfully installed evaluate-0.4.3


In [23]:
import evaluate
metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [24]:
# metric = datasets.load_metric("seqeval")

In [25]:
example = conll2003['train'][0]

In [26]:
label_list = conll2003["train"].features["ner_tags"].feature.names

label_list

['O', 'B-PER', 'I-PER', 'B-ORG', 'I-ORG', 'B-LOC', 'I-LOC', 'B-MISC', 'I-MISC']

In [27]:


labels = [label_list[i] for i in example["ner_tags"]]

metric.compute(predictions=[labels], references=[labels])

{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'ORG': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

## seqeval - The way the package works by accepting list of lists

The seqeval package expects the predictions and labels as lists of lists, with
each list corresponding to a single example in our validation or test sets. To
integrate these metrics during training, we need a function that can take the
outputs of the model and convert them into the lists that seqeval expects.

The following does the trick by ensuring we ignore the label IDs associated with
subsequent subwords:

## Compute Metrics

This compute_metrics() function first takes the argmax of the logits to convert them to predictions (as usual, the logits and the probabilities are in the same order, so we don’t need to apply the softmax). Then we have to convert both labels and predictions from integers to strings. We remove all the values where the label is -100, then pass the results to the metric.compute() method:

In [28]:
def compute_metrics(eval_preds):
    """
    Function to compute the evaluation metrics for Named Entity Recognition (NER) tasks.
    The function computes precision, recall, F1 score and accuracy.

    Parameters:
    eval_preds (tuple): A tuple containing the predicted logits and the true labels.

    Returns:
    A dictionary containing the precision, recall, F1 score and accuracy.
    """
    pred_logits, labels = eval_preds

    pred_logits = np.argmax(pred_logits, axis=2)
    # the logits and the probabilities are in the same order,
    # so we don’t need to apply the softmax

    # We remove all the values where the label is -100
    predictions = [
        [label_list[eval_preds] for (eval_preds, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(pred_logits, labels)
    ]

    true_labels = [
      [label_list[l] for (eval_preds, l) in zip(prediction, label) if l != -100]
       for prediction, label in zip(pred_logits, labels)
   ]
    results = metric.compute(predictions=predictions, references=true_labels)
    return {
   "precision": results["overall_precision"],
   "recall": results["overall_recall"],
   "f1": results["overall_f1"],
  "accuracy": results["overall_accuracy"],
  }

### `predictions` will print a long 2d tensor like below

```
[['O', 'O', 'B-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'], ['B-LOC', 'O', 'O', 'O', 'O', 'O'], ['B-MISC', 'I-MISC', 'O', 'O', 'O', 'O', 'B-PER', 'I-PER', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'O', 'B-ORG', 'O', ['O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'B-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'B-ORG', 'B-ORG', 'I-ORG', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O', 'O'], ['O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'],

---

---

, ['O', 'O', 'B-LOC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']]

```

In [29]:
trainer = Trainer(
    model,
    args,
   train_dataset=tokenized_datasets["train"],
   eval_dataset=tokenized_datasets["validation"],
   data_collator=data_collator,
   tokenizer=tokenizer,
   compute_metrics=compute_metrics
)

  trainer = Trainer(


In [30]:
trainer.train()

[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.


<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.2146,0.064176,0.916306,0.924712,0.92049,0.982001
2,0.0465,0.056843,0.930202,0.939255,0.934706,0.984813
3,0.0274,0.056498,0.933046,0.94317,0.938081,0.985385


TrainOutput(global_step=2634, training_loss=0.07590922525459469, metrics={'train_runtime': 557.1835, 'train_samples_per_second': 75.6, 'train_steps_per_second': 4.727, 'total_flos': 1020143109346326.0, 'train_loss': 0.07590922525459469, 'epoch': 3.0})

In [31]:
model.save_pretrained("ner_model")

In [32]:
tokenizer.save_pretrained("tokenizer")

('tokenizer/tokenizer_config.json',
 'tokenizer/special_tokens_map.json',
 'tokenizer/vocab.txt',
 'tokenizer/added_tokens.json',
 'tokenizer/tokenizer.json')

In [33]:
id2label = {
    str(i): label for i,label in enumerate(label_list)
}
label2id = {
    label: str(i) for i,label in enumerate(label_list)
}

In [34]:
import json

In [35]:
config = json.load(open("ner_model/config.json"))

In [36]:
config["id2label"] = id2label
config["label2id"] = label2id

In [37]:
json.dump(config, open("ner_model/config.json","w"))

In [38]:
model_fine_tuned = AutoModelForTokenClassification.from_pretrained("ner_model")

In [39]:
from transformers import pipeline

In [40]:
nlp = pipeline("ner", model=model_fine_tuned, tokenizer=tokenizer)


example = "Bill Gates is the Founder of Microsoft"

ner_results = nlp(example)

print(ner_results)

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


[{'entity': 'B-PER', 'score': 0.9960288, 'index': 1, 'word': 'bill', 'start': 0, 'end': 4}, {'entity': 'I-PER', 'score': 0.99507254, 'index': 2, 'word': 'gates', 'start': 5, 'end': 10}, {'entity': 'B-ORG', 'score': 0.9729531, 'index': 7, 'word': 'microsoft', 'start': 29, 'end': 38}]


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
!ls

drive  ner_model  sample_data  test-ner  tokenizer  wandb


In [None]:
!cp -r /content/* /content/drive/MyDrive/ner-bert-conll2003

cp: cannot create directory '/content/drive/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive/.Encrypted/MyDrive/ner-bert-conll2003/drive

In [None]:
!rsync -av --progress --exclude='.Encrypted' --exclude='.ipynb_checkpoints' --exclude='*/.Encrypted/*' /content/ /content/drive/MyDrive/ner-bert-conll2003/

sending incremental file list
./
.config/
.config/.last_opt_in_prompt.yaml
              3 100%    0.00kB/s    0:00:00 (xfr#1, ir-chk=2516/2525)
.config/.last_survey_prompt.yaml
             37 100%   36.13kB/s    0:00:00 (xfr#2, ir-chk=2515/2525)
.config/.last_update_check.json
            134 100%  130.86kB/s    0:00:00 (xfr#3, ir-chk=2514/2525)
.config/active_config
              7 100%    6.84kB/s    0:00:00 (xfr#4, ir-chk=2513/2525)
.config/config_sentinel
              0 100%    0.00kB/s    0:00:00 (xfr#5, ir-chk=2512/2525)
.config/default_configs.db
         12,288 100%   11.72MB/s    0:00:00 (xfr#6, ir-chk=2511/2525)
.config/gce
              5 100%    4.88kB/s    0:00:00 (xfr#7, ir-chk=2510/2525)
.config/hidden_gcloud_config_universe_descriptor_data_cache_configs.db
         12,288 100%   11.72MB/s    0:00:00 (xfr#8, ir-chk=2509/2525)
.config/configurations/
.config/configurations/config_default
             94 100%   91.80kB/s    0:00:00 (xfr#9, ir-chk=2506/2525)
.config/logs

In [None]:
!rm -rf /content/drive/MyDrive

rm: cannot remove '/content/drive/MyDrive': Operation canceled


In [None]:
!cp

In [41]:
!git init

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/


In [None]:
!ls

drive  ner_model  sample_data  test-ner  tokenizer  wandb


In [42]:
!git remote add origin https://github.com/tuanngocfun/ML-practices.git

In [43]:
!git pull origin --rebase

remote: Enumerating objects: 2687, done.[K
remote: Counting objects:   2% (1/37)[Kremote: Counting objects:   5% (2/37)[Kremote: Counting objects:   8% (3/37)[Kremote: Counting objects:  10% (4/37)[Kremote: Counting objects:  13% (5/37)[Kremote: Counting objects:  16% (6/37)[Kremote: Counting objects:  18% (7/37)[Kremote: Counting objects:  21% (8/37)[Kremote: Counting objects:  24% (9/37)[Kremote: Counting objects:  27% (10/37)[Kremote: Counting objects:  29% (11/37)[Kremote: Counting objects:  32% (12/37)[Kremote: Counting objects:  35% (13/37)[Kremote: Counting objects:  37% (14/37)[Kremote: Counting objects:  40% (15/37)[Kremote: Counting objects:  43% (16/37)[Kremote: Counting objects:  45% (17/37)[Kremote: Counting objects:  48% (18/37)[Kremote: Counting objects:  51% (19/37)[Kremote: Counting objects:  54% (20/37)[Kremote: Counting objects:  56% (21/37)[Kremote: Counting objects:  59% (22/37)[Kremote: Counting objects:  62% (23/37)[K

In [44]:
!git checkout main

Branch 'main' set up to track remote branch 'main' from 'origin'.
Switched to a new branch 'main'


In [None]:
!mkdir bert-ner

In [None]:
!mv /ner_model /bert-ner

mv: cannot stat '/ner_model': No such file or directory


In [None]:
!mv /sample_data /bert-ner

mv: cannot stat '/sample_data': No such file or directory


In [None]:
!mv /test_ner /bert-ner

In [None]:
!mv /tokenizer /bert-ner

In [None]:
!mv /wandb /bert-ner

In [None]:
import os
import shutil

dirs_to_move = ['ner_model', 'sample_data', 'test-ner', 'tokenizer', 'wandb']
target_dir = 'bert-ner'
os.makedirs(target_dir, exist_ok=True)

for dir_name in dirs_to_move:
    if os.path.exists(dir_name):
        try:
            shutil.move(dir_name, os.path.join(target_dir, dir_name))
            print(f"Moved {dir_name} to {target_dir}")
        except Exception as e:
            print(f"Error moving {dir_name}: {e}")
    else:
        print(f"{dir_name} does not exist.")


ner_model does not exist.
sample_data does not exist.
Moved test-ner to bert-ner
tokenizer does not exist.
wandb does not exist.


In [None]:
!pwd

/content


In [None]:
!ls

 2-Symbolic   bert-ner	 drive	'Sequence Models'


In [47]:
!mv /content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1 (2).ipynb /content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

/bin/bash: -c: line 1: syntax error near unexpected token `('
/bin/bash: -c: line 1: `mv /content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1 (2).ipynb /content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb'


In [49]:
!mv "/content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1 (2).ipynb" "/content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb"


In [58]:
!git add /content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

fatal: pathspec '/content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb' did not match any files


In [53]:
!git commit -m "notebook for transformer ner task with bert model and conll2003 datasets"

[main 88a36ee] notebook for transformer ner task with bert model and conll2003 datasets
 1 file changed, 1 insertion(+)
 create mode 100644 drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb


In [None]:
!git add bert-ner

In [45]:
!git remote set-url origin https://<token>@github.com/tuanngocfun/ML-practices.git

In [52]:
!git config --global user.email "tuanngoccs50@gmail.com"
!git config --global user.name "Nguyen Tuan Ngoc"

In [59]:
!git status

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	[32mdeleted:    drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.config/[m
	[31mdrive/[m
	[31mner_model/[m
	[31msample_data/[m
	[31mtest-ner/[m
	[31mtokenizer/[m
	[31mwandb/[m



In [60]:
!git restore --staged drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

In [57]:
!git commit --amend -m "fixing file"

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.config/[m
	[31mdrive/[m
	[31mner_model/[m
	[31msample_data/[m
	[31mtest-ner/[m
	[31mtokenizer/[m
	[31mwandb/[m

No changes
You asked to amend the most recent commit, but doing so would make
it empty. You can repeat your command with --allow-empty, or you can
remove the commit entirely with "git reset HEAD^".


In [62]:
!git diff --cached -- drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

In [63]:
!git add drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

In [64]:
!git diff --cached -- drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

[1mdiff --git a/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb b/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb[m
[1mdeleted file mode 100644[m
[1mindex 03e77f9..0000000[m
[1m--- a/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb[m
[1m+++ /dev/null[m
[36m@@ -1 +0,0 @@[m
\ No newline at end of file[m


In [65]:
!git commit -m "bert model and conll2003 dataset for traininng the NER task"

[main 7005fb6] bert model and conll2003 dataset for traininng the NER task
 1 file changed, 1 deletion(-)
 delete mode 100644 drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb


In [67]:
!git filter-branch --force --index-filter 'git rm --cached --ignore-unmatch drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb' -f HEAD

	 rewrites.  Hit Ctrl-C before proceeding to abort, then use an
	 alternative filtering tool such as 'git filter-repo'
	 (https://github.com/newren/git-filter-repo/) instead.  See the
Proceeding with filter-branch...

Rewrite 88a36ee78dd79c6b4043a617e65b75070da0bc82 (4/5) (0 seconds passed, remaining 0 predicted)    rm 'drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb'
Rewrite 7005fb666d2c1f265fe9e619bdaac23afda62b94 (5/5) (0 seconds passed, remaining 0 predicted)    
Ref 'refs/heads/main' was rewritten


In [68]:
!git add drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb

fatal: pathspec 'drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1.ipynb' did not match any files


In [None]:
!mv "/content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v1 (2).ipynb" "/content/drive/MyDrive/bert-ner-transformers/YT_Fine_tuning_BERT_NER_v.ipynb"

In [None]:
!git commit -m "bert model and conll2003 dataset for traininng the NER task"

In [66]:
!git push origin main

Enumerating objects: 8, done.
Counting objects:  12% (1/8)Counting objects:  25% (2/8)Counting objects:  37% (3/8)Counting objects:  50% (4/8)Counting objects:  62% (5/8)Counting objects:  75% (6/8)Counting objects:  87% (7/8)Counting objects: 100% (8/8)Counting objects: 100% (8/8), done.
Delta compression using up to 2 threads
Compressing objects:  20% (1/5)Compressing objects:  40% (2/5)Compressing objects:  60% (3/5)Compressing objects:  80% (4/5)Compressing objects: 100% (5/5)Compressing objects: 100% (5/5), done.
Writing objects:  14% (1/7)Writing objects:  28% (2/7)Writing objects:  42% (3/7)Writing objects:  57% (4/7)Writing objects:  71% (5/7)Writing objects:  85% (6/7)Writing objects: 100% (7/7)Writing objects: 100% (7/7), 42.03 KiB | 6.00 MiB/s, done.
Total 7 (delta 0), reused 0 (delta 0), pack-reused 0
remote: [1;31merror[m: GH013: Repository rule violations found for refs/heads/main.[K
remote: 
remote: - GITHUB PUSH PROTECTION[K
remote:   ———————————

In [None]:
!du -sh bert-ner

7.8G	bert-ner


In [None]:
!du -h --max-depth=1 bert-ner

7.4G	bert-ner/test-ner
616K	bert-ner/wandb
416M	bert-ner/ner_model
936K	bert-ner/tokenizer
55M	bert-ner/sample_data
7.8G	bert-ner


In [None]:
!git rm -r --cached bert-ner/test-ner
!git rm -r --cached /content/bert-ner/ner_model/model.safetensors

fatal: pathspec 'bert-ner/test-ner' did not match any files
rm 'bert-ner/ner_model/model.safetensors'


In [None]:
!git commit -m "Remove bert-ner/test-ner and /content/bert-ner/ner_model/model.safetensors from Git tracking"

[main 6744fbe] Remove bert-ner/test-ner and /content/bert-ner/ner_model/model.safetensors from Git tracking
 1 file changed, 0 insertions(+), 0 deletions(-)
 delete mode 100644 bert-ner/ner_model/model.safetensors


In [None]:
!git status

On branch main
Your branch is ahead of 'origin/main' by 4 commits.
  (use "git push" to publish your local commits)

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   bert-ner/wandb/run-20241117_061238-4v0ynkjq/logs/debug-internal.log[m
	[31mmodified:   bert-ner/wandb/run-20241117_061238-4v0ynkjq/logs/debug.log[m
	[31mmodified:   bert-ner/wandb/run-20241117_061238-4v0ynkjq/run-4v0ynkjq.wandb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.config/[m
	[31mbert-ner.zip[m
	[31mbert-ner/ner_model/model.safetensors[m

no changes added to commit (use "git add" and/or "git commit -a")


In [None]:
!echo "bert-ner/test-ner" >> .gitignore
!echo "/content/bert-ner/ner_model/model.safetensors" >> .gitignore

In [None]:
!git add .gitignore
!git commit -m "Update .gitignore to exclude test-ner and model.safetensors"

[main b4d0b38] Update .gitignore to exclude test-ner and model.safetensors
 1 file changed, 2 insertions(+)


In [None]:
!git push origin main

Enumerating objects: 87, done.
Counting objects:   1% (1/87)Counting objects:   2% (2/87)Counting objects:   3% (3/87)Counting objects:   4% (4/87)Counting objects:   5% (5/87)Counting objects:   6% (6/87)Counting objects:   8% (7/87)Counting objects:   9% (8/87)Counting objects:  10% (9/87)Counting objects:  11% (10/87)Counting objects:  12% (11/87)Counting objects:  13% (12/87)Counting objects:  14% (13/87)Counting objects:  16% (14/87)Counting objects:  17% (15/87)Counting objects:  18% (16/87)Counting objects:  19% (17/87)Counting objects:  20% (18/87)Counting objects:  21% (19/87)Counting objects:  22% (20/87)Counting objects:  24% (21/87)Counting objects:  25% (22/87)Counting objects:  26% (23/87)Counting objects:  27% (24/87)Counting objects:  28% (25/87)Counting objects:  29% (26/87)Counting objects:  31% (27/87)Counting objects:  32% (28/87)Counting objects:  33% (29/87)Counting objects:  34% (30/87)Counting objects:  35% (31/87)Counting objects:

In [None]:
!zip -r bert-ner.zip bert-ner

  adding: bert-ner/ (stored 0%)
  adding: bert-ner/test-ner/ (stored 0%)
  adding: bert-ner/test-ner/checkpoint-2000/ (stored 0%)
  adding: bert-ner/test-ner/checkpoint-2000/tokenizer_config.json (deflated 76%)
  adding: bert-ner/test-ner/checkpoint-2000/vocab.txt (deflated 53%)
  adding: bert-ner/test-ner/checkpoint-2000/rng_state.pth (deflated 25%)
  adding: bert-ner/test-ner/checkpoint-2000/optimizer.pt (deflated 19%)
  adding: bert-ner/test-ner/checkpoint-2000/model.safetensors (deflated 7%)
  adding: bert-ner/test-ner/checkpoint-2000/trainer_state.json (deflated 63%)
  adding: bert-ner/test-ner/checkpoint-2000/training_args.bin (deflated 51%)
  adding: bert-ner/test-ner/checkpoint-2000/special_tokens_map.json (deflated 42%)
  adding: bert-ner/test-ner/checkpoint-2000/config.json (deflated 56%)
  adding: bert-ner/test-ner/checkpoint-2000/tokenizer.json (deflated 71%)
  adding: bert-ner/test-ner/checkpoint-2000/scheduler.pt (deflated 55%)
  adding: bert-ner/test-ner/checkpoint-2500/

In [None]:
from google.colab import files
files.download('bert-ner.zip')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>