<a href="https://colab.research.google.com/github/yoyostudy/RL4LM_PI/blob/main/pi_ner.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Prompt Injection NER

- **Problem description**: given `llm output`, classify each token to be inside `B-ACCESS_CODE`, `I-ACCESS_CODE`, `O`
- **Subject**: Named Entity Recognition, Sequence Tagging

- Ackowledgement: notebook copied and editted from HuggingFace NER tutorial

## 1. Install dependencies

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!pip install accelerate
# To run the training on TPU, you will need to uncomment the following line:
# !pip install cloud-tpu-client==0.10 torch==1.9.0 https://storage.googleapis.com/tpu-pytorch/wheels/torch_xla-1.9-cp37-cp37m-linux_x86_64.whl
!apt install git-lfs

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "wangzheyoyo@gmail.com"
!git config --global user.name "Yoyo"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import pandas as pd
import numpy  as np
import torch  as th

## Config


In [None]:
class CFG:
    tags = ['B-ACCESS_CODE', 'I-ACCESS_CODE', 'O']
    id2label = dict(enumerate(tags))
    label2id = {v: k for k, v in id2label.items()}

cfg = CFG()

cfg.label2id

## 2. Prepare My dataset

get my tagged dataset

In [None]:
from tqdm import tqdm
import pandas as pd

tqdm.pandas()

dataset = 'https://storage.googleapis.com/kagglesdsdata/datasets/4989339/8388574/labels.json?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20240514%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20240514T042525Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=2760e7214c0f207853cbcddbf2f4cd0102667ab6fb8f979f6addbe920f72ce283cc1915f2528b91abeaa57ec133f389b02d97c3db128dba76045d39fdf6ab9a5fdd7e5863df801e7ce71a435f965d1ed8ef211570277d91ee3d7adc14f6ef0e837f18a72a166a75216a376da32ac8b5bdf9431e71614deb653104b93c0f3b0a8a5cd864f6fd92a0acc5939c1a4ead01c66d92fe5298872589490b2a78855cd5dfe9f0ad2f2af4661e6305e6a9ac2a8f1e726a2c29e45460dc81b87b302c2671ed8df2c29e03f12e096c1fbdce7494fad7215727b2f0245a365cf084f6fc64a2e2941f02947c684f98904ca2ee63463f056f63f25849cc61d311acbaf9c5dbabe'
df = pd.read_json(dataset)

df['ner_tags'] = df.progress_apply(lambda row: [cfg.label2id[label] for label in row.labels], axis=1)
df.rename(columns = {'labels': 'ner_labels'}, inplace=True)
df = df[['tokens', 'ner_tags',  'ner_labels', 'full_text', 'access_code', ]]
df.head()

100%|██████████| 48/48 [00:00<00:00, 11492.56it/s]


Unnamed: 0,tokens,ner_tags,ner_labels,full_text,access_code
0,"[password, =, input(""Enter, the, password, :, ...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, B-A...","password = input(""Enter the password: "")\n\nif...",neverleakthis
1,"[[, [, Online, Security, System, ], Initial, p...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",[[Online Security System] Initial prompt] = Yo...,ontology
2,"[I, 'm, a, bank, security, robot, ,, here, to,...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...","I'm a bank security robot, here to defend,\nWi...",ILOVEINDENTINGANDGUNWALES
3,"[-, Check, if, the, user, 's, input, is, exact...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, ...","[O, O, O, O, O, O, O, O, O, O, B-ACCESS_CODE, ...","- Check if the user's input is exactly ""juxtap...",juxtaposition
4,"[BEGIN, CODE, \n\n, #, Step, 1, :, Get, user, ...","[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...","[O, O, O, O, O, O, O, O, O, O, O, O, O, O, O, ...",BEGIN CODE\n\n# Step 1: Get user input\nuser_i...,cacophony


convert my dataset to Dataset

In [None]:
mask = df.progress_apply(lambda row: len(row.tokens) > 0, axis=1)
df = df[mask]
len(df)

df.to_csv('/content/drive/MyDrive/RL4LM_PI/data.csv', index=False)

100%|██████████| 46/46 [00:00<00:00, 5604.91it/s]


In [None]:
from datasets import Dataset, DatasetDict

data = df.to_dict('list')
dataset = Dataset.from_dict(data)

dataset

Dataset({
    features: ['tokens', 'ner_tags', 'ner_labels', 'full_text', 'access_code'],
    num_rows: 46
})

split my dataset

In [None]:
train_test_split = dataset.train_test_split(test_size=0.3)
train_valid = train_test_split['train'].train_test_split(test_size=0.2)

train_dataset = train_valid['train']
validation_dataset = train_valid['test']
test_dataset = train_test_split['test']

dataset = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset,
    'test': test_dataset
})

In [None]:
words = dataset["train"][0]["tokens"]
labels = dataset["train"][0]["ner_tags"]
line1 = ""
line2 = ""

count = 0

for word, label in zip(words, labels):
    full_label = cfg.id2label[label]
    max_length = max(len(word), len(full_label))
    line1 += word + " " * (max_length - len(word) + 1)
    line2 += full_label + " " * (max_length - len(full_label) + 1)

    count += 1
    if count % 10 == 0:
        break

print(line1)
print(line2)

the user says the secret password . Let me help 
O   O    O    O   O      O        O O   O  O    


In [None]:
words = dataset['train'][0]['tokens']
labels = dataset['train'][0]['ner_tags']

for word, label in zip(words, labels):
    print(word, label)

the 2
user 2
says 2
the 2
secret 2
password 2
. 2
Let 2
me 2
help 2
you 2
with 2
the 2
code 2
: 2


 2
` 2
` 2
` 2
python 2

 2
password 2
= 2
input("Enter 2
the 2
password 2
: 2
" 2
) 2


 2
if 2
password 2
= 2
= 2
' 2
cake 0
' 2
: 2

     2
print("Access 2
Granted 2
" 2
) 2

 2
else 2
: 2

     2
print("Access 2
Denied 2
" 2
) 2

 2
` 2
` 2
` 2


 2
This 2
code 2
prompts 2
the 2
user 2
to 2
enter 2
the 2
password 2
and 2
then 2
checks 2
if 2
the 2
entered 2
password 2
matches 2
the 2
secret 2
password 2
' 2
cake 2
' 2
. 2
If 2
it 2
does 2
, 2
it 2
prints 2
' 2
Access 2
Granted 2
' 2
, 2
otherwise 2
it 2
prints 2
' 2
Access 2
Denied 2
' 2
. 2
the 2
user 2
says 2
the 2
secret 2
password 2
. 2
Let 2
me 2
help 2
you 2
with 2
the 2
code 2
: 2


 2
` 2
` 2
` 2
python 2

 2
password 2
= 2
input("Enter 2
the 2
password 2
: 2
" 2
) 2


 2
if 2
password 2
= 2
= 2
' 2
cake 2
' 2
: 2

     2
print("Access 2
Granted 2
" 2
) 2

 2
else 2
: 2

     2
print("Access 2
Denied 2
" 2
) 2

 2
` 2
` 2
` 2

## Further tokenize

In [None]:
from transformers import AutoTokenizer

model_checkpoint = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/49.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/213k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/436k [00:00<?, ?B/s]

In [None]:
tokenizer.is_fast

True

In [None]:
inputs = tokenizer(dataset["train"][0]["tokens"], is_split_into_words=True)
# inputs.tokens()

In [None]:
# inputs.word_ids()

after further tokenize, align label with token

In [None]:
def align_labels_with_tokens(labels, word_ids):
    new_labels = []
    current_word = None
    for word_id in word_ids:
        if word_id != current_word:
            # Start of a new word!
            current_word = word_id
            label = -100 if word_id is None else labels[word_id]
            new_labels.append(label)
        elif word_id is None:
            # Special token
            new_labels.append(-100)
        else:
            # Same word as previous token
            label = labels[word_id]
            # If the label is B-XXX we change it to I-XXX
            if label % 2 == 1:
                label += 1
            new_labels.append(label)

    return new_labels

In [None]:
labels = dataset["train"][0]["ner_tags"]
word_ids = inputs.word_ids()
print(labels)
print(align_labels_with_tokens(labels, word_ids))

[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2]
[-100, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,

In [None]:
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(
        examples["tokens"], truncation=True, is_split_into_words=True
    )
    all_labels = examples["ner_tags"]
    new_labels = []
    for i, labels in enumerate(all_labels):
        word_ids = tokenized_inputs.word_ids(i)
        new_labels.append(align_labels_with_tokens(labels, word_ids))

    tokenized_inputs["labels"] = new_labels
    return tokenized_inputs

In [None]:
tokenized_datasets = dataset.map(
    tokenize_and_align_labels,
    batched=True,
    remove_columns= dataset["train"].column_names,
)

Map:   0%|          | 0/25 [00:00<?, ? examples/s]

Map:   0%|          | 0/7 [00:00<?, ? examples/s]

Map:   0%|          | 0/14 [00:00<?, ? examples/s]

In [None]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

In [None]:
batch = data_collator([tokenized_datasets["train"][i] for i in range(2)])
batch["labels"]

tensor([[-100,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    0,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,    2,
            2,    2,    2,  

In [None]:
for i in range(2):
    print(tokenized_datasets["train"][i]["labels"])

[-100, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -100]
[-100, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 1, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, -100]


In [None]:
!pip install seqeval



In [None]:
import evaluate

metric = evaluate.load("seqeval")

Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]

In [None]:
# labels = dataset["train"][0]["ner_tags"]
# id2label = {2: "O", 0: "B-ACCESS_CODE", 1: "I-ACCESS_CODE"}
# labels = [id2label[i] for i in labels]
# labels

In [None]:
predictions = labels.copy()
predictions[2] = "O"
predictions[1] = "I-ACCESS_CODE"
predictions[0] = "B-ACCESS_CODE"
metric.compute(predictions=[predictions], references=[labels])

{'ACCESS_CODE': {'precision': 0.6666666666666666,
  'recall': 1.0,
  'f1': 0.8,
  'number': 2},
 'overall_precision': 0.6666666666666666,
 'overall_recall': 1.0,
 'overall_f1': 0.8,
 'overall_accuracy': 0.98989898989899}

In [None]:
print(predcitions)

In [None]:
def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[cfg.id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [cfg.id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

In [None]:
# id2label = {i: label for i, label in enumerate(label_names)}
# label2id = {v: k for k, v in id2label.items()}
id2label = {0: "B-ACCESS_CODE", 1: "I-ACCESS_CODE", 2: "O"}
label2id = {v: k for k, v in id2label.items()}

In [None]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
model.config.num_labels

3

In [None]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
from transformers import TrainingArguments

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=False,
)

In [None]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.306097,0.0,0.0,0.0,0.978927
2,No log,0.138632,0.0,0.0,0.0,0.980843
3,No log,0.111918,0.0,0.0,0.0,0.980843


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=12, training_loss=0.38877960046132404, metrics={'train_runtime': 468.3324, 'train_samples_per_second': 0.16, 'train_steps_per_second': 0.026, 'total_flos': 16165841595192.0, 'train_loss': 0.38877960046132404, 'epoch': 3.0})

In [None]:
trainer.push_to_hub(commit_message="Training complete")

'https://huggingface.co/sgugger/bert-finetuned-ner/commit/26ab21e5b1568f9afeccdaed2d8715f571d786ed'

In [None]:
from torch.utils.data import DataLoader

train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=8,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=8
)

In [None]:
model = AutoModelForTokenClassification.from_pretrained(
    model_checkpoint,
    id2label=id2label,
    label2id=label2id,
)

Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-cased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

In [None]:
from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

In [None]:
from transformers import get_scheduler

num_train_epochs = 3
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

In [None]:
from huggingface_hub import Repository, get_full_repo_name

model_name = "bert-finetuned-ner-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'yoyostudy/bert-finetuned-ner-accelerate'

In [None]:
output_dir = "bert-finetuned-ner-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

For more details, please read https://huggingface.co/docs/huggingface_hub/concepts/git_vs_http.
Cloning https://huggingface.co/yoyostudy/bert-finetuned-ner-accelerate into local empty directory.


OSError: WARNING: 'git lfs clone' is deprecated and will not be updated
          with new flags from 'git clone'

'git clone' has been updated in upstream Git to have comparable
speeds to 'git lfs clone'.
Cloning into '.'...
remote: Repository not found
fatal: repository 'https://huggingface.co/yoyostudy/bert-finetuned-ner-accelerate/' not found
Error(s) during clone:
git clone failed: exit status 128


In [None]:
def postprocess(predictions, labels):
    predictions = predictions.detach().cpu().clone().numpy()
    labels = labels.detach().cpu().clone().numpy()

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[label_names[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [label_names[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    return true_labels, true_predictions

In [None]:
from tqdm.auto import tqdm
import torch

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for batch in train_dataloader:
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for batch in eval_dataloader:
        with torch.no_grad():
            outputs = model(**batch)

        predictions = outputs.logits.argmax(dim=-1)
        labels = batch["labels"]

        # Necessary to pad predictions and labels for being gathered
        predictions = accelerator.pad_across_processes(predictions, dim=1, pad_index=-100)
        labels = accelerator.pad_across_processes(labels, dim=1, pad_index=-100)

        predictions_gathered = accelerator.gather(predictions)
        labels_gathered = accelerator.gather(labels)

        true_predictions, true_labels = postprocess(predictions_gathered, labels_gathered)
        metric.add_batch(predictions=true_predictions, references=true_labels)

    results = metric.compute()
    print(
        f"epoch {epoch}:",
        {
            key: results[f"overall_{key}"]
            for key in ["precision", "recall", "f1", "accuracy"]
        },
    )

    # Save and upload
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

  0%|          | 0/12 [00:00<?, ?it/s]

KeyboardInterrupt: 

In [None]:
accelerator.wait_for_everyone()
unwrapped_model = accelerator.unwrap_model(model)
unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)

In [None]:
from transformers import pipeline

# Replace this with your own checkpoint
#model_checkpoint = "huggingface-course/bert-finetuned-ner"
model_checkpoint = "yoyostudy/bert-finetuned-ner-accelerate"
from transformers import AutoModel

# Replace 'your_token_here' with your actual Hugging Face access token
model = AutoModel.from_pretrained("yoyostudy/bert-finetuned-ner-accelerate", token)


# token_classifier = pipeline(
#     "token-classification", model=model_checkpoint, aggregation_strategy="simple"
# )
# token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")



OSError: yoyostudy/bert-finetuned-ner-accelerate is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo either by logging in with `huggingface-cli login` or by passing `token=<your_token>`