# DATA PREPARATION
After storing all news titles in the CSV file, we import the file and label every single word in the title using label-studio, including "ACTIVITY" and "OBJECTIVITY" following the business case:
1. ACTIVITY: related to deal, bid, investment, response to problems, etc.

2. OBJECTIVITY: related to organization, products, competitors, business partners, etc.

Continuously, we export all of the titles to txt format in ConLL2003 structure, then split the file manually into 3 sub-files with 80% for training, 14% for validation, and 6% for testing. Finally, we zip all of them in a zip file (i.e "bbc_label_bert.zip") and use Python (i.e "bbc_bert_processing.py") to process the data in a suitable structure to build the BERT model.

# DATA MODELING
Here we build a BERT model to explore the sentiment of entities in BBC news title about Microsoft corp.

In [1]:
!pip install datasets transformers[torch] seqeval

Collecting seqeval
  Downloading seqeval-1.2.2.tar.gz (43 kB)
     ---------------------------------------- 0.0/43.6 kB ? eta -:--:--
     ----------------- -------------------- 20.5/43.6 kB 320.0 kB/s eta 0:00:01
     -------------------------------------- 43.6/43.6 kB 354.9 kB/s eta 0:00:00
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Collecting torch!=1.12.0,>=1.9 (from transformers[torch])
  Obtaining dependency information for torch!=1.12.0,>=1.9 from https://files.pythonhosted.org/packages/5c/01/5ab75f138bf32d7a69df61e4997e24eccad87cc009f5fb7e2a31af8a4036/torch-2.2.2-cp311-cp311-win_amd64.whl.metadata
  Downloading torch-2.2.2-cp311-cp311-win_amd64.whl.metadata (26 kB)
Collecting accelerate>=0.20.3 (from transformers[torch])
  Obtaining dependency information for accelerate>=0.20.3 from https://files.pythonhosted.org/packages/6c/35/b3851a3c2d3ead15099defccb50b59c1165eb24dfde298abe4091ffb6cca/accelerate-0.29.1-py3-none-any.w

In [2]:
# Load the dataset
from datasets import load_dataset, load_metric
dataset = load_dataset("bbc_bert_processing.py")

Found cached dataset bbc_bert_processing (C:/Users/86158/.cache/huggingface/datasets/bbc_bert_processing/bbc_news_for_model/1.0.0/e63f6861124cf3d51d42aa550de7f2ee609cdc2401ad8d5a052406e45635bd4a)


  0%|          | 0/3 [00:00<?, ?it/s]

We use 80% for training, 14% for validation and 6% for testing

In [3]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 103
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 17
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags'],
        num_rows: 8
    })
})

In [4]:
# Check the first observation
dataset["train"][0]

{'id': '0',
 'tokens': ['CAA',
  ':',
  'Microsoft',
  'boss',
  'calls',
  'India',
  "'s",
  'new',
  'citizenship',
  'law',
  "'sad",
  "'"],
 'ner_tags': [0, 0, 2, 2, 1, 2, 2, 0, 2, 0, 0, 0]}

In [5]:
# Check the tags
tags = dataset["train"].features[f"ner_tags"]
print(tags)

Sequence(feature=ClassLabel(names=['O', 'B-ACTIVITY', 'B-OBJECTIVITY'], id=None), length=-1, id=None)


In [6]:
label_list = dataset["train"].features["ner_tags"].feature.names
label_list

['O', 'B-ACTIVITY', 'B-OBJECTIVITY']

### FINE-TUNING 

In [7]:
import torch
task = "ner"
model_checkpoint = "distilbert-base-uncased"
batch_size = 16
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [8]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Downloading tokenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [9]:
example = dataset["train"][2]
tokenized_input = tokenizer(example["tokens"], is_split_into_words=True)
tokens = tokenizer.convert_ids_to_tokens(tokenized_input["input_ids"])
print(tokens)

['[CLS]', 'bill', 'gates', 'steps', 'down', 'from', 'microsoft', 'board', 'to', 'focus', 'on', 'philanthropy', '[SEP]']


In [10]:
word_ids = tokenized_input.word_ids()
aligned_labels = [-100 if i is None else example[f"{task}_tags"][i] for i in word_ids]
label_all_tokens = True

def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, is_split_into_words=True)

    labels = []
    for i, label in enumerate(examples[f"{task}_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            # Special tokens have a word id that is None. We set the label to -100 so they are automatically
            # ignored in the loss function.
            if word_idx is None:
                label_ids.append(-100)
            # We set the label for the first token of each word.
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            # For the other tokens in a word, we set the label to either the current label or -100, depending on
            # the label_all_tokens flag.
            else:
                label_ids.append(label[word_idx] if label_all_tokens else -100)
            previous_word_idx = word_idx

        labels.append(label_ids)

    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
print(tokenized_datasets)

Map:   0%|          | 0/103 [00:00<?, ? examples/s]

Map:   0%|          | 0/17 [00:00<?, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 103
    })
    validation: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 17
    })
    test: Dataset({
        features: ['id', 'tokens', 'ner_tags', 'input_ids', 'attention_mask', 'labels'],
        num_rows: 8
    })
})


In [11]:
id2label = {
    0: "O",
    1: "B-ACTIVITY",
    2: "B-OBJECTIVITY"

}

label2id = {
    "O": 0,
    "B-ACTIVITY": 1,
    "B-OBJECTIVITY": 2
}

In [12]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(model_checkpoint,
                                                        id2label=id2label,
                                                        label2id=label2id,
                                                        num_labels=len(label_list)).to(device)

  torch.utils._pytree._register_pytree_node(
  torch.utils._pytree._register_pytree_node(


Downloading model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForTokenClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [13]:
from transformers import TrainingArguments

model_name = model_checkpoint.split("/")[-1]
args = TrainingArguments(
    f"{model_name}-ner",
    evaluation_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=3,
    weight_decay=0.01,
    #push_to_hub=True
)

In [14]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer)

In [15]:
metric = load_metric("seqeval")
labels = [label_list[i] for i in example[f"{task}_tags"]]
metric.compute(predictions=[labels], references=[labels])

  metric = load_metric("seqeval")


Downloading builder script:   0%|          | 0.00/2.47k [00:00<?, ?B/s]

{'ACTIVITY': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 2},
 'OBJECTIVITY': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 5},
 'overall_precision': 1.0,
 'overall_recall': 1.0,
 'overall_f1': 1.0,
 'overall_accuracy': 1.0}

In [16]:
import numpy as np

def compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)

    # Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]

    results = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": results["overall_precision"],
        "recall": results["overall_recall"],
        "f1": results["overall_f1"],
        "accuracy": results["overall_accuracy"],
    }

### TRAIN THE MODEL

In [17]:
from transformers import Trainer

trainer = Trainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

dataloader_config = DataLoaderConfiguration(dispatch_batches=None)


In [18]:
print("Training starts NOW")
trainer.train()

You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Training starts NOW


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,No log,0.922034,0.57047,0.669291,0.615942,0.59893
2,No log,0.818315,0.648855,0.669291,0.658915,0.663102
3,No log,0.776179,0.685484,0.669291,0.677291,0.679144


  _warn_prf(average, modifier, msg_start, len(result))


TrainOutput(global_step=21, training_loss=0.9139618646530878, metrics={'train_runtime': 26.6781, 'train_samples_per_second': 11.583, 'train_steps_per_second': 0.787, 'total_flos': 1504324122750.0, 'train_loss': 0.9139618646530878, 'epoch': 3.0})

# EVALUATION AND RECOMMENDATION

In [19]:
trainer.evaluate()

predictions, labels, _ = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions, axis=2)

# Remove ignored index (special tokens)
true_predictions = [
    [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]
true_labels = [
    [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
    for prediction, label in zip(predictions, labels)
]

results = metric.compute(predictions=true_predictions, references=true_labels)
results

{'ACTIVITY': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 37},
 'OBJECTIVITY': {'precision': 0.6854838709677419,
  'recall': 0.9444444444444444,
  'f1': 0.794392523364486,
  'number': 90},
 'overall_precision': 0.6854838709677419,
 'overall_recall': 0.6692913385826772,
 'overall_f1': 0.6772908366533864,
 'overall_accuracy': 0.679144385026738}

The model has a very good performance in identifying the "OBJECTIVITY" tag with almost absolute recall value. However, the metrics of the "ACTIVITY" tag show that it doesn't have a good prediction, that is the reason why the overall accuracy of the model is just acceptable at 66.84%. 

To explain, the number metric shows that there are just 37/107 observations that have the tag "ACTIVITY" while the "OBJECTIVITY" tag appear in 90 observations. It is clear that the dataset I collected mainly related to "OBJECTIVITY" and doesn't have enough observations to identify the tag "ACTIVITY", but there might be 2 more possibilities:

1. Most of the news about Microsoft are just related to "OBJECTIVITY", no matter how many titles are there that I collect.
   
2. I was biased when labeling the tags for the title of the news.

To enhance the performance of model, I can double-check the labeling process with my group members or define another business case to decide on a different labeling process. However, my labeling decision is not suitable to build the BERT model for this business case.

In [20]:
trainer.save_model("bert_model")

In [21]:
from transformers import pipeline
from transformers import AutoModelForTokenClassification
from transformers import AutoTokenizer
from transformers import TokenClassificationPipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
model = AutoModelForTokenClassification.from_pretrained("bert_model")
nlp = pipeline("ner", model=model, tokenizer=tokenizer, aggregation_strategy="first")

In [22]:
EXAMPLE = "Microsoft loses key Xbox executive amid continued gaming shake-up"

ner_results = nlp(EXAMPLE)
ner_results

[{'entity_group': 'OBJECTIVITY',
  'score': 0.8046711,
  'word': 'microsoft',
  'start': 0,
  'end': 9},
 {'entity_group': 'OBJECTIVITY',
  'score': 0.7521492,
  'word': 'xbox',
  'start': 20,
  'end': 24},
 {'entity_group': 'OBJECTIVITY',
  'score': 0.5864086,
  'word': 'executive',
  'start': 25,
  'end': 34},
 {'entity_group': 'OBJECTIVITY',
  'score': 0.6137433,
  'word': 'gaming',
  'start': 50,
  'end': 56}]