#Fine Tune BERT for Text Classification



 #### MODEL: Google's pre-trained BERT model (2018)
 #### LIBRARY: Huggingface Transformers library
 #### Dataset: https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html
 #### Problem: Text Classfication


In [None]:
!pip3 install transformers


In [None]:
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from transformers import BertTokenizerFast, BertForSequenceClassification
from transformers import BertTokenizerFast
from transformers import Trainer, TrainingArguments
import torch
from sklearn.metrics import accuracy_score


In [None]:
model_name = "bert-base-uncased"
max_length = 512

In [None]:
tokenizer = BertTokenizerFast.from_pretrained(model_name, do_lower_case=True)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
dataset = fetch_20newsgroups(subset="all", shuffle=True, remove=("headers","footers", "quotes"))
target_names=dataset.target_names
news_text = dataset.data
labels = dataset.target
(train_x,test_x,train_y,test_y)=train_test_split(news_text, labels, test_size=0.3)


In [None]:
train_encodings = tokenizer(train_x, truncation=True, padding=True, max_length=max_length)
test_encodings = tokenizer(test_x, truncation=True, padding=True, max_length=max_length)

In [None]:
model=BertForSequenceClassification.from_pretrained(model_name, num_labels=len(target_names)).to("cuda")
training_args = TrainingArguments(
    output_dir='./results',          # output directory
    num_train_epochs=3,              # total number of training epochs
    per_device_train_batch_size=8,  # batch size per device during training
    per_device_eval_batch_size=20,   # batch size for evaluation
    warmup_steps=500,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs',            # directory for storing logs
    load_best_model_at_end=True,     # load the best model when finished training (default metric is loss)
    # but you can specify `metric_for_best_model` argument to change to accuracy or other metric
    logging_steps=400,               # log & save weights each logging_steps
    save_steps=400,
    evaluation_strategy="steps",     # evaluate each `logging_steps`
)

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForSequenceClassification: ['cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.decoder.weight', 'cls.predictions.transform.LayerNorm.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at

In [None]:
class NewsGroupsDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __getitem__(self, idx):
        item = {k: torch.tensor(v[idx]) for k, v in self.encodings.items()}
        item["labels"] = torch.tensor([self.labels[idx]])
        return item

    def __len__(self):
        return len(self.labels)


# convert our tokenized data into a torch Dataset
train_dataset = NewsGroupsDataset(train_encodings, train_y)
test_dataset = NewsGroupsDataset(test_encodings, test_y)

In [None]:
def compute_metrics(pred):
  labels = pred.label_ids
  preds = pred.predictions.argmax(-1)
  # calculate accuracy using sklearn's function
  acc = accuracy_score(labels, preds)
  return {
      'accuracy': acc,
  }

In [None]:
trainer = Trainer(
    model=model,                         # the instantiated Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=train_dataset,         # training dataset
    eval_dataset=test_dataset,          # evaluation dataset
    compute_metrics=compute_metrics,     # the callback that computes metrics of interest
)

In [None]:
# train the model
trainer.train()

***** Running training *****
  Num examples = 13192
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 4947


Step,Training Loss,Validation Loss,Accuracy
400,2.4203,1.462609,0.602759
800,1.2889,1.131437,0.664839
1200,1.0863,1.009583,0.690838
1600,0.9728,0.942397,0.710117
2000,0.7619,0.995978,0.72515
2400,0.7242,0.922742,0.734701
2800,0.7189,0.904325,0.738946
3200,0.6311,0.88911,0.746374
3600,0.4732,0.948225,0.752034
4000,0.3848,0.962657,0.757871


***** Running Evaluation *****
  Num examples = 5654
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-400
Configuration saved in ./results/checkpoint-400/config.json
Model weights saved in ./results/checkpoint-400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 5654
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-800
Configuration saved in ./results/checkpoint-800/config.json
Model weights saved in ./results/checkpoint-800/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 5654
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-1200
Configuration saved in ./results/checkpoint-1200/config.json
Model weights saved in ./results/checkpoint-1200/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 5654
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-1600
Configuration saved in ./results/checkpoint-1600/config.json
Model weights saved in ./results/checkpoint-1600/pytorch_model.bi

Step,Training Loss,Validation Loss,Accuracy
400,2.4203,1.462609,0.602759
800,1.2889,1.131437,0.664839
1200,1.0863,1.009583,0.690838
1600,0.9728,0.942397,0.710117
2000,0.7619,0.995978,0.72515
2400,0.7242,0.922742,0.734701
2800,0.7189,0.904325,0.738946
3200,0.6311,0.88911,0.746374
3600,0.4732,0.948225,0.752034
4000,0.3848,0.962657,0.757871


Saving model checkpoint to ./results/checkpoint-4400
Configuration saved in ./results/checkpoint-4400/config.json
Model weights saved in ./results/checkpoint-4400/pytorch_model.bin
***** Running Evaluation *****
  Num examples = 5654
  Batch size = 20
Saving model checkpoint to ./results/checkpoint-4800
Configuration saved in ./results/checkpoint-4800/config.json
Model weights saved in ./results/checkpoint-4800/pytorch_model.bin


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from ./results/checkpoint-3200 (score: 0.88910973072052).


TrainOutput(global_step=4947, training_loss=0.836740604429166, metrics={'train_runtime': 12262.8084, 'train_samples_per_second': 3.227, 'train_steps_per_second': 0.403, 'total_flos': 1.0414566002294784e+16, 'train_loss': 0.836740604429166, 'epoch': 3.0})

In [None]:
# evaluate the current model after training
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 5654
  Batch size = 20


{'epoch': 3.0,
 'eval_accuracy': 0.7463742483197736,
 'eval_loss': 0.88910973072052,
 'eval_runtime': 389.4286,
 'eval_samples_per_second': 14.519,
 'eval_steps_per_second': 0.727}

In [None]:
# saving the fine tuned model & tokenizer
model_path = "20newsgroups-bert-base-uncased"
model.save_pretrained(model_path)
tokenizer.save_pretrained(model_path)

Configuration saved in 20newsgroups-bert-base-uncased/config.json
Model weights saved in 20newsgroups-bert-base-uncased/pytorch_model.bin
tokenizer config file saved in 20newsgroups-bert-base-uncased/tokenizer_config.json
Special tokens file saved in 20newsgroups-bert-base-uncased/special_tokens_map.json


('20newsgroups-bert-base-uncased/tokenizer_config.json',
 '20newsgroups-bert-base-uncased/special_tokens_map.json',
 '20newsgroups-bert-base-uncased/vocab.txt',
 '20newsgroups-bert-base-uncased/added_tokens.json',
 '20newsgroups-bert-base-uncased/tokenizer.json')

In [None]:
def get_prediction(text):
    # prepare our text into tokenized sequence
    inputs = tokenizer(text, padding=True, truncation=True, max_length=max_length, return_tensors="pt").to("cuda")
    # perform inference to our model
    outputs = model(**inputs)
    # get output probabilities by doing softmax
    probs = outputs[0].softmax(1)
    # executing argmax function to get the candidate label
    return target_names[probs.argmax()]

# Classifying texts

In [None]:
# Example 
text = """
A black hole is a place in space where gravity pulls so much that even light can not get out. 
The gravity is so strong because matter has been squeezed into a tiny space. This can happen when a star is dying.
Because no light can get out, people can't see black holes. 
They are invisible. Space telescopes with special tools can help find black holes. 
The special tools can see how stars that are very close to black holes act differently than other stars.
"""

print(get_prediction(text))

sci.space


In [None]:
# Example 
text= """
As Earth’s climate warms, incidences of extreme heat and humidity are rising,
with significant consequences for human health. Climate scientists are tracking
a key measure of heat stress that can warn us of harmful conditions.
"""

print(get_prediction(text))

talk.politics.misc


In [None]:
# Example 
text= """
In Pittsburgh, he will compete with Mason Rudolph for the starting quarterback position
 as the franchise begins the post-Ben Roethlisberger era. Roethlisberger retired 
 in January after 18 seasons with the Steelers in which he helped the franchise
  to two Super Bowl victories and finished his career with the fifth-most
   passing yards (64,088) in NFL history.
"""

print(get_prediction(text))

rec.sport.hockey


Resource:

https://www.thepythoncode.com/article/finetuning-bert-using-huggingface-transformers-python