# Fine-Tuning BERT for Sequence Classification and Repository Creation on Hugging Face

This Jupyter notebook serves as a guide for fine-tuning the base BERT model from Hugging Face, transforming it into a model suitable for sequence classification tasks, and pushing this model into a repository on Hugging Face.

The ultimate objective of this project is to train pair classifiers for the gloss of Serbian Wordnet. The classifiers will distinguish between positive/non-positive and negative/non-negative classifications. This will assist us in creating SETIWORDNET-like markings.

In this part of the project, we focus on preparing the pretrained BERT model for fine-tuning. Subsequent notebooks will cover the fine-tuning process and application of the model.

This notebook will take you through the following steps:

1. Initial setup - Importing required libraries and setting up the environment.
2. Loading the base BERT model from Hugging Face.
3. Initial configuration and preparation of the model for fine-tuning.
4. Creating a Hugging Face repository.
5. Pushing the prepared model to the repository.

Let's get started!


## Importing Required Libraries

Before we begin, we need to import the necessary libraries. The `transformers` library from Hugging Face provides us with pre-trained models and tokenizers that will help us in our task. We are specifically interested in the `AutoTokenizer` and `AutoModelForSequenceClassification` classes.

`AutoTokenizer` will be used to load the tokenizer corresponding to the base BERT model, while `AutoModelForSequenceClassification` is the class of the model we want to fine-tune.

The `huggingface_hub` library provides tools for working with the Hugging Face model hub. We'll use the `create_repo` function to create a new repository on the Hugging Face hub where we can store our fine-tuned model.

Let's import these classes and functions:


In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from huggingface_hub import create_repo

## Preparing the Model for Fine-tuning and Creating Repositories on Hugging Face

In this step, we're going to prepare two BERT models for fine-tuning: `bcms-bertic` provided by Classla, and `Bertovo-sent-base` provided by Tanor.

We're training classifiers for two types of polarities - Positive and Negative. Therefore, for each polarity, we adjust the labels appropriately and load the corresponding models and tokenizers.

We're also going to create a series of repositories on Hugging Face to store our models. For each iteration in our range, we create two repositories - one for `BERTicovoSENT` and one for `BERTicSENT`. These repositories will hold different versions of our fine-tuned models and their tokenizers.

Each of the repositories will be private and bear the name `BERTicovoSENT{polarity}{i}` or `BERTicSENT{polarity}{i}` where `{polarity}` is either `POS` (for positive) or `NEG` (for negative), and `{i}` is the iteration number.

After creating each repository, we push the models and their tokenizers to the corresponding repositories on Hugging Face. This allows us to version control our models and facilitates easy accessibility and deployment in future tasks.


In [2]:
iteration_range = [0, 2, 4,6]
model_name_BERTic = "classla/bcms-bertic"
model_name_BERTicovo = "Tanor/Bertovo-sent-base"
model_name_SRGPT = "jerteh/gpt2-orao"

for polarity in ["NEG","POS"]:

    id2label = {0: "NON-POSITIVE", 1: "POSITIVE"}
    label2id = {"NON-POSITIVE": 0, "POSITIVE": 1}
    if (polarity =="NEG"):
        id2label = {0: "NON-NEGATIVE", 1: "NEGATIVE"}
        label2id = {"NON-NEGATIVE": 0, "NEGATIVE": 1}
    tokenizer_BERTic = AutoTokenizer.from_pretrained(model_name_BERTic)
    model_BERTic = AutoModelForSequenceClassification.from_pretrained(
        model_name_BERTic, num_labels=2,  id2label=id2label, 
        label2id=label2id, )
    tokenizer_BERTicovo = AutoTokenizer.from_pretrained(model_name_BERTicovo)
    model_BERTicovo = AutoModelForSequenceClassification.from_pretrained(
        model_name_BERTicovo, num_labels=2,  id2label=id2label, 
        label2id=label2id, )
    tokenizer_SRGPT = AutoTokenizer.from_pretrained(model_name_SRGPT)
    model_SRGPT = AutoModelForSequenceClassification.from_pretrained(
        model_name_SRGPT, num_labels=2,  id2label=id2label, 
        label2id=label2id, )
    for i in iteration_range:
        model_name_out = f"Tanor/BERTicovoSENT{polarity}{i}"
        create_repo(model_name_out, private = True, exist_ok=True)
        model_BERTicovo.push_to_hub(model_name_out)
        tokenizer_BERTicovo.push_to_hub(model_name_out)
        model_name_out = f"Tanor/BERTicSENT{polarity}{i}"
        create_repo(model_name_out, private = True, exist_ok=True)
        model_BERTic.push_to_hub(model_name_out)
        tokenizer_BERTic.push_to_hub(model_name_out)
        model_name_out = f"Tanor/SRGPTSENT{polarity}{i}"
        create_repo(model_name_out, private = True, exist_ok=True)
        model_SRGPT.push_to_hub(model_name_out)
        tokenizer_SRGPT.push_to_hub(model_name_out)
        
        


Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at classla/bcms-bertic and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Downloading (…)lve/main/config.json:   0%|          | 0.00/873 [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.17M [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading model.safetensors:   0%|          | 0.00/3.13G [00:00<?, ?B/s]

Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at jerteh/gpt2-orao and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

Some weights of ElectraForSequenceClassification were not initialized from the model checkpoint at classla/bcms-bertic and are newly initialized: ['classifier.dense.bias', 'classifier.out_proj.weight', 'classifier.out_proj.bias', 'classifier.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at jerteh/gpt2-orao and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


pytorch_model.bin:   0%|          | 0.00/574M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/574M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/574M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/574M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

In [8]:
model_name_SRGPT = "jerteh/gpt2-orao"

for polarity in ["POS", "NEG"]:

    id2label = {0: "NON-POSITIVE", 1: "POSITIVE"}
    label2id = {"NON-POSITIVE": 0, "POSITIVE": 1}
    if (polarity =="NEG"):
        id2label = {0: "NON-NEGATIVE", 1: "NEGATIVE"}
        label2id = {"NON-NEGATIVE": 0, "NEGATIVE": 1}
    tokenizer_SRGPT = AutoTokenizer.from_pretrained(model_name_SRGPT)
    model_SRGPT = AutoModelForSequenceClassification.from_pretrained(
        model_name_SRGPT, num_labels=2,  id2label=id2label, 
        label2id=label2id, )
    for i in iteration_range:
        model_name_out = f"Tanor/SRGPTSENT{polarity}{i}"
        create_repo(model_name_out, private = True, exist_ok=True)
        tokenizer_SRGPT.push_to_hub(model_name_out)
        model_SRGPT.push_to_hub(model_name_out)

        
        


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Some weights of GPT2ForSequenceClassification were not initialized from the model checkpoint at JeRTeh/sr-gpt2-large and are newly initialized: ['score.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/3.09G [00:00<?, ?B/s]