<a href="https://colab.research.google.com/github/satyamverma95/DL_NLP/blob/main/DL_NLP_Assignment_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Making sure you have all the necessary libraries installed.

In [2]:
#!pip install transformers datasets evaluate
#! pip install huggingface_hub
#!pip install datasets
#! pip install transformers==4.28.0
#!pip install --upgrade accelerate

 login to your Hugging Face account

In [3]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

Loading SQuAD dataset from Datasets Library.

In [4]:
from datasets import load_dataset

#Loading smaller subets of data to make sure that wholepipe line works before training with whole data.
squad = load_dataset("squad", split="train[:5000]")



Spliting the dataset’s train split into a train and test set with the train_test_split method:

In [5]:
squad = squad.train_test_split(test_size=0.2)

Taking a look at the example.

In [6]:
squad["train"][496]

{'id': '56bf95eaa10cfb1400551195',
 'title': 'Beyoncé',
 'context': 'Her first acting role of 2006 was in the comedy film The Pink Panther starring opposite Steve Martin, grossing $158.8 million at the box office worldwide. Her second film Dreamgirls, the film version of the 1981 Broadway musical loosely based on The Supremes, received acclaim from critics and grossed $154 million internationally. In it, she starred opposite Jennifer Hudson, Jamie Foxx, and Eddie Murphy playing a pop singer based on Diana Ross. To promote the film, Beyoncé released "Listen" as the lead single from the soundtrack album. In April 2007, Beyoncé embarked on The Beyoncé Experience, her first worldwide concert tour, visiting 97 venues and grossed over $24 million.[note 1] Beyoncé conducted pre-concert food donation drives during six major stops in conjunction with her pastor at St. John\'s and America\'s Second Harvest. At the same time, B\'Day was re-released with five additional songs, including her duet w

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [8]:
def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]] ##This line creates a list of questions by stripping any leading or trailing whitespace from each question in the dataset.
    
    '''
    This line tokenizes the questions and the context using a tokenizer. 
    It takes the questions, examples["context"], and other specified parameters as inputs. 
    The max_length parameter sets the maximum length of the tokenized inputs, 
    truncation determines how to handle text that exceeds the maximum length 
    (in this case, only truncating the context), return_offsets_mapping=True instructs the 
    tokenizer to return the mapping between the tokens and the original character offsets, 
    and padding="max_length" pads the inputs to the maximum length.

    '''
    
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    '''
    This line retrieves the offset_mapping from the tokenizer's output and removes it from the inputs dictionary. 
    The offset mapping is a list that maps each token to its corresponding character offsets in the original text.
    '''

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]   #Retrieves the answer for the current example.
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        '''
        This line retrieves the sequence IDs of the tokenized inputs for the current example. 
        The sequence IDs indicate which part of the input each token belongs to (0 for question, 1 for context).
        '''

        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1


        '''
        If the answer is not fully inside the context, 
        it means it is outside of the considered portion of the text, 
        and (0, 0) is appended to start_positions and end_positions to label it as such.
        '''

        '''
          start_positions and end_positions are updated with the found token positions.

          Finally, start_positions and end_positions are added to the inputs dictionary as additional fields.

          The inputs dictionary, containing the tokenized inputs and the start and end positions of the answer span, is returned
        '''


        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

In [9]:
tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

Map:   0%|          | 0/4000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

In [10]:
tokenized_squad["train"]

Dataset({
    features: ['input_ids', 'attention_mask', 'start_positions', 'end_positions'],
    num_rows: 4000
})

In [11]:
from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

In [12]:
from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForQuestionAnswering: ['vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForQuestionAnswering were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['qa_outputs.bias', 'qa_outputs.weight']
You should probably TRAIN this mode

Three Steps Performed Below :

1) Defining training hyperparameters in TrainingArguments.</br>
2) Setting training arguments to Trainer along with the model, dataset, tokenizer, and data collator. </br>
3) Call train() to finetune your model. </br>

In [13]:
#!pip install --upgrade accelerate
#! pip install transformers==4.28.0

In [None]:
training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

trainer.train()

Cloning https://huggingface.co/satyamverma/my_awesome_qa_model into local empty directory.


Epoch,Training Loss,Validation Loss


In [None]:
trainer.push_to_hub()