<a href="https://colab.research.google.com/github/wangyeye66/projects/blob/main/NLP_QA.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [3]:
### Use pre-trained pipeline
from transformers import pipeline
# Initialize a pipeline for question answering using a pre-trained model
qa_pipeline = pipeline("question-answering", model="bert-large-uncased-whole-word-masking-finetuned-squad")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly ident

In [4]:
# Example context and question
context = "Transformers provides thousands of pretrained models to perform tasks on texts such as classification, information extraction, question answering, summarization, translation, text generation, etc in 100+ languages. Its aim is to make cutting-edge NLP easier to use for everyone."
question = "What is the aim of Transformers library?"


In [5]:
# Perform question answering
answer = qa_pipeline(question=question, context=context)
print(answer)

{'score': 0.4756828844547272, 'start': 226, 'end': 277, 'answer': 'to make cutting-edge NLP easier to use for everyone'}


### Fine tune model

In [1]:
%%capture
! pip install datasets
!pip install accelerate -U
!pip install transformers[torch] -U


In [2]:
import torch
from transformers import BertTokenizer, BertForQuestionAnswering, AdamW, TrainingArguments, Trainer
from datasets import Dataset
import pandas as pd
from sklearn.model_selection import train_test_split


### What is training data looks like

In [3]:
# Sample data: each entry contains a context, a question, and the start and end positions of the answer in the context
data = {
    'context': [
        "Python is a high-level, interpreted programming language known for its simplicity and readability.",
        "The Pacific Ocean is the largest ocean on Earth, covering more than 30% of the planet's surface.",
        "Machine learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience.",
        "Photosynthesis is a process used by plants, algae, and certain bacteria to convert light energy into chemical energy.",
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France, and is one of the most famous structures in the world."
    ],
    'question': [
        "What is Python known for?",
        "What is the largest ocean on Earth?",
        "What does machine learning enable systems to do?",
        "What is photosynthesis used by?",
        "Where is the Eiffel Tower located?"
    ],
    'answer': [
        "simplicity and readability",
        "The Pacific Ocean",
        "automatically learn and improve from experience",
        "plants, algae, and certain bacteria",
        "on the Champ de Mars in Paris, France"
    ],
    'start_position': [61, 4, 76, 34, 47],
    'end_position': [84, 19, 128, 61, 81]
}

df = pd.DataFrame(data)

# Split the data into training and validation sets
train_df, val_df = train_test_split(df, test_size=0.1)

# Convert the dataframes to Hugging Face's Dataset format
train_dataset = Dataset.from_pandas(train_df)
val_dataset = Dataset.from_pandas(val_df)


In [4]:
# Initialize the tokenizer and model
from transformers import BertTokenizerFast, BertForQuestionAnswering

tokenizer = BertTokenizerFast.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')
model = BertForQuestionAnswering.from_pretrained('bert-large-uncased-whole-word-masking-finetuned-squad')


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [5]:
# Tokenization and encoding of the data for the model
def tokenize_and_encode(examples):
    # Tokenize the inputs (questions and contexts), truncating the longer sequence and padding to the max length
    tokenized_inputs = tokenizer(
        examples['question'],
        examples['context'],
        truncation="only_second",
        max_length=512,
        padding="max_length",
        return_offsets_mapping=True
    )
    # Map labels to token positions. If the answer is truncated, label it as the [CLS] token's position
    tokenized_inputs['start_positions'] = [min(len(tokenized_inputs.sequence_ids(i)) - 1, start) for i, start in enumerate(examples['start_position'])]
    tokenized_inputs['end_positions'] = [min(len(tokenized_inputs.sequence_ids(i)) - 1, end) for i, end in enumerate(examples['end_position'])]

    return tokenized_inputs


In [6]:
# Apply the tokenization and encoding function to our datasets
train_dataset = train_dataset.map(tokenize_and_encode, batched=True, remove_columns=train_dataset.column_names)
val_dataset = val_dataset.map(tokenize_and_encode, batched=True, remove_columns=val_dataset.column_names)


Map:   0%|          | 0/4 [00:00<?, ? examples/s]

Map:   0%|          | 0/1 [00:00<?, ? examples/s]

In [15]:
# Define the training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    evaluation_strategy="steps",
    eval_steps=50,
    save_steps=100,
    load_best_model_at_end=True,
)

# Initialize the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
)

# Train the model
trainer.train()

# Save the model
model.save_pretrained("./question_answering_model")
tokenizer.save_pretrained("./question_answering_model")


Step,Training Loss,Validation Loss


('./question_answering_model/tokenizer_config.json',
 './question_answering_model/special_tokens_map.json',
 './question_answering_model/vocab.txt',
 './question_answering_model/added_tokens.json',
 './question_answering_model/tokenizer.json')

In [16]:
from transformers import BertTokenizerFast, BertForQuestionAnswering, pipeline

# Load the trained model and tokenizer
model_path = "./question_answering_model"
model = BertForQuestionAnswering.from_pretrained(model_path)
tokenizer = BertTokenizerFast.from_pretrained(model_path)

# Initialize the question-answering pipeline with the trained model
qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

# Prepare some examples for inference
examples = [
    {
        "context": "Python is an interpreted, high-level and general-purpose programming language. Python's design philosophy emphasizes code readability with its notable use of significant whitespace.",
        "question": "What is Python?"
    },
    {
        "context": "Java is a class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible.",
        "question": "What is Java used for?"
    },
    {
        "context": "JavaScript, often abbreviated as JS, is a programming language that conforms to the ECMAScript specification. JavaScript is high-level, often just-in-time compiled, and multi-paradigm.",
        "question": "What does JavaScript conform to?"
    }
]

# Perform inference
for example in examples:
    result = qa_pipeline(question=example["question"], context=example["context"])
    print(f"Question: {example['question']}")
    print(f"Answer: {result['answer']}\n")


Question: What is Python?
Answer: an interpreted, high-level and general-purpose programming language

Question: What is Java used for?
Answer: to have as few implementation dependencies as possible

Question: What does JavaScript conform to?
Answer: ECMAScript specification

