1. Fine-Tuning BERT or DistilBERT for Prompt Classification

To use BERT or DistilBERT for prompt parsing, you can fine-tune the model on a small dataset of prompt examples labeled with specific actions (e.g., summarize_single_doc, summarize_multiple_docs, list_key_findings). This enables the model to classify user prompts accurately.

Steps to Fine-Tune BERT or DistilBERT


In [17]:
from transformers import DistilBertForSequenceClassification, Trainer, TrainingArguments, DistilBertTokenizerFast
from datasets import load_dataset, Dataset
import pandas as pd

# Load and preprocess your dataset
data = {
    "prompt": [
        'Give me the summarize of the document entitled "LIFT: Cloud-based Management System for Research, Production, and Extension Services."',
        'Give me the summarize of a document entitled "LIFT: Cloud-based Management System for Research, Production, and Extension Services."',
        'Summarize the document entitled "LIFT: Cloud-based Management System for Research, Production, and Extension Services."',
        'Summarize a document entitled "LIFT: Cloud-based Management System for Research, Production, and Extension Services."',
        
        'Give me the summarize of all documents about "NLP, Machine Learning, IoT, Web-based".',
        'Give me the summarize of all documents with the following keywords "NLP, Machine Learning, IoT, Web-based".',
        'Give me the summarize of all documents using "NLP, Machine Learning, IoT, Web-based".',
        'Summarize all documents about "NLP, Machine Learning, IoT, Web-based".',
        'Summarize all documents with the following keywords "NLP, Machine Learning, IoT, Web-based".',
        'Summarize all documents using "NLP, Machine Learning, IoT, Web-based".',
        'Give me the summarize of all documents at year 2024',
        'Summarize all documents at year 2024',
        
        'List all the documents about "NLP, Machine Learning, IoT, Web-based".',
        'List all the documents with the following keywords "NLP, Machine Learning, IoT, Web-based".',
        'List all the documents using "NLP, Machine Learning, IoT, Web-based".',
        'List all the documents at year 2024',
        'List all documents about "NLP, Machine Learning, IoT, Web-based".',
        'List all documents with the following keywords "NLP, Machine Learning, IoT, Web-based".',
        'List all documents using "NLP, Machine Learning, IoT, Web-based".',
        'List all documents at year 2024'
    ],
    "label": [0,0,0,0,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2] 
}
df = pd.DataFrame(data)
dataset = Dataset.from_pandas(df)


In [18]:

# Define labels and tokenize prompts
labels = ["summarize_single_doc", "summarize_multiple_doc", "list_all_doc"]
tokenizer = DistilBertTokenizerFast.from_pretrained("distilbert-base-uncased")

def preprocess_data(examples):
    return tokenizer(examples["prompt"], truncation=True, padding="max_length", max_length=50)

encoded_dataset = dataset.map(preprocess_data, batched=True)


Map: 100%|██████████| 20/20 [00:00<00:00, 19043.38 examples/s]


In [3]:

# Set up the model for classification
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=len(labels))


: 

Try using DistilBERT for this type of task, as it's a lightweight version of BERT. If you’re already using it, consider a smaller model, like albert-base-v2 or tiny-bert, which are even more memory-efficient.

Replace distilbert-base-uncased with a smaller model:

In [19]:
model = DistilBertForSequenceClassification.from_pretrained("prajjwal1/bert-tiny", num_labels=len(labels))

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight', 'embeddings.LayerNorm.bias', 'embeddings.LayerNorm.weight', 'embeddings.position_embeddings.weight', 'embeddings.word_embeddings.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'transformer.layer.0.attention.k_lin.bias', 'transformer.layer.0.attention.k_lin.weight', 'transformer.layer.0.attention.out_lin.bias', 'transformer.layer.0.attention.out_lin.weight', 'transformer.layer.0.attention.q_lin.bias', 'transformer.layer.0.attention.q_lin.weight', 'transformer.layer.0.attention.v_lin.bias', 'transformer.layer.0.attention.v_lin.weight', 'transformer.layer.0.ffn.lin1.bias', 'transformer.layer.0.ffn.lin1.weight', 'transformer.layer.0.ffn.lin2.bias', 'transformer.layer.0.ffn.lin2.weight', 'transformer.layer.0.output_layer_norm.bias', 'transformer.layer.0.output_layer_norm.weight', 'transformer.l

In [None]:

# Training arguments
training_args = TrainingArguments(
    output_dir="./BERT/results",
    evaluation_strategy="no",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset,
)




In [21]:

# Fine-tune the model
trainer.train()

100%|██████████| 9/9 [00:01<00:00,  6.56it/s]

{'train_runtime': 1.3834, 'train_samples_per_second': 43.37, 'train_steps_per_second': 6.506, 'train_loss': 1.0938706927829318, 'epoch': 3.0}





TrainOutput(global_step=9, training_loss=1.0938706927829318, metrics={'train_runtime': 1.3834, 'train_samples_per_second': 43.37, 'train_steps_per_second': 6.506, 'total_flos': 31131702000.0, 'train_loss': 1.0938706927829318, 'epoch': 3.0})

save

In [None]:
model.save_pretrained("./BERT/fine_tuned_distilbert")
tokenizer.save_pretrained("./BERT/fine_tuned_distilbert")

('./fine_tuned_distilbert\\tokenizer_config.json',
 './fine_tuned_distilbert\\special_tokens_map.json',
 './fine_tuned_distilbert\\vocab.txt',
 './fine_tuned_distilbert\\added_tokens.json',
 './fine_tuned_distilbert\\tokenizer.json')

# Using the Fine-Tuned Model for Prompt Classification

In [None]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizerFast
import torch

# Load the fine-tuned model and tokenizer
model = DistilBertForSequenceClassification.from_pretrained("./BERT/fine_tuned_distilbert")
tokenizer = DistilBertTokenizerFast.from_pretrained("./BERT/fine_tuned_distilbert")


In [24]:

# Define label mapping
label_map = {0: "summarize_single_doc", 1: "summarize_multiple_doc", 2: "list_all_doc"}

def classify_prompt(prompt):
    inputs = tokenizer(prompt, return_tensors="pt", truncation=True, padding=True, max_length=50)
    outputs = model(**inputs)
    prediction = torch.argmax(outputs.logits, dim=1).item()
    return label_map[prediction]


In [31]:

# Example usage
prompt = "Give me the list of all documents."
classification = classify_prompt(prompt)
print(f"Classification: {classification}")

Classification: list_all_doc
