
---

# Fine-Tuning DialoGPT for Custom Chatbot Development

---

### **Introduction**
This project demonstrates the fine-tuning of a pre-trained language model, **DialoGPT**, for conversational AI. By leveraging the Hugging Face `transformers` library, the model is adapted to a custom dataset to create a chatbot tailored to specific needs. The project includes data preparation, fine-tuning, and interaction with the chatbot.

---

### **Purpose and Objectives**
The primary goals of this project are:
1. To fine-tune the pre-trained **DialoGPT** model on a custom conversational dataset.
2. To create a conversational agent capable of generating context-aware responses.
3. To implement an efficient pipeline for dataset tokenization, training, and deployment.
4. To provide a reusable framework for further customization and enhancement.

---

### **Key Components and Explanation**

#### **1. Model and Tokenizer**
- **Model Used:** `microsoft/DialoGPT-medium`, a pre-trained conversational model.
- **Tokenizer:** Converts raw text into token IDs that the model can process. Ensures padding tokens (`pad_token`) are compatible with the model’s architecture.

---

#### **2. Dataset Preparation**
- **Custom Dataset:** A text file (`chat_dataset.txt`) containing conversational examples, with one dialogue turn per line.
- **Tokenization:**
  - Splits text into tokens of fixed length (`block_size=128`).
  - Truncates or pads input to ensure uniformity across batches.
- **Caching:** Writes intermediate results to a specified directory for efficient processing.

---

#### **3. Data Collation**
- A **Data Collator for Language Modeling** is used to handle dynamic padding during training. This ensures input sequences of varying lengths can be efficiently processed without modifying the model.

---

#### **4. Training Configuration**
- **TrainingArguments:** Specifies hyperparameters for the fine-tuning process, such as:
  - Epochs: \( 3 \)
  - Batch size: \( 4 \)
  - Learning rate: \( 5 \times 10^{-5} \)
  - Checkpointing: Saves model weights periodically (`save_steps=500`).
  - Logging: Tracks progress (`logging_steps=100`).

- **Trainer API:** Simplifies model training by integrating dataset management, evaluation, and checkpointing.

---

#### **5. Fine-Tuning Process**
- The model is fine-tuned on the tokenized dataset to adapt it to the specific conversational patterns of the input data.
- Fine-tuned weights and tokenizer configurations are saved locally for deployment.

---

#### **6. Chatbot Interaction**
- The chatbot generates responses using the fine-tuned model.
- User inputs are tokenized and passed to the model for generation.
- Responses are decoded back into human-readable text, skipping special tokens.

---

### **Implementation**

#### **Steps**
1. **Model and Tokenizer Loading**
 

In [6]:
import os
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments, DataCollatorForLanguageModeling
from datasets import load_dataset
from transformers.data.datasets.language_modeling import TextDataset

In [7]:
# Step 1: Load Pre-trained Model and Tokenizer
model_name = "microsoft/DialoGPT-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Ensure the tokenizer has a pad_token
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

2. **Dataset Preparation**

In [11]:
# Step 2: Dataset Preparation
def prepare_dataset(file_path, tokenizer, block_size=128, cache_dir="/kaggle/working/cache"):
    """Loads and tokenizes the dataset."""
    # Ensure the cache directory is writable
    os.makedirs(cache_dir, exist_ok=True)
    dataset = TextDataset(
        tokenizer=tokenizer,
        file_path=file_path,
        block_size=block_size,
        overwrite_cache=True,  # Disable caching or overwrite existing cache
        cache_dir=cache_dir,   # Use writable cache directory
    )
    return dataset
# Specify a writable cache directory
dataset_path = "/kaggle/input/chat-dataset/QA.txt"
train_dataset = prepare_dataset(dataset_path, tokenizer, cache_dir="/kaggle/working/cache")

# Load the dataset using the datasets library
dataset = load_dataset("text", data_files={"train": dataset_path})

# Tokenize the dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=["text"])



Generating train split: 0 examples [00:00, ? examples/s]

Map:   0%|          | 0/8 [00:00<?, ? examples/s]

In [12]:
# Step 3: Data Collator for Padding
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

3. **Fine-Tuning**

In [13]:
# Step 4: Fine-tuning Setup
training_args = TrainingArguments(
    output_dir="./chatbot_model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=500,
    save_total_limit=2,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="no",
    learning_rate=5e-5,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=data_collator,
)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [14]:
# Step 5: Fine-tune the Model
trainer.train()

Step,Training Loss


TrainOutput(global_step=6, training_loss=5.529878616333008, metrics={'train_runtime': 88.7893, 'train_samples_per_second': 0.27, 'train_steps_per_second': 0.068, 'total_flos': 5572204167168.0, 'train_loss': 5.529878616333008, 'epoch': 3.0})

In [15]:
# Save the fine-tuned model
model.save_pretrained("./chatbot_model")
tokenizer.save_pretrained("./chatbot_model")


('./chatbot_model/tokenizer_config.json',
 './chatbot_model/special_tokens_map.json',
 './chatbot_model/vocab.json',
 './chatbot_model/merges.txt',
 './chatbot_model/added_tokens.json',
 './chatbot_model/tokenizer.json')

4. **Chatbot Interaction**

In [16]:
# Step 6: Interact with the Chatbot
def chat_with_bot(prompt, model, tokenizer):
    """Generates a response for the given prompt."""
    inputs = tokenizer.encode(prompt, return_tensors="pt")
    outputs = model.generate(inputs, max_length=100, pad_token_id=tokenizer.eos_token_id)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

In [None]:
# Load the fine-tuned model for chatting
fine_tuned_model = AutoModelForCausalLM.from_pretrained("./chatbot_model")
fine_tuned_tokenizer = AutoTokenizer.from_pretrained("./chatbot_model")

print("Chatbot ready! Type 'exit' to quit.")
while True:
    user_input = input("You: ")
    if user_input.lower() == "exit":
        print("Goodbye!")
        break
    response = chat_with_bot(user_input, fine_tuned_model, fine_tuned_tokenizer)
    print(f"Bot: {response}")

---

### **Sample Output**

#### **Chat Example**
```
You: Hello! How are you?
Bot: I'm just a chatbot, but I'm doing great! How about you?

You: Tell me a joke.
Bot: Why don't scientists trust atoms? Because they make up everything!

You: exit
Goodbye!
```

---

### **Applications**
1. **Customer Support:** Train on customer interaction logs to automate FAQs.
2. **E-Learning:** Create conversational assistants for tutoring.
3. **Healthcare:** Develop chatbots for symptom checks and guidance.
4. **Entertainment:** Build conversational companions with specific personalities or themes.

---

### **Future Enhancements**
1. **Evaluation:** Add evaluation metrics (e.g., perplexity, BLEU) to assess model quality.
2. **Dataset Augmentation:** Incorporate diverse conversational datasets for improved generalization.
3. **Fine-Tuning Optimization:** Experiment with advanced techniques like LoRA or adapters for efficient fine-tuning.
4. **Deployment:** Integrate the chatbot with a web or mobile interface for real-world use.

---

### **Conclusion**
This project demonstrates the fine-tuning of DialoGPT to create a custom chatbot capable of generating contextually relevant responses. The structured approach from data preparation to deployment ensures a robust conversational AI pipeline. Fine-tuning pre-trained models like DialoGPT is a powerful technique to adapt state-of-the-art NLP capabilities for specialized applications.

---