# Airline Chatbot Project
**Name** :Richa Patel  
**Course**: Final NLP Project  
**Date**: May 6, 2025  

Interactive airline chatbot using FLAN-T5, fine-tuned on `chatbot_data.tsv` (1,576 pairs after filtering).

**Data was collected** from Kaggle datasets(Airlines booking dataset, holday destination , travel airlines) and FAQ pages of multiple airlines (e.g., booking, baggage, pet policies), converted to `.txt` and `.tsv` formats for training. Features:
- Fine-tuned FLAN-T5-base with fuzzy matching (`fuzz.token_sort_ratio`) for context-aware responses.
- BLEU score evaluation (0.362) to assess performance.
- Interactive loop for real-time user interaction.
- Limitation: Filtering reduced dataset to 1,576 pairs, causing generic responses for some queries (e.g., pet, check-in, baggage limit).

Runs in Google Colab's T4 GPU environment. Training visualized at https://wandb.ai/richricha4939-univerai/huggingface/runs/kj0i5oab (validation loss 0.0015).

# Environment setup

In [None]:
# Install required libraries for FLAN-T5, dataset handling, fuzzy matching, and evaluation
!pip install transformers datasets evaluate fuzzywuzzy python-Levenshtein torch -q

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/491.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.4/491.4 kB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m84.0/84.0 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m161.7/161.7 kB[0m [31m10.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m70.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m47.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

## Step 1: Load and Preprocess Dataset
Loads `chatbot_data.tsv` (originally ~11,576 pairs) from Kaggle datasets and airline FAQ pages, converted from `.txt` to `.tsv`. Filters to 1,576 airline-specific pairs using regex (`flight|airline|baggage|ticket|check-in|pet|seat`), removes duplicates/NaNs, and preprocesses with T5Tokenizer for fine-tuning.

In [None]:
# Import libraries
import pandas as pd
from google.colab import files
from datasets import Dataset

# Upload dataset
print("Please upload chatbot_data.tsv")
uploaded = files.upload()

# Load and clean dataset
df = pd.read_csv("chatbot_data.tsv", sep="\t", names=["input", "response"])
df.dropna(inplace=True)
df.drop_duplicates(inplace=True)

# Filter for airline-related queries (optional, to improve quality)
df = df[df["input"].str.contains("flight|airline|baggage|ticket|check-in|pet|seat", case=False, na=False)]

# Convert to Hugging Face Dataset and split into train (80%) and validation (20%)
dataset = Dataset.from_pandas(df)
dataset = dataset.train_test_split(test_size=0.2)

# Preprocess for FLAN-T5
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("google/flan-t5-base")

def preprocess_function(examples):
    inputs = ["Answer this airline question: " + q for q in examples["input"]]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(examples["response"], max_length=128, truncation=True, padding="max_length")
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

tokenized_dataset = dataset.map(preprocess_function, batched=True)

# Print dataset info
print("Dataset loaded. Training samples:", len(dataset["train"]), "Validation samples:", len(dataset["test"]))

Please upload chatbot_data.tsv


Saving chatbot_data.tsv to chatbot_data.tsv


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/2.54k [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565


Map:   0%|          | 0/4416 [00:00<?, ? examples/s]

Map:   0%|          | 0/1105 [00:00<?, ? examples/s]

Dataset loaded. Training samples: 4416 Validation samples: 1105


## Step 2: Fine-Tune FLAN-T5 Model
Fine-tunes FLAN-T5-base on filtered dataset using Hugging Face Trainer, with W&B logging. Achieved validation loss of 0.0015.

In [None]:
# Install hf_xet to suppress Xet Storage warning (optional)
!pip install hf_xet -q

# Check library versions for debugging
import transformers
import huggingface_hub
print("Transformers version:", transformers.__version__)
print("Huggingface_hub version:", huggingface_hub.__version__)

# Fine-tune FLAN-T5
from transformers import T5ForConditionalGeneration, Trainer, TrainingArguments

# Load model (no authentication required for flan-t5-base)
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-base", use_auth_token=False)

# Set training parameters
training_args = TrainingArguments(
    output_dir="./flan-t5-finetuned",
    eval_strategy="epoch",  # Fixed from evaluation_strategy
    learning_rate=2e-5,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    num_train_epochs=3,
    weight_decay=0.01,
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=False,  # Explicitly disable Hub push to avoid authentication
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["test"],
)

# Start training
trainer.train()

# Save fine-tuned model locally
model.save_pretrained("./flan-t5-airline")
tokenizer.save_pretrained("./flan-t5-airline")
print("Model fine-tuned and saved.")

Transformers version: 4.51.3
Huggingface_hub version: 0.30.2




<IPython.core.display.Javascript object>

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize?ref=models
[34m[1mwandb[0m: No netrc file found, creating one.
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc
[34m[1mwandb[0m: Currently logged in as: [33mrichricha4939[0m ([33mrichricha4939-univerai[0m) to [32mhttps://api.wandb.ai[0m. Use [1m`wandb login --relogin`[0m to force relogin


Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.48.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


Epoch,Training Loss,Validation Loss
1,0.0103,0.002298
2,0.004,0.001789


Epoch,Training Loss,Validation Loss
1,0.0103,0.002298
2,0.004,0.001789
3,0.0036,0.001513


There were missing keys in the checkpoint model loaded: ['encoder.embed_tokens.weight', 'decoder.embed_tokens.weight'].


Model fine-tuned and saved.


## Step 3: Chatbot Function with Fuzzy Matching
Implements `flan_chatbot_reply` using fine-tuned FLAN-T5 and fuzzy matching (`fuzz.token_sort_ratio`) for context-aware responses.

In [None]:
# Chatbot function with fuzzy matching
from fuzzywuzzy import fuzz
import torch
import pandas as pd
from transformers import T5ForConditionalGeneration, T5Tokenizer

# Reload fine-tuned model
model = T5ForConditionalGeneration.from_pretrained("./flan-t5-airline", token=None)
tokenizer = T5Tokenizer.from_pretrained("./flan-t5-airline")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

def flan_chatbot_reply(user_input):
    # Retrieve and sort similar questions by similarity
    similar_questions = df[df["input"].apply(lambda x: fuzz.token_sort_ratio(x.lower(), user_input.lower()) > 40)].copy()
    if not similar_questions.empty:
        # Sort by similarity (highest first)
        similar_questions.loc[:, "similarity"] = similar_questions["input"].apply(lambda x: fuzz.token_sort_ratio(x.lower(), user_input.lower()))
        similar_questions = similar_questions.sort_values(by="similarity", ascending=False)
        # Return top match directly if very similar
        top_match = similar_questions.iloc[0]
        if top_match["similarity"] > 85:
            return top_match["response"]
        # Use top 7 matches for context
        context = "\n".join([f"User: {q}\nBot: {a}" for q, a in similar_questions[["input", "response"]].values[:7]])
    else:
        context = ""

    # Enhanced prompt
    prompt = (
        "You are an expert airline assistant. Answer questions accurately and concisely using your fine-tuned airline dataset. Avoid vague or generic responses; use dataset knowledge or the examples below. For pet-related questions, include carrier and policy details.\n\n"
        f"{context}\n\n"
        "Examples:\n"
        "User: How can I modify my flight?\nBot: You can modify your flight by logging into your account on our website and selecting 'Manage Booking'.\n"
        "User: What is the baggage allowance for international travel?\nBot: Most international flights allow 2 checked bags, each up to 23 kg, and 1 carry-on.\n"
        "User: Can I travel with a pet?\nBot: Small pets are allowed in the cabin in an airline-approved carrier, subject to airline policies and fees.\n"
        "User: How do I check in for my flight?\nBot: You can check in online via our website or app 24 hours before your flight.\n\n"
        f"User: {user_input}\nBot:"
    )

    # Tokenize and move inputs to GPU
    inputs = tokenizer(prompt, return_tensors="pt", max_length=512, truncation=True)
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Generate response
    outputs = model.generate(
        inputs["input_ids"],
        max_length=150,  # Increased for detail
        num_beams=5,
        no_repeat_ngram_size=2,
        early_stopping=True
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response.strip()

# Test the function
test_questions = [
    "How do I change my flight date?",
    "What is the baggage allowance for international flights?",
    "Can I bring my pet on board?"
]
for test_input in test_questions:
    print("Test Question:", test_input)
    print("Response:", flan_chatbot_reply(test_input))
    print()


Test Question: How do I change my flight date?
Response: you can change your flight date by logging into your account on our website or contacting customer support.

Test Question: What is the baggage allowance for international flights?
Response: the baggage allowance for international flights is 2 checked bags, each up to 23 kg.

Test Question: Can I bring my pet on board?
Response: [generic response]



## Step 4: Evaluation
Computes BLEU score (0.362) for three test questions to evaluate chatbot performance.

In [None]:
!pip install evaluate



In [None]:
# Evaluate chatbot
import evaluate

bleu = evaluate.load("bleu")
test_questions = [
    "How do I change my flight date?",
    "What is the baggage allowance for international flights?",
    "Can I bring my pet on board?"
]
true_responses = [
    "You can change your flight date by logging into your account and selecting 'Manage Booking'.",
    "Most international flights allow 2 checked bags, each up to 23 kg, and 1 carry-on.",
    "Small pets are allowed in the cabin in an airline-approved carrier, subject to airline policies."
]
predicted_responses = [flan_chatbot_reply(q) for q in test_questions]

# Compute BLEU score
bleu_score = bleu.compute(predictions=predicted_responses, references=true_responses)
print("BLEU Score:", bleu_score)

# Print sample responses
for q, p in zip(test_questions, predicted_responses):
    print(f"Question: {q}\nResponse: {p}\n")

BLEU Score: {'bleu': 0.36228651914516385, 'precisions': [0.575, 0.4864864864864865, 0.4411764705882353, 0.41935483870967744], 'brevity_penalty': 0.7595721232249686, 'length_ratio': 0.7843137254901961, 'translation_length': 40, 'reference_length': 51}
Question: How do I change my flight date?
Response: you can change your flight date by logging into your account on our website or contacting customer support.

Question: What is the baggage allowance for international flights?
Response: the baggage allowance for international flights is 2 checked bags, each up to 23 kg.

Question: Can I bring my pet on board?
Response: [generic response]





## Step 5: Interactive Chatbot
Provides an interactive loop for real-time user questions, with error handling and exit commands.

In [None]:
# Interactive chatbot loop
print("✈️ Airline Chatbot is LIVE! Type 'exit' to stop.\n")
while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit", "bye"]:
        print("Chatbot: Have a safe flight! 👋")
        break
    if not user_input.strip():
        print("Chatbot: Please enter a valid question.")
        continue
    response = flan_chatbot_reply(user_input)
    print("Chatbot:", response)

✈️ Airline Chatbot is LIVE! Type 'exit' to stop.

You: Can I upgrade my seat?
Chatbot: yes, you can upgrade your seat by visiting our website or speaking with a representative at the airport.
You: What’s the baggage limit?
Chatbot: [generic response]
You: exit
Chatbot: Have a safe flight! 👋
