<a href="https://colab.research.google.com/github/sergekamanzi/Chat-Bot-/blob/main/chatbot2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [27]:
!pip install datasets
!pip install --upgrade datasets fsspec aiohttp


Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting aiohttp
  Downloading aiohttp-3.12.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.6 kB)
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m15.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m14.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading aiohttp-3.12.9-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m49.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, aiohttp, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 202

In [1]:
import pandas as pd
from datasets import load_dataset
import re
from transformers import T5Tokenizer, TFT5ForConditionalGeneration
import tensorflow as tf
from sklearn.model_selection import train_test_split
from nltk.translate.bleu_score import sentence_bleu
import numpy as np

# Step 1: Load and preprocess the dataset
ds = load_dataset("RashmiMyneni/BankingDataset")
dataset_split = ds["train"]
df = pd.DataFrame(dataset_split)

# Check dataset size
print("Dataset size (rows):", len(dataset_split))
print("DataFrame shape:", df.shape)

# Check for missing values
print("\nMissing values per column:")
print(df.isnull().sum())

# Handle missing values
if df['text'].isnull().sum() > 0:
    df = df.dropna(subset=['text'])
    print("Dropped rows with missing 'text' values. New shape:", df.shape)

# Extract utterances and intents
df['utterance'] = df['text'].str.extract(r'####Utterance: (.*?) ####Category')
df['intent'] = df['text'].str.extract(r'####Intent: (\w+)</s>')

# Clean utterances
def clean_text(text):
    if isinstance(text, str):
        text = re.sub(r'\s+', ' ', text.strip()).lower()
        return text
    return text

df['utterance_cleaned'] = df['utterance'].apply(clean_text)

# Remove duplicates
print("\nNumber of duplicate utterances:", df['utterance_cleaned'].duplicated().sum())
df = df.drop_duplicates(subset=['utterance_cleaned'], keep='first')
print("Shape after removing duplicates:", df.shape)

# Check intent distribution
print("\nIntent distribution:")
print(df['intent'].value_counts())

# Step 2: Create conversational pairs
intent_to_response = {
    'create_account': "To create an online account, visit our website, click 'Register,' and follow the prompts to enter your personal details and verify your identity.",
    'delete_account': "To delete your account, log in to your online banking portal, navigate to account settings, and select 'Close Account.' Follow the verification steps to confirm.",
    'get_refund': "To request a refund, please contact our support team with your transaction ID and details. You can reach us at support@bank.com or call 1-800-555-1234.",
    'check_refund_policy': "Our refund policy allows refunds within 30 days of purchase for eligible transactions. Please review the terms on our website or contact support for details.",
    'get_invoice': "You can download your invoice from the online banking portal under 'Transaction History.' Select the transaction and click 'Download Invoice.'",
    'recover_password': "To recover your password, go to the login page, click 'Forgot Password,' and follow the instructions to reset it via email or SMS verification.",
    'payment_issue': "For payment issues, please verify your payment method and contact our support team at support@bank.com with details of the issue.",
    'switch_account': "To switch accounts, log in to your online banking portal, go to 'Account Settings,' and select 'Switch Account Type.' Follow the prompts to complete the process.",
    'edit_account': "To edit your account details, log in to your online banking portal, navigate to 'Profile Settings,' and update your information as needed.",
    'track_refund': "To track your refund, log in to your account and check the 'Refund Status' section under 'Transactions,' or contact support with your transaction ID."
}

# Assign responses using .loc to avoid SettingWithCopyWarning
df.loc[:, 'response'] = df['intent'].map(intent_to_response)
df = df.dropna(subset=['response'])
print("\nShape after mapping responses:", df.shape)

# Step 3: Tokenization for T5
tokenizer = T5Tokenizer.from_pretrained("t5-small")
max_length = 128

def prepare_t5_inputs(row):
    input_text = f"chatbot: {row['utterance_cleaned']}"
    target_text = row['response']
    return input_text, target_text

# Prepare inputs and targets
df.loc[:, 'input_text'] = df.apply(lambda x: prepare_t5_inputs(x)[0], axis=1)
df.loc[:, 'target_text'] = df.apply(lambda x: prepare_t5_inputs(x)[1], axis=1)

# Tokenize inputs and targets
input_encodings = tokenizer(
    df['input_text'].tolist(),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_tensors='tf'
)
target_encodings = tokenizer(
    df['target_text'].tolist(),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_tensors='tf'
)

# Step 4: Split dataset
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
train_inputs = tokenizer(
    train_df['input_text'].tolist(),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_tensors='tf'
)
train_targets = tokenizer(
    train_df['target_text'].tolist(),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_tensors='tf'
)
test_inputs = tokenizer(
    test_df['input_text'].tolist(),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_tensors='tf'
)
test_targets = tokenizer(
    test_df['target_text'].tolist(),
    max_length=max_length,
    padding='max_length',
    truncation=True,
    return_tensors='tf'
)

# Create TensorFlow datasets
def create_dataset(inputs, targets):
    dataset = tf.data.Dataset.from_tensor_slices((
        {
            'input_ids': inputs['input_ids'],
            'attention_mask': inputs['attention_mask'],
            'decoder_input_ids': targets['input_ids']
        },
        targets['input_ids']
    ))
    return dataset.batch(8)

train_dataset = create_dataset(train_inputs, train_targets)
test_dataset = create_dataset(test_inputs, test_targets)

# Step 5: Fine-tune T5 model
model = TFT5ForConditionalGeneration.from_pretrained("t5-small")
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
model.compile(optimizer=optimizer)

# Define early stopping callback
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=2,
    restore_best_weights=True,
    verbose=1
)

# Train model with early stopping
print("\nTraining model...")
history = model.fit(
    train_dataset,
    epochs=10,  # Increased max epochs, early stopping will halt if needed
    validation_data=test_dataset,
    callbacks=[early_stopping]
)
print(f"Training loss: {history.history['loss'][-1]:.4f}, Validation loss: {history.history['val_loss'][-1]:.4f}")

# Step 6: Evaluate model
def generate_response(input_text):
    input_text = f"chatbot: {clean_text(input_text)}"
    inputs = tokenizer(input_text, max_length=max_length, padding='max_length', truncation=True, return_tensors='tf')
    outputs = model.generate(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask'], max_length=50)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Calculate BLEU score
bleu_scores = []
for _, row in test_df.iterrows():
    predicted = generate_response(row['utterance_cleaned'])
    reference = [row['target_text'].split()]
    predicted_tokens = predicted.split()
    bleu = sentence_bleu(reference, predicted_tokens)
    bleu_scores.append(bleu)
print("\nAverage BLEU score:", np.mean(bleu_scores))

# Step 7: Qualitative testing
test_queries = [
    "How do I open an online account?",
    "Can you help me delete my account?",
    "What's the weather today?"  # Out-of-domain
]
print("\nQualitative Testing:")
for query in test_queries:
    response = generate_response(query)
    if any(keyword in query.lower() for keyword in ['account', 'bank', 'register', 'refund', 'invoice', 'password', 'payment']):
        print(f"Query: {query}\nResponse: {response}\n")
    else:
        print(f"Query: {query}\nResponse: Sorry, I can only assist with banking-related queries.\n")

# Step 8: Interactive prompt for user input
print("\nInteractive Chatbot Testing:")
print("Enter a query (or type 'exit' to quit):")
while True:
    user_input = input("> ")
    if user_input.lower() == 'exit':
        print("Exiting chatbot...")
        break
    response = generate_response(user_input)
    if any(keyword in user_input.lower() for keyword in ['account', 'bank', 'register', 'refund', 'invoice', 'password', 'payment']):
        print(f"Response: {response}\n")
    else:
        print("Response: Sorry, I can only assist with banking-related queries.\n")

# Save the fine-tuned model
#model.save_pretrained("banking_chatbot_model")
#tokenizer.save_pretrained("banking_chatbot_model")
#print("\nModel and tokenizer saved to 'banking_chatbot_model'")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


Dataset size (rows): 500
DataFrame shape: (500, 1)

Missing values per column:
text    0
dtype: int64

Number of duplicate utterances: 4
Shape after removing duplicates: (496, 4)

Intent distribution:
intent
create_account         50
delete_account         50
get_refund             50
check_refund_policy    50
get_invoice            50
recover_password       50
payment_issue          49
switch_account         49
edit_account           49
track_refund           49
Name: count, dtype: int64

Shape after mapping responses: (496, 5)


You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
All PyTorch model weights were used when initializing TFT5ForConditionalGeneration.

All the weights of TFT5ForConditionalGeneration were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFT5ForConditionalGeneration for predictions without further training.



Training model...
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10

KeyboardInterrupt: 