## Multifunctional Fine Tuned Retrieval-Based Chatbot Leveraging RoBERTa , BART Transformers

### Problem Statement

##### As internet digital communication has expanded rapidly, there's been a rising need for smarter and more responsive chatbots to enhance human-computer interactions which is very much helpful with customer interaction and etc. Traditional rule-based chatbots often fail to understand the complexity and nuances of human language. As there is a need for a versatile and adaptive chatbot that can comprehend and generate contextually relevant responses, leveraging state-of-the-art natural language processing (NLP) techniques.

### Objective

##### The object of the project is to create two chatbots 

##### First would be a fine tuned and retrieval-based chatbot

##### Second would be a Fined-Tuned chatbot

##### A sophisticated fine tuned and  retrieval-based chatbot would integrate RoBERTa , Sentence Transformer and advanced NLP methodologies. While the Fine Tuned chatbot will be fine tuned using Bart transformer. These chatbot's aims to enhance the quality and relevance of user interactions by employing sentence transformers for semantic understanding, cosine similarity for response retrieval, and BART for conditional text generation. It also checks the intent of the questions are whether positive or negative in nature using text blob which helps to give better experience to the user.

##### The chabot will be able to answer questions related to healthcare , finance and also be able to keep up with general conversations.

### Dataset

##### The dataset consist of questions and answers pairs. Which will be used for training and retrieval purposes. This dataset have entries of healthcare , finance and conversational questions and answers.

##### Import Libraries

In [5]:
import torch 
import re
import pandas as pd 
import numpy as np
from sentence_transformers import SentenceTransformer , InputExample, losses

##### Reading CSV File

In [2]:
chatDF = pd.read_csv("chatbot_data.csv")

In [3]:
chatDF.head() 

Unnamed: 0,query,response,intent,domain
0,Can I make changes to my loan repayment schedule?,Changes to your loan repayment schedule can be...,loan repayment adjustment,finance
1,How do I apply for a student loan?,You can apply for a student loan by visiting o...,student loan application,finance
2,What are the side effects of the COVID-19 vacc...,Common side effects of the COVID-19 vaccine in...,side effects inquiry,healthcare
3,How can I schedule an appointment with my doctor?,You can schedule an appointment by calling our...,appointment booking,healthcare
4,What should I do if I miss a dose of my medica...,"If you miss a dose, take it as soon as you rem...",medication inquiry,healthcare


The head returns the whole DataFrame which consist of four columns "query" , "response" , "intent" and "domain".

.

##### Using "value_counts()" to count the occurences of unique values.

In [4]:
chatDF["domain"].value_counts() 

domain
healthcare      765
finance         523
conversation    393
Name: count, dtype: int64

The dataset consist of Three major domains healthcare , finance and conversation. Healthcare has the highest count, followed by finance, and then conversation.

.

##### Checking the shape of the dataset

In [5]:
chatDF.shape

(1681, 4)

Dataset has 1676 rows and 4 columns.

.

##### Cleaning the text data

In [6]:
def clean_text(text):
    if not isinstance(text, str):  
        text = str(text)  
    text = re.sub(r'\r\n', ' ', text)  
    text = re.sub(r'\s+', ' ', text)  
    text = re.sub(r'<.*?>', '', text)  
    text = re.sub(r'[?.,@!#$%^&*()]','',text)
    text = re.sub(r'\d+','',text)
    text = text.strip().lower()  
    return text 

.

##### Extracting Response and Query Columns from chatDF DataFrame

In [7]:
responseDF = chatDF["response"] 

##### Apply cleaning to response and query columns

In [8]:
responseDF = responseDF.apply(clean_text)

In [9]:
responseDF[0] 

'changes to your loan repayment schedule can be made by contacting our loan department or via the online portal'

In [10]:
len(responseDF) 

1681

.

##### Cleaning and Storing chatDF into newChatDF DataFrame

In [11]:
newChatDF = chatDF.applymap(clean_text) 

  newChatDF = chatDF.applymap(clean_text)


In [12]:
newChatDF.head()

Unnamed: 0,query,response,intent,domain
0,can i make changes to my loan repayment schedule,changes to your loan repayment schedule can be...,loan repayment adjustment,finance
1,how do i apply for a student loan,you can apply for a student loan by visiting o...,student loan application,finance
2,what are the side effects of the covid- vaccine,common side effects of the covid- vaccine incl...,side effects inquiry,healthcare
3,how can i schedule an appointment with my doctor,you can schedule an appointment by calling our...,appointment booking,healthcare
4,what should i do if i miss a dose of my medica...,if you miss a dose take it as soon as you reme...,medication inquiry,healthcare


##### Contextual Word Embeddings Augmentation with NLPaug

Importing nlpaug library

In [13]:
import nlpaug.augmenter.word as naw 

##### Initialize the augmenter

In [14]:
aug = naw.ContextualWordEmbsAug(model_path='bert-base-uncased', action="insert")

##### Function to augment a single sentence

In [15]:
def augment_text(text):
    return aug.augment(text)

##### Applying Text Augmentation to DataFrame

In [16]:
newChatDF["augmentedQuery"] = newChatDF["query"].apply(augment_text)       

##### Converting the list rows into string

In [17]:
newChatDF['augmentedQuery'] = newChatDF['augmentedQuery'].apply(lambda x: ', '.join(x)) 

In [18]:
newChatDF["augmentedQuery"]

0       can i make changes enough to reflect my curren...
1       how do i apply again for a state student agent...
2       10 what are the side effects characteristics o...
3       now how can i just schedule only an appointmen...
4       and what should i probably do soon if i miss a...
                              ...                        
1676                                      who you now are
1677                                   who really are you
1678                                     who there is you
1679               who you might consider better yourself
1680                                       yet how is you
Name: augmentedQuery, Length: 1681, dtype: object

##### Concatenating "query" and "augmentedQuery" columns into "fullQuery" column

In [19]:
newChatDF["fullQuery"] = newChatDF['query'] + ' ' + newChatDF['augmentedQuery']

##### Checking the columns in newChatDF

In [20]:
newChatDF.columns

Index(['query', 'response', 'intent', 'domain', 'augmentedQuery', 'fullQuery'], dtype='object')

##### Dropping the unnecessary columns

In [21]:
newChatDF = newChatDF.drop(columns=['intent','domain','query','augmentedQuery']) 

##### Checking the type of "newChatDF" 

In [22]:
type(newChatDF)  

pandas.core.frame.DataFrame

### InputExample

##### "InputExample" is a specific instance of input data, typically consisting of sentences or text pairs, used to demonstrate and evaluate the transformer's ability to generate meaningful sentence embeddings.

##### Converting the "newChatDF" DataFrame to InputExample objects with a default label

In [23]:
default_label = 1.0 
input_examples = newChatDF.apply(lambda row: InputExample(
    guid=str(row.name),
    texts=[row['fullQuery'], row['response']],
    label=default_label
), axis=1).tolist()

In the above code :

**guid** : it gives a unique value to each question and answer pair, helping to keep track of each example distinctly.

**texts** : it combines the "fullQuery" and "response" into a list of two separate text elements.

**label** : it assigns the number 1.0 to each row, indicating a default label, which can be used to signify something like a positive example.

Finally, the apply method processes each row, creating InputExample objects, and .tolist() converts the entire result into a list of these objects.


.

##### Printing Input Examples

In [24]:
for example in input_examples:
    print(example)

<InputExample> label: 1.0, texts: can i make changes to my loan repayment schedule can i make changes enough to reflect my current loan repayment schedule; changes to your loan repayment schedule can be made by contacting our loan department or via the online portal
<InputExample> label: 1.0, texts: how do i apply for a student loan how do i apply again for a state student agent loan; you can apply for a student loan by visiting our website and filling out the application form
<InputExample> label: 1.0, texts: what are the side effects of the covid- vaccine 10 what are the side effects characteristics of the standard covid - 40 vaccine; common side effects of the covid- vaccine include soreness at the injection site fever and fatigue
<InputExample> label: 1.0, texts: how can i schedule an appointment with my doctor now how can i just schedule only an appointment with my doctor; you can schedule an appointment by calling our office or using our online portal
<InputExample> label: 1.0, t

.

### DataLoader

##### A `DataLoader` in machine learning efficiently manages and batches data for training and evaluation, ensuring optimized and streamlined data processing.

##### Creating a Shuffled DataLoader "train_dataloader"

In [25]:
from torch.utils.data import DataLoader
train_dataloader = DataLoader(input_examples, shuffle=True, batch_size=16) 

.

### Sentence Transformer

The SentenceTransformer('stsb-roberta-base') model is used to convert sentences into 768-dimensional vectors. These vectors capture the semantic meaning of the sentences, making it useful for tasks like sentence similarity, clustering, and semantic search. Essentially, it helps in understanding and comparing the meaning of sentences in a numerical format.

##### Initialize Sentence Transformer Model


In [None]:
sentenceModel = SentenceTransformer('stsb-roberta-base')  

##### Sentence Model Training with Multiple Negatives Ranking Loss

Using Multiple Negatives Ranking Loss we can treat each question and its corresponding answers as a positive pairs. While all the other non-corresponding answers in the batch are treated as negative samples. The model learns to pull the question and its correct answer closer together in the embedding space while pushing apart incorrect answers.

In [27]:
train_loss = losses.MultipleNegativesRankingLoss(sentenceModel) 

##### Training Sentence Model with Multiple Epochs and Warmup Steps

In [None]:
num_epochs = 5
warmup_steps = int(len(train_dataloader) * num_epochs * 0.1)

sentenceModel.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=num_epochs,
    warmup_steps=warmup_steps 
) 

##### Preparing , Cleaning and Encoding New Query

In [None]:
new_query = "I wanted to book an appointment" 
new_query = clean_text(new_query) 
new_query_embedding = sentenceModel.encode([new_query])  

In [None]:
faq_embeddings = sentenceModel.encode(newChatDF["fullQuery"])  

##### Importing cosine similarity

In [8]:
from sklearn.metrics.pairwise import cosine_similarity

##### Calculating Query Embedding Similarities

In [None]:
similarities = cosine_similarity(new_query_embedding, faq_embeddings) 

##### Finding the index of the most similar query and Best Score

In [None]:
most_similar_query_index = np.argmax(similarities) 
best_score = similarities[0][most_similar_query_index].item() 

### TextBlob

Importing TextBlob 

In [None]:
from textblob import TextBlob 

##### Classifying The Sentiment Of The Given Input

In [117]:
def classify_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        print(blob)
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"


The text blob here will be helping us to counter the negative questions that user might ask which is not present in the dataset.

##### Classifying Sentiment of New Query

In [118]:
sentiment = classify_sentiment(new_query)  

okay


In [119]:
sentiment 

'Positive'

##### Handle Negative Sentiment and Similar Query Response

In [None]:
if sentiment == "Negative":
    print("Please drop us a mail regarding your concerns.") 
elif best_score >= 0.70: 
    # Retrieve the most similar query and its response 
    most_similar_query = newChatDF['fullQuery'][most_similar_query_index] 
    response = responseDF[most_similar_query_index] 
    print(f"Most Similar Query: {most_similar_query}")
    print(f"Response: {response}") 
elif best_score >= 0.30:
    print("Sorry we are facing some technical difficulties , please write to us on contact@healthcarerocks.com") 
elif best_score >= 0.20:
    print("Please write to us on our mail ID contact@healthcarerocks.com")
else:
    print("please write a mail regarding any queries related to our services")

Most Similar Query: ok bye ok u bye
Response: have a nice day


This helps us to handle the questions that model have not yet seen or not present in the dataset,

.

##### Best Score

In [85]:
best_score

0.7360935211181641

##### Saving the trained model into a respective directory 

In [None]:
output_dir = 'chatbot/trainedModel' 
sentenceModel.save(output_dir) 

##### Saving Data To Pickle File

In [None]:
import pickle

with open('pickleFiles/faq_embeddings.pkl', 'wb') as f:
    pickle.dump(faq_embeddings, f)

with open('pickleFiles/responseDF.pkl',"wb") as f:
    pickle.dump(responseDF,f) 

.

.

.

### Model Loading and Testing the model

In [None]:
from sentence_transformers import SentenceTransformer

# Load the model 
output_dir = 'trainedModel'
sentenceModel = SentenceTransformer(output_dir) 


In [3]:
import pickle 

with open('pickleFiles/faq_embeddings.pkl', 'rb') as f:
    faq_embeddings = pickle.load(f)

with open('pickleFiles/responseDF.pkl', 'rb') as f:
    responseDF = pickle.load(f)

In [None]:
from textblob import TextBlob 

new_query = "how to create a demat account " 
new_query = clean_text(new_query) 
new_query_embedding = sentenceModel.encode([new_query])  


similarities = cosine_similarity(new_query_embedding, faq_embeddings)

most_similar_query_index = np.argmax(similarities) 
best_score = similarities[0][most_similar_query_index].item() 


def classify_sentiment(text):
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity
    if sentiment > 0:
        print(blob)
        return "Positive"
    elif sentiment < 0:
        return "Negative"
    else:
        return "Neutral"

sentiment = classify_sentiment(new_query)  


if sentiment == "Negative":
    print("Please drop us a mail regarding your concerns.") 
elif best_score >= 0.70: 
    # Retrieve the most similar query and its response 
    response = responseDF[most_similar_query_index] 
    print(f"Response: {response}") 
elif best_score >= 0.30:
    print("Sorry we are facing some technical difficulties , please write to us on contact@healthcarerocks.com") 
elif best_score >= 0.20:
    print("Please write to us on our mail ID contact@healthcarerocks.com")
else:
    print("please write a mail regarding any queries related to our services")


Response: you can download the account opening forms from the site and submit them at our branches offering demat services you can also visit the branches offering demat service for opening the demat account there is no fee for opening a dp account with bank however a nominal fee towards services is levied as per our standard rate card


In [35]:
best_score

0.7164210677146912

.

.

.

.

### Fine Tuned Chatbot Using BART Transformer

##### Creating a DataFrame

In [147]:
newChatDF = pd.DataFrame({
    "query": newQueryDataset,
    "response": responseDF
})

In [148]:
newChatDF.head()

Unnamed: 0,query,response
0,can i make changes to my loan repayment schedu...,changes to your loan repayment schedule can be...
1,how do i apply for a student loan for how do i...,you can apply for a student loan by visiting o...
2,what are the side effects of the covid- vaccin...,common side effects of the covid- vaccine incl...
3,how can i schedule an appointment with my doct...,you can schedule an appointment by calling our...
4,what should i do if i miss a dose of my medica...,if you miss a dose take it as soon as you reme...


The dataframe has two columns query and response.

.

##### Checking the shape of the dataset

In [149]:
newChatDF.shape

(1681, 2)

The Dataset has 1681 rows and 2 columns.

.

##### Train Test Split

In [150]:
from sklearn.model_selection import train_test_split

##### Spliting the data into training and validation sets.

In [151]:
train_df, val_df = train_test_split(newChatDF, test_size=0.2, random_state=42)

##### Checking the shape of the dataframe

In [152]:
train_df.shape, val_df.shape

((1344, 2), (337, 2))

After the split the training dataset train_df has 1344 rows and 2 columns where validation dataset val_df has 337 rows and 2 columns.

.

In [153]:
train_df.head()

Unnamed: 0,query,response
427,why am i unable to update my personal informat...,some updates require additional verification o...
720,what is square off or what is another square off,squaring off is a trading style used by trader...
266,how can i improve my credit score quickly how ...,pay bills on time reduce credit card balances ...
148,i'll see you tomorrow then now i ' ll hopefull...,see you then goodbye
425,i deposited cash at the atm but it hasnt been ...,atm deposits may take up to hours to reflect ...


In [154]:
val_df.head()

Unnamed: 0,query,response
1604,is distal myopathy inherited next is distal m...,distal myopathy is inherited in an autosomal ...
482,whats the weather like besides whats the weath...,its rainy
203,how can i maintain a healthy weight how can am...,maintain a healthy weight by balancing calorie...
49,i really wish it wasn't so hot every day may i...,me too i can't wait until winter
937,can an nri open demat account can an internet ...,yes nris can open a demat account in india to ...


##### Reseting Index for Training and Validation Data

In [155]:
train_data = train_df.reset_index(drop=True)
validation_data = val_df.reset_index(drop=True) 

In [156]:
train_data.head()

Unnamed: 0,query,response
0,why am i unable to update my personal informat...,some updates require additional verification o...
1,what is square off or what is another square off,squaring off is a trading style used by trader...
2,how can i improve my credit score quickly how ...,pay bills on time reduce credit card balances ...
3,i'll see you tomorrow then now i ' ll hopefull...,see you then goodbye
4,i deposited cash at the atm but it hasnt been ...,atm deposits may take up to hours to reflect ...


In [157]:
train_data['query'][0]

'why am i unable to update my personal information and why am i unable to simply update my personal progress information'

In [158]:
validation_data.head()

Unnamed: 0,query,response
0,is distal myopathy inherited next is distal m...,distal myopathy is inherited in an autosomal ...
1,whats the weather like besides whats the weath...,its rainy
2,how can i maintain a healthy weight how can am...,maintain a healthy weight by balancing calorie...
3,i really wish it wasn't so hot every day may i...,me too i can't wait until winter
4,can an nri open demat account can an internet ...,yes nris can open a demat account in india to ...


### BART Transformers

##### Importing Bart libraries

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration, Trainer, TrainingArguments
from datasets import Dataset, DatasetDict

##### Initializing tokenizer

In [None]:
tokenizer = BartTokenizer.from_pretrained('facebook/bart-base')

##### Preprocessing By Tokenizing And Creating Labels

In [None]:
def preprocess_function(examples):
    inputs = tokenizer(examples["query"], padding="max_length", truncation=True, max_length=512)
    targets = tokenizer(examples["response"], padding="max_length", truncation=True, max_length=512)
    inputs["labels"] = targets["input_ids"]
    return inputs

##### Creating Dataset Objects from Pandas DataFrames

In [None]:
train_dataset = Dataset.from_pandas(train_data)
validation_dataset = Dataset.from_pandas(validation_data)

##### Apply Preprocessing Function to Datasets

In [None]:
train_dataset = train_dataset.map(preprocess_function, batched=True)
validation_dataset = validation_dataset.map(preprocess_function, batched=True) 

##### Creating dataset dictionary

In [None]:
dataset_dict = DatasetDict({
    'train': train_dataset,
    'validation': validation_dataset
})

##### Initializing BART Model for Conditional Generation

In [None]:
model = BartForConditionalGeneration.from_pretrained('facebook/bart-base')

##### Training The Model

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir='./results',          
    evaluation_strategy='epoch',
    save_strategy = "epoch",     
    learning_rate=3e-5,             
    per_device_train_batch_size=8,  
    per_device_eval_batch_size=8,    
    num_train_epochs=5,              
    weight_decay=0.01,              
    logging_dir='./logs',           
    logging_steps=10,                
    load_best_model_at_end=True,     
    metric_for_best_model='epoch' 
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset_dict['train'],
    eval_dataset=dataset_dict['validation']
)

# Train model
trainer.train()

##### Saving the trained model into chatbot_model directory

In [389]:
model.save_pretrained("./chatbot_model")
tokenizer.save_pretrained("./chatbot_model")

('./chatbot_model\\tokenizer_config.json',
 './chatbot_model\\special_tokens_map.json',
 './chatbot_model\\vocab.json',
 './chatbot_model\\merges.txt',
 './chatbot_model\\added_tokens.json')

.

.

.

##### Checking the model by passing new inputs

In [None]:
from transformers import BartTokenizer, BartForConditionalGeneration

# Initialize tokenizer and model
tokenizer = BartTokenizer.from_pretrained('./chatbot_model')
model = BartForConditionalGeneration.from_pretrained('./chatbot_model')

def generate_response(query):
    # Tokenize the input query
    inputs = tokenizer(query, return_tensors="pt", padding="max_length", truncation=True, max_length=512)

    # Generate response from the model
    outputs = model.generate(inputs.input_ids, max_length=512, num_beams=5, early_stopping=True)

    # Decode the output token IDs to text
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Example usage
new_query = "How to book an appointment"
response = generate_response(new_query)
print("Generated Response:", response)


##### Evaluating the model

In [None]:
results = trainer.evaluate()
print(results)