# Smart Assist

### Chatbot to support customer service agents to answer customer queries and provide relevant answers. Also it will determine the sentiment and classify the query/

### Model details
* **Model**: Llama 3.2 3B Instruct fine tuned  
* **Framework**: Transformers  
* **Vector database**: ChromaDB

### The data for RAG
* Indigo website
* Air India website

### The data for Fine tunning
* Customer service





# Installations, imports, utils

# Initialize model, tokenizer, query pipeline

Define the model, the device, and the `bitsandbytes` configuration.

Prepare the model and the tokenizer.

Now we define the query pipeline using transformer

We also define here an utility function. This function will be used to display the output from the answer of the LLM.  

We define a function for testing the pipeline.

# Retrieval Augmented Generation

## Ingestion of data using PyPDFLoader

We will ingest the EU AI Act data using the PyPDFLoader from Langchain. There are actually multiple PDF ingestion utilities, we selected this one since it is easy to use.

## Split data in chunks

We split data in chunks using a recursive character text splitter.  

Note: You can experiment with several values of chunk_size and chunk_overlap. Here we will set the following values:
* chunk_size: 1000 (this gives the size of a chunk, in characters).
* chunk_overlap: 100 (this gives the number of characters with which two succesive chunks overlap).  

Chunk overlapping is required in order to be able to keep the context, even if we have a concept that we want to include that is spread over multiple document chunks.


## Creating Embeddings and Storing in Vector Store

Create the embeddings using Sentence Transformer and HuggingFace embeddings.  
Ocasionally, HuggingFace sentence-transformers might not be available. We implement therefore a mechanism to work with local stored sentence transformers.

Initialize ChromaDB with the document splits, the embeddings defined previously and with the option to persist it locally.  
We make sure to use the persistence option for the vector database.

## Initialize chain   

We are using `RetrievalQA` task chain utility from Langchain.  
This will first query the vector database (using similarity search) with the prompt we are using.   
Then, the query and the context retrieved (the documents that match with the query) are used to compose a prompt that instructs the LLM to answer to the query (**Generation**) using the information from the context retrieved (**Retrieval**). Therefore the name of the system, `Retrieval Augmented Generation`.


## Test the Retrieval-Augmented Generation


We define a test function, that will run the query and time it.

Let's check few queries.

## Document sources

Let's check the documents sources, for the last query run.  


# Conclusions


We used Langchain, ChromaDB and Llama3 as a LLM to build a Retrieval Augmented Generation solution. To improve the solution, we will have to refine the RAG implementation, first by optimizing the embeddings, then by using more complex RAG schemes.





In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [2]:
!pip install -q tensorflow
!pip install -q sentence-transformers "unstructured[pdf]" gradio langchain_community langchain-huggingface transformers torch

[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/981.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m368.6/981.5 kB[0m [31m11.2 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m981.5/981.5 kB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.0/117.0 kB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.2/42.2 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m42.0/42.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.5/48.5 kB[0m 

In [3]:
!pip install chromadb

Collecting chromadb
  Downloading chromadb-0.5.20-py3-none-any.whl.metadata (6.8 kB)
Collecting build>=1.0.3 (from chromadb)
  Downloading build-1.2.2.post1-py3-none-any.whl.metadata (6.5 kB)
Collecting chroma-hnswlib==0.7.6 (from chromadb)
  Downloading chroma_hnswlib-0.7.6-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (252 bytes)
Collecting posthog>=2.4.0 (from chromadb)
  Downloading posthog-3.7.4-py2.py3-none-any.whl.metadata (2.0 kB)
Collecting opentelemetry-exporter-otlp-proto-grpc>=1.2.0 (from chromadb)
  Downloading opentelemetry_exporter_otlp_proto_grpc-1.28.2-py3-none-any.whl.metadata (2.2 kB)
Collecting opentelemetry-instrumentation-fastapi>=0.41b0 (from chromadb)
  Downloading opentelemetry_instrumentation_fastapi-0.49b2-py3-none-any.whl.metadata (2.1 kB)
Collecting pypika>=0.48.9 (from chromadb)
  Downloading PyPika-0.48.9.tar.gz (67 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m67.3/67.3 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00

In [4]:
import sys
import torch
from torch import cuda, bfloat16
import transformers
from time import time
import gradio as gr
from langchain.llms import HuggingFacePipeline
from langchain.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
from google.colab import userdata


In [5]:
hf_token = userdata.get('HF_TOKEN')
print(hf_token)

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


In [6]:
# Device setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

In [7]:
def setup_model():
    # Load model and tokenizer
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-3B-Instruct", token=hf_token)
    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3.2-3B-Instruct",
        token=hf_token,
        torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
    )

    # Move model to GPU and set to eval mode
    model.to(device)
    model.eval()

    return model, tokenizer

def setup_embeddings():
    model_name = "sentence-transformers/all-mpnet-base-v2"
    model_kwargs = {"device": device}
    return HuggingFaceEmbeddings(model_name=model_name, model_kwargs=model_kwargs)

def setup_pipeline(model, tokenizer):
    text_pipeline = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        max_new_tokens=512,  # Instead of max_length
        temperature=0.7,
        device_map="auto" if torch.cuda.is_available() else None,
        pad_token_id=tokenizer.eos_token_id,
        eos_token_id=tokenizer.eos_token_id
    )
    return HuggingFacePipeline(pipeline=text_pipeline)

def process_documents(directory_path):
    try:
        loader = DirectoryLoader(
            directory_path,
            glob="**/*.pdf",
            loader_cls=PyPDFLoader
        )
        documents = loader.load()
        text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,
            chunk_overlap=200,
            separators=["\n\n", "\n", ".", "!", "?", ";", ",", " ", ""],
            length_function=len
        )
        splits = text_splitter.split_documents(documents)
        print(f"Processed {len(splits)} document chunks")
        return splits
    except Exception as e:
        print(f"Error processing documents: {e}")
        return []


def init_rag_system(documents, embeddings, llm):
    try:
        # Try to remove existing database
        import shutil
        try:
            shutil.rmtree("travel_agent_db")
            print("Cleared existing database")
        except Exception as e:
            print(f"Note: Could not clear existing database: {e}")

        # Create fresh vectorstore
        vectordb = Chroma.from_documents(
            documents=documents,
            embedding=embeddings,
            persist_directory="travel_agent_db"
        )

        # Configure retriever
        retriever = vectordb.as_retriever(
            search_kwargs={"k": 5}
        )

        qa_chain = RetrievalQA.from_chain_type(
            llm=llm,
            chain_type="stuff",
            retriever=retriever,
            return_source_documents=True,
            verbose=True
        )
        return qa_chain, vectordb
    except Exception as e:
        print(f"Error initializing RAG system: {e}")
        return None, None

def analyze_query(query):
    intents = {
        'booking': ['book', 'reserve', 'purchase', 'buy', 'ticket'],
        'information': ['info', 'know', 'what', 'how', 'when', 'where', 'cost', 'price'],
        'modification': ['change', 'modify', 'update', 'cancel', 'refund'],
        'support': ['help', 'assist', 'problem', 'issue', 'error'],
        'availability': ['available', 'schedule', 'time', 'find']
    }

    query_lower = query.lower()
    detected_intent = 'general inquiry'
    for intent, keywords in intents.items():
        if any(keyword in query_lower for keyword in keywords):
            detected_intent = intent
            break

    negative_words = ['not', 'never', 'problem', 'issue', 'error', 'wrong', 'bad', 'frustrated']
    positive_words = ['good', 'great', 'please', 'thank', 'happy', 'excited']

    sentiment = 'neutral'
    if any(word in query_lower for word in negative_words):
        sentiment = 'negative'
    elif any(word in query_lower for word in positive_words):
        sentiment = 'positive'

    return detected_intent, sentiment

# def process_query(query, context, qa_chain):
#     if not qa_chain:
#         return "System not initialized properly."

#     intent, sentiment = analyze_query(query)

#     augmented_query = f"""Based on the user query: {query}

#     Provide a direct response as a travel agent without including any context or system prompts."""

#     try:
#         start_time = time()
#         response = qa_chain.run(augmented_query)

#         if "Helpful Answer:" in response:
#             cleaned_response = response.split("Helpful Answer:")[-1].strip()
#         else:
#             cleaned_response = response.split("pieces of context")[-1] if "pieces of context" in response else response
#             cleaned_response = cleaned_response.split("Question:")[-1] if "Question:" in response else cleaned_response
#             cleaned_response = cleaned_response.split("Response:")[-1] if "Response:" in response else cleaned_response
#             cleaned_response = cleaned_response.split("travel agent")[-1] if "travel agent" in response else cleaned_response

#         cleaned_response = cleaned_response.replace("Use the following", "").strip()
#         cleaned_response = cleaned_response.split("Processing time")[0].strip()

#         end_time = time()
#         processing_time = round(end_time - start_time, 2)

#         final_response = f"""Intent: {intent.capitalize()}
# Sentiment: {sentiment.capitalize()}
# ---
# Response: {cleaned_response}

# Processing time: {processing_time}s"""

#         return final_response
#     except Exception as e:
#         return f"Error processing query: {str(e)}"

def process_query(query, context, qa_chain):
    if not qa_chain:
        return "System not initialized properly."

    intent, sentiment = analyze_query(query)

    augmented_query = f"""Based on the following query and context, provide an accurate response:
    Query: {query}
    Additional Context: {context if context else 'None provided'}

    Instructions:
    1. If the answer is directly available in the knowledge base, use that information
    2. Ensure the response is accurate to the source documents
    3. Keep the original wording where possible
    4. If multiple relevant pieces of information are found, combine them coherently

    Provide a direct response as a travel agent."""

    try:
        start_time = time()
        # Get response and source documents
        raw_result = qa_chain({"query": augmented_query})
        response = raw_result['result'] if isinstance(raw_result, dict) else raw_result

        cleaned_response = response
        for prefix in ["Context:", "Question:", "Response:", "Helpful Answer:"]:
            if prefix in cleaned_response:
                cleaned_response = cleaned_response.split(prefix)[-1].strip()

        cleaned_response = cleaned_response.replace("Use the following pieces of context", "").strip()
        cleaned_response = cleaned_response.split("Processing time")[0].strip()

        end_time = time()
        processing_time = round(end_time - start_time, 2)

        final_response = f"""Intent: {intent.capitalize()}
Sentiment: {sentiment.capitalize()}
---
Response: {cleaned_response}

Processing time: {processing_time}s"""

        return final_response
    except Exception as e:
        return f"Error processing query: {str(e)}"


def debug_retrieval(query, vectordb):
    docs = vectordb.similarity_search(query, k=5)
    print("\nRetrieved Documents for:", query)
    for i, doc in enumerate(docs, 1):
        print(f"\nDocument {i}:")
        print(f"Content: {doc.page_content[:200]}...")
        print(f"Source: {doc.metadata.get('source', 'Unknown')}")


def create_interface(qa_chain):
    def handle_query(query, context):
        return process_query(query, context, qa_chain)

    interface = gr.Interface(
        fn=handle_query,
        inputs=[
            gr.Textbox(
                label="Your Question",
                placeholder="Ask about travel destinations, bookings, or advice"
            ),
            gr.Textbox(
                label="Additional Context (Optional)",
                placeholder="Add any relevant travel details..."
            )
        ],
        outputs=gr.Textbox(label="Response"),
        title="Travel Agent Assistant",
        description="Ask questions about travel destinations, bookings, or general travel advice!"
    )
    return interface









In [8]:
# Setup model and tokenizer
model, tokenizer = setup_model()
print("Model loaded successfully")

# Setup embeddings
embeddings = setup_embeddings()
print("Embeddings initialized")

# Setup pipeline
llm = setup_pipeline(model, tokenizer)
print("Pipeline created")

tokenizer_config.json:   0%|          | 0.00/54.5k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/878 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/20.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/1.46G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/189 [00:00<?, ?B/s]

Model loaded successfully


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.6k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


Embeddings initialized
Pipeline created


  return HuggingFacePipeline(pipeline=text_pipeline)


In [9]:
import shutil
shutil.rmtree("travel_agent_db", ignore_errors=True)
print("Cleared existing database")

Cleared existing database


In [10]:
def main():
    print("Initializing Travel Agent Assistant...")

    try:

        documents = process_documents("/content/drive/MyDrive/main_docs_con")
        if not documents:
            print("No documents processed. Please check your document directory.")
            return
        print(f"Successfully processed {len(documents)} documents")

        # Initialize RAG system with fresh database
        qa_chain, vectordb = init_rag_system(documents, embeddings, llm)
        if not qa_chain:
            print("Failed to initialize RAG system.")
            return
        print("RAG system initialized successfully")

        # Create and launch interface
        interface = create_interface(qa_chain)
        interface.launch(share=True)

    except Exception as e:
        print(f"Error in initialization: {str(e)}")
        return

if __name__ == "__main__":
    main()




Initializing Travel Agent Assistant...
Processed 2242 document chunks
Successfully processed 2242 documents
Note: Could not clear existing database: [Errno 2] No such file or directory: 'travel_agent_db'
RAG system initialized successfully
Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
* Running on public URL: https://050efce9344fa0c94f.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)
