# Final Project Team 6 ChatBot Design - https://manikandan18ramalingam-ai-models.hf.space/

This project deals with creating a working chatbot trained on Stanford question answering data set (squad 2.0).
The chatbot can accept the context and user questions and respond accordingly. The chat bot is sort of enterprise application which is deployed in Huggingface spaces and given public access.

The chatbot can be accessed from following link.

https://manikandan18ramalingam-ai-models.hf.space/


## Download and Pre-process the data

The data used is Stanford question answering data set shortly called as squad 2.0. We used squad 2.0 data set from Kaggle (https://www.kaggle.com/datasets/stanfordu/stanford-question-answering-dataset ) as the base to train our models to be used in the chatbot. Squad 2.0 data set contains large corpus of data collected on various topics from Music to Physics. 



### Install the below packages 

Requirement.txt would contain below libraries:-

transformers
tf-keras
streamlit==1.38.0
langchain-community==0.2.16
langchain-text-splitters==0.2.4
langchain-chroma==0.1.3
langchain-huggingface==0.0.3
langchain-groq==0.1.9
unstructured==0.15.0
unstructured[pdf]==0.15.0
nltk==3.8.1
jq

In [None]:
pip install -r requirements.txt

### Install the streamlit and langchain libraries

In [None]:
!pip install streamlit==1.38.0
!pip install langchain-community==0.2.16
!pip install langchain-text-splitters==0.2.4
!pip install langchain-chroma==0.1.3
!pip install langchain-huggingface==0.0.3
!pip install langchain-groq==0.1.9
!pip install unstructured==0.15.0
!pip install unstructured[pdf]==0.15.0
!pip install nltk==3.8.1
!pip install jq

### Import the necessary libraries

In [1]:

!pip install langchain_huggingface==0.0.3
import os
import langchain_community
import langchain_text_splitters
import langchain_huggingface
import langchain_chroma

from langchain_community.document_loaders import JSONLoader, DirectoryLoader
from langchain_text_splitters import CharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma



### Load Huggingface embeddings

In [2]:
# loading the embedding model
embeddings = HuggingFaceEmbeddings()

  from tqdm.autonotebook import tqdm, trange


### Load the squad 2.0 train data set

Use Json loader from langchain to do this.

In [4]:
from langchain_community.document_loaders import JSONLoader, DirectoryLoader

loader = JSONLoader(file_path='./train-v2.0.json', text_content=False, jq_schema='.data[].paragraphs[].context')
documents = loader.load()

### Split the text data

Use CharacterTextSplitter to split the text and convert that into chunks.

In [6]:
from langchain_text_splitters import CharacterTextSplitter

text_splitter = CharacterTextSplitter(chunk_size=2000,
                                      chunk_overlap=500)
text_chunks = text_splitter.split_documents(documents)

### Create the Text embeddings and store it in Vector Database

Use chroma DB to store the embeddings of the data set.

In [7]:
vectordb = Chroma.from_documents(
    documents=text_chunks,
    embedding=embeddings,
    persist_directory="vector_db_dir"
)

print("Documents Vectorized")

Documents Vectorized


### Collect the Groq API key

In [12]:
import os
import json
from langchain_groq import ChatGroq
from langchain.memory import ConversationBufferMemory

# Get the current working directory (for environments where __file__ is not defined)
working_dir = os.getcwd()

# Example usage: Combine working directory with file name
config_data = json.load(open(f"{working_dir}/config.json"))
GROQ_API_KEY = config_data["GROQ_API_KEY"]
os.environ["GROQ_API_KEY"] = GROQ_API_KEY

### Load the question-answering data set

In [14]:
from transformers import pipeline

# Load the question-answering pipeline
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

Hardware accelerator e.g. GPU is available in the environment, but no `device` argument is passed to the `Pipeline` object. Model will be on CPU.


### Method to use pipeline response

In [15]:
# Function to retrieve answer from the question-answering model
def get_answer(question, context):
    qa_result = qa_pipeline(question=question, context=context)
    return qa_result.get('answer')

### Persist the data in vector store.

In [17]:
# Setting up vectorstore for document retrieval
def setup_vectorstore():
    persist_directory = os.path.join(os.path.dirname(__file__), "vector_db_dir")
    vectorstore = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
    return vectorstore

### Create the Llama chat chain

In [21]:
# Setting up fallback LLaMA chain
def chat_chain(vectorstore):
    llm = ChatGroq(model="llama-3.1-8b-instant", temperature=0)
    retriever = vectorstore.as_retriever()
    chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=False
)
    return chain

### Create the streamlit code

This is used to create context and user input text box. 

In [23]:
import streamlit as st

# Streamlit UI setup
st.set_page_config(
    page_title="Question Answering ChatBot",
    page_icon="📚",
    layout="centered"
)

# Title at the top
st.title("📚 Question Answering ChatBot")

# Add a robo-themed GIF (you can replace this URL with any GIF URL you prefer)
# Display centered GIF
st.markdown(
    """
    <div style="display: flex; justify-content: center;">
        <img src="https://media0.giphy.com/media/v1.Y2lkPTc5MGI3NjExZTJxcndrYW1pcnJ4bXpxNWM5eGYwa3J6d3BlNzQ4NnJsaXczYmZ4YyZlcD12MV9pbnRlcm5hbF9naWZfYnlfaWQmY3Q9Zw/58OujxlE7e19Mjv0gj/giphy.webp" alt="GIF" width="300" height="300">
    </div>
    """, unsafe_allow_html=True
)

# Input area for user-provided context
context_input = st.text_area("Context (optional):", placeholder="Provide context if available")

# "Ask AI" chat input field immediately after the context
user_input = st.text_input("Ask AI:")

# Check for user input
if user_input:
    # If user provides context in the context text area, use it
    if context_input:
        context = context_input
        answer = get_answer(user_input, context)
    else:
        # If no answer is found, fallback to LLaMA 2
        llama_chain = chat_chain(setup_vectorstore())
        llama_response = llama_chain.invoke({"query": user_input})
        answer = llama_response["result"]

    # Display the answer
    st.write("Answer:", answer)

2024-10-19 21:59:25.236 
  command:

    streamlit run /Users/diyamanipriya/myenv/lib/python3.9/site-packages/ipykernel_launcher.py [ARGUMENTS]
2024-10-19 21:59:25.237 Session state does not function when running a script without `streamlit run`


### Run the streamlit code to start the chat bot 

In [26]:
import subprocess

# Path to your Streamlit app script
#streamlit_app_path = "/Users/diyamanipriya/myenv/lib/python3.9/site-packages/ipykernel_launcher.py"

# Running the Streamlit app
#subprocess.run(["streamlit", "run", streamlit_app_path])

### Chat bot web access

The streamlit app is deployed in huggingface personal space and web access is provided to chatbot via that. Since huggingface requires app.py to be created with stream lit app, multiple files are created. Also, Groq api key is personal one and cannot be shared here to load the LLM model.

It can be accessed publicly using  https://manikandan18ramalingam-ai-models.hf.space/
