### Project Overview
In this project, we provide contextual datasets containing internal information about a company, including employee HR details, contracts, products, and company history.

All data provided are fictional and created solely for project purposes.


We will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [54]:
# imports

import os
import glob
from dotenv import load_dotenv
import gradio as gr
import zipfile


In [56]:
#!pip install --upgrade chromadb pydantic


In [142]:
# imports for langchain and Chroma and plotly

# Imports from LangChain
from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain.schema import Document

# Imports from LangChain's OpenAI embeddings and LLM wrapper
from langchain.chat_models import ChatOpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationalRetrievalChain


# Imports from Chroma and additional libraries
from langchain.vectorstores import Chroma
import numpy as np
from sklearn.manifold import TSNE
import plotly.graph_objects as go

# Ensure you have the latest versions of langchain, langchain-openai, chromadb, plotly, and sklearn installed


In [60]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [62]:
os.environ['OPENAI_API_KEY'] = "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"


In [113]:
zip_file_path = "knowledge-base.zip"

# Create a temporary directory to extract the zip contents
temp_dir = "knowledge_base"
os.makedirs(temp_dir, exist_ok=True)

# Extract the ZIP file
with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(temp_dir)

# Load all Markdown files from the extracted directory
folders = glob.glob(f"{temp_dir}/*")

# With thanks to CG and Jon R, students on the course, for this fix needed for some users 
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

In [115]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)

Created a chunk of size 1088, which is longer than the specified 1000


In [117]:
len(chunks)


123

In [119]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: knowledge-base


# Vector Database - FAISS (Facebook AI Similarity Search)

In [122]:
# Put the chunks of data into a Vector Store that associates a Vector Embedding with each chunk

embeddings = OpenAIEmbeddings()

In [124]:
# Check if a Chroma Datastore already exists - if so, delete the collection to start from scratch

if os.path.exists(db_name):
    Chroma(persist_directory=db_name, embedding_function=embeddings).delete_collection()

In [126]:
# Create our Chroma vectorstore!
from langchain.vectorstores import FAISS

vectorstore = FAISS.from_documents(chunks, embedding=embeddings)
total_vectors = vectorstore.index.ntotal
dimensions = vectorstore.index.d

print(f"There are {total_vectors} vectors with {dimensions:,} dimensions in the vector store")

There are 123 vectors with 1,536 dimensions in the vector store


In [144]:
# create a new Chat with OpenAI
llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

# set up the conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# the retriever is an abstraction over the VectorStore that will be used during RAG
retriever = vectorstore.as_retriever()

# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [146]:
query = "Can you describe Insurellm in a few sentences"
result = conversation_chain.invoke({"question":query})
print(result["answer"])

Insurellm is an innovative insurance tech startup founded by Avery Lancaster in 2015, aimed at disrupting the insurance industry with advanced products. With 200 employees and 12 offices across the US, Insurellm offers four key software products: Carllm for auto insurance, Homellm for home insurance, Rellm for reinsurance, and Marketllm, a marketplace connecting consumers with insurance providers. The company has rapidly expanded and currently serves over 300 clients worldwide.


# Now we will bring this up in Gradio using the Chat interface


In [150]:
# Wrapping that in a function

def chat(message, history):
    result = conversation_chain.invoke({"question": message})
    return result["answer"]

In [152]:
# And in Gradio:

view = gr.ChatInterface(chat).launch()



* Running on local URL:  http://127.0.0.1:7884

To create a public link, set `share=True` in `launch()`.


# However, this LLM is not perfect now. 

In [155]:
# Let's investigate what gets sent behind the scenes

from langchain_core.callbacks import StdOutCallbackHandler

llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

retriever = vectorstore.as_retriever()

conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory, callbacks=[StdOutCallbackHandler()])

query = "Who received the prestigious IIOTY award in 2023?"
result = conversation_chain.invoke({"question": query})
answer = result["answer"]
print("\nAnswer:", answer)



[1m> Entering new ConversationalRetrievalChain chain...[0m


[1m> Entering new StuffDocumentsChain chain...[0m


[1m> Entering new LLMChain chain...[0m
Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
- **2022**: **Satisfactory**  
  Avery focused on rebuilding team dynamics and addressing employee concerns, leading to overall improvement despite a saturated market.  

- **2023**: **Exceeds Expectations**  
  Market leadership was regained with innovative approaches to personalized insurance solutions. Avery is now recognized in industry publications as a leading voice in Insurance Tech innovation.

## Annual Performance History
- **2020:**  
  - Completed onboarding successfully.  
  - Met expectations in delivering project milestones.  
  - Received positive feedback from the team leads.

- **2021:**  
  

### For a question, the LLM answered it doesn't now. However, it is supporsed to know. 

In [167]:
# create a new Chat with OpenAI
llm = ChatOpenAI(temperature=0.7, model_name=MODEL)

# set up the conversation memory for the chat
memory = ConversationBufferMemory(memory_key='chat_history', return_messages=True)

# the retriever is an abstraction over the VectorStore that will be used during RAG; 
# k typically refers to the number of nearest neighbors (or top results) to retrieve based on a given query
retriever = vectorstore.as_retriever(search_kwargs={"k": 20})

# putting it together: set up the conversation chain with the GPT 3.5 LLM, the vector store and memory
conversation_chain = ConversationalRetrievalChain.from_llm(llm=llm, retriever=retriever, memory=memory)

In [168]:
query = "Who received the prestigious IIOTY award in 2023?"
result = conversation_chain.invoke({"question": query})
answer = result["answer"]
print("\nAnswer:", answer)


Answer: Maxine received the prestigious IIOTY 2023 award.


## This time, the LLM answered the question correctly. 