## Expert Knowledge Worker

### A question answering agent that is an expert knowledge worker
### To be used by employees of Insurellm, an Insurance Tech company
### The agent needs to be accurate and the solution should be low cost.

This project will use RAG (Retrieval Augmented Generation) to ensure our question/answering assistant has high accuracy.

In [8]:
from google.colab import drive
drive.mount('/content/drive')
!pip install gradio
!git clone https://github.com/skchandrappa/llm_engineering.git

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Collecting gradio
  Downloading gradio-5.6.0-py3-none-any.whl.metadata (16 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.5-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.4.3 (from gradio)
  Downloading gradio_client-1.4.3-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart==0.0.12 (from gradio)
  Downloading python_multipart-0.0.12-py3-none-any.whl.metadata (1.

In [13]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.8-py3-none-any.whl.metadata (2.9 kB)
Collecting SQLAlchemy<2.0.36,>=1.4 (from langchain-community)
  Downloading SQLAlchemy-2.0.35-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (9.6 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.8 (from langchain-community)
  Downloading langchain-0.3.8-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.21 (from langchain-community)
  Downloading langchain_core-0.3.21-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.6.1-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from datac

In [9]:
# imports

import os
import glob
# from dotenv import load_dotenv
import gradio as gr

fatal: destination path 'llm_engineering' already exists and is not an empty directory.


In [6]:
import os

file_path = os.path.join('/content/llm_engineering/week5/knowledge-base/company', 'about.md')
with open(file_path, 'r') as f:
    content = f.read()

In [14]:
# imports for langchain

from langchain.document_loaders import DirectoryLoader, TextLoader
from langchain.text_splitter import CharacterTextSplitter

In [None]:
# price is a factor for our company, so we're going to use a low cost model

MODEL = "gpt-4o-mini"
db_name = "vector_db"

In [None]:
# Load environment variables in a file called .env

load_dotenv()
os.environ['OPENAI_API_KEY'] = os.getenv('OPENAI_API_KEY', 'your-key-if-not-using-env')

In [18]:
# Read in documents using LangChain's loaders
# Take everything in all the sub-folders of our knowledgebase

folders = glob.glob("/content/llm_engineering/week5/knowledge-base/*")
print(folders)
# With thanks to CG and Jon R, students on the course, for this fix needed for some users
text_loader_kwargs = {'encoding': 'utf-8'}
# If that doesn't work, some Windows users might need to uncomment the next line instead
# text_loader_kwargs={'autodetect_encoding': True}

documents = []
for folder in folders:
    doc_type = os.path.basename(folder)
    loader = DirectoryLoader(folder, glob="**/*.md", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
    folder_docs = loader.load()
    for doc in folder_docs:
        doc.metadata["doc_type"] = doc_type
        documents.append(doc)

['/content/llm_engineering/week5/knowledge-base/employees', '/content/llm_engineering/week5/knowledge-base/products', '/content/llm_engineering/week5/knowledge-base/company', '/content/llm_engineering/week5/knowledge-base/contracts']


In [19]:
len(documents)

31

In [20]:
documents[24]

Document(metadata={'source': '/content/llm_engineering/week5/knowledge-base/contracts/Contract with BrightWay Solutions for Markellm.md', 'doc_type': 'contracts'}, page_content='# Contract with BrightWay Solutions for Markellm\n\n**Contract Date:** October 5, 2023  \n**Contract ID:** INS-2023-0092\n\n### Terms\nThis contract (“Contract”) is made between Insurellm, a company incorporated in the United States, and BrightWay Solutions, a technology provider specializing in insurance services.\n\n1. **Scope of Services:**  \n   Insurellm shall provide BrightWay Solutions access to the Markellm platform under the agreed pricing structure for a duration of one year from the effective date.\n\n2. **Payment Terms:**  \n   BrightWay Solutions agrees to pay an initial setup fee of $1,000 for integration services, followed by the Basic Listing Fee of $199 per month for featured listing on Markellm. Payment shall be made within 30 days of invoice.\n\n3. **Service Level Agreement (SLA):**  \n   Ins

In [21]:
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = text_splitter.split_documents(documents)



In [22]:
len(chunks)

123

In [23]:
chunks[6]

Document(metadata={'source': '/content/llm_engineering/week5/knowledge-base/employees/Oliver Spencer.md', 'doc_type': 'employees'}, page_content='## Compensation History\n- **March 2018**: Initial salary of $80,000.\n- **July 2019**: Salary increased to $90,000 post-promotion.\n- **June 2021**: Salary raised to $105,000 after role transition.\n- **September 2022**: Salary adjustment to $120,000 due to increased responsibilities and performance.\n- **January 2023**: Revised salary of $125,000 in recognition of mentorship role.\n\n## Other HR Notes\n- Oliver enjoys a strong rapport with team members and is known for organizing regular team-building activities.\n- Participated in Insurellm’s Hackathon in 2022, where he led a project that won “Best Overall Solution.” \n- Pursuing AWS Certified Solutions Architect certification to enhance cloud skillset.\n- Has expressed interest in further leadership opportunities within Insurellm and may consider project management roles in the future.')

In [24]:
doc_types = set(chunk.metadata['doc_type'] for chunk in chunks)
print(f"Document types found: {', '.join(doc_types)}")

Document types found: employees, products, company, contracts


In [25]:
for chunk in chunks:
    if 'CEO' in chunk.page_content:
        print(chunk)
        print("_________")

page_content='# Avery Lancaster

## Summary
- **Date of Birth**: March 15, 1985  
- **Job Title**: Co-Founder & Chief Executive Officer (CEO)  
- **Location**: San Francisco, California  

## Insurellm Career Progression
- **2015 - Present**: Co-Founder & CEO  
  Avery Lancaster co-founded Insurellm in 2015 and has since guided the company to its current position as a leading Insurance Tech provider. Avery is known for her innovative leadership strategies and risk management expertise that have catapulted the company into the mainstream insurance market.  

- **2013 - 2015**: Senior Product Manager at Innovate Insurance Solutions  
  Before launching Insurellm, Avery was a leading Senior Product Manager at Innovate Insurance Solutions, where she developed groundbreaking insurance products aimed at the tech sector.' metadata={'source': '/content/llm_engineering/week5/knowledge-base/employees/Avery Lancaster.md', 'doc_type': 'employees'}
_________
page_content='## Support

1. **Customer 