## Chat with a Database of Documents
### Goals
1. Create pre-existing database (vector store) of embedded file(s) such as syllabi, class notes, etc.
2. Query that database using various models (OpenAI and OpenSource models) on variety of questions

In [2]:
!pip install -q faiss-cpu langchain sentence-transformers python-dotenv pypdf openai unstructured markdown


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.3.1[0m[39;49m -> [0m[32;49m23.3.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m


In [3]:
# Use dotenv to load the environment variables from .env in ipynb
%load_ext dotenv
%dotenv

In [4]:
import os
openai_key = os.environ['OPEN_AI_KEY']
hf_key = os.environ['HUGGINGFACEHUB_API_TOKEN']

In [8]:
from langchain.document_loaders import PyPDFLoader

# Load PDFs
syllabi_pdfs = [
    PyPDFLoader("/workspaces/data-projects/langchain-chainlit/docs/rbain-syllabus-phys1-f23.pdf"),
    #PyPDFLoader('/workspaces/data-projects/langchain-chainlit/docs/cbain-syllabus-cs111-f23.pdf'),
    PyPDFLoader('/workspaces/data-projects/langchain-chainlit/docs/open-stax-ch3.pdf'),
]
docs = []
for syllabus in syllabi_pdfs:
    docs.extend(syllabus.load())

In [9]:
# Split
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(docs)
len(splits)

170

In [10]:
# Explore splits and do some preprocessing
import re
def preprocess_text(text):
    # Remove unnecessary newline characters
    text = re.sub(r'(?<=[0-9a-zA-ZIVXLCDMivxlcdm.])\n', ' ', text)
    text = re.sub(r'(?<=[.!?])\n', ' ', text)
    # Join lines that are part of the same paragraph
    lines = text.split('.')
    cleaned_lines = [lines[i] + '.' if i % 2 != 0 else lines[i] for i in range(len(lines))]
    text = ' '.join(cleaned_lines)

    # Remove unnecessary whitespace
    text = ' '.join(text.split())

    return text
print(preprocess_text(splits[0].page_content))

I Course Description 1. Course Summary a PHY 161/PHYS 215 General Physics I is an algebra-based introduction to mechanics, thermodynamics, and waves. Topics include motion in one and two dimensions, Newton’s laws of motion, equilibrium, work, energy, momentum, rotational motion, gravity, heat, waves, and sound Examples from medicine and biology will be included whenever possible. 2 College Credit Hours (Dual-Enrollment) a. This course is dual enrolled with Francis Marion University (FMU) and taught by a GSSM instructor Students will each have a FMU transcript with their overall grade earned in this course. Students may earn up 4 college credit hours depending on their grade and the transfer policies of their college/university Refer to the Dual Enrollment FAQ in the Course Catalog for more information. 3 Learning Outcomes a. Upon completion of this course, students will be able to: b Apply the laws of classical Newtonian mechanics (motion, force, energy, momentum, and.


In [11]:
# Apply to all splits
print(f'Before --> {splits[0].page_content}')
for split in splits:
    split.page_content = preprocess_text(split.page_content)

print(f'After --> {splits[0].page_content}')   

Before --> I.
Course
Description
1.
Course
Summary
a.
PHY
161/PHYS
215
General
Physics
I
is
an
algebra-based
introduction
to
mechanics, 
thermodynamics,
and
waves.
Topics
include
motion
in
one
and
two
dimensions, 
Newton’s
laws
of
motion,
equilibrium,
work,
energy,
momentum,
rotational
motion, 
gravity,
heat,
waves,
and
sound.
Examples
from
medicine
and
biology
will
be 
included
whenever
possible.
2.
College
Credit
Hours
(Dual-Enrollment)
a.
This
course
is
dual
enrolled
with
Francis
Marion
University
(FMU)
and
taught
by
a 
GSSM
instructor.
Students
will
each
have
a
FMU
transcript
with
their
overall
grade 
earned
in
this
course.
Students
may
earn
up
4
college
credit
hours
depending
on 
their
grade
and
the
transfer
policies
of
their
college/university.
Refer
to
the
Dual 
Enrollment
FAQ
in
the
Course
Catalog
for
more
information.
3.
Learning
Outcomes
a.
Upon
completion
of
this
course,
students
will
be
able
to: 
b.
Apply
the
laws
of
classical
Newtonian
mechanics
(motion,
force,
energy, 
mo

In [6]:
'''
from langchain.text_splitter import MarkdownTextSplitter
# Try importing markdown as plan text
from langchain.document_loaders import TextLoader

txt_loader = TextLoader("/workspaces/data-projects/langchain-chainlit/docs/cbain-syllabus-c111-f23.md")
markdown_doc = txt_loader.load()

markdown_splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=0)
md_splits = markdown_splitter.create_documents([markdown_doc[0].page_content])
#type(markdown_doc[0].page_content)
print(md_splits)
'''

[Document(page_content='---\nlayout: two-column\ntitle: Syllabus\npermalink: /syllabus/\n---'), Document(page_content='<table style="border-top: solid 1px #444; border-bottom: solid 1px #444; border-collapse: collapse;'), Document(page_content='width: 100%; margin-bottom: 30px;">'), Document(page_content='<caption>Summary Details for COMP_SCI 111</caption>\n  <tbody>\n      <tr>'), Document(page_content='<th scope="row" style="text-align: left; font-size: 0.9em; line-height: 1.5em;'), Document(page_content="margin-top: 5px; vertical-align: top; font-family: 'Akkurat Pro Light'; border-bottom: dotted 1px"), Document(page_content='#999; padding: 8px; min-width: 80px;"><strong style="font-family: \'Akkurat Pro'), Document(page_content='Regular\';">Term</strong></th>'), Document(page_content='<td style="font-size: 0.9em; line-height: 1.5em; margin-top: 5px; vertical-align: top;'), Document(page_content="font-family: 'Akkurat Pro Light'; border-bottom: dotted 1px #999; padding: 8px; min-wid

In [12]:
# Create embeddings
from langchain.embeddings import HuggingFaceEmbeddings
model_id = 'sentence-transformers/all-MiniLM-L6-v2'
embedding = HuggingFaceEmbeddings(model_name=model_id)

  from .autonotebook import tqdm as notebook_tqdm


In [13]:
# Declare directory for storing things
from langchain.vectorstores import FAISS

db = FAISS.from_documents(
    documents = splits,
    embedding= embedding,
)

# Save this locally since it's just 2 documents
db.save_local("./FAISS/faiss_index")

In [9]:
'''
# Test web stuff
md_db = FAISS.from_documents(
    documents = md_splits,
    embedding=embedding
)
md_db.save_local('./FAISS/')
'''

In [14]:
# Query the db
query = 'What is the title of this chapter of the textbook?'
print(db.similarity_search(query)[1].page_content)

along a diff erent set o f axes—one r otat ed 22. A farmer w ants t o fence off a f our-sided plot o f flat land The y measur e the prs t thr ee sides , sho wn asandinFigure 3. 57 , and then c orrectly calculat e the length and orientation o f the f ourth side What is their r esul t? FIGURE 3. 5723 In an at temp t to escape his island, Gil ligan builds a raft and sets t o sea . The wind shifts a gr eat deal during the da y, and he is blo wn along the f ollowing straight lines:north of west; thensouth o f eas t; thensouth o f west; thenstraight eas t; theneast of nor th; thensouth o f west; and pnal lynorth of east What is his pnal position r elativ e to the island? 24. Suppose a pilot fliesin a dir ectionnorth of eas t and then fliesin a dir ectionnorth of eas t as sho wn in Figure 3 58 . Find her total dis tanc efrom the s tarting point and the directionof the s traight -line path t o the pnal position Discus s qualitativ ely ho w this flight.


Now we will declare an llm and a Q&A chain to query the database.

In [15]:
from langchain.llms import HuggingFaceHub
from langchain.chat_models import ChatOpenAI

repo_id = "tiiuae/falcon-7b-instruct"
#repo_id = 'tiiuae/falcon-40b-instruct'
falcon_llm = HuggingFaceHub(
    huggingfacehub_api_token=hf_key, 
    repo_id=repo_id, 
    model_kwargs={"temperature":0.01}
)
openai_llm = ChatOpenAI(
    model_name='gpt-3.5-turbo',
    openai_api_key=openai_key,
    temperature=0,
    streaming=True
)



In [16]:
# Build a prompt that tells the chain to use the context to answer a question
from langchain.prompts import PromptTemplate

template = """Use the following context to answer the question at the end. 
If you don't know the answer, just say that you don't know, don't try to make up an answer. 
Use three sentences maximum. Keep the answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""

prompt = PromptTemplate.from_template(template)
#print(prompt.format(context='Reggie Bain is the CEO of Apple', question='Who is the CEO of apple?'))

In [17]:
# Declare retriever to get data from our vector store
# If you've saved locally, load the faiss db. You have to pass the directory of the index not the file itself
new_db = FAISS.load_local("/workspaces/data-projects/langchain-chainlit/FAISS/faiss_index/", embedding)
retriever = new_db.as_retriever()

In [15]:
# Testing the retriever
print(retriever.invoke('List the kinematic equations')[1].page_content)

are denot ed with a subscrip t 0, as usual .
Step 2. Treat the motion as tw o independent one -dimensional motions , one horiz ontal and the other v ertical .The
kinematic equations f or horiz ontal and v ertical motion tak e the f ollowing f orms:Review of Kinematic E quations ( constant)3.283.293.303.313.32
3.333.343.353.363.373.383.39118 3 • T wo-Dimensional Kinematics
Access f or free at opens tax.org


In [18]:
# Declare memory object to track history of conversation
from langchain.memory import ConversationBufferMemory

memory = ConversationBufferMemory(
    memory_key='chat_history',
    return_messages=True
)

In [19]:
# Use Conversational retrieval chain which takes: (1)LLM, (2)Retriever, (3)Memory (retrieval chain doesn't have memory)
from langchain.chains import ConversationalRetrievalChain

qa_conv_chain = ConversationalRetrievalChain.from_llm(
    #openai_llm,
    falcon_llm,
    retriever=retriever,
    memory=memory,
    chain_type='stuff',
    #return_source_documents = True
)

'''
# Try using LCEL format. Need runnable passthrough and str output parser and an implementation of stuff
from langchain_core.runnables import RunnablePassthrough
from langchain.schema import StrOutputParser

parser = StrOutputParser()
# First we'll stuff the docs together
def format_docs(docs):
    return "\n\n".join([d.page_content for d in docs])

chain = ({'context': retriever | format_docs, 'question':RunnablePassthrough()}
         | prompt
         | falcon_llm
         | parser
         )
'''

'\n# Try using LCEL format. Need runnable passthrough and str output parser and an implementation of stuff\nfrom langchain_core.runnables import RunnablePassthrough\nfrom langchain.schema import StrOutputParser\n\nparser = StrOutputParser()\n# First we\'ll stuff the docs together\ndef format_docs(docs):\n    return "\n\n".join([d.page_content for d in docs])\n\nchain = ({\'context\': retriever | format_docs, \'question\':RunnablePassthrough()}\n         | prompt\n         | falcon_llm\n         | parser\n         )\n'

In [21]:
'''
memory.clear()
question = 'List the kinematic equations'
output = chain.invoke(question)
print(output['text'])
'''

"\nmemory.clear()\nquestion = 'List the kinematic equations'\noutput = chain.invoke(question)\nprint(output['text'])\n"

In [20]:
# Test our conversational retrieval chain. Must feed dictionary of {context="context", question: "question_content"}
memory.clear()
question = 'What is the definition of displacement?'
output = qa_conv_chain({'question':question})
print(output['answer'])

 Displacement is the distance an object travels in a certain direction. It is the difference between the final and initial positions of an object.


In [28]:
memory.clear()
question = 'What formula do I use to calculate the magnitude of a velocity vector?'
output = qa_conv_chain({'question':question})
print(output['answer'])


The formula to calculate the magnitude of a velocity vector is:

v = sqrt(vx^2 + vy^2)

where vx and vy are the x and y components of the velocity vector, respectively.


In [32]:
# Now to query the syllabus
memory.clear()
question = 'What are the dates of the exams in Reginald Bain physics class?'
output = qa_conv_chain({'question': question})
print(output['answer'])

 The dates of the exams in Reginald Bain physics class are: 1. Exam 1: 9/22/2021 2. Exam 2: 10/13/2021 3. Exam 3: 11/10/2021 4. Exam 4: 12/8/2021 5. Final Exam: 12/15/2021


In [33]:
question = 'When are his office hours?'
output = qa_conv_chain({'question': question})
print(output['answer'])

 "His office hours are from 2:00pm to 4:00pm on Tuesdays and Thursdays."


In [38]:
question = 'What is the textbook used in his class?'
output = qa_conv_chain({'question': question})
print(output['answer'])

 The textbook used in his class is "General Physics" by Douglas C. Giancoli.
