### Langchains practice making a basic document questioning RAG to chat with our data

#### Here are the basic steps
1. Document Loading
2. Document Splitting
3. Vectorstores and Embeddings
4. Retrival
5. Question Answering
6. Chat

In [1]:
# Imports and setup for os environment variables
import os
import openai
import sys

sys.path.append('../..')

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read the local .env file

# OpenAI API key
openai.api_key = os.environ["OPENAI_API_KEY"]

# Langchain endpoint
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_ENDPOINT"] = "https://api.langchain.plus"
os.environ["LANGCHAIN_PROJECT"] = "AmexRag"
os.environ["LANGCHAIN_API_KEY"] = os.environ["LANGSMITH_API_KEY"]


#### 1. Document Loading

add the pdf documents from the directory into a place where we can process them

In [2]:
# Import PDF loaders - different loaders to see the difference in output
from langchain.document_loaders import PDFPlumberLoader
from langchain.document_loaders import PyPDFLoader

loaderpy = PyPDFLoader("AmexStatements/2024-02-21.pdf")
loaderplumber = PDFPlumberLoader("AmexStatements/2024-02-21.pdf")

In [3]:
pages = loaderpy.load() 
#print(len(pages))
#print(pages[0].page_content)

In [4]:
pages = loaderplumber.load()
#print(len(pages))
#print(pages[0].page_content)

In [5]:
pages[0].metadata

{'source': 'AmexStatements/2024-02-21.pdf',
 'file_path': 'AmexStatements/2024-02-21.pdf',
 'page': 0,
 'total_pages': 9,
 'Subject': '',
 'CreationDate': "D:20240519155457-07'00'",
 'Producer': 'OpenText Output Transformation Engine - 16.3.19                                                     ',
 'Author': '',
 'Creator': '',
 'Title': '',
 'ModDate': "D:20240519155457-07'00'",
 'Keywords': ''}

In [6]:
# Now lets load all the docuemnts into a list
loaders = []
for file in os.listdir("AmexStatements"):
    loaders.append(PyPDFLoader(f"AmexStatements/{file}"))
# Load PDF

docs = []
for loader in loaders:
    docs.extend(loader.load())

print(len(docs))
print(len(docs[0].page_content))

168
2563


#### 2. Document Splitting

we split documents into chunks in order to make it easier for the model to process
* **Level 1: Character Splitting** - Simple static character chunks of data
* **Level 2: Recursive Character Text Splitting** - Recursive chunking based on a list of separators
* **Level 3: Document Specific Splitting** - Various chunking methods for different document types (PDF, Python, Markdown)
* **Level 4: Semantic Splitting** - Embedding walk based chunking
* **Level 5: Agentic Splitting** - Experimental method of splitting text with an agent-like system. Good for if you believe that token cost will trend to $0.00
* **Alternative Representation Chunking + Indexing** - Derivative representations of your raw text that will aid in retrieval and indexing


There are many ways of splitting, I highlight 4 most useful for our cases.
1. By Characters or Resursively splitting on Characters. - Good Starting place
2. By Token. - Good if you have a fixed pricing problem (Not for our case as much)
3. By Document Specific Splitting. - Good for specific structures like tables. (Probably most useful for us)
4. By Semantic Chuncking. - Good to the understanding of the data (2nd most useful)

In [7]:
from langchain.text_splitter import CharacterTextSplitter # Doesnt look at whitespace and newlines
from langchain_text_splitters import RecursiveCharacterTextSplitter # Good for text but not pdfs
from langchain.text_splitter import TokenTextSplitter # Good for price based splitting

# text = 'fnsldnfsd'
# text_splitter = CharacterTextSplitter(chunk_size = 26, chunk_overlap = 4)
# recursive_splitter = RecursiveCharacterTextSplitter(chumk_size = 26, chunk_overlap = 4)
# token_splitter = TokenTextSplitter(chunk_size = 6, chunk_overlap = 1)

# text_splitter.split_text(text)
# recursive_splitter.split_text(text)
# token_splitter.split_text(text)


Super Cool below but too many dependencies and isnt working rn due to an Onnx installation issue, moving on...

In [8]:

# from unstructured.staging.base import elements_to_json
# from unstructured.partition.pdf import partition_pdf
# import numpy as np
# import pdf2image
# import wrapt
# from pdfminer import psparser

# elements = partition_pdf(
#     filename="AmexStatements/2024-02-21.pdf",

#     # Unstructured Helpers
#     strategy="hi_res", 
#     infer_table_structure=True, 
#     model_name="yolox"
# )

# elements

In [9]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai.embeddings import OpenAIEmbeddings

text_splitter = SemanticChunker(OpenAIEmbeddings())

splits = text_splitter.split_documents(docs)


In [10]:
print(len(splits))

299


#### 3. Vectorstores and Embeddings

In [11]:
from langchain.vectorstores import Chroma
persist_directory = 'docs/chroma/'
vectordb = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings()
    #, persist_directory=persist_directory
)
print(vectordb._collection.count())

299


In [12]:
question = "How much have I spent on Uber? There are various transactions on my statement, give me the sum."

In [13]:
docs = vectordb.similarity_search(question,k=3)
len(docs)

3

#### 4. Retrival

Problem with basic retrival on similarity is we dont get diversity of content if its the same across pages, for that we will use different retrival methods. 

Examples:
1. MMR (Maximum marginal relevance)
2. SelfQueryRetriever (The query string to use for vector search)
3. 

In [14]:
docs_mmr = vectordb.max_marginal_relevance_search(question,k=3)
docs_mmr[0].page_content[:100] # Based on MMR search

'Foreign\nSpend\n10/13/23 Uber Trip help.uber.com CA\nSP25Z6LF 60527$18.94\n10/13/23 CHIPOTLE ONLINE NEWP'

In [15]:
# Based on similarity search on a specific document (metadata)
docs = vectordb.similarity_search(
    question,
    k=3,
    filter={"source":"AmexStatements/2024-04-22.pdf"}
)

In [16]:
for d in docs:
    print(d.metadata)

{'page': 6, 'source': 'AmexStatements/2024-04-22.pdf'}
{'page': 0, 'source': 'AmexStatements/2024-04-22.pdf'}
{'page': 7, 'source': 'AmexStatements/2024-04-22.pdf'}


#### 4.2 Self Query Retriever

In [17]:
from langchain.retrievers.self_query.base import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
from langchain_openai import OpenAI
# Building the Self Query Retriever below few cells

In [18]:
metadata_field_info = [
    AttributeInfo(
        name="source",
        description="The statement the chunk is from, should be one of `AmexStatements/*.pdf` where the * is the date the statement was issued",
        type="string",
    ),
    AttributeInfo(
        name="page",
        description="The page from the statement",
        type="integer",
    ),
] 

# This is the info the retriever will use to construct the query

In [19]:
document_content_description = "Bank Statements from American Express for Sameer Himati"
llm = OpenAI(model='gpt-3.5-turbo-instruct', temperature=0)
retriever = SelfQueryRetriever.from_llm(
    llm,
    vectordb,
    document_content_description,
    metadata_field_info,
    verbose=True
)
docs = retriever.get_relevant_documents(question)


  warn_deprecated(


In [20]:
for d in docs:
    print(d.metadata)

{'page': 5, 'source': 'AmexStatements/2023-10-23.pdf'}
{'page': 3, 'source': 'AmexStatements/2024-02-21.pdf'}
{'page': 6, 'source': 'AmexStatements/2024-04-22.pdf'}
{'page': 13, 'source': 'AmexStatements/2023-11-22.pdf'}


#### 4.3 Compression

Another approach for improving the quality of retrieved docs is compression.
Information most relevant to a query may be buried in a document with a lot of irrelevant text.
Passing that full document through your application can lead to more expensive LLM calls and poorer responses.
Contextual compression is meant to fix this.

In [21]:
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor

def pretty_print_docs(docs):
    print(f"\n{'-' * 100}\n".join([f"Document {i+1}:\n\n" + d.page_content for i, d in enumerate(docs)]))

# Wrap our vectorstore
llm = OpenAI(temperature=0, model="gpt-3.5-turbo-instruct")
compressor = LLMChainExtractor.from_llm(llm)

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever()
)

compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

10/13/23 Uber Trip help.uber.com CA
SP25Z6LF 60527$18.94
10/14/23 Uber Trip help.uber.com CA
SYPLO52W 60523$10.97
----------------------------------------------------------------------------------------------------
Document 2:

03/15/23 Uber Trip help.uber.com CA
HHJZE4HR 33139$28.94
03/15/23 Uber Trip help.uber.com CA
6NDDBWD5 33122$22.99
03/16/23 Uber Trip help.uber.com CA
Y7LJAIZZ 60181$6.97
03/16/23 Uber Trip help.uber.com CA
UCW4EZRF 60148$6.98
----------------------------------------------------------------------------------------------------
Document 3:

-$14.07 -$14.07
01/31/24 UBER EATS
help.uber.com CA
KAQ2NFXW 94103-$14.07
----------------------------------------------------------------------------------------------------
Document 4:

Total Fees in2023 $250.00
Total Interest in2023 $0.00
Days inBilling Period: 31
Your Annual Percentage Rate (APR) istheannual interest rate onyour account. Pay Over Time 02/06/2023 27.99 %(v) $0.00 $0.00
Cash Advances 02/06/2023 29

In [22]:
# Combination of MMR search with Compression

compression_retriever = ContextualCompressionRetriever(
    base_compressor=compressor,
    base_retriever=vectordb.as_retriever(search_type = "mmr")
)

compressed_docs = compression_retriever.get_relevant_documents(question)
pretty_print_docs(compressed_docs)

Document 1:

10/13/23 Uber Trip help.uber.com CA
SP25Z6LF 60527$18.94
10/14/23 Uber Trip help.uber.com CA
SYPLO52W 60523$10.97
----------------------------------------------------------------------------------------------------
Document 2:

-$14.07 -$14.07
01/31/24 UBER EATS
help.uber.com CA
KAQ2NFXW 94103-$14.07
----------------------------------------------------------------------------------------------------
Document 3:

Total Payments and Credits -$3,903.45 -$463.45 -$4,366.90
----------------------------------------------------------------------------------------------------
Document 4:

02/19/23 AplPay STARBUCKS STORE 0835 SAINT PETERSBURG FL
FAST FOOD RESTAURANT$3.16
02/19/23 AplPay PUBLIX SAINT PETERSBURG FL
8636881188$11.27
02/19/23 AplPay GALLITO 927540460410837 TAMPA FL
TYRODRIGUEZ117@GMAIL.COM$27.82
02/20/23 AplPay MIO'S GRILL &CAFE -KAYRA LLC St.Petersburg FL
squareup.com/receipts$112.24


#### 5. Question Answering

In [23]:
from langchain.chains import RetrievalQA 
# Starting a QA chain, lets see if it helps
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)
result = qa_chain({"query": question})
result["result"]

  warn_deprecated(


' $69.88'

In [24]:
# Hmm the output above is not what we wanted, lets try with a Prompt

from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context which is parts of a credit card statement from Sameer Himati to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

# Run chain
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT}
)
result = qa_chain({"query": question})
result["result"]

' Thanks for asking! Based on the transactions listed on your statement, it appears that you have spent a total of $36.88 on Uber.'

In [25]:
# hmm so perhaps the problem is in the type of Chain we are using, lets try a different one

qa_chain_mr = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    chain_type="refine"
)
result = qa_chain_mr({"query": question})
result["result"]

'\nBased on the information provided, it appears that you have spent a total of $34.21 on Uber trips. This includes a $28.94 trip on 03/15/23 and a $6.97 trip on 03/16/23. However, there are also several other transactions on your statement that may be related to Uber, such as a $69.50 charge for VTS Flash Cab and a $56.16 charge for Lyft. Without more specific information, it is difficult to determine the exact amount you have spent on Uber. Additionally, there are some other charges on your statement that may not be related to Uber, such as a $10.00 membership fee for LA Fitness and a $109.20 charge for Keller Williams Experience. It is recommended to review your statement in more detail to accurately determine the total amount spent on Uber. Please note that there may be additional charges for Uber that are not reflected on this statement, such as trailing interest charges. It is important to review all statements and transactions to get an accurate understanding of your total spend

In [26]:
# Well better but now im going to try using a Self Query Retriever with the qa chain using map reduce

qa_chain_mr_10k = RetrievalQA.from_chain_type(
    llm,
    retriever= ContextualCompressionRetriever(
    base_compressor=compressor,
    chain_type="map_reduce",
    base_retriever=vectordb.as_retriever()
)
)

result = qa_chain_mr_10k({"query": question})
result["result"]

' $65.85 ($18.94 + $10.97 + $28.94 + $22.99 + $6.97 + $6.98)'

##### Hmm well after trying all these, looks like we still cant get a good response. I think the problem is to do with the splitting and embedding for cases like this we may need to either retrieve all the docs with info about the question or find an efficient way to search thru the embedding vectors for all instances, so smaller splits perhaps

### Lets now do some Evaluation below

In [27]:
from langchain.evaluation.qa import QAGenerateChain

example_gen_chain = QAGenerateChain.from_llm(OpenAI())

new_examples = example_gen_chain.apply_and_parse(
    [{"doc": t} for t in docs[:5]]
)

examples = []

for pair in new_examples:
    a = pair.values()
    examples.extend(a)
examples   
    



[{'query': 'What was the total amount spent on October 16, 2023 at AirCanada by the passenger named ARYAEI/SOHEILA?',
  'answer': 'The total amount spent by the passenger named ARYAEI/SOHEILA on October 16, 2023 at AirCanada was $29.98. '},
 {'query': "On which date did the passenger named HIMATI/SAMEER MR depart from Chicago O'Hare International Airport on an Etihad Airways flight to Abu Dhabi International Airport?",
  'answer': "The passenger named HIMATI/SAMEER MR departed from Chicago O'Hare International Airport on February 6th, 2024."},
 {'query': 'What date did the customer make a purchase at Dunkin Donuts in Kenosha, Wisconsin and how much was the charge?',
  'answer': 'The customer made a purchase on 04/07/24 at Dunkin Donuts in Kenosha, Wisconsin for $14.08.'},
 {'query': 'What is the phone number for customer care for American Express cards?',
  'answer': 'The phone number for customer care for American Express cards can be found on the first page of your statement or on th

In [28]:
qa = qa_chain
docs[0]

Document(page_content="Foreign\nSpend\n10/13/23 Uber Trip help.uber.com CA\nSP25Z6LF 60527$18.94\n10/13/23 CHIPOTLE ONLINE NEWPORT BEACH CA\n1339 92660\nFAST FOOD RESTAURANT$13.19\n10/14/23 STATE FARM INSURANCE BLOOMINGTON IL\n8009566310$318.76\n10/14/23 PORTILLOS HOT DOGS #10 NAPERVI 1624665 NAPERVILLE IL\n130187 60540\nJumbo Hot Dog\nCelery Salt\nMustard\nLCCS$14.87\n10/14/23 Uber Trip help.uber.com CA\nSYPLO52W 60523$10.97\n10/14/23 TST* JIBEK JOLU -LINCOLN 00092818 CHICAGO IL\nRESTAURANT$63.12\n10/14/23 CHICK-FIL-A LOMBARD IL\n6305860830$22.24\n10/14/23 AplPay 1000 TALES FAMILY RESTAUR 545500001 MTPROSPECT IL\nINFO@RAINFOODUSA.COM$9.68\n10/15/23 PARK CHICAGO MOBILE 0539 CHICAGO IL\n877-242-7901$20.00\n10/15/23 PETE'S FRESH MARKET OAKBROOK TERRACE IL\n6308126100$34.14\n10/15/23 AplPay PETE'S FRESH MARKET OAKBROOK TERRACE IL\n6308126100$3.96\n10/16/23 AIRCANADA WINNIPEG\nAirCanada\nFrom: To: Carrier: Class:\nTORONTO LESTER BP CHICAGO O'HARE INT AC X\nCHICAGO O'HARE INT AC L\nTicket N

In [29]:
langchain.debug = False

NameError: name 'langchain' is not defined

In [None]:
predictions = qa.apply(examples)

In [None]:
from langchain.evaluation.qa import QAEvalChain

eval_chain = QAEvalChain.from_llm(llm)

graded_outputs = eval_chain.evaluate(examples, predictions)


for i, eg in enumerate(examples):
    print(f"Example {i}:")
    print("Question: " + predictions[i]['query'])
    print("Real Answer: " + predictions[i]['answer'])
    print("Predicted Answer: " + predictions[i]['result'])
    print("Predicted Grade: " + graded_outputs[i]['results'])
    print()

#### 6. Chat!

In [None]:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)