## **Integrating document loaders**

Pre-trained language models don't have access to external data sources - their understanding comes purely from their training data. This means that if we require our model to have knowledge that goes beyond its training data, such as
- company data,
- knowledge of more recent world events,
- or even a specific document,

we need a way of integrating that data.

<img src='./images/rag.png' width=60%>

In RAG,
- Use __embeddings__ to _retrieve_ relevant information to integrate into the _prompt_

so that the model has extra context to inform its response.

### **RAG development steps**

<img src='./images/rag-dev-steps.png' width=60%>

There are three primary steps to RAG development in LangChain. 

1. Document loader; loading the documents into LangChain
2. Splitting; is the step if splitting the documents into chunks. Chunks are units of information that we can index and process individually.
3. Storage + Retrieval; is the step that encodes and stores the chunks for retrieval, which could utilize a vector database if that meets the needs of the use case.

### **1. LangChain document loaders**

- LangChain document loaders are classes desined to _load_ and _configure_ documents for system integration.
- LancChain provides document loader classes for common file types such as CSV and PDFs.
- There are also additional loaders provided by 3rd parties for managing unique document formats, including
  - Amazon S3 files
  - Jupyter notebooks (.ipynb)
  - Audio transcripts (.wav, etc.)
  - and many more.

In this chapter, we will practice loading data from three common formats: PDFs, CSVs, and HTML.

LangChain has documentation on all of its document loaders, and there's a lot of overlap in sytax. Check it out, [here](https://python.langchain.com/docs/integrations/document_loaders)!

1. __PDF document loader__

    There are a few different types of PDF loaders in LangChain, and there is documentation available online for each. Here we will use `PyPDFLoader`.
    
    We instantiate the PyPDFLoader class passing in the path to the PDF file we're loading.

    Finally, we use the.load() method to load the document into memory, and assing the resulting object to the data variable.

In [5]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('./datasets/attention_is_all_you_need.pdf')

data = loader.load()

print(data[0])

page_content='Provided proper attribution is provided, Google hereby grants permission to
reproduce the tables and figures in this paper solely for use in journalistic or
scholarly works.
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗ †
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗ ‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or
convolutional neural networks that include an encoder and a decoder. The best
performing models also connect the encoder and decoder through an attention
mechanism. We propose a new simple network architecture, the Transformer,
based solely on attention mechanisms, dispensing with recurrence and convolutions

2. __CSV document loader__

    When loading CSV's, the syntax is very similar to the PDF loader, but instead we use `CSVLoader` class.

In [19]:
from langchain_community.document_loaders import CSVLoader

loader = CSVLoader('./datasets/fifa_countries_audience.csv')

data = loader.load()

print(data)

[Document(metadata={'source': './datasets/fifa_countries_audience.csv', 'row': 0}, page_content='country: United States\nconfederation: CONCACAF\npopulation_share: 4.5\ntv_audience_share: 4.3\ngdp_weighted_share: 11.3'), Document(metadata={'source': './datasets/fifa_countries_audience.csv', 'row': 1}, page_content='country: Japan\nconfederation: AFC\npopulation_share: 1.9\ntv_audience_share: 4.9\ngdp_weighted_share: 9.1'), Document(metadata={'source': './datasets/fifa_countries_audience.csv', 'row': 2}, page_content='country: China\nconfederation: AFC\npopulation_share: 19.5\ntv_audience_share: 14.8\ngdp_weighted_share: 7.3'), Document(metadata={'source': './datasets/fifa_countries_audience.csv', 'row': 3}, page_content='country: Germany\nconfederation: UEFA\npopulation_share: 1.2\ntv_audience_share: 2.9\ngdp_weighted_share: 6.3'), Document(metadata={'source': './datasets/fifa_countries_audience.csv', 'row': 4}, page_content='country: Brazil\nconfederation: CONMEBOL\npopulation_share: 

3. __HTML document loader__

    Finally, we can load HTML files using the `UnstructuredHTMLLoader` class. We can access the document's contents, again, with subsetting, and extract the document's metadata with the metadata attribute.

In [None]:
from langchain_community.document_loaders import UnstructuredHTMLLoader

loader = UnstructuredHTMLLoader('./datasets/safe-secure-use-of-ai.html')

data = loader.load()

print(data[0])
print(data[0].metadata)

### **2. Splitting external data for retrieval**

<img src=./images/intro.png width=60%>

Let's examine the introduction from an academic paper, which is saved as a PDF. One naive splitting option would be to separate the document by-line.

Line 1:<br>
<img src=./images/intro-line1.png width=60%>

Line 2: <br>
<img src=./images/intro-line2.png width=60%>

This would be simple to implement, but because sentences are often split over multiple lines, and because those lines are processed separately, key context might be lost.

__Chunk overlap__

To counteract lost context during chunk splitting, a chunk overlap is often implemented. We've selected two chunks and a chunk overlap shown in green. Having this extra overlap present in both chunks helps retain context. If a model shows signs of losing context and misunderstanding information when answering from external sources, we may need to increase this chunk overlap.

<img src=./images/intro-chunk-overlap.png width=40%>

There isn't one document splitting strategy that works for all situations. We should experiment with multiple methods, and see which one strikes the right balance between retaining context and managing chunk size. We will compare two document splitting methods:
- `CharacterTextSplitter`
- `RecursiveCharacterTextSplitter`

As an example, let's split this quote by Elbert Hubbard, which contains 103 characters, into chunks. We'll compare how the two methods perform on this quote with a chunk_size of 24 characters and a small chunk_overlap of three.

```python
quote = '''One machine can do the work of fifty ordinary humans.\nNo machine can do the work of one extraordinary human.'''
chunk_size=24
chunk_overlap=3
len(quote)
```
Output:<br>
`103`


1. __CharacterTextSplitter__
   
    This method splits based on the separator first, then evaluates `chunk_size` and `chunk_overlap` to check if it's satisfied.

    We call `CharacterTextSplitter`, passing the separator to split on, along with the chunk_size and `chunk_overlap`. Applying the splitter to the quote with the `.split_text()` method, and printing the output, we can see that we have a problem: 

In [8]:
quote = '''One machine can do the work of fifty ordinary humans.\nNo machine can do the work of one extraordinary human.'''
chunk_size=24
chunk_overlap=3

from langchain.text_splitter import CharacterTextSplitter

ct_splitter = CharacterTextSplitter(
    separator=".",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

docs = ct_splitter.split_text(quote)
print(docs)
print([len(doc) for doc in docs])

Created a chunk of size 52, which is longer than the specified 24


['One machine can do the work of fifty ordinary humans', 'No machine can do the work of one extraordinary human']
[52, 53]


Each of these chunks contains more characters than our specified `chunk_size`. `CharacterTextSplitter` splits on the separator in an attempt to make chunks smaller than `chunk_size`, but in this case, splitting on the separator was unable to return chunks below our `chunk_size`. Let's take a look at a more robust splitting method!

2. __RecursiveCharacterTextSplitter__
   
   `RecursiveCharacterSplitter` takes a list of separators to split on, and it works through the list from left to right, splitting the document using each separator in turn, and seeing if these chunks can be combined while remaining under `chunk_size`. Let's split the quote using the same `chunk_size` and `chunk_overlap`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

rc_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

docs = rc_splitter.split_text(quote)
print(docs)

['One machine can do the', 'work of fifty ordinary', 'humans.', 'No machine can do the', 'work of one', 'extraordinary human.']


Notice how the length of each chunk varies. The class split by
1. paragraphs first, and found that the chunk size was too big; (`"\n\n"`)
2. likewise for sentences (`"\n"`)
3. It got to the third separator (`" "`): splitting words using the space separator, 

and found that words could be combined into chunks while remaining under the chunk_size character limit. 

However, some of these chunks are too small to contain meaningful context, but this recursive implementation may work better on larger documents.

__RecursiveCharacterTextSplitter with HTML__

We can also use split other file formats, like HTML. Recall that we can load HTML using UnstructuredHTMLLoader. Defining the splitter is the same, but for splitting documents, we use the .split_documents() method instead of .split_text() to perform the split.

NOTE: UnstucturedHTMLLoader requires Numpy to be 1.26 or lower. But this conflicts some of the packages we need for this course. So I skipped running codes for HTML. However, how the splitter works is essentially the same with above examples.

In [15]:
# Import the character splitter
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter

quote = 'Words are flowing out like endless rain into a paper cup,\nthey slither while they pass,\nthey slip away across the universe.'
chunk_size = 24
chunk_overlap = 10

# Create an instance of the splitter class
ct_splitter = CharacterTextSplitter(
    separator="\n",
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

rc_splitter = RecursiveCharacterTextSplitter(
    separators=["\n", " ", ""],
    chunk_size=chunk_size,
    chunk_overlap=chunk_overlap
)

# Split the string and print the chunks
docs_ct = ct_splitter.split_text(quote)
print("Chunks splitted by CharacterTextSplitter:")
print(docs_ct)
print([len(doc) for doc in docs_ct], '\n')

print("Chunks splitted by RecursiveCharacterTextSplitter:")
docs_rc = rc_splitter.split_text(quote)
print(docs_rc)
print([len(doc) for doc in docs_rc])

Created a chunk of size 57, which is longer than the specified 24
Created a chunk of size 29, which is longer than the specified 24


Chunks splitted by CharacterTextSplitter:
['Words are flowing out like endless rain into a paper cup,', 'they slither while they pass,', 'they slip away across the universe.']
[57, 29, 35] 

Chunks splitted by RecursiveCharacterTextSplitter:
['Words are flowing out', 'out like endless rain', 'rain into a paper cup,', 'they slither while they', 'they pass,', 'they slip away across', 'across the universe.']
[21, 21, 22, 23, 10, 21, 20]


### **RAG storage and retrieval using vector databases**

Now that we've covered document loading and splitting, we'll round-out the RAG workflow with learning about storing and retrieving this information using vector databases.

We've now loaded documents and split them into chunks using an appropriate `chunk_size` and `chunk_overlap`. All that's left is to store them for retrieval.

<img src='./images/rag.png' width=60%>

We'll be using a vector database to store our documents and make them available for retrieval. This requires embedding our text documents to create vectors that capture the semantic meaning of the text. Then, a user query can be embedded to retrieve the most similar documents from the database and insert them into the model prompt.

There are many vector databases available in LangChain. In this course, we will use `ChromaDB` because it is lightweight and quick to set up.

We'll be storing documents containing guidelines for a company's marketing copy. There's two guidelines: one around brand capitalization, and another on how to refer to users.

In [3]:
import pickle

# Load from file
with open('./datasets/docs.pkl', 'rb') as f:
    docs = pickle.load(f)

with open('./datasets/docs_with_ids.pkl', 'rb') as f:
    docs_with_ids = pickle.load(f)

docs

docs_with_ids

[Document(metadata={'id': '0', 'guideline': 'brand-capitalization'}, page_content='In all marketing copy, TechStack should always be written with the T and S capitalized. Incorrect: techstack, Techstack, etc.'),
 Document(metadata={'id': '1', 'guideline': 'referring-to-users'}, page_content='Our users should be referred to as techies in both internal and external communications.')]

__Setting up a Chroma vector database__

Now that we've parsed the data, it's time to embed it. We'll use an embedding model from `OpenAI` by instantiating the `OpenAIEmbeddings` class, passing in our `openai_api_key`. 

To create a Chroma database from a set of documents, call the `.from_documents()` method on the Chroma class, passing the documents and embedding function to use. 

We'd like to persist this database to disk for future use, so provide a path to the `persist_directory` argument. 

Finally, to integrate the database with other LangChain components, we need to convert it into a retriever with the `.as_retriever()` method. Here, we specify that we want to perform a similarity search and return the top two most similar documents for each user query.

__Building a prompt template__

So the model know what to do, we'll construct a prompt template, which starts with the instruction: to review and fix the copy provided, insert the retrieved guidelines and copy to review, and an indication that the model should follow with a fixed version.

__Chaining it all together__

To chain together our `retriever`, `prompt_template`, and `LLM`, we use LCEL in a similar way as before, using pipes to connect the three components. The only difference is that we create a dictionary that assigns the retrieved documents to guidelines, and assigns the copy to review to the `RunnablePassthrough` function, which acts as a _placeholder_ to insert our input when we invoke the chain. Printing the result, we can see the model fixed the two guideline breaches.

In [4]:
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.documents import Document
import os
import pickle
from dotenv import load_dotenv

load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")

embedding_function = OpenAIEmbeddings(
    api_key=api_key,
    model="text-embedding-3-small"
)

vectorstore = Chroma.from_documents(
    docs_with_ids,
    embedding=embedding_function,
    persist_directory="./datasets/"
)

retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":2}
)

message="""
Review and fix the following TechStack marketing copy with the following guidelines in consideration:
Guidelines:
{guidelines}
Copy:
{copy}
Fixed Copy:
"""

prompt_template = ChatPromptTemplate.from_messages([("human", message)])

llm = ChatOpenAI(model="gpt-4o-mini", api_key=api_key)

rag_chain= (
    {"guidelines": retriever, "copy": RunnablePassthrough()}
    | prompt_template
    | llm
)

response = rag_chain.invoke("Here at techstack, our users are the best in the world`")

print(response.content)

Here at TechStack, our techies are the best in the world!


## Example

In [None]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
import os
from dotenv import load_dotenv

load_dotenv()

api_key = os.getenv("OPENAI_API_KEY")

loader = PyPDFLoader('./datasets/rag_vs_fine_tuning.pdf')
data = loader.load()

# Split the document using RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=50
)
docs = splitter.split_documents(data) 

# Embed the documents in a persistent Chroma vector database
embedding_function = OpenAIEmbeddings(api_key=api_key, model='text-embedding-3-small')
vectorstore = Chroma.from_documents(
    docs,
    embedding=embedding_function,
    persist_directory='./datasets/example/'
)

# Configure the vector store as a retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={"k":3}
)

In [5]:
# Add placeholders to the message string
message = """
Answer the following question using the context provided:

Context:
{context}

Question:
{question}

Answer:
"""

# Create a chat prompt template from the message string
prompt_template = ChatPromptTemplate.from_messages([("human", message)])

In [6]:
# Create a chain to link retriever, prompt_template, and llm
rag_chain = ({"context": retriever, "question":RunnablePassthrough()}
            | prompt_template
            | llm)

# Invoke the chain
response = rag_chain.invoke("Which popular LLMs were considered in the paper?")
print(response.content)

The provided context does not include any information regarding popular LLMs (Large Language Models) considered in a paper. It only discusses the capitalization rules for the brand "TechStack" in marketing copy. Please provide additional context or details related to LLMs for a more accurate response.
