# Converting Documents into Text

It was mentioned earlier that the first step to build RAG applications is by preprocessing the documents, by turning them to text. In order to this, you need to build logic to parse and extract the document with minimal loss of quality.

Luckily **LangChain** has *document loaders* that handle the parsing logic and enable the upload of data from various sources into a **Document** class that consists of text and associated metadata.

You can use it for example with a simple .txt file. By using LangChain **TextLoader** class to extract the text like this:

In [8]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('sample.txt')
loader.load()

[Document(metadata={'source': 'sample.txt'}, page_content='text content ')]

There are different files that can be uploaded you can read more [here](https://python.langchain.com/docs/integrations/document_loaders/).

Basically you should follow these steps:
1. Pick an available loader.
2. Create an instance of the loader, along with the necessary parameters, including the document location (e.g. file path, URL, etc.).
3. Load the documents by calling the `load` method of the loader instance.

If you want for example to use **WebBaseLoader** to load HTML from web URLs and parse it to text, first install beautifulsoup4.

In [9]:
pip install beautifulsoup4

Note: you may need to restart the kernel to use updated packages.


Let's see an example:

In [1]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.langchain")

USER_AGENT environment variable not set, consider setting it to identify your requests.


This code throws a warning because we haven't provided an USER_AGENT, you can for example do this locally:
> **Note**: To avoid this warning remove the previous cell and then run the following code.

In [1]:
import os
os.environ['USER_AGENT']='myagent' # let's try again

from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.langchain.com/")
loader.load()

[Document(metadata={'source': 'https://www.langchain.com/', 'title': 'LangChain', 'description': 'LangChain’s suite of products supports developers along each step of their development journey.', 'language': 'en'}, page_content="LangChain\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nLangGraphLangSmithLangChainResources\n\nAll ResourcesBlogCustomer StoriesLangChain AcademyCommunityExpertsChangelogMethods\n\nAgentsEvaluationRetrievalDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricing\n\nLangSmithLangGraph PlatformGet a demoSign up\n\n\n\n\n\n\n\n\n\n\n\n\nProducts\n\nLangGraphLangSmithLangChainResources\n\nAll ResourcesBlogCustomer StoriesLangChain AcademyCommunityExpertsChangelogMethods\n\nAgentsEvaluationRetrievalDocs\n\nPythonLangGraphLangSmithLangChainJavaScriptLangGraphLangSmithLangChainCompany\n\nAboutCareersPricing\n\nLangSmithLangGraph PlatformGet a demoSign upLangChain’s suite of products supports developers 

Now let's try with a PDF file:

For that, let's install first pypdf

In [2]:
pip install pypdf

Collecting pypdf
  Downloading pypdf-5.3.1-py3-none-any.whl.metadata (7.3 kB)
Downloading pypdf-5.3.1-py3-none-any.whl (302 kB)
Installing collected packages: pypdf
Successfully installed pypdf-5.3.1
Note: you may need to restart the kernel to use updated packages.


In [4]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("basic-text.pdf")
pages = loader.load()
pages

[Document(metadata={'producer': 'Skia/PDF m126', 'creator': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36', 'creationdate': '2024-07-09T13:31:41+00:00', 'title': 'Sample Document for PDF Testing', 'moddate': '2024-07-09T13:31:41+00:00', 'source': 'basic-text.pdf', 'total_pages': 1, 'page': 0, 'page_label': '1'}, page_content="Sample Document for PDF Testing\nIntroduction\nThis is a simple document created to test basic PDF functionality. It includes various text formatting\noptions to ensure proper rendering in PDF readers.\nText Formatting Examples\n1. Bold text is used for emphasis.\n2. Italic text can be used for titles or subtle emphasis.\n3. Strikethrough is used to show deleted text.\nLists\nHere's an example of an unordered list:\nItem 1\nItem 2\nItem 3\nAnd here's an ordered list:\n1. First item\n2. Second item\n3. Third item\nQuote\nThis is an example of a block quote. It can be used to highlight important info

As documents can exceed the context window of the majority of LLMs or embedding models. It is recommended to split the documents into manageable chunks that can then be converted into embeddings and semantically search, allowing us to do the *retrieve*.

> **Note**: LLMs and embedding models have a maximum token limit on the size of the input and output they can handle. This limit is usally called **context window**. And usually applies to the combination of input and output. For example, if the context window is 1024 tokens, the input can be 512 tokens and the output can be 512 tokens.

It may seem like it's simple to split the document into different chunks, but keeping the text **semantically** related (related by meaning) is a complex process. Luckily, **LangChain** offers a *RecursiveCharacterTextSplitter* class that can do the following:

1. Take a list of separators, in order of importance, by default these are:
    a. The paragraph separator, which is a double newline: '\n\n'
    b. The line separator, which is a single newline: '\n'
    c. The word separator: space ' '
2. To respect the given chunk size, for instance, 1,000 characters, start by splitting up paragraphs.
3. For any paragraph that exceeds the chunk size, split by the next separator lines, continue until all chunks are smaller than the desired length.
4. Emit each chunk as a **Document**, with the metadata of the original document passed in and additional information about the position in the original document.

Let's see an example:

In [7]:
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

loader = TextLoader("test.txt")
docs = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

splitted_docs = splitter.split_documents(docs)

In [8]:
splitted_docs

[Document(metadata={'source': 'test.txt'}, page_content='The Eternal Voyage\n\nAcross the endless seas of time,\nWhere echoes fade, where stars align,\nA vessel sails with silver light,\nA phantom ship in endless flight.\n\nThe captain stands with eyes aglow,\nA soul of fire, a heart of snow,\nHis compass spins, yet finds no ground,\nLost in whispers, lost in sound.\n\nThe ocean sings in endless tides,\nOf kings once fallen, hopes that rise,\nOf lovers lost in stormy waves,\nAnd those who dream beyond their graves.\n\nThrough mist-clad isles of memory,\nWhere silent ghosts drift ceaselessly,\nHe charts a course through fate unknown,\nA mariner by stars alone.\n\nThe Isles of Lost Tomorrows\n\nThe wind it calls in spectral tones,\nTo lands where golden ages shone,\nWhere empires rose, then turned to dust,\nBound to timeâ€™s unyielding thrust.\n\nUpon the shores of shattered fate,\nThe footprints fade, the echoes wait,\nFor those who walk with fearless stride,\nTo learn what lingers, wha

Here the document is split into chunks of 1000 characters. With some overlap between the chunks of 200 characters to maintain the context. The result is also a list of documents, where each document is up to 1,000 characters in lenght, split along the natural divisions of the text. This structure is used to keep each chunk consistent.

> **Note**: This class can be used with code languages and markdown, this can be done using using keywords specific to each language as the separators, this ensure for example that a function remains in the same chunk. 

> **Tip**: LangChain also has separators for code languages such as Python, JS, Markdown, HTML and more.
Here is an example of how to use it with Python code:

In [3]:
from langchain_text_splitters import (
    Language,
    RecursiveCharacterTextSplitter,
)

In [12]:
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(metadata={}, page_content='def hello_world():\n    print("Hello, World!")'),
 Document(metadata={}, page_content='# Call the function\nhello_world()')]

On this example we used the **from_language** method to specify the language of the code, this is important to ensure that the code is split correctly. This method accepts the coding language along with the chunk parameters.

> **Note**: For this example we used **create_documents** method, which accepts a list of strings rather than a list of docuemnts. This method is useful when you are working with raw text strings. You can also add metadata to each new document.

Let's do another example with a markdown file:

In [5]:
markdown_text = """
# LangChain

⚡ Building applications with LLMs through composability ⚡

## Quick Install

```bash
pip install langchain
```

As an open source project in a rapidly developing field, we are extremely open 
    to contributions.
"""

md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)

md_docs = md_splitter.create_documents([markdown_text],[{"source":"https://www.langchain.com"}])

md_docs

[Document(metadata={'source': 'https://www.langchain.com'}, page_content='# LangChain'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='⚡ Building applications with LLMs through composability ⚡'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='## Quick Install\n\n```bash\npip install langchain'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='```'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='As an open source project in a rapidly developing field, we'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='are extremely open'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='to contributions.')]

## Generate text embeddings. 

LangChain also has an *Embeddings* class designed to interface with text embedding models and generate vector representations of text. This class has two methods: one for embedding documents and one for embedding a query. The former takes a list of text strings as input, while the latter takes a single text string. 

In [2]:
from langchain_ollama import OllamaEmbeddings

model = OllamaEmbeddings(model="mxbai-embed-large")

embeddings = model.embed_documents([
    "Hi there!",
    "Oh, hello!",
    "What's your name?",
    "My friends call me World",
    "Hello World!"
])

In [3]:
embeddings

[[0.017469143,
  0.027417513,
  -0.0019900969,
  0.031921364,
  -0.015324965,
  -0.00011685633,
  0.030310582,
  0.028300924,
  0.038620733,
  -0.0048352163,
  0.0012878991,
  0.008177046,
  -0.0142559195,
  0.007532992,
  -0.051368788,
  -0.006621033,
  -0.00030253254,
  -0.028879663,
  -0.030958323,
  -0.010676186,
  -0.017620008,
  0.026903763,
  -0.0615801,
  -0.0021567822,
  -0.00040583726,
  -0.0050553377,
  0.020341104,
  -0.024637353,
  0.056206927,
  0.019648915,
  -0.012905453,
  -0.0034635663,
  0.017119698,
  -0.0660243,
  -0.024760703,
  -0.018812343,
  0.026569206,
  -0.023269355,
  -0.026210701,
  -0.027162619,
  0.020740967,
  0.014012362,
  0.066791035,
  -0.027516782,
  -0.0629987,
  0.015871065,
  0.010159771,
  0.009515729,
  0.04530428,
  -0.049360875,
  0.01848147,
  0.011473189,
  -0.027298005,
  -0.029249527,
  -0.0039732456,
  -0.017265355,
  -0.015847396,
  0.0094845975,
  0.0052580303,
  0.031336255,
  -0.0014587672,
  0.02682033,
  0.023542795,
  -0.05591599

> **Note** : Embeddings should be generated with multiple documents at the same time; instead of one at a time, as it will be more efficient.

Basically you could use *document loaders to convert any document to plain text*, then use a *text splitter* to split each large document into many smaller ones. Finally using *embedding models* create a numeric representation of the meaning of each split.

Let's see a full example:

In [None]:
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings

# Load the document

loader = TextLoader("./test.txt")
doc = loader.load()

# Split the document

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap = 20,
)
chunks = text_splitter.split_documents(doc)

embeddings_model = OllamaEmbeddings(model='mxbai-embed-large')
embeddings = embeddings_model.embed_documents(
    [chunk.page_content for chunk in chunks]
)

Once the embeddings are generated, the next step is to store them in a special database known as a vector store. A vector store is a database designed to store vectors and perform complex calculations, like cosine similarity, efficiently and quickly. 

Vector stores handle unstructured data, like text and images. Making different from traditional databases, which are designed to store structured data, like JSON documents or data conforming to schema of a relational database. Vector stores are capable of performing create, read, update, delete (CRUD), and search operations.

Vector stores allows to create scalable applications that utilize AI to answer questions about large documents.

The following image shows how the document embeddings are insterted into the vector store and how later, when a query is sent, similar embeddings are retrieved from the vector store.

![Embedding process](<embeddingProcess.png>)

## Few points to keep in mind:
1. Most vector stores are relatively new and may not stand the test of time.
2. Managing and opmitizing vector stores can be complex.
3. Vector stores are not a replacement for traditional databases, but rather a complement to them, as they add complexity to your application and may drain valuable resources.

The good news is that vector store capabilites have recently been extended to PostgreSQL via the **pgvector** extension. This extension allows you to use the same database you're already familiar with and to power both your transactional tables as well as your vector search tables.

For this example, we will be using a docker image of PostgreSQL with the **pgvector** extension installed. This image is available on Docker Hub and can be pulled and run with the following command:

```powershell
docker run \
    --name pgvector-container \
    -e POSTGRES_USER=langchain \
    -e POSTGRES_PASSWORD=langchain \
    -e POSTGRES_DB=langchain \
    -p 6024:5432 \
    -d pgvector/pgvector:pg16
```

This command will pull the **pgvector** image from Docker Hub and run it as a container named **pgvector-container**. The container will be accessible on port 6024, with the username **langchain**, password **langchain**, and database **langchain**.

Save the connection string
*postgresql+psycopg://langchain:langchain@localhost:6024/langchain*

Let's see a full example using a vector store.

> **Note**: If you see error:

```powershell
Exception has occurred: ImportError
no pq wrapper available.
Attempts made:
- couldn't import psycopg 'c' implementation: No module named 'psycopg_c'
- couldn't import psycopg 'binary' implementation: DLL load failed while importing pq: The specified module could not be found.
- couldn't import psycopg 'python' implementation: libpq library not found
```

There might be an issue with one of the dependencies, you can try to install the following:

```powershell
pip install "psycopg[binary,pool]"
```

In [4]:
from langchain_community.document_loaders import TextLoader
from langchain_ollama import OllamaEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_postgres.vectorstores import PGVector
from langchain_core.documents import Document
import uuid

# Load the document, split it into chunks
raw_documents = TextLoader('./test.txt').load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000,
                                               chunk_overlap=200)
documents = text_splitter.split_documents(raw_documents)

# embed each chunk and insert it into the vector store
embeddings_model = OllamaEmbeddings(model='mxbai-embed-large')
connection = 'postgresql+psycopg://langchain:langchain@localhost:6024/langchain'
db = PGVector.from_documents(documents, embeddings_model, connection=connection)

The important thing to note here is that the new line of code:
- Establish a connection to the Postgres instance running in the Docker container.
- Run any setup necessary, such as creating tables to hold your documents and vectors, if it's the first time running it.
- Create the embeddings for each document you passed in, using the model you chose.
- Store the embeddings, the document's metadata, and the document's text content in Postgres, ready to be searched.

Let's make a simple query

In [5]:
db.similarity_search("cat", k=4)

[Document(id='b02b4d9e-c989-4777-abc7-b33b684b6ff2', metadata={'source': './test.txt'}, page_content="Upon the shores of shattered fate,\nThe footprints fade, the echoes wait,\nFor those who walk with fearless stride,\nTo learn what lingers, what has died.\n\nThrough ruins draped in ivy's clutch,\nWhere time and natureâ€™s fingers touch,\nThe ship drifts on through silent years,\nPast sorrowâ€™s sigh and fleeting tears.\n\nThe City Beneath the Moon\n\nBeneath a sky of argent hue,\nA city shines in silver dew,\nWhere domes like pearls in starlight gleam,\nAnd spires rise from mist and dream.\n\nThe streets are paved with whispered song,\nEach step a verse, each path prolonged,\nA melody of fate untold,\nA place where hearts are never old.\n\nBut shadows creep where dreams reside,\nA silent force that walks beside,\nFor every joy a sorrow deep,\nFor every dawn a dusk to keep.\n\nThe Forest of Forgotten Names\n\nBeyond the cityâ€™s dreaming gate,\nA forest waits in patient state,\nWhere t

At glance this method *similarity_search* will find the most relevant documents, by following this process:

- The search query- *cat* word will be sent to the embeddings model to retrieve its embedding.
- Then it will run a query on Postgres to find the N (for this case 4) previously stored embeddings that are most similar to the query.
- Finally, it will fetch the text context and metadata that relates to each of those embeddings.
- The model can now return a list of *Document* sorted by the similarity they have with the query, the most likely goes first and so on.

We can add more documents to an existing database. Let's see an example.

In [6]:
ids = [str(uuid.uuid4()), str(uuid.uuid4())]
db.add_documents(
    [
        Document(
            page_content="there are cats in the pond",
            metadata={"location": "pond", "topic": "animals"},
        ),
        Document(
            page_content="ducks are also found in the pond",
            metadata={"location": "pond", "topic": "animals"}
        ),
    ],
    ids=ids,
)

['8b3b1bfe-ea15-4c9f-aa48-45cc3a3217af',
 '9aeb5daf-505e-4a87-ab92-51535eba1a0f']

The *add_documents* method we're using here will follow similar process to *fromDocuments*:

- It creates an embedding for each document, using the model you chose.
- Stores the embeddings, metadata, and text content in Postgres, ready to be searched.

**ids** is an optional parameter that allow us to *update or delete* ids, read more [here](https://api.python.langchain.com/en/latest/postgres/vectorstores/langchain_postgres.vectorstores.PGVector.html)

Let's see an example: 

In [8]:
db.delete(ids=["3"])

This removes the 4th element inserted by using its UUID.There are certain ways to make this more systematically.

## Tracking Changes in Documents
Normally working with vector stores, means that working with data that regularly changes, because changes mean re-indexes, that can lead to costly recomputations of embeddings and duplications of preexisting content. Langchain offers an indexing **API** to make it easy to keep your documents in sync with your vector store inside a class called **RecordManager**. When indexing content, hashes are computed for each document. When indexing content, hashes are computed for each document and the follwoing information is stored in **RecordManager**:

- The document hash (hash of both page content and metadata)
- Write time
- The source ID (each document should include information in its metadata to determine the ultimate source of this document)

Also, the indexing API has clenaup modes to help decide how to delete exisitng documents in the vector store. For example, if you made changes to how documents are processed before insertior of if source documents have changed, you may want to remove any existing documents that come from the same source as the new documents being indexed. If some source documents have been deleted, you'll want to delete all existing documents in the vector store and replace them with the re-indexed docuemtns.

Here is the list of modes:

- **None** mode doesn't do any automatic cleanup, allowing the user to manually do cleanup of old content.
- **Incremental** and **full** modes delete previous versions of the content if the content of the source document or derived documents has changed.
- **Full** mode will additionally delete any documents not included in documents currently being indexed.

In [13]:
from langchain.indexes import SQLRecordManager, index
from langchain_postgres.vectorstores import PGVector
from langchain_ollama import OllamaEmbeddings
from langchain.docstore.document import Document

connection = "postgresql+psycopg://langchain:langchain@localhost:6024/langchain"
collection_name = "my_docs"
embeddings_model = OllamaEmbeddings(model='mxbai-embed-large')
namespace = "my_docs_namespace"

vectorstore = PGVector(
    embeddings=embeddings_model,
    collection_name=collection_name,
    connection=connection,
    use_jsonb=True,
)

record_manager = SQLRecordManager(
    namespace,
    db_url="postgresql+psycopg://langchain:langchain@localhost:6024/langchain",
)

#Create the schema if it doesn't exist
record_manager.create_schema()

#Create documents
docs=[
    Document(page_content="there are cats in the pond", metadata={
        "id": 1, "source": "cats.txt"
    }),
    Document(page_content="ducks are also found in the pond", metadata={
        "id": 2, "source": "ducks.txt"}),
]

#Index the documents
index_1 = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental", #prevent duplicate documents
    source_id_key="source", #use the source field as the source id
)

print(f"Index attempt 1: {index_1}")

# second time you attempt to index, it will not add the documents again
index_2 = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

print(f"Index attempt 2: {index_2}")

# If we mutate a document, the new version will be written and all old versions sharing the same source will be deleted.

docs[0].page_content = "I just modified this document!"

index_3 = index(
    docs,
    record_manager,
    vectorstore,
    cleanup="incremental",
    source_id_key="source",
)

print(f"Index attempt 3: {index_3}")

Index attempt 1: {'num_added': 2, 'num_updated': 0, 'num_skipped': 0, 'num_deleted': 0}
Index attempt 2: {'num_added': 0, 'num_updated': 0, 'num_skipped': 2, 'num_deleted': 0}
Index attempt 3: {'num_added': 1, 'num_updated': 0, 'num_skipped': 1, 'num_deleted': 1}


This code first create a record manager, to keep track of the documents that have already been indexed. Then the **index** function is used to synchronize your vector store with the new list of documents. As we are using the incremental mode, any documents sharung the same ID as previous ones will be replaced with the new version.