### Setup

Import the functions and load the PDF documents. The documents will be stored as `langchain` objects in the "documents" class, and we will explore some of the properties of that class.

In [1]:
from populate_database import *

documents = load_documents()

We know the three PDFs have 16 pages, 116 pages, and 32 pages each. So we can check that the loader splits the document up into objects by page.

In [3]:
len(documents)

164

Check out the metadata per object.

We see a dictionary with the file name and the original page number. Note that page numbers are pythonic and start at 0, and will likely not conform to PDF document page numbers with prefix/appendix pages, etc.

We could add anything we want to this dictionary, including URLs, authors, year of publication, etc. Later, when running the model and recieving responses with citations, we will see how adding more metadata would be useful here.

In [14]:
print(documents[10].metadata)
print(documents[32].metadata)
print(documents[-1].metadata)

{'source': 'data/Alaskas-Changing-Environment-2024.pdf', 'page': 10}
{'source': 'data/Alaskas-Changing-Environment_2019_WEB.pdf', 'page': 0}
{'source': 'data/ArcticReportCard_full_report2024.pdf', 'page': 115}


It's possible to have a unique ID for document objects, but not required... here we didn't assign any and the attribute is blank.

In [12]:
documents[10].id

Now let's look at the page content for the object. If you compare to the PDF report, you will see that this is all of the text on the entire page, even text associated with graphics. Here we can see that some of the information lacks context because we don't have any images or specific spatial arrangement of the text.

In [13]:
print(documents[10].page_content)

11
FebruaryOctober 15 May 15
2021–2022 winter12 inches of snow each day
0
2022–2023 winter
0
12 inches of snow each day
2023–2024 winter
0
12 inches of snow each day
Spotlight event: Heavy snow
Winter warming in recent decades has been 
significant, but so far, much of Alaska remains cold 
enough for most winter precipitation to fall as snow. 
With warmer ocean temperatures, more moisture 
is available to evaporate, and when the ingredients 
come together, heavy snowfalls can occur.
In Anchorage, the past three winters all had one 
or more snowstorms producing more than a foot 
of snow in a 24-hour period. Following heavy snow 
in November 2023, numerous Anchorage streets 
went unplowed for days due to a mismatch between 
city and state road maintenance. Residents were 
stranded in homes, and businesses were unable to 
open without employees. In 2024, several dozen roofs 
collapsed and Anchorage officials warned more than 
1,000 commercial property owners that their roofs 
were at risk

In this format, these pages are simply too big to make use of. The embedding model needs to split the text into smaller chunks that can be tokenized. While the embedder we chose (`nomic-embed-text`) has a long context length of up to 8192 tokens, making it useful for long text blocks, we still want to chunk the text into smaller units to improve our search capabilities.

For PDFs, we are using `langchain`'s `RecursiveCharacterTextSplitter` function, which attempts to keep larger units (e.g., paragraphs) intact. If a unit exceeds the chunk size, it moves to the next level (e.g., sentences). This is an example of [text-structure based splitting](https://python.langchain.com/docs/concepts/text_splitters/#text-structured-based), which uses natural hierarchical units of text (paragraphs, sentences, words) in order to split the text up.

In other branches of this repo, we intend to explore [document-structure based splitting](https://python.langchain.com/docs/concepts/text_splitters/#document-structured-based) on structured files like Markdown, JSON, or HTML, which will use the features unique to those formats to provide the logical organization of the document used in splitting.

For now, let's see the chunking when we use a limit of 800 characters with an 80 character overlap.

In [15]:
chunks = split_documents(documents)
print(len(chunks))

536


Let's look at the chunks derived from the page content above. We can see that the page has been split, and that there is some overlap between the two chunks.

In [20]:
print(chunks[34].metadata)
print(chunks[34].page_content)
print("\n")
print(chunks[35].metadata)
print(chunks[35].page_content)

{'source': 'data/Alaskas-Changing-Environment-2024.pdf', 'page': 10}
11
FebruaryOctober 15 May 15
2021–2022 winter12 inches of snow each day
0
2022–2023 winter
0
12 inches of snow each day
2023–2024 winter
0
12 inches of snow each day
Spotlight event: Heavy snow
Winter warming in recent decades has been 
significant, but so far, much of Alaska remains cold 
enough for most winter precipitation to fall as snow. 
With warmer ocean temperatures, more moisture 
is available to evaporate, and when the ingredients 
come together, heavy snowfalls can occur.
In Anchorage, the past three winters all had one 
or more snowstorms producing more than a foot 
of snow in a 24-hour period. Following heavy snow 
in November 2023, numerous Anchorage streets 
went unplowed for days due to a mismatch between 
city and state road maintenance. Residents were


{'source': 'data/Alaskas-Changing-Environment-2024.pdf', 'page': 10}
city and state road maintenance. Residents were 
stranded in homes, and business

Just like the document objects, the chunks don't have unique IDs unless you assign them

In [21]:
chunks[35].id

The functions in this codebase DO assign unique IDs to chunks based on document name, page number, and chunk number on that page. Here we add a unique ID to the chunk metadata dictionary.

In [22]:
chunks_with_ids = calculate_chunk_ids(chunks)
print(chunks_with_ids[35].metadata["id"])

data/Alaskas-Changing-Environment-2024.pdf:10:1


So chunk #35 is from the AK Changing Environment 2024 document, page 2, chunk 3. As mentioned previously, we can use a custom function here to associate as much metadata as we want with this chunk, so that if the LLM uses this chunk of text to create a response to a question, we can provide rich metadata at the same time and point the user back to the exact source of the information. Not just the name of the source document, but the page number, paragraph, etc.