### Setup

Import the functions and load the NCR text documents. The documents will be stored as `langchain` objects in the "documents" class, and we will explore some of the properties of that class.

In [1]:
from populate_database import *

documents = load_documents()

We know the three PDFs have 16 pages, 116 pages, and 32 pages each. So we can check that the loader splits the document up into objects by page.

In [2]:
len(documents)

5

Check out the metadata per object.

We see a dictionary with the file name, but since this is unstructured text we have no page number.

In this branch, we're going to use lookup table to add the source URL to the metadata dictionary. (See `calculate_chunk_ids()` below.)

In [3]:
print(documents[0].metadata)

{'source': 'data/Utqiagvik.txt'}


Now let's look at the page content for the object. If you compare to the NCR website, you will see that this is all of the text on the entire page, even text associated with graphics. Here we can see that some of the information lacks context because we don't have any images or specific spatial arrangement of the text.

In [4]:
print(documents[0].page_content)

University of Alaska Fairbanks  |  Alaska Climate Adaptation Science Center

Home

About

Credits

Data

Places

Northern Climate Reports

Ecological futures in stories, charts, and data

A changing climate is altering Northern landscapes.

Explore these changes with easy-to-understand climate model projections.

Projected Conditions for Utqiaġvik (Barrow)

In Utqiaġvik (Barrow),

average annual temperatures

may increase by about 18°F by the end of the century.

Winter temperatures are projected to increase the most (+31°F).

Models have higher uncertainty with regard to precipitation,

but summer is likely to have more precipitation (+76%).

In the past, this area had very low flammability. Future flammability may be about the same.

Based on climate and fire-driven shifts, vegetation in this area may be notably different by the end of the century.

Summary across all models and scenarios. See tables and sections below for more detailed information and definitions.

Introduction

The

For unstructured text, we are using `langchain`'s `RecursiveCharacterTextSplitter` function just like with the PDFs, which attempts to keep larger units (e.g., paragraphs) intact.

In other branches of this repo, we intend to explore [document-structure based splitting](https://python.langchain.com/docs/concepts/text_splitters/#document-structured-based) on structured files like Markdown, JSON, or HTML, which will use the features unique to those formats to provide the logical organization of the document used in splitting.

For now, let's see the chunking when we use a limit of 400 characters with an 40 character overlap. This is smaller than with the PDFs, because the text on this website is condensed and there are no long-form paragraphs.

In [5]:
chunks = split_documents(documents)
print(len(chunks))

283


Let's look at some chunks derived from the page content above. We can see that the page has been split, but there is not actually any overlap between the two chunks. This may be due to recursive splitter using full sentences, and perhaps it does not overlap partial sentences in this case? 

In [6]:
print(chunks[3].metadata)
print(chunks[3].page_content)
print("\n")
print(chunks[4].metadata)
print(chunks[4].page_content)
print("\n")
print(chunks[5].metadata)
print(chunks[5].page_content)

{'source': 'data/Utqiagvik.txt'}
Introduction

The tables and charts below are specific to the gridded data extracted from the location of Utqiaġvik (Barrow), indicated by a marker on the map above. The shaded region on the map is the nearest watershed (hydrological unit, level 12) and is only used to summarize wildfire data around this place.


{'source': 'data/Utqiagvik.txt'}
The average (mean) elevation near this point is 20ft above sea level. The minimum elevation is 0ft and the maximum elevation is 59ft, which should be kept in mind when interpreting these results.


{'source': 'data/Utqiagvik.txt'}
The sections below show output from different scientific simulations of possible future conditions for temperature, precipitation, hydrology, flammability, and vegetation change. These simulations use different Global Climate Models (GCMs)—climate models—such as the National Center for Atmospheric Research Community Climate System Model 4.0 (NCAR CCSM4).


Here we add a URL to the chunk metadata dictionary.

In [8]:
chunks_with_ids = calculate_chunk_ids(chunks)
print(chunks_with_ids[0].metadata)

{'source': 'data/Utqiagvik.txt', 'url': 'https://northernclimatereports.org/report/community/AK418'}


Great, if the LLM uses this chunk of text to create a response to a question, we can point the user back to the exact source of the information.