### Setup

Import the functions and load the NCR markdown document. The documents will be stored as `langchain` objects in the "documents" class, and we will explore some of the properties of that class.

In [15]:
from populate_database import *

documents = load_documents()

We've combined the individual markdown file text into one file (`data/ncr_data.md`) that has all communities organized under their own headers.

In [16]:
len(documents)

1

Now let's look at the page content for the object. If you compare to the NCR website, you will see that this is all of the top-level data-driven text for all of the communities we are using in this project. See how the data has been organized heirarchically under headers, and that there is redundant information about the community and topic in each of the subheaders. This is intended to keep topic and location context in each individual section, so that they text in each chunk does not become dislocated from the text providing the spatial context.

In [17]:
print(documents[0])

# Anchorage (Dgheyaytnu)
General Information: In Anchorage (Dgheyaytnu), The average (mean) elevation near this point is 69ft above sea level. The minimum elevation is 30ft and the maximum elevation is 121ft, which should be kept in mind when interpreting these results.
## Projected Conditions for Anchorage (Dgheyaytnu)
Projected Conditions for Anchorage (Dgheyaytnu): The sections below show output from different scientific simulations of possible future conditions for temperature, precipitation, hydrology, permafrost, flammability, and vegetation change. These simulations use different Global Climate Models (GCMs)—climate models—such as the National Center for Atmospheric Research Community Climate System Model 4.0 (NCAR CCSM4).

These climate models use Representative Concentration Pathways (RCPs) to compare different future greenhouse gas emissions scenarios. Compared to current emissions RCP 4.5 is a scenario representing a reduction in global emissions, while RCP 8.5 represents a 

For structured text, we are using `langchain`'s `MarkdownHeaderTextSplitter` function, which will keep the individual sections of the markdown document intact. Here we are explicitly splitting the text according to our own logic, rather than allowing a text splitter function to make those decisions. 

This is the [document-structure based splitting](https://python.langchain.com/docs/concepts/text_splitters/#document-structured-based) mentioned in the other repo branches. This can work on structured files like Markdown, JSON, or HTML, which will use the features unique to those formats to provide the logical organization of the document used in splitting. Here, we don't set any chunking parameters except for a list of markdown levels to split on, which we can name for the metadata, e.g.:

```
headers_to_split_on = [
        ("#", "Community"),
        ("##", "Topic"),
        ("###", "Subtopic"),
    ]
```

Let's try it out.

In [18]:
chunks = split_documents(documents)
print(len(chunks))

35


In [19]:
chunks

[Document(metadata={'Community': 'Anchorage (Dgheyaytnu)'}, page_content='General Information: In Anchorage (Dgheyaytnu), The average (mean) elevation near this point is 69ft above sea level. The minimum elevation is 30ft and the maximum elevation is 121ft, which should be kept in mind when interpreting these results.'),
 Document(metadata={'Community': 'Anchorage (Dgheyaytnu)', 'Topic': 'Projected Conditions for Anchorage (Dgheyaytnu)'}, page_content='Projected Conditions for Anchorage (Dgheyaytnu): The sections below show output from different scientific simulations of possible future conditions for temperature, precipitation, hydrology, permafrost, flammability, and vegetation change. These simulations use different Global Climate Models (GCMs)—climate models—such as the National Center for Atmospheric Research Community Climate System Model 4.0 (NCAR CCSM4).  \nThese climate models use Representative Concentration Pathways (RCPs) to compare different future greenhouse gas emissions

Now we add a URL to the chunk metadata dictionary.

In [20]:
chunks_with_ids = calculate_chunk_ids(chunks)
print(chunks_with_ids[0].metadata)
print(chunks_with_ids[0].page_content)

{'Community': 'Anchorage (Dgheyaytnu)', 'url': 'https://northernclimatereports.org/report/community/AK15'}
General Information: In Anchorage (Dgheyaytnu), The average (mean) elevation near this point is 69ft above sea level. The minimum elevation is 30ft and the maximum elevation is 121ft, which should be kept in mind when interpreting these results.


Great, if the LLM uses this chunk of text to create a response to a question, we can point the user back to the exact source of the information.