### Setup

Import the functions and load the CSVs. The documents will be stored as `langchain` objects in the "documents" class, and we will explore some of the properties of that class.

In [1]:
from populate_database import *

documents = load_documents()

The CSV loader splits the CSV by rows, so that each row becomes its own document object. For this reason, we reformatted the CSVs (in `reformat_csvs.ipynb`) to include the CSV field metadata and a new row for location in every row of the dataset. This is very redundant, but ensures that we keep contextual information with every single document object.

In [2]:
len(documents)

2000

Now let's look at the page content for the object. We see the data from the CSV in a dictionary-style format, which includes the location and description.

In [3]:
print(documents[0])

page_content='date_range: 1950_2009
season: DJF
model: CRU-TS40
scenario: Historical
variable: tas
mean: -25.5
min: -30.7
max: -19.0
median: -25.4
hi_std: -23.1
lo_std: -27.9
q1: -26.9
q3: -23.8
Location: Utqiaġvik (Barrow)
Description: Location: Utqiaġvik (Barrow); View a report for this location at https://earthmaps.io/temperature/point/71.2905/-156.789; tas is the mean annual near-surface air temperature in degrees Celsius for the specified model and scenario; model is the global climate model used to generate the data; scenario is the emissions scenario used to generate the data; mean is the mean of annual means; median is the median of annual means; max is the maximum annual mean; min is the minimum annual mean; q1 is the first quartile of the annual means; q3 is the third quartile of the annual means; hi_std is the mean + standard deviation of annual means; lo_std is the mean - standard deviation of annual means; DJF is December - February; MAM is March - May; JJA is June - Augus

Note that we don't need to split these documents into chunks! The dataset is small enough that we can leave each row as its own chunk. If we had a CSV with hundred or thousands of rows, we might think about chunking them.

Now we want to add a URL to the chunk metadata dictionary. Here we can use the filename to parse the place name string and use our lookup table of URLs.

In [4]:
print(documents[0].metadata)

{'source': 'data/Utqiaġvik (Barrow) Temperature.csv', 'row': 0}


In [5]:
chunks_with_ids = calculate_chunk_ids(documents)
print(chunks_with_ids[0].metadata)
print(chunks_with_ids[0].page_content)

{'source': 'data/Utqiaġvik (Barrow) Temperature.csv', 'row': 0, 'Location': 'Utqiaġvik (Barrow)', 'URL': 'https://northernclimatereports.org/report/community/AK418'}
date_range: 1950_2009
season: DJF
model: CRU-TS40
scenario: Historical
variable: tas
mean: -25.5
min: -30.7
max: -19.0
median: -25.4
hi_std: -23.1
lo_std: -27.9
q1: -26.9
q3: -23.8
Location: Utqiaġvik (Barrow)
Description: Location: Utqiaġvik (Barrow); View a report for this location at https://earthmaps.io/temperature/point/71.2905/-156.789; tas is the mean annual near-surface air temperature in degrees Celsius for the specified model and scenario; model is the global climate model used to generate the data; scenario is the emissions scenario used to generate the data; mean is the mean of annual means; median is the median of annual means; max is the maximum annual mean; min is the minimum annual mean; q1 is the first quartile of the annual means; q3 is the third quartile of the annual means; hi_std is the mean + standard

Great, if the LLM uses this CSV row to create a response to a question, we can point the user back to the exact source of the information.