# Goal: create RAG for trail directions.

## requirements
- show source of information
- provided easy chat interface
- output easy to use trail directions

## technical outline
1. load white_blaze pdf data
2. use langchain to load/split the pdf data into more usable chunks
3. use openai to make the chunks more machine interpretable aka propositions
4. create text embeddings for propositions so they can be stored in vectorDB
5. Run Redis and define schemas for stored prop embeddings
6. Create simple chat for searching the new redisVL db with questions.

In [1]:
# boiler plate for loading resources

import os

# Load list of pdfs
data_path = "docs/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

# read the pdfs and turn them into something we can use

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# # start with the sample to see if it's viable
doc = [doc for doc in docs if "white_blaze_sample_quick_info" in doc][0]

loader = UnstructuredFileLoader(
    doc, mode="single", strategy="fast"
  )

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
  )

Listing available documents ... ['docs/at-1.pdf', 'docs/white_blaze_sample.pdf', 'docs/white_blaze_sample_quick_info.pdf']


In [3]:
doc

'docs/white_blaze_sample_quick_info.pdf'

In [2]:
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

  from .autonotebook import tqdm as notebook_tqdm
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\robert\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\robert\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping taggers\averaged_perceptron_tagger.zip.


Done preprocessing. Created 8 chunks of the original pdf docs/white_blaze_sample_quick_info.pdf


In [3]:
chunks

[Document(page_content='Resupply\n\nResupply locations along the Appalachian Trail ATTENTION: For more detailed information read write-up under mileage in book and see maps. Shaded entries are 1.0 miles or less from the Appalachian Trail that are full resupplies or PO’s.\n\n~Designates map available = e Location Suches, GA~e Neel Gap, GA Blairsville, GA Dahlonega, GA Helen, GA~e Hiawassee, GA~e Franklin, NC~e NOC, NC~e Stecoah Gap, NC (NC. 143)~e Robbinsville, NC~e Fontana Village, NC~e Gatlinburg, TN~e Cherokee, NC Davenport Gap, TN~e Green Corner Road~e Hot Springs, NC~e Log Cabin Rd~e Sams Gap, TN~e Uncle Johnny’s Nolichucky Hostel~e Erwin, TN~e Elk Park, NC~e Roan Mountain, TN~e Scotty’s Budget Hostel Dennis Cove, TN~e Shook Branch Road~e Hampton, TN~e Shady Valley, TN~e Damascus, VA~e Troutdale, VA~e Sugar Grove, VA\n\nNOBO Mile 20.5 31.3 31.3 31.3 52.5 69.2 109.4 136.7 150.5 150.5 165.9 207.7 207.7 239.2 241.5 274.9 291.2 319.7 344.3 344.3 395.3 395.3 407.4 420.0 428.5 428.6 455.

In [None]:
import tqdm
import json


def create_dense_props(chunk):
    """Create dense representation of raw text content."""
    # The system message here should be HEAVILY customized for your specific use case
    SYSTEM_PROMPT = """
    You are a helpful PDF extractor tool. You will be presented with segments from
    a pdf trail guide of the Appalachian Trail.

    Decompose and summarize the raw content into clear and simple propositions,
    ensuring they are interpretable out of context. Consider the following rules:
    1. Split compound sentences into simpler dense phrases that retain existing
    meaning.
    2. Simplify technical jargon or wording if possible while retaining existing
    meaning.
    2. For any named entity that is accompanied by additional descriptive information,
    separate this information into its own distinct proposition.
    3. Decontextualize the proposition by adding necessary modifier to nouns or
    entire sentences and replacing pronouns (e.g., "it", "he", "she", "they", "this", "that")
    with the full name of the entities they refer to.
    4. Present the results as a list of strings, formatted in JSON, under the key "propositions".
    """

    response = openai.OpenAI().chat.completions.create(
        model=CHAT_MODEL,
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": f"Decompose this raw content using the rules above:\n{chunk.page_content} "}
        ]
    )
    res = response.choices[0].message.content
    return json.loads(res)["propositions"]



props = [
    create_dense_props(chunk) for chunk in tqdm.tqdm(chunks)
  ]