## Goal

The goal of this exploration is to find a logical way to chunk prompt input for a large language model (LLM).

Our current application of this process will be taking parsed website information into something that we can prompt an LLM with to answer a question.


## Background:

ChatGPT: 4096 characters (about 500 words)

**Tips for dealing with output char limit**: [source](https://www.androidauthority.com/chatgpt-character-limit-3292997/)
- If ChatGPT stops prompting an answer because of its output char limit, then prompt "Continue"
- You can't ask ChatGPT to write beyond its char limit
- Break down prompt by heading/title/piece
- Ask for an outline
- Regenerate response

LLama: [source](https://huggingface.co/docs/transformers/model_doc/llama)
- Llama 1: 2048 tokents
- Llama 2: 4096
- CodeLlama: 16384

Cohere:
- 100000 char max [source](https://txt.cohere.com/command-usecase-patterns/#:~:text=It%20supports%20a%20much%20longer,endpoint%20at%20100%2C000%20characters%20maximum.)

Claude:
- 100K tokens/prompt [source](https://zapier.com/blog/claude-ai/#:~:text=Claude%202's%20unique%20selling%20point,amount%20offered%20by%20GPT%2D4.)

In [151]:
# Read in fake article on bananas (thanks ChatGPT)

file_path = "bananas.txt"
with open(file_path, "r") as f:
    text = f.read()

print(type(text))
escapes = ''.join([chr(char) for char in range(1, 32)])
translator = str.maketrans('', '', escapes)
t = text.translate(translator)
t.replace("\\n", "")

<class 'str'>


'The Versatile Banana: Nature\'s Perfect SnackBananas are one of the most popular and beloved fruits in the world. Their distinctive yellow color, convenient peel-and-eat packaging, and sweet, creamy taste make them a favorite among people of all ages. But beyond their delicious flavor and easy-to-carry nature, bananas have a fascinating history and a wide range of health benefits that have earned them a special place in the world of fruits.\\A Brief History of BananasBananas have a long and storied history that dates back thousands of years. They are believed to have originated in Southeast Asia and have been cultivated in regions as diverse as Africa, the Middle East, and the Americas. It wasn\'t until the 19th century that bananas gained popularity in the Western world, primarily due to efforts by entrepreneurs like Lorenzo Dow Baker and Minor C. Keith, who developed modern banana plantations in Central America. Today, bananas are a global commodity, with countries like India, China

In [173]:
# Chunk by number of characters (4092)

chunks = []

current_chunk = ""
# run through chars
for char in t:
    current_chunk += char
    if (len(current_chunk) == 4096):
        # print(current_chunk)
        chunks.append(current_chunk)
        current_chunk = ""
for chunk in chunks:
    # print(len(chunk))
    print(chunk)
    print("\n\n\n\n\n")

The Versatile Banana: Nature's Perfect SnackBananas are one of the most popular and beloved fruits in the world. Their distinctive yellow color, convenient peel-and-eat packaging, and sweet, creamy taste make them a favorite among people of all ages. But beyond their delicious flavor and easy-to-carry nature, bananas have a fascinating history and a wide range of health benefits that have earned them a special place in the world of fruits.\A Brief History of BananasBananas have a long and storied history that dates back thousands of years. They are believed to have originated in Southeast Asia and have been cultivated in regions as diverse as Africa, the Middle East, and the Americas. It wasn't until the 19th century that bananas gained popularity in the Western world, primarily due to efforts by entrepreneurs like Lorenzo Dow Baker and Minor C. Keith, who developed modern banana plantations in Central America. Today, bananas are a global commodity, with countries like India, China, an

In [169]:
# chunk by sentences, not just chars

sentences = t.split(".") # removes periods from character count, not perfect, but shouldn't affect word processing too much
# print(sentences)

chunks = []
current_chunk = ""
chunk_chars = 0

for s in sentences:
    if (chunk_chars + len(s) <= 4096):
        current_chunk += s
        chunk_chars += len(s)
    else:
        chunks.append(current_chunk)
        current_chunk = ""
        chunk_chars = 0

for chunk in chunks:
    print(len(chunk))

4063
3816
3708
4087
3954
4082
The Versatile Banana: Nature's Perfect SnackBananas are one of the most popular and beloved fruits in the world Their distinctive yellow color, convenient peel-and-eat packaging, and sweet, creamy taste make them a favorite among people of all ages But beyond their delicious flavor and easy-to-carry nature, bananas have a fascinating history and a wide range of health benefits that have earned them a special place in the world of fruits\A Brief History of BananasBananas have a long and storied history that dates back thousands of years They are believed to have originated in Southeast Asia and have been cultivated in regions as diverse as Africa, the Middle East, and the Americas It wasn't until the 19th century that bananas gained popularity in the Western world, primarily due to efforts by entrepreneurs like Lorenzo Dow Baker and Minor C Keith, who developed modern banana plantations in Central America Today, bananas are a global commodity, with countrie

In [170]:
# chunking by section
# we'll probably need more info/text structure to get this to work

## Trying it out with the model
Question for the prompt: When were bananas discovered?

**ChatGPT, without prompting:**

Bananas are believed to have been first domesticated in Southeast Asia, particularly in the region that includes Malaysia, Indonesia, and the Philippines. The exact date of their discovery is not well-documented, but it is estimated that bananas have been cultivated for thousands of years. They have a long history of cultivation and consumption by various civilizations throughout Asia, and their cultivation likely dates back to at least 6,000 to 8,000 years ago.

Bananas were later spread to other parts of the world through trade and exploration. They were introduced to Africa by Arab traders and to the Americas by European explorers in the 15th and 16th centuries. Today, bananas are one of the most popular and widely consumed fruits in the world.

**ChatGPT, char chunks, then question:**

Bananas have a long history and were first discovered and cultivated in Southeast Asia. The exact timeline of banana cultivation and domestication is challenging to pinpoint due to the fruit's long history and the limited preservation of historical records. However, archaeological and botanical evidence suggests that bananas have been cultivated for thousands of years.

Bananas were likely first domesticated in regions that are now part of Malaysia, Indonesia, and the Philippines. From there, they spread to other parts of the world through trade and exploration. It's believed that Arab merchants introduced bananas to Africa and the Middle East around the 9th century. Later, European explorers and traders played a significant role in bringing bananas to the Americas during the Age of Exploration.

Overall, bananas have a rich history dating back thousands of years, and they have evolved into the popular and versatile fruit we know today.



## I asked ChatGPT to help me write the code to do this:

In [174]:
!pip3 install beautifulsoup4
!pip3 install lxml

[33mDEPRECATION: Loading egg at /usr/local/Cellar/gpgme/1.22.0/lib/python3.11/site-packages/gpg-1.22.0-py3.11-macosx-13-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m
[33mDEPRECATION: Loading egg at /usr/local/Cellar/gpgme/1.22.0/lib/python3.11/site-packages/gpg-1.22.0-py3.11-macosx-13-x86_64.egg is deprecated. pip 24.3 will enforce this behaviour change. A possible replacement is to use pip for package installation.. Discussion can be found at https://github.com/pypa/pip/issues/12330[0m[33m


In [194]:
import requests

# The URL of the web page you want to fetch
url = "https://en.wikipedia.org/wiki/Banana"  # Replace with the URL of the web page you want to retrieve

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Get the HTML content from the response
    html_content = response.text

    # Now, you can work with the HTML content as needed, such as parsing it or saving it to a file.
    # For example, if you want to save it to a file:
    with open("web_page.html", "w", encoding="utf-8") as file:
        file.write(html_content)
else:
    print(f"Failed to retrieve the web page. Status code: {response.status_code}")


In [211]:
from bs4 import BeautifulSoup

with open("web_page.html", "r", encoding="utf-8") as file:
    html_content = file.read()

soup = BeautifulSoup(html_content, "lxml")

unwanted_elements = soup.find_all("div", class_="unwanted-class")
for element in unwanted_elements:
    element.decompose()

for script in soup(["script", "noscript", "meta", "link", "span", "button", "label", "style"]):  # You can also remove <noscript> tags if needed
    script.extract()

# Extract and concatenate the title, headers (h1, h2, h3, etc.), and article text
extracted_text = ""

# Extract the title (inside the <title> tag)
title = soup.title.string if soup.title else ""
extracted_text += title + "\n\n"

# Extract headers (h1, h2, h3, etc.)
headers = soup.find_all(["h1", "h2", "h3", "h4", "h5", "h6"])
for header in headers:
        # Add the header text to the extracted text
        extracted_text += header.get_text() + "\n\n"

        # Find the next sibling elements (usually the article text)
        sibling = header.find_next_sibling()
        while sibling and sibling.name not in ["h1", "h2", "h3", "h4", "h5", "h6"]:
            # Add the text of the sibling element to the extracted text
            extracted_text += sibling.get_text() + "\n\n"
            sibling = sibling.find_next_sibling()

# Now, you can save the extracted text to a text file
with open("extracted_text.txt", "w", encoding="utf-8") as file:
    file.write(extracted_text)

In [212]:
with open("extracted_text.txt", "r") as f:
    article = f.read()
print(article)

Banana - Wikipedia

Contents


















Young plant

The banana plant is the largest herbaceous flowering plant.[10] All the above-ground parts of a banana plant grow from a structure usually called a "corm".[11] Plants are normally tall and fairly sturdy with a treelike appearance, but what appears to be a trunk is actually a "false stem" or pseudostem. Bananas grow in a wide variety of soils, as long as the soil is at least 60 centimetres (2.0 ft) deep, has good drainage and is not compacted.[12] Banana plants are among the fastest growing of all plants, with daily surface growth rates recorded from 1.4 square metres (15 sq ft) to 1.6 square metres (17 sq ft).[13][14]


The leaves of banana plants are composed of a stalk (petiole) and a blade (lamina). The base of the petiole widens to form a sheath; the tightly packed sheaths make up the pseudostem, which is all that supports the plant. The edges of the sheath meet when it is first produced, making it tubular. As new growth o

In [213]:
import re

def split_text_into_chunks(text, max_tokens=4096):
    # Split the text into paragraphs or other meaningful divisions
    paragraphs = re.split(r'\n\s*\n', text)
    
    # Initialize variables to keep track of the current chunk and token count
    current_chunk = ''
    current_chunk_tokens = 0
    chunks = []
    
    for paragraph in paragraphs:
        # Split each paragraph into sentences
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?|\!)\s', paragraph)
        
        for sentence in sentences:
            sentence_tokens = len(sentence.split())
            
            # If adding the sentence to the current chunk won't exceed the token limit
            if current_chunk_tokens + sentence_tokens <= max_tokens:
                current_chunk += sentence + ' '
                current_chunk_tokens += sentence_tokens
            else:
                # If adding the sentence would exceed the token limit, start a new chunk
                chunks.append(current_chunk)
                current_chunk = sentence + ' '
                current_chunk_tokens = sentence_tokens
    
    # Add the last chunk
    chunks.append(current_chunk)
    
    return chunks

max_tokens_per_chunk = 512  # Set your desired maximum token limit per chunk
article_chunks = split_text_into_chunks(article, max_tokens_per_chunk)

# Print the resulting chunks
for i, chunk in enumerate(article_chunks):
    print(f"Chunk {i + 1} (Tokens: {len(chunk.split())}):\n")
    print(chunk)
    print("\n")


Chunk 1 (Tokens: 508):

Banana - Wikipedia Contents Young plant The banana plant is the largest herbaceous flowering plant.[10] All the above-ground parts of a banana plant grow from a structure usually called a "corm".[11] Plants are normally tall and fairly sturdy with a treelike appearance, but what appears to be a trunk is actually a "false stem" or pseudostem. Bananas grow in a wide variety of soils, as long as the soil is at least 60 centimetres (2.0 ft) deep, has good drainage and is not compacted.[12] Banana plants are among the fastest growing of all plants, with daily surface growth rates recorded from 1.4 square metres (15 sq ft) to 1.6 square metres (17 sq ft).[13][14] The leaves of banana plants are composed of a stalk (petiole) and a blade (lamina). The base of the petiole widens to form a sheath; the tightly packed sheaths make up the pseudostem, which is all that supports the plant. The edges of the sheath meet when it is first produced, making it tubular. As new growth