## Notebook Goal - Forecasting token usage
This notebook aims to get a ballpark number for tokens that will be used for classifying the ~28k PLRs we have in hand
- does this by calculating the average number of tokens a typical PLR will have usking the tiktoken library

In [1]:
import tiktoken
import fitz
import shutil, random, os

Test tiktoken

In [3]:
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo-1106")

text = "This is an example sentence to count tokens."
token_count = len(encoding.encode(text))
print(f"The text contains {token_count} tokens.")

The text contains 9 tokens.


Check PLR literature token count - this literature will be used to provide context about PLRs to our gpt model

In [7]:
# convert PDF to text
pdf_to_convert = fitz.open("/Users/st414/Documents/PLR/plr_literature.pdf")
full_text = ""
for page in pdf_to_convert:
    text = page.get_text()
    full_text += text
# get token count
token_count = len(encoding.encode(full_text))
print(f"The PLR literature text contains {token_count} tokens.")

The PLR literature text contains 3316 tokens.


Do a number of itirations over a sample of randomly selected PLRs to get an average for the number of tokens/ PLR

In [4]:
# dirpath = '/Volumes/erds_marks_plr/marks_plr_downloads/PLR_scraping/files_definite_plr'
destDirectory = "/Users/st414/Documents/PLR/sample_plrs"
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
sample_size = 30
iters = 10

for i in range(iters):
    token_count = 0
    token_total = 0
    filenames = random.sample(os.listdir(destDirectory), sample_size)

    for fname in filenames:
        srcpath = os.path.join(destDirectory, fname)
        pdf_to_convert = fitz.open(srcpath)
        # shutil.copyfile(srcpath, destDirectory)

        full_text = ""
        for page in pdf_to_convert:
            text = page.get_text()
            full_text += text
        # get token count
        token_count = len(encoding.encode(full_text))
        # print(token_count)
        token_total += token_count

    # print(f"total tokens {token_total}")
    token_avg = token_total / sample_size
    print(f"Token Avg for Iteration {i+1}: {token_avg} tokens.")

Token Avg for Iteration 1: 2224.3 tokens.
Token Avg for Iteration 2: 2244.0333333333333 tokens.
Token Avg for Iteration 3: 3214.5666666666666 tokens.
Token Avg for Iteration 4: 2097.366666666667 tokens.
Token Avg for Iteration 5: 2877.6 tokens.
Token Avg for Iteration 6: 2356.266666666667 tokens.
Token Avg for Iteration 7: 2694.5333333333333 tokens.
Token Avg for Iteration 8: 2846.5 tokens.
Token Avg for Iteration 9: 1963.0666666666666 tokens.
Token Avg for Iteration 10: 2721.133333333333 tokens.
