In [14]:
%run useful_functions.ipynb

# Overview
In a previous notebook - `text_splitter_eval.ipynb`, we pickled a list of chunked text. Now it's time to persist the different chunked texts into vector stores so that we can persist them and use them for querying.


# Load the chunked text
Load the chunked texts from the pickled file. From the table, we can see we have split the text with four differently configured splitters.  Three are `CharacterTextSplitter` and one is `NLTKTextSplitter`.

In [3]:
import pickle

with open('chunks_list.pkl', 'rb') as f:
    chunked_texts = pickle.load(f)
from IPython.display import display, Markdown

# Create the markdown table
markdown_table = "| Index | Splitter | Number of Chunks | Chunk Size | Overlap Size |\n|---|---|---|---|---|\n"
for i, chunk in enumerate(chunked_texts):
    markdown_table += "| " + " | ".join(str(x) for x in [i, chunk['type'], len(chunk['chunks']), chunk['chunk_size'], chunk['overlap_size']]) + " |\n"

# Display the markdown table
display(Markdown(markdown_table))


| Index | Splitter | Number of Chunks | Chunk Size | Overlap Size |
|---|---|---|---|---|
| 0 | CharacterTextSplitter | 751 | 100 | 20 |
| 1 | CharacterTextSplitter | 75 | 1000 | 200 |
| 2 | CharacterTextSplitter | 16 | 4000 | 200 |
| 3 | NLTKTextSplitter | 16 | 3894 | 0 |


Lets take our splits and embed them.  We'll be using OpenAIEmbeddings for no particular reason than the docs use them.  Yet, they cost $$$ and we should evaluate other Open Source alternatives that might be as good or better with no cost.

Sadly, figuring out the cost can be a tad difficult.

In [17]:

from IPython.display import display, Markdown

# returns a list of dictionaries with the key being a string representing the name of the LLM model and one property, an encoding string representing the encoding method used in the LLM model.
encoding_lookup_dict = utils_get_llm_names_and_encoding()

# Get the keys from the first dictionary to use as column headers
headers = encoding_lookup_dict[0].keys()

# Create the markdown table
markdown_table = "| " + " | ".join(headers) + " |\n| " + " | ".join("---" for _ in headers) + " |\n"
for item in l:
    markdown_table += "| " + " | ".join(str(item[key]) for key in headers) + " |\n"

# Display the markdown table
display(Markdown(markdown_table))


| model | encoding |
| --- | --- |
| gpt-4 | cl100k_base |
| gpt-3.5-turbo | cl100k_base |
| text-davinci-003 | p50k_base |
| text-davinci-002 | p50k_base |
| text-davinci-001 | r50k_base |
| text-curie-001 | r50k_base |
| text-babbage-001 | r50k_base |
| text-ada-001 | r50k_base |
| davinci | r50k_base |
| curie | r50k_base |
| babbage | r50k_base |
| ada | r50k_base |
| code-davinci-002 | p50k_base |
| code-davinci-001 | p50k_base |
| code-cushman-002 | p50k_base |
| code-cushman-001 | p50k_base |
| davinci-codex | p50k_base |
| cushman-codex | p50k_base |
| text-davinci-edit-001 | p50k_edit |
| code-davinci-edit-001 | p50k_edit |
| text-embedding-ada-002 | cl100k_base |
| text-similarity-davinci-001 | r50k_base |
| text-similarity-curie-001 | r50k_base |
| text-similarity-babbage-001 | r50k_base |
| text-similarity-ada-001 | r50k_base |
| text-search-davinci-doc-001 | r50k_base |
| text-search-curie-doc-001 | r50k_base |
| text-search-babbage-doc-001 | r50k_base |
| text-search-ada-doc-001 | r50k_base |
| code-search-babbage-code-001 | r50k_base |
| code-search-ada-code-001 | r50k_base |
| gpt2 | gpt2 |


In [None]:
!pip install openai

In [8]:
from langchain.embeddings import OpenAIEmbeddings

# Create an OpenAIEmbeddings instance with your callback handler
embedding = OpenAIEmbeddings(callbacks=callback_handler)


ValidationError: 1 validation error for OpenAIEmbeddings
callbacks
  extra fields not permitted (type=value_error.extra)

Before we access an OpenAI call, we need to pass in the openai key.  This is handled by the `preamble` notebook.

In [4]:
%run preamble.ipynb

INFO:numexpr.utils:Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
Note: NumExpr detected 16 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8.
INFO:numexpr.utils:NumExpr defaulting to 8 threads.
NumExpr defaulting to 8 threads.
