# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/2/overview-of-embeddings-based-retrieval

In [1]:
# https://huggingface.co/nvidia/NV-Embed-v2

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("nvidia/NV-Embed-v2", trust_remote_code=True)

  from tqdm.autonotebook import tqdm, trange


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

In [2]:
import torch
# from sentence_transformers import SentenceTransformer

# Each query needs to be accompanied by an corresponding instruction describing the task.
task_name_to_instruct = {"example": "Given a question, retrieve passages that answer the question",}

query_prefix = "Instruct: "+task_name_to_instruct["example"]+"\nQuery: "
queries = [
    'are judo throws allowed in wrestling?', 
    'how to become a radiology technician in michigan?'
    ]

# No instruction needed for retrieval passages
passages = [
    "Since you're reading this, you are probably someone from a judo background or someone who is just wondering how judo techniques can be applied under wrestling rules. So without further ado, let's get to the question. Are Judo throws allowed in wrestling? Yes, judo throws are allowed in freestyle and folkstyle wrestling. You only need to be careful to follow the slam rules when executing judo throws. In wrestling, a slam is lifting and returning an opponent to the mat with unnecessary force.",
    "Below are the basic steps to becoming a radiologic technologist in Michigan:Earn a high school diploma. As with most careers in health care, a high school education is the first step to finding entry-level employment. Taking classes in math and science, such as anatomy, biology, chemistry, physiology, and physics, can help prepare students for their college studies and future careers.Earn an associate degree. Entry-level radiologic positions typically require at least an Associate of Applied Science. Before enrolling in one of these degree programs, students should make sure it has been properly accredited by the Joint Review Committee on Education in Radiologic Technology (JRCERT).Get licensed or certified in the state of Michigan."
]

# load model with tokenizer
# model = SentenceTransformer('nvidia/NV-Embed-v2', trust_remote_code=True)
model.max_seq_length = 32768
model.tokenizer.padding_side="right"

def add_eos(input_examples):
  input_examples = [input_example + model.tokenizer.eos_token for input_example in input_examples]
  return input_examples

# get the embeddings
batch_size = 2
query_embeddings = model.encode(add_eos(queries), batch_size=batch_size, prompt=query_prefix, normalize_embeddings=True)
passage_embeddings = model.encode(add_eos(passages), batch_size=batch_size, normalize_embeddings=True)

scores = (query_embeddings @ passage_embeddings.T) * 100
print(scores.tolist())




[[87.97085571289062, 0.47974157333374023], [1.026376485824585, 86.3504409790039]]


In [1]:
import os
import openai
import sys
import chromadb

from openai import OpenAI
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb.config import Settings
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from dotenv import load_dotenv, find_dotenv
from pypdf import PdfReader
from helper_utils_02 import word_wrap

sys.path.append('../..')

#### Embedding function

In [None]:
# Default model for SentenceTransformerEmbeddingFunction is "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

#### SentenceTransformersTokenTextSplitter hyper-parameters

In [3]:
# SentenceTransformers embedding model's context windows is 256 max.
# It will truncate the rest, after the exceeding the max. context window. 
tokens_per_chunk = 256
tokens_chunk_overlap = int(tokens_per_chunk * 0.2)

#### Persistance directory

In [4]:
# Store embedding database
is_persistent = True

# Specify the directory to store the database files
persist_dir = './persist_dir/'

#### Load pdf to text

In [5]:
reader = PdfReader("microsoft_annual_report_2022.pdf")

# Extracting text from pages and strip leading and trailing space characters
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [None]:
total_pdf_pages = len(pdf_texts)

print(f'type(reader): {type(reader)}')
print(f'type(pdf_texts): {type(pdf_texts)}')
print(f'number of pdf pages: {total_pdf_pages}')
print(f'\npdf page 1: {pdf_texts[0][:200]} .... {pdf_texts[0][-200:]}')

#### You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

#### https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846

#### Split pdf text, in the order of the separators , to satisfy chunk_size

In [7]:
# creates an instance of RecursiveCharacterTextSplitter
character_splitter = RecursiveCharacterTextSplitter(
    # list of delimiters used to split the text,
    # it split text in the order of the separators, to satisfy chunk_size
    # "": An empty string (effectively splitting on every character) 
    # separators=["\n\n", "\n", ". ", " ", ""],
    separators=["\n\n", "\n", ". ", " "],  # remove "" separator, don't split on character    
    chunk_size=1000,
    chunk_overlap=0
)

#  concatenate pdf pages using "\n\n" (two newlines) as a delimiter.
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

In [None]:
_n_chunk = 300
total_character_split_texts_chunks = len(character_split_texts)

print(f'The {total_pdf_pages} pdf pages are splitted into {total_character_split_texts_chunks} chunks\n')
print(f'Text in chunk[{_n_chunk}]: {character_split_texts[_n_chunk]}')
print(f"\nNumber of characters in chunk[{_n_chunk}]: {len(character_split_texts[_n_chunk])}")

#### Splitting text into tokens is the first step of any RAG system

In [33]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=tokens_chunk_overlap, tokens_per_chunk=tokens_per_chunk)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

total_token_split_texts_chunks = len(token_split_texts)

In [None]:
print(f'The {total_character_split_texts_chunks} character_split_texts chunks are split into {total_token_split_texts_chunks} token_split_texts chunks')
print(f'token_split_texts[{_n_chunk}]: {token_split_texts[_n_chunk]}')

#### Example of embedding text. An embedding vector of a single word or a chunk of text, both vectors have the same dimensions 

In [None]:
# _texts = [['hello'], [token_split_texts[10]]]
_texts = [['hello'], [' '], ['world'], ['hello world']]
for _text in _texts:
  # emb_text = [token_split_texts[10]]
  emb_text = _text
  emb_vector = embedding_function(emb_text)[0]  # list has only 1 vector 
  emb_vector_dim = len(emb_vector)

  print(f'text to be embedded: {_text}')
  print(f'embedding vector: {emb_vector}')
  print(f'embedding vector dimension: {emb_vector_dim}\n')

In [None]:
import numpy as np

_texts = [['hello world']]
# _texts = [['hello'], [' '], ['world']]
summed_emb_vector2 = np.zeros(len(embedding_function(_texts[0])[0]))

for _text in _texts:
    emb_text = _text
    emb_vector = embedding_function(emb_text)[0]
    summed_emb_vector2 += emb_vector

print(summed_emb_vector2)


In [None]:
import numpy as np

# _texts = [['hello'], [' '], ['world'], ['hello world']]
_texts = [['hello'], [' '], ['world']]
summed_emb_vector = np.zeros(len(embedding_function(_texts[0])[0]))

for _text in _texts:
    emb_text = _text
    emb_vector = embedding_function(emb_text)[0]
    summed_emb_vector += emb_vector

print(summed_emb_vector)


In [None]:
vect_hello = np.array(embedding_function('hello'))  # list has only 1 vector 
vect_space = np.array(embedding_function(' '))  # list has only 1 vector 
vect_world = np.array(embedding_function('world'))  # list has only 1 vector 
vect_hello_world = np.array(embedding_function('hello world'))  # list has only 1 vector 


print(f'vect_hello:       {vect_hello[:5]} ...')
print(f'vect_space:       {vect_space[:5]} ...')
print(f'vect_world:       {vect_world[:5]} ...')
print(f'vect_hello_world: {vect_hello_world[:5]} ...')

In [None]:
vect_hello = np.array(embedding_function(hello)[0])  # list has only 1 vector 
vect_space = np.array(embedding_function(space)[0])  # list has only 1 vector 
vect_world = np.array(embedding_function(world)[0])  # list has only 1 vector 
vect_hello_world = np.array(embedding_function('hello world'))  # list has only 1 vector 

print(f'vect_hello:       {vect_hello[:5]} ...')
print(f'vect_space:       {vect_space[:5]} ...')
print(f'vect_world:       {vect_world[:5]} ...')
print(f'vect_hello_world: {vect_hello_world[:5]} ...')

In [None]:
import numpy as np
vect_sum = np.add(vect_hello, vect_space, vect_world)
vect_net = np.subtract(vect_sum, vect_hello_world)
# print(vect_sum)
vect_net

#### Create a collection and add documents to it

In [3]:
# Create a Chroma client
client = chromadb.Client(Settings(is_persistent=is_persistent, persist_directory=persist_dir))  

# Create or retrieve a collection
collection_name = "microsoft_annual_report_2022"
collection = client.get_or_create_collection(name=collection_name, embedding_function=embedding_function)

# Add documents to the collection, and will overwrite existing document with the same ids
ids = [str(i) for i in range(len(token_split_texts))]
collection.add(ids=ids, documents=token_split_texts)

#### Retrieve a collection

In [21]:
# Create or retrieve a collection
collection_name = "microsoft_annual_report_2022"
collection = client.get_or_create_collection(name=collection_name, embedding_function=embedding_function)

# Retrieve the whole embedding database
results = collection.get()

In [None]:
results_keys = []
print(f"Results is a dictionary. Here are the keys and values (values is a list):")
for k, v in results.items():
  results_keys.append(k)
  print(f'key: {k}, value: {v}')

print(f'\nresults keys:\n{results_keys}')

# Get the first document
print(f'\nText in the first document:')
print(f'results["documents"][0]: {results["documents"][0]}')

#### Get a subset of a collection

In [None]:
ids_to_retrieve = ["0", "1"]
subset = collection.get(
  ids=ids_to_retrieve,
)
print(f'subset of collection by ids:')
subset

In [19]:
queries = [
          "What was the total revenue from the income statement?", \
           "What was the operating margin?", \
           "What is the Operating income from the Income Statements", \
           "What is the cost of good sold from the Income Statement", \
           "What is the beginning inventory from the Balance Sheet", \
           "What is the ending inventory from the Balance Sheet", \
           "What is the Beginning Accounts Receivable from current asset on the balance sheet", \
           "What is the Ending Accounts Receivable from current asset on the balance sheet", \
            ]

results = collection.query(query_texts=queries, n_results=3)

In [None]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  retrieved_ids = results['ids'][i]  
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    # print(word_wrap(document, n_chars=90))
    print(f"document: {j}, ids={retrieved_ids[j]}: {word_wrap(document, n_chars=90)}")
    print('\n')

In [9]:

_ = load_dotenv(find_dotenv('.env\my_api_key.env')) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [None]:
# def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
def rag_4o(query, retrieved_documents, model="gpt-4o-mini"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [10]:
# def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
def rag_4o(query, retrieved_documents, model="gpt-4o-mini"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [17]:
def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

Where to Find Key Metrics in Financial Statements
Cost of Goods Sold (COGS):

Income Statement: COGS is typically found on the income statement, usually listed below revenue. It represents the direct costs associated with producing or purchasing goods sold.   
Average Inventory:

Balance Sheet: Inventory is listed as a current asset on the balance sheet. To calculate average inventory, you'll need the beginning and ending inventory balances for the period.
Average Inventory = (Beginning Inventory + Ending Inventory) / 2   
  
Net Credit Sales:

Income Statement: This figure can be found on the income statement. It represents the total sales made on credit during the period.   
Average Accounts Receivable:

Balance Sheet: Accounts receivable is listed as a current asset. Similar to average inventory, calculate average accounts receivable by using the beginning and ending balances.
Average Accounts Receivable = (Beginning Accounts Receivable + Ending Accounts Receivable) / 2   
  
Note: The specific labels or placement of these items might vary slightly depending on the company and the accounting standards they follow. However, these general guidelines should help you locate the necessary information.

In [None]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  retrieved_ids = results['ids'][i]
  output = rag_4o(query=query, retrieved_documents=retrieved_documents)    
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    print(f"document: {j}, ids={retrieved_ids[j]}: {word_wrap(document, n_chars=90)}")
    print('\n')
  
  print(f'{"="*40}')
  print(f"query: {queries[i]}")
  print(f'output: {word_wrap(output, n_chars=90)}')
  print(f'{"="*40}\n')