# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/2/overview-of-embeddings-based-retrieval

In [1]:
import os
import openai
import sys
import chromadb

from openai import OpenAI
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb.config import Settings
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from dotenv import load_dotenv, find_dotenv
from pypdf import PdfReader
from helper_utils_02 import word_wrap

sys.path.append('../..')

#### Embedding function

In [2]:
# Default model for SentenceTransformerEmbeddingFunction is "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

  from tqdm.autonotebook import tqdm, trange


#### SentenceTransformersTokenTextSplitter hyper-parameters

In [3]:
# SentenceTransformers embedding model's context windows is 256 max.
# It will truncate the rest, after the exceeding the max. context window. 
tokens_per_chunk = 256
tokens_chunk_overlap = int(tokens_per_chunk * 0.2)

#### Persistance directory

In [4]:
# Store embedding database
is_persistent = True

# Specify the directory to store the database files
persist_dir = './persist_dir/'

#### Load pdf to text

In [5]:
reader = PdfReader("microsoft_annual_report_2022.pdf")

# Extracting text from pages and strip leading and trailing space characters
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [6]:
total_pdf_pages = len(pdf_texts)

print(f'type(reader): {type(reader)}')
print(f'type(pdf_texts): {type(pdf_texts)}')
print(f'number of pdf pages: {total_pdf_pages}')
print(f'\npdf page 1: {pdf_texts[0][:200]} .... {pdf_texts[0][-200:]}')

type(reader): <class 'pypdf._reader.PdfReader'>
type(pdf_texts): <class 'list'>
number of pdf pages: 90

pdf page 1: 1 Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2 ....  community, and country. This starts with 
increasing access to digital skills. This year alone, more than 23  million people accessed digital skills training as part of 
our global skills initiative.


#### You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

#### https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846

#### Split pdf text, in the order of the separators , to satisfy chunk_size

In [7]:
# creates an instance of RecursiveCharacterTextSplitter
character_splitter = RecursiveCharacterTextSplitter(
    # list of delimiters used to split the text,
    # it split text in the order of the separators, to satisfy chunk_size
    # "": An empty string (effectively splitting on every character) 
    # separators=["\n\n", "\n", ". ", " ", ""],
    separators=["\n\n", "\n", ". ", " "],  # remove "" separator, don't split on character    
    chunk_size=1000,
    chunk_overlap=0
)

#  concatenate pdf pages using "\n\n" (two newlines) as a delimiter.
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

In [32]:
_n_chunk = 300
total_character_split_texts_chunks = len(character_split_texts)

print(f'The {total_pdf_pages} pdf pages are splitted into {total_character_split_texts_chunks} chunks\n')
print(f'Text in chunk[{_n_chunk}]: {character_split_texts[_n_chunk]}')
print(f"\nNumber of characters in chunk[{_n_chunk}]: {len(character_split_texts[_n_chunk])}")

The 90 pdf pages are splitted into 347 chunks

Text in chunk[300]: vigorously, adverse outcomes that we estimate could reach approximately $600  million in aggregate beyond recorded 
amounts are reasonably possible. Were unfavorable final outcomes to occur, there exists the possibility of a material 
adverse impact in our consolidated financial statements for the period in which the effects become reasonably estimable.

Number of characters in chunk[300]: 355


#### Splitting text into tokens is the first step of any RAG system

In [33]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=tokens_chunk_overlap, tokens_per_chunk=tokens_per_chunk)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

total_token_split_texts_chunks = len(token_split_texts)

In [34]:
print(f'The {total_character_split_texts_chunks} character_split_texts chunks are split into {total_token_split_texts_chunks} token_split_texts chunks')
print(f'token_split_texts[{_n_chunk}]: {token_split_texts[_n_chunk]}')

The 347 character_split_texts chunks are split into 349 token_split_texts chunks
token_split_texts[300]: law. the plaintiffs allege that their handsets either operated outside the fcc guidelines or were manufactured before the fcc guidelines went into effect. the lawsuits also allege an industry - wide conspiracy to manipulate the science and testing around emission guidelines. in 2013, the defendants in the consolidated cases moved to exclude the plaintiffs ’ expert evidence of general causation on the basis of flawed scientific methodologies. in 2014, the trial court granted in part and denied in part the defendants ’ motion to exclude the plaintiffs ’ general causation experts. the defendants filed an interlocutory appeal to the district of columbia court of appeals challenging the standard for evaluating expert scientific evidence. in october 2016, the court of appeals issued its decision adopting the standard advocated by the defendants and remanding the cases to the trial


#### Example of embedding text. An embedding vector of a single word or a chunk of text, both vectors have the same dimensions 

In [68]:
# _texts = [['hello'], [token_split_texts[10]]]
_texts = [['hello'], [' '], ['world'], ['hello world']]
for _text in _texts:
  # emb_text = [token_split_texts[10]]
  emb_text = _text
  emb_vector = embedding_function(emb_text)[0]  # list has only 1 vector 
  emb_vector_dim = len(emb_vector)

  print(f'text to be embedded: {_text}')
  print(f'embedding vector: {emb_vector}')
  print(f'embedding vector dimension: {emb_vector_dim}\n')

text to be embedded: ['hello']
embedding vector: [-0.0627717673778534, 0.0549587607383728, 0.052164774388074875, 0.0857899934053421, -0.08274899423122406, -0.07457294315099716, 0.06855469197034836, 0.018396392464637756, -0.0820114016532898, -0.03738484904170036, 0.012124965898692608, 0.0035183115396648645, -0.004134250804781914, -0.04378445819020271, 0.021807344630360603, -0.005102711729705334, 0.0195465087890625, -0.04234874248504639, -0.11035964637994766, 0.005424516275525093, -0.0557347871363163, 0.02805245853960514, -0.023158729076385498, 0.028481375426054, -0.05370968207716942, -0.052601609379053116, 0.03393924608826637, 0.04538865014910698, 0.02371842972934246, -0.07312081009149551, 0.05477773770689964, 0.017047280445694923, 0.08136038482189178, -0.0028626977000385523, 0.01195810828357935, 0.07355853170156479, -0.09423747658729553, -0.0813620463013649, 0.040015459060668945, 0.000692174828145653, -0.013393301516771317, -0.054538097232580185, 0.005151425953954458, -0.02613980695605

In [71]:
import numpy as np

_texts = [['hello world']]
# _texts = [['hello'], [' '], ['world']]
summed_emb_vector2 = np.zeros(len(embedding_function(_texts[0])[0]))

for _text in _texts:
    emb_text = _text
    emb_vector = embedding_function(emb_text)[0]
    summed_emb_vector2 += emb_vector

print(summed_emb_vector2)


[-3.44772749e-02  3.10231540e-02  6.73496397e-03  2.61089895e-02
 -3.93620022e-02 -1.60302415e-01  6.69239387e-02 -6.44150935e-03
 -4.74505350e-02  1.47588458e-02  7.08753243e-02  5.55276237e-02
  1.91933736e-02 -2.62513123e-02 -1.01095485e-02 -2.69404594e-02
  2.23074369e-02 -2.22266242e-02 -1.49692550e-01 -1.74930040e-02
  7.67627917e-03  5.43522723e-02  3.25445249e-03  3.17258611e-02
 -8.46213549e-02 -2.94059888e-02  5.15955538e-02  4.81240526e-02
 -3.31485295e-03 -5.82792200e-02  4.19693105e-02  2.22106520e-02
  1.28188923e-01 -2.23389715e-02 -1.16562312e-02  6.29283041e-02
 -3.28763239e-02 -9.12260562e-02 -3.11753675e-02  5.26994728e-02
  4.70347889e-02 -8.42031017e-02 -3.00562009e-02 -2.07448211e-02
  9.51783638e-03 -3.72181856e-03  7.34325405e-03  3.93243320e-02
  9.32740271e-02 -3.78858414e-03 -5.27420491e-02 -5.80581799e-02
 -6.86435262e-03  5.28316759e-03  8.28929320e-02  1.93627756e-02
  6.28451118e-03 -1.03307562e-02  9.03235190e-03 -3.76837067e-02
 -4.52060625e-02  2.40163

In [69]:
import numpy as np

# _texts = [['hello'], [' '], ['world'], ['hello world']]
_texts = [['hello'], [' '], ['world']]
summed_emb_vector = np.zeros(len(embedding_function(_texts[0])[0]))

for _text in _texts:
    emb_text = _text
    emb_vector = embedding_function(emb_text)[0]
    summed_emb_vector += emb_vector

print(summed_emb_vector)


[-2.11848324e-01  1.34904202e-01 -1.37575339e-02  6.11020122e-02
  4.43835929e-03 -1.45303838e-01  2.86863469e-01  2.85047245e-02
 -1.43550921e-01 -1.04822398e-01  9.59619810e-02 -1.00200316e-01
  4.37183212e-02  7.48633454e-03 -4.99898829e-02 -4.40015029e-02
 -1.19766988e-01 -1.81008186e-01 -2.75501410e-01 -5.55342063e-02
 -1.52866770e-01  1.21568507e-01  1.86925968e-02  7.55881160e-02
 -1.06967868e-01  2.84147890e-02  4.40425184e-02  1.50314633e-01
  3.77841271e-02 -1.88795958e-01 -5.96537627e-03  7.61574171e-02
  2.36096323e-01 -5.30556797e-02 -8.84637823e-02  8.78671454e-02
 -1.87591057e-01 -1.52349630e-01  6.81530684e-02  3.17451675e-02
  8.79507884e-02 -1.50590342e-01 -4.01891130e-02 -1.09347217e-02
  7.28807114e-02 -1.24850341e-01  4.07027490e-02  7.64014628e-02
  1.25435639e-01 -1.44924843e-02 -1.51440281e-01 -1.64728792e-01
 -7.73474716e-02 -1.64813865e-02  1.65193500e-01 -1.99855473e-02
  1.88305844e-02 -9.64586511e-02  1.18892942e-01 -1.06276066e-01
  4.85221874e-02  2.07658

In [67]:
vect_hello = np.array(embedding_function('hello'))  # list has only 1 vector 
vect_space = np.array(embedding_function(' '))  # list has only 1 vector 
vect_world = np.array(embedding_function('world'))  # list has only 1 vector 
vect_hello_world = np.array(embedding_function('hello world'))  # list has only 1 vector 


print(f'vect_hello:       {vect_hello[:5]} ...')
print(f'vect_space:       {vect_space[:5]} ...')
print(f'vect_world:       {vect_world[:5]} ...')
print(f'vect_hello_world: {vect_hello_world[:5]} ...')

vect_hello:       [[-0.05710156  0.0941014  -0.04296131 ...  0.05559215  0.01427933
   0.04930653]
 [-0.01528914  0.05226418  0.0412192  ... -0.01116124 -0.00367527
   0.01262808]
 [-0.02921028 -0.00813605  0.03420501 ...  0.07332278  0.04099218
  -0.00541234]
 [-0.02921028 -0.00813605  0.03420501 ...  0.07332278  0.04099218
  -0.00541234]
 [-0.0632946  -0.02908464 -0.03562215 ...  0.01235676  0.04910436
  -0.01412477]] ...
vect_space:       [[-1.18838452e-01  4.82987538e-02 -2.54810927e-03 -1.10111590e-02
   5.19507416e-02  1.02917179e-02  1.15433238e-01  7.00820121e-04
  -8.59253034e-02 -7.06540346e-02  1.33174425e-03 -3.54723595e-02
   1.84341129e-02 -6.73720753e-03  2.44029555e-02 -2.95032132e-02
  -5.81384748e-02 -5.04395626e-02 -2.07655299e-02  2.90360358e-02
  -6.36759996e-02  2.40299441e-02  2.62433048e-02 -6.03735307e-03
  -1.10765453e-02 -1.40070973e-03 -1.86198801e-02  3.27700824e-02
   2.88595771e-03 -5.69439158e-02 -4.39416058e-02  2.54140887e-02
   8.79094005e-02 -2.49912

In [66]:
vect_hello = np.array(embedding_function(hello)[0])  # list has only 1 vector 
vect_space = np.array(embedding_function(space)[0])  # list has only 1 vector 
vect_world = np.array(embedding_function(world)[0])  # list has only 1 vector 
vect_hello_world = np.array(embedding_function('hello world'))  # list has only 1 vector 

print(f'vect_hello:       {vect_hello[:5]} ...')
print(f'vect_space:       {vect_space[:5]} ...')
print(f'vect_world:       {vect_world[:5]} ...')
print(f'vect_hello_world: {vect_hello_world[:5]} ...')

vect_hello:       [-0.06277177  0.05495876  0.05216477  0.08578999 -0.08274899] ...
vect_space:       [-0.11883845  0.04829875 -0.00254811 -0.01101116  0.05195074] ...
vect_world:       [-0.0302381   0.03164669 -0.0633742  -0.01367682  0.03523661] ...
vect_hello_world: [[-0.05710156  0.0941014  -0.04296131 ...  0.05559215  0.01427933
   0.04930653]
 [-0.01528914  0.05226418  0.0412192  ... -0.01116124 -0.00367527
   0.01262808]
 [-0.02921028 -0.00813605  0.03420501 ...  0.07332278  0.04099218
  -0.00541234]
 [-0.02921028 -0.00813605  0.03420501 ...  0.07332278  0.04099218
  -0.00541234]
 [-0.0632946  -0.02908464 -0.03562215 ...  0.01235676  0.04910436
  -0.01412477]] ...


In [62]:
import numpy as np
vect_sum = np.add(vect_hello, vect_space, vect_world)
vect_net = np.subtract(vect_sum, vect_hello_world)
# print(vect_sum)
vect_net

array([[-0.12450866,  0.00915612,  0.09257797, ...,  0.12172331,
         0.0837532 , -0.05793161],
       [-0.16632108,  0.05099334,  0.00839747, ...,  0.18847669,
         0.10170781, -0.02125315],
       [-0.15239994,  0.11139357,  0.01541165, ...,  0.10399267,
         0.05704036, -0.00321273],
       ...,
       [-0.12346346,  0.10249017,  0.1259632 , ...,  0.10942195,
         0.05163494,  0.02237053],
       [-0.15239994,  0.11139357,  0.01541165, ...,  0.10399267,
         0.05704036, -0.00321273],
       [-0.06609109,  0.06015231,  0.0592649 , ...,  0.15626523,
         0.14811261, -0.01196623]])

#### Create a collection and add documents to it

In [3]:
# Create a Chroma client
client = chromadb.Client(Settings(is_persistent=is_persistent, persist_directory=persist_dir))  

# Create or retrieve a collection
collection_name = "microsoft_annual_report_2022"
collection = client.get_or_create_collection(name=collection_name, embedding_function=embedding_function)

# Add documents to the collection, and will overwrite existing document with the same ids
ids = [str(i) for i in range(len(token_split_texts))]
collection.add(ids=ids, documents=token_split_texts)

#### Retrieve a collection

In [21]:
# Create or retrieve a collection
collection_name = "microsoft_annual_report_2022"
collection = client.get_or_create_collection(name=collection_name, embedding_function=embedding_function)

# Retrieve the whole embedding database
results = collection.get()

In [None]:
results_keys = []
print(f"Results is a dictionary. Here are the keys and values (values is a list):")
for k, v in results.items():
  results_keys.append(k)
  print(f'key: {k}, value: {v}')

print(f'\nresults keys:\n{results_keys}')

# Get the first document
print(f'\nText in the first document:')
print(f'results["documents"][0]: {results["documents"][0]}')

#### Get a subset of a collection

In [None]:
ids_to_retrieve = ["0", "1"]
subset = collection.get(
  ids=ids_to_retrieve,
)
print(f'subset of collection by ids:')
subset

In [19]:
queries = [
          "What was the total revenue from the income statement?", \
           "What was the operating margin?", \
           "What is the Operating income from the Income Statements", \
           "What is the cost of good sold from the Income Statement", \
           "What is the beginning inventory from the Balance Sheet", \
           "What is the ending inventory from the Balance Sheet", \
           "What is the Beginning Accounts Receivable from current asset on the balance sheet", \
           "What is the Ending Accounts Receivable from current asset on the balance sheet", \
            ]

results = collection.query(query_texts=queries, n_results=3)

In [None]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  retrieved_ids = results['ids'][i]  
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    # print(word_wrap(document, n_chars=90))
    print(f"document: {j}, ids={retrieved_ids[j]}: {word_wrap(document, n_chars=90)}")
    print('\n')

In [9]:

_ = load_dotenv(find_dotenv('.env\my_api_key.env')) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [None]:
# def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
def rag_4o(query, retrieved_documents, model="gpt-4o-mini"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [10]:
# def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
def rag_4o(query, retrieved_documents, model="gpt-4o-mini"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [17]:
def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

Where to Find Key Metrics in Financial Statements
Cost of Goods Sold (COGS):

Income Statement: COGS is typically found on the income statement, usually listed below revenue. It represents the direct costs associated with producing or purchasing goods sold.   
Average Inventory:

Balance Sheet: Inventory is listed as a current asset on the balance sheet. To calculate average inventory, you'll need the beginning and ending inventory balances for the period.
Average Inventory = (Beginning Inventory + Ending Inventory) / 2   
  
Net Credit Sales:

Income Statement: This figure can be found on the income statement. It represents the total sales made on credit during the period.   
Average Accounts Receivable:

Balance Sheet: Accounts receivable is listed as a current asset. Similar to average inventory, calculate average accounts receivable by using the beginning and ending balances.
Average Accounts Receivable = (Beginning Accounts Receivable + Ending Accounts Receivable) / 2   
  
Note: The specific labels or placement of these items might vary slightly depending on the company and the accounting standards they follow. However, these general guidelines should help you locate the necessary information.

In [None]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  retrieved_ids = results['ids'][i]
  output = rag_4o(query=query, retrieved_documents=retrieved_documents)    
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    print(f"document: {j}, ids={retrieved_ids[j]}: {word_wrap(document, n_chars=90)}")
    print('\n')
  
  print(f'{"="*40}')
  print(f"query: {queries[i]}")
  print(f'output: {word_wrap(output, n_chars=90)}')
  print(f'{"="*40}\n')