# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/2/overview-of-embeddings-based-retrieval

In [1]:
import os
import openai
import sys
import chromadb

from openai import OpenAI
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb.config import Settings
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter
from dotenv import load_dotenv, find_dotenv
from pypdf import PdfReader
from helper_utils_02 import word_wrap

sys.path.append('../..')

# Default model for SentenceTransformerEmbeddingFunction is "all-MiniLM-L6-v2"
embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")


# Store embedding database
is_persistent = True

# Specify the directory to store the database files
persist_dir = './persist_dir/'

# SentenceTransformersTokenTextSplitter hyper-parameters
# SentenceTransformers embedding model's context windows is 256 max.
# It will truncate the rest, after the exceeding the max. context window. 
tokens_per_chunk = 256
tokens_chunk_overlap = int(tokens_per_chunk * 0.2)

  from tqdm.autonotebook import tqdm, trange


In [None]:
reader = PdfReader("microsoft_annual_report_2022.pdf")

# Extracting text from pages and strip leading and trailing space characters
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0], n_chars=90))

In [None]:
print(f'type(reader): {type(reader)}')
print(f'type(pdf_texts): {type(pdf_texts)}')
print(f'len(pdf_texts): {len(pdf_texts)}')
print(f'\npdf page 1: {pdf_texts[0]}')

You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 


Print first and last 200 characters on page 2

In [None]:
print(word_wrap(pdf_texts[1][:200]))
print('....')
print(word_wrap(pdf_texts[1][-200:]))

https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846

In [None]:
# creates an instance of RecursiveCharacterTextSplitter
character_splitter = RecursiveCharacterTextSplitter(
    # list of delimiters used to split the text,
    # it split text in the order of the separators, to satisfy chunk_size
    # "": An empty string (effectively splitting on every character) 
    # separators=["\n\n", "\n", ". ", " ", ""],
    separators=["\n\n", "\n", ". ", " "],  # remove "" separator, don't split on character    
    chunk_size=1000,
    chunk_overlap=0
)
#  concatenate pdf pages using "\n\n" (two newlines) as a delimiter.
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\n{character_split_texts[10]}")
print(f"\nlen(character_split_texts[0]): {len(character_split_texts[10])}")
print(f"Total chunks: {len(character_split_texts)}")

#### Splitting text into tokens is the first step of any RAG system

In [None]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=tokens_chunk_overlap, tokens_per_chunk=tokens_per_chunk)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal token_split_texts chunks: {len(token_split_texts)}")

#### Example of embedding text. An embedding vector of a single word or a chunk of text, both vectors have the same dimensions 

In [None]:
_texts = [['hello'], [token_split_texts[10]]]
for _text in _texts:
  # emb_text = [token_split_texts[10]]
  emb_text = _text
  emb_vector = embedding_function(emb_text)[0]  # list has only 1 vector 
  emb_vector_dim = len(emb_vector)

  print(f'text to be embedded: {_text}')
  print(f'embedding vector: {emb_vector}')
  print(f'embedding vector dimension: {emb_vector_dim}\n')

In [2]:
# Create a Chroma client
client = chromadb.Client(Settings(is_persistent=is_persistent, persist_directory=persist_dir))  

# Create or retrieve a collection
collection_name = "microsoft_annual_report_2022"
collection = client.get_or_create_collection(name=collection_name, embedding_function=embedding_function)

In [11]:
# Add documents to the collection, and will overwrite existing document with the same ids
ids = [str(i) for i in range(len(token_split_texts))]
collection.add(ids=ids, documents=token_split_texts)

In [3]:
# Retrieve the whole embedding database
results = collection.get()

In [4]:
print(f"type(results['documents']): {type(results['documents'])}")
print(f"len(results['documents']: {len(results['documents'])}")
print(f"results['results'][10]: {results['documents'][10]}")
print(f"type(results['documents'][10]): {type(results['documents'][10])}")

type(results['documents']): <class 'list'>
len(results['documents']: 349
results['results'][10]: 26 • gaming, focuses on developing hardware, content, and services across a large range of platforms to help grow our user base through game experiences and social interaction. internal development allows us to maintain competitive advantages that come from product differentiation and closer technical control over our products and services. it also gives us the freedom to decide which modifications and enhancements are most important and when they should be implemented. we strive to obtain information as early as possible about changing usage patterns and hardware advances that may affect software and hardware design. before releasing new software platforms, and as we make significant modifications to existing platforms, we provide application vendors with a range of resources and guidelines for development, training, and testing. generally, we also create product documentation internally.


In [5]:
results_keys = []
for k, v in results.items():
  results_keys.append(k)
  print(k, v)

print(f'\nresults keys: {results_keys}')

# Get the first document
print(f'\nresults["documents"][0]: {results["documents"][0]}')

ids ['0', '1', '10', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '11', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '12', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '13', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '14', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '15', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '16', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '17', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '18', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '19', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '2', '20', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '21', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '22', '220', '221', '222', '223', '224', '225', '226', '227', '228',

In [6]:
# Retrieve documents by ids
my_results = collection.get(ids=['0', '1'])
print(f'my_result["documents"][0]: {my_results["documents"][0]}')
print(f'my_results["documents"][1]: {my_results["documents"][1]}')

my_result["documents"][0]: 1 dear shareholders, colleagues, customers, and partners : we are living through a period of historic economic, societal, and geopolitical change. the world in 2022 looks nothing like the world in 2019. as i write this, inflation is at a 40 - year high, supply chains are stretched, and the war in ukraine is ongoing. at the same time, we are entering a technological era with the potential to power awesome advancements across every sector of our economy and society. as the world ’ s largest software company, this places us at a historic intersection of opportunity and responsibility to the world around us. our mission to empower every person and every organization on the planet to achieve more has never been more urgent or more necessary. for all the uncertainty in the world, one thing is clear : people and organizations in every industry are increasingly looking to digital technology to overcome today ’ s challenges and emerge stronger. and no
my_results["docu

In [19]:
queries = [
          "What was the total revenue from the income statement?", \
           "What was the operating margin?", \
           "What is the Operating income from the Income Statements", \
           "What is the cost of good sold from the Income Statement", \
           "What is the beginning inventory from the Balance Sheet", \
           "What is the ending inventory from the Balance Sheet", \
           "What is the Beginning Accounts Receivable from current asset on the balance sheet", \
           "What is the Ending Accounts Receivable from current asset on the balance sheet", \
            ]

results = collection.query(query_texts=queries, n_results=3)

In [20]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  retrieved_ids = results['ids'][i]  
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    # print(word_wrap(document, n_chars=90))
    print(f"document: {j}, ids={retrieved_ids[j]}: {word_wrap(document, n_chars=90)}")
    print('\n')

query: What was the total revenue from the income statement?
document: 0, ids=194: 47 financial statements and supplementary data income statements ( in millions, except
per share amounts ) year ended june 30, 2022 2021 2020 revenue : product $ 72, 732 $ 71,
074 $ 68, 041 service and other 125, 538 97, 014 74, 974 total revenue 198, 270 168, 088
143, 015 cost of revenue : product 19, 064 18, 219 16, 017 service and other 43, 586 34,
013 30, 061 total cost of revenue 62, 650 52, 232 46, 078 gross margin 135, 620 115, 856
96, 937 research and development 24, 512 20, 716 19, 269 sales and marketing 21, 825 20,
117 19, 598 general and administrative 5, 900 5, 107 5, 111 operating income 83, 383 69,
916 52, 959 other income, net 333 1, 186 77 income before income taxes 83, 716 71, 102
53, 036 provision for income taxes 10, 978 9, 831 8, 755


query: What was the total revenue from the income statement?
document: 1, ids=196: 48 comprehensive income statements ( in millions ) year ended june 

In [9]:

_ = load_dotenv(find_dotenv('.env\my_api_key.env')) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [10]:
# def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
def rag_4o(query, retrieved_documents, model="gpt-4o-mini"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [17]:
def rag_35(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

Where to Find Key Metrics in Financial Statements
Cost of Goods Sold (COGS):

Income Statement: COGS is typically found on the income statement, usually listed below revenue. It represents the direct costs associated with producing or purchasing goods sold.   
Average Inventory:

Balance Sheet: Inventory is listed as a current asset on the balance sheet. To calculate average inventory, you'll need the beginning and ending inventory balances for the period.
Average Inventory = (Beginning Inventory + Ending Inventory) / 2   
  
Net Credit Sales:

Income Statement: This figure can be found on the income statement. It represents the total sales made on credit during the period.   
Average Accounts Receivable:

Balance Sheet: Accounts receivable is listed as a current asset. Similar to average inventory, calculate average accounts receivable by using the beginning and ending balances.
Average Accounts Receivable = (Beginning Accounts Receivable + Ending Accounts Receivable) / 2   
  
Note: The specific labels or placement of these items might vary slightly depending on the company and the accounting standards they follow. However, these general guidelines should help you locate the necessary information.

In [30]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  retrieved_ids = results['ids'][i]
  output = rag_4o(query=query, retrieved_documents=retrieved_documents)    
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    print(f"document: {j}, ids={retrieved_ids[j]}: {word_wrap(document, n_chars=90)}")
    print('\n')
  
  print(f'{"="*40}')
  print(f"query: {queries[i]}")
  print(f'output: {word_wrap(output, n_chars=90)}')
  print(f'{"="*40}\n')

query: What was the total revenue from the income statement?
document: 0, ids=194: 47 financial statements and supplementary data income statements ( in millions, except
per share amounts ) year ended june 30, 2022 2021 2020 revenue : product $ 72, 732 $ 71,
074 $ 68, 041 service and other 125, 538 97, 014 74, 974 total revenue 198, 270 168, 088
143, 015 cost of revenue : product 19, 064 18, 219 16, 017 service and other 43, 586 34,
013 30, 061 total cost of revenue 62, 650 52, 232 46, 078 gross margin 135, 620 115, 856
96, 937 research and development 24, 512 20, 716 19, 269 sales and marketing 21, 825 20,
117 19, 598 general and administrative 5, 900 5, 107 5, 111 operating income 83, 383 69,
916 52, 959 other income, net 333 1, 186 77 income before income taxes 83, 716 71, 102
53, 036 provision for income taxes 10, 978 9, 831 8, 755


query: What was the total revenue from the income statement?
document: 1, ids=196: 48 comprehensive income statements ( in millions ) year ended june 