# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

In [69]:
from helper_utils import word_wrap

In [70]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0]))

1 
Dear shareholders, colleagues, customers, and partners:  
We are
living through a period of historic economic, societal, and
geopolitical change. The world in 2022 looks nothing like 
the world in
2019. As I write this, inflation is at a 40 -year high, supply chains
are stretched, and the war in Ukraine is 
ongoing. At the same time, we
are entering a technological era with the potential to power awesome
advancements 
across every sector of our economy and society. As the
world’s largest software company, this places us at a historic

intersection of opportunity and responsibility to the world around us.
 
Our mission to empower every person and every organization on the
planet to achieve more has never been more 
urgent or more necessary.
For all the uncertainty in the world, one thing is clear: People and
organizations in every 
industry are increasingly looking to digital
technology to overcome today’s challenges and emerge stronger. And no

company is better positioned to help t

You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 

In [71]:
import nltk
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /home/rares/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [72]:
import re

joint_pdf_texts = '\n'.join(pdf_texts)
joint_pdf_texts = re.sub(r'\s+', ' ', joint_pdf_texts)

In [81]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "."], #"\n", ". ", " ", ""],
    keep_separator=False,
    chunk_size=1000,
    chunk_overlap=0
)

split_texts = splitter.split_text(joint_pdf_texts)

print(word_wrap(split_texts[0]))
print(f"\nTotal chunks: {len(split_texts)}")

1 Dear shareholders, colleagues, customers, and partners: We are living
through a period of historic economic, societal, and geopolitical
change. The world in 2022 looks nothing like the world in 2019. As I
write this, inflation is at a 40 -year high, supply chains are
stretched, and the war in Ukraine is ongoing. At the same time, we are
entering a technological era with the potential to power awesome
advancements across every sector of our economy and society. As the
world’s largest software company, this places us at a historic
intersection of opportunity and responsibility to the world around us.
Our mission to empower every person and every organization on the
planet to achieve more has never been more urgent or more necessary.
For all the uncertainty in the world, one thing is clear: People and
organizations in every industry are increasingly looking to digital
technology to overcome today’s challenges and emerge stronger

Total chunks: 307


In [82]:
import os
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file
openai_api_key = os.environ['OPENAI_API_KEY']

embedding_function = OpenAIEmbeddingFunction(api_key=openai_api_key, model_name="text-embedding-3-small")
print(embedding_function([split_texts[0]]))

[array([-0.00921401, -0.01512612,  0.03912692, ...,  0.00043272,
       -0.02321774, -0.00480603], shape=(1536,), dtype=float32)]


In [None]:
chroma_client = chromadb.PersistentClient('microsoft_annual_report_2022')
chroma_collection = chroma_client.create_collection(
    "microsoft_annual_report_2022", 
    embedding_function=embedding_function,
    get_or_create=True)

if not chroma_collection.count():
    ids = [str(i) for i in range(len(split_texts))]
    chroma_collection.add(ids=ids, documents=split_texts)

chroma_collection.count()

307

In [None]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

Revenue, classified by the major geographic areas in which our
customers were located, was as follows: (In millions) Year Ended June
30, 2022 2021 2020 United States (a) $ 100,218 $ 83,953 $ 73,160 Other
countries 98,052 84,135 69,855 Total $ 198,270 $ 168,088 $ 143,015 (a)
Includes billings to OEMs and certain multinational organizations
because of the nature of these businesses and the impracticability of
determining the geographic source of the revenue


Revenue, classified by significant product and service offerings, was
as follows: (In millions) Year Ended June 30, 2022 2021 2020 Server
products and cloud services $ 67,321 $ 52,589 $ 41,379 Office products
and cloud services 44,862 39,872 35,316 Windows 24,761 22,488 21,510
Gaming 16,230 15,370 11,575 LinkedIn 13,816 10,289 8,077 Search and
news advertising 11,591 9,267 8,524 Enterprise Services 7,407 6,943
6,409 Devices 6,991 6,791 6,457 Other 5,291 4,479 3,768 Total $ 198,270
$ 168,088 $ 143,015 We have recast certain previousl

In [None]:
import openai

openai_client = openai.OpenAI()

In [None]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]

    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
query = "What is the revenue for from outside the US?"

output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

The revenue from outside the US (Other countries) was $98,052 million
for the year ended June 30, 2022.
