# Lab 1 - Overview of embeddings-based retrieval

Welcome! Here's a few notes about the Chroma course notebooks.
 - A number of warnings pop up when running the notebooks. These are normal and can be ignored.
 - Some operations such as calling an LLM or an opeation using generated data return unpredictable results and so your notebook outputs may differ from the video.
  
Enjoy the course!

https://learn.deeplearning.ai/courses/advanced-retrieval-for-ai/lesson/2/overview-of-embeddings-based-retrieval

In [1]:
from helper_utils_02 import word_wrap

In [2]:
from pypdf import PdfReader

reader = PdfReader("microsoft_annual_report_2022.pdf")

# Extracting text from pages and strip space characters
pdf_texts = [p.extract_text().strip() for p in reader.pages]

# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

print(word_wrap(pdf_texts[0], n_chars=90))

1 Dear shareholders, colleagues, customers, and partners:  
We are living through a
period of historic economic, societal, and geopolitical change. The world in 2022 looks
nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply
chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are
entering a technological era with the potential to power awesome advancements 
across
every sector of our economy and society. As the world’s largest software company, this
places us at a historic 
intersection of opportunity and responsibility to the world
around us.  
Our mission to empower every person and every organization on the planet to
achieve more has never been more 
urgent or more necessary. For all the uncertainty in
the world, one thing is clear: People and organizations in every 
industry are
increasingly looking to digital technology to overcome today’s challenges and emerge
stronger. And no 
company is better positioned to help th

In [3]:
print(f'type(reader): {type(reader)}')
print(f'type(pdf_texts): {type(pdf_texts)}')
print(f'len(pdf_texts): {len(pdf_texts)}')
print(f'\npdf page 1: {pdf_texts[0]}')

type(reader): <class 'pypdf._reader.PdfReader'>
type(pdf_texts): <class 'list'>
len(pdf_texts): 90

pdf page 1: 1 Dear shareholders, colleagues, customers, and partners:  
We are living through a period of historic economic, societal, and geopolitical change. The world in 2022 looks nothing like 
the world in 2019. As I write this, inflation is at a 40 -year high, supply chains are stretched, and the war in Ukraine is 
ongoing. At the same time, we are entering a technological era with the potential to power awesome advancements 
across every sector of our economy and society. As the world’s largest software company, this places us at a historic 
intersection of opportunity and responsibility to the world around us.  
Our mission to empower every person and every organization on the planet to achieve more has never been more 
urgent or more necessary. For all the uncertainty in the world, one thing is clear: People and organizations in every 
industry are increasingly looking to digita

You can view the pdf in your browser [here](./microsoft_annual_report_2022.pdf) if you would like. 


Print first and last 200 characters on page 2

In [4]:
print(word_wrap(pdf_texts[1][:200]))
print('....')
print(word_wrap(pdf_texts[1][-200:]))

2   
But skills alone aren’t enough —we need to help people better
prepare for and connect to jobs. That’s why we’ve 
committed to equip
10  million people from underserved communities with skills for
....
ard our five -year commitment to bridge the disability divide for the
more than 1  billion 
people around the world with disabilities,
seeking to expand accessibility in technology, the workforce, and


In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

https://dev.to/eteimz/understanding-langchains-recursivecharactertextsplitter-2846

In [6]:
# creates an instance of RecursiveCharacterTextSplitter
character_splitter = RecursiveCharacterTextSplitter(
    # list of delimiters used to split the text,
    # it split text in the order of the separators, to satisfy chucnk_size
    # "": An empty string (effectively splitting on every character) 
    # separators=["\n\n", "\n", ". ", " ", ""],
    separators=["\n\n", "\n", ". ", " "],  # remove "" separator, don't split on character    
    chunk_size=1000,
    chunk_overlap=0
)
#  concatenate pdf pages using "\n\n" (two newlines) as a delimiter.
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

print(word_wrap(character_split_texts[10]))
print(f"\n{character_split_texts[10]}")
print(f"\nlen(character_split_texts[0]): {len(character_split_texts[10])}")
print(f"Total chunks: {len(character_split_texts)}")

increased, due in large part to significant global datacenter
expansions and the growth in Xbox sales and usage. Despite 
these
increases, we remain dedicated to achieving a net -zero future. We
recognize that progress won’t always be linear, 
and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time.  
On the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate 
over 1.3  million cubic meters of volumetric benefits in nine
water basins around the world. Progress toward our zero waste

commitment included diverting more than 15,200 metric tons of solid
waste otherwise headed to landfills and incinerators, 
as well as
launching new Circular Centers to increase reuse and reduce e -waste at
our datacenters.  
We contracted to protect over 17,000 acres of land
(50% more than the land we use to operate), thus achieving our

increased, due in large part to significant global d

#### Splitting text into tokens is the first step of any RAG system

#### chunk_overlap is a hyper-parameter

In [7]:
# SentenceTransformers embedding model's context windows is 256 max.
# It will truncate the rest, after the exceeding the max. context window. 
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

token_split_texts = []
for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal token_split_texts chunks: {len(token_split_texts)}")

  from tqdm.autonotebook import tqdm, trange


increased, due in large part to significant global datacenter
expansions and the growth in xbox sales and usage. despite these
increases, we remain dedicated to achieving a net - zero future. we
recognize that progress won ’ t always be linear, and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time. on the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate over 1. 3 million cubic meters of volumetric benefits in nine
water basins around the world. progress toward our zero waste
commitment included diverting more than 15, 200 metric tons of solid
waste otherwise headed to landfills and incinerators, as well as
launching new circular centers to increase reuse and reduce e - waste
at our datacenters. we contracted to protect over 17, 000 acres of land
( 50 % more than the land we use to operate ), thus achieving our

Total token_split_texts chunks: 349


In [19]:
# import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

# embedding_function = SentenceTransformerEmbeddingFunction()
embedding_function = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

#### Example of embedding 

In [18]:
emb_text = [token_split_texts[10]]
emb_vector = embedding_function(emb_text)[0]  # list has only 1 vector 
emb_vector_dim = len(emb_vector)

print(f'text to be embedded: {[token_split_texts[10]]}')
print(f'embedding vector: {emb_vector}')
print(f'embedding vector dimension: {emb_vector_dim}')

text to be embedded: ['increased, due in large part to significant global datacenter expansions and the growth in xbox sales and usage. despite these increases, we remain dedicated to achieving a net - zero future. we recognize that progress won ’ t always be linear, and the rate at which we can implement emissions reductions is dependent on many factors that can fluctuate over time. on the path to becoming water positive, we invested in 21 water replenishment projects that are expected to generate over 1. 3 million cubic meters of volumetric benefits in nine water basins around the world. progress toward our zero waste commitment included diverting more than 15, 200 metric tons of solid waste otherwise headed to landfills and incinerators, as well as launching new circular centers to increase reuse and reduce e - waste at our datacenters. we contracted to protect over 17, 000 acres of land ( 50 % more than the land we use to operate ), thus achieving our']
embedding vector: [0.0425626

In [None]:
# chroma_client = chromadb.Client()
# chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

# ids = [str(i) for i in range(len(token_split_texts))]

# chroma_collection.add(ids=ids, documents=token_split_texts)
# chroma_collection.count()

In [21]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
from chromadb.config import Settings

In [23]:
is_persistent = True

# Specify the directory to store the database files
persist_dir = '../persist_dir/'

# Create a Chroma client
client = chromadb.Client(Settings(is_persistent=is_persistent, persist_directory=persist_dir))  

# Create a collection
collection_name = "microsoft_annual_report_2022"
collection = client.get_or_create_collection(name=collection_name, embedding_function=embedding_function)

In [13]:
# Specify the persistence directory
persist_directory = "Chromadb"

# Create a ChromaDB client with persistence
chroma_client = chromadb.Client(ClientSettings(persist_directory=persist_directory))

# Get or create the collection
collection = chroma_client.get_or_create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

# Add documents to the collection (if necessary)
ids = [str(i) for i in range(len(token_split_texts))]
collection.add(ids=ids, documents=token_split_texts)

# Retrieve all documents
documents = collection.get()

# Access the document text
for document in documents:
    text = document['document']
    print(text)

NameError: name 'ClientSettings' is not defined

In [12]:
import chromadb

# Specify the persistence directory
persist_directory = "Chromadb"

# Create a ChromaDB client with persistence
chroma_client = chromadb.Client(Settings(persist_directory=persist_directory))

# Get or create the collection
collection = chroma_client.get_or_create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

# Add documents to the collection (if necessary)
ids = [str(i) for i in range(len(token_split_texts))]
collection.add(ids=ids, documents=token_split_texts)

# Retrieve all documents
documents = collection.get()

# Access the document text
for document in documents:
    text = document['document']
    print(text)

NameError: name 'Settings' is not defined

In [11]:
import chromadb

# Specify the persistence directory
persist_directory = "Chromadb"

# Create a ChromaDB client with persistence
chroma_client = chromadb.Client(Settings(persist_directory=persist_directory))

# Get or create the collection
collection = chroma_client.get_or_create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

# Add documents to the collection (if necessary)
# ...
ids = [str(i) for i in range(len(token_split_texts))]
collection.add(ids=ids, documents=token_split_texts)

# Retrieve all documents
documents = collection.get()

# Access the document text
for document in documents:
    text = document['document']
    print(text)

NameError: name 'Settings' is not defined

In [10]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)

In [None]:
ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

print(f'chroma_collection.count(): {chroma_collection.count()} should equal Total_token_split_texts chunks: {len(token_split_texts)}')
print(f'chroma_collection.count(): {chroma_collection.count()}')
print(f"Total token_split_texts chunks: {len(token_split_texts)}")

In [None]:
first_document

In [None]:
first_document

In [None]:
# Retrieve the first document and its ID
# first_document = chroma_collection.get(ids=[ids[0]])
first_document = chroma_collection.get(ids=['0', '1'])

# Access the document and its ID
document_text = first_document['documents']  # Retrieve the document text
document_id = first_document['ids']  # Retrieve the document ID

print("Document Text:", document_text)
print("Document ID:", document_id)

In [None]:
print(word_wrap(document_text[1]))

In [None]:
queries = ["What was the total revenue?", "What was the operating margin?"]

results = chroma_collection.query(query_texts=queries, n_results=3)

In [None]:
for i, query in enumerate(queries):
  retrieved_documents = results['documents'][i]
  for j, document in enumerate(retrieved_documents):
    print(f"query: {queries[i]}")
    # print(word_wrap(document, n_chars=90))
    print(f"document {j}: {word_wrap(document, n_chars=90)}")
    print('\n')

In [15]:
import os
import openai
import sys
sys.path.append('../..')
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv('.env\my_api_key.env')) # read local .env file
openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [16]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model=model,
        messages=messages,
    )
    content = response.choices[0].message.content
    return content

In [None]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

In [None]:
for query in queries:
  output = rag(query=query, retrieved_documents=retrieved_documents)
  print(f'query: {query}')
  for i, retrieved_document in enumerate(retrieved_documents):
    print(f'retrieved document {i}: {retrieved_document}')
  print(f'output: {word_wrap(output)}\n')
  # print('\n')  

In [None]:
retrieved_documents

In [None]:
output = rag(query=queries, retrieved_documents=retrieved_documents)

print(queries)
print(word_wrap(output))

In [None]:
output = rag(query=queries[0], retrieved_documents=retrieved_documents)

print(queries[0])
print(word_wrap(output))

In [None]:
output = rag(query=queries[1], retrieved_documents=retrieved_documents)

print(queries[1])
print(word_wrap(output))