In [1]:
%pip install --upgrade --quiet python_dotenv pypdf langchain \
 sentence-transformers huggingface_hub chromadb

Note: you may need to restart the kernel to use updated packages.


In [1]:
import os
from dotenv import load_dotenv, find_dotenv
from huggingface_hub import login
_ = load_dotenv(find_dotenv()) # read local .env file
hugging_face_access_token = os.environ['HUGGINGFACEHUB_API_TOKEN']
login(hugging_face_access_token)

Token will not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /home/studio-lab-user/.cache/huggingface/token
Login successful


In [2]:
def word_wrap(text: str, max_len: int = 72) -> str:
    if len(text) < max_len:
        return text
    ans = text[:max_len].rsplit(' ', 1)[0] + "\n" + \
        word_wrap(text[len(text[:max_len].rsplit(' ', 1)[0]):], max_len)
    return ans

In [3]:
from langchain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("microsoft_annual_report_2022.pdf")
pdf_docs = [doc for doc in loader.load() if len(doc.page_content.strip()) != 0]

In [4]:
print(len(pdf_docs))
print(pdf_docs[0].metadata)
print(word_wrap(pdf_docs[0].page_content))

90
{'source': 'microsoft_annual_report_2022.pdf', 'page': 2}
1 Dear shareholders, colleagues, customers, and partners:  
We are
 living through a period of historic economic, societal, and
 geopolitical change. The world in 2022 looks nothing like 
the world
 in 2019. As I write this, inflation is at a 40 -year high, supply
 chains are stretched, and the war in Ukraine is 
ongoing. At the same
 time, we are entering a technological era with the potential to power
 awesome advancements 
across every sector of our economy and society.
 As the world’s largest software company, this places us at a historic
 
intersection of opportunity and responsibility to the world around
 us.  
Our mission to empower every person and every organization on
 the planet to achieve more has never been more 
urgent or more
 necessary. For all the uncertainty in the world, one thing is clear:
 People and organizations in every 
industry are increasingly looking
 to digital technology to overcome today’s chall

In [5]:
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter)

# separators are characters used for recursive splitting
# first by "\n\n" if the chunk is over 1000 then by "\n"
# and so on

# chunk over lap is a hyper parameter for you
# to decide what the optimal chunking is. Overlap
# preserves some context from the previous chunk to the next
character_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ". ", " ", ""],
    chunk_overlap=0,
    chunk_size=1000
)

In [6]:
pdf_docs_character_splits = character_splitter.split_documents(pdf_docs)

In [7]:
len(pdf_docs_character_splits)

347

In [8]:
# we need to tokenize (token split the charater splitted documents to be able to fit
# the context window size of sentense transformers embedding model) which we will
# use to construct the vector store below. This max context size is 256 tokes.
# It downloads sentence-tranformers from HuggingFace to tokenize documents
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

  return self.fget.__get__(instance, owner)()


In [9]:
pdf_docs_token_splits = token_splitter.split_documents(pdf_docs_character_splits)

In [10]:
# it adds 2 more chunks after token splitting
print(len(pdf_docs_token_splits))
print(pdf_docs_token_splits[10].metadata)
print(word_wrap(pdf_docs_token_splits[10].page_content))

349
{'source': 'microsoft_annual_report_2022.pdf', 'page': 4}
increased, due in large part to significant global datacenter
 expansions and the growth in xbox sales and usage. despite these
 increases, we remain dedicated to achieving a net - zero future. we
 recognize that progress won ’ t always be linear, and the rate at
 which we can implement emissions reductions is dependent on many
 factors that can fluctuate over time. on the path to becoming water
 positive, we invested in 21 water replenishment projects that are
 expected to generate over 1. 3 million cubic meters of volumetric
 benefits in nine water basins around the world. progress toward our
 zero waste commitment included diverting more than 15, 200 metric tons
 of solid waste otherwise headed to landfills and incinerators, as well
 as launching new circular centers to increase reuse and reduce e -
 waste at our datacenters. we contracted to protect over 17, 000 acres
 of land ( 50 % more than the land we use to operate 

In [11]:
# token splitts fit with in the 256 context size for the
# selected setence-transofrmers embedding model
token_splitter.count_tokens(text=pdf_docs_token_splits[10].page_content)

193

In [12]:
# If we were to use sentence transformers outside of
# LangChain integration, we could directly use the one
# we downloaded above when tokinizing docs. However,
# out of the box embedder can't be used with LangChain.
# And hence, we have to download it again from HuggingFace
# but this time it will fit the LangChain framework.
# This will be used in constructing the vector store
# along with the embedding appropriate tokenized docs.

# Note: sentence-transformers embedding creates a single
# dense vector per document chunk. It creates the chunk dense
# vector via a pooking layer on top of bert-based token embedding

from langchain_community.embeddings import HuggingFaceEmbeddings
hf_embedding_model = HuggingFaceEmbeddings(model_name='all-MiniLM-L6-v2')

In [13]:
embeddig = hf_embedding_model.embed_query("this is a text")
# embedding vector is has 384 dimensions
print(len(embeddig))

384


In [14]:
from enum import Enum
""" Conivnient Enum for selecting search type
to be used when generating a retriever.
"""
class SearchType(Enum):
    # default
    similarity = 1
    # Maximal marginal relevance optimizes for similarity
    # to query and diversity among selected documents.
    mmr = 2
    # You can also set a retrieval method that
    # sets a similarity score threshold and only
    # returns documents with a score above that threshold.
    similarity_score_threshold = 3 #

    """
      retriever = db.as_retriever(
          search_type="similarity_score_threshold",
          search_kwargs={"score_threshold": 0.5}
      )
    """


# LangChain Chroma integration hides some methods
# like counting the nuber of items in the collection.
# Out of the box chroma collection has a count method.

from langchain_community.vectorstores import Chroma

vectorstore = Chroma(
    collection_name="microsoft_annual_report_2022"
)

db = vectorstore.from_documents(pdf_docs_token_splits, hf_embedding_model)

# retriever = db.as_retriever(search_type=SearchType.mmr.name)
# to specify top k results
# retriever = db.as_retriever(search_type=SearchType.mmr.name, search_kwargs={"k": 4})

In [15]:
# top k is a hyper parameter you can experiment with
retriever = db.as_retriever(search_type=SearchType.mmr.name, search_kwargs={"k": 4})

In [16]:
query = "What was the total revenue?"
retrieved_docs = retriever.get_relevant_documents(query)
retrieved_docs.sort(key=lambda doc: doc.metadata['page'])

In [17]:
for doc in retrieved_docs:
    print(doc.metadata)
    print(word_wrap(doc.page_content))
    print("")

{'page': 35, 'source': 'microsoft_annual_report_2022.pdf'}
engineering, gaming, and linkedin. • sales and marketing expenses
 increased $ 1. 7 billion or 8 % driven by investments in commercial
 sales and linkedin. sales and marketing included a favorable foreign
 currency impact of 2 %. • general and administrative expenses
 increased $ 793 million or 16 % driven by investments in corporate
 functions. operating income increased $ 13. 5 billion or 19 % driven
 by growth across each of our segments. current year net income and
 diluted eps were positively impacted by the net tax benefit related to
 the transfer of intangible properties, which resulted in an increase
 to net income and diluted eps of $ 3. 3 billion and $ 0. 44,
 respectively. prior year net income and diluted eps were positively
 impacted by the net tax benefit related to the india supreme court
 decision on withholding taxes, which resulted in an increase to net
 income and diluted eps of $ 620 million and $ 0. 08, res

In [18]:
from langchain.schema import HumanMessage, SystemMessage
from langchain.prompts import ChatPromptTemplate

def rag(query: str, retrieved_docs, chat_model) -> str:

    text = [doc.page_content for doc in retrieved_docs]
    information = "\n\n".join(text)

    system_message = "You are a helpful expert financial research assistant.\
      Your users are asking questions about information contained in an annual report.\
      You will be shown the user's question, and the relevant information from\
      the annual report. Answer the user's question using only this information."

    content = f"Question: {query}. \n Information: {information}"

    messages = [
      SystemMessage(content=system_message),
      HumanMessage(
        content=content
      ),
    ]

    response = chat_model.invoke(messages)

    return response.content

In [19]:
import llm_utils
llm = llm_utils.HuggingFaceHub(
    repo_id="HuggingFaceH4/zephyr-7b-beta",
    task="text-generation",
    model_kwargs={
        "max_new_tokens": 512,
        "top_k": 30,
        "temperature": 0.1,
        "repetition_penalty": 1.03,
    },
)

In [20]:
chat_model = llm_utils.ChatHuggingFace(llm=llm)

                    repo_id was transferred to model_kwargs.
                    Please confirm that repo_id is what you intended.
                    task was transferred to model_kwargs.
                    Please confirm that task is what you intended.
                    huggingfacehub_api_token was transferred to model_kwargs.
                    Please confirm that huggingfacehub_api_token is what you intended.


In [21]:
query = "What was the total revenue?"
retrieved_docs = retriever.get_relevant_documents(query)
retrieved_docs.sort(key=lambda doc: doc.metadata['page'])
response = rag(query, retrieved_docs, chat_model)
print(word_wrap(response))

Based on the provided information, the total revenue for the year ended
 June 30, 2022, is $198,270 million. This can be found by adding up the
 revenue figures listed under each product and service offering in the
 "Revenue, classified by significant product and service offerings"
 section. The total revenue for the previous years, 2021 and 2020, are
 also provided for reference.


In [22]:
query = "What were the expenses?"
retrieved_docs = retriever.get_relevant_documents(query)
retrieved_docs.sort(key=lambda doc: doc.metadata['page'])
response = rag(query, retrieved_docs, chat_model)
print(word_wrap(response))

Based on the provided information, the expenses for the given time
 period can be summarized as follows:

1. General and administrative
 expenses: These expenses include salaries, benefits, stock-based
 compensation, and other related costs for various departments such as
 finance, legal, facilities, human resources, and administrative
 personnel. They increased by $793 million or 16% in the latest year,
 mainly due to investments in corporate functions.

2. Other income
 (expense), net: This category includes various sources of income and
 expenses, such as interest and dividends, interest expense,
 gains/losses on investments, gains/losses on derivatives, and foreign
 currency remeasurements. In the latest year, other income was $333
 million, while other expense was $1,186 million in the previous
 year.

3. Other receivables due from suppliers: These are receivables
 from suppliers that are expected to be paid within one year. As of the
 latest year, they were $1 billion, compared t