#### Overview of embeddings-based retrieval

- use langchain for text splitting (chunk = chunk size of chars)
- use sentence transformer for token splitting (chunks = number of tokens)
- use chromadb - to store vectors

In [1]:
from helper_utils import word_wrap

In [3]:
from pypdf import PdfReader

In [4]:
reader = PdfReader("./data/microsoft_annual_report_2022.pdf")
pdf_texts = [p.extract_text().strip() for p in reader.pages]

In [5]:
len(pdf_texts)

93

In [6]:
# Filter the empty strings
pdf_texts = [text for text in pdf_texts if text]

In [7]:
len(pdf_texts)

90

In [8]:
print(word_wrap(pdf_texts[0]))

1 Dear shareholders, colleagues, customers, and partners:  
We are
living through a period of historic economic, societal, and
geopolitical change. The world in 2022 looks nothing like 
the world in
2019. As I write this, inflation is at a 40 -year high, supply chains
are stretched, and the war in Ukraine is 
ongoing. At the same time, we
are entering a technological era with the potential to power awesome
advancements 
across every sector of our economy and society. As the
world’s largest software company, this places us at a historic

intersection of opportunity and responsibility to the world around us.
 
Our mission to empower every person and every organization on the
planet to achieve more has never been more 
urgent or more necessary.
For all the uncertainty in the world, one thing is clear: People and
organizations in every 
industry are increasingly looking to digital
technology to overcome today’s challenges and emerge stronger. And no

company is better positioned to help th

**RecursiveCharacterTextSplitter**

The `RecursiveCharacterTextSplitter` is a utility that helps split long text into smaller chunks while maintaining as much context as possible. Here's how it works:

##### Separators

The `separators` list defines the order in which the text will be split. In this example:

- It first attempts to split by two newlines (`"\n\n"`), which typically indicates a paragraph break.
- If the chunk size condition isn't met, it moves on to split by a single newline (`"\n"`), indicating line breaks or new sentences.
- Then it tries to split by period followed by a space (`". "`), which indicates sentence boundaries.
- After that, it splits by a space (`" "`), which breaks the text at the word level.
- Finally, it splits by individual characters (`""`) if none of the above yield a chunk that meets the size requirement.

##### Chunk size and overlap

- `chunk_size=1000` means that each chunk will have a maximum of 1000 characters.
- `chunk_overlap=0` means there will be no overlap between consecutive chunks (i.e., no repeated content).

##### Recursive splitting

The process is recursive because it starts from the largest separator (paragraphs), and if the resulting chunk is still larger than 1000 characters, it moves down to the next smaller separator (sentences, words, etc.), ensuring that the chunks are as close to 1000 characters as possible while retaining coherent pieces of text.


```python
text = "This is a long paragraph with multiple sentences. It discusses several topics and ideas, flowing continuously. For instance, it talks about machine learning, deep learning, and various AI applications. While doing so, it doesn’t include paragraph breaks or line breaks. Everything is packed in a single block."
```

##### Initial Split
It tries to split using `"\n\n"` (paragraph breaks). There are no `\n\n` in this text, so no split happens.

##### Next Split
It then looks for `"\n"` (line breaks). There are none here either.

##### Next Split
It tries `". "` (sentence breaks). Here, it successfully splits the text into three sentences:
- "This is a long paragraph with multiple sentences."
- "It discusses several topics and ideas, flowing continuously."
- "For instance, it talks about machine learning, deep learning, and various AI applications."
- "While doing so, it doesn’t include paragraph breaks or line breaks. Everything is packed in a single block."

##### Final Chunks
If any of these sentences exceed 1000 characters, it continues splitting by `" "` (spaces) and eventually by characters if necessary.



```python
text = """Data science is an interdisciplinary field that uses various techniques to extract insights from data. It involves statistics, machine learning, and data analysis.

Machine learning is a subset of AI that enables systems to learn from data and improve from experience.

Deep learning, a branch of machine learning, uses neural networks to model complex patterns in data."""
```

##### Initial Split
The first separator `"\n\n"` (paragraph breaks) will be applied:
- "Data science is an interdisciplinary field that uses various techniques to extract insights from data. It involves statistics, machine learning, and data analysis."
- "Machine learning is a subset of AI that enables systems to learn from data and improve from experience."
- "Deep learning, a branch of machine learning, uses neural networks to model complex patterns in data."

##### Next Split
If any paragraph exceeds 1000 characters, it would then try to split further using `"\n"`, `". "`, and so on.


In [10]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, SentenceTransformersTokenTextSplitter

In [13]:
character_splitter = RecursiveCharacterTextSplitter(
    separators   = ["\n\n", "\n", ". ", " ", ""],
    chunk_size   = 1000,
    chunk_overlap= 0
)

In [14]:
character_split_texts = character_splitter.split_text('\n\n'.join(pdf_texts))

In [15]:
print(f"\nTotal chunks: {len(character_split_texts)}")


Total chunks: 347


In [16]:
print(word_wrap(character_split_texts[10]))

increased, due in large part to significant global datacenter
expansions and the growth in Xbox sales and usage. Despite 
these
increases, we remain dedicated to achieving a net -zero future. We
recognize that progress won’t always be linear, 
and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time.  
On the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate 
over 1.3  million cubic meters of volumetric benefits in nine
water basins around the world. Progress toward our zero waste

commitment included diverting more than 15,200 metric tons of solid
waste otherwise headed to landfills and incinerators, 
as well as
launching new Circular Centers to increase reuse and reduce e -waste at
our datacenters.  
We contracted to protect over 17,000 acres of land
(50% more than the land we use to operate), thus achieving our


#### The SentenceTransformersTokenTextSplitter

The `SentenceTransformersTokenTextSplitter` is designed to split text based on token count, using tokenization principles similar to those employed by models like Sentence Transformers. Here's how it works:

##### chunk_overlap=0
This means there is no overlap between consecutive chunks. Each chunk will be entirely separate from the previous one, with no repeated content.

##### tokens_per_chunk=256
This indicates that each chunk will contain a maximum of 256 tokens. Tokens here refer to the processed units of text after tokenization, which could be words, parts of words, punctuation marks, etc., depending on the tokenizer.

##### Use Case
This splitter is typically useful when working with models that have token limits (like many transformer models), where you need to control the number of tokens being processed at a time.


#### Two-Step Chunking Strategy: LangChain + SentenceTransformersTokenTextSplitter

Chunking first with LangChain using a chunk size of 1000 characters, followed by further splitting each of those chunks using `SentenceTransformersTokenTextSplitter` with 256 tokens, provides a layered approach to ensure efficient processing for large language models. Here's how it benefits:

##### 1. Balanced Chunk Sizes for Text Processing
- **Initial Character-Based Chunking:** The initial chunking by LangChain (1000 characters) ensures that the text is divided into manageable pieces that retain context, such as paragraphs or sentences, without breaking down into excessively small parts.
- **Token-Based Splitting for Model Constraints:** After chunking, each chunk is split further based on token limits (256 tokens per chunk) to fit within the constraints of transformer models, preventing errors during inference.

##### 2. Optimized for Transformer Models
- Transformer-based models typically have a **maximum token limit** (often 512 or 1024 tokens). By splitting into 256-token chunks, you ensure that each chunk is well within the limit, reducing the risk of truncation or cutting off important information in the middle of a chunk.

##### 3. Combines Flexibility with Granularity
- **Character-based Splitting:** Handles initial splitting by context (paragraphs, sentences) and ensures that large blocks of text are broken up in a logical way without splitting mid-word.
- **Token-based Splitting:** Offers more **granularity** by ensuring each piece fits neatly into a model’s processing window, providing efficient model performance without losing coherence.

##### 4. Improved Performance for Downstream Tasks
- The combination of these two splitting strategies helps to balance **context retention** (larger chunks from character splitting) with **computational efficiency** (smaller chunks optimized for transformer models).
- This is especially useful for **tasks like text embedding, summarization, and question answering**, where the


In [17]:
token_splitter = SentenceTransformersTokenTextSplitter(chunk_overlap=0, tokens_per_chunk=256)

  from tqdm.autonotebook import tqdm, trange







In [18]:
token_split_texts = []

for text in character_split_texts:
    token_split_texts += token_splitter.split_text(text)

print(word_wrap(token_split_texts[10]))
print(f"\nTotal chunks: {len(token_split_texts)}")

increased, due in large part to significant global datacenter
expansions and the growth in xbox sales and usage. despite these
increases, we remain dedicated to achieving a net - zero future. we
recognize that progress won ’ t always be linear, and the rate at which
we can implement emissions reductions is dependent on many factors that
can fluctuate over time. on the path to becoming water positive, we
invested in 21 water replenishment projects that are expected to
generate over 1. 3 million cubic meters of volumetric benefits in nine
water basins around the world. progress toward our zero waste
commitment included diverting more than 15, 200 metric tons of solid
waste otherwise headed to landfills and incinerators, as well as
launching new circular centers to increase reuse and reduce e - waste
at our datacenters. we contracted to protect over 17, 000 acres of land
( 50 % more than the land we use to operate ), thus achieving our

Total chunks: 349


In [19]:
import chromadb
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction

In [20]:
embedding_function = SentenceTransformerEmbeddingFunction()
print(embedding_function([token_split_texts[10]]))

[array([ 4.25627157e-02,  3.32118273e-02,  3.03401388e-02, -3.48665789e-02,
        6.84165433e-02, -8.09091479e-02, -1.54743800e-02, -1.45093317e-03,
       -1.67444535e-02,  6.77076355e-02, -5.05413748e-02, -4.91953716e-02,
        5.13999276e-02,  9.19272900e-02, -7.17784017e-02,  3.95196974e-02,
       -1.28335375e-02, -2.49475036e-02, -4.62286547e-02, -2.43575107e-02,
        3.39496396e-02,  2.55024582e-02,  2.73171254e-02, -4.12622327e-03,
       -3.63383330e-02,  3.69086629e-03, -2.74304301e-02,  4.79670428e-03,
       -2.88962591e-02, -1.88706890e-02,  3.66662741e-02,  2.56958492e-02,
        3.13127823e-02, -6.39344081e-02,  5.39441109e-02,  8.22534561e-02,
       -4.17567901e-02, -6.99577061e-03, -2.34860033e-02, -3.07479408e-02,
       -2.97919614e-03, -7.79094175e-02,  9.35316738e-03,  3.16281826e-03,
       -2.22570635e-02, -1.82946585e-02, -9.61250253e-03, -3.15068848e-02,
       -5.51971514e-03, -3.27030569e-02,  1.68029740e-01, -4.74596545e-02,
       -5.00168912e-02, 

#### Overview of `SentenceTransformerEmbeddingFunction`

1. **Input Text**: 
   - The function takes input text, which can be sentences, paragraphs, or even entire documents.

2. **Tokenization**: 
   - It processes the text through a Sentence Transformer model, which first tokenizes the input to handle it appropriately for embedding.

3. **Embedding Generation**: 
   - The tokenized input is passed through the model to generate embeddings. Each embedding is typically a fixed-length vector that represents the semantic meaning of the text.

##### Key Parameters (Example)

While the implementation specifics may vary, common parameters for initializing a SentenceTransformer embedding function might include:

- **model_name**: The name of the pre-trained Sentence Transformer model to use (e.g., `"all-MiniLM-L6-v2"`).
- **device**: Specifies whether to run the model on CPU or GPU for faster processing.

##### Benefits

1. **High-Quality Embeddings**: 
   - Sentence Transformers are pre-trained on large datasets and are optimized for producing high-quality embeddings that capture nuanced semantic meanings.

2. **Versatility**: 
   - The embeddings can be used in various NLP applications, including:
   - **Semantic similarity**
   - **Information retrieval**
   - **Text classification**
   - **Clustering**


In [21]:
chroma_client = chromadb.Client()
chroma_collection = chroma_client.get_or_create_collection("microsoft_annual_report_2022", 
                                                            embedding_function = embedding_function)

In [22]:
%%time
# takes time
ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

CPU times: total: 26.5 s
Wall time: 20.8 s


349

In [28]:
query = "What was the total revenue?"

results = chroma_collection.query(query_texts= [query], 
                                  n_results  = 5)
retrieved_documents = results['documents'][0]

for document in retrieved_documents:
    print(word_wrap(document))
    print('\n')

revenue, classified by significant product and service offerings, was
as follows : ( in millions ) year ended june 30, 2022 2021 2020 server
products and cloud services $ 67, 321 $ 52, 589 $ 41, 379 office
products and cloud services 44, 862 39, 872 35, 316 windows 24, 761 22,
488 21, 510 gaming 16, 230 15, 370 11, 575 linkedin 13, 816 10, 289 8,
077 search and news advertising 11, 591 9, 267 8, 524 enterprise
services 7, 407 6, 943 6, 409 devices 6, 991 6, 791 6, 457 other 5, 291
4, 479 3, 768 total $ 198, 270 $ 168, 088 $ 143, 015 we have recast
certain previously reported amounts in the table above to conform to
the way we internally manage and monitor our business.


74 note 13 — unearned revenue unearned revenue by segment was as
follows : ( in millions ) june 30, 2022 2021 productivity and business
processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more
personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in
unearned revenue were as follows : ( in milli

In [29]:
import os
import openai
from openai import OpenAI

# from dotenv import load_dotenv, find_dotenv
# _ = load_dotenv(find_dotenv()) # read local .env file
# openai.api_key = os.environ['OPENAI_API_KEY']

openai_client = OpenAI()

In [30]:
def rag(query, retrieved_documents, model="gpt-3.5-turbo"):
    information = "\n\n".join(retrieved_documents)

    messages = [
        {
            "role": "system",
            "content": "You are a helpful expert financial research assistant. Your users are asking questions about information contained in an annual report."
            "You will be shown the user's question, and the relevant information from the annual report. Answer the user's question using only this information."
        },
        {"role": "user", "content": f"Question: {query}. \n Information: {information}"}
    ]
    
    response = openai_client.chat.completions.create(
        model   = model,
        messages= messages,
    )
    content = response.choices[0].message.content
    return content

In [31]:
output = rag(query=query, retrieved_documents=retrieved_documents)

print(word_wrap(output))

The total revenue for the year ended June 30, 2022, was $198,270
million.


#### Roles of the LLM in the RAG Function

1. **Understanding Natural Language**
   - **Interpretation of Queries**: The LLM interprets and understands user questions, grasping their meaning and intent.

2. **Contextual Response Generation**
   - **Utilizing Provided Information**: Generates responses by synthesizing the user's query with relevant information from `retrieved_documents`.
   - **Maintaining Coherence**: Ensures responses are coherent and logically structured.

3. **Inference and Knowledge Integration**
   - **Inferring Missing Information**: Fills in gaps to create comprehensive answers based on the provided context.
   - **Limited Contextualization**: Operates within the constraints of the given documents, avoiding external knowledge.

4. **Adjusting Tone and Style**
   - **Role-Specific Behavior**: Adjusts tone and style based on the specified role (e.g., expert financial research assistant).
   - **Customization of Responses**: Modifies language and complexity to suit different audiences.

5. **Output Generation**
   - **Returning a Response**: Outputs the generated response for presentation to the user, ensuring quality based on the retrieved documents.



#### Data Processing Flow in RAG with LLM

When you pass documents along with a query to a Large Language Model (LLM) hosted by a service provider, the following process occurs:

##### 1. Data Transmission
- **Sending Data**: The query and retrieved documents are sent to the LLM via an API call over the internet, typically as part of the request payload in a structured format (e.g., JSON).

##### 2. Processing on the Host Machine
- **Host Infrastructure**: The LLM resides on the service provider's servers, equipped with powerful hardware (like GPUs or TPUs) for large-scale computations.
- **Input Handling**: The host machine receives the query and documents, processing them through various NLP algorithms such as tokenization, embedding, and attention mechanisms.

##### 3. Response Generation
- **Contextual Processing**: The LLM synthesizes the information by analyzing the query in relation to the provided documents, utilizing its training to generate a coherent response.
- **Returning the Output**: After processing, the host machine sends the generated response back to your application via the API.

##### 4. Receiving the Response
- **Output Retrieval**: Your application receives the response and can present it to the user or perform additional processing as needed.

##### Important Considerations
- **Data Privacy**: Since documents are sent to a third-party service, consider privacy and data security, especially for sensitive or proprietary information. Adhere to relevant data protection regulations.
- **Latency**: The time taken for the request to travel, be processed, and return can introduce latency. The actual speed depends on network conditions and processing complexity.

