In [None]:
!pip install pymilvus ollama llama-index-llms-ollama llama-index-vector-stores-milvus

In [None]:
!pip install llama-index-embeddings-jinaai llama-index-readers-file

In [None]:
!pip install sentence-transformers llama-index-embeddings-huggingface

### Load Jina AI Embedding Model

In [35]:
from llama_index.embeddings.huggingface import HuggingFaceEmbedding


jina_embedding_model = HuggingFaceEmbedding(model_name="jinaai/jina-embeddings-v2-base-en")
len(jina_embedding_model.get_text_embedding('This is a test'))

Some weights of BertModel were not initialized from the model checkpoint at jinaai/jina-embeddings-v2-base-en and are newly initialized: ['embeddings.position_embeddings.weight', 'encoder.layer.0.intermediate.dense.bias', 'encoder.layer.0.intermediate.dense.weight', 'encoder.layer.0.output.LayerNorm.bias', 'encoder.layer.0.output.LayerNorm.weight', 'encoder.layer.0.output.dense.bias', 'encoder.layer.0.output.dense.weight', 'encoder.layer.1.intermediate.dense.bias', 'encoder.layer.1.intermediate.dense.weight', 'encoder.layer.1.output.LayerNorm.bias', 'encoder.layer.1.output.LayerNorm.weight', 'encoder.layer.1.output.dense.bias', 'encoder.layer.1.output.dense.weight', 'encoder.layer.10.intermediate.dense.bias', 'encoder.layer.10.intermediate.dense.weight', 'encoder.layer.10.output.LayerNorm.bias', 'encoder.layer.10.output.LayerNorm.weight', 'encoder.layer.10.output.dense.bias', 'encoder.layer.10.output.dense.weight', 'encoder.layer.11.intermediate.dense.bias', 'encoder.layer.11.intermedi

768

## Chunking

Let's have a look at different chunking strategies

In [36]:
from llama_index.core import SimpleDirectoryReader

docs = SimpleDirectoryReader(input_files=["data/pdfs/doordash_listing_small.pdf", "data/pdfs/DASH_q1_24_financials.pdf"]).load_data()

### Chunk size 100, Overlap 20

In [37]:
from llama_index.core.node_parser import SentenceSplitter

base_splitter = SentenceSplitter(chunk_size=100, chunk_overlap=20)
base_nodes = base_splitter.get_nodes_from_documents(docs)

for elt in base_nodes[5:10]:
    print(f'element is: {elt.get_content()}\n')

element is: as defined in Rule 405 of the Securities Act. Yes ☐ No ☒Indicate by check mark if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.

element is: Yes ☐ No ☒Indicate by check mark whether the registrant: (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding12  months  (or  for  such  shorter  period  that  the  registrant  was  required  to  file  such  reports);

element is: and  (2)  has  been  subject  to  such  filing  requirements  for  the  past  90days.

element is: Yes ☐    No ☒Indicate by check mark whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T(§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).

element is: Yes  ☒   No  ☐ Indicate by check mark whether the reg

### Chunk size 256, Overlap 50

In [38]:
from llama_index.core.node_parser import SentenceSplitter

base_splitter = SentenceSplitter(chunk_size=256, chunk_overlap=50)
base_nodes = base_splitter.get_nodes_from_documents(docs)

for elt in base_nodes[5:10]:
    print(f'element is: {elt.get_content()}\n')

element is: The registrant has elected to use December 31, 2020 as the calculation datebecause on June 30, 2020 (the last business day of the registrant's most recently completed second fiscal quarter), the registrant was a privately held company. Thiscalculation does not reflect a determination that certain persons are affiliates of the registrant for any other purpose.The registrant had outstanding 290,150,290 shares of Class A common stock, 31,313,450 shares of Class B common stock, and no shares of Class C common stock as ofFebruary 26, 2021.DOCUMENTS INCORPORATED BY REFERENCEPortions of the registrant’s Definitive Proxy Statement relating to the 2021 Annual Meeting of Stockholders are incorporated by reference into Part III of this Annual Report onForm 10-K where indicated. Such Definitive Proxy Statement will be filed with the Securities and Exchange Commission within 120 days after the end of the registrant’s fiscalyear ended December 31, 2020. 2

element is: Table of ContentsTA

### Chunk size 512, Overlap 100

In [39]:
from llama_index.core.node_parser import SentenceSplitter

base_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=100)
base_nodes = base_splitter.get_nodes_from_documents(docs)

for elt in base_nodes[5:10]:
    print(f'element is: {elt.get_content()}\n')

element is: Insome cases, you can identify forward-looking statements because they contain words such as “may,” “will,” “should,” “expect,” “plan,” “anticipate,”“could,” “would,” “intend,” “target,” “project,” “contemplate,” “believe,” “estimate,” “predict,” “potential” or “continue” or the negative of these words orother similar terms or expressions that concern our expectations, strategy, plans or intentions. Forward-looking statements contained in this AnnualReport on Form 10-K include, but are not limited to, statements about:•our future financial performance, including our expectations regarding our revenue, cost of revenue, operating expenses, Total Orders,Marketplace GOV, Contribution Profit (Loss), Contribution Margin, Adjusted Gross Profit, Adjusted Gross Margin, Adjusted EBITDA, andAdjusted EBITDA Margin, our ability to determine reserves, and our ability to maintain and increase long-term future profitability;•our ability to successfully execute our business and growth strat

## Semantic Chunking

In [40]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=jina_embedding_model
)

nodes = splitter.get_nodes_from_documents(docs)

for elt in nodes[5:10]:
    print(f'element is: {elt.get_content()}\n')

element is: Form 10-K Summary136Signatures137
3

element is: Table of ContentsSPECIAL NOTE REGARDING FORWARD-LOOKING STATEMENTSThis Annual Report on Form 10-K contains forward-looking statements within the meaning of the federal securities laws, which statements involvesubstantial risks and uncertainties. Forward-looking statements generally relate to future events or our future financial or operating performance. Insome cases, you can identify forward-looking statements because they contain words such as “may,” “will,” “should,” “expect,” “plan,” “anticipate,”“could,” “would,” “intend,” “target,” “project,” “contemplate,” “believe,” “estimate,” “predict,” “potential” or “continue” or the negative of these words orother similar terms or expressions that concern our expectations, strategy, plans or intentions. Forward-looking statements contained in this AnnualReport on Form 10-K include, but are not limited to, statements about:•our future financial performance, including our expectati

# Load data in Milvus

In [41]:
from llama_index.llms.ollama import Ollama
from llama_index.vector_stores.milvus import MilvusVectorStore

from llama_index.core import StorageContext, ServiceContext
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

llm = Ollama(model="llama3", request_timeout=120.0)

service_context_jina = ServiceContext.from_defaults(llm=llm, embed_model=jina_embedding_model, chunk_size=300, chunk_overlap=50)

vector_store_jina = MilvusVectorStore(
    uri="milvus_rag_llama_index.db",
    collection_name="doordash_listing_demo",
    dim=768,  # the value changes with embedding model
    overwrite=True  # drop table if exist and then create
    )
storage_context_jina = StorageContext.from_defaults(vector_store=vector_store_jina)

  service_context_jina = ServiceContext.from_defaults(llm=llm, embed_model=jina_embedding_model, chunk_size=300, chunk_overlap=50)


In [42]:
docs = SimpleDirectoryReader(input_files=['data/pdfs/doordash_listing_small.pdf',"data/pdfs/DASH_q1_24_financials.pdf"]).load_data()

In [43]:
vector_index_jina = VectorStoreIndex.from_documents(docs, storage_context=storage_context_jina, service_context=service_context_jina)

E20240611 19:08:15.668916 118249707 collection_data.cpp:84] [SERVER][Insert][] Insert data failed, errs: attempt to write a readonly database
E20240611 19:08:15.684504 118249707 collection_data.cpp:84] [SERVER][Insert][] Insert data failed, errs: attempt to write a readonly database


In [44]:
from llama_index.core.tools import RetrieverTool, ToolMetadata

milvus_tool_openai = RetrieverTool(
    retriever=vector_index_jina.as_retriever(similarity_top_k=3),  # retrieve top_k results
    metadata=ToolMetadata(
        name="CustomRetriever",
        description='Retrieve relevant information from provided documents.'
    ),
)

In [45]:
query_engine = vector_index_jina.as_query_engine()
response = query_engine.query("Can you tell me more about this Doordash Listing? Summarise it please and give me 5 points that are important")
print(response)

Based on the provided context, I can summarize the Doordash listing as follows:

The summary is not available in the given context. However, based on the financial reports of DoorDash (DASH) for Q1 2024, here are five key points that might be important:

1. Revenue growth rates may decline due to a widespread COVID-19 vaccine rollout.
2. The company identified a material weakness in its internal control over financial reporting and may identify additional weaknesses or fail to maintain an effective system of internal controls.
3. GAAP research and development expense as a percentage of Marketplace GOV was 1.5% in Q1 2024, consistent with previous periods.
4. GAAP general and administrative expense increased by 12% year-over-year (YoY) due to increases in litigation reserves, personnel-related costs, credit card chargebacks, and bad debt expense.
5. The company improved its GAAP net loss including redeemable non-controlling interests, decreasing it from $162 million in Q1 2023 to $25 mi

## Semantic Chunking

In [46]:
doc_semantic = SimpleDirectoryReader(input_files=["data/pdfs/doordash_listing_small.pdf","data/pdfs/DASH_q1_24_financials.pdf"]).load_data()

In [47]:
from llama_index.core.node_parser import SemanticSplitterNodeParser

splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=jina_embedding_model
)

nodes = splitter.get_nodes_from_documents(doc_semantic)

In [48]:
vector_index_semantic = VectorStoreIndex(nodes=nodes, storage_context=storage_context_jina, service_context=service_context_jina)
query_engine_semantic = vector_index_semantic.as_query_engine()

E20240611 19:08:59.387715 118249707 collection_data.cpp:84] [SERVER][Insert][] Insert data failed, errs: attempt to write a readonly database


In [49]:
print(query_engine_semantic.query('Tell me more about the assets that this company has'))

The company in question appears to be a technology-driven food delivery platform, as evident from its discussion of revenue growth and logistical efficiency. While it doesn't explicitly mention specific assets, we can infer some insights from its business operations.

Firstly, considering the increased demand for delivery services during the COVID-19 pandemic, the company likely possesses a substantial fleet of vehicles or arrangements with third-party logistics providers to facilitate its delivery network. These assets would enable the platform to scale up its operations and maintain a strong presence in the market.

Secondly, as a technology-driven company, DoorDash may own intellectual property rights (IPRs) related to its proprietary algorithms, software, and mobile applications that power its food delivery services. This IPR portfolio could be considered an intangible asset with significant value.

Lastly, the company's corporate offices, which are mentioned as potentially being c

In [51]:
print(query_engine_semantic.query('What is the Marketplace GOV, EBITDA and GAAP net loss of Doordash?'))

Based on the provided context information, it appears that DoorDash is a company with significant growth potential, but also faces various challenges in managing its growth, maintaining its reputation, and adapting to changes in the market. However, the information does not provide specific financial metrics such as Marketplace GOV, EBITDA, or GAAP net loss.

The provided text only discusses the company's internal policies, strategic plans, management infrastructure, employee retention, brand reputation, marketing efforts, and competition within the delivery industry. There is no mention of financial performance metrics. Therefore, it is not possible to provide an answer based on the given context information.
