<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/docs/examples/vector_stores/LindormDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lindorm

>[Lindorm](https://www.alibabacloud.com/help/en/lindorm) is a cloud native multi-model database service. It allows you to store data of all sizes. Lindorm supports low-cost storage and processing of large amounts of data and the pay-as-you-go billing method. It is compatible with the open standards of multiple open source software, such as Apache HBase, Apache Cassandra, Apache Phoenix, OpenTSDB, Apache Solr, and SQL.


To run this notebook you need a Lindorm instance running in the cloud. You can get one following [this link](https://alibabacloud.com/help/en/lindorm/latest/create-an-instance).

After creating the instance, you can get your instance [information](https://www.alibabacloud.com/help/en/lindorm/latest/view-endpoints) and run [curl commands](https://www.alibabacloud.com/help/en/lindorm/latest/connect-and-use-the-search-engine-with-the-curl-command) to connect to and use LindormSearch

## Setup

If you're opening this Notebook on colab, you will probably need to ensure you have `llama-index` installed:

In [None]:
!pip install llama-index

In [None]:
!pip install opensearch-py

In [None]:
%pip install llama-index-vector-stores-lindorm

In [None]:
# choose dashscope as embedding and llm model, your can also use default openai or other model to test
%pip install llama-index-embeddings-dashscope
%pip install llama-index-llms-dashscope

import needed package dependencies:

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.vector_stores.lindorm import (
    LindormVectorStore,
    LindormVectorClient,
)
from llama_index.core import VectorStoreIndex, StorageContext

Config dashscope embedding and llm model, your can also use default openai or other model to test

In [None]:
# set Embbeding model
from llama_index.core import Settings
from llama_index.embeddings.dashscope import DashScopeEmbedding

# Global Settings
Settings.embed_model = DashScopeEmbedding()

In [None]:
# config llm model
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels

dashscope_llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX)

## Download example data:

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-07-10 14:01:02--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：75042 (73K) [text/plain]
正在保存至: “data/paul_graham/paul_graham_essay.txt”


2024-07-10 14:01:04 (43.2 KB/s) - 已保存 “data/paul_graham/paul_graham_essay.txt” [75042/75042])



## Load Data:

In [None]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    "First document, text"
    f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)

Total documents: 1
First document, id: 5ddae8c1-f137-4500-83cd-e38e42d4f72b
First document, hash: 8fde8a692925d317c5544f3dbaa88eeb5e9ec0cbdb74da1de19d57ee75ac0c3c
First document, text (75014 characters):


What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...


## Create the Lindorm Vector Store object:

In [None]:
# only for jupyter notebook
import nest_asyncio

nest_asyncio.apply()

# lindorm instance info
host = "ld-bp******jm*******-proxy-search-pub.lindorm.aliyuncs.com"
port = 30070
username = "your_username"
password = "your_password"


# index demonstrate the VectorStore impl
index_name = "lindorm_rag_test"

# extenion param of lindorm search, number of cluster units to query; between 1 and method.parameters.nlist(ivfpq param); no default value.
nprobe = "2"

# extenion param of lindorm search, usually used to improve recall accuracy, but it increases performance overhead; between 1 and 200; default: 10.
reorder_factor = "10"

#  LindormVectorClient encapsulates logic for a single index with vector search enabled
client = LindormVectorClient(
    host,
    port,
    username,
    password,
    index=index_name,
    dimension=1536,  # match dimension of your embedding model
    nprobe=nprobe,
    reorder_factor=reorder_factor,
    # filter_type="pre_filter/post_filter(default)"
)

# initialize vector store
vector_store = LindormVectorStore(client)

## Build the Index from the Documents:

In [None]:
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# initialize an index using our sample data and the client we just created
index = VectorStoreIndex.from_documents(
    documents=documents, storage_context=storage_context, show_progress=True
)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 25.27it/s]
Generating embeddings: 100%|██████████| 22/22 [00:02<00:00, 10.31it/s]


## Querying the store:

### Search Test

In [None]:
# Set Retriever
vector_retriever = index.as_retriever()
# search
source_nodes = vector_retriever.retrieve("What did the author do growing up?")
# check source_nodes
for node in source_nodes:
    # print(node.metadata)
    print(f"---------------------------------------------")
    print(f"Score: {node.score:.3f}")
    print(node.get_content())
    print(f"---------------------------------------------\n\n")

---------------------------------------------
Score: 0.448
What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep.

The first programs I tried writing were on the IBM 1401 that our school district used for what was then called "data processing." This was in 9th grade, so I was 13 or 14. The school district's 1401 happened to be in the basement of our junior high school, and my friend Rich Draves and I got permission to use it. It was like a mini Bond villain's lair down there, with all these alien-looking machines — CPU, disk drives, printer, card reader — sitting up on a raised floor under bright fluorescent lights.

The language we used was an early version of Fortran. You 

### Basic Querying

In [None]:
# run query
query_engine = index.as_query_engine(llm=dashscope_llm)
# query_engine = index.as_query_engine()
res = query_engine.query("What did the author do growing up?")
res.response

'Growing up, the author worked on two main activities outside of school: writing and programming. They wrote short stories instead of essays, and their early programming attempts involved using an IBM 1401 computer with Fortran, despite the challenges posed by the limited input methods and their lack of sophisticated mathematical knowledge.'

### Metadata Filtering

Lindorm Vector Store now supports metadata filtering in the form of exact-match `key=value` pairs and range fliter in the form of `>`、`<`、`>=`、`<=` at query time.

In [None]:
from llama_index.core import Document
from llama_index.core.vector_stores import (
    MetadataFilters,
    MetadataFilter,
    FilterOperator,
    FilterCondition,
)
import regex as re

In [None]:
# Split the text into paragraphs.
text_chunks = documents[0].text.split("\n\n")

# Create a document for each footnote
footnotes = [
    Document(
        text=chunk,
        id=documents[0].doc_id,
        metadata={
            "is_footnote": bool(re.search(r"^\s*\[\d+\]\s*", chunk)),
            "mark_id": i,
        },
    )
    for i, chunk in enumerate(text_chunks)
    if bool(re.search(r"^\s*\[\d+\]\s*", chunk))
]

In [None]:
# Insert the footnotes into the index
for f in footnotes:
    index.insert(f)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1140.07it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 506.62it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 957.82it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1170.94it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1043.88it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1337.47it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1055.97it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1331.10it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1408.43it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1081.84it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 479.68it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 946.15it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1062.66it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 363.93it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 760.53it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1027.76it/s

In [None]:
retriever = index.as_retriever(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(
                key="is_footnote", value="true", operator=FilterOperator.EQ
            ),
            MetadataFilter(
                key="mark_id", value=0, operator=FilterOperator.GTE
            ),
        ],
        condition=FilterCondition.AND,
    ),
)

result = retriever.retrieve("What did the author about space aliens and lisp?")

print(result)

[NodeWithScore(node=TextNode(id_='c307ea8e-3647-43f0-9858-34581cc50ce5', embedding=None, metadata={'ref_doc_id': 'd9f7000e-412c-466d-9792-88b15aad7148', 'mark_id': 173, 'is_footnote': True, 'document_id': 'd9f7000e-412c-466d-9792-88b15aad7148', '_node_type': 'TextNode', 'doc_id': 'd9f7000e-412c-466d-9792-88b15aad7148', 'content': "[19] One way to get more precise about the concept of invented vs discovered is to talk about space aliens. Any sufficiently advanced alien civilization would certainly know about the Pythagorean theorem, for example. I believe, though with less certainty, that they would also know about the Lisp in McCarthy's 1960 paper.", '_node_content': '{"id_": "c307ea8e-3647-43f0-9858-34581cc50ce5", "embedding": null, "metadata": {"is_footnote": true, "mark_id": 173}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {"1": {"node_id": "d9f7000e-412c-466d-9792-88b15aad7148", "node_type": "4", "metadata": {"is_footnote": true, "mark_id

In [None]:
# Create a query engine that only searches certain footnotes.
footnote_query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(
                key="is_footnote", value="true", operator=FilterOperator.EQ
            ),
            MetadataFilter(
                key="mark_id", value=0, operator=FilterOperator.GTE
            ),
        ],
        condition=FilterCondition.AND,
    ),
    llm=dashscope_llm,
)

res = footnote_query_engine.query(
    "What did the author about space aliens and lisp?"
)
res.response

"The author suggests that any sufficiently advanced alien civilization would be aware of the Pythagorean theorem and, albeit with less certainty, they would also be familiar with Lisp as described in McCarthy's 1960 paper."

### Hybrid Search

The Lindorm search support hybrid search, note the minimum search granularity of query str is one token.

In [None]:
from llama_index.core.vector_stores.types import VectorStoreQueryMode

retriever = index.as_retriever(
    vector_store_query_mode=VectorStoreQueryMode.HYBRID
)

result = retriever.retrieve("What did the author about space aliens and lisp?")

print(result)

[NodeWithScore(node=TextNode(id_='c307ea8e-3647-43f0-9858-34581cc50ce5', embedding=None, metadata={'ref_doc_id': 'd9f7000e-412c-466d-9792-88b15aad7148', 'mark_id': 173, 'is_footnote': True, 'document_id': 'd9f7000e-412c-466d-9792-88b15aad7148', '_node_type': 'TextNode', 'doc_id': 'd9f7000e-412c-466d-9792-88b15aad7148', 'content': "[19] One way to get more precise about the concept of invented vs discovered is to talk about space aliens. Any sufficiently advanced alien civilization would certainly know about the Pythagorean theorem, for example. I believe, though with less certainty, that they would also know about the Lisp in McCarthy's 1960 paper.", '_node_content': '{"id_": "c307ea8e-3647-43f0-9858-34581cc50ce5", "embedding": null, "metadata": {"is_footnote": true, "mark_id": 173}, "excluded_embed_metadata_keys": [], "excluded_llm_metadata_keys": [], "relationships": {"1": {"node_id": "d9f7000e-412c-466d-9792-88b15aad7148", "node_type": "4", "metadata": {"is_footnote": true, "mark_id

In [None]:
query_engine = index.as_query_engine(
    llm=dashscope_llm, vector_store_query_mode=VectorStoreQueryMode.HYBRID
)
res = query_engine.query("What did the author about space aliens and lisp?")
res.response

"The author believes that any sufficiently advanced alien civilization would know about fundamental mathematical concepts like the Pythagorean theorem. They also express, with less certainty, the idea that these aliens would be familiar with Lisp, a programming language discussed in McCarthy's 1960 paper. This thought experiment serves as a way to explore the distinction between ideas that are invented versus discovered."