<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/LindormSearchDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lindorm

>[Lindorm](https://www.alibabacloud.com/help/en/lindorm) is a cloud native multi-model database service. It allows you to store data of all sizes. Lindorm supports low-cost storage and processing of large amounts of data and the pay-as-you-go billing method. It is compatible with the open standards of multiple open source software, such as Apache HBase, Apache Cassandra, Apache Phoenix, OpenTSDB, Apache Solr, and SQL.


To run this notebook you need a Lindorm instance running in the cloud. You can get one following [this link](https://alibabacloud.com/help/en/lindorm/latest/create-an-instance?spm=a2c63.l28256.0.0.4cc0f53cUfKOxI).

After creating the instance, you can get your instance [information](https://www.alibabacloud.com/help/en/lindorm/latest/view-endpoints?spm=a2c63.p38356.0.0.37121bcdxsDvbN) and run [curl commands](https://www.alibabacloud.com/help/en/lindorm/latest/connect-and-use-the-search-engine-with-the-curl-command) to connect to and use LindormSearch

## Setup

If you're opening this Notebook on colab, you will probably need to ensure you have `llama-index` installed:

%pip install llama-index-vector-stores-lindormsearch

In [None]:
!pip install llama-index

In [None]:
!pip install opensearch-py

In [None]:
%pip install llama-index-vector-stores-lindormsearch

In [None]:
# choose dashscope as embedding and llm model, your can also use default openai or other model to test
%pip install llama-index-embeddings-dashscope
%pip install llama-index-llms-dashscope

import needed package dependencies:

In [1]:
from llama_index.core import SimpleDirectoryReader
from llama_index.vector_stores.lindormsearch import (
    LindormSearchVectorStore,
    LindormSearchVectorClient,
)
from llama_index.core import VectorStoreIndex, StorageContext

Config dashscope embedding and llm model, your can also use default openai or other model to test

In [2]:
# set Embbeding model
from llama_index.core import Settings
from llama_index.embeddings.dashscope import DashScopeEmbedding
# Global Settings
Settings.embed_model = DashScopeEmbedding()


In [3]:
# config llm model
from llama_index.llms.dashscope import DashScope, DashScopeGenerationModels
dashscope_llm = DashScope(model_name=DashScopeGenerationModels.QWEN_MAX)

## Download example data:

In [4]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-07-08 15:17:12--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt
正在解析主机 raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.110.133, ...
正在连接 raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... 已连接。
已发出 HTTP 请求，正在等待回应... 200 OK
长度：75042 (73K) [text/plain]
正在保存至: “data/paul_graham/paul_graham_essay.txt”


2024-07-08 15:17:14 (58.4 KB/s) - 已保存 “data/paul_graham/paul_graham_essay.txt” [75042/75042])



## Load Data:

In [5]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    "First document, text"
    f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)

Total documents: 1
First document, id: dc941e05-3d58-41d0-b948-f6a184dae96f
First document, hash: 3fb8fcb30130991930dfa44a3c664bea16fb6d271c6aee796552ae3ad61f1a5c
First document, text (75014 characters):


What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...


## Create the Lindormsearch Vector Store object:

In [6]:
# only for jupyter notebook
import nest_asyncio
nest_asyncio.apply()

# lindorm instance info
host = 'ld-bp******jm*******-proxy-search-pub.lindorm.aliyuncs.com'
port = 30070
username = 'your_username'
password = 'your_password'

# index demonstrate the VectorStore impl
index_name = "lindorm_vector_test"

#  LindormSearchVectorClient encapsulates logic for a single index with vector search enabled
client = LindormSearchVectorClient(
      host,
      port,
      username,
      password,
      index=index_name,
      dimension=1536,
)

# initialize vector store
vector_store = LindormSearchVectorStore(client)

## Build the Index from the Documents:

In [7]:
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# initialize an index using our sample data and the client we just created
index = VectorStoreIndex.from_documents(
    documents=documents, 
    storage_context=storage_context,
    show_progress=True
)

  from .autonotebook import tqdm as notebook_tqdm
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 24.63it/s]
Generating embeddings: 100%|██████████| 22/22 [00:02<00:00,  8.93it/s]


## Querying the store:

### Search Test

In [8]:
# Set Retriever
vector_retriever = index.as_retriever()
# search
source_nodes = vector_retriever.retrieve("Why did the author choose to work on AI?")
# check source_nodes
for node in source_nodes:
      # print(node.metadata)
      print(f"---------------------------------------------")
      print(f"Score: {node.score:.3f}")
      print(node.get_content())
      print(f"---------------------------------------------\n\n")

---------------------------------------------
Score: 0.536
All that seemed left for philosophy were edge cases that people in other fields felt could safely be ignored.

I couldn't have put this into words when I was 18. All I knew at the time was that I kept taking philosophy courses and they kept being boring. So I decided to switch to AI.

AI was in the air in the mid 1980s, but there were two things especially that made me want to work on it: a novel by Heinlein called The Moon is a Harsh Mistress, which featured an intelligent computer called Mike, and a PBS documentary that showed Terry Winograd using SHRDLU. I haven't tried rereading The Moon is a Harsh Mistress, so I don't know how well it has aged, but when I read it I was drawn entirely into its world. It seemed only a matter of time before we'd have Mike, and when I saw Winograd using SHRDLU, it seemed like that time would be a few years at most. All you had to do was teach SHRDLU more words.

There weren't any classes in AI

### Basic Querying

In [10]:
# run query
query_engine = index.as_query_engine(llm=dashscope_llm)
# query_engine = index.as_query_engine()
res = query_engine.query("Why did the author choose to work on AI?")
res.response

'The author chose to work on AI because it was a field that captivated their interest due to two influential factors: a novel by Robert A. Heinlein titled "The Moon is a Harsh Mistress," featuring an intelligent computer named Mike, and a PBS documentary showcasing Terry Winograd\'s work with the SHRDLU program. These inspirations led the author to believe that creating intelligent machines was within reach and sparked their passion for artificial intelligence.'

### Metadata Filtering

The LindormSearch support metadata filtering in the form of exact-match `key=value` pairs and range fliter in the form of `>`、`<`、`>=`、`<=` at query time.

In [11]:
from llama_index.core import Document
from llama_index.core.vector_stores import MetadataFilters, MetadataFilter,FilterOperator,FilterCondition
import regex as re

In [12]:
# Split the text into paragraphs.
text_chunks = documents[0].text.split("\n\n")

# Create a document for each footnote
footnotes = [
    Document(
        text=chunk,
        id=documents[0].doc_id,
        metadata={
            "is_footnote": bool(re.search(r"^\s*\[\d+\]\s*", chunk)),
            "mark_id":i,
        },
    )
    for i,chunk in enumerate(text_chunks)
    if bool(re.search(r"^\s*\[\d+\]\s*", chunk))
]


In [13]:
# Insert the footnotes into the index
for f in footnotes:
    index.insert(f)

Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 896.99it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 2126.93it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 368.99it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 732.50it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 1049.63it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 449.02it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 460.15it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 592.75it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 482.88it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 458.29it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 564.89it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 431.96it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 528.05it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 604.54it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 434.73it/s]
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 358.79it/s]
Parsin

In [14]:
# Create a query engine that only searches certain footnotes.
footnote_query_engine = index.as_query_engine(
    filters=MetadataFilters(
        filters=[
            MetadataFilter(
                key="metadata.is_footnote", 
                value="true",
                operator=FilterOperator.EQ
            ),
            MetadataFilter(
                key="metadata.mark_id",
                value=0,
                operator=FilterOperator.GTE
            )
        ],
        condition=FilterCondition.AND
    ),
    llm=dashscope_llm
)

res = footnote_query_engine.query(
    "What did the author about space aliens and lisp?"
)
res.response

"The author speculates that any advanced alien civilization would be aware of the Pythagorean theorem and, with less certainty, suggests they might also know about the Lisp programming language as described in McCarthy's 1960 paper."

### Hybrid Search

The Lindorm search support hybrid search, note the minimum search granularity of query str is one token.

In [16]:
from llama_index.core.vector_stores.types import VectorStoreQueryMode

retriever = index.as_retriever(
      vector_store_query_mode=VectorStoreQueryMode.HYBRID,
      query_str = "space aliens"
)

result = retriever.retrieve("What did the author about space aliens and lisp?")

print(result)

[NodeWithScore(node=TextNode(id_='a6db04e5-2bc0-4b73-a893-345c3f32c9a4', embedding=None, metadata={'is_footnote': True, 'mark_id': 173}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='20584e85-09a4-4c34-b145-64785d5dccfa', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'is_footnote': True, 'mark_id': 173}, hash='b43f450088029936fd7a03f5917ff9c487ba2e3ed9c6c22de43e024a67f8f48e')}, text="[19] One way to get more precise about the concept of invented vs discovered is to talk about space aliens. Any sufficiently advanced alien civilization would certainly know about the Pythagorean theorem, for example. I believe, though with less certainty, that they would also know about the Lisp in McCarthy's 1960 paper.", mimetype='text/plain', start_char_idx=0, end_char_idx=323, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n'), score=0.5169236), NodeWithSco

In [17]:
query_engine = index.as_query_engine(
      llm=dashscope_llm,
      vector_store_query_mode=VectorStoreQueryMode.HYBRID,
      query_str = "space aliens"
)
res = query_engine.query("What did the author about space aliens and lisp?")
res.response

"The author believes that any sufficiently advanced alien civilization would know about the Pythagorean theorem and, with less certainty, they would also be aware of the Lisp programming language as described in John McCarthy's 1960 paper."