<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/vector_stores/CassandraIndexDemo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cassandra Vector Store

[Apache Cassandra®](https://cassandra.apache.org) is a NoSQL, row-oriented, highly scalable and highly available database.
Newest Cassandra releases natively [support](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor(ANN)+Vector+Search+via+Storage-Attached+Indexes) Vector Similarity Search.

**This notebook shows the basic usage of Cassandra as a Vector Store in LlamaIndex.**

To run this notebook you need either a running Cassandra cluster equipped with Vector 
Search capabilities (in pre-release at the time of writing) or a DataStax Astra DB instance
 running in the cloud (you can get one for free at [datastax.com](https://astra.datastax.com)).
 _This notebook shows the latter choice; check
 [cassio.org](https://cassio.org/start_here/) for more information, quickstarts and tutorials._

## Setup

In [None]:
!pip install "cassio>=0.1.3"

...


In [None]:
import os

from llama_index import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Document,
    StorageContext,
)
from llama_index.vector_stores import CassandraVectorStore

### Please provide database connection parameters and secrets

Now you need a database connection. Make sure you have either a vector-capable running Cassandra cluster or an [Astra DB](https://astra.datastax.com) instance in the cloud.

_In the following, the latter is assumed (see the references at the top for details)._

In [None]:
import os
import getpass

database_id = input("\nPlease enter your Database ID (e.g. '0123abcd...'):")
token = getpass.getpass(
    "\nPlease enter your 'Database Administrator' Token (e.g. 'AstraCS:...'):"
)


Please enter your Database ID (e.g. '0123abcd...'): 0123abcd-01ab-01ab-01ab-012345abcdef

Please enter your 'Database Administrator' Token (e.g. 'AstraCS:...'): ········


This cell sets the database connection as a global `cassio` property for usage later (it is also possible to explicitly supply a DB connection when creating the vector store):

In [None]:
import cassio

cassio.init(database_id=database_id, token=token)

### Please provide OpenAI access key

In order use embeddings by OpenAI you need to supply an OpenAI API Key:

In [None]:
import openai

OPENAI_API_KEY = getpass.getpass("OpenAI API Key:")
openai.api_key = OPENAI_API_KEY

OpenAI API Key: ········


Download data

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

## Creating and populating the Vector Store

You will now load some essays by Paul Graham from a local file and store them into the Cassandra Vector Store.

In [None]:
# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
print(f"Total documents: {len(documents)}")
print(f"First document, id: {documents[0].doc_id}")
print(f"First document, hash: {documents[0].hash}")
print(
    "First document, text"
    f" ({len(documents[0].text)} characters):\n{'='*20}\n{documents[0].text[:360]} ..."
)

Total documents: 1
First document, id: 7c966f42-36f4-4ff6-ad75-357978a65381
First document, hash: 2e2d9629223c077019a6dde689049344ff2293d6c52372871420119ec049f25c
First document, text (75014 characters):


What I Worked On

February 2021

Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined ma ...


### Initialize the Cassandra Vector Store

Creation of the vector store entails creation of the underlying database table if it does not exist yet:

In [None]:
cassandra_store = CassandraVectorStore(
    table="cass_v_table", embedding_dimension=1536
)

Now wrap this store into an `index` LlamaIndex abstraction for later querying:

In [None]:
storage_context = StorageContext.from_defaults(vector_store=cassandra_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

Note that the above `from_documents` call does several things at once: it splits the input documents into chunks of manageable size ("nodes"), computes embedding vectors for each node, and stores them all in the Cassandra Vector Store.

## Querying the store

### Basic querying

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)

The author chose to work on AI because they were inspired by a novel called "The Moon is a Harsh Mistress" by Heinlein, which featured an intelligent computer called Mike. Additionally, they were influenced by a PBS documentary that showed Terry Winograd using SHRDLU, a program that they believed could be improved by teaching it more words.


### MMR-based queries

The MMR (maximal marginal relevance) method is designed to fetch text chunks from the store that are at the same time relevant to the query but as different as possible from each other, with the goal of providing a broader context to the building of the final answer:

In [None]:
query_engine = index.as_query_engine(vector_store_query_mode="mmr")
response = query_engine.query("Why did the author choose to work on AI?")
print(response.response)

The author chose to work on AI because they believed that it was a field that held promise and potential for advancing the understanding of natural language and intelligence. They were initially drawn to AI because of their fascination with the program SHRDLU, which they considered to be a step towards achieving intelligence. However, as they delved deeper into the field, they realized that the existing approaches to AI, which involved explicit data structures and formal representations, were limited and not capable of truly understanding natural language. Despite this realization, the author still found value in working on AI and decided to focus on Lisp, a programming language associated with AI, as they believed it was interesting in its own right.


## Connecting to an existing store

Since this store is backed by Cassandra, it is persistent by definition. So, if you want to connect to a store that was created and populated previously, here is how:

In [None]:
new_store_instance = CassandraVectorStore(
    table="cass_v_table", embedding_dimension=1536
)

# Create index (from preexisting stored vectors)
new_index_instance = VectorStoreIndex.from_vector_store(
    vector_store=new_store_instance
)

# now you can do querying, etc:
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "What did the author study prior to working on AI?"
)

In [None]:
print(response.response)

The author studied painting and drawing prior to working on AI.


## Removing documents from the index

First get an explicit list of pieces of a document, or "nodes", from a `Retriever` spawned from the index:

In [None]:
retriever = new_index_instance.as_retriever(
    vector_store_query_mode="mmr",
    similarity_top_k=3,
    vector_store_kwargs={"mmr_prefetch_factor": 4},
)
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

In [None]:
print(f"Found {len(nodes_with_scores)} nodes.")
for idx, node_with_score in enumerate(nodes_with_scores):
    print(f"    [{idx}] score = {node_with_score.score}")
    print(f"        id    = {node_with_score.node.node_id}")
    print(f"        text  = {node_with_score.node.text[:90]} ...")

Found 3 nodes.
    [0] score = 0.42933435561941374
        id    = 7734b895-a738-4c56-a433-d24a80179759
        text  = What I Worked On

February 2021

Before college the two main things I worked on, outside o ...
    [1] score = 0.002203557726127847
        id    = fea6c20f-e707-4c66-be3f-f963639def30
        text  = Now all I had to do was learn Italian.

Only stranieri (foreigners) had to take this entra ...
    [2] score = 0.022935334418004605
        id    = cf09e631-ab10-4355-923a-b9711c197600
        text  = All you had to do was teach SHRDLU more words.

There weren't any classes in AI at Cornell ...


But wait! When using the vector store, you should consider the **document** as the sensible unit to delete, and not any individual node belonging to it. Well, in this case, you just inserted a single text file, so all nodes will have the same `ref_doc_id`:

In [None]:
print("Nodes' ref_doc_id:")
print("\n".join([nws.node.ref_doc_id for nws in nodes_with_scores]))

Nodes' ref_doc_id:
7c966f42-36f4-4ff6-ad75-357978a65381
7c966f42-36f4-4ff6-ad75-357978a65381
7c966f42-36f4-4ff6-ad75-357978a65381


Now let's say you need to remove the text file you uploaded:

In [None]:
new_store_instance.delete(nodes_with_scores[0].node.ref_doc_id)

Repeat the very same query and check the results now. You should see _no results_ being found:

In [None]:
nodes_with_scores = retriever.retrieve(
    "What did the author study prior to working on AI?"
)

print(f"Found {len(nodes_with_scores)} nodes.")

Found 0 nodes.


## Metadata filtering

The Cassandra vector store support metadata filtering in the form of exact-match `key=value` pairs at query time. The following cells, which work on a brand new Cassandra table, demonstrate this feature.

In this demo, for the sake of brevity, a single source document is loaded (the `../data/paul_graham/paul_graham_essay.txt` text file). Nevertheless, you will attach some custom metadata to the document to illustrate how you can can restrict queries with conditions on the metadata attached to the documents.

In [None]:
md_storage_context = StorageContext.from_defaults(
    vector_store=CassandraVectorStore(
        table="cass_v_table_md", embedding_dimension=1536
    )
)


def my_file_metadata(file_name: str):
    """Depending on the input file name, associate a different metadata."""
    if "essay" in file_name:
        source_type = "essay"
    elif "dinosaur" in file_name:
        # this (unfortunately) will not happen in this demo
        source_type = "dinos"
    else:
        source_type = "other"
    return {"source_type": source_type}


# Load documents and build index
md_documents = SimpleDirectoryReader(
    "../data/paul_graham", file_metadata=my_file_metadata
).load_data()
md_index = VectorStoreIndex.from_documents(
    md_documents, storage_context=md_storage_context
)

That's it: you can now add filtering to your query engine:

In [None]:
from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters

In [None]:
md_query_engine = md_index.as_query_engine(
    filters=MetadataFilters(
        filters=[ExactMatchFilter(key="source_type", value="essay")]
    )
)
md_response = md_query_engine.query(
    "How long it took the author to write his thesis?"
)
print(md_response.response)

It took the author approximately 5 weeks to write his thesis.


To test that the filtering is at play, try to change it to use only `"dinos"` documents... there will be no answer this time :)