# DuckDB

>[DuckDB](https://duckdb.org/docs/api/python/overview) is a fast in-process analytical database. DuckDB is under an MIT license.

In this notebook we are going to show how to use DuckDB as a Vector store to be used in LlamaIndex.

Install DuckDB with:

```sh
pip install duckdb
```

Make sure to use the latest DuckDB version (>= 0.10.0).

You can run DuckDB in different modes depending on persistence:
- `in-memory` is the default mode, where the database is created in memory, you can force this to be use by setting `database_name = ":memory:"` when initializing the vector store.
- `persistence` is set by using a name for a database and setting a persistence directory `database_name = "my_vector_store.duckdb"` where the database is persisted in the default `persist_dir` or to the one you set it to.

With the vector store created, you can:
- `.add` 
- `.get` 
- `.update`
- `.upsert`
- `.delete`
- `.peek`
- `.query` to run a search. 


## Basic example

In this basic example, we take the Paul Graham essay, split it into chunks, embed it using an open-source embedding model, load it into `DuckDBVectorStore`, and then query it.

For the embedding model we will use OpenAI. 

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

### Creating a DuckDB Index

In [None]:
!pip install duckdb
!pip install llama-index-vector-stores-duckdb

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.duckdb import DuckDBVectorStore
from llama_index.core import StorageContext

from IPython.display import Markdown, display

In [None]:
# Setup OpenAI API
import os
import openai

openai.api_key = os.environ["OPENAI_API_KEY"]

Download and prepare the sample dataset

In [None]:
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

--2024-02-16 19:38:34--  https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/paul_graham/paul_graham_essay.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.111.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 75042 (73K) [text/plain]
Saving to: ‘data/paul_graham/paul_graham_essay.txt’


2024-02-16 19:38:34 (1.24 MB/s) - ‘data/paul_graham/paul_graham_essay.txt’ saved [75042/75042]



In [None]:
documents = SimpleDirectoryReader("data/paul_graham/").load_data()

vector_store = DuckDBVectorStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

In [None]:
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

<b>The author mentions that before college, they worked on two main things outside of school: writing and programming. They wrote short stories and also tried writing programs on an IBM 1401 computer. They later got a microcomputer and started programming more extensively.</b>

## Persisting to disk example

Extending the previous example, if you want to save to disk, simply initialize the DuckDBVectorStore by specifying a database name and persist directory.

In [None]:
# Save to disk
documents = SimpleDirectoryReader("data/paul_graham/").load_data()

vector_store = DuckDBVectorStore("pg.duckdb", persist_dir="./persist/")
storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(
    documents, storage_context=storage_context
)

In [None]:
# Load from disk
vector_store = DuckDBVectorStore.from_local("./persist/pg.duckdb")
index = VectorStoreIndex.from_vector_store(vector_store)

In [None]:
# Query Data
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")
display(Markdown(f"<b>{response}</b>"))

<b>The author mentions that before college, they worked on two main things outside of school: writing and programming. They wrote short stories and also tried writing programs on an IBM 1401 computer. They later got a microcomputer and started programming more extensively.</b>

## Metadata filter example

It is possible to narrow down the search space by filter with metadata. Below is an example to show that in practice. 

In [None]:
from llama_index.core.schema import TextNode

nodes = [
    TextNode(
        **{
            "text": "The Shawshank Redemption",
            "metadata": {
                "author": "Stephen King",
                "theme": "Friendship",
                "year": 1994,
                "ref_doc_id": "doc_1",
            },
        }
    ),
    TextNode(
        **{
            "text": "The Godfather",
            "metadata": {
                "director": "Francis Ford Coppola",
                "theme": "Mafia",
                "year": 1972,
                "ref_doc_id": "doc_1",
            },
        }
    ),
    TextNode(
        **{
            "text": "Inception",
            "metadata": {
                "director": "Christopher Nolan",
                "theme": "Sci-fi",
                "year": 2010,
                "ref_doc_id": "doc_2",
            },
        }
    ),
]

vector_store = DuckDBVectorStore()
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex(nodes, storage_context=storage_context)

Define the metadata filters.

In [None]:
from llama_index.core.vector_stores import ExactMatchFilter, MetadataFilters

filters = MetadataFilters(
    filters=[ExactMatchFilter(key="theme", value="Mafia")]
)

Use the index as a retriever to use the metadatafilter option. 

In [None]:
retriever = index.as_retriever(filters=filters)
retriever.retrieve("What is inception about?")

[NodeWithScore(node=TextNode(id_='736a1279-4ebd-496e-87b5-925197646477', embedding=[-0.006784645840525627, -0.021635770797729492, -0.015731574967503548, -0.03265434503555298, -0.005616107489913702, 0.025351788848638535, -0.0057811918668448925, 0.0027044713497161865, -0.01623653806746006, -0.023759208619594574, 0.027164479717612267, 0.017932699993252754, 0.029028963297605515, 0.003991158679127693, -0.0009047273779287934, 0.010973258875310421, 0.027164479717612267, -0.012844215147197247, 0.006972389295697212, -0.011148054152727127, 0.003528274828568101, 0.007736308965831995, -0.031022923067212105, -0.013996569439768791, 0.0012567456578835845, 0.004988139029592276, 0.010571876540780067, -0.024290068075060844, 0.019123896956443787, -0.02119554579257965, 0.014022464863955975, -0.023098871111869812, -0.009050510823726654, 0.001241370104253292, 0.006881754379719496, -0.007186027709394693, -0.0036577528808265924, -0.012734158895909786, 0.0034473512787371874, 0.003987921867519617, 0.01084378082