# Neo4j Vector Index

- Author: [Jongho](https://github.com/XaviereKU)
- Design: 
- Peer Review: 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/99-TEMPLATE/00-BASE-TEMPLATE-EXAMPLE.ipynb)

## Overview

In this tutorial, we will use **Neo4j** as our vectorstore.

Throughout this tutorial, we build our vectorstore with Neo4j vector index(`Neo4jVector`) with `OpenAIEmbeddings`, and use it to retrieve data that we want.

To fully utilize Neo4j you need to know how use `Cypher`, declarative query language for Neo4j.

We use some Cypher queries but will not go deeply. You can visit Cypher official document web site in References.

For more information, visit [Neo4j](https://neo4j.com/).

### Table of Contents

- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Setup Neo4j](#setup-neo4j)
- [Load Dataset](#load-dataset)
- [Store Dataset from Scratch](#store-dataset-from-scratch)
- [Working with Existing Vectorstore](#working-with-existing-vectorstore)
- [Manage vector store](#manage-vector-store)
- [Query vector store](#query-vector-store)

### References

- [Cypher](https://neo4j.com/docs/cypher-manual/current/introduction/)
- [Neo4j Docker Installation](https://hub.docker.com/_/neo4j)
- [Neo4j Official Installation guide](https://neo4j.com/docs/operations-manual/current/installation/)
- [Langchain Neo4j document](https://python.langchain.com/docs/integrations/vectorstores/neo4jvector/)
----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Pip install necessary package
%pip install -qU  neo4j
%pip install -qU  langchain-neo4j

[0mNote: you may need to restart the kernel to use updated packages.
[0mNote: you may need to restart the kernel to use updated packages.


In [3]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_core",
        "langchain_community",
        "langchain_text_splitters",
        "langchain_openai",
        "langchain_neo4j",
        "neo4j",
    ],
    verbose=False,
    upgrade=False,
)

[0m

In [4]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "Neo4j",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

In [5]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

False

## Setup Neo4j
Neo4j supports Linux, macOS and windows, but in this tutorial we setup `Neo4j` using docker.

You can visit **Neo4j Docker installation** reference to check more detailed information and **Neo4j Official Installation guide** reference to check installation for Linux, macOS and windows.

To run Neo4j container, we use the following command.
```
docker run \
    -itd \
    --publish=7474:7474 --publish=7687:7687 \
    --volume=$HOME/neo4j/data:/data \
    --env=NEO4J_AUTH=none \
    --name neo4j \
    neo4j
```

**[NOTE]**

* The env NEO4J_AUTH=none is set as this is just a tutorial, if you want to use Neo4j as a production, **DO NOT** set it to none.
* **7474** port is used for HTTP you can access dashboard with your browser by http://localhost:7474
* **7687** port is used for bolt access. This is the port we will use in this notebook

Now we set neo4j url to be used through this tutorial

In [6]:
# set parameters for neo4j

url = "bolt://172.16.16.192:7687"
username = "neo4j"
password = "neo4j"

# You can also use environment variables instead of directly passing named parameters
# os.environ["NEO4J_URI"] = "bolt://localhost:7687"
# os.environ["NEO4J_USERNAME"] = "neo4j"
# os.environ["NEO4J_PASSWORD"] = "neo4j"

If you did not set NEO4J_AUTH environment variable to none, you **MUST** set your credential by open http://localhost:7474 and set password parameter with new credential

## Load Dataset

Prepare your own dataset in **txt** format and place it under data folder.

In [7]:
# Import required packages to load dataset
from langchain_community.document_loaders import TextLoader
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

# Import neo4j related packages
from langchain_neo4j import Neo4jVector
from langchain_neo4j.vectorstores.utils import DistanceStrategy

In [8]:
# Load data with TextLoader
loader = TextLoader("data/state_of_the_union.txt", encoding="utf8")

documents = loader.load()

# Set textsplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)

# Split text
docs = text_splitter.split_documents(documents)

# Initialize OpenAIEmbeddings
embeddings = OpenAIEmbeddings()

In [9]:
# check the type of each item in docs array
print("Type of docs:", type(docs))
print()
print("Type of item in docs:", type(docs[0]))
print()
print("Item in docs:", docs[0].page_content[:200])

Type of docs: <class 'list'>

Type of item in docs: <class 'langchain_core.documents.base.Document'>

Item in docs: Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. 


## Store Dataset from Scratch

Now we are ready to store our dataset into Neo4j from scratch.

In [10]:
# Initialize database
db = Neo4jVector.from_documents(
    docs,
    embeddings,
    url=url,
    username=username,
    password=password,
    index_name="Example_cosine",
    node_label="Example",
)

By default, similarity search will use cosine distance.

But if you want to use Euclidean distance, see the following cell

In [11]:
# import DistanceStrategy
from langchain_neo4j.vectorstores.utils import DistanceStrategy

# Initialize database with EUCLIDEAN_DISTANCE
db_euc = Neo4jVector.from_documents(
    docs,
    embeddings,
    url=url,
    username=username,
    password=password,
    distance_strategy=DistanceStrategy.EUCLIDEAN_DISTANCE,
    index_name="Example_Euc",
    node_label="Euclidean",
)

Now we are ready!

## Working with Existing Vectorstore
We created our first Neo4j vector store from scratch.
But what if we want to work with an existing vectorstore?

Remember the first Neo4j vector store we created? We set its index_name as Example_cosine?

To use it, we just pass index_name = "Example_cosine" to `from_existing_index` method

In [12]:
# set index_name
index_name = "Example_cosine"

# initialize new db named store from existing index
store = Neo4jVector.from_existing_index(
    embeddings, url=url, username=username, password=password, index_name=index_name
)

## Manage vector store

In this section we will see how to add, update and delete items from vector store

### Add items
First we look how to add item to index(Example_cosine) we created.

To add items directly, we use `Document` class we imported.

But before we add, let's check how many items in Example_cosine index.

Note that Neo4j uses node to denote item in vector store.

In [13]:
# we use cypher query here. Below returns the number of all items, including null item
store.query(
    """
 MATCH (n:Example) 
 RETURN count(*)
 """
)

[{'count(*)': 42}]

Now we add some items to store

In [14]:
# set documents with Document class
new_docs = [
    Document(
        page_content="This is langchain tutorial an open-source project done by langchain-opentutorial group",
        metadata={"source": ""},
    ),
    Document(
        page_content="Now we are adding some documents",
        metadata={"source": ""},
    ),
]

# add documents to store
# this will return ids of the newly added items
store.add_documents(new_docs)

['e7a6f17b2fa2f355f28fc83977fe61e4', '78c0ff48ad161b68ce52e076e0e4effa']

In [15]:
# Count the number of items again
store.query(
    """
 MATCH (n:Example) 
 RETURN count(*)
 """
)

[{'count(*)': 44}]

As you can see the number of items added to Example_cosine index with node_label Example

### Update items
Now we update items in store.

We put metadata-source as empty in above add documents example. Let's set it to 'langchain-opentutorial'

In [16]:
# set documents with Document class
new_docs = [
    Document(
        page_content="This is langchain tutorial an open-source project done by langchain-opentutorial group",
        metadata={"source": "langchain-opentutorial"},
    ),
    Document(
        page_content="Now we are adding some documents",
        metadata={"source": "langchain-opentutorial"},
    ),
]

# add documents to store
# this will return ids of the newly added items
store.add_documents(new_docs)

['e7a6f17b2fa2f355f28fc83977fe61e4', '78c0ff48ad161b68ce52e076e0e4effa']

Now we update metadata of two newly added documents.

The `add_documents` methods also updates the item not just add the item.

You can see that the returns, the ids of added documents, are the same as in add documents section.

You can pass ids as a kwargs to add_documents to update specific item.

In [17]:
# Update item but with id
store.add_documents(
    [
        Document(
            page_content="This is langchain tutorial, an open-source project done by langchain-opentutorial group.",
            metadata={"source": "langchain-opentutorial", "method": "update_by_id"},
        )
    ],
    ids=["e7a6f17b2fa2f355f28fc83977fe61e4"],
)

['e7a6f17b2fa2f355f28fc83977fe61e4']

### Delete items
We can delete items with ids or other criterias.

Unfortunately, as of `langchain-neo4j` version 0.2.0, `delete` method is not implemented yet, so we need to use Cypher.

In [18]:
# query single item with id
query_result = store.query(
    """
    MATCH (n:Example)
    WHERE n.id = 'e7a6f17b2fa2f355f28fc83977fe61e4'
    Return n
    """
)

print(query_result[0]["n"].keys())
print("text:", query_result[0]["n"]["text"])
print("text:", query_result[0]["n"]["id"])

dict_keys(['method', 'id', 'source', 'text', 'embedding'])
text: This is langchain tutorial, an open-source project done by langchain-opentutorial group.
text: e7a6f17b2fa2f355f28fc83977fe61e4


In [19]:
# delete single item with id
store.query(
    """
    MATCH (n:Example)
    WHERE n.id = 'e7a6f17b2fa2f355f28fc83977fe61e4'
    DELETE n
    """
)

[]

In [20]:
# check if item is deleted
query_result = store.query(
    """
    MATCH (n:Example)
    WHERE n.id = 'e7a6f17b2fa2f355f28fc83977fe61e4'
    Return count(*)
    """
)

print("After deletion:", query_result)

After deletion: [{'count(*)': 0}]


In [21]:
# query item where source metadata is langchain-opentutorial
query_result = store.query(
    """
    MATCH (n:Example)
    WHERE n.source = 'langchain-opentutorial'
    Return count(*)
    """
)

print("Before deletion:", query_result)

# delete by metadata
store.query(
    """
    MATCH (n:Example)
    WHERE n.source = 'langchain-opentutorial'
    DELETE n
    """
)

# check if item is deleted
query_result = store.query(
    """
    MATCH (n:Example)
    WHERE n.source = 'langchain-opentutorial'
    Return count(*)
    """
)
print("After deletion:", query_result)

Before deletion: [{'count(*)': 1}]
After deletion: [{'count(*)': 0}]


You can also delete multiple items, or all data.

Let's try it with item with label Example_Euc

In [22]:
# query items with node_label Euclidean
query_result = store.query(
    """
    MATCH (n:Euclidean)
    Return count(*)
    """
)

print(query_result)

[{'count(*)': 42}]


In [23]:
# delete all items with node_label Euclidean
store.query(
    """
    MATCH (n:Euclidean)
    DELETE n
    """
)

[]

In [24]:
# check if all the items are deleted
query_result = store.query(
    """
    MATCH (n:Euclidean)
    Return count(*)
    """
)

print(query_result)

[{'count(*)': 0}]


## Query vector store
### Query directly
By default, similarity search will use **cosine distance**.

But as Neo4j also supports Euclidean distance, you can use it if you want.

First, we do similarity search with cosine distance to get 2 documents.

In [25]:
# Write your query
query = "What did the president say about Ketanji Brown Jackson"

# Do similarity search
# Here k means the number of documents we want.
docs_only = db.similarity_search(query, k=2)

print("Searched document number :", len(docs_only))

Searched document number : 2


In [26]:
# print searched docs
for doc in docs_only:
    print("-" * 80)
    print(doc.page_content[:200])
    print("-" * 80)

--------------------------------------------------------------------------------
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s r
--------------------------------------------------------------------------------


Also, you can get similarity scores together with documents.

In [27]:
# Do similarity search with scores
docs_with_scores = db.similarity_search_with_score(query, k=2)

print("Searched document number :", len(docs_with_scores))

# print searched docs and there cosine similarity score
for doc, score in docs_with_scores:
    print("-" * 80)
    print("Score :", score)
    print(doc.page_content[:200])
    print("-" * 80)

Searched document number : 2
--------------------------------------------------------------------------------
Score : 0.902435302734375
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score : 0.8859710693359375
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s r
--------------------------------------------------------------------------------


In [28]:
# Do similarity search with scores - Euclidean distance
euc_docs_with_scores = db_euc.similarity_search_with_score(query, k=2)

print("Searched document number :", len(euc_docs_with_scores))

# print searched docs and there cosine similarity score
for doc, score in euc_docs_with_scores:
    print("-" * 80)
    print("Score :", score)
    print(doc.page_content[:200])
    print("-" * 80)

Searched document number : 0


As you can see, different distance strategy gives different score.

You can check more at [DistanceStrategy](https://python.langchain.com/api_reference/neo4j/vectorstores/langchain_neo4j.vectorstores.utils.DistanceStrategy.html) for Neo4jVector

### Query by turning into retriever
We also can use `as_retriever` method to turn database into retriever.

With this method, you can put this into chain and to Retrieval Augmented Generation(RAG).

In [29]:
# turn db into retirever
retriever = store.as_retriever()

# invoke retriever to get result
results = retriever.invoke("What did the president say about Ketanji Brown Jackson")

for result in results:
    print("-" * 80)
    print(result.page_content[:200])
    print("-" * 80)

--------------------------------------------------------------------------------
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s r
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Tonight, I’m announcing a crackdown on these companies overcharging American businesses and consumers. 

And as Wall Street firms take over more nursing homes, quality in those homes has gone d

`as_retriever` method, as a default, return 4 documents.

If you want to retrieve more or less documents you can do the following.

In [30]:
# turn db into retirever to return 2 documents
retriever = store.as_retriever(search_kwargs={"k": 2})

# invoke retriever to get result
results = retriever.invoke("What did the president say about Ketanji Brown Jackson")

for result in results:
    print("-" * 80)
    print(result.page_content[:200])
    print("-" * 80)

--------------------------------------------------------------------------------
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s r
--------------------------------------------------------------------------------


If you want to get score also, you can use decorator `@chain`

In [31]:
# import necessary package
from langchain_core.runnables import chain


# define retriever_with_score with chain decorator
@chain
def retriever_with_score(query):
    docs, scores = zip(*store.similarity_search_with_score(query, k=2))
    for doc, score in zip(docs, scores):
        doc.metadata["score"] = score

    return docs

In [32]:
# invoke retriever_with_score
results = retriever_with_score.invoke(
    "What did the president say about Ketanji Brown Jackson"
)

# print result
# Note that Score is inside the metadata now
for result in results:
    print("-" * 80)
    print("Score :", result.metadata["score"])
    print(result.page_content[:200])
    print("-" * 80)

--------------------------------------------------------------------------------
Score : 0.902435302734375
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score : 0.8859710693359375
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s r
--------------------------------------------------------------------------------
