# Vector Similarity Search with RedisPy
![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [RedisPy](https://redis-py.readthedocs.io/en/stable/), the standard Redis Python client library to perform document + embdding indexing and semantic search tasks.

## Install Python Dependencies

In [None]:
!pip install -q redis redisvl>=0.0.4 numpy sentence-transformers

## Load Document Chunks and Embeddings
**You are expected to have first run the Data Prep Notebook**

In [1]:
import os
import json

data_path = "notebooks/resources/"

with open(os.path.join(data_path, "embeddings.json"), "r") as f:
    chunk_embeddings = json.load(f)

with open(os.path.join(data_path, "docs.json"), "r") as f:
    chunks = json.load(f)

## Install Redis Stack (OPTIONAL)

Redis Search will be used as Vector Similarity Search engine for LangChain.

Instead of using in-notebook Redis Stack https://redis.io/docs/getting-started/install-stack/ you can provision your own free instance of Redis in the cloud. Get your own Free Redis Cloud instance at https://redis.com/try-free/

In [None]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [2]:
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


## Create the HASH index from schema
Below we connect to Redis and create an index for vector similarity search that contains a single text field and vector field.

In [4]:
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


r = redis.Redis.from_url(REDIS_URL)

index_name = "redispy"
key_prefix = "doc:redispy"

def create_index(index_type: str = "FLAT"):       # Creates a FLAT index by default
    try:
        # check to see if index exists
        r.ft(index_name).info()
        print("Index already exists!")
    except:
        # schema
        schema = (
            TextField("content"),                  # Text Field Name
            VectorField("chunk_vector",            # Vector Field Name
                index_type, {                      # Vector Index Type: FLAT or HNSW
                    "TYPE": "FLOAT32",
                    "DIM": 384,                    # Number of Vector Dimensions
                    "DISTANCE_METRIC": "COSINE",   # Vector Search Distance Metric
                }
            ),
        )

        # index Definition
        definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.HASH)   # Uses HASH by default

        # create Index
        r.ft(index_name).create_index(fields=schema, definition=definition)

In [5]:
# Create the index
create_index()

In [6]:
# Check the info related to the newly created index
r.ft(index_name).info()

{'index_name': 'redispy',
 'index_options': [],
 'index_definition': [b'key_type',
  b'HASH',
  b'prefixes',
  [b'doc:redispy'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'content',
   b'attribute',
   b'content',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'chunk_vector',
   b'attribute',
   b'chunk_vector',
   b'type',
   b'VECTOR']],
 'num_docs': '0',
 'max_doc_id': '0',
 'num_terms': '0',
 'num_records': '0',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '0.00818634033203125',
 'total_inverted_index_blocks': '0',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': 'nan',
 'bytes_per_record_avg': 'nan',
 'offsets_per_term_avg': 'nan',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_time': '0',
 'indexing': '0',
 'percent_indexed': '1',
 'number_of_uses': 1,
 'cleaning': 0,
 'gc_stats

### Process and load data using RedisVL
Below we use a Redis pipeline (not a transaction) to batch send writes to Redis. This method helps with throughput significantly. The batch_size param can be customized and benchmarked on your hardware and with your data. We typically recommend starting small (100-200) and increasing as needed.

In [7]:
# load expects an iterable of dictionaries
import numpy as np

batch_size = 200
pipe = r.pipeline(transaction=False)
for i, chunk in enumerate(chunks):
    data = {
        'content': chunk['page_content'],
        'chunk_vector': np.array(chunk_embeddings[i]).astype(np.float32).tobytes()
    }
    pipe.hset(f"{key_prefix}:{i}", mapping=data)
    # execute in "mini batches"
    if i % batch_size == 0:
        res = pipe.execute()

# cleanup final batch execution
res = pipe.execute()

In [8]:
len(chunks) == r.dbsize()

True

In [9]:
# do NOT run this command in production
keys = r.keys()

r.hgetall(keys[0])

{b'chunk_vector': b'_^\x9f\xbc%\xb8\x02\xbd{\x16\xf0\xbb\x1b\xbcO\xbb\x08x\xa5\xbbY-q<Y-q\xbdL\xa6\x80\xbc\xe1\xe5\x04<\x8a\xf5>\xbc\x93\x17\xd4<\x11\x04\xe3<\xfa\x8a\x8b\xbc\re\x8d<h$\xd7\xbb\x08>+=\xd2p^<Z{\x03\xbd\xbf~\xc5<c\xc3\xfa\xbc\x16\xc1\x1c\xbd\xc8\x14\xcf<)\xe3c\xbd\x8f\x90\x95:\xbf,4\xbc%\x96\x1f<2\x05\xf9<\x1bt\x8a\xbc\x0c!G<\xd3\xd6\x07<\xac\xae\x8f\xbbA\xaaM\xbc\xbff.\xbc%"+\xbb\xb05N\xbch\xea\xdc\xbb\xd2X\xc7;\x85\x94\xe2\xbcd\x11\r<\x1b\xde2\xbb\xc4\x19\x9c<\xcd\xbdp<\xef\x1a\xe4\xbb\xac\xa4\xc3;<\xf7\xdf9Ll\x86<{\xe6\xc1;\xb0\x13k\xbcK\xeeE\xbc\xd2\x92A=\xeb\xef\x02=\xce#\x1a=w\x03\xa6\xbc\x89\xeb\xf2\xba\x03[\x8f<A6Y\xbcQq\x05<\xbf,\xb4;\x86\xfa\x8b\xbc\x0cQ\xf5\xbb\x8ai3;\x99\xd4\x8d;mm\x9c\xbb\xe1\xa1>\xbc\xdc\xd69\xbd=\x8d\xb7\xba\x9ee\x18\xbc\xd7;c<\xef\x02M\xbb\x8a\x8b\x16;K\x10\xa9<\x94s1<\xac\xa4C<\x07\xfad\xbb\xd7\xc7\xee<m\xbf\xad\xbb\x98\xc0\xf5\xbbc{\xb5;\xfe\x11\xca<\xc06\x80<\xa7\x9fD<\x0c\x95;\xbc\x1b:\x10\xbd%\xe80\xbc\xb1\x05\xa0<.D@;AX\xbc;\xac\xe8\

## Initialize embeddings engine

Get the vectorizer and create the embeddings from RedisVL. You could also use the HuggingFace `sentence-transformers` client directly, but RedisVL makes this too easy!

In [10]:
from redisvl.vectorize.text import HFTextVectorizer

# create a vectorizer
hf = HFTextVectorizer(model="sentence-transformers/all-MiniLM-L6-v2")


### Query the database
Now we can use the Redis search index to perform similarity search operations. This query takes a user input, converts to embeddings, and fetches the top 2 most semantically similar chunks from Redis.

In [14]:
# Grab user input
_input = "Profit margins"

query = (
    Query("*=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": hf.embed(_input, as_buffer=True)
}

r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:a39f4db7-ee0b-468d-8091-95a107c3087f', 'payload': None, 'score': '0.178667545319', 'content': 'Inventories as of May 31, 2023 were $8.5 billion, flat compared to the prior year, driven by the actions we took throughout fiscal 2023 to manage inventory levels\n\nWe returned $7.5 billion to our shareholders in fiscal 2023 through share repurchases and dividends\n\nReturn on Invested Capital ("ROIC") as of May 31, 2023 was 31.5% compared to 46.5% as of May 31, 2022. ROIC is considered a non-GAAP financial measure, see "Use of Non-GAAP Financial Measures" for further information.\n\nFor discussion related to the results of operations and changes in financial condition for fiscal 2022 compared to fiscal 2021 refer to Part II, Item 7. Management\'s Discussion and Analysis of Financial Condition and Results of Operations in our fiscal 2022 Form 10-K, which was filed with the United States Securities and Exchange Commission on July 21, 2022.\n\nCURRENT ECONOMIC CON

### Range Queries
Range queries allow you to set a pre defined "threshold" for which we want to return documents

In [15]:
query = (
    Query("@chunk_vector:[VECTOR_RANGE $radius $vec]=>{$YIELD_DISTANCE_AS: score}")
     .sort_by("score")
     .return_fields("content", "score")
     .dialect(2)
)

# Find all vectors within 0.8 of the query vector
query_params = {
    "radius": 0.18,
    "vec": hf.embed(_input, as_buffer=True)
}
r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:a39f4db7-ee0b-468d-8091-95a107c3087f', 'payload': None, 'score': '0.178667545319', 'content': 'Inventories as of May 31, 2023 were $8.5 billion, flat compared to the prior year, driven by the actions we took throughout fiscal 2023 to manage inventory levels\n\nWe returned $7.5 billion to our shareholders in fiscal 2023 through share repurchases and dividends\n\nReturn on Invested Capital ("ROIC") as of May 31, 2023 was 31.5% compared to 46.5% as of May 31, 2022. ROIC is considered a non-GAAP financial measure, see "Use of Non-GAAP Financial Measures" for further information.\n\nFor discussion related to the results of operations and changes in financial condition for fiscal 2022 compared to fiscal 2021 refer to Part II, Item 7. Management\'s Discussion and Analysis of Financial Condition and Results of Operations in our fiscal 2022 Form 10-K, which was filed with the United States Securities and Exchange Commission on July 21, 2022.\n\nCURRENT ECONOMIC CON

### Hybrid Queries
Hybrid queries contain both traditional filters (numeric, tags, text) and VSS in one single Redis command.

In [17]:
query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": hf.embed(_input, as_buffer=True)
}
r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:7c93cdc5-b6b8-4382-82a5-d33da13db5c3', 'payload': None, 'score': '0.184492409229', 'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales 

In [18]:
r.ft(index_name).dropindex(True)

b'OK'

## What about JSON Support?

Redis also allows you to store data in JSON objects. The JSON fields can contain metadata and vectors. Below is a simple example of indexing JSON data.

**For now** -- JSON support is only enabled in the base redis-py library. It is coming soon to LangChain and RedisVL.

In [51]:
index_name = "redispy:json"
key_prefix = "doc:redispy:json"


try:
    # check to see if index exists
    r.ft(index_name).info()
    print("Index already exists!")
except:
    # schema
    schema = (
        TextField("$.content",                     # Text Field Name (JSON path)
            as_name="content"                      # Text Field Alias -- required for JSON
        ),
        VectorField("$.chunk_vector",              # Vector Field Name (JSON path)
            "FLAT", {                              # Vector Index Type: FLAT or HNSW
                "TYPE": "FLOAT32",
                "DIM": 384,                        # Number of Vector Dimensions
                "DISTANCE_METRIC": "COSINE",       # Vector Search Distance Metric
            },
            as_name="chunk_vector"                 # Vector Field Alias -- required for JSON
        ),
    )

    # index Definition
    definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.JSON) # select JSON here

    # create Index
    r.ft(index_name).create_index(fields=schema, definition=definition)

In [54]:
r.ft(index_name).info()

{'index_name': 'redispy:json',
 'index_options': [],
 'index_definition': [b'key_type',
  b'JSON',
  b'prefixes',
  [b'doc:redispy:json'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'$.content',
   b'attribute',
   b'content',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'$.chunk_vector',
   b'attribute',
   b'chunk_vector',
   b'type',
   b'VECTOR']],
 'num_docs': '0',
 'max_doc_id': '0',
 'num_terms': '0',
 'num_records': '0',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '0.00818634033203125',
 'total_inverted_index_blocks': '0',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': 'nan',
 'bytes_per_record_avg': 'nan',
 'offsets_per_term_avg': 'nan',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_time': '0',
 'indexing': '0',
 'percent_indexed': '1',
 'number_of_uses': 2,
 'cleaning':

In [55]:
# Write JSON data to the index

batch_size = 200
pipe = r.pipeline(transaction=False)

for i, chunk in enumerate(chunks):
    redis_key = f"{key_prefix}:{i}"
    data = {
        'content': chunk['page_content'],
        'chunk_vector': chunk_embeddings[i] # notice that we don't need to convert JSON embeddings to bytes
    }
    pipe.json().set(redis_key, "$", data)
    # mini batch
    if i % batch_size == 0:
        res = pipe.execute()

res = pipe.execute() # make sure to use mini batches if working with larger datasets

In [None]:
# Fetch the JSON doc
r.json().get(f"{key_prefix}:0", "$")

In [57]:
# And now you can perform the same kinds of queries

query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": hf.embed(_input, as_buffer=True)
}
r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:json:145', 'payload': None, 'score': '0.184492409229', 'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales were $12.6 billion for fisca

In [None]:
r.ft(index_name).dropindex(True)

## What's Next?

Now that you have the basics down with the baseline Redis Python client,

- [RedisVL](RedisVL_VSS.ipynb) is probably the next one to try. RedisVL makes it incredible easy to get started with Redis as a vector database.
- [LangChain](LangChain_VSS.ipynb) is then after that -- which will make it easy to integrate Redis with an LLM.