<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/RedisPython_VSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search with RedisPy
![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [RedisPy](https://redis-py.readthedocs.io/en/stable/), the standard Redis Python client library to perform document + embdding indexing and semantic search tasks.

## Setup and Data Prep

### Pull Github Materials
We need to clone the supporting materials from github.

In [1]:
# This clones your git repository into a directory named 'temp_repo'.
!git clone https://github.com/Redislabs-Solution-Architects/financial-vss.git temp_repo

# This command moves the 'resources' directory from 'temp_repo' to your current directory.
!mv temp_repo/resources .

# This deletes the 'temp_repo' directory, cleaning up the unwanted files.
!rm -rf temp_repo


Cloning into 'temp_repo'...
remote: Enumerating objects: 73, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 73 (delta 30), reused 56 (delta 17), pack-reused 0[K
Receiving objects: 100% (73/73), 6.92 MiB | 21.40 MiB/s, done.
Resolving deltas: 100% (30/30), done.
mv: cannot move 'temp_repo/resources' to './resources': Directory not empty


### Install Python Dependencies

In [2]:
!pip install -q redis redisvl>==0.0.4 langchain pdf2image "unstructured[all-docs]" sentence-transformers

### Preprocess PDF Doc(s)

Now we will load a single financial (10k filings) doc and preprocess it using some LangChain helpers.

In [3]:
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

# For simplicity, we will just work with one of the 10k files. This will take some time still.
# To Note: the UnstructuredFileLoader is not the only document loader type that LangChain provides
# To Note: the RecursiveCharacterTextSplitter is what we use to create smaller chunks of text from the doc.
# Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
# Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
doc = [doc for doc in docs if "nke" in doc][0]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Listing available documents ... ['resources/aapl-10k-2023.pdf', 'resources/jnj-10k-2023.pdf', 'resources/msft-10k-2023.pdf', 'resources/amzn-10k-2023.pdf', 'resources/nke-10k-2023.pdf', 'resources/nvd-10k-2023.pdf']
Done preprocessing. Created 323 chunks of the original pdf resources/nke-10k-2023.pdf


In [4]:
# Take a look at one item
print(chunks[2])

page_content="NIKE, Inc.(Exact name of Registrant as specified in its charter)Oregon93-0584541(State or other jurisdiction of incorporation)(IRS Employer Identification No.)One Bowerman Drive, Beaverton, Oregon 97005-6453(Address of principal executive offices and zip code)(503) 671-6453(Registrant's telephone number, including area code)SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:Class B Common StockNKENew York Stock Exchange(Title of each class)(Trading symbol)(Name of each exchange on which registered)SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:NONE\n\nAs of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:Class A$7,831,564,572 Class B136,467,702,472 $144,299,267,044\n\nTable of ContentsUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934FOR THE FISCAL YEAR ENDED MAY 31, 2023

### Create document chunk embeddings

In [5]:
from redisvl.vectorize.text import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")

# Embed each page_content from the document chunks
chunk_embeddings = hf.embed_many([chunk.page_content for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(chunk_embeddings) == len(chunks)

True

### Install Redis Stack (OPTIONAL)

Redis Search will be used as Vector Similarity Search engine for LangChain.

Instead of using in-notebook Redis Stack https://redis.io/docs/getting-started/install-stack/ you can provision your own free instance of Redis in the cloud. Get your own Free Redis Cloud instance at https://redis.com/try-free/

In [6]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


gpg: cannot open '/dev/tty': No such device or address
curl: (23) Failed writing body


### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [7]:
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


## VSS with Redis Python

### Create the HASH index from schema
Below we connect to Redis and create an index for vector similarity search that contains a single text field and vector field.

In [8]:
import redis
from redis.commands.search.field import TextField, VectorField
from redis.commands.search.indexDefinition import IndexDefinition, IndexType
from redis.commands.search.query import Query


r = redis.Redis.from_url(REDIS_URL)

index_name = "redispy"
key_prefix = "doc:redispy"

def create_index(index_type: str = "FLAT"):       # Creates a FLAT index by default
    try:
        # check to see if index exists
        r.ft(index_name).info()
        print("Index already exists!")
    except:
        # schema
        schema = (
            TextField("content"),                  # Text Field Name
            VectorField("chunk_vector",            # Vector Field Name
                index_type, {                      # Vector Index Type: FLAT or HNSW
                    "TYPE": "FLOAT32",
                    "DIM": 384,                    # Number of Vector Dimensions
                    "DISTANCE_METRIC": "COSINE",   # Vector Search Distance Metric
                }
            ),
        )

        # index Definition
        definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.HASH)   # Uses HASH by default

        # create Index
        r.ft(index_name).create_index(fields=schema, definition=definition)

In [9]:
# Create the index
create_index()

In [10]:
# Check the info related to the newly created index
r.ft(index_name).info()

{'index_name': 'redispy',
 'index_options': [],
 'index_definition': [b'key_type',
  b'HASH',
  b'prefixes',
  [b'doc:redispy'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'content',
   b'attribute',
   b'content',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'chunk_vector',
   b'attribute',
   b'chunk_vector',
   b'type',
   b'VECTOR']],
 'num_docs': '0',
 'max_doc_id': '0',
 'num_terms': '0',
 'num_records': '0',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '0.00818634033203125',
 'total_inverted_index_blocks': '0',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': 'nan',
 'bytes_per_record_avg': 'nan',
 'offsets_per_term_avg': 'nan',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_time': '0',
 'indexing': '0',
 'percent_indexed': '1',
 'number_of_uses': 1,
 'cleaning': 0,
 'gc_stats

### Process and load data using Redis
Below we use a Redis pipeline (not a transaction) to batch send writes to Redis. This method helps with throughput significantly. The batch_size param can be customized and benchmarked on your hardware and with your data. We typically recommend starting small (100-200) and increasing as needed.

In [11]:
# load expects an iterable of dictionaries
import numpy as np

batch_size = 200
pipe = r.pipeline(transaction=False)
for i, chunk in enumerate(chunks):
    data = {
        'content': chunk.page_content,
        # For HASH -- must convert embeddings to bytes
        'chunk_vector': np.array(chunk_embeddings[i]).astype(np.float32).tobytes()
    }
    pipe.hset(f"{key_prefix}:{i}", mapping=data)
    # execute in "mini batches"
    if i % batch_size == 0:
        res = pipe.execute()

# cleanup final batch execution
res = pipe.execute()

In [12]:
# check the data size in Redis
len(chunks) == r.dbsize()

True

In [13]:
# do NOT run this command in production
keys = r.keys()

r.hgetall(keys[0])

{b'content': b'There are provisions within our articles of incorporation and Oregon law intended to protect shareholder interests by providing the Board of Directors a means to attempt to deny coercive takeover attempts or to negotiate with a potential acquirer in order to obtain more favorable terms. Such provisions include a control share acquisition statute, a freeze-out statute, two classes of stock that vote separately on certain issues, and the fact that holders of Class A Common Stock elect three-quarters of the Board of Directors rounded down to the next whole number. However, such provisions could discourage, delay or prevent an unsolicited merger, acquisition or other change in control of the Company that some shareholders might believe to be in their best interests or in which shareholders might receive a premium for their common stock over the prevailing market price. These provisions could also discourage proxy contests for control of the Company.\n\nWe may fail to meet ma

### Query the database
Now we can use the Redis search index to perform similarity search operations. This query takes a user input, converts to embeddings, and fetches the top 2 most semantically similar chunks from Redis.

In [14]:
# Grab user input
_input = "Nike profit margins and company performance"

query = (
    Query("*=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": hf.embed(_input, as_buffer=True)
}

r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:150', 'payload': None, 'score': '0.354782044888', 'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -43 %\n\n5,343 25\

### Range Queries
Range queries allow you to set a pre defined "threshold" for which we want to return documents

In [15]:
query = (
    Query("@chunk_vector:[VECTOR_RANGE $radius $vec]=>{$YIELD_DISTANCE_AS: score}")
     .sort_by("score")
     .return_fields("content", "score")
     .dialect(2)
)

# Find all vectors within 0.8 of the query vector
query_params = {
    "radius": 0.8,
    "vec": hf.embed(_input, as_buffer=True)
}
r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:150', 'payload': None, 'score': '0.354782044888', 'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -43 %\n\n5,343 25\

### Hybrid Queries
Hybrid queries contain both traditional filters (numeric, tags, text) and VSS in one single Redis command.

In [16]:
query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": hf.embed(_input, as_buffer=True)
}
r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:145', 'payload': None, 'score': '0.362203001976', 'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales were $12.6 billion for fiscal 202

In [17]:
r.ft(index_name).dropindex(True)

b'OK'

### What about JSON Support?

Redis also allows you to store data in JSON objects. The JSON fields can contain metadata and vectors. Below is a simple example of indexing JSON data.

**For now** -- JSON support is only enabled in the base redis-py library. It is coming soon to LangChain and RedisVL.

In [18]:
index_name = "redispy:json"
key_prefix = "doc:redispy:json"

# schema
schema = (
    TextField("$.content",                     # Text Field Name (JSON path)
        as_name="content"                      # Text Field Alias -- required for JSON
    ),
    VectorField("$.chunk_vector",              # Vector Field Name (JSON path)
        "FLAT", {                              # Vector Index Type: FLAT or HNSW
            "TYPE": "FLOAT32",
            "DIM": 384,                        # Number of Vector Dimensions
            "DISTANCE_METRIC": "COSINE",       # Vector Search Distance Metric
        },
        as_name="chunk_vector"                 # Vector Field Alias -- required for JSON
    ),
)

# index Definition
definition = IndexDefinition(prefix=[key_prefix], index_type=IndexType.JSON) # select JSON here

# create Index
r.ft(index_name).create_index(fields=schema, definition=definition)

b'OK'

In [19]:
r.ft(index_name).info()

{'index_name': 'redispy:json',
 'index_options': [],
 'index_definition': [b'key_type',
  b'JSON',
  b'prefixes',
  [b'doc:redispy:json'],
  b'default_score',
  b'1'],
 'attributes': [[b'identifier',
   b'$.content',
   b'attribute',
   b'content',
   b'type',
   b'TEXT',
   b'WEIGHT',
   b'1'],
  [b'identifier',
   b'$.chunk_vector',
   b'attribute',
   b'chunk_vector',
   b'type',
   b'VECTOR']],
 'num_docs': '0',
 'max_doc_id': '0',
 'num_terms': '0',
 'num_records': '0',
 'inverted_sz_mb': '0',
 'vector_index_sz_mb': '0.00818634033203125',
 'total_inverted_index_blocks': '0',
 'offset_vectors_sz_mb': '0',
 'doc_table_size_mb': '0',
 'sortable_values_size_mb': '0',
 'key_table_size_mb': '0',
 'geoshapes_sz_mb': '0',
 'records_per_doc_avg': 'nan',
 'bytes_per_record_avg': 'nan',
 'offsets_per_term_avg': 'nan',
 'offset_bits_per_record_avg': 'nan',
 'hash_indexing_failures': '0',
 'total_indexing_time': '0',
 'indexing': '0',
 'percent_indexed': '1',
 'number_of_uses': 1,
 'cleaning':

In [20]:
# Write JSON data to the index

batch_size = 200
pipe = r.pipeline(transaction=False)

for i, chunk in enumerate(chunks):
    redis_key = f"{key_prefix}:{i}"
    data = {
        'content': chunk.page_content,
        'chunk_vector': chunk_embeddings[i] # notice that we don't need to convert JSON embeddings to bytes
    }
    pipe.json().set(redis_key, "$", data)
    # mini batch
    if i % batch_size == 0:
        res = pipe.execute()

res = pipe.execute() # make sure to use mini batches if working with larger datasets

In [21]:
# Fetch the JSON doc
r.json().get(f"{key_prefix}:0", "$")

[{'content': 'Indicate by check mark:YESNO•if the registrant is a well-known seasoned issuer, as defined in Rule 405 of the Securities Act.þ¨•if the registrant is not required to file reports pursuant to Section 13 or Section 15(d) of the Act.¨þ•whether the registrant (1) has filed all reports required to be filed by Section 13 or 15(d) of the Securities Exchange Act of 1934 during the preceding12 months (or for such shorter period that the registrant was required to file such reports), and (2) has been subject to such filing requirements for thepast 90 days.þ¨•whether the registrant has submitted electronically every Interactive Data File required to be submitted pursuant to Rule 405 of Regulation S-T(§232.405 of this chapter) during the preceding 12 months (or for such shorter period that the registrant was required to submit such files).þ¨•whether the registrant is a large accelerated filer, an accelerated filer, a non-accelerated filer, a smaller reporting company or an emerging gr

In [22]:
# And now you can perform the same kinds of queries

query = (
    Query("@content:profit=>[KNN 2 @chunk_vector $vec as score]")
     .sort_by("score")
     .return_fields("content", "score")
     .paging(0, 2)
     .dialect(2)
)

query_params = {
    "vec": hf.embed(_input, as_buffer=True)
}
r.ft(index_name).search(query, query_params).docs

[Document {'id': 'doc:redispy:json:145', 'payload': None, 'score': '0.362203001976', 'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales were $12.6 billion for fisca

## Cleanup
Clean up the index and data.

In [23]:
r.ft(index_name).dropindex(True)

b'OK'

## What's Next?

Now that you have the basics down with the baseline Redis Python client,

**RedisVL_VSS** is probably the next one to try. RedisVL makes it incredible easy to get started with Redis as a vector database.
<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/RedisVL_VSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**LangChain_VSS** is next after that -- which will make it easy to integrate Redis with an LLM.
<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/LangChain_VSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>