![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

# Vector Search from scratch with the Redis Vector Library


Now that you have a good foundation in Redis data structures and basic search capabilities, this notebook builds on that path. It introduces [redisvl](https://redisvl.com), a dedicated Python client library for streamline GenAI application development.

We will go through the same initial setup and data prep stage, then dive into building an **end-to-end RAG system from scratch**, including a few special topics/techniques:
- Dense content representation
- Query rewriting / expansion
- Semantic caching
- Conversational memory persistence




In [48]:
import os
import warnings
warnings.filterwarnings("ignore")
dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["ROOT_DIR"] = parent_directory
print(dir_path)
print(parent_directory)

/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/getting_started
/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss


### Install Python Dependencies

In [49]:
%pip install -r $ROOT_DIR/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [50]:
import warnings

warnings.filterwarnings("ignore")

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [51]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

## Simplified Vector Search with RedisVL

### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `RecursiveCharacterTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [52]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs from a folder
data_path = f"{parent_directory}/resources/10K"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

Listing available documents ... ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/nke-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/mu-10K-2019.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/amzn-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/jnj-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/amzn-10K-2019.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/aapl-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/aapl-10K-2019.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/nvd-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/msft-10k-2023.pdf']


In [53]:
# pick out the Nike doc for this exercise
doc = [doc for doc in docs if "nke" in doc][0]

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
)
loader = UnstructuredFileLoader(
    doc, mode="single", strategy="fast"
)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Done preprocessing. Created 180 chunks of the original pdf /Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/nke-10k-2023.pdf


### Text embedding generation with RedisVL
RedisVL has built-in extensions and utilities to aid the GenAI development process.

#### SentenceTransformer Models Cache folder
We are using `SentenceTransformer` in this demo and here we specify the cache folder. If you already downloaded the models in a local file system, set this folder here, otherwise the library tries to download the models in this folder if not available locally.

In particular, these models will be downloaded if not present in the cache folder:

models/models--sentence-transformers--all-MiniLM-L6-v2

In [54]:
#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

In [55]:
from redisvl.utils.vectorize import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2", cache_folder = os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))


# Embed each chunk content
embeddings = hf.embed_many([chunk.page_content for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(embeddings) == len(chunks)

True

In [56]:
print(f"{len(chunks)} chunks embedded - with embeddings of {len(embeddings[0])} size")

180 chunks embedded - with embeddings of 384 size


### Define a schema and create an index

Below we connect to Redis and create an index that contains a text field, tag field, and vector field.

In [57]:
from redis import Redis
from redisvl.schema import IndexSchema
from redisvl.index import SearchIndex


index_name = "redisvl"


schema = IndexSchema.from_dict({
  "index": {
    "name": index_name,
    "prefix": "chunk"
  },
  "fields": [
    {
        "name": "doc_id",
        "type": "tag",
        "attrs": {
            "sortable": True
        }
    },
    {
        "name": "content",
        "type": "text"
    },
    {
        "name": "text_embedding",
        "type": "vector",
        "attrs": {
            "dims": hf.dims,
            "distance_metric": "cosine",
            "algorithm": "hnsw",
            "datatype": "float32"
        }
    }
  ]
})

In [58]:
# connect to redis
client = Redis.from_url(REDIS_URL)

# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

In [59]:
# use the RedisVL CLI tool to list all indices
!rvl index listall

[32m20:43:41[0m [34m[RedisVL][0m [1;30mINFO[0m   Indices:
[32m20:43:41[0m [34m[RedisVL][0m [1;30mINFO[0m   1. langchain
[32m20:43:41[0m [34m[RedisVL][0m [1;30mINFO[0m   2. redisvl


In [60]:
# get info about the index
!rvl index info -i redisvl



Index Information:
╭──────────────┬────────────────┬────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes   │ Index Options   │   Indexing │
├──────────────┼────────────────┼────────────┼─────────────────┼────────────┤
│ redisvl      │ HASH           │ ['chunk']  │ []              │          0 │
╰──────────────┴────────────────┴────────────┴─────────────────┴────────────╯
Index Fields:
╭────────────────┬────────────────┬────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name           │ Attribute      │ Type   │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │ Field Option   │   Option Value │ Field Option    │   Option Value │
├────────────────┼────────────────┼────────┼────────────────┼─

### Process and load dataset
Below we use the RedisVL index to simply load the list of document chunks to Redis db.

In [61]:
# load expects an iterable of dictionaries
from redisvl.redis.utils import array_to_buffer

data = [
    {
        'doc_id': f'{i}',
        'content': chunk.page_content,
        # For HASH -- must convert embeddings to bytes
        'text_embedding': array_to_buffer(embeddings[i])
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = index.load(data, id_field="doc_id")

### Query the database
Now we can use the RedisVL index to perform similarity search operations with Redis

In [62]:
from redisvl.query import VectorQuery

query = "Nike profit margins and company performance"

query_embedding = hf.embed(query)

vector_query = VectorQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=3,
    return_fields=["doc_id", "content"],
    return_score=True
)

# show the raw redis query
str(vector_query)

'*=>[KNN 3 @text_embedding $vector AS vector_distance] RETURN 3 doc_id content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 3'

In [63]:
# execute the query with RedisVL
index.query(vector_query)

[{'id': 'chunk:85',
  'vector_distance': '0.321347296238',
  'doc_id': '85',
  'content': 'TOTAL NIKE BRAND Converse\n\n$\n\n1,932 (4,841)\n\n8,359 676\n\n$\n\n1,896 (4,262)\n\n8,406 669\n\n2 % -14 %\n\n1 % $ 1 %\n\n1,530 (3,656)\n\n8,641 543\n\nCorporate TOTAL NIKE, INC. EARNINGS BEFORE INTEREST AND TAXES\n\n(1)\n\n$\n\n(2,840)\n\n6,195\n\n$\n\n(2,219)\n\n6,856\n\n28 %\n\n10 % $\n\n(2,261)\n\n6,923\n\nEBIT margin\n\n(1)\n\n12.1 %\n\n14.7 %\n\n15.5 %\n\nInterest expense (income), net\n\n(6)\n\n205\n\n—\n\n262\n\nTOTAL NIKE, INC. INCOME BEFORE INCOME TAXES\n\n$\n\n6,201\n\n$\n\n6,651\n\n7 % $\n\n6,661\n\n(1) Total NIKE Brand EBIT, Total NIKE, Inc. EBIT and EBIT Margin represent non-GAAP financial measures. See "Use of Non-GAAP Financial Measures" for further information.\n\n2023 FORM 10-K 36\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES\n\n7 % 12 % -13 %\n\n16 % 302 %\n\n6 % 7 %\n\n— 6 %\n\n% CHANGE\n\n0 % 35 % -27 %\n\n24 % -17 %\n\n3 % 23 % 2 %\n\n1 %\n\n—\n\n0 %\n\nTable of Contents\n\n

In [64]:
# paginate through results
for result in index.paginate(vector_query, page_size=1):
    print(result[0]["doc_id"], result[0]["vector_distance"], flush=True)

85 0.321347296238
84 0.328033983707
118 0.358749687672


### Sort by alternative fields

In [65]:
# Sort by doc_id field after vector search limits to topK
vector_query = VectorQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=4,
    return_fields=["doc_id"],
    return_score=True
)

# Decompose vector_query into the core query and the params
query = vector_query.query
params = vector_query.params

# Pass query and params direct to index.search()
result = index.search(
    query.sort_by("doc_id", asc=True),
    params
)

[doc.__dict__ for doc in result.docs]

[{'id': 'chunk:118',
  'payload': None,
  'vector_distance': '0.358749687672',
  'doc_id': '118'},
 {'id': 'chunk:158',
  'payload': None,
  'vector_distance': '0.360825479031',
  'doc_id': '158'},
 {'id': 'chunk:84',
  'payload': None,
  'vector_distance': '0.328033983707',
  'doc_id': '84'},
 {'id': 'chunk:85',
  'payload': None,
  'vector_distance': '0.321347296238',
  'doc_id': '85'}]

### Add filters to vector queries

In [66]:
from redisvl.query.filter import Text

vector_query = VectorQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=4,
    return_fields=["content"],
    return_score=True
)

# Set a text filter
text_filter = Text("content") % "profit"

vector_query.set_filter(text_filter)

index.query(vector_query)

[{'id': 'chunk:81',
  'vector_distance': '0.363789796829',
  'content': 'NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling price ("ASP") per pair contributed approximately 7 percentage points of footwear revenue growth. Higher ASP was primarily due to higher full-price ASP, net of discounts, on a wholesale equivalent basis, and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP.\n\nNIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily d

### Range queries in RedisVL

In [67]:
from redisvl.query import RangeQuery

range_query = RangeQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=4,
    return_fields=["content"],
    return_score=True,
    distance_threshold=0.8  # find all items with a semantic distance of less than 0.8
)

In [68]:
index.query(range_query)

[{'id': 'chunk:85',
  'vector_distance': '0.321347296238',
  'content': 'TOTAL NIKE BRAND Converse\n\n$\n\n1,932 (4,841)\n\n8,359 676\n\n$\n\n1,896 (4,262)\n\n8,406 669\n\n2 % -14 %\n\n1 % $ 1 %\n\n1,530 (3,656)\n\n8,641 543\n\nCorporate TOTAL NIKE, INC. EARNINGS BEFORE INTEREST AND TAXES\n\n(1)\n\n$\n\n(2,840)\n\n6,195\n\n$\n\n(2,219)\n\n6,856\n\n28 %\n\n10 % $\n\n(2,261)\n\n6,923\n\nEBIT margin\n\n(1)\n\n12.1 %\n\n14.7 %\n\n15.5 %\n\nInterest expense (income), net\n\n(6)\n\n205\n\n—\n\n262\n\nTOTAL NIKE, INC. INCOME BEFORE INCOME TAXES\n\n$\n\n6,201\n\n$\n\n6,651\n\n7 % $\n\n6,661\n\n(1) Total NIKE Brand EBIT, Total NIKE, Inc. EBIT and EBIT Margin represent non-GAAP financial measures. See "Use of Non-GAAP Financial Measures" for further information.\n\n2023 FORM 10-K 36\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES\n\n7 % 12 % -13 %\n\n16 % 302 %\n\n6 % 7 %\n\n— 6 %\n\n% CHANGE\n\n0 % 35 % -27 %\n\n24 % -17 %\n\n3 % 23 % 2 %\n\n1 %\n\n—\n\n0 %\n\nTable of Contents\n\nNORTH AMERICA\n\n(

In [69]:
# Add filter to range query
range_query.set_filter(text_filter)

index.query(range_query)

[{'id': 'chunk:81',
  'vector_distance': '0.363789796829',
  'content': 'NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling price ("ASP") per pair contributed approximately 7 percentage points of footwear revenue growth. Higher ASP was primarily due to higher full-price ASP, net of discounts, on a wholesale equivalent basis, and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP.\n\nNIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily d

## Cleanup

Clean up the database.

In [70]:
!rvl index delete -i redisvl

[32m20:43:41[0m [34m[RedisVL][0m [1;30mINFO[0m   Index deleted successfully
