<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/redisvl-02.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Search with RedisVL

![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [redisvl](https://redisvl.com), a dedicated Python client library for using Redis as a vector database, to perform document + embdding indexing and semantic search tasks.

## Setup and Data Prep

### Pull Github Materials
We need to clone the supporting materials from github.

In [1]:
# This clones your git repository into a directory named 'temp_repo'.
!git clone https://github.com/Redislabs-Solution-Architects/financial-vss.git temp_repo

# This command moves the 'resources' directory from 'temp_repo' to your current directory.
!mv temp_repo/resources .
!mv temp_repo/requirements.txt

# This deletes the 'temp_repo' directory, cleaning up the unwanted files.
!rm -rf temp_repo


Cloning into 'temp_repo'...
remote: Enumerating objects: 96, done.[K
remote: Counting objects: 100% (96/96), done.[K
remote: Compressing objects: 100% (64/64), done.[K
remote: Total 96 (delta 43), reused 75 (delta 27), pack-reused 0[K
Receiving objects: 100% (96/96), 7.06 MiB | 6.55 MiB/s, done.
Resolving deltas: 100% (43/43), done.
mv: cannot move 'temp_repo/resources' to './resources': Directory not empty


### Install Python Dependencies

In [2]:
!pip install -q -r requirements.txt

### Preprocess PDF Doc(s)

Now we will load a single financial (10k filings) doc and preprocess it using some LangChain helpers.

In [3]:
import os

# Load list of pdfs
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

Listing available documents ... ['resources/aapl-10k-2023.pdf', 'resources/nke-10k-2023.pdf', 'resources/jnj-10k-2023.pdf', 'resources/msft-10k-2023.pdf', 'resources/nvd-10k-2023.pdf', 'resources/amzn-10k-2023.pdf']
Done preprocessing. Created 323 chunks of the original pdf resources/nke-10k-2023.pdf


In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# For simplicity, we will just work with one of the 10k files. This will take some time still.
# To Note: the UnstructuredFileLoader is not the only document loader type that LangChain provides
# To Note: the RecursiveCharacterTextSplitter is what we use to create smaller chunks of text from the doc.
# Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
# Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
doc = [doc for doc in docs if "nke" in doc][0]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

In [4]:
# Take a look at one item
print(chunks[2])

page_content="NIKE, Inc.(Exact name of Registrant as specified in its charter)Oregon93-0584541(State or other jurisdiction of incorporation)(IRS Employer Identification No.)One Bowerman Drive, Beaverton, Oregon 97005-6453(Address of principal executive offices and zip code)(503) 671-6453(Registrant's telephone number, including area code)SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:Class B Common StockNKENew York Stock Exchange(Title of each class)(Trading symbol)(Name of each exchange on which registered)SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:NONE\n\nAs of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:Class A$7,831,564,572 Class B136,467,702,472 $144,299,267,044\n\nTable of ContentsUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934FOR THE FISCAL YEAR ENDED MAY 31, 2023

### Create document chunk embeddings

In [5]:
from redisvl.utils.vectorize import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")

# Embed each page_content from the document chunks
chunk_embeddings = hf.embed_many([chunk.page_content for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(chunk_embeddings) == len(chunks)

True

### Run Localized Redis Stack

If you don't have a remote Redis instance, use an in-notebook version of [Redis Stack](https://redis.io/docs/getting-started/install-stack/). Or you can provision your own free instance of [Redis Cloud](https://redis.com/try-free/).


Use the below code to download and run a localized version of Redis Stack here in the notebook.

In [6]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


gpg: cannot open '/dev/tty': No such device or address
curl: (23) Failed writing body


### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [7]:
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # "1TNxTEdYRDgIDKM2gDfasupCADXXXX"


#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


## Getting Started with RedisVL

### Create an index from schema
Below we connect to Redis and create an index for vector search that contains a single text field and vector field.

In [None]:
from redis import Redis
from redisvl.schema import IndexSchema
from redisvl.index import SearchIndex

index_name = "redisvl"

schema = IndexSchema.from_dict({
  "index": {
    "name": index_name,
    "prefix": "chunk"
  },
  "fields": [
    {"name": "label", "type": "tag", "attrs": {"sortable": True}},
    {"name": "content", "type": "text"},
    {
      "name": "chunk_vector",
      "type": "vector",
      "attrs": {
        "dims": 384,
        "distance_metric": "cosine",
        "algorithm": "hnsw",
        "datatype": "float32"
      }
    }
  ]
})

# connect to redis
client = Redis.from_url(REDIS_URL)

# create an index
index = SearchIndex(schema, client)
index.create(overwrite=True)

In [18]:
# use the CLI to see the created index
!rvl index listall

[32m22:19:42[0m [34m[RedisVL][0m [1;30mINFO[0m   Indices:
[32m22:19:42[0m [34m[RedisVL][0m [1;30mINFO[0m   1. redisvl


In [19]:
!rvl index info -i redisvl



Index Information:
╭──────────────┬────────────────┬─────────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes        │ Index Options   │   Indexing │
├──────────────┼────────────────┼─────────────────┼─────────────────┼────────────┤
│ redisvl      │ HASH           │ ['doc:redisvl'] │ []              │          0 │
╰──────────────┴────────────────┴─────────────────┴─────────────────┴────────────╯
Index Fields:
╭──────────────┬──────────────┬────────┬────────────────┬────────────────╮
│ Name         │ Attribute    │ Type   │ Field Option   │   Option Value │
├──────────────┼──────────────┼────────┼────────────────┼────────────────┤
│ label        │ label        │ TEXT   │ WEIGHT         │              1 │
│ content      │ content      │ TEXT   │ WEIGHT         │              1 │
│ chunk_vector │ chunk_vector │ VECTOR │                │                │
╰──────────────┴──────────────┴────────┴────────────────┴────────────────╯


### Process and load data using RedisVL
Below we use the RedisVL index to simply load the list of document chunks to Redis db.

In [20]:
# load expects an iterable of dictionaries
from redisvl.redis.utils import array_to_buffer

data = [
    {
        'label': f'ID-{i}',
        'content': chunk.page_content,
        # For HASH -- must convert embeddings to bytes
        'chunk_vector': array_to_buffer(chunk_embeddings[i])
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = index.load(data)

### Query the database
Now we can use the RedisVL index to perform similarity search operations with Redis

In [23]:
from redisvl.query import VectorQuery

query = "Nike profit margins and company performance"

vector_query = VectorQuery(
    vector=hf.embed(query),
    vector_field_name="chunk_vector",
    num_results=4,
    return_fields=["label", "content"],
    return_score=True
)

# show the raw redis query
str(vector_query)

'*=>[KNN 4 @chunk_vector $vector AS vector_distance] RETURN 3 label content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 4'

In [24]:
# execute the query with RedisVL
index.query(vector_query)

[{'id': 'doc:redisvl:faf74f986a86418fb1be5d46dd5d3707',
  'vector_distance': '0.354781925678',
  'label': 'ID-150',
  'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n

In [None]:
# paginate through results
for result in index.paginate(vector_query, page_size=1):
    print(result["label"], result["vector_distance"], flush=True)

### Sort by alternative fields

In [27]:
# Sort by field other than vector score
result = index.search(vector_query.query.sort_by("label", asc=True), vector_query.params)

[doc.__dict__ for doc in result.docs]

[{'id': 'doc:redisvl:9f2f26f97d674466bf32c058973adc96',
  'payload': None,
  'vector_distance': '0.362202882767',
  'label': 'ID-145',
  'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". N

### Add filters to vector queries

In [28]:
from redisvl.query.filter import Text

# set the text filter
text_filter = Text("content") % "profit"

vector_query.set_filter(text_filter)

index.query(vector_query)

[{'id': 'doc:redisvl:9f2f26f97d674466bf32c058973adc96',
  'vector_distance': '0.362202882767',
  'label': 'ID-145',
  'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital s

In [29]:
from redisvl.query import FilterQuery

# Perform a standalone lexical search
filter_query = FilterQuery(return_fields=["content"], filter_expression=text_filter, num_results=4)

# inspect raw redis query
str(filter_query)

'@content:(profit) RETURN 1 content DIALECT 2 LIMIT 0 4'

In [30]:
index.query(filter_query)

[{'id': 'doc:redisvl:f0fae0b64b964d7383b53936b34502f2',
  'content': 'Proposals to reform U.S. and foreign tax laws could significantly impact how U.S. multinational corporations are taxed on global earnings and could increase the U.S. corporate tax rate. For example, the Organization for Economic Co-operation and Development (OECD) and the G20 Inclusive Framework on Base Erosion and Profit Shifting (the "Inclusive Framework") has put forth two proposals—Pillar One and Pillar Two—that revise the existing profit allocation and nexus rules and ensure a minimal level of taxation, respectively. On December 12, 2022, the European Union member states agreed to implement the Inclusive Framework\'s global corporate minimum tax rate of 15%. Other countries are also actively considering changes to their tax laws to adopt certain parts of the Inclusive Framework\'s proposals. Although we cannot predict whether or in what form these proposals will be enacted into law, these changes, if enacted int

### Range queries in RedisVL

In [31]:
from redisvl.query import RangeQuery

range_query = RangeQuery(
    vector=hf.embed(query),
    vector_field_name="chunk_vector",
    num_results=4,
    return_fields=["content"],
    return_score=True,
    distance_threshold=0.5  # find all items with a semantic distance of less than 0.5
)


# inspect query
str(range_query)

'@chunk_vector:[VECTOR_RANGE $distance_threshold $vector]=>{$yield_distance_as: vector_distance} RETURN 2 content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 4'

In [32]:
index.query(range_query)

[{'id': 'doc:redisvl:faf74f986a86418fb1be5d46dd5d3707',
  'vector_distance': '0.354781925678',
  'content': '2023 FORM 10-K 35\n\nTable of Contents\n\nOPERATING SEGMENTS\n\nAs discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -

In [33]:
# Add filter to range query
range_query.set_filter(text_filter)

index.query(range_query)

[{'id': 'doc:redisvl:9f2f26f97d674466bf32c058973adc96',
  'vector_distance': '0.362202882767',
  'content': 'NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales were $12.6 billi

## Building a RAG Pipeline with RedisVL

### Use AsyncSearchIndex

In [None]:
from redis.asyncio import Redis
from redisvl.index import AsyncSearchIndex

client = Redis.from_url(REDIS_URL)
index = AsyncSearchIndex(index.schema, client)

### Prep OpenAI Helpers & Prompts

In [None]:
import openai


CHAT_MODEL = "gpt-3.5-turbo"


SYSTEM_PROMPT = """You are a helpful financial analyst assistant that has access
to public financial 10k documents in order to answer users questions about company
performance, ethics, characteristics, and core information.
"""


async def answer_question(index: AsyncSearchIndex, query: str):
    """Answer the user's question"""
    query_vector = hf.embed(query)
    context = await retrieve_context(index, query_vector)
    response = await openai.ChatCompletion.acreate(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": promptify(query, context)}
        ],
        max_tokens=100
    )

    # Response provided by GPT-3.5
    return response['choices'][0]['message']['content']


async def retrieve_context(index: AsyncSearchIndex, query_vector) -> str:
    """Fetch the relevant context from Redis using vector search"""
    results = await index.query(
        VectorQuery(
            vector=query_vector,
            vector_field_name="chunk_vector",
            return_fields=["content"],
            num_results=3
        )
    )
    content = "\n".join([result["content"] for result in results]) -> str:
    return content


def promptify(query: str, context: str) -> str:
    return f'''Use the provided context below derived from public financial
    documents to answer the user's question. If you can't answer the user's
    question, based on the context; do not guess. If there is no context at all,
    respond with "I don't know".

    User question:

    {query}

    Helpful context:

    {context}

    Answer:
    '''

### Vanilla RAG

In [None]:
query = "What was Nike's revenue last year compared to this year??"

await answer_question(index, query)

In [None]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"

await answer_question(index, query)

In [None]:
query = "Is Nike an ethical company?"

await answer_question(index, query)

### Improve performance and cut costs with LLM caching

In [None]:
from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name="llmcache",
    vectorizer=hf,
    redis_url="redis://localhost:6379",
    ttl=120,
    distance_threshold=0.2
)

In [None]:
from functools import wraps


def cache(func):
    @wraps(func)
    async def wrapper(index, query_text, *args, **kwargs):
        query_vector = cache._vectorizer.embed(query_text)

        # Check the cache with the vector
        if result := cache.check(vector=query_vector):
            return result[0]['response']

        response = await func(index, query_text, query_vector=query_vector)
        cache.store(query_text, response, query_vector)
        return response
    return wrapper

@cache
async def answer_question(index: AsyncSearchIndex, query_text: str, **kwargs):
    """Answer the user's question using the query vector and text"""
    context = await retrieve_context(index, kwargs["query_vector"])
    response = await openai.ChatCompletion.acreate(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": promptify(query_text, context)}
        ],
        max_tokens=100
    )

    # Response provided by GPT-3.5
    return response['choices'][0]['message']['content']

In [None]:
query = "What was Nike's revenue last year compared to this year??"

await answer_question(index, query)

In [None]:
query = "What was Nike's total revenue in the last year compared to now??"

await answer_question(index, query)

## Cleanup

Clean up the index.

In [None]:
#index.delete(drop=True)

## What's Next?

Now that you have tried the easy-to-use RedisVL client, try your hand with LangChain -- the highest level of abstraction for using and integrating Redis as a vector database.


<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/langchain-03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>