![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)

# RAG from scratch with the Redis Vector Library


In this recipe we will cover the basic of the Redis Vector Library and build a basic RAG app from scratch.

## Let's Begin!
<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/RAG/01_redisvl.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


## Environment Setup

### Pull Github Materials
Because you are likely running this notebook in **Google Colab**, we need to first
pull the necessary dataset and materials directly from GitHub.

**If you are running this notebook locally**, FYI you may not need to perform this
step at all.

In [1]:
# NBVAL_SKIP
!git clone https://github.com/redis-developer/redis-ai-resources.git temp_repo
!mv temp_repo/python-recipes/RAG/resources .
!rm -rf temp_repo

Cloning into 'temp_repo'...
remote: Enumerating objects: 227, done.[K
remote: Counting objects: 100% (90/90), done.[K
remote: Compressing objects: 100% (59/59), done.[K
remote: Total 227 (delta 50), reused 47 (delta 29), pack-reused 137[K
Receiving objects: 100% (227/227), 19.63 MiB | 9.16 MiB/s, done.
Resolving deltas: 100% (74/74), done.
mv: rename temp_repo/python-recipes/RAG/resources to ./resources: Directory not empty


### Install Python Dependencies

In [2]:
# NBVAL_SKIP
!pip install -q redis redisvl "unstructured[pdf]" sentence-transformers langchain langchain-community openai


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Install Redis Stack

Later in this tutorial, Redis will be used to store, index, and query vector
embeddings created from PDF document chunks. **We need to make sure we have a Redis
instance available.**

#### For Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.com/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [79]:
import os
import warnings
warnings.filterwarnings('ignore')

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

## Simplified Vector Search with RedisVL

### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `RecursiveCharacterTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [80]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs from a folder
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

Listing available documents ... ['resources/nke-10k-2023.pdf', 'resources/amzn-10k-2023.pdf', 'resources/jnj-10k-2023.pdf', 'resources/aapl-10k-2023.pdf', 'resources/retrieval_basic_rag_test.csv', 'resources/nvd-10k-2023.pdf', 'resources/testset.csv', 'resources/msft-10k-2023.pdf', 'resources/propositions.json', 'resources/generation_basic_rag_test.csv']


In [81]:
# pick out the Nike doc for this exercise
doc = [doc for doc in docs if "nke" in doc][0]

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
)
loader = UnstructuredFileLoader(
    doc, mode="single", strategy="fast"
)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Done preprocessing. Created 179 chunks of the original pdf resources/nke-10k-2023.pdf


### Text embedding generation with RedisVL
RedisVL has built-in extensions and utilities to aid the GenAI development process. In the following snipit we utilize the HFTextVectorizer redisvl in tandem with the **all-MiniLM-L6-v2** class to generate vector embeddings for the chunks created above. These embeddings capture the "meaning" of the text so that we can retrieve the relevant chunks later when a user's query is semantically related.

In [82]:
from redisvl.utils.vectorize import HFTextVectorizer

hf = HFTextVectorizer("sentence-transformers/all-MiniLM-L6-v2")
os.environ["TOKENIZERS_PARALLELISM"] = "false"

# Embed each chunk content
embeddings = hf.embed_many([chunk.page_content for chunk in chunks])

# Check to make sure we've created enough embeddings, 1 per document chunk
len(embeddings) == len(chunks)

True

### Define a schema and create an index

Below we connect to Redis and create an index that contains a text field, tag field, and vector field.

In [83]:
from redis import Redis
from redisvl.index import SearchIndex


index_name = "redisvl"


schema = {
  "index": {
    "name": index_name,
    "prefix": "chunk"
  },
  "fields": [
    {
        "name": "chunk_id",
        "type": "tag",
        "attrs": {
            "sortable": True
        }
    },
    {
        "name": "content",
        "type": "text"
    },
    {
        "name": "text_embedding",
        "type": "vector",
        "attrs": {
            "dims": 384,
            "distance_metric": "cosine",
            "algorithm": "hnsw",
            "datatype": "float32"
        }
    }
  ]
}

In [84]:
# connect to redis
client = Redis.from_url(REDIS_URL)

# create an index from schema and the client
index = SearchIndex.from_dict(schema)
index.set_client(client)
index.create(overwrite=True, drop=True)

In [85]:
# use the RedisVL CLI tool to list all indices
!rvl index listall

[32m14:02:38[0m [34m[RedisVL][0m [1;30mINFO[0m   Indices:
[32m14:02:38[0m [34m[RedisVL][0m [1;30mINFO[0m   1. redisvl


In [59]:
# get info about the index
!rvl index info -i redisvl



Index Information:
╭──────────────┬────────────────┬────────────┬─────────────────┬────────────╮
│ Index Name   │ Storage Type   │ Prefixes   │ Index Options   │   Indexing │
├──────────────┼────────────────┼────────────┼─────────────────┼────────────┤
│ redisvl      │ HASH           │ ['chunk']  │ []              │          0 │
╰──────────────┴────────────────┴────────────┴─────────────────┴────────────╯
Index Fields:
╭────────────────┬────────────────┬────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────┬────────────────┬────────────────┬─────────────────┬────────────────╮
│ Name           │ Attribute      │ Type   │ Field Option   │ Option Value   │ Field Option   │ Option Value   │ Field Option   │   Option Value │ Field Option    │ Option Value   │ Field Option   │   Option Value │ Field Option    │   Option Value │
├────────────────┼────────────────┼────────┼────────────────┼────────────

### Process and load dataset
Below we use the RedisVL index to simply load the list of document chunks to Redis db.

In [86]:
# load expects an iterable of dictionaries
from redisvl.redis.utils import array_to_buffer

data = [
    {
        'chunk_id': i,
        'content': chunk.page_content,
        # For HASH -- must convert embeddings to bytes
        'text_embedding': array_to_buffer(embeddings[i])
    } for i, chunk in enumerate(chunks)
]

# RedisVL handles batching automatically
keys = index.load(data, id_field="chunk_id")

### Query the database
Now we can use the RedisVL index to perform similarity search operations with Redis

In [87]:
from redisvl.query import VectorQuery

query = "Nike profit margins and company performance"

query_embedding = hf.embed(query)

vector_query = VectorQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=3,
    return_fields=["chunk_id", "content"],
    return_score=True
)

# show the raw redis query
str(vector_query)

'*=>[KNN 3 @text_embedding $vector AS vector_distance] RETURN 3 chunk_id content vector_distance SORTBY vector_distance ASC DIALECT 2 LIMIT 0 3'

In [88]:
# execute the query with RedisVL
index.query(vector_query)

[{'id': 'chunk:84',
  'vector_distance': '0.321347117424',
  'chunk_id': '84',
  'content': 'TOTAL NIKE BRAND Converse\n\n$\n\n1,932 (4,841)\n\n8,359 676\n\n$\n\n1,896 (4,262)\n\n8,406 669\n\n2 % -14 %\n\n1 % $ 1 %\n\n1,530 (3,656)\n\n8,641 543\n\nCorporate TOTAL NIKE, INC. EARNINGS BEFORE INTEREST AND TAXES\n\n(1)\n\n$\n\n(2,840)\n\n6,195\n\n$\n\n(2,219)\n\n6,856\n\n28 %\n\n10 % $\n\n(2,261)\n\n6,923\n\nEBIT margin\n\n(1)\n\n12.1 %\n\n14.7 %\n\n15.5 %\n\nInterest expense (income), net\n\n(6)\n\n205\n\n—\n\n262\n\nTOTAL NIKE, INC. INCOME BEFORE INCOME TAXES\n\n$\n\n6,201\n\n$\n\n6,651\n\n7 % $\n\n6,661\n\n(1) Total NIKE Brand EBIT, Total NIKE, Inc. EBIT and EBIT Margin represent non-GAAP financial measures. See "Use of Non-GAAP Financial Measures" for further information.\n\n2023 FORM 10-K 36\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES\n\n7 % 12 % -13 %\n\n16 % 302 %\n\n6 % 7 %\n\n— 6 %\n\n% CHANGE\n\n0 % 35 % -27 %\n\n24 % -17 %\n\n3 % 23 % 2 %\n\n1 %\n\n—\n\n0 %\n\nTable of Contents\n

In [89]:
# paginate through results
for result in index.paginate(vector_query, page_size=1):
    print(result[0]["chunk_id"], result[0]["vector_distance"], flush=True)

84 0.321347117424
83 0.328033864498
117 0.358749628067


### Sort by alternative fields

In [90]:
# Sort by chunk_id field after vector search limits to topK
vector_query = VectorQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=4,
    return_fields=["chunk_id"],
    return_score=True
)

# Decompose vector_query into the core query and the params
query = vector_query.query
params = vector_query.params

# Pass query and params direct to index.search()
result = index.search(
    query.sort_by("chunk_id", asc=True),
    params
)

[doc.__dict__ for doc in result.docs]

[{'id': 'chunk:117',
  'payload': None,
  'vector_distance': '0.358749628067',
  'chunk_id': '117'},
 {'id': 'chunk:157',
  'payload': None,
  'vector_distance': '0.360825419426',
  'chunk_id': '157'},
 {'id': 'chunk:83',
  'payload': None,
  'vector_distance': '0.328033864498',
  'chunk_id': '83'},
 {'id': 'chunk:84',
  'payload': None,
  'vector_distance': '0.321347117424',
  'chunk_id': '84'}]

### Add filters to vector queries

In [91]:
from redisvl.query.filter import Text

vector_query = VectorQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=4,
    return_fields=["content"],
    return_score=True
)

# Set a text filter
text_filter = Text("content") % "profit"

vector_query.set_filter(text_filter)

index.query(vector_query)

[{'id': 'chunk:80',
  'vector_distance': '0.363789737225',
  'content': 'NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling price ("ASP") per pair contributed approximately 7 percentage points of footwear revenue growth. Higher ASP was primarily due to higher full-price ASP, net of discounts, on a wholesale equivalent basis, and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP.\n\nNIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily d

### Range queries in RedisVL

In [92]:
from redisvl.query import RangeQuery

range_query = RangeQuery(
    vector=query_embedding,
    vector_field_name="text_embedding",
    num_results=4,
    return_fields=["content"],
    return_score=True,
    distance_threshold=0.8  # find all items with a semantic distance of less than 0.8
)

In [93]:
index.query(range_query)

[{'id': 'chunk:84',
  'vector_distance': '0.321347117424',
  'content': 'TOTAL NIKE BRAND Converse\n\n$\n\n1,932 (4,841)\n\n8,359 676\n\n$\n\n1,896 (4,262)\n\n8,406 669\n\n2 % -14 %\n\n1 % $ 1 %\n\n1,530 (3,656)\n\n8,641 543\n\nCorporate TOTAL NIKE, INC. EARNINGS BEFORE INTEREST AND TAXES\n\n(1)\n\n$\n\n(2,840)\n\n6,195\n\n$\n\n(2,219)\n\n6,856\n\n28 %\n\n10 % $\n\n(2,261)\n\n6,923\n\nEBIT margin\n\n(1)\n\n12.1 %\n\n14.7 %\n\n15.5 %\n\nInterest expense (income), net\n\n(6)\n\n205\n\n—\n\n262\n\nTOTAL NIKE, INC. INCOME BEFORE INCOME TAXES\n\n$\n\n6,201\n\n$\n\n6,651\n\n7 % $\n\n6,661\n\n(1) Total NIKE Brand EBIT, Total NIKE, Inc. EBIT and EBIT Margin represent non-GAAP financial measures. See "Use of Non-GAAP Financial Measures" for further information.\n\n2023 FORM 10-K 36\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES\n\n7 % 12 % -13 %\n\n16 % 302 %\n\n6 % 7 %\n\n— 6 %\n\n% CHANGE\n\n0 % 35 % -27 %\n\n24 % -17 %\n\n3 % 23 % 2 %\n\n1 %\n\n—\n\n0 %\n\nTable of Contents\n\nNORTH AMERICA\n\n(

In [94]:
# Add filter to range query
range_query.set_filter(text_filter)

index.query(range_query)

[{'id': 'chunk:80',
  'vector_distance': '0.363789737225',
  'content': 'NIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling price ("ASP") per pair contributed approximately 7 percentage points of footwear revenue growth. Higher ASP was primarily due to higher full-price ASP, net of discounts, on a wholesale equivalent basis, and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP.\n\nNIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily d

## Building a basic RAG Pipeline from Scratch
We're going to build a basic RAG pipeline from scratch incorporating the following components:

- Standard semantic search
- Integration with OpenAI for LLM
- Chat completion

### Setup RedisVL AsyncSearchIndex

In [109]:
from redis.asyncio import Redis as AsyncRedis
from redisvl.index import AsyncSearchIndex

client = AsyncRedis.from_url(REDIS_URL)
async_index = AsyncSearchIndex.from_dict(schema)
await async_index.set_client(client)

<redisvl.index.index.AsyncSearchIndex at 0x310e91390>

### Setup OpenAI API

In [110]:
import openai
import os
import getpass


CHAT_MODEL = "gpt-3.5-turbo-0125"

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")


### Baseline Retrieval Augmented Generation
The code below answers a user's questions following this basic flow:

1. Generate a query_vector from the user's chat question to have an apples to apples comparison against the vector database.
2. Retrieve the most semantically relevant chunks to the user's query from the database.
3. Pass the user query and retrieved context to the `promptify` function to generate the final prompt to be sent to the LLM along with the system prompt and necessary hyperparameters. 
4. Return the LLMs response to the user. 

In [111]:

async def answer_question(index: AsyncSearchIndex, query: str):
    """Answer the user's question"""

    SYSTEM_PROMPT = """You are a helpful financial analyst assistant that has access
    to public financial 10k documents in order to answer users questions about company
    performance, ethics, characteristics, and core information.
    """

    query_vector = hf.embed(query)
    # Fetch context from Redis using vector search
    context = await retrieve_context(index, query_vector)
    # Generate contextualized prompt and feed to OpenAI
    response = await openai.AsyncClient().chat.completions.create(
        model=CHAT_MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": promptify(query, context)}
        ],
        temperature=0.1,
        seed=42
    )
    # Response provided by LLM
    return response.choices[0].message.content


async def retrieve_context(async_index: AsyncSearchIndex, query_vector) -> str:
    """Fetch the relevant context from Redis using vector search"""
    results = await async_index.query(
        VectorQuery(
            vector=query_vector,
            vector_field_name="text_embedding",
            return_fields=["content"],
            num_results=3
        )
    )
    content = "\n".join([result["content"] for result in results])
    return content


def promptify(query: str, context: str) -> str:
    return f'''Use the provided context below derived from public financial
    documents to answer the user's question. If you can't answer the user's
    question, based on the context; do not guess. If there is no context at all,
    respond with "I don't know".

    User question:

    {query}

    Helpful context:

    {context}

    Answer:
    '''

### Let's test it out...

In [112]:
# Generate a list of questions
questions = [
    "What is the trend in the company's revenue and profit over the past few years?",
    "What are the company's primary revenue sources?",
    "How much debt does the company have, and what are its capital expenditure plans?",
    "What does the company say about its environmental, social, and governance (ESG) practices?",
    "What is the company's strategy for growth?"
]

In [113]:
import asyncio

results = await asyncio.gather(*[
    answer_question(async_index, question) for question in questions
])

### Let's view the results

In [114]:
for i, r in enumerate(results):
    print(f"Question: {questions[i]}")
    print(f"Answer: \n {r}", "\n-----------\n")

Question: What is the trend in the company's revenue and profit over the past few years?
Answer: 
 Based on the provided financial data:

- The company's revenue has been increasing over the past few years. From fiscal 2021 to fiscal 2022, there was a 5% increase in revenue, and from fiscal 2022 to fiscal 2023, there was a 10% increase in revenue.

- The company's profit, represented by net income, has also been increasing over the past few years. From fiscal 2021 to fiscal 2022, there was a 16% increase in net income, and from fiscal 2022 to fiscal 2023, there was an 18% increase in net income.

Therefore, both revenue and profit have shown a positive trend of growth over the past few years for the company. 
-----------

Question: What are the company's primary revenue sources?
Answer: 
 The company's primary revenue sources are as follows:

1. Footwear: 
   - North America: $21,608 million
   - Europe, Middle East & Africa: $13,418 million
   - Greater China: $7,248 million
   - Asia

# You now have a working RAG pipeline!

As you can see, it is easy to get started with RAG and we were able to get decent chat results from this simple setup. To go beyond the basic example though see the [advanced_rag](../advanced_capabilities/advanced_RAG.ipynb) notebook.

This notebook covers:

- **Improving accuracy** with dense content representations and query rewriting/expansion
- **Improving performance and optimizing cost** with semantic caching
- **Improving personalization** with chat session memory.


## Cleanup

Clean up the database.

In [75]:
await async_index.client.flushall()

True