![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)
# RAG with LLamaIndex

This notebook uses [LLamaIndex](https://docs.llamaindex.ai/en/stable/) and [Redis](https://redis.com) to setup a basic RAG implementation.

## Let's Begin!
<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/RAG/03_llamaindex.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Environment Setup

### Pull Github Materials
Because you are likely running this notebook in **Google Colab**, we need to first
pull the necessary dataset and materials directly from GitHub.

**If you are running this notebook locally**, FYI you may not need to perform this
step at all.

In [None]:
# NBVAL_SKIP
!git clone https://github.com/redis-developer/redis-ai-resources.git temp_repo
!mv temp_repo/python-recipes/RAG/resources .
!rm -rf temp_repo

### Install Python Dependencies

In [1]:
# NBVAL_SKIP
%pip install -U -q llama-index llama-index-vector-stores-redis llama-index-embeddings-cohere llama-index-embeddings-openai

Note: you may need to restart the kernel to use updated packages.


### Install Redis Stack

Later in this tutorial, Redis will be used to store, index, and query vector
embeddings created from PDF document chunks. **We need to make sure we have a Redis
instance available.

#### For Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.com/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [1]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

## RAG with LlamaIndex

### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `RecursiveCharacterTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [2]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.redis import RedisVectorStore

# Load list of pdfs from a folder
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

docs = SimpleDirectoryReader(data_path).load_data()

print(f"Sample doc {docs[0]}")

Sample doc Doc ID: c013353e-dae7-4d17-befd-9e784c8acf79
Text: UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington,
D.C. 20549 FORM 10-K (Mark One) ☒ ANNUAL  REPORT PURSUANT T O SECTION
13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the fiscal year
ended September 24, 2022 or ☐ TRANSITION REPORT PURSUANT T O SECTION
13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the transition
period...


In [3]:
import getpass

if "OPENAI_API_KEY" not in os.environ:
    os.environ["OPENAI_API_KEY"] = getpass.getpass("OPENAI_API_KEY")

## Create Index
In the following block, Llama-index will embed the docs provide automatically with OpenAI by default and then store them in the storage_context (Redis).

In [4]:
from llama_index.core import StorageContext

vector_store = RedisVectorStore(redis_url=REDIS_URL, overwrite=True)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_documents(docs, storage_context=storage_context)

### Init retriever and query_engine classes

In [5]:
query_engine = index.as_query_engine()
retriever = index.as_retriever()

### Run vector search
We can see the results of the vector search with the retrieve method

In [6]:
result_nodes = retriever.retrieve("What was nike's revenue in fiscal 23?")
for node in result_nodes:
    print(node)

Node ID: d2e6cd9c-0716-49d8-8563-407a00d05445
Text: Table of Contents FISCAL 2023 NIKE BRAND REVENUE HIGHLIGHTS The
following tables present NIKE Brand revenues disaggregated by
reportable operating segment, distribution channel and major product
line: FISCAL 2023 COMPARED TO FISCAL 2022 •NIKE, Inc. Revenues were
$51.2 billion in fiscal 2023, which increased 10% and 16% compared to
fiscal 2022 on...
Score:  0.900

Node ID: 28542d3b-b345-4e9e-b675-f62361ec85d9
Text: Table of Contents NORTH AMERICA (Dollars in millions) FISCAL
2023FISCAL 2022 % CHANGE% CHANGE EXCLUDING CURRENCY CHANGESFISCAL 2021
% CHANGE% CHANGE EXCLUDING CURRENCY CHANGES Revenues by: Footwear $
14,897 $ 12,228 22 % 22 %$ 11,644 5 % 5 % Apparel 5,947 5,492 8 % 9 %
5,028 9 % 9 % Equipment 764 633 21 % 21 % 507 25 % 25 % TOTAL REVENUES
$ 21,6...
Score:  0.885



### Run query engine
Now let's get a final RAGlike response

In [7]:
response = query_engine.query("What was nike's revenue in fiscal 23?")
response.response

"NIKE's revenue in fiscal 23 was $51.2 billion."

### Use a custom index schema

In most use cases, you need the ability to customize the underling index configuration
and specification. For example, this is handy in order to define specific metadata filters you wish to enable.

With Redis, this is as simple as defining an index schema object
(from file or dict) and passing it through to the vector store client wrapper.

In [8]:
from redisvl.schema import IndexSchema


custom_schema = IndexSchema.from_dict(
    {
        # customize basic index specs
        "index": {
            "name": "custom_index",
            "prefix": "docs",
            "key_separator": ":",
        },
        # customize fields that are indexed
        "fields": [
            # required fields for llamaindex
            {"type": "tag", "name": "id"},
            {"type": "tag", "name": "doc_id"},
            {"type": "text", "name": "text"},
            # custom metadata fields
            {"type": "numeric", "name": "updated_at"},
            {"type": "tag", "name": "file_name"},
            # custom vector field definition for cohere embeddings
            {
                "type": "vector",
                "name": "vector",
                "attrs": {
                    "dims": 1536,
                    "algorithm": "hnsw",
                    "distance_metric": "cosine",
                },
            },
        ],
    }
)

In [9]:
custom_schema.index

IndexInfo(name='custom_index', prefix='docs', key_separator=':', storage_type=<StorageType.HASH: 'hash'>)

In [10]:
custom_schema.fields

{'id': TagField(name='id', type='tag', path=None, attrs=TagFieldAttributes(sortable=False, separator=',', case_sensitive=False, withsuffixtrie=False)),
 'doc_id': TagField(name='doc_id', type='tag', path=None, attrs=TagFieldAttributes(sortable=False, separator=',', case_sensitive=False, withsuffixtrie=False)),
 'text': TextField(name='text', type='text', path=None, attrs=TextFieldAttributes(sortable=False, weight=1, no_stem=False, withsuffixtrie=False, phonetic_matcher=None)),
 'updated_at': NumericField(name='updated_at', type='numeric', path=None, attrs=NumericFieldAttributes(sortable=False)),
 'file_name': TagField(name='file_name', type='tag', path=None, attrs=TagFieldAttributes(sortable=False, separator=',', case_sensitive=False, withsuffixtrie=False)),
 'vector': HNSWVectorField(name='vector', type='vector', path=None, attrs=HNSWVectorFieldAttributes(dims=1536, algorithm=<VectorIndexAlgorithm.HNSW: 'HNSW'>, datatype=<VectorDataType.FLOAT32: 'FLOAT32'>, distance_metric=<VectorDist

In [11]:
# from datetime import datetime


# def date_to_timestamp(date_string: str) -> int:
#     date_format: str = "%Y-%m-%d"
#     return int(datetime.strptime(date_string, date_format).timestamp())


# # iterate through documents and add new field
# for document in docs:
#     document.metadata["updated_at"] = date_to_timestamp(
#         document.metadata["last_modified_date"]
#     )

In [12]:
vector_store = RedisVectorStore(
    schema=custom_schema,  # provide customized schema
    redis_url=REDIS_URL,
    overwrite=True,
)

storage_context = StorageContext.from_defaults(vector_store=vector_store)

# build and load index from documents and storage context
index = VectorStoreIndex.from_documents(
    docs, storage_context=storage_context
)

### Query the vector store and filter on metadata
Now that we have additional metadata indexed in Redis, let's try some queries which add in filters. As an example, we'll do a search for chunks with the word "audit" from an exact file "amzn-10k-2023.pdf". 

In [13]:
from llama_index.core.vector_stores import (
    MetadataFilters,
    MetadataFilter,
    ExactMatchFilter,
)

retriever = index.as_retriever(
    similarity_top_k=3,
    filters=MetadataFilters(
        filters=[
            ExactMatchFilter(key="file_name", value="amzn-10k-2023.pdf"),
            MetadataFilter(
                key="text",
                value="audit",
                operator="text_match",
            ),
        ],
        condition="and",
    ),
)

In [14]:
result_nodes = retriever.retrieve("What did the author learn?")

for node in result_nodes:
    print(node)

Node ID: cd0c5d8f-e3b1-4cbb-aa6a-5960003cdb2d
Text: Table of Contents valuation. In the ordinary course of our
business, there are many transactions and calculations for which the
ultimate tax determination is uncertain. Significant judgment is
required in evaluating and estimating our tax expense, assets, and
liabilities. We are also subject to tax controversies in various
jurisdictions that can...
Score:  0.746

Node ID: 6745f668-4c7a-43bf-a9c3-9b04e1a497f8
Text: Table of Contents Included in other income (expense), net in
2021 and 2022 is a marketable equity securities valuation gain (loss)
of $11.8 billion and $(12.7) billion from our equity investment in
Rivian Automotive, Inc. (“Rivian”). Our investment in Rivian’s
preferred stock was accounted for at cost, with adjustments for
observable changes in ...
Score:  0.740

Node ID: 717666fe-fea5-488b-999c-84e6d8b9a0db
Text: Exhibit 31.1 CERTIFICATIONS I, Andrew R. Jassy, certify that: 1.
I have reviewed this Form 10-K of Amazon.com, I