![Redis](https://redis.io/wp-content/uploads/2024/04/Logotype.svg?auto=webp&quality=85,75&width=120)
# RAG with LangChain

This notebook uses [LangChain](https://python.langchain.com/docs/get_started/introduction) and [Redis](https://redis.com) to perform document + embdding indexing and semantic search tasks. It also shows how to integrate with an LLM like OpenAI's GPT models. See the full partner package source code [here](https://github.com/langchain-ai/langchain-redis/tree/main).

## Let's Begin!
<a href="https://colab.research.google.com/github/redis-developer/redis-ai-resources/blob/main/python-recipes/RAG/02_langchain.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Environment Setup

### Pull Github Materials
Because you are likely running this notebook in **Google Colab**, we need to first
pull the necessary dataset and materials directly from GitHub. The following commands grab the material, move to root for colab, and clean up unneeded files.

**If you are running this notebook locally**, FYI you may not need to perform this
step at all.

In [None]:
# NBVAL_SKIP
!git clone https://github.com/redis-developer/redis-ai-resources.git temp_repo
!mv temp_repo/python-recipes/RAG/resources .
!rm -rf temp_repo

### Install Python Dependencies

In [23]:
%pip install -q redis "unstructured[pdf]" sentence-transformers langchain 
%pip install -q langchain-community "langchain-redis>=0.2.0" langchain-huggingface langchain-openai

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.0[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Install Redis Stack

Later in this tutorial, Redis will be used to store, index, and query vector
embeddings created from PDF document chunks. **We need to make sure we have a Redis
instance available.**

#### For Colab
Use the shell script below to download, extract, and install [Redis Stack](https://redis.io/docs/getting-started/install-stack/) directly from the Redis package archive.

In [None]:
# NBVAL_SKIP
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

#### For Alternative Environments
There are many ways to get the necessary redis-stack instance running
1. On cloud, deploy a [FREE instance of Redis in the cloud](https://redis.com/try-free/). Or, if you have your
own version of Redis Enterprise running, that works too!
2. Per OS, [see the docs](https://redis.io/docs/latest/operate/oss_and_stack/install/install-stack/)
3. With docker: `docker run -d --name redis-stack-server -p 6379:6379 redis/redis-stack-server:latest`

### Define the Redis Connection URL

By default this notebook connects to the local instance of Redis Stack. **If you have your own Redis Enterprise instance** - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [1]:
import os
import warnings
warnings.filterwarnings('ignore')

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost") # ex: "redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
REDIS_PORT = os.getenv("REDIS_PORT", "6379")      # ex: 18374
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")  # ex: "1TNxTEdYRDgIDKM2gDfasupCADXXXX"

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"

## RAG with LangChain

### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `RecursiveCharacterTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter

In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs from a folder
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

Listing available documents ... ['resources/nke-10k-2023.pdf', 'resources/amzn-10k-2023.pdf', 'resources/metrics_2500_0.csv', 'resources/jnj-10k-2023.pdf', 'resources/aapl-10k-2023.pdf', 'resources/testset_15.csv', 'resources/retrieval_basic_rag_test.csv', 'resources/2022-chevy-colorado-ebrochure.pdf', 'resources/nvd-10k-2023.pdf', 'resources/testset.csv', 'resources/msft-10k-2023.pdf', 'resources/propositions.json', 'resources/generation_basic_rag_test.csv']


In [3]:
# pick out the Nike doc for this exercise
doc = [doc for doc in docs if "nke" in doc][0]

# set up the file loader/extractor and text splitter to create chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2500, chunk_overlap=0
)
loader = UnstructuredFileLoader(
    doc, mode="single", strategy="fast"
)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

  loader = UnstructuredFileLoader(


Done preprocessing. Created 179 chunks of the original pdf resources/nke-10k-2023.pdf


### Initialize Embeddings Model
Here we will use LangChain's built in embedding engine so that it will work seemlessly with the LangChain VectorStore classes.

In [4]:
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## Vector Search with LangChain
### Create Redis vector store instance

We also need to create a schema for the vector index so we can take advantage of the metadata along with the vectors.

**Important Note**: LangChain does not support JSON data types yet. Only supports HASH for now. This update should be coming soon.

In [5]:
from langchain_redis import RedisVectorStore

index_name = "langchain_ex"

# construct the vector store class from texts and metadata
rds = RedisVectorStore.from_documents(
    chunks,
    embeddings,
    index_name=index_name,
    redis_url=REDIS_URL,
    metadata_schema=[
        {
            "name": "source",
            "type": "text"
        },
    ]
)

16:18:04 redisvl.index.index INFO   Index already exists, not overwriting.


In [6]:
# access underlying redis client to see how many docs have been stores
rds._index.client.dbsize()

1123

### Query the database
Now we can use the LangChain vector store class to perform similarity search operations on Redis

In [7]:
from redisvl.query.filter import Text

In [8]:
# basic "top 4" vector search on a given query
rds.similarity_search_with_score(query="Profit margins", k=4)

[(Document(metadata={'source': 'resources/nke-10k-2023.pdf'}, page_content="(Dollars in millions, except per share data)\n\nRevenues Cost of sales\n\nGross profit Gross margin\n\nDemand creation expense Operating overhead expense\n\nTotal selling and administrative expense % of revenues\n\nInterest expense (income), net\n\nOther (income) expense, net Income before income taxes\n\nIncome tax expense Effective tax rate\n\nNET INCOME Diluted earnings per common share\n\n$\n\n$ $\n\nFISCAL 2023\n\n51,217 28,925\n\n22,292\n\n43.5 %\n\n4,060 12,317\n\n16,377\n\n32.0 % (6)\n\n(280) 6,201\n\n1,131\n\n18.2 %\n\n5,070 3.23\n\n$\n\n$ $\n\nFISCAL 2022\n\n46,710 25,231\n\n21,479\n\n46.0 %\n\n3,850 10,954\n\n14,804\n\n31.7 % 205\n\n(181) 6,651\n\n605 9.1 %\n\n6,046 3.75\n\n% CHANGE\n\n10 % $ 15 %\n\n4 %\n\n5 % 12 %\n\n11 %\n\n—\n\n— -7 %\n\n87 %\n\n16 % $ -14 % $\n\nFISCAL 2021\n\n% CHANGE\n\n44,538 24,576\n\n5 % 3 %\n\n19,962\n\n8 %\n\n44.8 %\n\n3,114 9,911\n\n24 % 11 %\n\n13,025\n\n14 %\n\n29.2 % 

In [9]:
# vector search with metadata filtering
f = Text("text") % "profit"
rds.similarity_search_with_score(query="Profit margins", k=4, filter=f)

[(Document(metadata={'source': 'resources/nke-10k-2023.pdf'}, page_content="(Dollars in millions, except per share data)\n\nRevenues Cost of sales\n\nGross profit Gross margin\n\nDemand creation expense Operating overhead expense\n\nTotal selling and administrative expense % of revenues\n\nInterest expense (income), net\n\nOther (income) expense, net Income before income taxes\n\nIncome tax expense Effective tax rate\n\nNET INCOME Diluted earnings per common share\n\n$\n\n$ $\n\nFISCAL 2023\n\n51,217 28,925\n\n22,292\n\n43.5 %\n\n4,060 12,317\n\n16,377\n\n32.0 % (6)\n\n(280) 6,201\n\n1,131\n\n18.2 %\n\n5,070 3.23\n\n$\n\n$ $\n\nFISCAL 2022\n\n46,710 25,231\n\n21,479\n\n46.0 %\n\n3,850 10,954\n\n14,804\n\n31.7 % 205\n\n(181) 6,651\n\n605 9.1 %\n\n6,046 3.75\n\n% CHANGE\n\n10 % $ 15 %\n\n4 %\n\n5 % 12 %\n\n11 %\n\n—\n\n— -7 %\n\n87 %\n\n16 % $ -14 % $\n\nFISCAL 2021\n\n% CHANGE\n\n44,538 24,576\n\n5 % 3 %\n\n19,962\n\n8 %\n\n44.8 %\n\n3,114 9,911\n\n24 % 11 %\n\n13,025\n\n14 %\n\n29.2 % 

In [10]:
# vector search with combinations of metadata filtering

f = (Text("text") % "profit") | (Text("text") % "revenue")
rds.similarity_search_with_score(query="Nike company revenue", k=4, filter=f)

[(Document(metadata={'source': 'resources/nke-10k-2023.pdf'}, page_content='As discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -43 %\n\n5,343 25\n\n11 % 308 %\n\nTOTAL NIKE BRAND Converse\n\n$\n\n48,763 $ 2,427\n\n44,436 2,34

In [11]:
# filter results to a certain distance threshold
rds.similarity_search_with_score(query="Nike company revenue", k=4, distance_threshold=0.3)

[(Document(metadata={'source': 'resources/nke-10k-2023.pdf'}, page_content='As discussed in Note 15 — Operating Segments and Related Information in the accompanying Notes to the Consolidated Financial Statements, our operating segments are evidence of the structure of the Company\'s internal organization. The NIKE Brand segments are defined by geographic regions for operations participating in NIKE Brand sales activity.\n\nThe breakdown of Revenues is as follows:\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES FISCAL 2021\n\n% CHANGE\n\nNorth America Europe, Middle East & Africa Greater China\n\n$\n\n21,608 $ 13,418 7,248\n\n18,353 12,479 7,547\n\n18 % 8 % -4 %\n\n18 % $ 21 % 4 %\n\n17,179 11,456 8,290\n\n7 % 9 % -9 %\n\nAsia Pacific & Latin America Global Brand Divisions\n\n(3)\n\n(2)\n\n6,431 58\n\n5,955 102\n\n8 % -43 %\n\n17 % -43 %\n\n5,343 25\n\n11 % 308 %\n\nTOTAL NIKE BRAND Converse\n\n$\n\n48,763 $ 2,427\n\n44,436 2,34

## RAG with LangChain
LangChain makes it easy to now take this vector store and build retireval augmented generation (RAG) applications over your data.

### Initialize OpenAI

You need to supply an OpenAI API key (starts with `sk-...`) when prompted. If the key is in your env -- great, otherwise enter it when prompted below. You can find your API key at https://platform.openai.com/account/api-keys

In [12]:
import getpass
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(openai_api_key=os.getenv("OPENAI_API_KEY") or getpass.getpass(prompt="OpenAI API Key:"))

### Setup prompt

In [13]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

prompt_template = """
    Use the following pieces of context from financial 10k filings data to answer the user question at the end. 
    If you don't know the answer, say that you don't know, don't try to make up an answer.

    Context:
    ---------
    {context}
    ---------
    Question:
    {question}
    Answer:
"""

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt = ChatPromptTemplate.from_template(prompt_template)

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [14]:
rag_chain = (
    {
        "context": rds.as_retriever() | format_docs,
        "question": RunnablePassthrough()
    }
    | prompt
    | llm
    | StrOutputParser()
)

### Finally - let's ask questions!



In [15]:
query = "What was Nike's revenue last year compared to this year??"
rag_chain.invoke(query)

"Nike's revenue last year was $44,538 million, and this year it was $51,217 million."

In [16]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
rag_chain.invoke(query)

'The exact number of products Nike offers is not explicitly stated in the provided context. However, Nike is part of the athletic footwear, apparel, and equipment industry, which is highly competitive both in the United States and worldwide.'

In [17]:
query = "Is Nike an ethical company?"
rag_chain.invoke(query)

'Based on the provided information, there is no specific mention or data that directly addresses the ethical practices of Nike as a company. Therefore, it is not possible to determine if Nike is an ethical company based on the provided context.'

## Cleanup

Cleanup the index and data.

In [18]:
from redisvl.index import SearchIndex

idx = SearchIndex.from_existing(
    index_name,
    redis_url=REDIS_URL
)

idx.delete()

ValueError: REDIS_URL env var not set

In [19]:
import redisvl

redisvl.__version__

'0.4.0'

In [20]:
import redis

redis.__version__

'5.2.1'

In [21]:
REDIS_URL

'redis://:@localhost:6379'