# Multi-document Single-index RAG with LangChain and Redis Hybrid Search

## Environment Setup

In [1]:
import json
import os
import warnings
warnings.filterwarnings("ignore")
dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["ROOT_DIR"] = parent_directory
print(dir_path)
print(parent_directory)

/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/multi_doc_RAG
/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss


### Install Python Dependencies

In [2]:
!pip3 install -r $ROOT_DIR/requirements.txt



### Configure your Redis Stack


In [3]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# If SSL is enabled on the endpoint, use redis:// as the URL prefix
REDIS_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"
os.environ["REDIS_URL"] = REDIS_URL

### SentenceTransformerEmbeddings Models Cache folder
We are using `SentenceTransformerEmbeddings` in this demo and here we specify the cache folder. If you already downloaded the models in a local file system, set this folder here, otherwise the library tries to download the models in this folder if not available locally.

In particular, these models will be downloaded if not present in the cache folder:

models/models--sentence-transformers--all-MiniLM-L6-v2

models/models--sentence-transformers--all-mpnet-base-v2


In [4]:
#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

## RAG with LangChain

### Create Custom index based on your data using RedisVL

In [5]:
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from redis import Redis
index_name = 'langchain'
prefix = 'chunk'
schema = IndexSchema.from_yaml('sec_index.yaml')
client = Redis.from_url(REDIS_URL)
# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

In [6]:
# get info about the index
!rvl index info -i langchain

/bin/sh: rvl: command not found


### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `SentenceTransformersTokenTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html

In [7]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))

In [8]:
from ingestion import get_sec_data
from ingestion import redis_bulk_upload 
sec_data = get_sec_data()

/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/multi_doc_RAG
/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss
 ✅ Loaded doc info for  112 tickers...


In [9]:
sec_data

{'VZ': {'10K_files': [],
  'metadata_file': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/VZ/VZ-metadata.json'],
  'transcript_files': []},
 'AMZN': {'10K_files': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-2023-10K.pdf',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-2022-10K.pdf',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-2021-10K.pdf'],
  'metadata_file': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-metadata.json'],
  'transcript_files': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/earning_calls/AMZN/2020-Jan-30-AMZN.txt',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/earning_calls/AMZN/2016-Apr-28-AMZN.txt',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/earning_calls/A

In [10]:
redis_bulk_upload(sec_data, index, embeddings, tickers=['AAPL', 'AMZN'])

✅ Loaded 108 10K chunks for ticker=AAPL from AAPL-2021-10K.pdf
✅ Loaded 94 10K chunks for ticker=AAPL from AAPL-2023-10K.pdf
✅ Loaded 103 10K chunks for ticker=AAPL from AAPL-2022-10K.pdf
✅ Loaded 27 earning_call chunks for ticker=AAPL from 2018-May-01-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2019-Oct-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2016-Jan-26-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2020-Jul-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2017-Aug-01-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2020-Jan-28-AAPL.txt
✅ Loaded 34 earning_call chunks for ticker=AAPL from 2016-Apr-26-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2017-Jan-31-AAPL.txt
✅ Loaded 28 earning_call chunks for ticker=AAPL from 2019-Apr-30-AAPL.txt
✅ Loaded 26 earning_call chunks for ticker=AAPL from 2017-Nov-02-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2016-Oct-25-AAPL.tx

## Vector Search with LangChain
**Important Note-2**: LangChain does not support JSON data types yet. Only supports HASH for now. This update should be coming soon.

In [11]:
from langchain_community.vectorstores import Redis as LangChainRedis
from utils import create_langchain_schemas_from_redis_schema

index_name = 'langchain'

vec_schema , main_schema = create_langchain_schemas_from_redis_schema('sec_index.yaml')

rds = LangChainRedis.from_existing_index( embedding=embeddings, 
                                          index_name= index_name, 
                                          schema = main_schema)

### Query the database
Now we can use the LangChain vector store class to perform similarity search operations on Redis

In [12]:
from langchain.vectorstores.redis import RedisText
from langchain.vectorstores.redis import RedisTag

In [14]:
f = RedisTag("ticker") == "AAPL"
rds.similarity_search(query="How many employees work at this company???", k=4, distance_threshold=0.8, filter=f)

[Document(page_content='The Company has historically experienced higher net sales in its first quarter compared to other quarters in its fiscal year due in part to seasonal holiday demand. Additionally, new product and service introductions can significantly impact net sales, cost of sales and operating expenses. The timing of product introductions can also impact the Company’s net sales to its indirect distribution channels as these channels are filled with new inventory following a product launch, and channel inventory of an older product often declines as the launch of a newer product approaches. Net sales can also be affected when consumers and distributors anticipate a product introduction.\n\nHuman Capital\n\nThe Company believes it has a talented, motivated, and dedicated team, and is committed to supporting the development of all of its team members and to continuously building on its strong culture. As of September 25, 2021, the Company had approximately 154,000 full-time equi

In [15]:
f = RedisTag("ticker") == "AAPL"
rds.similarity_search(query="What did Tim Cook said in 2020 earning calls regarding NANDs?", k=4, distance_threshold=0.8, filter=f)

[Document(page_content="Thank you. Good afternoon, and thanks to everyone for joining us. Speaking first today is Apple's CEO, Tim Cook; and he'll be followed by CFO, Luca Maestri. After that, we'll open the call to questions from analysts. Please note that some of the information you'll hear during our discussion today will consist of forward-looking statements, including, without limitation, those regarding revenue, gross margin, operating expenses, other income and expense, taxes, capital allocation and future business outlook. Actual results or trends could differ materially from our forecast. For more information, please refer to the risk factors discussed in Apple's most recently filed periodic reports on Form 10-K and Form 10-Q and the Form 8-K filed with the SEC today, along with the associated press release. Apple assumes no obligation to update any forward-looking statements or information which speak as of their respective dates. I'd now like to turn the call over to Tim for

In [16]:
f = RedisTag("doc_type") == "earning_call"
rds.similarity_search(query="What did Tim Cook said in 2020 earning calls regarding NANDs?", k=4, distance_threshold=0.8, filter=f)

[Document(page_content="Thank you. Good afternoon, and thanks to everyone for joining us. Speaking first today is Apple's CEO, Tim Cook; and he'll be followed by CFO, Luca Maestri. After that, we'll open the call to questions from analysts. Please note that some of the information you'll hear during our discussion today will consist of forward-looking statements, including, without limitation, those regarding revenue, gross margin, operating expenses, other income and expense, taxes, capital allocation and future business outlook. Actual results or trends could differ materially from our forecast. For more information, please refer to the risk factors discussed in Apple's most recently filed periodic reports on Form 10-K and Form 10-Q and the Form 8-K filed with the SEC today, along with the associated press release. Apple assumes no obligation to update any forward-looking statements or information which speak as of their respective dates. I'd now like to turn the call over to Tim for

In [17]:
# vector search with combinations of metadata filtering
f = (RedisText("content") % "profit") | (RedisText("content") % "revenue")

rds.similarity_search_with_score(query="Apple company revenue", k=4, filter=f)


[(Document(page_content="Earlier this month, released macOS Catalina with all new entertainment apps, innovative Sidecar feature that uses iPad to expand Mac workspace and new accessibility tools that enable users to control their Mac entirely with their voice. 1. Catalina brings Apple Arcade experience to Mac. 1. Already seeing some third-party developers bring their iPad apps to Mac App Store with Mac Catalyst, including Twitter, Post-it and more. 4. Launching newly redesigned Mac Pro this fall, which Co. is manufacturing in Austin, Texas. 7. Others: 1. In FY19, crossed $100b in revenue in US for first time. 2. Introduce new services from Apple Card to Apple TV+ and generated over $46b in total Services revenue, setting new yearly Services records in all five geographic segments and driving Services business to size of Fortune 70 co. 3. Delivered new hardware in all device categories. 4. Wearables business showed explosive growth and generated more annual revenue than two-thirds of c

In [18]:
# filter results to a certain distance threshold
rds.similarity_search_with_score(query="Nike company revenue", k=4, distance_threshold=0.5)

[]

## RAG with Ollama running Llama 3 LLM

### Initialize a llama  LLM served via Ollama
Alternatively, if you like to connect to a local Ollama LLM, you can use below LLM. If you have a local OpenAI-compatible server running via vLLM , add your LLM here.

In [19]:
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")

### Setup prompt
PromptTemplate defines the exect text of the response that would be fed to the LLM. This step is optional, but the defaults usually work well for OpenAI and might fall short for other models.

In [20]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [21]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}
    

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold",
                               search_kwargs={"distance_threshold":0.8, 'include_metadata': True}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

### Finally - let's ask questions!



In [22]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"Based on the provided context from financial 10K filings data, we can answer the question.\n\nLast year's revenue (from the previous quarter) is not explicitly mentioned in the text. However, we can infer that it was $5.5 billion, excluding the one-time payment of $548 million received from a patent infringement dispute.\n\nThis year's revenue (mentioned in the 10K filing for the current quarter) is approximately $6.1 billion, which includes the $548 million one-time payment. Excluding this amount, the revenue would be around $5.5 billion, similar to last year.\n\nSo, while there is some growth from $5.5 billion to $6.1 billion, it's not a significant increase. The 15% growth mentioned in the text refers to the sequential quarter-over-quarter growth, rather than year-over-year growth."

In [23]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"I don't know the answer to this question because it was not provided in the given context from Apple Inc.'s 2021 Form 10-K filing. The context only discusses Apple's business, competition, and supply chain, but does not mention Nike or its products. Therefore, I cannot provide an answer.\n\nSource: Apple Inc. 2021 Form 10-K"

In [28]:
query = "what was the deferred revenue of Apple in 2022?"
res=qa(query)
res



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what was the deferred revenue of Apple in 2022?',
 'result': 'The financial statements provided do not include information about deferred revenue for Apple Inc. in 2022 or any other year. The provided data includes information about commercial paper, repurchase agreements, and term debt, as well as segment operating performance and macroeconomic conditions, but it does not include specific information about deferred revenue.',
 'source_documents': [Document(page_content='The Company’s proportion of net sales by disaggregated revenue source was generally consistent for each reportable segment in Note 11, “Segment Information and Geographic Data” for 2022, 2021 and 2020, except in Greater China, where iPhone revenue represented a moderately higher proportion of net sales in 2022 and 2021.\n\nAs of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion, respectively. As of September 24, 2022, the Company expects 64% 

Wrong Answer: because we could not fetch the right chunk. From Apple 10K in 2022 we have: "As of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion,
respectively. As of September 24, 2022, the Company expects 64% of total deferred revenue to be realized in less than a year, 27%
within one-to-two years, 7% within two-to-three years and 2% in greater than three years."

In [29]:
query = "what was revenue of Apple in 2022?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'According to the provided Form 10-K filing for Apple Inc. as of 2023, the company\'s revenue for 2022 is not explicitly stated. However, we can find the relevant information by analyzing the tables and text.\n\nThe table "Segment Operating Performance" shows net sales by reportable segment for 2023, 2022, and 2021. The total net sales for 2022 are mentioned as $394,328 million.\n\nLater in the filing, it is mentioned that "In FY19, crossed $100b in revenue in US for first time." This suggests that Apple\'s revenue in the United States was $100 billion in fiscal year 2019. Since this information is not relevant to our question, we can ignore it.\n\nWe are still missing the total revenue figure for 2022. To find this out, we need to analyze the change percentages and numbers provided in the table "Segment Operating Performance".\n\nLet\'s look at the Americas segment:\n\n* Total net sales: $169,658 million (2022)\n* Change from 2021: +11% ($23,658 million)\n\nWe can calculate the total 

In [31]:
query = "How many employees work at Nike???"
res=qa(query)
res



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'How many employees work at Nike???',
 'result': 'Question: How many employees work at Amazon?\nAnswer: As of September 25, 2021, the Company had approximately 154,000 full-time equivalent employees.\n\nSource: [Amazon.com, Inc. - Form 10-K (2021)](https://www.sec.gov/Archives/edgar/data/1127599/000119279221144395/a02-16713_10-k.htm)',
 'source_documents': [Document(page_content="This included changes to over 150 of our processes to provide for social distancing as well as costs to onboard and train over 175,000 new employees who are hired to meet the higher customer demand. This $4 billion also included investments in personal protective equipment for employees and enhanced cleaning for our facilities. Our consolidated revenue and operating income significantly exceeded the top end of our guidance range. Strong top line performance was driven by increased consumer demand, led by Prime members. We continue to see high Prime member engagement throughout the quarter. Prime memb

wrong answer, it does not have the Nike data, but it did hallucinate given the wrong context by retrieval.

### Adding query analysis and hybrid search in QA chain

In [32]:
from custom_ners import get_redis_filters

 ✅ Loaded doc info for  112 tickers...


In [33]:
#Plugin your own query_analysis here, that includes NER, topic detection, intent detection, semantic routing etc. 
def query_analysis(q):
    filters = get_redis_filters(q)
    print(filters)
    return filters
    

def ask_question(question,
                 filters = None,
                 filter_strategy = 'AND',
                 distance_threshold =0.8,
                 search_type="similarity_distance_threshold"):
    
    q_filters = query_analysis(question)
    print(f"inferred filters: {q_filters}")
    if filters is None:
        filters = q_filters
    else:
        filters = " ( "+q_filters + " ) " + filter_strategy+ " ( " + filters + " ) "
    
    print(f"Final filters: {filters} to apply")
    if filters is not None:
        search_args = {"distance_threshold":distance_threshold, 
                   'include_metadata': True, 
                   'filter':filters}
    else:
        search_args = {"distance_threshold":distance_threshold, 
                   'include_metadata': True}
        
    fqa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=rds.as_retriever(search_type=search_type,
                                   search_kwargs= search_args),
        return_source_documents=True,
        chain_type_kwargs={"prompt": get_prompt()},
        verbose=True
    )
    response = fqa(question)
    return response  

In [34]:
ask_question("what is the revenue of aapl?")

@ticker:{aapl}
inferred filters: @ticker:{aapl}
Final filters: @ticker:{aapl} to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the revenue of aapl?',
 'result': 'Question: What is the revenue of AAPL?\nAnswer: $58.3b (from the 2Q18 Financials section)\nSource: II. 2Q18 Financials (L. M.)',
 'source_documents': [Document(page_content="Generated almost $34b in earnings in six month; bullish on Co.'s future. 12. Has best pipeline of products and services Co. ever had. 1. Has huge installed base of active devices that is growing across all products. 1. Has highest customer loyalty and satisfaction in industry. 13. Services business is growing dramatically. 14. Balance sheet and cash flow generation are strong. 1. Allows Co. to invest significantly in product roadmap and still return meaningful amount of capital to shareholders. 15. Recent corporate tax reform enables Co. to deploy global cash more efficiently. 1. In US, expects direct investment in economy to exceed $350b over next five years, including $30b in CapEx and Co. expects to create over 20,000 US jobs at AAPL over that time frame. 16.

In [35]:
ask_question("what is the revenue of aapl in 2022?", filters = "@doc_type:{10K}")

@ticker:{aapl}
inferred filters: @ticker:{aapl}
Final filters:  ( @ticker:{aapl} ) AND ( @doc_type:{10K} )  to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the revenue of aapl in 2022?',
 'result': 'The revenue data for Apple Inc. (AAPL) is not directly provided in the text. However, we can deduce the net sales for 2021 and 2020 from the given context.\n\nFrom the text:\n\n* 2021:\n\t+ Total net sales: $133,803 million\n\t+ U.S.: $68,366 million\n\t+ China: $40,308 million\n\t+ Other countries: $125,010 million\n* 2020:\n\t+ Total net sales: $109,197 million\n\t+ U.S.: NA (not available)\n\t+ China: $7,256 million\n\t+ Other countries: $101,941 million\n\nTo find the revenue for 2022, we can look at the "Segment Operating Performance" table provided in the text. However, this table only provides data for 2023 and 2021.\n\nUnfortunately, there is no direct mention of Apple\'s revenue for 2022. We cannot determine the revenue for 2022 based on the given context.',
 'source_documents': [Document(page_content='Research and Development\n\nThe year-over-year growth in R&D expense in 2022 was driven primarily by increases in h

In [36]:
ask_question("what is the total deferred revenue of aapl in 2022?", filters = "@doc_type:{10K} AND @content:(deferred revenue)")

@ticker:{aapl}
inferred filters: @ticker:{aapl}
Final filters:  ( @ticker:{aapl} ) AND ( @doc_type:{10K} AND @content:(deferred revenue) )  to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the total deferred revenue of aapl in 2022?',
 'result': 'According to the provided financial data from Apple Inc.\'s Form 10-K filing for 2022, the total deferred revenue is $5.9 billion.\n\nThis information can be found in the "Unrealized losses" section under "Deferred tax assets" in the "Deferred Tax Assets and Liabilities" note on page 41 of the filing.',
 'source_documents': [Document(page_content='The Company’s proportion of net sales by disaggregated revenue source was generally consistent for each reportable segment in Note 11, “Segment Information and Geographic Data” for 2022, 2021 and 2020, except in Greater China, where iPhone revenue represented a moderately higher proportion of net sales in 2022 and 2021.\n\nAs of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion, respectively. As of September 24, 2022, the Company expects 64% of total deferred revenue to be realized in less than a year

Correct Retrieval by Redis Search but wrong extraction and generation by LLM!

In [37]:
rds.similarity_search_with_score(query="what is the total deferred revenue of Apple in 2022?", k=5, filter='(@content:(deferred) | @content:(revenue))')

[(Document(page_content='The Company’s proportion of net sales by disaggregated revenue source was generally consistent for each reportable segment in Note 11, “Segment Information and Geographic Data” for 2022, 2021 and 2020, except in Greater China, where iPhone revenue represented a moderately higher proportion of net sales in 2022 and 2021.\n\nAs of September 24, 2022 and September 25, 2021, the Company had total deferred revenue of $12.4 billion and $11.9 billion, respectively. As of September 24, 2022, the Company expects 64% of total deferred revenue to be realized in less than a year, 27% within one-to-two years, 7% within two-to-three years and 2% in greater than three years.\n\nApple Inc. | 2022 Form 10-K | 37\n\n2020\n\n137,781 28,622 23,724 30,620 53,768 274,515\n\nNote 3 – Financial Instruments\n\nCash, Cash Equivalents and Marketable Securities\n\nThe following tables show the Company’s cash, cash equivalents and marketable securities by significant investment category as

Correct Retrieval by Redis Search!

## Cleanup

Cleanup the index and data.

In [None]:
#rds.drop_index(index_name=index_name, redis_url=REDIS_URL, delete_documents=True)