# Multi-document Single-index RAG with LangChain and Redis Hybrid Search

## Environment Setup

In [2]:
import json
import os
import warnings
warnings.filterwarnings("ignore")
dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["ROOT_DIR"] = parent_directory
print(dir_path)
print(parent_directory)

/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/multi_doc_RAG
/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss


### Install Python Dependencies

In [2]:
!pip3 install -r $ROOT_DIR/requirements.txt



### Configure your Redis Stack


In [3]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# If SSL is enabled on the endpoint, use redis:// as the URL prefix
REDIS_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"
os.environ["REDIS_URL"] = REDIS_URL

### SentenceTransformerEmbeddings Models Cache folder
We are using `SentenceTransformerEmbeddings` in this demo and here we specify the cache folder. If you already downloaded the models in a local file system, set this folder here, otherwise the library tries to download the models in this folder if not available locally.

In particular, these models will be downloaded if not present in the cache folder:

models/models--sentence-transformers--all-MiniLM-L6-v2

models/models--sentence-transformers--all-mpnet-base-v2


In [4]:
#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

## RAG with LangChain

### Create Custom index based on your data using RedisVL

In [5]:
from redisvl.index import SearchIndex
from redisvl.schema import IndexSchema
from redis import Redis
index_name = 'langchain'
prefix = 'chunk'
schema = IndexSchema.from_yaml('sec_index.yaml')
client = Redis.from_url(REDIS_URL)
# create an index from schema and the client
index = SearchIndex(schema, client)
index.create(overwrite=True, drop=True)

11:47:22 redisvl.index.index INFO   Index already exists, overwriting.


In [6]:
# get info about the index
!rvl index info -i langchain

/bin/sh: rvl: command not found


### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `SentenceTransformersTokenTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html

In [7]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))

In [8]:
from ingestion import get_sec_data
from ingestion import redis_bulk_upload 
sec_data = get_sec_data()

/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/multi_doc_RAG
/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss
 ✅ Loaded doc info for  112 tickers...


In [9]:
sec_data

{'VZ': {'10K_files': [],
  'metadata_file': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/VZ/VZ-metadata.json'],
  'transcript_files': []},
 'AMZN': {'10K_files': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-2023-10K.pdf',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-2022-10K.pdf',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-2021-10K.pdf'],
  'metadata_file': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/filings/AMZN/AMZN-metadata.json'],
  'transcript_files': ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/earning_calls/AMZN/2020-Jan-30-AMZN.txt',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/earning_calls/AMZN/2016-Apr-28-AMZN.txt',
   '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/earning_calls/A

In [10]:
redis_bulk_upload(sec_data, index, embeddings, tickers=['AAPL', 'AMZN'])

✅ Loaded 108 10K chunks for ticker=AAPL from AAPL-2021-10K.pdf
✅ Loaded 94 10K chunks for ticker=AAPL from AAPL-2023-10K.pdf
✅ Loaded 103 10K chunks for ticker=AAPL from AAPL-2022-10K.pdf
✅ Loaded 27 earning_call chunks for ticker=AAPL from 2018-May-01-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2019-Oct-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2016-Jan-26-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2020-Jul-30-AAPL.txt
✅ Loaded 30 earning_call chunks for ticker=AAPL from 2017-Aug-01-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2020-Jan-28-AAPL.txt
✅ Loaded 34 earning_call chunks for ticker=AAPL from 2016-Apr-26-AAPL.txt
✅ Loaded 29 earning_call chunks for ticker=AAPL from 2017-Jan-31-AAPL.txt
✅ Loaded 28 earning_call chunks for ticker=AAPL from 2019-Apr-30-AAPL.txt
✅ Loaded 26 earning_call chunks for ticker=AAPL from 2017-Nov-02-AAPL.txt
✅ Loaded 31 earning_call chunks for ticker=AAPL from 2016-Oct-25-AAPL.tx

## Vector Search with LangChain
**Important Note-2**: LangChain does not support JSON data types yet. Only supports HASH for now. This update should be coming soon.

In [11]:
from langchain_community.vectorstores import Redis as LangChainRedis
from utils import create_langchain_schemas_from_redis_schema

index_name = 'langchain'

vec_schema , main_schema = create_langchain_schemas_from_redis_schema('sec_index.yaml')

rds = LangChainRedis.from_existing_index( embedding=embeddings, 
                                          index_name= index_name, 
                                          schema = main_schema)

### Query the database
Now we can use the LangChain vector store class to perform similarity search operations on Redis

In [12]:
from langchain.vectorstores.redis import RedisText
from langchain.vectorstores.redis import RedisTag

In [13]:
f = RedisTag("ticker") == "AAPL"
rds.similarity_search(query="Profit How many employees work at this company???", k=4, distance_threshold=0.8, filter=f)

[Document(page_content='Senior Director of Corporate Accounting (Principal Accounting Officer)\n\nOctober 28, 2021\n\nOctober 28, 2021\n\n/s/ James A. Bell JAMES A. BELL\n\nDirector\n\nOctober 28, 2021\n\n/s/ Al Gore AL GORE\n\nDirector\n\nOctober 28, 2021\n\n/s/ Andrea Jung ANDREA JUNG\n\nDirector\n\nOctober 28, 2021\n\n/s/ Arthur D. Levinson ARTHUR D. LEVINSON\n\nDirector and Chair of the Board\n\nOctober 28, 2021\n\n/s/ Monica Lozano MONICA LOZANO\n\nDirector\n\nOctober 28, 2021\n\n/s/ Ronald D. Sugar RONALD D. SUGAR\n\nDirector\n\nOctober 28, 2021\n\n/s/ Susan L. Wagner SUSAN L. WAGNER\n\nDirector\n\nApple Inc. | 2021 Form 10-K | 60', metadata={'id': 'chunk:AAPL-2021-10K.pdf-107', 'chunk_id': 'AAPL-2021-10K.pdf-107', 'source_doc': 'AAPL-2021-10K.pdf', 'doc_type': '10K', 'ticker': 'AAPL', 'company_name': 'APPLE INC', 'sector': 'Information Technology', 'asset_class': 'Equity', 'location': 'United States', 'exchange': 'NASDAQ', 'currency': 'USD', 'market_value': '559365151.11', 'weig

In [14]:
f = RedisTag("ticker") == "AAPL"
rds.similarity_search(query="What did Tim Cook said in 2020 earning calls regarding NANDs?", k=4, distance_threshold=0.8, filter=f)

[Document(page_content='Senior Director of Corporate Accounting (Principal Accounting Officer)\n\nOctober 28, 2021\n\nOctober 28, 2021\n\n/s/ James A. Bell JAMES A. BELL\n\nDirector\n\nOctober 28, 2021\n\n/s/ Al Gore AL GORE\n\nDirector\n\nOctober 28, 2021\n\n/s/ Andrea Jung ANDREA JUNG\n\nDirector\n\nOctober 28, 2021\n\n/s/ Arthur D. Levinson ARTHUR D. LEVINSON\n\nDirector and Chair of the Board\n\nOctober 28, 2021\n\n/s/ Monica Lozano MONICA LOZANO\n\nDirector\n\nOctober 28, 2021\n\n/s/ Ronald D. Sugar RONALD D. SUGAR\n\nDirector\n\nOctober 28, 2021\n\n/s/ Susan L. Wagner SUSAN L. WAGNER\n\nDirector\n\nApple Inc. | 2021 Form 10-K | 60', metadata={'id': 'chunk:AAPL-2021-10K.pdf-107', 'chunk_id': 'AAPL-2021-10K.pdf-107', 'source_doc': 'AAPL-2021-10K.pdf', 'doc_type': '10K', 'ticker': 'AAPL', 'company_name': 'APPLE INC', 'sector': 'Information Technology', 'asset_class': 'Equity', 'location': 'United States', 'exchange': 'NASDAQ', 'currency': 'USD', 'market_value': '559365151.11', 'weig

In [15]:
f = RedisTag("doc_type") == "earning_call"
rds.similarity_search(query="What did Tim Cook said in 2020 earning calls regarding NANDs?", k=4, distance_threshold=0.8, filter=f)

[Document(page_content="THE INFORMATION CONTAINED IN EVENT TRANSCRIPTS IS A TEXTUAL REPRESENTATION OF THE APPLICABLE COMPANY'S CONFERENCE CALL AND WHILE EFFORTS ARE MADE TO PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY BE MATERIAL ERRORS, OMISSIONS, OR INACCURACIES IN THE REPORTING OF THE SUBSTANCE OF THE CONFERENCE CALLS. IN NO WAY DOES REFINITIV OR THE APPLICABLE COMPANY ASSUME ANY RESPONSIBILITY FOR ANY INVESTMENT OR OTHER DECISIONS MADE BASED UPON THE INFORMATION PROVIDED ON THIS WEB SITE OR IN ANY EVENT TRANSCRIPT. USERS ARE ADVISED TO REVIEW THE APPLICABLE COMPANY'S CONFERENCE CALL ITSELF AND THE APPLICABLE COMPANY'S SEC FILINGS BEFORE MAKING ANY INVESTMENT OR OTHER DECISIONS. -------------------------------------------------------------------------------- Copyright 2020 Refinitiv. All Rights Reserved. --------------------------------------------------------------------------------", metadata={'id': 'chunk:2020-Jul-30-AMZN.txt-22', 'chunk_id': '2020-Jul-30-AMZN.txt-22', 'source_do

In [18]:
# vector search with combinations of metadata filtering
f = (RedisText("content") % "profit") | (RedisText("content") % "revenue")
rds.similarity_search_with_score(query="Nike company revenue", k=4, filter=f)

[]

In [19]:
# filter results to a certain distance threshold
rds.similarity_search_with_score(query="Nike company revenue", k=4, distance_threshold=0.5)

[]

## RAG with Ollama running Llama 3 LLM

### Initialize a llama  LLM served via Ollama
Alternatively, if you like to connect to a local Ollama LLM, you can use below LLM. If you have a local OpenAI-compatible server running via vLLM , add your LLM here.

In [20]:
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")

### Setup prompt
PromptTemplate defines the exect text of the response that would be fed to the LLM. This step is optional, but the defaults usually work well for OpenAI and might fall short for other models.

In [21]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [22]:
from langchain.chains import RetrievalQA

def get_search_kwargs(filters, distance_threshold):
    return {"distance_threshold":distance_threshold,"filter":filters}
    

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold",
                               search_kwargs={"distance_threshold":0.8, 'include_metadata': True}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

### Finally - let's ask questions!



In [59]:
query = "What was Apple's revenue last year compared to this year??"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'A financial question!\n\nAccording to the provided 10-K filings, we can find the answer.\n\nIn the 2022 Form 10-K, it is mentioned that:\n\n"Fiscal 2022 Highlights...\nTotal net sales increased 8% or $28.5 billion during 2022 compared to 2021..."\n\nAnd in the 2021 Form 10-K (not provided), we would find the revenue figure for the previous year.\n\nSo, let\'s assume the revenue for 2021 is x.\n\nThen, the revenue for 2022 would be x + $28.5 billion (8% increase).\n\nNow, if you provide me with the numbers:\n\n21,280 (2021) and\n46,291 (2022)\n\nI can help you find the difference between last year\'s revenue and this year\'s revenue.\n\nPlease provide the numbers!'

In [24]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Question: How many products does Nike offer? What is the industry that Nike is part of?\nAnswer: According to Nike\'s 10-K filing for 2020, the company offers a wide range of products across its three core businesses: Jordan Brand, Converse, and Nike. As stated in the filing, "We design, develop, market, and sell athletic footwear, apparel, equipment, and accessories." Additionally, the filing notes that "our products are designed to be worn for various activities, including running, training, basketball, soccer, tennis, golf, football, baseball, softball, volleyball, and other sports."\n\nAs for the industry, Nike is part of the Apparel, Footwear, and Textile Sector.\n\nSource: Nike\'s 2020 Annual Report (Form 10-K), filed with the Securities and Exchange Commission on June 15, 2021. The relevant information can be found on page 2-3 of the filing.\n\nReference: https://www.sec.gov/Archives/edgar/data/783557/000119312521245114/nke-20200615.htm'

In [25]:
query = "what was revenue of Apple in 2022?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"I don't know the answer to that question. The provided context is from financial 10K filings, but it does not include information about Apple's revenue for 2022. To find this information, I would need access to Apple's 2022 Form 10-K filing or another relevant document that provides this data."

In [26]:
query = "How many employees work at Nike???"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"I apologize, but the provided context does not mention Nike or any information related to employee count. The document appears to be a 10-K filing for Apple Inc., and it only includes the signature block for the company's executives and directors.\n\nTherefore, I do not have the answer to this question.\n\nSource: Apple Inc. | 2021 Form 10-K"

### Adding query analysis and hybrid search in QA chain

In [27]:
from custom_ners import get_redis_filters

 ✅ Loaded doc info for  112 tickers...
[{'LOWER': 'vz'}]
[{'LOWER': 'verizon'}, {'LOWER': 'communications'}, {'LOWER': 'inc'}]
[{'LOWER': 'communication'}]
[{'LOWER': 'equity'}]
[{'LOWER': 'new'}, {'LOWER': 'york'}, {'LOWER': 'stock'}, {'LOWER': 'exchange'}, {'LOWER': 'inc.'}]
[{'LOWER': 'amzn'}]
[{'LOWER': 'amazon'}, {'LOWER': 'com'}, {'LOWER': 'inc'}]
[{'LOWER': 'consumer'}, {'LOWER': 'discretionary'}]
[{'LOWER': 'equity'}]
[{'LOWER': 'nasdaq'}]
[{'LOWER': 'cat'}]
[{'LOWER': 'caterpillar'}, {'LOWER': 'inc'}]
[{'LOWER': 'industrials'}]
[{'LOWER': 'equity'}]
[{'LOWER': 'new'}, {'LOWER': 'york'}, {'LOWER': 'stock'}, {'LOWER': 'exchange'}, {'LOWER': 'inc.'}]
[{'LOWER': 'aapl'}]
[{'LOWER': 'apple'}, {'LOWER': 'inc'}]
[{'LOWER': 'information'}, {'LOWER': 'technology'}]
[{'LOWER': 'equity'}]
[{'LOWER': 'nasdaq'}]
[{'LOWER': 'pm'}]
[{'LOWER': 'philip'}, {'LOWER': 'morris'}, {'LOWER': 'international'}, {'LOWER': 'inc'}]
[{'LOWER': 'consumer'}, {'LOWER': 'staples'}]
[{'LOWER': 'equity'}]
[{'LO

In [41]:
#Plugin your own query_analysis here, that includes NER, topic detection, intent detection, semantic routing etc. 
def query_analysis(q):
    filters = get_redis_filters(q)
    print(filters)
    return filters
    

def ask_question(question,
                 filters = None,
                 filter_strategy = 'AND',
                 distance_threshold =0.8,
                 search_type="similarity_distance_threshold"):
    
    q_filters = query_analysis(question)
    print(f"inferred filters: {q_filters}")
    if filters is None:
        filters = q_filters
    else:
        filters = " ( "+q_filters + " ) " + filter_strategy+ " ( " + filters + " ) "
    
    print(f"Final filters: {filters} to apply")
    if filters is not None:
        search_args = {"distance_threshold":distance_threshold, 
                   'include_metadata': True, 
                   'filter':filters}
    else:
        search_args = {"distance_threshold":distance_threshold, 
                   'include_metadata': True}
        
    fqa = RetrievalQA.from_chain_type(
        llm=llm,
        chain_type="stuff",
        retriever=rds.as_retriever(search_type=search_type,
                                   search_kwargs= search_args),
        return_source_documents=True,
        chain_type_kwargs={"prompt": get_prompt()},
        verbose=True
    )
    response = fqa(question)
    return response  

In [42]:
ask_question("what is the revenue of aapl?")

@ticker:{aapl}
inferred filters: @ticker:{aapl}
Final filters: @ticker:{aapl} to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the revenue of aapl?',
 'result': "I don't know.\n\nSource: None (the provided context does not contain financial information or revenue data for AAPL)",
 'source_documents': [Document(page_content="THE INFORMATION CONTAINED IN EVENT TRANSCRIPTS IS A TEXTUAL REPRESENTATION OF THE APPLICABLE COMPANY'S CONFERENCE CALL AND WHILE EFFORTS ARE MADE TO PROVIDE AN ACCURATE TRANSCRIPTION, THERE MAY BE MATERIAL ERRORS, OMISSIONS, OR INACCURACIES IN THE REPORTING OF THE SUBSTANCE OF THE CONFERENCE CALLS. IN NO WAY DOES THOMSON REUTERS OR THE APPLICABLE COMPANY ASSUME ANY RESPONSIBILITY FOR ANY INVESTMENT OR OTHER DECISIONS MADE BASED UPON THE INFORMATION PROVIDED ON THIS WEB SITE OR IN ANY EVENT TRANSCRIPT. USERS ARE ADVISED TO REVIEW THE APPLICABLE COMPANY'S CONFERENCE CALL ITSELF AND THE APPLICABLE COMPANY'S SEC FILINGS BEFORE MAKING ANY INVESTMENT OR OTHER DECISIONS. -------------------------------------------------------------------------------- Copyright 2019 Thomson Reute

In [43]:
ask_question("what is the revenue of aapl?", filters = "@doc_type:{10K}")

@ticker:{aapl}
inferred filters: @ticker:{aapl}
Final filters:  ( @ticker:{aapl} ) AND ( @doc_type:{10K} )  to apply


[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


{'query': 'what is the revenue of aapl?',
 'result': "Question: What is the revenue of AAPL?\nAnswer: I don't know.\nSource: None (since the provided context only contains signatures and certification information, it does not include financial data such as revenue.)",
 'source_documents': [Document(page_content='Apple Inc. | 2023 Form 10-K | 57\n\nSIGNATURES\n\nPursuant to the requirements of Section 13 or 15(d) of the Securities Exchange Act of 1934, the Registrant has duly caused this report to be signed on its behalf by the undersigned, thereunto duly authorized.\n\nDate: November 2, 2023\n\nApple Inc.\n\nBy:\n\n/s/ Luca Maestri Luca Maestri Senior Vice President, Chief Financial Oﬃcer\n\nPower of Attorney\n\nKNOW ALL PERSONS BY THESE PRESENTS, that each person whose signature appears below constitutes and appoints Timothy D. Cook and Luca Maestri, jointly and severally, his or her attorneys-in-fact, each with the power of substitution, for him or her in any and all capacities, to s

## Cleanup

Cleanup the index and data.

In [47]:
#rds.drop_index(index_name=index_name, redis_url=REDIS_URL, delete_documents=True)