# Basic RAG with LangChain

## Environment Setup

In [25]:
import os
import warnings
warnings.filterwarnings("ignore")
dir_path = os.getcwd()
parent_directory = os.path.dirname(dir_path)
os.environ["TOKENIZERS_PARALLELISM"] = "false"
os.environ["ROOT_DIR"] = parent_directory
print(dir_path)
print(parent_directory)

/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/getting_started
/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss


### Install Python Dependencies

In [26]:
!pip install -r $ROOT_DIR/requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.2.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


### Configure your Redis Stack


In [27]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")

# If SSL is enabled on the endpoint, use rediss:// as the URL prefix
REDIS_URL = f"redis://{REDIS_HOST}:{REDIS_PORT}"

### SentenceTransformerEmbeddings Models Cache folder
We are using `SentenceTransformerEmbeddings` in this demo and here we specify the cache folder. If you already downloaded the models in a local file system, set this folder here, otherwise the library tries to download the models in this folder if not available locally.

In particular, these models will be downloaded if not present in the cache folder:

models/models--sentence-transformers--all-MiniLM-L6-v2

models/models--sentence-transformers--all-mpnet-base-v2


In [28]:
#setting the local downloaded sentence transformer models f
os.environ["TRANSFORMERS_CACHE"] = f"{parent_directory}/models"

## RAG with LangChain

### Dataset Preparation (PDF Documents)

To best demonstrate Redis as a vector database layer, we will load a single
financial (10k filings) doc and preprocess it using some helpers from LangChain:

- `UnstructuredFileLoader` is not the only document loader type that LangChain provides. Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
- `SentenceTransformersTokenTextSplitter` is what we use to create smaller chunks of text from the doc. Docs: https://api.python.langchain.com/en/latest/sentence_transformers/langchain_text_splitters.sentence_transformers.SentenceTransformersTokenTextSplitter.html

In [29]:
from langchain.text_splitter import SentenceTransformersTokenTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs from a folder
data_path = f"{parent_directory}/resources/10K"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

Listing available documents ... ['/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/nke-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/mu-10K-2019.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/amzn-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/jnj-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/amzn-10K-2019.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/aapl-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/aapl-10K-2019.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/nvd-10k-2023.pdf', '/Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/msft-10k-2023.pdf']


In [30]:
# pick out the Nike doc for this exercise
doc = [doc for doc in docs if "nke" in doc][0]

text_splitter = SentenceTransformersTokenTextSplitter()

loader = UnstructuredFileLoader(
    doc, mode="single", strategy="fast"
)

# extract, load, and make chunks
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Done preprocessing. Created 226 chunks of the original pdf /Users/rouzbeh.farahmand/PycharmProjects/commit/financial-vss/resources/10K/nke-10k-2023.pdf


In [31]:
print(f"created {len(chunks)} chunks ")

created 226 chunks 


### Initialize Embeddings Engine
Here we will use LangChain's built in embedding engine so that it will work seemlessly with the LangChain VectorStore classes.

In [32]:
from langchain.embeddings.sentence_transformer import SentenceTransformerEmbeddings 
embeddings = SentenceTransformerEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2", cache_folder=os.getenv("TRANSFORMERS_CACHE", f"{parent_directory}/models"))

## Vector Search with LangChain
### Create Redis vector store instance

We also need to create a schema for the vector index so we can take advantage of the metadata along with the vectors.


**Important Note-1**: If you use your own embedding model with different dimensions, make sure to modify the `dims` in the schema below accordingly.

**Important Note-2**: LangChain does not support JSON data types yet. Only supports HASH for now. This update should be coming soon.

In [33]:
from langchain_community.vectorstores import Redis

index_name = 'langchain'
vector_schema = {
        "name": "content_vector",
        "algorithm": "FLAT",
        "dims": 384,
        "distance_metric": "COSINE",
        "datatype": "FLOAT32",
    }
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}],
    "content_vector_key": "chunk_vector"    # name of the vector field in langchain
}

print(f"loading {len(chunks)} chucks to REDIS_URL={REDIS_URL}")
# construct the vector store class from texts and metadata
rds = Redis.from_documents(
    documents=chunks,
    embedding=embeddings,
    vector_schema=vector_schema,
    redis_url=REDIS_URL,
    index_name = index_name
)

loading 226 chucks to REDIS_URL=redis://localhost:6379


In [34]:
# access underlying redis client to see how many docs have been stores
rds.client.dbsize()

226

### Query the database
Now we can use the LangChain vector store class to perform similarity search operations on Redis

In [35]:
from langchain.vectorstores.redis import RedisText

In [36]:
# basic "top 4" vector search on a given query
rds.similarity_search_with_score(query="Profit margins", k=4)

[(Document(page_content='nike - owned in - line and factory stores. management considers this metric when making financial and operating decisions. the method of calculating comparable store sales varies across the retail industry. as a result, our calculation of this metric may not be comparable to similarly titled metrics used by other companies. 2023 form 10 - k 30 table of contents results of operations ( dollars in millions, except per share data ) revenues cost of sales gross profit gross margin demand creation expense operating overhead expense total selling and administrative expense % of revenues interest expense ( income ), net other ( income ) expense, net income before income taxes income tax expense effective tax rate net income diluted earnings per common share $ $ $ fiscal 2023 51, 217 28, 925 22, 292 43. 5 % 4, 060 12, 317 16, 377 32. 0 % ( 6 ) ( 280 ) 6, 201 1, 131 18. 2 % 5, 070 3. 23 $ $ $ fiscal 2022 46, 710 25, 231 21, 479 46. 0 % 3, 850 10, 954 14, 804 31. 7 % 205

In [37]:
# vector search with metadata filtering
f = RedisText("content") % "profit"
rds.similarity_search_with_score(query="Profit margins", k=4, filter=f)

[(Document(page_content='nike - owned in - line and factory stores. management considers this metric when making financial and operating decisions. the method of calculating comparable store sales varies across the retail industry. as a result, our calculation of this metric may not be comparable to similarly titled metrics used by other companies. 2023 form 10 - k 30 table of contents results of operations ( dollars in millions, except per share data ) revenues cost of sales gross profit gross margin demand creation expense operating overhead expense total selling and administrative expense % of revenues interest expense ( income ), net other ( income ) expense, net income before income taxes income tax expense effective tax rate net income diluted earnings per common share $ $ $ fiscal 2023 51, 217 28, 925 22, 292 43. 5 % 4, 060 12, 317 16, 377 32. 0 % ( 6 ) ( 280 ) 6, 201 1, 131 18. 2 % 5, 070 3. 23 $ $ $ fiscal 2022 46, 710 25, 231 21, 479 46. 0 % 3, 850 10, 954 14, 804 31. 7 % 205

In [38]:
# vector search with combinations of metadata filtering
f = (RedisText("content") % "profit") | (RedisText("content") % "revenue")
rds.similarity_search_with_score(query="Nike company revenue", k=4, filter=f)

[(Document(page_content='nike - owned in - line and factory stores. management considers this metric when making financial and operating decisions. the method of calculating comparable store sales varies across the retail industry. as a result, our calculation of this metric may not be comparable to similarly titled metrics used by other companies. 2023 form 10 - k 30 table of contents results of operations ( dollars in millions, except per share data ) revenues cost of sales gross profit gross margin demand creation expense operating overhead expense total selling and administrative expense % of revenues interest expense ( income ), net other ( income ) expense, net income before income taxes income tax expense effective tax rate net income diluted earnings per common share $ $ $ fiscal 2023 51, 217 28, 925 22, 292 43. 5 % 4, 060 12, 317 16, 377 32. 0 % ( 6 ) ( 280 ) 6, 201 1, 131 18. 2 % 5, 070 3. 23 $ $ $ fiscal 2022 46, 710 25, 231 21, 479 46. 0 % 3, 850 10, 954 14, 804 31. 7 % 205

In [39]:
# filter results to a certain distance threshold
rds.similarity_search_with_score(query="Nike company revenue", k=4, distance_threshold=0.5)

[(Document(page_content='##4 14. 0 % 35 % 5, 727 3. 56 6 % 5 % 2023 form 10 - k 31 table of contents consolidated operating results revenues ( dollars in millions ) fiscal 2023 fiscal 2022 % change % change excluding currency ( 1 ) changes fiscal 2021 % change nike, inc. revenues : nike brand revenues by : footwear apparel $ 33, 135 $ 13, 843 29, 143 13, 567 14 % 2 % 20 % $ 8 % 28, 021 12, 865 4 % 5 % equipment global brand divisions ( 2 ) total nike brand revenues $ 1, 727 58 48, 763 $ 1, 624 102 44, 436 6 % - 43 % 10 % 13 % - 43 % 16 % $ 1, 382 25 42, 293 18 % 308 % 5 % converse corporate ( 3 ) 2, 427 27 2, 346 ( 72 ) 3 % — 8 % — 2, 205 40 6 % — total nike, inc. revenues $ 51, 217 $ 46, 710 10 % 16 % $ 44, 538 5 % supplemental nike brand revenues details : nike brand revenues by : sales to wholesale customers $ 27, 397 $ 25, 608 7 % 14 % $ 25, 898 1 % sales through nike direct global brand divisions ( 2 ) 21, 308 58 18, 726 102 14 % - 43 % 20 % - 43 % 16, 370 25 14 % 308 % total nike

## RAG with HuggingFace LLM
LangChain makes it easy to now take this vector store and build retireval augmented generation (RAG) applications over your data.

### Initialize a llama  LLM served via Ollama
Alternatively, if you like to connect to a local Ollama LLM, you can use below LLM. If you have a local OpenAI-compatible server running via vLLM , add your LLM here.

In [40]:
from langchain_community.llms import Ollama
llm = Ollama(model="llama3")

### Setup prompt
PromptTemplate defines the exect text of the response that would be fed to the LLM. This step is optional, but the defaults usually work well for OpenAI and might fall short for other models.

In [41]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the user question at the end. Only use the result from tools and evidence provided to you. If you don't know the answer, say that you don't know, don't try to make up an answer. Provide the source of the document that you used to get the answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]
    Source: [source document here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [42]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(search_type="similarity_distance_threshold",search_kwargs={"distance_threshold":0.8}),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

### Finally - let's ask questions!



In [43]:
query = "What was Nike's revenue last year compared to this year??"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"According to the financial highlights section, in fiscal 2022, Nike's revenues were $48.7 billion, which increased 10% and 16% on a reported and currency-neutral basis, respectively.\n\nIn comparison, for fiscal 2023, Nike's revenues were $51.2 billion, an increase of 10% and 16% on a reported and currency-neutral basis, respectively, compared to the previous year.\n\nSo, to summarize:\n\n* Fiscal 2022: $48.7 billion\n* Fiscal 2023: $51.2 billion"

In [44]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"Question: How many products does Nike offer? What is the industry that Nike is part of?\nAnswer: According to the provided context, Nike offers a wide range of products including athletic footwear, apparel, equipment, accessories, and services. The company's principal business activity is the design, development, and worldwide marketing and selling of these products.\n\nSource: 2023 Form 10-K filing with the Securities and Exchange Commission (SEC), page 1-2"

In [45]:
query = "Is Nike an ethical company?"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


"I don't know.\n\nThe provided context does not contain information that would enable me to determine whether Nike is an ethical company or not. The text only mentions some of the company's policies and initiatives related to employee well-being, financial coaching, and other corporate social responsibility efforts. It does not provide any specific information about ethics or morality."

In [46]:
query = "How many employees work at Nike???"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


'Question: How many employees work at Nike?\nAnswer: As of May 31, 2023, we had approximately 83,700 employees worldwide, including retail and part-time employees.\nSource: 2023 Form 10-K, page 30.'

## Cleanup

Cleanup the index and data.

In [47]:
#rds.drop_index(index_name=index_name, redis_url=REDIS_URL, delete_documents=True)