<a href="https://colab.research.google.com/github/Redislabs-Solution-Architects/financial-vss/blob/main/LangChain_VSS.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Vector Similarity Search & Document QnA with LangChain
![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [LangChain](https://python.langchain.com/docs/get_started/introduction) and [Redis](https://redis.com) to perform document + embdding indexing and semantic search tasks. It also shows how to integrate with an LLM like OpenAI's GPT models.

## Setup and Data Prep

### Pull Github Materials
We need to clone the supporting materials from github.

In [1]:
# This clones your git repository into a directory named 'temp_repo'.
!git clone https://github.com/Redislabs-Solution-Architects/financial-vss.git temp_repo

# This command moves the 'resources' directory from 'temp_repo' to your current directory.
!mv temp_repo/resources .

# This deletes the 'temp_repo' directory, cleaning up the unwanted files.
!rm -rf temp_repo


Cloning into 'temp_repo'...
remote: Enumerating objects: 73, done.[K
remote: Counting objects: 100% (73/73), done.[K
remote: Compressing objects: 100% (52/52), done.[K
remote: Total 73 (delta 30), reused 56 (delta 17), pack-reused 0[K
Receiving objects: 100% (73/73), 6.92 MiB | 10.96 MiB/s, done.
Resolving deltas: 100% (30/30), done.
mv: cannot move 'temp_repo/resources' to './resources': Directory not empty


### Install Python Dependencies

In [2]:
!pip install -q redis redisvl>==0.0.4 langchain pdf2image "unstructured[all-docs]" sentence-transformers openai tiktoken

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.[0m[31m
[0m

### Preprocess PDF Doc(s)

Now we will load a single financial (10k filings) doc and preprocess it using some LangChain helpers.

In [3]:
import os

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import UnstructuredFileLoader

# Load list of pdfs
data_path = "resources/"
docs = [os.path.join(data_path, file) for file in os.listdir(data_path)]

print("Listing available documents ...", docs)

# For simplicity, we will just work with one of the 10k files. This will take some time still.
# To Note: the UnstructuredFileLoader is not the only document loader type that LangChain provides
# To Note: the RecursiveCharacterTextSplitter is what we use to create smaller chunks of text from the doc.
# Docs: https://python.langchain.com/docs/integrations/document_loaders/unstructured_file
# Docs: https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter
doc = [doc for doc in docs if "nke" in doc][0]
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
loader = UnstructuredFileLoader(doc, mode="single", strategy="fast")
chunks = loader.load_and_split(text_splitter)

print("Done preprocessing. Created", len(chunks), "chunks of the original pdf", doc)

Listing available documents ... ['resources/nke-10k-2023.pdf', 'resources/msft-10k-2023.pdf', 'resources/amzn-10k-2023.pdf', 'resources/aapl-10k-2023.pdf', 'resources/nvd-10k-2023.pdf', 'resources/jnj-10k-2023.pdf']
Done preprocessing. Created 323 chunks of the original pdf resources/nke-10k-2023.pdf


In [4]:
# Take a look at one item
print(chunks[2])

page_content="NIKE, Inc.(Exact name of Registrant as specified in its charter)Oregon93-0584541(State or other jurisdiction of incorporation)(IRS Employer Identification No.)One Bowerman Drive, Beaverton, Oregon 97005-6453(Address of principal executive offices and zip code)(503) 671-6453(Registrant's telephone number, including area code)SECURITIES REGISTERED PURSUANT TO SECTION 12(B) OF THE ACT:Class B Common StockNKENew York Stock Exchange(Title of each class)(Trading symbol)(Name of each exchange on which registered)SECURITIES REGISTERED PURSUANT TO SECTION 12(G) OF THE ACT:NONE\n\nAs of November 30, 2022, the aggregate market values of the Registrant's Common Stock held by non-affiliates were:Class A$7,831,564,572 Class B136,467,702,472 $144,299,267,044\n\nTable of ContentsUNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549FORM 10-K(Mark One)☑ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(D) OF THE SECURITIES EXCHANGE ACT OF 1934FOR THE FISCAL YEAR ENDED MAY 31, 2023

### Install Redis Stack (OPTIONAL)

Redis Search will be used as Vector Similarity Search engine for LangChain.

Instead of using in-notebook Redis Stack https://redis.io/docs/getting-started/install-stack/ you can provision your own free instance of Redis in the cloud. Get your own Free Redis Cloud instance at https://redis.com/try-free/

In [5]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb jammy main
Starting redis-stack-server, database path /var/lib/redis-stack


gpg: cannot open '/dev/tty': No such device or address
curl: (23) Failed writing body


### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [6]:
# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


### Initialize Embeddings Engine
Here we will use LangChain's built in embedding engine so that it will work seemlessly with the LangChain VectorStore classes.

In [7]:
from langchain.embeddings.huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

## VSS with LangChain
### Create Redis vector store instance

We also need to create a schema for the vector index so we can take advantage of the metadata along with the vectors.

**Important Note**: LangChain does not support JSON data types yet. Only supports HASH for now. This update should be coming soon.

In [8]:
from langchain.vectorstores.redis import Redis


# set the index name for this example
index_name = "langchain"

# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "chunk_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}],
    "content_vector_key": "chunk_vector"    # name of the vector field in langchain
}


# construct the vector store class from texts and metadata
rds = Redis.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name=index_name,
    redis_url=REDIS_URL,
    index_schema=index_schema,
)

If you meant to manually override the schema, please ignore this message.
index_schema: {'vector': [{'name': 'chunk_vector', 'algorithm': 'HNSW', 'dims': 384, 'distance_metric': 'COSINE', 'datatype': 'FLOAT32'}], 'text': [{'name': 'content'}], 'content_vector_key': 'chunk_vector'}
generated_schema: {'text': [{'name': 'source'}], 'numeric': [{'name': 'start_index'}], 'tag': []}



In [9]:
# If you wish to connect to an existing Redis vector store instance
rds = Redis.from_existing_index(
    embedding=embeddings,
    index_name=index_name,
    schema=index_schema,
    redis_url=REDIS_URL,
)

In [10]:
# checkout out the schema we created
rds.schema

{'text': [{'name': 'content',
   'weight': 1,
   'no_stem': False,
   'withsuffixtrie': False,
   'no_index': False,
   'sortable': False}],
 'vector': [{'name': 'chunk_vector',
   'dims': 384,
   'algorithm': 'HNSW',
   'datatype': 'FLOAT32',
   'distance_metric': 'COSINE',
   'initial_cap': 20000,
   'm': 16,
   'ef_construction': 200,
   'ef_runtime': 10,
   'epsilon': 0.8}]}

In [11]:
# access underlying redis client to see how many docs have been stores
rds.client.dbsize()

646

In [12]:
# do NOT run this command in production
keys = rds.client.keys()

rds.client.hgetall(keys[0])

{b'chunk_vector': b'\x03"{\xbd\xe2`P\xbd\xec \x9e\xbb~\xd2\xad\xbd\xb3dz=\x086\xc4=%\x02W<\xa2\x96\x18<\xab=#\xbdn\\\x89:O\x9a5=\x0c\'\xc4<\xe4\x1f\x1e=\x9bE`=\xf6\xb3N\xbc\xf0k\xd3\xbb\xd0\xb7\xc1<\x94\xbb\xb3:\xa7\xb1\xdd\xbde\'+\xbb\x85\xa2F=\x80>\x0c\xbd\n\xc7\xa2\xbcl\xf27=\xb8\xcc\x03\xbe\xbf\xb1>\xbc\xe4Q@\xbc@\xfb\xdb\xbb\n\xf4\x01\xbc%#Z\xbd\xad\xb4s\xbdb1\xbb<\xce\x85\x87=\xe2\x85-\xbb\xac\x91\x84=\xfcM\x9f\xbbs\xe6\x02\xbc\xf0\x84\x8a\xbdK\xeb\x9c=\xda\x84\x8e\xbc\xce\xa2\xaa<\xdas\xd1\xbd\x88\x94\x96\xbc1T\x96<\xb8\xac\x9e<\xe6\xbc.\xbd5\xdd0=\x80\x00d\xbc\xa6\x97x\xbd\nB\xbf<\x18\xa2\xe2\xbc\xc45\xa3=\x94\xac\x17\xbd\xec3\xfa<\xb5y*=:\xad\xd3\xbdPB\x1b=!t\x89\xbd\xb1\xc9\x93\xbd\x06\xd1>\xbb\xe4\xcc%=%\xba<\xbd\xa9(\xb1\xbc\x04\xf6Z<\x950_\xbd\xfcV@\xbc/"|\xbd1\x02Y=/\x08\x9d\xbd2\xcf\xb4<t\xe2\x83\xbc\xfd\t\x93\xbdu\x14M\xbd\\\xd8\x85=\xd2\x7f\xe2<\xd9\x925=\x0fh\xf8<_\'\x14<x\xcc\xbf;\xb3\xc0\x96\xbd\xc9X\x81=Q\x81\x84=\xf8n\xf1<"\x98_\xbd\xad\xae=\xbd\tI\x1a\xbd$\x0bD<\

### Query the database
Now we can use the LangChain vector store class to perform similarity search operations on Redis

In [13]:
from langchain.vectorstores.redis import RedisText

In [14]:
# basic "top 4" vector search on a given query
rds.similarity_search_with_score(query="Profit margins", k=4)

[(Document(page_content='6,923\n\nEBIT margin\n\n(1)\n\n12.1 %\n\n14.7 %\n\n15.5 %\n\nInterest expense (income), net\n\n(6)\n\n205\n\n—\n\n262\n\nTOTAL NIKE, INC. INCOME BEFORE INCOME TAXES\n\n$\n\n6,201\n\n$\n\n6,651\n\n7 % $\n\n6,661\n\n(1) Total NIKE Brand EBIT, Total NIKE, Inc. EBIT and EBIT Margin represent non-GAAP financial measures. See "Use of Non-GAAP Financial Measures" for further information.\n\n2023 FORM 10-K 36\n\n% CHANGE EXCLUDING CURRENCY (1) CHANGES\n\n7 % 12 % -13 %\n\n16 % 302 %\n\n6 % 7 %\n\n— 6 %\n\n% CHANGE\n\n0 % 35 % -27 %\n\n24 % -17 %\n\n3 % 23 % 2 %\n\n1 %\n\n—\n\n0 %\n\nTable of Contents\n\nNORTH AMERICA\n\n(Dollars in millions)\n\nFISCAL 2023 FISCAL 2022\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY\n\nCHANGES FISCAL 2021\n\n% CHANGE\n\n% CHANGE EXCLUDING CURRENCY CHANGES\n\nRevenues by: Footwear Apparel\n\n$\n\n14,897 $ 5,947\n\n12,228 5,492\n\n22 % 8 %\n\n22 % $ 9 %\n\n11,644 5,028\n\n5 % 9 %\n\n5 % 9 %\n\nEquipment\n\nTOTAL REVENUES\n\n$\n\n764 21,608 $\n

In [15]:
# vector search with metadata filtering

f = RedisText("content") % "profit"
rds.similarity_search_with_score(query="Profit margins", k=4, filter=f)

[(Document(page_content='COMPARABLE STORE SALES Comparable store sales: This key metric, which excludes NIKE Brand Digital sales, comprises revenues from NIKE-owned in-line and factory stores for which all three of the following requirements have been met: (1) the store has been open at least one year, (2) square footage has not changed by more than 15% within the past year and (3) the store has not been permanently repositioned within the past year. Comparable store sales includes revenues from stores that were temporarily closed during the period as a result of COVID-19. Comparable store sales represents a performance metric that we believe is useful information for management and investors in understanding the performance of our established NIKE-owned in-line and factory stores. Management considers this metric when making financial and operating decisions. The method of calculating comparable store sales varies across the retail industry. As a result, our calculation of this metric

In [16]:
# vector search with combinations of metadata filtering

f = (RedisText("content") % "profit") | (RedisText("content") % "revenue")
rds.similarity_search_with_score(query="Nike company revenue", k=4, filter=f)

[(Document(page_content='FISCAL 2023 COMPARED TO FISCAL 2022\n\nNIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n\nNIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling pr

In [17]:
# filter results to a certain distance threshold
rds.similarity_search_with_score(query="Nike company revenue", k=4, distance_threshold=0.14)

[]

## RAG with LangChain
LangChain makes it easy to now take this vector store and build retireval augmented generation (RAG) applications over your data.

### Initialize OpenAI

You need to supply an OpenAI API key (starts with `sk-...`) when prompted. If the key is in your env -- great, otherwise enter it when prompted below. You can find your API key at https://platform.openai.com/account/api-keys

In [18]:
import getpass
from langchain.llms import OpenAI

llm = OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY") or getpass.getpass(prompt="OpenAI API Key:"))

OpenAI API Key:··········


### Setup prompt
PromptTemplate defines the exect text of the response that would be fed to the LLM. This step is optional, but the defaults usually work well for OpenAI and might fall short for other models.

In [19]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context from financial 10k filings data to answer the quser uestion at the end. If you don't know the answer, say that you don't know, don't try to make up an answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [20]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

### Finally - let's ask questions!

Examples:
- What did the president say about Kentaji Brown Jackson
- Did he mention Stephen Breyer?
- What was his stance on Ukraine

In [21]:
query = "What was Nike's revenue last year compared to this year??"
res=qa(query)
res['result']





[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


" Nike's revenue increased 10% from $44.4 billion in fiscal 2022 to $48.8 billion in fiscal 2023."

In [22]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
res=qa(query)
res['result']





[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


' Nike offers a variety of products including athletic footwear, apparel, and equipment. Nike operates in the athletic footwear, apparel, and equipment industry.'

In [23]:
query = "Is Nike an ethical company?"
res=qa(query)
res['result']





[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


" I don't know."

In [24]:
query = "How many employees work at Nike???"
res=qa(query)
res['result']





[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


' Approximately 83,700 employees worldwide, including retail and part-time employees.'

## Cleanup

Cleanup the index and data.

In [25]:
rds.drop_index(index_name=index_name, redis_url=REDIS_URL, delete_documents=True)

True