# Vector Similarity Search & Document QnA with LangChain
![Redis](https://redis.com/wp-content/themes/wpx/assets/images/logo-redis.svg?auto=webp&quality=85,75&width=120)

This notebook uses [LangChain](https://python.langchain.com/docs/get_started/introduction) and [Redis](https://redis.com) to perform document + embdding indexing and semantic search tasks. It also shows how to integrate with an LLM like OpenAI's GPT models.

## Install Python Dependencies

In [None]:
!pip install -q langchain openai tiktoken redis sentence-transformers

## Load Document Chunks and Embeddings
**You are expected to have first run the Data Prep Notebook**

In [5]:
import os
import json

data_path = "notebooks/resources/"

with open(os.path.join(data_path, "embeddings.json"), "r") as f:
    chunk_embeddings = json.load(f)

with open(os.path.join(data_path, "docs.json"), "r") as f:
    chunks = json.load(f)

## Install Redis Stack (OPTIONAL)

Redis Search will be used as Vector Similarity Search engine for LangChain.

Instead of using in-notebook Redis Stack https://redis.io/docs/getting-started/install-stack/ you can provision your own free instance of Redis in the cloud. Get your own Free Redis Cloud instance at https://redis.com/try-free/

In [None]:
%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

### Connect to Redis

By default this notebook would connect to the local instance of Redis Stack. If you have your own Redis Cloud instance - replace REDIS_PASSWORD, REDIS_HOST and REDIS_PORT values with your own.

In [7]:
import os

# Replace values below with your own if using Redis Cloud instance
REDIS_HOST = os.getenv("REDIS_HOST", "localhost")
REDIS_PORT = os.getenv("REDIS_PORT", "6379")
REDIS_PASSWORD = os.getenv("REDIS_PASSWORD", "")
#REDIS_HOST="redis-18374.c253.us-central1-1.gce.cloud.redislabs.com"
#REDIS_PORT=18374
#REDIS_PASSWORD="1TNxTEdYRDgIDKM2gDfasupCADXXXX"

#shortcut for redis-cli $REDIS_CONN command
if REDIS_PASSWORD!="":
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT} -a {REDIS_PASSWORD} --no-auth-warning"
else:
  os.environ["REDIS_CONN"]=f"-h {REDIS_HOST} -p {REDIS_PORT}"

REDIS_URL = f"redis://:{REDIS_PASSWORD}@{REDIS_HOST}:{REDIS_PORT}"


In [4]:
import os
import getpass

if "OPENAI_API_KEY" in os.environ:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
else:
    OPENAI_API_KEY = getpass.getpass(prompt='OpenAI Key: ', stream=None)

## Initialize Embeddings Engine
Here we will use LangChain's built in embedding engine so that it will work seemlessly with the LangChain VectorStore classes.

In [8]:
from langchain.vectorstores.redis import Redis
from langchain.embeddings.hugging_face import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings("sentence-transformers/all-MiniLM-L6-v2")

## Create Langchain / Redis vector store

We also need to create a schema for the vector index so we can take advantage of the metadata along with the vectors. Important Note: LangChain does not support JSON data types yet. Only supports HASH.

In [26]:
# set the index name for this example
index_name = "langchain"

# with langchain we can manually modify the default vector schema configuration
vector_schema = {
    "name": "chunk_vector",        # name of the vector field in langchain
    "algorithm": "HNSW",           # could use HNSW instead
    "dims": 384,                   # set based on the HF model embedding dimension
    "distance_metric": "COSINE",   # could use EUCLIDEAN or IP
    "datatype": "FLOAT32",
}

# here we can define the entire schema spec for our index in LangChain
index_schema = {
    "vector": [vector_schema],
    "text": [{"name": "content"}],
    "content_vector_key": "chunk_vector"    # name of the vector field in langchain
}


try:
    rds = Redis.from_existing_index(
        embedding=embeddings,
        index_name=index_name,
        schema=index_schema,
        redis_url=REDIS_URL,
    )
except Exception as e:
    print("No vector index created yet -- creating new vectorstore")
    # Load Redis with documents
    rds = Redis.from_texts(
        texts=[chunk['page_content'] for chunk in chunks],
        metadatas=[chunk['metadata'] for chunk in chunks],
        embedding=embeddings,
        index_name=index_name,
        redis_url=REDIS_URL,
        index_schema=index_schema,
    )

In [28]:
# checkout out the schema we created
rds.schema

{'text': [{'name': 'content',
   'weight': 1,
   'no_stem': False,
   'withsuffixtrie': False,
   'no_index': False,
   'sortable': False}],
 'vector': [{'name': 'chunk_vector',
   'dims': 1536,
   'algorithm': 'FLAT',
   'datatype': 'FLOAT32',
   'distance_metric': 'COSINE',
   'initial_cap': 20000,
   'block_size': 1000}]}

In [30]:
# access underlying redis client to see how many docs have been stores
rds.client.dbsize()

323

In [41]:
# do NOT run this command in production
keys = rds.client.keys()

rds.client.hgetall(keys[0])

{b'chunk_vector': b'\xc7\x0e\xd2;O\xd0\x8b\xbc\xe4O\x9a;\xf9\x8a\x87\xbc\xf8\x13J<f\xe3\xa5\xbb\x97\x18\x91\xbc\x07\xc0\x19<\x02\x99\x82\xbc\xd5\xab\xea\xbci\x8b\xdf\xbb8\x86g<\xbe?\x81\xbb\xbf\xa7\x87<\xcd\x9d\xef: s\xcd<\x92\x91\x13<\x15\x8d%\xbc\xf4\\Y<w/\x8f\xbca\xe5\xea\xbc\xbf\x0f\x8e<\x19t\t\xbd\xf8{P\xbc\xbb\xf0\x16\xbc\x04\xd95\xbc\x8e\n\x96<\xfc2\xc1\xbc\xd6{w<X\xae\x93\xbch\xebE\xbc\xfaS}\xbc\xb9A\xc6\xbb\xae\xf3\x97<\x92)\r\xbc\x959M<\xee\xfd\xae\xba\xb5\xf2\xdb\xbc\xa9\xc5i<p)\xb4\xbcG\xfa#<\xcdD\xa0<\xd6\x82\x0e\xbc\xb1k^;\xc3\xbf\xe7;IjJ<\x0b\xa0\xe6;\xc4\xfe\x91\xbc\xcfL@<\x02Z\xd8<|\x1e\x93\xbc@\\O=\x17-\xbf\xbc\x14\x85\x85<\xce\x14\xad\xb9\xa9\x04\x14\xbb\xc8\x16r\xbcw\x88\xde<p)4\xbb\xdf0\xa3\xbc\xf2\xec2\xbbo\x1a}<\xf0\xa5\xe8\xbc`T\x08=\x8c+\xd2\xbc\xfeI\x98\xbc\t`\xb3\xbb-89;\xb0\xcb\xc4;\xb5\xf2[;\xd5\xab\xea<\x0f\xef\xd0<?$<<R\xe0K<\xf6;\x1d<f\xe3\xa5\xb9\x89\xb3\x0b<\xd7\x8a\xae\xb9\xdf`\x16;\xbe\x98P<\xd0T\xe0<\x89\x1b\x12\xbd\x9a\x90W\xbcX\xae\x93\xbcKBw<\x89

### Sample Semantic Search Queries

In [42]:
from langchain.vectorstores.redis import RedisText

In [58]:
# basic "top 4" vector search on a given query
rds.similarity_search_with_score(query="Profit margins", k=4)

[(Document(page_content='Inventories as of May 31, 2023 were $8.5 billion, flat compared to the prior year, driven by the actions we took throughout fiscal 2023 to manage inventory levels\n\nWe returned $7.5 billion to our shareholders in fiscal 2023 through share repurchases and dividends\n\nReturn on Invested Capital ("ROIC") as of May 31, 2023 was 31.5% compared to 46.5% as of May 31, 2022. ROIC is considered a non-GAAP financial measure, see "Use of Non-GAAP Financial Measures" for further information.\n\nFor discussion related to the results of operations and changes in financial condition for fiscal 2022 compared to fiscal 2021 refer to Part II, Item 7. Management\'s Discussion and Analysis of Financial Condition and Results of Operations in our fiscal 2022 Form 10-K, which was filed with the United States Securities and Exchange Commission on July 21, 2022.\n\nCURRENT ECONOMIC CONDITIONS AND MARKET DYNAMICS\n\nConsumer Spending: Our fiscal 2023 growth in Revenues reflects strong

In [59]:
# vector search with metadata filtering

f = RedisText("content") % "profit"
rds.similarity_search_with_score(query="Profit margins", k=4, filter=f)

[(Document(page_content='NIKE Brand apparel revenues increased 8% on a currency-neutral basis, primarily due to higher revenues in Men\'s. Unit sales of apparel increased 4%, while higher ASP per unit contributed approximately 4 percentage points of apparel revenue growth. Higher ASP was primarily due to higher full-price ASP and growth in the size of our NIKE Direct business, partially offset by lower NIKE Direct ASP, reflecting higher promotional activity.\n\nNIKE Direct revenues increased 14% from $18.7 billion in fiscal 2022 to $21.3 billion in fiscal 2023. On a currency-neutral basis, NIKE Direct revenues increased 20% primarily driven by NIKE Brand Digital sales growth of 24%, comparable store sales growth of 14% and the addition of new stores. For further information regarding comparable store sales, including the definition, see "Comparable Store Sales". NIKE Brand Digital sales were $12.6 billion for fiscal 2023 compared to $10.7 billion for fiscal 2022.\n\n2023 FORM 10-K 33\n

In [66]:
# vector search with combinations of metadata filtering

f = (RedisText("content") % "profit") | (RedisText("content") % "revenue")
rds.similarity_search_with_score(query="Nike company revenue", k=4, filter=f)

[(Document(page_content='FISCAL 2023 COMPARED TO FISCAL 2022\n\nNIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n\nNIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling pr

In [68]:
# filter results to a certain distance threshold
rds.similarity_search_with_score(query="Nike company revenue", k=4, distance_threshold=0.14)

[(Document(page_content='FISCAL 2023 COMPARED TO FISCAL 2022\n\nNIKE, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively. The increase was due to higher revenues in North America, Europe, Middle East & Africa ("EMEA"), APLA and Greater China, which contributed approximately 7, 6, 2 and 1 percentage points to NIKE, Inc. Revenues, respectively.\n\nNIKE Brand revenues, which represented over 90% of NIKE, Inc. Revenues, increased 10% and 16% on a reported and currency-neutral basis, respectively. This increase was primarily due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\' which grew 17%, 35%,11% and 10%, respectively, on a wholesale equivalent basis.\n\nNIKE Brand footwear revenues increased 20% on a currency-neutral basis, due to higher revenues in Men\'s, the Jordan Brand, Women\'s and Kids\'. Unit sales of footwear increased 13%, while higher average selling pr

## Integrate with an LLM
LangChain makes it easy to now take this vector store and build retireval augmented generation (RAG) applications over your data.

### Initialize OpenAI

You need to supply an OpenAI API key (starts with `sk-...`) when prompted. If the key is in your env -- great, otherwise enter it when prompted below. You can find your API key at https://platform.openai.com/account/api-keys

In [70]:
import getpass
from langchain.llms import OpenAI

llm = OpenAI(openai_api_key=os.getenv("OPENAI_API_KEY") or getpass("OpenAI API Key:"))

### Setup prompt
PromptTemplate defines the exect text of the response that would be fed to the LLM. This step is optional, but the defaults usually work well for OpenAI and might fall short for other models.

In [71]:
def get_prompt():
    """Create the QA chain."""
    from langchain.prompts import PromptTemplate

    # Define our prompt
    prompt_template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, say that you don't know, don't try to make up an answer.

    This should be in the following format:

    Question: [question here]
    Answer: [answer here]

    Begin!

    Context:
    ---------
    {context}
    ---------
    Question: {question}
    Answer:"""

    prompt = PromptTemplate(
        template=prompt_template,
        input_variables=["context", "question"]
    )
    return prompt

### Putting it all together

This is where the Langchain brings all the components together in a form of a simple RAG application with the financial PDF document.

In [72]:
from langchain.chains import RetrievalQA

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=rds.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": get_prompt()},
    verbose=True
)

### Finally - let's ask questions!

Examples:
- What did the president say about Kentaji Brown Jackson
- Did he mention Stephen Breyer?
- What was his stance on Ukraine

In [73]:
query = "What was Nike's revenue last year compared to this year??"
res=qa(query)
res['result']



[1m> Entering new RetrievalQA chain...[0m


score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.



[1m> Finished chain.[0m


' Nike, Inc. Revenues were $51.2 billion in fiscal 2023, which increased 10% and 16% compared to fiscal 2022 on a reported and currency-neutral basis, respectively.'

In [74]:
query = "How many products does Nike offer? What is the industry that Nike is part of?"
res=qa(query)
res['result']

score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


' Nike offers products from the NIKE Brand, Jordan Brand, and Converse. Nike is part of the athletic footwear, apparel and equipment industry.'

In [75]:
query = "Is Nike an ethical company?"
res=qa(query)
res['result']

score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


" I don't know."

In [76]:
query = "How many employees work at Nike???"
res=qa(query)
res['result']

score_threshold is deprecated. Use distance_threshold instead.score_threshold should only be used in similarity_search_with_relevance_scores.score_threshold will be removed in a future release.




[1m> Entering new RetrievalQA chain...[0m

[1m> Finished chain.[0m


' Approximately 83,700 employees worldwide, including retail and part-time employees.'

## Cleanup

Cleanup the index and data.

In [83]:
rds.drop_index(index_name=index_name, redis_url=REDIS_URL, delete_documents=True)

True