# Lesson 3 : Retrieving Relevant Information with Similarity Search


Welcome back! In the previous lesson, we explored how to generate embeddings for document chunks using OpenAI and LangChain. Today, we will build on that knowledge by diving into vector databases and how they enable the efficient retrieval of relevant information through similarity search.

Vector databases are specialized storage systems designed to handle high-dimensional vector data—such as the embeddings we generated in the last lesson. They are crucial for performing similarity searches, which allow us to find document chunks that are semantically similar to a given query. In this lesson, we will focus on using **FAISS**, a powerful tool developed by Facebook AI, to create a local vector storage. This will enable us to efficiently store and search through our embeddings, paving the way for advanced document retrieval tasks.

---

## Preparing Documents and Embedding Model

Before we can perform a similarity search, we need to prepare our document and initialize our embedding model. Here’s a quick recap:

```python
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()
````

This snippet loads a PDF, splits it into 1 000-character chunks (with 100-character overlap), and initializes the OpenAI embeddings model.

---

## Creating Embeddings and Vector Store

With our document chunks ready and embedding model initialized, the next step is to generate embeddings and create a vector store. We’ll use **FAISS** for this:

```python
from langchain_community.vectorstores import FAISS

# Generate embeddings for all chunks and create a FAISS vector store
vectorstore = FAISS.from_documents(split_docs, embedding_model)
```

`FAISS.from_documents(...)` does the following under the hood:

1. Takes each chunk in `split_docs`.
2. Converts each chunk’s text into a high-dimensional embedding via `embedding_model`.
3. Indexes all vectors in a FAISS index for ultra-fast similarity search.
4. Returns a vector store that links each vector back to its original document metadata.

---

## Performing Similarity Search

Now that our FAISS index is ready, we can query it for semantically relevant chunks:

```python
# Define our search query
query = "What was the main clue?"

# Retrieve the top 3 most similar chunks
retrieved_docs = vectorstore.similarity_search(query, k=3)

# Display the first 300 characters of each result
for doc in retrieved_docs:
    print(doc.page_content[:300], "...\n")
```

**Sample Output**

```
The little man stood glancing from one to the
other of us with half-frightened, half-hopeful eyes,
as one who is not sure whether he is on the verge
of a windfall or of a catastrophe. Then he stepped
into the cab, and in half an hour we were back in
the sitting-room at Baker Street. Nothing had been ...

less innocent aspect. Here is the stone; the stone
came from the goose, and the goose came from Mr.
Henry Baker, the gentleman with the bad hat and
all the other characteristics with which I have bored
you. So now we must set ourselves very seriously
to ﬁnding this gentleman and ascertaining what
pa ...

she found matters as described by the last
witness. Inspector Bradstreet, B division,
gave evidence as to the arrest of Horner,
who struggled frantically, and protested his
innocence in the strongest terms. Evidence
of a previous conviction for robbery having
been given against the prisoner, the mag ...
```

Even though the exact phrase “main clue” doesn’t appear, FAISS retrieves passages discussing the key evidence (the blue carbuncle, witness testimony, etc.)—all relevant to our query.

---

## Summary and Next Steps

In this lesson you learned:

1. **How to load, split, and embed documents** with LangChain.
2. **How to build a FAISS vector store** to index those embeddings.
3. **How to perform a similarity search** to retrieve semantically related text chunks.

🔍 **Practice Exercise**: Try using different queries or documents (e.g., another public domain text) and observe how FAISS returns the most relevant passages. This hands-on practice will solidify your understanding before we move on to the next unit.

Keep up the great work, and see you in the next lesson! 🚀



## Exploring Vector Store Details

You've been doing well with understanding document processing and embeddings. Now, let's explore of our vector store create using FAISS.

Just run the code and observe the output:

It will show the number of document chunks created.
It will display the embedding dimensions used.
This will help you see how the setup works. Enjoy exploring the results!

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# Store document vectors in FAISS using the embedding model
vectorstore = FAISS.from_documents(split_docs, embedding_model)

# Print the number of documents and embedding dimensions
print(f"Number of documents: {len(vectorstore.docstore._dict)}")
print(f"Embedding dimensions: {vectorstore.index.d}")

```

## Formulate a Query for the Similarity Search

Well done understanding vector stores. Now, let's perform your first similarity search on a document using FAISS.

Your task is to formulate a question about the document to guide the search. Here are some example questions you might consider:

"What is the main event?"
"Who is the main character?"
"What is the setting of the story?"
Once you've crafted your question, execute the code to perform the similarity search and retrieve relevant information from the document. Dive in and see how effectively you can extract meaningful insights!


```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# Store document vectors in FAISS using the embedding model
vectorstore = FAISS.from_documents(split_docs, embedding_model)

# TODO: Write a question about the document for the similarity search
query = "_________________________"

# Perform similarity search to find the top 3 most relevant document chunks
retrieved_docs = vectorstore.similarity_search(query, k=3)

# Print the content of the retrieved documents
for doc in retrieved_docs:
    print(doc.page_content[:300], "...\n")


```

Here’s the completed snippet with a query asking for the main character in the story:

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# Store document vectors in FAISS using the embedding model
vectorstore = FAISS.from_documents(split_docs, embedding_model)

# Ask about the main character in the story
query = "Who is the main character in The Adventure of the Blue Carbuncle?"

# Perform similarity search to find the top 3 most relevant document chunks
retrieved_docs = vectorstore.similarity_search(query, k=3)

# Print the content of the retrieved document snippets
for doc in retrieved_docs:
    print(doc.page_content[:300], "...\n")
```

Running this will retrieve and print the passages that most closely discuss Sherlock Holmes (the main character) and related context.


## Adjusting Document Retrieval Quantity

Nice work on learning how to perform similarity searches! Now, let's explore how to adjust the number of document chunks retrieved during a search.

Currently, the code retrieves the top 3 most relevant document chunks. Your task is to change this number to 5.

This small change will help you understand how to control the amount of context you retrieve related to a query.. Dive in and see the difference!


```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# Store document vectors in FAISS using the embedding model
vectorstore = FAISS.from_documents(split_docs, embedding_model)

# Define the search query
query = "What was the main clue?"

# TODO: Change the number of retrieved document chunks to 5
retrieved_docs = vectorstore.similarity_search(query, k=3)

# Print the content of the retrieved documents
for doc in retrieved_docs:
    print(doc.page_content[:300], "...\n")


```

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# Store document vectors in FAISS using the embedding model
vectorstore = FAISS.from_documents(split_docs, embedding_model)

# Define the search query
query = "What was the main clue?"

# Retrieve the top 5 most relevant document chunks
retrieved_docs = vectorstore.similarity_search(query, k=5)

# Print the content of the retrieved document snippets
for doc in retrieved_docs:
    print(doc.page_content[:300], "...\n")
```


## Similarity Search with FAISS

Finally, let's put all your knowledge into practice by creating a vector store using FAISS and performing a similarity search.

Your task is to:

Generate and store document vectors in FAISS using the embeddings model that has already been initialized for you.
Define a search query related to the Sherlock Holmes story.
Perform a similarity search to retrieve the top 5 most relevant document chunks.
Print the first 100 characters of each retrieved document.
This exercise will help you see how effectively you can retrieve relevant information from the text using vector similarity.


```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# TODO: Generate and store document vectors in FAISS using the embeddings model

# TODO: Define a search query

# TODO: Perform a similarity search to retrieve the top 5 most relevant document chunks

# TODO: Print the first 100 characters of each retrieved document


```

```python
from langchain_openai import OpenAIEmbeddings
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS

# Define the file path
file_path = "data/the_adventure_of_the_blue_carbuncle.pdf"

# Create a loader for our document
loader = PyPDFLoader(file_path)

# Load the document
docs = loader.load()

# Split the document into chunks
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100)
split_docs = text_splitter.split_documents(docs)

# Initialize the OpenAI embedding model
embedding_model = OpenAIEmbeddings()

# Generate and store document vectors in FAISS using the embeddings model
vectorstore = FAISS.from_documents(split_docs, embedding_model)

# Define a search query
query = "What was the main clue in the mystery?"

# Perform a similarity search to retrieve the top 5 most relevant document chunks
retrieved_docs = vectorstore.similarity_search(query, k=5)

# Print the first 100 characters of each retrieved document chunk
for i, doc in enumerate(retrieved_docs, start=1):
    snippet = doc.page_content[:100].replace("\n", " ")
    print(f"Result {i}: {snippet}...\n")
```
