# FAISS


- Author: [Jeongeun Lim](https://www.linkedin.com/in/jeongeun-lim-808978188/)
- Design: []()
- Peer Review : 
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)


[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/08-OutputFixingParser.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/08-OutputFixingParser.ipynb)


## Overview


`FAISS` is a library designed for the efficient similarity search and clustering of dense vectors. It provides robust algorithms for searching vector sets of any size, including those that may not fit entirely in `RAM`.


In addition to the core search functionality, `FAISS` includes support code for evaluation and parameter tuning, making it a versatile tool for various applications in machine learning and artificial intelligence.


----
Key Benefits:


- Efficient Large-Scale Search:
`FAISS` ensures fast and accurate vector searches, even with millions of high-dimensional vectors.


- Memory Optimization:
Offers advanced quantization techniques to reduce memory usage without sacrificing performance.


- Customizable Search Accuracy:
Users can fine-tune parameters to balance between search accuracy and speed according to specific requirements.


- Versatile Applications:
From machine learning to AI-powered recommendation systems, Faiss supports a wide range of use cases.


---- 
Implementation Steps:


To effectively integrate `FAISS` into your workflow, follow these steps:


1. Data Preparation:
Prepare and normalize your data, ensuring vectors are in a dense representation format.


2. Index Creation:
Select and build a Faiss index based on your dataset size and performance requirements. Common options include IndexFlat for brute-force search or IVF for scalable inverted file-based search.


3. Index Training (if needed):
For certain indices, such as `IVF` or `PQ`, train the index with representative data samples to optimize performance.


4. Search Execution:
Use the index to search for nearest neighbors, leveraging optional GPU acceleration for faster performance.


5. Evaluation and Tuning:
Test and evaluate the performance of your index, adjusting parameters like quantization levels or clustering size for improved results.


### Table of Contents


- [Overview](#overview)
- [Environment Setup](#environment-setup)
- [Load a Sample Dataset](#load-a-sample-dataset)
- [Create a VectorStore](#create-a-vectorstore)
    - [Create a FAISS VectorStore(from_documents)](#create-a-faiss-vectorstorefrom_documents)
    - [Create a FAISS VectorStore(from_texts)](#create-a-faiss-vectorstorefrom_texts)
- [Similarity Search](#similarity-search)
- [Data Addition Methods](#data-addition-methods)
- [Delete Documents](#delete-documents)
- [Local Persistence](#local-persistence)
- [FAISS Object Merge (Merge From)](#faiss-object-merge-merge-from)
- [Convert to Searcher (as_retriever)](#convert-to-searcher-as_retriever)


### References


- [LangChain : Faiss](https://python.langchain.com/docs/integrations/vectorstores/faiss)
- [Faiss Docs](https://faiss.ai/)


----

## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**
- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. 
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.

In [75]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [76]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langsmith",
        "langchain",
        "langchain_openai",
        "langchain_community",
    ],
    verbose=False,
    upgrade=False,
)

In [77]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "FAISS",
    }
)

Environment variables have been set successfully.


You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. 

[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps.

In [78]:
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Load a Sample Dataset
Demonstrates how to load text files using LangChain’s `TextLoader` and split them into smaller chunks with `RecursiveCharacterTextSplitter`. 
The resulting documents are prepared for further embedding and storage in a FAISS vector store.

In [79]:
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Text splitting
text_splitter = RecursiveCharacterTextSplitter(chunk_size=600, chunk_overlap=0)

# Load the text file and convert it to List[Document] format
loader = TextLoader("data/the_little_prince.txt")

# Documents splitting
split_doc = loader.load_and_split(text_splitter)

# Classification by chapters: Split into Chapters 1-13 and 14-27
# Assumption: Each chapter starts with a specific pattern ("Chapter X") and is split based on this
chapter_1_to_13 = []
chapter_14_to_27 = []
# Set the starting group
current_group = 1

for doc in split_doc:
    content = doc.page_content
    # Identify chapters using a keyword (e.g., "Chapter X")
    if "Chapter 14" in content:
        # Switch to the second group after Chapter 14
        current_group = 2

    # Split documents based on the group
    if current_group == 1:
        chapter_1_to_13.append(doc)
    else:
        chapter_14_to_27.append(doc)

# Check the number of documents in each group
print(f"Group 1 (Chapters 1-13): {len(chapter_1_to_13)} documents")
print(f"Group 2 (Chapters 14-27): {len(chapter_14_to_27)} documents")

Group 1 (Chapters 1-13): 111 documents
Group 2 (Chapters 14-27): 99 documents


## Create a VectorStore

Key Initialization Parameters:

- Indexing Parameters
    - `embedding_function` (Embeddings): The embedding function to be used.
- Client Parameters
    - `index` (Any): The FAISS index to be used.
    - `docstore` (Docstore): The document store to be utilized.
    - `index_to_docstore_id` (Dict[int, str]): A mapping from the index to document store IDs.

**[Note]** 

- `FAISS` is a high-performance library for vector search and clustering.
- This class integrates `FAISS` with LangChain's VectorStore interface.
- By combining the `embedding function`, `FAISS index`, and `document store`, you can build an efficient vector search system.

In [80]:
import faiss
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_openai import OpenAIEmbeddings

# Embedding
embeddings = OpenAIEmbeddings()

# Calculate the size of the embedding dimension
dimension_size = len(embeddings.embed_query("hello world"))
print(dimension_size)

1536


In [81]:
# Create a FAISS vector store
db = FAISS(
    embedding_function=OpenAIEmbeddings(),
    index=faiss.IndexFlatL2(dimension_size),
    docstore=InMemoryDocstore(),
    index_to_docstore_id={},
)

## Create a FAISS VectorStore(from_documents)

The `from_documents` class method creates a FAISS vector store using a list of documents and an embedding function.

- Parameters:
    - `documents` (List[Document]): A list of documents to be added to the vector store.
    - `embedding` (Embeddings): The embedding function to be used.
    - `**kwargs`: Additional keyword arguments.

- How It Works:
1. Extracts the text content (`page_content`) and metadata from the list of documents.
2. Calls the `from_texts` method using the extracted text and metadata.

- Return Value:
    - `VectorStore`: An instance of the vector store initialized with the provided documents and embeddings.

**[Note]** 
- This method internally calls the `from_texts` method to create the vector store.
- The `page_content` of each document is used as text, while `metadata` is used as the document's metadata.
- Additional configurations can be passed through `kwargs`.

In [82]:
# Create a FAISS vector store from the documents
db = FAISS.from_documents(documents=chapter_1_to_13, embedding=OpenAIEmbeddings())

In [83]:
# Check the document store IDs
db.index_to_docstore_id

{0: 'b45f8bc1-c003-44a8-b99d-d9c254092cd5',
 1: 'ea5ccd67-04ea-408f-940a-230e351367e2',
 2: 'a8afd216-2235-4c24-8f9c-094b087057e8',
 3: 'd9cfe1d4-6a16-4185-81bf-666ce8909a91',
 4: '324c48ed-2364-4433-8ee5-ae1a862eab7d',
 5: '62bc291e-ff75-48f5-a9e9-9976762e3be2',
 6: '7d013dc0-5d55-41a6-93a4-1e8423e76d72',
 7: 'cf7224d3-2cf3-4871-b7b1-4991696e6760',
 8: 'e55dccd4-2598-4076-b4c3-456aaaf2adf8',
 9: '03ee783e-12b6-48a3-90ec-a754dce2bda1',
 10: '06c680a2-acb9-4a1a-a6cb-9c4b554294f8',
 11: 'af300200-186b-40b0-a378-7d68590b4d83',
 12: '57ac805e-8244-4bb0-aab3-9499e70e4520',
 13: 'f4a1c9ab-a294-4570-beb0-9f4de78e6bc1',
 14: '7ac38095-1bf1-402a-86de-c5ff398c867b',
 15: 'f779e2ba-e786-4968-b55d-1b00664b93a8',
 16: 'ca94a960-3227-4a8f-b5b9-6727e654a5a6',
 17: '933cee0c-130e-47f7-9495-1f783ad22494',
 18: 'cb7fb67f-22fd-43f7-b4d5-69b7dd3a60de',
 19: '6daf1f41-8ab2-40a3-81ac-05b7416f6abf',
 20: '216aa3e6-d401-4f6f-9195-22ad9f627ac5',
 21: '08b05c42-c715-4816-8769-04b8227b6f8d',
 22: '430a29cc-46a5-

In [84]:
# Check the ID of the stored document: Document
db.docstore._dict

{'b45f8bc1-c003-44a8-b99d-d9c254092cd5': Document(id='b45f8bc1-c003-44a8-b99d-d9c254092cd5', metadata={'source': 'data/the_little_prince.txt'}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)'),
 'ea5ccd67-04ea-408f-940a-230e351367e2': Document(id='ea5ccd67-04ea-408f-940a-230e351367e2', metadata={'source': 'data/the_little_prince.txt'}, page_content='[ Antoine de Saiot-Exupery ]\nOver the past century, the thrill of flying has inspired some to perform remarkable feats of daring. For others, their desire to soar into the skies led to dramatic leaps in technology. For Antoine de Saint-Exupéry, his love of aviation inspired stories, which have touched the hearts of millions around the world.'),
 'a8afd216-2235-4c24-8f9c-094b087057e8': Document(id='a8afd216-2235-4c24-8f9c-094b087057e8', metadata={'source': 'data/the_little_prince.txt'}, page_content='Born in 1900 in Lyons, France, young Antoine was filled with a passion for adventure. When he failed an entr

## Create a FAISS VectorStore(from_texts)

The `from_texts` class method creates a FAISS vector store using a list of texts and an embedding function.

- Parameters:
    - `texts` (List[str]): A list of texts to be added to the vector store.
    - `embedding` (Embeddings): The embedding function to use.
    - `metadatas` (Optional[List[dict]]): A list of metadata. Default is None.
    - `ids` (Optional[List[str]]): A list of document IDs. Default is None.
    - `**kwargs`: Additional keyword arguments.

- How It Works:
    1. Texts are embedded using the provided embedding function.
    2. The `__from` method is called with the embedded vectors to create a `FAISS` instance.

- Return Value:
    - `FAISS`: The created FAISS vector store instance.

**[Note]**
- This method provides a user-friendly interface, handling document embedding, in-memory document storage, and `FAISS` database initialization in a single step.
- It’s a convenient way to get started quickly.

**[Caution]**
- Be mindful of memory usage when processing a large number of texts.
- When using metadata or IDs, ensure they are provided as lists with the same length as the text list.

In [85]:
# List of text content from "The Little Prince"
texts = [
    "Once when I was six years old I saw a magnificent picture in a book.",
    "The second planet was inhabited by a conceited man.",
    '"Good morning," said the fox.',
    "I had been so proud of my baobabs!",
]

# Corresponding metadata for each document
metadatas = [
    {"source": "text document", "chapter": 1},
    {"source": "text document", "chapter": 11},
    {"source": "text document", "chapter": 21},
    {"source": "text document", "chapter": 25},
]

# Unique IDs for each document
ids = ["doc1", "doc2", "doc3", "doc4"]

# Create the FAISS database
db2 = FAISS.from_texts(
    texts=texts, embedding=OpenAIEmbeddings(), metadatas=metadatas, ids=ids
)

# Verify that all documents are stored correctly
stored_content = db2.docstore._dict
print("Stored Documents:")
for doc_id, doc in stored_content.items():
    print(f"ID: {doc_id}, Metadata: {doc.metadata}, Content: {doc.page_content}")

Stored Documents:
ID: doc1, Metadata: {'source': 'text document', 'chapter': 1}, Content: Once when I was six years old I saw a magnificent picture in a book.
ID: doc2, Metadata: {'source': 'text document', 'chapter': 11}, Content: The second planet was inhabited by a conceited man.
ID: doc3, Metadata: {'source': 'text document', 'chapter': 21}, Content: "Good morning," said the fox.
ID: doc4, Metadata: {'source': 'text document', 'chapter': 25}, Content: I had been so proud of my baobabs!


## Similarity Search

The `similarity_search` method allows you to search for documents most similar to a given query.

- Parameters:
    - `query` (str): The search query text for finding similar documents.
    - `k` (int): The number of documents to return. Default is 4.
    - `filter` (Optional[Union[Callable, Dict[str, Any]]]): A metadata filtering function or dictionary. Default is None.
    - `fetch_k` (int): The number of documents to retrieve before applying filtering. Default is 20.
    - `**kwargs`: Additional keyword arguments.

- Returns:
    - `List[Document]`: A list of documents most similar to the query.

- How It Works:
    1. Internally calls the `similarity_search_with_score` method to search for documents along with their similarity scores.
    2. Extracts and returns only the documents from the results, excluding the scores.

- Key Features:
    - The `filter` parameter enables metadata-based filtering.
    - The `fetch_k` parameter allows control over the number of documents retrieved before filtering, ensuring enough documents remain after filtering.

- Considerations:
    - Search performance heavily depends on the quality of the embedding model used.
    - In large datasets, it is important to adjust the values of `k` and `fetch_k` to balance search speed and accuracy.
    - For complex filtering needs, pass a custom function to the `filter` parameter for fine-grained control.

- Optimization Tips:
    - Cache the results for frequently used queries to improve the speed of repeated searches.
    - Avoid setting `fetch_k` too high, as it may slow down search performance. Experiment to find an appropriate value.

In [86]:
# Similarity Search
db.similarity_search("Your planet is very beautiful")

[Document(id='e96703a3-ce03-4b96-bf01-e5e5b85da792', metadata={'source': 'data/the_little_prince.txt'}, page_content='But that did not really surprise me much. I knew very well that in addition to the great planets-- such as the Earth, Jupiter, Mars, Venus-- to which we have given names, there are also hundreds of others, some of which are so small that one has a hard time seeing them through the telescope. When an astronomer discovers one of these he does not give it a name, but only a number. He might call it, for example, "Asteroid 325."'),
 Document(id='8d6db04e-3f95-4b3e-a80b-57efa25a9be5', metadata={'source': 'data/the_little_prince.txt'}, page_content='[ Chapter 9 ]\n- the little prince leaves his planet'),
 Document(id='111d25df-2a3c-4d28-8593-dc639bdefa51', metadata={'source': 'data/the_little_prince.txt'}, page_content='On our earth we are obviously much too small to clean out our volcanoes. That is why they bring no end of trouble upon us. \nThe little prince also pulled up,

In [87]:
# Specify the value of k (number of documents to return)
db.similarity_search("Your planet is very beautiful", k=2)

[Document(id='e96703a3-ce03-4b96-bf01-e5e5b85da792', metadata={'source': 'data/the_little_prince.txt'}, page_content='But that did not really surprise me much. I knew very well that in addition to the great planets-- such as the Earth, Jupiter, Mars, Venus-- to which we have given names, there are also hundreds of others, some of which are so small that one has a hard time seeing them through the telescope. When an astronomer discovers one of these he does not give it a name, but only a number. He might call it, for example, "Asteroid 325."'),
 Document(id='8d6db04e-3f95-4b3e-a80b-57efa25a9be5', metadata={'source': 'data/the_little_prince.txt'}, page_content='[ Chapter 9 ]\n- the little prince leaves his planet')]

In [88]:
# Use a filter to narrow results based on metadata
db.similarity_search(
    "Your planet is very beautiful",
    filter={"source": "data/the_little_prince.txt"},
    k=2,
)

[Document(id='e96703a3-ce03-4b96-bf01-e5e5b85da792', metadata={'source': 'data/the_little_prince.txt'}, page_content='But that did not really surprise me much. I knew very well that in addition to the great planets-- such as the Earth, Jupiter, Mars, Venus-- to which we have given names, there are also hundreds of others, some of which are so small that one has a hard time seeing them through the telescope. When an astronomer discovers one of these he does not give it a name, but only a number. He might call it, for example, "Asteroid 325."'),
 Document(id='8d6db04e-3f95-4b3e-a80b-57efa25a9be5', metadata={'source': 'data/the_little_prince.txt'}, page_content='[ Chapter 9 ]\n- the little prince leaves his planet')]

## Data Addition Methods
The Data Addition Methods describes how to add data to a `FAISS` vector store using either documents or texts. These methods provide flexibility for different input types and allow the user to efficiently populate the vector store.

### Add from Document (add_documents)
The `add_documents` method allows you to add or update documents in the vector store.

- Parameters:
    - `documents` (List[Document]): A list of Document objects to be added to the vector store.
    - `**kwargs`: Additional keyword arguments.

- Return Value:
    - `List[str]`: A list of IDs for the added texts.

- Functionality:
    1. Extracts text content and metadata from the documents.
    2. Calls the `add_texts` method to perform the actual addition process.

- Key Features:
    - Convenient for handling Document objects directly.
    - Includes ID handling logic to ensure the uniqueness of the documents.
    - Operates based on the `add_texts` method, promoting code reusability.

In [89]:
from langchain_core.documents import Document

# Specify page_content and metadata
db.add_documents(
    [
        Document(
            page_content="Hello! This time, I will add a new document.",
            # Metadata specifying the source of the document
            metadata={"source": "mydata.txt"},
        )
    ],
    # Unique ID for the new document
    ids=["new_doc1"],
)

['new_doc1']

In [90]:
# Verify the added data by performing a similarity search
db.similarity_search("hello", k=1)

[Document(id='new_doc1', metadata={'source': 'mydata.txt'}, page_content='Hello! This time, I will add a new document.')]

### Add from text (add_texts)


The `add_texts` method provides the functionality to embed texts and add them to the vector store.


- Parameters:
    - `texts` (Iterable[str]): An iterable of texts to be added to the vector store.
    - `metadatas` (Optional[List[dict]]): A list of metadata associated with the texts (optional).
    - `ids` (Optional[List[str]]): A list of unique identifiers for the texts (optional).
    - `**kwargs`: Additional keyword arguments.


- Return Value:
    - `List[str]`: A list of IDs of the texts added to the vector store.


- How it works:
    1. The input texts iterable is converted into a list.
    2. The `_embed_documents` method is used to embed the texts.
    3. The `__add` method is called to add the embedded texts to the vector store.

In [91]:
# Add new text data
db.add_texts(
    ["This time, we're adding text data.", "This is the second text data being added."],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["new_doc2", "new_doc3"],
)

['new_doc2', 'new_doc3']

In [92]:
# Check the added data
db.index_to_docstore_id

{0: 'b45f8bc1-c003-44a8-b99d-d9c254092cd5',
 1: 'ea5ccd67-04ea-408f-940a-230e351367e2',
 2: 'a8afd216-2235-4c24-8f9c-094b087057e8',
 3: 'd9cfe1d4-6a16-4185-81bf-666ce8909a91',
 4: '324c48ed-2364-4433-8ee5-ae1a862eab7d',
 5: '62bc291e-ff75-48f5-a9e9-9976762e3be2',
 6: '7d013dc0-5d55-41a6-93a4-1e8423e76d72',
 7: 'cf7224d3-2cf3-4871-b7b1-4991696e6760',
 8: 'e55dccd4-2598-4076-b4c3-456aaaf2adf8',
 9: '03ee783e-12b6-48a3-90ec-a754dce2bda1',
 10: '06c680a2-acb9-4a1a-a6cb-9c4b554294f8',
 11: 'af300200-186b-40b0-a378-7d68590b4d83',
 12: '57ac805e-8244-4bb0-aab3-9499e70e4520',
 13: 'f4a1c9ab-a294-4570-beb0-9f4de78e6bc1',
 14: '7ac38095-1bf1-402a-86de-c5ff398c867b',
 15: 'f779e2ba-e786-4968-b55d-1b00664b93a8',
 16: 'ca94a960-3227-4a8f-b5b9-6727e654a5a6',
 17: '933cee0c-130e-47f7-9495-1f783ad22494',
 18: 'cb7fb67f-22fd-43f7-b4d5-69b7dd3a60de',
 19: '6daf1f41-8ab2-40a3-81ac-05b7416f6abf',
 20: '216aa3e6-d401-4f6f-9195-22ad9f627ac5',
 21: '08b05c42-c715-4816-8769-04b8227b6f8d',
 22: '430a29cc-46a5-

## Delete Documents


The `delete` method is used to remove documents from the vector store based on their specified IDs.


- Parameters:
    - `ids` (Optional[List[str]]): A list of document IDs to delete.
    - `**kwargs`: Additional keyword arguments (not utilized in this method).


- Return Value:
    - `Optional[bool]`: Returns True if the deletion is successful, False if it fails, or None if the functionality is not implemented.


- How It Works:
    1. Validates the provided IDs.
    2, Finds the indices corresponding to the IDs to be deleted.
    3. Removes the entries with the given IDs from the `FAISS` index.
    4. Deletes the documents associated with the IDs from the document store.
    5. Updates the index-to-ID mapping.


- Key Features:
    - Ensures precise document management using ID-based deletion.
    - Performs deletion on both the `FAISS` index and the document store for consistency.
    - Maintains data integrity by reordering the index after deletion.


**[Caution]**
- Deletion is irreversible, so it should be done with care.
- The method lacks concurrency control, which requires caution in multi-threaded environments.

In [93]:
# Add data for deletion
ids = db.add_texts(
    ["Adding data for deletion.", "This is the second data entry for deletion."],
    metadatas=[{"source": "mydata.txt"}, {"source": "mydata.txt"}],
    ids=["delete_doc1", "delete_doc2"],
)

# Verify the IDs of the added data
print(ids)

['delete_doc1', 'delete_doc2']


The `delete` method can remove documents by providing their IDs.

In [94]:
# Delete by IDs
db.delete(ids)

True

In [95]:
# Output the result after deletion
db.index_to_docstore_id

{0: 'b45f8bc1-c003-44a8-b99d-d9c254092cd5',
 1: 'ea5ccd67-04ea-408f-940a-230e351367e2',
 2: 'a8afd216-2235-4c24-8f9c-094b087057e8',
 3: 'd9cfe1d4-6a16-4185-81bf-666ce8909a91',
 4: '324c48ed-2364-4433-8ee5-ae1a862eab7d',
 5: '62bc291e-ff75-48f5-a9e9-9976762e3be2',
 6: '7d013dc0-5d55-41a6-93a4-1e8423e76d72',
 7: 'cf7224d3-2cf3-4871-b7b1-4991696e6760',
 8: 'e55dccd4-2598-4076-b4c3-456aaaf2adf8',
 9: '03ee783e-12b6-48a3-90ec-a754dce2bda1',
 10: '06c680a2-acb9-4a1a-a6cb-9c4b554294f8',
 11: 'af300200-186b-40b0-a378-7d68590b4d83',
 12: '57ac805e-8244-4bb0-aab3-9499e70e4520',
 13: 'f4a1c9ab-a294-4570-beb0-9f4de78e6bc1',
 14: '7ac38095-1bf1-402a-86de-c5ff398c867b',
 15: 'f779e2ba-e786-4968-b55d-1b00664b93a8',
 16: 'ca94a960-3227-4a8f-b5b9-6727e654a5a6',
 17: '933cee0c-130e-47f7-9495-1f783ad22494',
 18: 'cb7fb67f-22fd-43f7-b4d5-69b7dd3a60de',
 19: '6daf1f41-8ab2-40a3-81ac-05b7416f6abf',
 20: '216aa3e6-d401-4f6f-9195-22ad9f627ac5',
 21: '08b05c42-c715-4816-8769-04b8227b6f8d',
 22: '430a29cc-46a5-

## Local Persistence


### Save Local


The `save_local` method enables saving the `FAISS` index, document store, and index-to-document ID mapping to the local disk.


- Parameters:
    - `folder_path` (str): The folder path where the data will be saved.
    - `index_name` (str): The name of the index file to be saved (default: "index").


- How It Works:
    1. Creates the specified folder path (ignored if it already exists).
    2. Saves the `FAISS` index as a separate file.
    3. Stores the document store and index-to-document ID mapping in pickle format.


- Usage Considerations:
    - Write permissions are required for the specified save path.
    - For large datasets, significant storage space and time may be required.
    - Be mindful of potential security risks associated with using pickle.

In [96]:
# Save to local disk
db.save_local(folder_path="faiss_db", index_name="faiss_index")

### Load Local


The `load_local` class method allows loading a `FAISS` index, document store, and index-to-document ID mapping saved on the local disk.


- Parameters:
    - `folder_path` (str): The folder path where the saved files are located.
    - `embeddings` (Embeddings): The embedding object used for generating queries.
    - `index_name` (str): The name of the index file to load (default: "index").
    - `allow_dangerous_deserialization` (bool): Whether to allow deserialization of pickle files (default: False).


- Returns:
    - `FAISS`: The loaded `FAISS` object.


- How It Works:
    1. Ensures deserialization risks are considered and requires explicit user permission.
    2. Loads the `FAISS` index separately.
    3. Uses pickle to deserialize the document store and index-to-document ID mapping.
    4. Creates and returns a `FAISS` object using the loaded data.

In [97]:
# Load the saved data
loaded_db = FAISS.load_local(
    folder_path="faiss_db",
    index_name="faiss_index",
    embeddings=embeddings,
    allow_dangerous_deserialization=True,
)

# Verify the loaded data
loaded_db.index_to_docstore_id

{0: 'b45f8bc1-c003-44a8-b99d-d9c254092cd5',
 1: 'ea5ccd67-04ea-408f-940a-230e351367e2',
 2: 'a8afd216-2235-4c24-8f9c-094b087057e8',
 3: 'd9cfe1d4-6a16-4185-81bf-666ce8909a91',
 4: '324c48ed-2364-4433-8ee5-ae1a862eab7d',
 5: '62bc291e-ff75-48f5-a9e9-9976762e3be2',
 6: '7d013dc0-5d55-41a6-93a4-1e8423e76d72',
 7: 'cf7224d3-2cf3-4871-b7b1-4991696e6760',
 8: 'e55dccd4-2598-4076-b4c3-456aaaf2adf8',
 9: '03ee783e-12b6-48a3-90ec-a754dce2bda1',
 10: '06c680a2-acb9-4a1a-a6cb-9c4b554294f8',
 11: 'af300200-186b-40b0-a378-7d68590b4d83',
 12: '57ac805e-8244-4bb0-aab3-9499e70e4520',
 13: 'f4a1c9ab-a294-4570-beb0-9f4de78e6bc1',
 14: '7ac38095-1bf1-402a-86de-c5ff398c867b',
 15: 'f779e2ba-e786-4968-b55d-1b00664b93a8',
 16: 'ca94a960-3227-4a8f-b5b9-6727e654a5a6',
 17: '933cee0c-130e-47f7-9495-1f783ad22494',
 18: 'cb7fb67f-22fd-43f7-b4d5-69b7dd3a60de',
 19: '6daf1f41-8ab2-40a3-81ac-05b7416f6abf',
 20: '216aa3e6-d401-4f6f-9195-22ad9f627ac5',
 21: '08b05c42-c715-4816-8769-04b8227b6f8d',
 22: '430a29cc-46a5-

## FAISS Object Merge (Merge From)


The `merge_from` method allows merging another `FAISS` object into the current `FAISS` object.


- Parameters:
    - `target` (`FAISS`): The target `FAISS` object to be merged into the current object.


- How It Works:
    1. Checks if the document stores are compatible for merging.
    2. Assigns new indices to the incoming documents based on the length of the existing index.
    3. Merges the `FAISS` index.
    4. Extracts documents and ID information from the target `FAISS` object.
    5. Adds the extracted information to the current document store and index-to-document ID mapping.


- Key Features:
    - Merges the indices, document stores, and index-to-document ID mappings of two `FAISS` objects.
    - Maintains continuity of index numbering during the merge.
    - Ensures compatibility of document stores before proceeding with the merge.


**[Cautions]**
- The structure of the target `FAISS` object must be compatible with the current object.
- Be cautious of duplicate IDs, as the current implementation does not check for duplicates.
- If an exception occurs during the merge process, it may leave the system in a partially merged state.

In [98]:
# Load the saved data
db = FAISS.load_local(
    folder_path="faiss_db",
    index_name="faiss_index",
    embeddings=embeddings,
    allow_dangerous_deserialization=True,
)

# Create a new FAISS vector store
db2 = FAISS.from_documents(documents=chapter_1_to_13, embedding=OpenAIEmbeddings())

# Check the data in db
db.index_to_docstore_id

{0: 'b45f8bc1-c003-44a8-b99d-d9c254092cd5',
 1: 'ea5ccd67-04ea-408f-940a-230e351367e2',
 2: 'a8afd216-2235-4c24-8f9c-094b087057e8',
 3: 'd9cfe1d4-6a16-4185-81bf-666ce8909a91',
 4: '324c48ed-2364-4433-8ee5-ae1a862eab7d',
 5: '62bc291e-ff75-48f5-a9e9-9976762e3be2',
 6: '7d013dc0-5d55-41a6-93a4-1e8423e76d72',
 7: 'cf7224d3-2cf3-4871-b7b1-4991696e6760',
 8: 'e55dccd4-2598-4076-b4c3-456aaaf2adf8',
 9: '03ee783e-12b6-48a3-90ec-a754dce2bda1',
 10: '06c680a2-acb9-4a1a-a6cb-9c4b554294f8',
 11: 'af300200-186b-40b0-a378-7d68590b4d83',
 12: '57ac805e-8244-4bb0-aab3-9499e70e4520',
 13: 'f4a1c9ab-a294-4570-beb0-9f4de78e6bc1',
 14: '7ac38095-1bf1-402a-86de-c5ff398c867b',
 15: 'f779e2ba-e786-4968-b55d-1b00664b93a8',
 16: 'ca94a960-3227-4a8f-b5b9-6727e654a5a6',
 17: '933cee0c-130e-47f7-9495-1f783ad22494',
 18: 'cb7fb67f-22fd-43f7-b4d5-69b7dd3a60de',
 19: '6daf1f41-8ab2-40a3-81ac-05b7416f6abf',
 20: '216aa3e6-d401-4f6f-9195-22ad9f627ac5',
 21: '08b05c42-c715-4816-8769-04b8227b6f8d',
 22: '430a29cc-46a5-

In [99]:
# Check the data in db2
db2.index_to_docstore_id

{0: '01d607fa-e877-4d97-9ca8-54321dbda6c3',
 1: '83bacd37-f782-46f2-8e04-fca5e2dc3003',
 2: 'bb48edb9-63dc-4358-9029-7f73e1cf5737',
 3: '3cb10a63-f69e-4e82-aa5b-80e48e6c6fa0',
 4: 'ced7912a-e69c-4f37-85ca-f5644373eeec',
 5: 'a8e5eb0c-2270-49f9-9654-a51103ee86a0',
 6: '00c30b78-1629-43a6-8637-b8e42576e910',
 7: 'a96510ac-41f3-48ef-b63d-1252cfebad2d',
 8: '08332f8b-5180-41d3-a42a-e113c055b936',
 9: '529b64bc-4194-4f85-bd89-a41c1b399971',
 10: '3ef7ce4c-1d82-4e1a-b960-a7ebaf6e0414',
 11: 'eba927f6-2512-40af-bbfb-c05c805c0fd7',
 12: '3f002b89-ce88-44b6-9c03-1f65b9585809',
 13: 'a0492526-56d7-43c9-a611-b7bd2a520556',
 14: '6918176e-1abe-421a-8c97-1d5da1da5a41',
 15: 'c77bf41a-15ad-4a3c-8f7a-b8f484a462c6',
 16: 'e51a432d-d6fc-49fe-9ed1-5d213c5778a2',
 17: '13cd9bcd-8efa-46da-b325-d7745e7ac7b3',
 18: '8e6a74f0-066a-4bff-b156-d78bfcd1a663',
 19: 'cf52715c-d5d3-4a43-88f6-1bfac05d190e',
 20: '30ab4c44-176b-4078-bfbd-e93a38b48f3d',
 21: '6df564ac-2082-4011-bd60-76be03818724',
 22: '002d4518-f27c-

Use `merge_from` to combine the two databases

In [100]:
# Merge db + db2
db.merge_from(db2)

# Check the merged data
db.index_to_docstore_id

{0: 'b45f8bc1-c003-44a8-b99d-d9c254092cd5',
 1: 'ea5ccd67-04ea-408f-940a-230e351367e2',
 2: 'a8afd216-2235-4c24-8f9c-094b087057e8',
 3: 'd9cfe1d4-6a16-4185-81bf-666ce8909a91',
 4: '324c48ed-2364-4433-8ee5-ae1a862eab7d',
 5: '62bc291e-ff75-48f5-a9e9-9976762e3be2',
 6: '7d013dc0-5d55-41a6-93a4-1e8423e76d72',
 7: 'cf7224d3-2cf3-4871-b7b1-4991696e6760',
 8: 'e55dccd4-2598-4076-b4c3-456aaaf2adf8',
 9: '03ee783e-12b6-48a3-90ec-a754dce2bda1',
 10: '06c680a2-acb9-4a1a-a6cb-9c4b554294f8',
 11: 'af300200-186b-40b0-a378-7d68590b4d83',
 12: '57ac805e-8244-4bb0-aab3-9499e70e4520',
 13: 'f4a1c9ab-a294-4570-beb0-9f4de78e6bc1',
 14: '7ac38095-1bf1-402a-86de-c5ff398c867b',
 15: 'f779e2ba-e786-4968-b55d-1b00664b93a8',
 16: 'ca94a960-3227-4a8f-b5b9-6727e654a5a6',
 17: '933cee0c-130e-47f7-9495-1f783ad22494',
 18: 'cb7fb67f-22fd-43f7-b4d5-69b7dd3a60de',
 19: '6daf1f41-8ab2-40a3-81ac-05b7416f6abf',
 20: '216aa3e6-d401-4f6f-9195-22ad9f627ac5',
 21: '08b05c42-c715-4816-8769-04b8227b6f8d',
 22: '430a29cc-46a5-

## Convert to Searcher (as_retriever)


The `as_retriever` method creates a `VectorStoreRetriever` object based on the current vector store.


- Parameters:
    - `**kwargs`: Keyword arguments passed to the search function.
    - `search_type` (Optional[str]): Type of search to perform (`"similarity"`, `"mmr"`, or `"similarity_score_threshold"`).
    - `search_kwargs` (Optional[Dict]): Additional keyword arguments for the search function.


- Return Value:
    - `VectorStoreRetriever`: A retriever object based on the vector store.


- Key Features:
    - Supports Multiple Search Types
        - `"similarity"`: Default similarity-based search.
        - `"mmr"`: Maximal Marginal Relevance search.
        - `"similarity_score_threshold"`: Similarity threshold-based search.
    - Customizable Search Parameters
        - `k`: Number of documents to return.
        - `score_threshold`: Similarity score threshold.
        - `fetch_k`: Number of documents fetched for the MMR algorithm.
        - `lambda_mult`: Parameter to adjust diversity in `MMR`.
        - `filter`: Filter documents based on metadata.


- Usage Considerations:
    - Choose appropriate search types and parameters to balance the quality and diversity of search results.
    - Adjust `fetch_k` and `k` values for large datasets to balance performance and accuracy.
    - Use the filter option to search only documents that match specific conditions.


- Optimization Tips:
    - For `MMR` searches, increase `fetch_k` and adjust `lambda_mult` to balance diversity and relevance.
    - Use threshold-based search to return only highly relevant documents.


**[Cautions]**
- Improper parameter settings may impact search performance or result quality.
- High `k` values on large datasets can significantly increase search time. By default, similarity search retrieves 4 documents to ensure manageable results.

In [101]:
# Create a new FAISS vector store
db = FAISS.from_documents(
    documents=chapter_1_to_13 + chapter_14_to_27,
    embedding=OpenAIEmbeddings(),
)

The default retriever returns 4 documents.

In [102]:
# Convert to retriever
retriever = db.as_retriever()

# Perform search
retriever.invoke("What is the story of the Little Prince?")

[Document(id='934f36a0-823b-453e-ac47-cded72f2c5d0', metadata={'source': 'data/the_little_prince.txt'}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)'),
 Document(id='06d9242f-ecb4-4bf7-8014-35f5f1784ea1', metadata={'source': 'data/the_little_prince.txt'}, page_content='prince from another world traveling the universe in order to understand life. In the book, the little prince discovers the true meaning of life. At the end of his conversation with the Little Prince, the aviator manages to fix his plane and both he and the little prince continue on their journeys'),
 Document(id='d5e05a3e-ecd3-4856-b2a9-845f1e1cd0d8', metadata={'source': 'data/the_little_prince.txt'}, page_content='The little prince sat down on a stone, and raised his eyes toward the sky. \n"I wonder," he said, "whether the stars are set alight in heaven so that one day each one of us may find his own again... Look at my planet. It is right there above us. But how far away it is!" \n"I

Higher diversity with more document retrieval


- `k`: The number of documents to return (default: 4)
- `fetch_k`: The number of documents to pass to the `MMR` algorithm (default: 20)
- `lambda_mult`: Adjusts the diversity of `MMR` results (range: 0 to 1, default: 0.5)

In [103]:
# Perform MMR search
retriever = db.as_retriever(
    search_type="mmr", search_kwargs={"k": 6, "lambda_mult": 0.25, "fetch_k": 10}
)
# Invoke search with a query
retriever.invoke("What is the story of the Little Prince?")

[Document(id='934f36a0-823b-453e-ac47-cded72f2c5d0', metadata={'source': 'data/the_little_prince.txt'}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)'),
 Document(id='6f29cf82-220d-41f2-99fd-bc80964b0f68', metadata={'source': 'data/the_little_prince.txt'}, page_content='"It is your own fault," said the little prince. "I never wished you any sort of harm; but you wanted me to tame you..." \n"Yes, that is so," said the fox. \n"But now you are going to cry!" said the little prince. \n"Yes, that is so," said the fox. \n"Then it has done you no good at all!" \n"It has done me good," said the fox, "because of the color of the wheat fields." And then he added: \n"Go and look again at the roses. You will understand now that yours is unique in all the world. Then come back to say goodbye to me, and I will make you a present of a secret."'),
 Document(id='a4cba619-6b9a-4499-bd84-a18992e3b193', metadata={'source': 'data/the_little_prince.txt'}, page_content='[ C

Fetch more documents for the `MMR` algorithm, but return only the top 2

In [104]:
# Perform MMR search, return only the top 2 documents
retriever = db.as_retriever(search_type="mmr", search_kwargs={"k": 2, "fetch_k": 10})
retriever.invoke("What is the story of the Little Prince?")

[Document(id='934f36a0-823b-453e-ac47-cded72f2c5d0', metadata={'source': 'data/the_little_prince.txt'}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)'),
 Document(id='d5e05a3e-ecd3-4856-b2a9-845f1e1cd0d8', metadata={'source': 'data/the_little_prince.txt'}, page_content='The little prince sat down on a stone, and raised his eyes toward the sky. \n"I wonder," he said, "whether the stars are set alight in heaven so that one day each one of us may find his own again... Look at my planet. It is right there above us. But how far away it is!" \n"It is beautiful," the snake said. "What has brought you here?"\n"I have been having some trouble with a flower," said the little prince. \n"Ah!" said the snake. \nAnd they were both silent. \n"Where are the men?" the little prince at last took up the conversation again. "It is a little lonely in the desert..."')]

Perform search for documents with a similarity score above a certain threshold

In [105]:
# Perform threshold-based similarity search
retriever = db.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.8}
)

retriever.invoke("What is the story of the Little Prince?")

[Document(id='934f36a0-823b-453e-ac47-cded72f2c5d0', metadata={'source': 'data/the_little_prince.txt'}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)'),
 Document(id='06d9242f-ecb4-4bf7-8014-35f5f1784ea1', metadata={'source': 'data/the_little_prince.txt'}, page_content='prince from another world traveling the universe in order to understand life. In the book, the little prince discovers the true meaning of life. At the end of his conversation with the Little Prince, the aviator manages to fix his plane and both he and the little prince continue on their journeys'),
 Document(id='d5e05a3e-ecd3-4856-b2a9-845f1e1cd0d8', metadata={'source': 'data/the_little_prince.txt'}, page_content='The little prince sat down on a stone, and raised his eyes toward the sky. \n"I wonder," he said, "whether the stars are set alight in heaven so that one day each one of us may find his own again... Look at my planet. It is right there above us. But how far away it is!" \n"I

Retrieve only the most similar document

In [106]:
# Perform search to retrieve the most similar single document with k=1
retriever = db.as_retriever(search_kwargs={"k": 1})

retriever.invoke("What is the story of the Little Prince?")

[Document(id='934f36a0-823b-453e-ac47-cded72f2c5d0', metadata={'source': 'data/the_little_prince.txt'}, page_content='The Little Prince\nWritten By Antoine de Saiot-Exupery (1900〜1944)')]

Apply specific metadata filters

In [107]:
# Apply filter based on metadata and retrieve top 2 documents
retriever = db.as_retriever(
    search_kwargs={"filter": {"source": "data/the_little_prince.txt"}, "k": 2}
)

# Perform retrieval with a question related to The Little Prince
results = retriever.invoke("What does the little prince learn about the baobab trees?")

print("\n" + "=" * 40)
print("Search Results")
print("=" * 40)

for i, doc in enumerate(results):
    print(f"\nResult {i+1}")
    print("-" * 40)
    print(f"Source: {doc.metadata.get('source', 'Unknown')}")
    print(
        f"Content Preview: {doc.page_content[:150]}..."
    )  # Show first 150 characters as a preview
    print("-" * 40)
    print(f"Full Content:\n{doc.page_content}")
    print("-" * 40)


Search Results

Result 1
----------------------------------------
Source: data/the_little_prince.txt
Content Preview: [ Chapter 5 ]
- we are warned as to the dangers of the baobabs
As each day passed I would learn, in our talk, something about the little prince‘s plan...
----------------------------------------
Full Content:
[ Chapter 5 ]
- we are warned as to the dangers of the baobabs
As each day passed I would learn, in our talk, something about the little prince‘s planet, his departure from it, his journey. The information would come very slowly, as it might chance to fall from his thoughts. It was in this way that I heard, on the third day, about the catastrophe of the baobabs.
This time, once more, I had the sheep to thank for it. For the little prince asked me abruptly-- as if seized by a grave doubt-- "It is true, isn‘t it, that sheep eat little bushes?" 
"Yes, that is true." 
"Ah! I am glad!"
----------------------------------------

Result 2
---------------------------------

In [None]:
# instantiate
embeddings = OpenAIEmbeddings()
index = faiss.IndexFlatL2(len(embeddings.embed_query("hello world")))
vector_store = FAISS(
    embedding_function=embeddings,
    index=index,
    docstore= InMemoryDocstore(),
    index_to_docstore_id={}
)

In [None]:
vector_store.index_to_docstore_id

In [None]:
# add documents
# from langchain_core.documents import Document

# document_1 = Document(page_content="foo", metadata={"baz": "bar"})
# document_2 = Document(page_content="thud", metadata={"bar": "baz"})
# document_3 = Document(page_content="i will be deleted :(")

# documents = [document_1, document_2, document_3]
# ids = ['1', '2', '3']
# vector_store.add_documents(documents=documents, ids=ids)

In [None]:
# upsert
texts = ["Hello World", "FAISS is great", "LangChain"]
ids = ["123", "456", "789"]
vector_store.upsert(texts=texts, ids=ids)

In [None]:
print(vector_store.index_to_docstore_id)
print(vector_store.index_to_docstore_id.values())

In [None]:
vector_store.get_by_ids(vector_store.index_to_docstore_id.values())

In [None]:
# upsert
texts = ["Hello World", "FAISS is great", "LangChain"]
ids = ["123", "456", "789"]
metadatas = [{"baz": "bar"}, {}, {}]
vector_store.upsert(texts=texts, ids=ids, metadatas=metadatas)

In [None]:
print(vector_store.index_to_docstore_id)
print(vector_store.index_to_docstore_id.values())

In [None]:
vector_store.get_by_ids(vector_store.index_to_docstore_id.values())

In [None]:
vector_store.upsert_parallel(texts=texts, ids=ids, metadatas=metadatas)

In [None]:
# delete documents
vector_store.delete(ids=["3"])

In [None]:
# exist docstore id 확인
vector_store.index_to_docstore_id

In [None]:
# search documents by ids
vector_store.get_by_ids(["1", "2"])

In [None]:
# search
results = vector_store.similarity_search(query="thud",k=1)
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

In [None]:
# search with filter
# filter: (Optional[Dict[str, str]]): Filter by metadata. Defaults to None.
results = vector_store.similarity_search(query="thud",k=1,filter={"bar": "baz"})
for doc in results:
    print(f"* {doc.page_content} [{doc.metadata}]")

In [None]:
# search with score
results = vector_store.similarity_search_with_score(query="qux",k=1)
for doc, score in results:
    print(f"* [SIM={score:3f}] {doc.page_content} [{doc.metadata}]")

In [None]:
# search documents by ids
vector_store.get_by_ids(["2", "3"])

In [None]:
# async
# add documents
document_2 = Document(page_content="thud", metadata={"bar": "baz"})
document_3 = Document(page_content="i will be deleted :(")

documents = [document_2, document_3]
ids = ["2", "3"]

vector_store.aadd_documents(documents=documents, ids=ids)

In [None]:
vector_store._create_filter_func(filter={"bar": "baz"})

In [None]:
vector_store.index