Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add vector store support (Weaviate, Pinecone, Faiss) #108

Merged
merged 19 commits into from
Dec 19, 2022
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/_static/vector_stores/faiss_index_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/vector_stores/faiss_index_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/vector_stores/pinecone_reader.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/vector_stores/weaviate_reader_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/_static/vector_stores/weaviate_reader_1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 14 additions & 1 deletion docs/how_to/data_connectors.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,24 @@
We currently offer connectors into the following data sources. External data sources are retrieved through their APIs + corresponding authentication token.
The API reference documentation can be found [here](/reference/readers.rst).

#### External API's
- [Notion](https://developers.notion.com/) (`NotionPageReader`)
- [Google Docs](https://developers.google.com/docs/api) (`GoogleDocsReader`)
- [Slack](https://api.slack.com/) (`SlackReader`)
- MongoDB (`SimpleMongoReader`)
- Wikipedia (`WikipediaReader`)

#### Databases
- MongoDB (`SimpleMongoReader`)

#### Vector Stores

See [How to use Vector Stores with GPT Index](vector_stores.md) for a more thorough guide on integrating vector stores with GPT Index.

- Weaviate (`WeaviateReader`)
- Pinecone (`PineconeReader`)
- Faiss (`FaissReader`)

#### File
- local file directory (`SimpleDirectoryReader`)

We offer [example notebooks of connecting to different data sources](https://github.com/jerryjliu/gpt_index/tree/main/examples/data_connectors). Please check them out!
46 changes: 46 additions & 0 deletions docs/how_to/vector_stores.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Using Vector Stores
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a later PR: i think this page would benefit from some diagrams to show the differences between how gpt index interacts with the vector stores.

made an for later #109

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah totally!


GPT Index offers multiple integration points with vector stores / vector databases:

1) GPT Index can load data from vector stores, similar to any other data connector. This data can then be used within GPT Index data structures.
2) GPT Index can use a vector store itself (Faiss) as an index. Like any other index, this index can store documents and be used to answer queries.


## Loading Data from Vector Stores using Data Connector
GPT Index supports loading data from the following sources. See [Data Connectors](data_connectors.md) for more details and API documentation.

- Weaviate (`WeaviateReader`). [Installation](https://weaviate.io/developers/weaviate/current/getting-started/installation.html). [Python Client](https://weaviate.io/developers/weaviate/current/client-libraries/python.html).
- Pinecone (`PineconeReader`). [Installation/Quickstart](https://docs.pinecone.io/docs/quickstart).
- Faiss (`FaissReader`). [Installation](https://github.com/facebookresearch/faiss/blob/main/INSTALL.md).

NOTE: Both Pinecone and Faiss data loaders assume that the respective data sources only store vectors; text content is stored elsewhere. Therefore, both data loaders require that the user specifies an `id_to_text_map` in the load_data call.

For instance, this is an example usage of the Pinecone data loader `PineconeReader`:

![](/_static/vector_stores/pinecone_reader.png)


NOTE: Since Weaviate can store a hybrid of document and vector objects, the user may either choose to explicitly specify `class_name` and `properties` in order to query documents, or they may choose to specify a raw GraphQL query. See below for usage.

![](/_static/vector_stores/weaviate_reader_0.png)
![](/_static/vector_stores/weaviate_reader_1.png)

[Example notebooks can be found here](https://github.com/jerryjliu/gpt_index/tree/main/examples/data_connectors).


## Using a Vector Store as an Index
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(possibly noob question, i'm still getting familiar with faiss and vector dbs)

is the difference between this, vs using faiss directly to store embeddings of the paul graham essay, mainly that gpt index also generates a coherent answer? or are there other things going on?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh so using Faiss as a data loader (the first section) means that you load documents from an existing Faiss index (say that the user already has), and can use a GPT index structure on top of the retrieved documents - say you build a tree over the retrieved documents.

In this section it's saying that once you have documents, you can also build a GPT Index data struct, with Faiss under the hood, over these documents. So these documents could be from anywhere (e.g. Slack, notion), and we'll create an index data structure over that, taking care of tokenization/chunking/querying.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is something where a diagram absolutely would help!


GPT Index also supports using a vector store itself (specifically, Faiss) as an index. Similar to
any other index within GPT Index (tree, keyword table, list), this index can be constructed upon any collection
of documents. We use the vector store within the index to store embeddings for the input text chunks.

Once constructed, the index can be used for querying.

**Index Construction**
![](/_static/vector_stores/faiss_index_0.png)

**Index Querying**
![](/_static/vector_stores/faiss_index_1.png)


[Example notebooks can be found here](https://github.com/jerryjliu/gpt_index/tree/main/examples/vector_indices).
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ At the core of GPT Index is a **data structure**. Instead of relying on world kn
how_to/embeddings.md
how_to/custom_prompts.md
how_to/custom_llms.md
how_to/vector_stores.md


.. toctree::
Expand Down
1 change: 1 addition & 0 deletions docs/reference/indices.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,4 @@ classes allow for index creation, insertion, and also querying.
indices/list.rst
indices/table.rst
indices/tree.rst
indices/vector_store.rst
1 change: 1 addition & 0 deletions docs/reference/query.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,4 @@ This doc specifically shows the classes that are used to query indices.
indices/list_query.rst
indices/table_query.rst
indices/tree_query.rst
indices/vector_store_query.rst
150 changes: 150 additions & 0 deletions examples/data_connectors/FaissDemo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "5d974136",
"metadata": {},
"source": [
"# Faiss Demo"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b541d8ec",
"metadata": {},
"outputs": [],
"source": [
"from gpt_index.readers.faiss import FaissReader"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "90d37078",
"metadata": {},
"outputs": [],
"source": [
"# Build the Faiss index. \n",
"# A guide for how to get started with Faiss is here: https://github.com/facebookresearch/faiss/wiki/Getting-started\n",
"# We provide some example code below.\n",
"\n",
"import faiss\n",
"\n",
"# # Example Code\n",
"# d = 8\n",
"# docs = np.array([\n",
"# [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1],\n",
"# [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2],\n",
"# [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3],\n",
"# [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4],\n",
"# [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5]\n",
"# ])\n",
"# # id_to_text_map is used for query retrieval\n",
"# id_to_text_map = {\n",
"# 0: \"aaaaaaaaa bbbbbbb cccccc\",\n",
"# 1: \"foooooo barrrrrr\",\n",
"# 2: \"tmp tmptmp tmp\",\n",
"# 3: \"hello world hello world\",\n",
"# 4: \"cat dog cat dog\"\n",
"# }\n",
"# # build the index\n",
"# index = faiss.IndexFlatL2(d)\n",
"# index.add(docs)\n",
"\n",
"id_to_text_map = {\n",
" \"id1\": \"text blob 1\",\n",
" \"id2\": \"text blob 2\",\n",
"}\n",
"index = ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fd470a09",
"metadata": {},
"outputs": [],
"source": [
"reader = FaissReader(index)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c33084c5",
"metadata": {},
"outputs": [],
"source": [
"# To load data from the Faiss index, you must specify: \n",
"# k: top nearest neighbors\n",
"# query: a 2D embedding representation of your queries (rows are queries)\n",
"k = 4\n",
"query1 = np.array([...])\n",
"query2 = np.array([...])\n",
"query=np.array([query1, query2])\n",
"\n",
"documents = reader.load_data(query=query, id_to_text_map=id_to_text_map, k=k)"
]
},
{
"cell_type": "markdown",
"id": "0b74697a",
"metadata": {},
"source": [
"### Create index"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e85d7e5b",
"metadata": {},
"outputs": [],
"source": [
"index = GPTListIndex(documents)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "31c3b68f",
"metadata": {},
"outputs": [],
"source": [
"response = index.query(\"<query_text>\", verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "56fce3fb",
"metadata": {},
"outputs": [],
"source": [
"display(Markdown(f\"<b>{response}</b>\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python [conda env:conda_gpt_env]",
"language": "python",
"name": "conda-env-conda_gpt_env-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.15"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
136 changes: 136 additions & 0 deletions examples/data_connectors/PineconeDemo.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "f3ca56f0-6ef1-426f-bac5-fd7c374d0f51",
"metadata": {},
"source": [
"# Pinecone Demo"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "e2f49003-b952-4b9b-b935-2941f9303773",
"metadata": {},
"outputs": [],
"source": [
"api_key = \"<api_key>\""
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "262f990a-79c8-413a-9f3c-cd9a3c191307",
"metadata": {},
"outputs": [],
"source": [
"from gpt_index.readers.pinecone import PineconeReader"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "252f8163-7297-44b6-a838-709e9662f3d6",
"metadata": {},
"outputs": [],
"source": [
"reader = PineconeReader(api_key=api_key, environment=\"us-west1-gcp\")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "53b49187-8477-436c-9718-5d2f8cc6fad0",
"metadata": {},
"outputs": [],
"source": [
"# the id_to_text_map specifies a mapping from the ID specified in Pinecone to your text. \n",
"id_to_text_map = {\n",
" \"id1\": \"text blob 1\",\n",
" \"id2\": \"text blob 2\",\n",
"}\n",
"\n",
"# the query_vector is an embedding representation of your query_vector\n",
"# Example query vector:\n",
"# query_vector=[0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]\n",
"\n",
"query_vector=[n1, n2, n3, ...]"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a88be1c4-603f-48b9-ac64-10a219af4951",
"metadata": {},
"outputs": [],
"source": [
"# NOTE: Required args are index_name, id_to_text_map, vector.\n",
"# In addition, we pass-through all kwargs that can be passed into the the `Query` operation in Pinecone.\n",
"# See the API reference: https://docs.pinecone.io/reference/query\n",
"# and also the Python client: https://github.com/pinecone-io/pinecone-python-client\n",
"# for more details. \n",
"documents = reader.load_data(index_name='quickstart', id_to_text_map=id_to_text_map, top_k=3, vector=query_vector, separate_documents=True)"
]
},
{
"cell_type": "markdown",
"id": "a4baf59e-fc97-4a1e-947f-354a6438ffa6",
"metadata": {},
"source": [
"### Create index "
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "109d083e-f3b4-420b-886b-087c8cf3f98b",
"metadata": {},
"outputs": [],
"source": [
"index = GPTListIndex(documents)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "e15b9177-9e94-4e4e-9a2e-cd3a288a7faf",
"metadata": {},
"outputs": [],
"source": [
"response = index.query(\"<query_text>\", verbose=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "67b50613-a589-4acf-ba16-10571b415268",
"metadata": {},
"outputs": [],
"source": [
"display(Markdown(f\"<b>{response}</b>\"))"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "gpt_retrieve_venv",
"language": "python",
"name": "gpt_retrieve_venv"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.16"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Loading