# MongoDB Atlas

- Author: [Ivy Bae](https://github.com/ivybae)
- Design:
- Peer Review :
- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)

[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-MongoDB.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/09-VectorStore/07-MongoDB.ipynb)

## Overview

This tutorial covers ...

### Table of Contents

- [Overview](#overview)
- [Environement Setup](#environment-setup)
- [Initialization](#initialization)
- [Load Data](#load-data)
- [Manage vector store](#manage-vector-store)
- [Query vector store](#query-vector-store)

### References

- [Get Started with Atlas](https://www.mongodb.com/docs/atlas/getting-started/)
- [Deploy a Free Cluster](https://www.mongodb.com/docs/atlas/tutorial/deploy-free-tier-cluster/)
- [Connection Strings](https://www.mongodb.com/docs/manual/reference/connection-string/)
- [Integrate Atlas Vector Search with LangChain](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/)
- [Get Started with the LangChain Integration](https://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/langchain/get-started/)
- [Comparison Query Operators](https://www.mongodb.com/docs/manual/reference/operator/query-comparison/)
- [MongoDB Atlas](https://python.langchain.com/docs/integrations/vectorstores/mongodb_atlas/)
- [Document loaders](https://python.langchain.com/docs/concepts/document_loaders/)

---


## Environment Setup

Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.

**[Note]**

- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials.
- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details.


In [1]:
%%capture --no-stderr
%pip install langchain-opentutorial

In [2]:
# Install required packages
from langchain_opentutorial import package

package.install(
    [
        "langchain_openai",
        "langsmith",
        "langchain_core",
        "langchain_community",
        "langchain-mongodb",
        "pymongo",
        "certifi",
    ],
    verbose=False,
    upgrade=False,
)


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
# Set environment variables
from langchain_opentutorial import set_env

set_env(
    {
        "OPENAI_API_KEY": "",
        "LANGCHAIN_API_KEY": "",
        "MONGODB_ATLAS_CLUSTER_URI": "",
        "LANGCHAIN_TRACING_V2": "true",
        "LANGCHAIN_ENDPOINT": "https://api.smith.langchain.com",
        "LANGCHAIN_PROJECT": "09-Ollama",
    }
)

Environment variables have been set successfully.


You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.

[Note] This is not necessary if you've already set the required API keys in previous steps.

`MONGODB_ATLAS_CLUSTER_URI` is required to use **MongoDB Atlas** and is explained in the [Connect to your cluster](#connect-to-your-cluster).

If you are already using **MongoDB Atlas**, you can set the cluster **connection string** to `MONGODB_ATLAS_CLUSTER_URI` in your `.env` file.


In [4]:
# Load API keys from .env file
from dotenv import load_dotenv

load_dotenv(override=True)

True

## Initialization

**MongoDB Atlas** is a multi-cloud database service that provides an easy way to host and manage your data in the cloud.

After you register with and log in to **Atlas**, you can create a Free cluster.

**Atlas** can be started with [Atlas CLI](https://www.mongodb.com/docs/atlas/cli/current/atlas-cli-getting-started/) or **Atlas UI**.

**Atlas CLI** can be difficult to use if you're not used to working with development tools, so this tutorial will walk you through how to use **Atlas UI**.

### Deploy a cluster

Please select the appropriate project in your **Organization**. If the project doesn't exist, you'll need to create it.

If you select a project, you can create a cluster.

![mongodb-atlas-project](./assets/07-mongodb-atlas-initialization-01.png)

Follow the procedure below to deploy a cluster

- select **Cluster**: **M0** Free cluster option

> Note: You can deploy only one Free cluster per Atlas project

- select **Provider**: **M0** on AWS, GCP, and Azure

- select **Region**

- create a database user and add your IP address settings.

After you deploy a cluster, you can see the cluster you deployed as shown in the image below.

![mongodb-atlas-cluster-deploy](./assets/07-mongodb-atlas-initialization-02.png)


### Connect to your cluster

Click **Get connection string** in the image above to get the cluster URI and set the value of `MONGODB_ATLAS_CLUSTER_URI` in the `.env` file.

The **connection string** resembles the following example:

> mongodb+srv://[databaseUser]:[databasePassword]@[clusterName].[hostName].mongodb.net/?retryWrites=true&w=majority

Then go back to the [Environment Setup](#environment-setup) and run the `load_dotenv` function again.


### Initialize MongoDB python client

To integrate with LangChain, you need to connect to the cluster using [PyMongo](https://github.com/mongodb/mongo-python-driver), the MongoDB python driver.


In [5]:
import os
import certifi
from pymongo import MongoClient

MONGODB_ATLAS_CLUSTER_URI = os.getenv("MONGODB_ATLAS_CLUSTER_URI")
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI, tlsCAFile=certifi.where())

### Initialize MongoDBAtlasVectorSearch

A **MongoDB database** stores a collections of documents.

Create a vector store using `MongoDBAtlasVectorSearch`.

- `collection`: documents to store in the vector database

- `embedding`: use OpenAI `text-embedding-3-small` model

- `index_name`: index to use when querying the vector store.


In [6]:
from langchain_openai import OpenAIEmbeddings
from langchain_mongodb import MongoDBAtlasVectorSearch

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

ATLAS_VECTOR_SEARCH_INDEX_NAME = "langchain-opentutorial-index"
DB_NAME = "langchain-opentutorial-db"
COLLECTION_NAME = "little-prince"
collection = client[DB_NAME][COLLECTION_NAME]

vector_store = MongoDBAtlasVectorSearch(
    collection=collection,
    embedding=embeddings,
    index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
    relevance_score_fn="cosine",
)
vector_store.create_vector_search_index(dimensions=1536)

You can **browse collections** to see the **little-prince** collection you just created and the sample data provided by Atlas.

- [available sample datasets](https://www.mongodb.com/docs/atlas/sample-data/#std-label-available-sample-datasets)

![mongodb-atlas-collection](./assets/07-mongodb-atlas-database-01.png)

Click the **Atlas Search tab** to see the search index **langchain-opentutorial-index** that you created.

![mongodb-atlas-search-index](./assets/07-mongodb-atlas-database-02.png)

In this tutorial, we will use the **little-prince** collection in the **langchain-opentutorial-db** database.


## Load Data

LangChain provides **Document loaders** that can load a variety of data sources.

In this tutorial, we'll use `PyPDFLoader` to add data from the **TheLittlePrince.pdf** in the data directory to the **little-prince** collection.


In [7]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("./data/TheLittlePrince.pdf")
documents = loader.load()

If you are working with large datasets, you can use `lazy_load` instead of the `load` method.

The `load` method returns **List[Document]**, so let's check for the first **Document** object.


In [8]:
documents[0]

Document(metadata={'source': './data/TheLittlePrince.pdf', 'page': 0}, page_content='1!!!!!\n!!!The Little Prince written and illustrated by Antoine de Saint Exupéry translated from the French by Katherine Woods!!!!!!')

- `metadata`: data associated with content

- `page_content`: string text

If you open the file **TheLittePrince.pdf** and compare the contents of the first page to `page_content`, it is the same.


## Manage vector store

Now that you've initialized the `vector_store` and loaded the data, you can add and delete **Documents** to the **little-prince** collection.

### Add

- `add_documents`: Add **documents** to the `vector_store` and returns a List of IDs for the added documents.


In [9]:
ids = vector_store.add_documents(documents=documents)

`delete` function allow specify the Document IDs to delete, so `ids` store the IDs of the added documents.

Check the first document ID. The number of **IDs** matches the number of **documents**, and each ID is a unique value.


In [10]:
ids[0]

'6787965b3c24f7045d2ca706'

In the image below, after adding **documents** the **STORAGE SIZE** of the collection increases and you can see the documents corresponding to each ID, such as `ids[0]`.

![mongodb-atlas-add-documents](./assets/07-mongodb-atlas-vectorstore-01.png)

The `embedding` field is a **vector representation of the text** data. It is used to determine similarity to the query vector for vector search.


### Delete

Create a `Document` object, add it to a **collection**, and then `delete` its data.


In [11]:
from langchain_core.documents import Document

sample_document = Document(
    page_content="I am leveraging my experience as a developer to provide development education and nurture many new developers.",
    metadata={"source": "linkedin"},
)
sample_id = vector_store.add_documents([sample_document])
print(sample_id)

['678796bd3c24f7045d2ca746']


**TOTAL DOCUMENTS** has increased from 64 to 65.

And if you check the **Document** on the last page, you'll see that it has the **same id** as `sample_document`, and the **same text value** as `page_content`.

![mongodb-atlas-last-document](./assets/07-mongodb-atlas-vectorstore-02.png)


You can specify the **document IDs to delete** as arguments to the `delete` function, such as `sample_id`.

If you don't specify an ID, all documents added to the collection are deleted.


In [13]:
vector_store.delete(ids=sample_id)

True

If `True` returns, the deletion is successful.

You can see that **TOTAL DOCUMENTS** has decreasesd from 65 to 64 and that `sample_document` has been deleted.


## Query vector store

Make a `query` related to the content of The Little Prince and see if the `vector_store` returns results from a search for similar documents.

The `query` is based on the most well-known story about the relationship between the Little Prince and the Fox.


In [15]:
query = "What does it mean to be tamed according to the fox?"

### Semantic Search

`similarity_search` method performs a basic semantic search

It returns a **List[Document]** ranked by relevance.


In [None]:
vector_store.similarity_search(query=query)

[Document(metadata={'_id': '6787965b3c24f7045d2ca734', 'source': './data/TheLittlePrince.pdf', 'page': 46}, page_content='47\n "Please--tame me!" he said.  "I want to, very much," the little prince replied. "But I have not much time. I have friends to discover, and a great many things to understand."  "One only understands the things that one tames," said the fox. "Men have no more time to understand anything. They buy things all ready made at the shops. But there is no shop anywhere where one can buy friendship, and so men have no friends any more. If you want a friend, tame me . . ."  "What must I do, to tame you?" asked the little prince.  "You must be very patient," replied the fox. "First you will sit down at a little distance from me--like that--in the grass. I shall look at you out of the corner of my eye, and you will say nothing. Words are the source of misunderstandings. But you will sit a little closer to me, every day . . ."  The next day the little prince came back.  "It w

### Semantic Search with Score

`similarity_search_with_score` method also performs a semantic search.

The difference with the `similarity_search` method is that it returns a **relevance score** of documents between 0 and 1.

The `k` parameter in the example below specifies the number of documents. This is also supported by `similarity_search` method.


In [18]:
vector_store.similarity_search_with_score(query=query, k=10)

[(Document(metadata={'_id': '6787965b3c24f7045d2ca734', 'source': './data/TheLittlePrince.pdf', 'page': 46}, page_content='47\n "Please--tame me!" he said.  "I want to, very much," the little prince replied. "But I have not much time. I have friends to discover, and a great many things to understand."  "One only understands the things that one tames," said the fox. "Men have no more time to understand anything. They buy things all ready made at the shops. But there is no shop anywhere where one can buy friendship, and so men have no friends any more. If you want a friend, tame me . . ."  "What must I do, to tame you?" asked the little prince.  "You must be very patient," replied the fox. "First you will sit down at a little distance from me--like that--in the grass. I shall look at you out of the corner of my eye, and you will say nothing. Words are the source of misunderstandings. But you will sit a little closer to me, every day . . ."  The next day the little prince came back.  "It 

### Semantic Search with Filtering

**MongoDB Atlas** supports pre-filtering your data using **MongoDB Query Language(MQL) Operators**.

You must update the index definition using `create_vector_search_index`.


In [23]:
vector_store.create_vector_search_index(dimensions=1536, filters=["page"], update=True)

Compare the image below to when you first created the index in [Initialize MongoDBAtlasVectorSearch](#initialize-mongodbatlasvectorsearch).

Notice that `page` have been added to the **Index Fields** and **Documents** have been added as well.

![mongodb-atlas-index-update](./assets/07-mongodb-atlas-vectorstore-03.png)


There are **comparison query operators** that find values that match a condition.

For example, the `$eq` operator finds **documents** that match a specified value.

In [Semantic Search with Score](#semantic-search-with-score), `similarity_search_with_score` result that contains the beginning part of a book.

Now you can add a `pre_filter` condition that documents **page** are greater than or equal to 20 using the `$gte` operator.


In [24]:
vector_store.similarity_search_with_score(
    query=query, k=10, pre_filter={"page": {"$gte": 20}}
)

[(Document(metadata={'_id': '6787965b3c24f7045d2ca734', 'source': './data/TheLittlePrince.pdf', 'page': 46}, page_content='47\n "Please--tame me!" he said.  "I want to, very much," the little prince replied. "But I have not much time. I have friends to discover, and a great many things to understand."  "One only understands the things that one tames," said the fox. "Men have no more time to understand anything. They buy things all ready made at the shops. But there is no shop anywhere where one can buy friendship, and so men have no friends any more. If you want a friend, tame me . . ."  "What must I do, to tame you?" asked the little prince.  "You must be very patient," replied the fox. "First you will sit down at a little distance from me--like that--in the grass. I shall look at you out of the corner of my eye, and you will say nothing. Words are the source of misunderstandings. But you will sit a little closer to me, every day . . ."  The next day the little prince came back.  "It 