#  Populate MongoDB Atlas Database
In this Python notebook, we will be generating embeddings of four dogs breed documents, indexing them and storing them into MongoDB Atlas database.

## Step 1: The Documents

These are the type of the dogs we'll be using. You can find these documents in the `data` folder:
```text
data/dogs
├── alaskan-malmute.pdf
└── american-bulldog.pdf
└── golden-retriever.pdf
└── siberian-husky.pdf
└── border-collie.pdf
└── german-shepherd.pdf
└── rottweiler.pdf
└── dalmation.pdf
└── shiba-inu.pdf
└── dobermann.pdf
└── poodle.pdf
└── chihuahua.pdf
└── beagle.pdf
```

## Step 2: Load Settings

In [13]:
import os
import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [14]:
# Load settings from .env file
from dotenv import find_dotenv, dotenv_values

# Change system path to root direcotry
sys.path.insert(0, '../')

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# For debugging purposes
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")
else:
    print("ATLAS_URI Connection string found:", ATLAS_URI)

## Only uncomment this if you are using OpenAI for embeddings
# OPENAI_API_KEY = config.get("OPENAI_API_KEY")
# if not OPENAI_API_KEY:
#     raise Exception ("'OPENAI_API_KEY' is not set. Please set it above to continue...")
# else:
#     print("ATLAS_URI Connection string found:", ATLAS_URI)

ATLAS_URI Connection string found: mongodb+srv://cents29:FJ9BB0sUAXZQaZeh@cluster0.9jwfxrl.mongodb.net/?retryWrites=true&w=majority&appName=Cluster0


In [15]:
# Define our variables
DB_NAME = 'dogs'
COLLECTION_NAME = 'type'
INDEX_NAME = 'idx_embedding'

In [16]:
# LlamaIndex will download embeddings models as needed
# Set llamaindex cache dir to ../cache dir here (Default is system tmp)
# This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath('../'), 'cache')

In [17]:
import pymongo

mongodb_client = pymongo.MongoClient(ATLAS_URI)

print ("Atlas client initialized")

Atlas client initialized


## Step 3: Setup Embeddings

Now, we'll need to establish an embedding model to help us generate embeddings for the documents, so im using HuggingFace Embeddings

### Using HuggingFace Embeddings

This option utilizes a HuggingFace embedding model. Listed below are some examples, taken from the leaderboard https://huggingface.co/spaces/mteb/leaderboard. We'll be going with the `BAAI/bge-small-en-v.15` embedding model here.

| model name                              | overall score | model size | model params | embedding length | License  | url                                                            |
|-----------------------------------------|---------------|------------|--------------|------------------|----------|----------------------------------------------------------------|
| BAAI/bge-large-en-v1.5                  | 64.x          | 1.34 GB    | 335 M        | 1024             | MIT      | https://huggingface.co/BAAI/bge-large-en-v1.5                  |
| BAAI/bge-small-en-v1.5                  | 62.x          | 133 MB     | 33.5 M       | 384              | MIT      | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          | 438 MB     |              | 768              | Apache 2 | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L12-v2 | 56.x          | 134 MB     |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          | 91 MB      |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |

In [18]:
# from llama_index.embeddings import HuggingFaceEmbedding
# Uncomment the line above and comment the line below if you face an import error
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
Load pretrained SentenceTransformer: BAAI/bge-small-en-v1.5
INFO:sentence_transformers.SentenceTransformer:2 prompts are loaded, with the keys: ['query', 'text']
2 prompts are loaded, with the keys: ['query', 'text']
2 prompts are loaded, with the keys: ['query', 'text']


In [19]:
## Set up embedding model
# The LLM used to generate natural language responses to queries
# If not provided, it will default to gpt-3.5-turbo from OpenAI
# If your OpenAI API key is not set, it will default to llama2-chat-13B from Llama.cpp
# Since we don't need an LLM just yet, we'll be setting it to None

# from llama_index import ServiceContext
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import ServiceContext

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)

LLM is explicitly disabled. Using MockLLM.


  service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)


## Step 4: Connect Llama-Index and MongoDB Atlas

i'll be using MongoDB Atlas as my vector storage. This is critical to store indexed data and then query later on.

In [20]:
# Run this cell to install llama-index-vector-stores-mongodb
#!pip install llama-index-vector-stores-mongodb

In [21]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
# from llama_index.storage.storage_context import StorageContext
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import StorageContext

vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 index_name  = 'idx_embedding',
                                 ## The following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )

storage_context = StorageContext.from_defaults(vector_store=vector_store)

## Step 5: Read PDF Documents

Llama-index has very handy `SimpleDirectoryReader` that can read single/multiple files and also an entire directory's content. I'll be using this to read my 10 PDF files and storing the data in `docs`.

In [22]:
%%time

# from llama_index.readers.file.base import SimpleDirectoryReader
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import SimpleDirectoryReader

data_dir = '../data/dogs/'

## This reads an entire directory
docs = SimpleDirectoryReader(
        input_dir=data_dir
).load_data()

print (f"Loaded {len(docs)} chunks from '{data_dir}'")

Loaded 122 chunks from '../data/dogs/'
CPU times: total: 2.38 s
Wall time: 4.59 s


## Step 6: Index the Documents and Store Them Into MongoDB Atlas

The code cell below is where everything that we've been preparing for in this comes together:
- Embeddings are generated using our packaged-up embedding model `service_context` 
- Our documents `docs` get indexed `storage_context` - both text and embeddings are stored in MongoDB Atlas

In [23]:
%%time

# from llama_index.indices.vector_store.base import VectorStoreIndex
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    docs, 
    storage_context=storage_context,
    service_context=service_context,
)

Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:06<00:00,  6.55s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.95s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.40s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.07s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  8.00s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:07<00:00,  7.02s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:08<00:00,  8.00s/it]
Batches: 100%|███████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.10it/s]
Batches: 100%|██████████████████████████

CPU times: total: 9min 55s
Wall time: 1min 22s


After running the code cell above, there should be a new database `dogs` and a new collection `type` inside it that contains the text as well as the generated embeddings of my 10 PDF files.

#  Making Queries to the RAG Model
In this section, we will be making use of our RAG model as well as an LLM to ask questions regarding our uploaded documents. If all goes to plan, our RAG model (powered by Atlas Vector Search) should be able to retrieve the portions of the document that's relevant to our query and feed that information to the LLM, thus enabling it to correctly answer our query. 

## Step 1: Setup LLM
I'll need to setup an LLM to be able to take the results from the Atlas Vector Search and respond to the user query. We'll be using OpenAI again for this purpose.

In [24]:
import openai
from llama_index.llms.openai import OpenAI

openai.api_key = config.get("OPENAI_API_KEY")

llm = OpenAI(model="gpt-3.5-turbo")

completion_response = llm.complete("To infinity, and")
print(completion_response)

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 401 Unauthorized"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 401 Unauthorized"
HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 401 Unauthorized"
Retrying llama_index.llms.openai.base.OpenAI._chat in 0.8096267740813633 seconds as it raised AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************MU8m. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}.
Retrying llama_index.llms.openai.base.OpenAI._chat in 0.8096267740813633 seconds as it raised AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************MU8m. You can find your API key at https://platform.openai.com/account/api-keys.', 'typ

AuthenticationError: Error code: 401 - {'error': {'message': 'Incorrect API key provided: sk-proj-********************************************MU8m. You can find your API key at https://platform.openai.com/account/api-keys.', 'type': 'invalid_request_error', 'param': None, 'code': 'invalid_api_key'}}

Awesome! Now that we've initialized both our embedding model as well as our LLM, let's combine them together into a unified interface `service_context` that we can use later on.

In [None]:
# from llama_index import ServiceContext
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import ServiceContext

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=llm)

## Step 4: Connect Llama-Index and MongoDB Atlas

This is where everything comes together, we orchestrate the combination of MongoDB Atlas as our vector storage and the `service_context` we just defined. This system we've just set up will allow us to ask the LLM questions regarding our uploaded documents; Atlas Vector Search will then locate portions of the document that most closely matches our query to supplement the LLM's response, thereby providing us with a more accurate response. 

In [None]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch

# from llama_index.storage.storage_context import StorageContext
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import StorageContext

# from llama_index.indices.vector_store.base import VectorStoreIndex
# Uncomment the line above and comment away the line below if you face an import error
from llama_index.core import VectorStoreIndex

vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 index_name  = 'idx_embedding',
                                 ## the following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )

storage_context = StorageContext.from_defaults(vector_store=vector_store)

index = VectorStoreIndex.from_vector_store(vector_store=vector_store, service_context=service_context)

## Step 5: Query Data / Ask Questions

In [None]:
from IPython.display import Markdown

response = index.as_query_engine().query("why german shepherd are the best partners? and not border collie")
display(Markdown(f"<b>{response}</b>"))