[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate.ipynb)

#  RAG-1A - Doing RAG with MongoDB

## Overview 

Here is overall RAG pipeline.  In this notebook we will do step-1

- 👉 Load PDF documents
- Use embedding models to calculate embeddings for PDF documents
- Upload them into Atlas
- Then query these PDF documents

![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/rag-1.svg)

### What you need to run this notebook

- a (free) MongoDB Atlas Account
- and connection credentials

### The Stack

- Langugage : Python
- Vector database: Atlas
- Embedding Model: 


### How to run

This notebook can be run on Google Colab and stand alone python development environments.  Click here to run on colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate.ipynb)


References


## Step-1: Setup Atlas

We will need to have Atlas setup.

Follow [instructions here](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-1-atlas-setup/setup-atlas.md)


## Step-: Configuration

We need to configure the following
- Atlas connection credentials

### Option 3A - If running on Colab

- Click on 'Colab secrets' icon (🔑) on left pane, and crate the following secrets.
- `ATLAS_URI` and `OPENAI_API_KEY`
-  Make sure the `notebook access` button is checked on for both
- See screenshot below for example

<!-- ![](../images/colab-secret-2.png) -->

![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/colab-secret-1.png)


### Option 3B - If running on local python environment

- setup your local python env following this [setup guide](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/setup-python-env.md)
- Create a file named `.env` in the same location as notebook
- And add the following settings

```text
ATLAS_URI=mongodb+srv://<username>:<password>@sandbox.....
```


## Step-?: Determine Runtime Environment

This code will figure out if we are running on Google Colab environment or local environment.  We use it to install relevant packages later.

In [None]:
# We will keep all global variables in an object to not pollute the global namespace.
class MyConfig(object):
    pass

MY_CONFIG = MyConfig()

In [None]:

# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
    print("Running in Colab")
    MY_CONFIG.RUNNING_IN_COLAB = True
else:
    print("NOT running in Colab")
    MY_CONFIG.RUNNING_IN_COLAB = False

## Step-4: Install dependencies (if necessary)

We will install required libraries in cloud environments like Google Colab.  For local environments, we assume the dependencies are already setup.

In [None]:
if MY_CONFIG.RUNNING_IN_COLAB:
    !pip install \
                openai==1.13.3 \
                pymongo==4.6.2 \
                llama-index==0.10.17 \
                transformers==4.38.2 \
                sentence_transformers==2.5.1 \
                torch==2.2.1 \

## Step-2: Basic Setup

### 2.1 - Check if we have GPU

In [1]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  True
device  0 NVIDIA GeForce RTX 2070


### 2.2 - Logging and Paths

In [2]:
## Setup logging.  To see more loging set the level to DEBUG

import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

In [3]:
import os, sys

this_dir = os.path.abspath('')
parent_dir = os.path.dirname(this_dir)
sys.path.append (os.path.abspath (parent_dir))

## Step-1: Inspect the Documents

We are going to be using 10K filings - these are financial documents filed by US public companies to SEC (Securities and Exchange Commission).  You can read about them [here](https://www.investor.gov/introduction-investing/investing-basics/glossary/form-10-k)

We have two 10k documents from Lyft and Uber

```text
data/10k
├── lyft_2021.pdf
└── uber_2021.pdf
```

Don't think it is just 2 documents.  Each PDF documnet is about 200+ pages long.  So these are serious PDF documents.

Go ahead and inspect the documents


## Step-2: Load Settings

In [4]:
import os,sys
## Load Settings from .env file
from dotenv import find_dotenv, dotenv_values

# _ = load_dotenv(find_dotenv()) # read local .env file
config = dotenv_values(find_dotenv())

# debug
# print (config)

ATLAS_URI = config.get('ATLAS_URI')

if not ATLAS_URI:
    raise Exception ("'ATLAS_URI' is not set.  Please set it above to continue...")

## Only need this if we are using OpenAI for Embeddings
# OPENAI_API_KEY = config.get("OPENAI_API_KEY")
# if not OPENAI_API_KEY:
#     raise Exception ("'OPENAI_API_KEY' is not set.  Please set it above to continue...")

In [5]:
DB_NAME = 'rag1'
COLLECTION_NAME = '10k'
INDEX_NAME = 'idx_embedding'

In [6]:
import os
## LlamaIndex will download embeddings models as needed.
## Set llamaindex cache dir to ./cache dir here (Default is system tmp)
## This way, we can easily see downloaded artifacts
os.environ['LLAMA_INDEX_CACHE_DIR'] = os.path.join(os.path.abspath(''), '..', 'llama-index-cache')

In [7]:
import pymongo

mongodb_client = pymongo.MongoClient(ATLAS_URI)

print ("Atlas client initialized")

Atlas client initialized


## Step-3: Clear out the collection

For a fresh start!

In [8]:
database = mongodb_client[DB_NAME]
collection = database [COLLECTION_NAME]

doc_count = collection.count_documents (filter = {})
print (f"Document count before delete : {doc_count:,}")

result = collection.delete_many(filter= {})
print (f"Deleted docs : {result.deleted_count}")

Document count before delete : 0
Deleted docs : 0


## Step-4: Setup Embeddings

The default embedding is OpenAI.  We can always plugin custom embeddings

### 4.1 - Option A - OpenAI Embeddings

This is using OpenAI embedding model
You will need an API key (defined in env variable : OPENAI_API_KEY)

In [9]:
# from llama_index import  OpenAIEmbedding
# embed_model = OpenAIEmbedding()

### 4.2 - Option B : Using Custom Embeddings

Here are a select models for comparison.  Taken from leaderboard : https://huggingface.co/spaces/mteb/leaderboard

| model name                              | overall score | model size | model params | embedding length | License  | url                                                            |
|-----------------------------------------|---------------|------------|--------------|------------------|----------|----------------------------------------------------------------|
| BAAI/bge-large-en-v1.5                  | 64.x          | 1.34 GB    | 335 M        | 1024             | MIT      | https://huggingface.co/BAAI/bge-large-en-v1.5                  |
| BAAI/bge-small-en-v1.5                  | 62.x          | 133 MB     | 33.5 M       | 384              | MIT      | https://huggingface.co/BAAI/bge-small-en-v1.5                  |
| sentence-transformers/all-mpnet-base-v2 | 57.8          | 438 MB     |              | 768              | Apache 2 | https://huggingface.co/sentence-transformers/all-mpnet-base-v2 |
| sentence-transformers/all-MiniLM-L12-v2 | 56.x          | 134 MB     |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2 |
| sentence-transformers/all-MiniLM-L6-v2  | 56.x          | 91 MB      |              | 384              | Apache 2 | https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2  |

In [10]:
from llama_index.embeddings import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

  from .autonotebook import tqdm as notebook_tqdm


In [11]:
## setup embed model

# The LLM used to generate natural language responses to queries.
# If not provided, defaults to gpt-3.5-turbo from OpenAI
# If your OpenAI key is not set, defaults to llama2-chat-13B from Llama.cpp
# We don't need an LLM just yet, so setting it to None

from llama_index import  ServiceContext

service_context = ServiceContext.from_defaults(embed_model=embed_model, llm=None)

LLM is explicitly disabled. Using MockLLM.


## Step-5: Connect Illama-Index and MongoDB Atlas

Let's define MongoDB Atlas as our vector storage. This is critical to stored indexed data and then query

In [12]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.storage.storage_context import StorageContext


vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                 db_name = DB_NAME, collection_name = COLLECTION_NAME,
                                 index_name  = 'idx_embedding',
                                 ## the following columns are set to default values
                                 # embedding_key = 'embedding', text_key = 'text', metadata_= 'metadata',
                                 )
storage_context = StorageContext.from_defaults(vector_store=vector_store)

## Step-6: Read PDF Documents

Ilmaa-index has very handy `SimpleDirectoryReader` that can read single files / multiple files / or entire directory content

In [13]:
%%time

from llama_index.readers.file.base import SimpleDirectoryReader

data_dir = '../data/10k/'

## This reads one doc
# docs = SimpleDirectoryReader(
#     input_files=["./data/10k/uber_2021.pdf"]
# ).load_data()

## here we read entire directory content
docs = SimpleDirectoryReader(
        input_dir=data_dir
).load_data()

print (f"Loaded {len(docs)} chunks from '{data_dir}'")

Loaded 545 chunks from '../data/10k/'
CPU times: user 10.3 s, sys: 34.7 ms, total: 10.4 s
Wall time: 10.4 s


## Step-7: Index the docs and Store them into MongoDB Atlas

When we execute the code below, the following will happen

- documents are indexed
- embeddings are created for text
- the document (text, embeddings) are stored in our Vector Storage (MongoDB Atlas in this case)

In [14]:
%%time

from llama_index.indices.vector_store.base import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    docs, storage_context=storage_context,
    service_context=service_context,
)

CPU times: user 10.5 s, sys: 375 ms, total: 10.9 s
Wall time: 15.7 s


## Step-8: Verify Created Documents in Atlas

- Go to your Atlas dashboard
- Select 'browse collections' 
- Select database: `rag1`  and collection `10k`
- Click around to see some sample data inserted
- You will see `text` attribute having text data 
- `embeddings` are populated too
- expand the `meta` attribute.  This is automatically populated for us by llama-index

![](../images/10k-documents-1.png)

## Step-9: Setup Vector Index

Before we do vector search, we need to define an embedding index

You can look at steps here [setup-atlas-index.md](setup-atlas-index.md)

Here are the details:

index_name = **'idx_embedding'**

index defitintion json

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 384,
      "similarity": "dotProduct"
    }
  ]
}
```

The similarity function can be  one of 
- "euclidean"
- "cosine"
- "dotProduct"


### Follow these steps to setup index


![](../images/atlas-index-2.png)

![](../images/atlas-index-7.png)

![](../images/atlas-index-8.png)



## We are done 

Now the data is populated and ready to be queried.

Let's go to the next lab: query