[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-mistral.ipynb)

#  RAG-10k - Populate data with Mistral embeddings

## Overview

Here is overall RAG pipeline.  In this notebook we will do step-1.

This notebook showcases how to use **MISTRAL EMBEDDING MODEL** to create embeddings.

We will do the following:

- 👉 Load PDF documents
- 👉 Use Mistral embedding models to calculate embeddings for PDF documents
- 👉 Upload them into Atlas


![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/rag-1.svg)

### What you need to run this notebook

- a (free) MongoDB Atlas Account
- and connection credentials
- a Mistral API Key

### The Stack

- Langugage : Python
- Vector database: Atlas
- Embedding Model: Mistral embedding model


### How to run

This notebook can be run on Google Colab and stand alone python development environments.  Click here to run on colab.

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sujee/mongodb-atlas-vector-search/blob/main/lab-4-rag/rag-10k-a-populate-embeddings-mistral.ipynb)


References


## Step-1: Setup Atlas

We will need to have Atlas setup.

Follow [instructions here](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/lab-1-atlas-setup/setup-atlas.md)


## Step-2: Configuration

We will setup some common configurations here

In [1]:
# We will keep all global variables in an object to not pollute the global namespace.
class MyConfig(object):
    pass

MY_CONFIG = MyConfig()

MY_CONFIG.DB_NAME = 'rag1'
MY_CONFIG.COLLECTION_NAME = '10k_mistral'
MY_CONFIG.EMBEDDING_ATTRIBUTE = 'embedding_mistral'
MY_CONFIG.INDEX_NAME = 'idx_embedding_mistral'


## Step-3: Load Configuration

We need to configure the following
- Atlas connection credentials

### Option 3A - If running on Colab

- Click on 'Colab secrets' icon (🔑) on left pane, and crate the following secrets.
   - `ATLAS_URI`
   - `MISTRAL_API_KEY`
-  Make sure the `notebook access` button is checked on for all
- See screenshot below for example

<!-- ![](../images/colab-secret-2.png) -->

![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/colab-secret-3.png)


### Option 3B - If running on local python environment

- setup your local python env following this [setup guide](https://github.com/sujee/mongodb-atlas-vector-search/blob/main/setup-python-env.md)
- Create a file named `.env` in the same location as notebook
- And add the following settings

```text
ATLAS_URI=mongodb+srv://<username>:<password>@sandbox.....
MISTRAL_API_KEY=xyz
```


## Step-4: Determine Runtime Environment

This code will figure out if we are running on Google Colab environment or local environment.  We use it to install relevant packages later.

In [2]:
# are we running in Colab?
import os

if os.getenv("COLAB_RELEASE_TAG"):
    print("Running in Colab")
    MY_CONFIG.RUNNING_IN_COLAB = True
else:
    print("NOT running in Colab")
    MY_CONFIG.RUNNING_IN_COLAB = False

Running in Colab


## Step-5: Install dependencies (if necessary)

We will install required libraries in cloud environments like Google Colab.  For local environments, we assume the dependencies are already setup.

In [3]:
if MY_CONFIG.RUNNING_IN_COLAB:
    !pip install \
                pymongo==4.6.2 \
                llama-index \
                llama-index-embeddings-mistralai \
                llama-index-vector-stores-mongodb \
                transformers==4.38.2 \
                torch==2.2.1

Collecting openai
  Downloading openai-1.14.2-py3-none-any.whl (262 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m262.4/262.4 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pymongo==4.6.2
  Downloading pymongo-4.6.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (677 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m677.2/677.2 kB[0m [31m5.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting llama-index
  Downloading llama_index-0.10.20-py3-none-any.whl (5.6 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.1.4-py3-none-any.whl (7.7 kB)
Collecting llama-index-embeddings-mistralai
  Downloading llama_index_embeddings_mistralai-0.1.4-py3-none-any.whl (2.6 kB)
Collecting llama-index-vector-stores-mongodb
  Downloading llama_index_vector_stores_mongodb-0.1.4-py3-none-any.whl (4.0 kB)
Collecting dnspython<3.0.0,>=1.16.0 (from pymongo==4.6.2)
  Downloading dnspython-2.6.1-py3-

## Step-6: Basic Setup

### 6.1 - Check if we have GPU

In [4]:
## Check if GPU is enabled
import os
import torch

## To disable GPU and experiment, uncomment the following line
## Normally, you would want to use GPU, if one is available.
# os.environ["CUDA_VISIBLE_DEVICES"]=""

print ("using CUDA/GPU: ", torch.cuda.is_available())

for i in range(torch.cuda.device_count()):
   print("device ", i , torch.cuda.get_device_properties(i).name)

using CUDA/GPU:  False


### 6.2 - Logging

In [5]:
## Setup logging.  To see more loging set the level to DEBUG

import sys
import logging

# logging.basicConfig(stream=sys.stdout, level=logging.DEBUG)
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))

## Step-7: Load Configurations

In [6]:
## Load settings based on where we are running
##  - if runninning on google Colab, load from secrets
##  - if running locally use dotenv

if MY_CONFIG.RUNNING_IN_COLAB:
    from google.colab import userdata
    MY_CONFIG.ATLAS_URI = userdata.get('ATLAS_URI')
    MY_CONFIG.MISTRAL_API_KEY = userdata.get('MISTRAL_API_KEY')
    # MY_CONFIG.OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
else:
    import os, sys
    from dotenv import find_dotenv, dotenv_values

    this_dir = os.path.abspath('')
    parent_dir = os.path.dirname(this_dir)
    sys.path.append (os.path.abspath (parent_dir))

    config = dotenv_values(find_dotenv())
    # debug
    # print (config)
    MY_CONFIG.ATLAS_URI = config.get('ATLAS_URI')
    MY_CONFIG.MISTRAL_API_KEY = config.get("MISTRAL_API_KEY")
## --- end load config

## If you just want to quickly set the config manually, you can do so here.
# MY_CONFIG.ATLAS_URI = ''
# MY_CONFIG.MISTRAL_API_KEY = ''

if  MY_CONFIG.ATLAS_URI:
    print ("✅ config ATLAS_URI found")
else:
    raise Exception ("'❌ ATLAS_URI' is not set.  Please set it above to continue...")


if MY_CONFIG.MISTRAL_API_KEY:
   print ("✅ config MISTRAL_API_KEY found")
else:
    raise Exception ("❌'MISTRAL_API_KEY' is not set.  Please set it above to continue...")

✅ config ATLAS_URI found
✅ config MISTRAL_API_KEY found


## Step-8 : Get Data Files (if needed)

We are going to be using 10K filings - these are financial documents filed by US public companies to SEC (Securities and Exchange Commission).  You can read about them [here](https://www.investor.gov/introduction-investing/investing-basics/glossary/form-10-k)

We have two 10k documents from Lyft and Uber

Don't think it is just 2 documents.  Each PDF documnet is about 200+ pages long.  So these are serious PDF documents.

Let's get these data files.

In [7]:
import os

# ------- begin -------
def download_data_file (remote_file, local_file):
     if  os.path.exists (local_file):
         print (f"✅ Local data files exists : {local_file}")
     else:
        !wget -O {local_file}  {remote_file}
        print (f"✅ Downloaded data file : {local_file}")
#-------- end -------

# figure out data dir
if MY_CONFIG.RUNNING_IN_COLAB:
    MY_CONFIG.DATA_DIR = "data/10k"
else:
    MY_CONFIG.DATA_DIR = "../data/10k"

if not os.path.exists (MY_CONFIG.DATA_DIR):
  !mkdir -p {MY_CONFIG.DATA_DIR}

download_data_file ('https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/lyft_2021.pdf',
                    os.path.join (MY_CONFIG.DATA_DIR, 'lyft_2021.pdf'))

download_data_file ('https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/uber_2021.pdf',
                    os.path.join (MY_CONFIG.DATA_DIR, 'uber_2021.pdf'))

--2024-03-20 07:16:56--  https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/lyft_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1440303 (1.4M) [application/octet-stream]
Saving to: ‘data/10k/lyft_2021.pdf’


2024-03-20 07:16:56 (44.2 MB/s) - ‘data/10k/lyft_2021.pdf’ saved [1440303/1440303]

✅ Downloaded data file : data/10k/lyft_2021.pdf
--2024-03-20 07:16:56--  https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/data/10k/uber_2021.pdf
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 2

## Step-9: Inspect the PDF files

The will be in the following directory


```text
data/10k/
├── lyft_2021.pdf
└── uber_2021.pdf
```


## Step-10: Initialize Atlas Client

If this step fails, make sure 'connect from anywhere' is enabled on your Atlas network configuration

![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/atlas-connect-2.png)

In [8]:
import pymongo

mongodb_client = pymongo.MongoClient(MY_CONFIG.ATLAS_URI)
print ('✅ Connected to Atlas instance!')

✅ Connected to Atlas instance!


## (Optional) Step-11: Clear out the collection

For a fresh start!

In [9]:
## if a clean start is required, you can use the following code to clear out old data

database = mongodb_client[MY_CONFIG.DB_NAME]
collection = database [MY_CONFIG.COLLECTION_NAME]

doc_count = collection.count_documents (filter = {})
print (f"Document count before delete : {doc_count:,}")

result = collection.delete_many(filter= {})
print (f"Deleted docs : {result.deleted_count}")

## Step-12: Calculate Embeddings

There are many choices here:

* OpenAI embeddings - call via API  (See sample notebook here )
* **MistralAI embeddings - call via API  (this notebook)**
* Local embedding models (See notebook here)

We are going to use Llama-index-mistral package ([documentation](https://docs.llamaindex.ai/en/stable/examples/embeddings/mistralai.html))

We will call

```python
MistralAIEmbedding(model_name=model_name, api_key=api_key)
```

Our model name would be "mistral-embed"

In [10]:
from llama_index.embeddings.mistralai import MistralAIEmbedding
from llama_index.core import Settings


Settings.embed_model = MistralAIEmbedding(model_name='mistral-embed', api_key=MY_CONFIG.MISTRAL_API_KEY)

In [11]:
## testing
embeddings = Settings.embed_model.get_text_embedding("La Plateforme - The Platform")
print ('embedding len : ', len(embeddings))
print ('first few embeddings : ', embeddings[:10])

## Step-13: Connect Illama-Index and MongoDB Atlas

Let's define MongoDB Atlas as our vector storage. This is critical to stored indexed data and then query

In [12]:
from llama_index.vector_stores.mongodb import MongoDBAtlasVectorSearch
from llama_index.core import StorageContext


vector_store = MongoDBAtlasVectorSearch(mongodb_client = mongodb_client,
                                        db_name = MY_CONFIG.DB_NAME,
                                        collection_name = MY_CONFIG.COLLECTION_NAME,
                                        index_name  = MY_CONFIG.INDEX_NAME,
                                        embedding_key = MY_CONFIG.EMBEDDING_ATTRIBUTE,
                                        ## the following columns are set to default values
                                       # text_key = 'text', metadata_= 'metadata',
                                 )
storage_context = StorageContext.from_defaults(vector_store=vector_store)

## Step-14: Read PDF Documents

Ilmaa-index has very handy `SimpleDirectoryReader` that can read single files / multiple files / or entire directory content

In [13]:
%%time

from llama_index.core import SimpleDirectoryReader


## This reads one doc
# docs = SimpleDirectoryReader(
#     input_files=["./data/10k/uber_2021.pdf"]
# ).load_data()

## here we read entire directory content
docs = SimpleDirectoryReader(
        input_dir=MY_CONFIG.DATA_DIR
).load_data()

print (f"Loaded {len(docs)} chunks from '{MY_CONFIG.DATA_DIR}'")

Loaded 545 chunks from 'data/10k'
CPU times: user 26.4 s, sys: 163 ms, total: 26.6 s
Wall time: 28 s


## Step-15: Index the docs and Store them into MongoDB Atlas

When we execute the code below, the following will happen

- documents are indexed
- embeddings are created for text
- the document (text, embeddings) are stored in our Vector Storage (MongoDB Atlas in this case)

**Note: This might take a couple of minutes**

In [14]:
%%time

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(
    docs, storage_context=storage_context
)

refreshed_index = index.refresh_ref_docs (docs)

CPU times: user 7.83 s, sys: 228 ms, total: 8.05 s
Wall time: 53.3 s


## Step-16: View Created Documents in Atlas

- Go to your Atlas dashboard
- Select 'browse collections'
- Select database: **`rag1`**  and collection **`10k_mistral`**
- Click around to see some sample data inserted
- You will see `text` attribute having text data
- `embeddings` are populated too
- expand the `meta` attribute.  This is automatically populated for us by llama-index

![](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/10k-documents-1.png)

## Step-17: Setup Vector Index

Before we do vector search, we need to define an embedding index

You can look at steps here [setup-atlas-index.md](setup-atlas-index.md)

Here are the details:

- database : **`rag1`**
- Collection: **`10k_mistral`**
- index_name = **`idx_embedding_mistral`**

index defitintion json

```json
{
  "fields": [
    {
      "type": "vector",
      "path": "embedding_mistral",
      "numDimensions": 1024,
      "similarity": "dotProduct"
    }
  ]
}
```

The similarity function can be  one of
- "euclidean"
- "cosine"
- "dotProduct"


### Follow these steps to setup index


![missing image](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/atlas-index-2.png)

![missing image](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/atlas-index-rag-mistral-1.png)

![missing image](https://raw.githubusercontent.com/sujee/mongodb-atlas-vector-search/main/images/atlas-index-rag-mistral-2.png)

## We are done! 👏



## We are done

Now the data is populated and ready to be queried.

Let's go to the next lab: query