# MongoDB: Testing Vectors

Notes: 

- Used this notebook as a starting point: https://github.com/esteininger/vector-search/blob/master/foundations/atlas-vector-search/Atlas_Vector_Search_Demonstration.ipynb
- Creating the search index through pymongo was not easy. I had to use Atlas Admin to create the index (https://www.mongodb.com/community/forums/t/createsearchindex-not-found-in-mongosh/234699/8)
- It took around 2 mins to create the index after loading 10 docs to the collection
- You have to know the index name and add it to the query pipeline (it is not usual to refer index names in queries)
- Indexes in mongo support more datatypes than Astra (int32, int64, double)
- Similarity metrics are the same
- I was a bit confused if they are using KNN or ANN. Then mention HNSW as the algorithm (https://www.mongodb.com/docs/atlas/atlas-search/knn-beta/#mongodb-expression-exp.knnBeta)
- knnBeta is the search argument and knnVector is the type of index. Dind'nt understand why they are not the same (for simplicity).
- Having the meta field "score" available for select is nice. We could have an automatic "similarity" field returned when a order by ANN is found on the command. That would prevent errors when the similarity function used in the select projection is differente than the indexing one.

In [5]:
pip install pymongo pprintpp -q

Note: you may need to restart the kernel to use updated packages.


In [6]:
from sentence_transformers import SentenceTransformer
from pprint import pprint

# https://huggingface.co/obrizum/all-MiniLM-L6-v2
# how is this converting?
model = SentenceTransformer('obrizum/all-MiniLM-L6-v2')

  from .autonotebook import tqdm as notebook_tqdm
Downloading (…)7ae18/.gitattributes: 100%|█████████████████████████████████████████████| 1.17k/1.17k [00:00<00:00, 348kB/s]
Downloading (…)_Pooling/config.json: 100%|████████████████████████████████████████████████| 190/190 [00:00<00:00, 22.6kB/s]
Downloading (…)f4f827ae18/README.md: 100%|████████████████████████████████████████████| 10.1k/10.1k [00:00<00:00, 26.5MB/s]
Downloading (…)f827ae18/config.json: 100%|█████████████████████████████████████████████████| 612/612 [00:00<00:00, 825kB/s]
Downloading (…)ce_transformers.json: 100%|█████████████████████████████████████████████████| 116/116 [00:00<00:00, 347kB/s]
Downloading (…)e18/data_config.json: 100%|█████████████████████████████████████████████| 39.3k/39.3k [00:00<00:00, 985kB/s]
Downloading pytorch_model.bin: 100%|██████████████████████████████████████████████████| 90.9M/90.9M [00:31<00:00, 2.90MB/s]
Downloading (…)nce_bert_config.json: 100%|████████████████████████████████████████

In [10]:
# strings as an array that we will
products = [
    {"name":"Mozzarella"},
    {"name":"Parmesan"},
    {"name":"Cheddar"},
    {"name":"Brie"},
    {"name":"Swiss"},
    {"name":"Gruyere"},
    {"name":"Feta"},
    {"name":"Gouda"},
    {"name":"Provolone"},
    {"name":"Monterey Jack"},
    {"name":"Telephone"}
]

# create a new embedding field for each product object
for product in products:
  # convert to embedding, then to array
    embeddings = model.encode(product['name']).tolist()
    product['embedding'] = embeddings
    
pprint(f"""{products[0]["name"]}: {products[0]["embedding"][:5]}... """)

('Mozzarella: [-0.09950762987136841, -0.02402164228260517, '
 '-0.046839337795972824, 0.06274435669183731, -0.09200147539377213]... ')


In [14]:
mongo_uri = ""

In [15]:
import pymongo

# connection object
connection = pymongo.MongoClient(mongo_uri)
database = 'eap'
collection = 'vector'

In [57]:
# Tried to create the index using the API/mongoshell, but it failed.
index_definition = f"""
{database}.{collection}.createSearchIndex(
    knn_vector_index,
    {{ mappings: 
        {{ dynamic: true,
            "fields": {{
                "embedding": {{
                    "type": "knnVector",
                    "dimensions": 384,
                    "similarity": "dotProduct"
                }}
            }} 
        }}
    }}
)"""

connection[database].command(index_definition)

OperationFailure: command not found, full error: {'ok': 0, 'errmsg': 'command not found', 'code': 59, 'codeName': 'CommandNotFound'}

In [16]:
# Inserting data
connection[database][collection].insert_many(products)

<pymongo.results.InsertManyResult at 0x1071b9e50>

In [37]:
# Inspecting recorded data
pprint(connection[database][collection].count_documents({}))

11


In [59]:
# Inspecting recorded data
pprint(connection[database][collection].find_one({}))

{'_id': ObjectId('650f057937db9df4c1b498f3'),
 'embedding': [-0.09950762987136841,
               -0.02402164228260517,
               -0.046839337795972824,
               0.06274435669183731,
               -0.09200147539377213,
               -0.03168807551264763,
               0.06376274675130844,
               -0.044427450746297836,
               0.031521499156951904,
               -0.14870190620422363,
               0.009347978048026562,
               -0.02179163508117199,
               -0.005592826288193464,
               0.03899405896663666,
               -0.048328742384910583,
               0.01296360045671463,
               0.08245743811130524,
               0.01641601324081421,
               -0.05583083629608154,
               -0.023067114874720573,
               0.032100629061460495,
               0.01212228648364544,
               0.020215777680277824,
               0.0018870396306738257,
               0.03407026827335358,
               -0.0093941148370

In [62]:
# Querying data

query = "cheese"
vector_query = model.encode(query).tolist()

# pprint(vector_query)
pipeline = [
  {
    "$search": {
        "index":"default",
        "knnBeta": {
            "vector": vector_query,
            "path": "embedding",
            "k": 5
      }
    }
  },
    {
        "$project": {
            "embedding": 0,
            "_id": 0,
            'score': {
                '$meta': 'searchScore'
            }
        }
    }
]

results = list(connection[database][collection].aggregate(pipeline))
pprint(results)

[{'name': 'Cheddar', 'score': 0.8514009714126587},
 {'name': 'Mozzarella', 'score': 0.8419662714004517},
 {'name': 'Swiss', 'score': 0.7116225361824036},
 {'name': 'Provolone', 'score': 0.7058044075965881},
 {'name': 'Monterey Jack', 'score': 0.6898983120918274}]
