# 03.1 - Embedding with Chroma

For finding semantically related documents, we'll use Chroma (https://www.trychroma.com/), which is a lightweight vector data store. Chroma supports swappable embedding models, filtering using metadata, keyword search, and multiple distance measurements. We'll use these features for evlauating approaches to organizing papers for downstream processing (search, summarization, keyword extraction, etc.).

The default Chroma embedding model is used in this notebook. The other "03" notebooks show how different embedding models can be used.

## Section 0 - Notebook Setup

In [None]:
%pip install --upgrade --quiet tinydb

In [None]:
%pip install --upgrade --quiet chromadb

In [None]:
%pip install --upgrade --quiet sentence_transformers

Load articles and prune ones without abstracts, since we're using the abstracts for generating the embeddings.

In [1]:
from tinydb import TinyDB, Query

db = TinyDB('../data/data.json')
table = db.table('articles')

articles = table.all()
print(f'loaded {len(articles)} articles')

articles = [x for x in articles if x['abstract'] != 'No abstract available.']
print(f'retaining {len(articles)} articles')

loaded 1600 articles
retaining 1505 articles


Stage the articles so that they can easily be loaded into the vector database. Remove duplicates.

In [2]:
documents = []
ids = []

for article in articles:
    doc_id = article['link']
    if doc_id not in ids:
        documents.append(article['abstract'])
        ids.append(doc_id)

print(f'loaded {len(ids)} articles')

loaded 1479 articles


## Section I - Create a Vector Database using Chroma

Create a Chroma database for storing the vector data.

In [4]:
import chromadb

client = chromadb.PersistentClient(path="../data/chroma_db")

Create a collection in the database. Chroma collections can each have their own embedding and distance measurements.

In [5]:
from chromadb.errors import InvalidCollectionException

collection_name = 'articles-default-embeddings'

try:
    collection = client.get_collection(
        name=collection_name
    )

    print(f'loaded collection {collection_name}')
except InvalidCollectionException:
    print(f'creating collection {collection_name}')
    
    collection = client.create_collection(
        name=collection_name
    )

creating collection articles-default-embeddings


Add documents to the collection if the collection is new, or there are new documents to add.

In [6]:
collection.add(
    documents=documents,
    ids=ids
)

In [7]:
results = collection.query(
    query_texts=["infectious diseases modeling"],
    n_results=10
)

results

{'ids': [['https://link.springer.com/article/10.1007/s11538-024-01326-9?error=cookies_not_supported&code=20758f1b-cc00-482f-91c4-2eb161967b2f',
   'https://www.cambridge.org/core/journals/infection-control-and-hospital-epidemiology/article/expanding-the-use-of-mathematical-modeling-in-healthcare-epidemiology-and-infection-prevention-and-control/2C400B94C466610D0F00D920313A34ED',
   'http://biorxiv.org/lookup/doi/10.1101/2024.06.29.601123',
   'https://linkinghub.elsevier.com/retrieve/pii/S1879625724000427',
   'https://doi.org/10.1098/rspb.2024.1296',
   'https://linkinghub.elsevier.com/retrieve/pii/S002555642400110X',
   'https://doi.org/10.1016/j.epidem.2024.100809',
   'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0294579',
   'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0290821',
   'https://doi.org/10.1002/sim.10050']],
 'embeddings': None,
 'documents': [["Fred Brauer was an eminent mathematician who studied dynamical systems, especially

In [None]:
results = collection.query(
    query_texts=["agent-based models"],
    n_results=10
)

results

Create a new collection using Cosign distance rather than Squred L2 (default). Ref: https://docs.trychroma.com/guides#changing-the-distance-function

In [8]:
collection_name = 'articles-default-embeddings-cosign-distance'

try:
    print(f'loading collection {collection_name}')
    
    collection = client.get_collection(
        name=collection_name
    )
except InvalidCollectionException:
    print(f'creating collection {collection_name}')
    
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )

loading collection articles-default-embeddings-cosign-distance
creating collection articles-default-embeddings-cosign-distance


In [9]:
collection.add(
    documents=documents,
    ids=ids
)

In [10]:
results = collection.query(
    query_texts=["infectious diseases transmitted by mosquitoes and that affect children"],
    n_results=10
)

results

{'ids': [['https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0310480',
   'https://www.nature.com/articles/s41467-024-49080-9?error=cookies_not_supported&code=97f93c4a-08eb-43ca-a702-1d5b9904a90a',
   'https://parasitesandvectors.biomedcentral.com/articles/10.1186/s13071-024-06279-5',
   'https://doi.org/10.3390/pathogens13121105',
   'https://linkinghub.elsevier.com/retrieve/pii/S0025556424001792',
   'https://doi.org/10.1093/infdis/jiae609',
   'https://doi.org/10.1126/sciadv.adp1657',
   'https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0011908',
   'https://www.ebm-journal.org/journals/experimental-biology-and-medicine/articles/10.3389/ebm.2024.10114/full',
   'https://doi.org/10.1111/gcb.17610']],
 'embeddings': None,
 'documents': [['Aedes mosquito-borne viruses (ABVs) place a substantial strain on public health resources in the Americas. Vector control of Aedes mosquitoes is an important public health strategy to decrease or prevent spread of AB