# 03.1 Embedding with Chroma

For finding semantically related documents, we'll use Chroma (https://www.trychroma.com/), which is a lightweight vector data store. Chroma supports swappable embedding models, filtering using metadata, keyword search, and multiple distance measurements. We'll use these features for evlauating approaches to organizing papers for downstream processing (search, summarization, keyword extraction, etc.).

In [7]:
%pip install --upgrade --quiet chromadb

Note: you may need to restart the kernel to use updated packages.


In [8]:
%pip install --upgrade --quiet sentence_transformers

Note: you may need to restart the kernel to use updated packages.


Load articles and prune ones without abstracts, since we're using the abstracts for generating the embeddings.

In [9]:
from tinydb import TinyDB, Query

db = TinyDB('db.json')
table = db.table('articles')

articles = table.all()
print(f'loaded {len(articles)} articles')

articles = [x for x in articles if x['abstract'] != 'No abstract available.']
print(f'retaining {len(articles)} articles')

loaded 1398 articles
retaining 1320 articles


Stage the articles so that they can easily be loaded into the vector database. Remove duplicates.

In [10]:
documents = []
ids = []

for article in articles:
    doc_id = article['link']
    if doc_id not in ids:
        documents.append(article['abstract'])
        ids.append(doc_id)

print(f'loaded {len(ids)} articles')

loaded 1318 articles


In [11]:
import chromadb

client = chromadb.PersistentClient(path="vectors_db")

In [12]:
from chromadb.errors import InvalidCollectionException

collection_name = 'articles-default-embeddings'

try:
    collection = client.get_collection(
        name=collection_name
    )

    print(f'loaded collection {collection_name}')
except InvalidCollectionException:
    print(f'creating collection {collection_name}')
    
    collection = client.create_collection(
        name=collection_name
    )

creating collection articles-default-embeddings


Add documents to the collection if the collection is new, or there are new documents to add.

In [13]:
collection.add(
    documents=documents,
    ids=ids
)

In [15]:
results = collection.query(
    query_texts=["infectious diseases transmitted by mosquitoes that affect children"],
    n_results=10
)

results

{'ids': [['https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0310480',
   'https://www.nature.com/articles/s41467-024-49080-9?error=cookies_not_supported&code=97f93c4a-08eb-43ca-a702-1d5b9904a90a',
   'https://parasitesandvectors.biomedcentral.com/articles/10.1186/s13071-024-06279-5',
   'https://linkinghub.elsevier.com/retrieve/pii/S2542519623002528',
   'https://journals.plos.org/plosntds/article?id=10.1371/journal.pntd.0011908',
   'https://linkinghub.elsevier.com/retrieve/pii/S0025556424001792',
   'https://doi.org/10.1126/sciadv.adp1657',
   'https://www.ebm-journal.org/journals/experimental-biology-and-medicine/articles/10.3389/ebm.2024.10114/full',
   'https://academic.oup.com/jid/advance-article/doi/10.1093/infdis/jiae335/7701446',
   'https://doi.org/10.1371/journal.pntd.0012616']],
 'embeddings': None,
 'documents': [['Aedes mosquito-borne viruses (ABVs) place a substantial strain on public health resources in the Americas. Vector control of Aedes mosquitoes i

In [17]:
results = collection.query(
    query_texts=["infectious diseases modeling"],
    n_results=10
)

results

{'ids': [['https://link.springer.com/article/10.1007/s11538-024-01326-9?error=cookies_not_supported&code=20758f1b-cc00-482f-91c4-2eb161967b2f',
   'https://www.cambridge.org/core/journals/infection-control-and-hospital-epidemiology/article/expanding-the-use-of-mathematical-modeling-in-healthcare-epidemiology-and-infection-prevention-and-control/2C400B94C466610D0F00D920313A34ED',
   'http://biorxiv.org/lookup/doi/10.1101/2024.06.29.601123',
   'https://linkinghub.elsevier.com/retrieve/pii/S1879625724000427',
   'https://linkinghub.elsevier.com/retrieve/pii/S002555642400110X',
   'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0294579',
   'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0290821',
   'https://doi.org/10.1002/sim.10050',
   'http://medrxiv.org/lookup/doi/10.1101/2024.10.11.24314870',
   'https://linkinghub.elsevier.com/retrieve/pii/S0025556424001251']],
 'embeddings': None,
 'documents': [["Fred Brauer was an eminent mathematician who 

In [19]:
results = collection.query(
    query_texts=["agent-based models"],
    n_results=10
)

results

{'ids': [['https://www.ncbi.nlm.nih.gov/pubmed/?term=38827450',
   'https://linkinghub.elsevier.com/retrieve/pii/S1755436524000409',
   'https://linkinghub.elsevier.com/retrieve/pii/S1755436524000355',
   'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0297247',
   'https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0290821',
   'https://www.canada.ca/content/dam/phac-aspc/documents/services/reports-publications/canada-communicable-disease-report-ccdr/monthly-issue/2024-50/issue-10-october-2024/ccdrv50i10a03-eng.pdf',
   'https://doi.org/10.1098/rsif.2024.0217',
   'https://linkinghub.elsevier.com/retrieve/pii/S2772653324000157',
   'https://linkinghub.elsevier.com/retrieve/pii/S2468042724000058',
   'https://linkinghub.elsevier.com/retrieve/pii/S1755436524000136']],
 'embeddings': None,
 'documents': [["The vision of personalized medicine is to identify interventions that maintain or restore a person's health based on their individual biology. Medical

Create a new collection using Cosign distance rather than Squred L2 (default). Ref: https://docs.trychroma.com/guides#changing-the-distance-function

In [None]:
collection_name = 'articles-default-embeddings-cosign-distance'

try:
    print(f'loading collection {collection_name}')
    
    collection = client.get_collection(
        name=collection_name
    )
except InvalidCollectionException:
    print(f'creating collection {collection_name}')
    
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )

In [None]:
collection.add(
    documents=documents,
    ids=ids
)

In [None]:
results = collection.query(
    query_texts=["infectious diseases transmitted by mosquitoes and that affect children"],
    n_results=10
)

results