In [None]:
%pip install --upgrade --quiet chromadb

In [None]:
%pip install --upgrade --quiet sentence_transformers

Load articles and prune ones without abstracts, since we're using the abstracts for generating the embeddings.

In [None]:
from tinydb import TinyDB, Query

db = TinyDB('db.json')
table = db.table('articles')

articles = table.all()
print(f'loaded {len(articles)} articles')

articles = [x for x in articles if x['abstract'] != 'No abstract available.']
print(f'retaining {len(articles)} articles')

Stage the articles so that they can easily be loaded into the vector database. Remove duplicates.

In [None]:
documents = []
ids = []

for article in articles:
    doc_id = article['link']
    if doc_id not in ids:
        documents.append(article['abstract'])
        ids.append(doc_id)

print(f'loaded {len(ids)} articles')

For finding semantically related documents, we'll use Chroma (https://www.trychroma.com/), which is a lightweight vector data store. Chroma supports swappable embedding models, filtering using metadata, keyword search, and multiple distance measurements. We'll use these features for evlauating approaches to organizing papers for downstream processing (search, summarization, keyword extraction, etc.).

In [None]:
import chromadb

client = chromadb.PersistentClient(path="vectors_db")

In [None]:
from chromadb.errors import InvalidCollectionException

collection_name = 'articles-default-embeddings'

try:
    collection = client.get_collection(
        name=collection_name
    )

    print(f'loaded collection {collection_name}')
except InvalidCollectionException:
    print(f'creating collection {collection_name}')
    
    collection = client.create_collection(
        name=collection_name
    )

Add documents to the collection if the collection is new, or there are new documents to add.

In [None]:
collection.add(
    documents=documents,
    ids=ids
)

In [None]:
results = collection.query(
    query_texts=["infectious diseases transmitted by mosquitoes and that affect children"],
    n_results=10
)

results

Create a new collection using Cosign distance rather than Squred L2 (default). Ref: https://docs.trychroma.com/guides#changing-the-distance-function

In [None]:
collection_name = 'articles-default-embeddings-cosign-distance'

try:
    print(f'loading collection {collection_name}')
    
    collection = client.get_collection(
        name=collection_name
    )
except InvalidCollectionException:
    print(f'creating collection {collection_name}')
    
    collection = client.create_collection(
        name=collection_name,
        metadata={"hnsw:space": "cosine"}
    )

In [None]:
collection.add(
    documents=documents,
    ids=ids
)

In [None]:
results = collection.query(
    query_texts=["infectious diseases transmitted by mosquitoes and that affect children"],
    n_results=10
)

results