ChromaDB Tutorial: https://www.youtube.com/watch?v=Qs_y0lTJAp0

In [1]:
import chromadb
from pprint import pprint

In [2]:
chroma_client = chromadb.Client()

In [3]:
#collection = chroma_client.create_collection(name="documents")
# Better:
collection = chroma_client.get_or_create_collection(name="documents")

In [4]:
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)


In [5]:
results = collection.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=2 # how many results to return
)
pprint(results)


{'data': None,
 'distances': [[1.0404009819030762, 1.2430799007415771]],
 'documents': [['This is a document about pineapple',
                'This is a document about oranges']],
 'embeddings': None,
 'ids': [['id1', 'id2']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None, None]],
 'uris': None}


As we can see above, the pineapple sentence is closer (less distance), perhaps becuase of pizza hawaii?

In [6]:
results = collection.query(
    query_texts=["This is a query document about florida"],
    n_results=1
)
pprint(results)

{'data': None,
 'distances': [[1.1462137699127197]],
 'documents': [['This is a document about oranges']],
 'embeddings': None,
 'ids': [['id2']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None]],
 'uris': None}


We can also include filters in the query.

Think of these as SQL-like filters that operate on the underliying collection, before a similarity search is conducted.

For instance:

In [7]:
results = collection.query(
    query_texts=["This is a query document about hawaii"],
    n_results=2,
    where_document={"$contains": "pineapple"}
)
pprint(results)

{'data': None,
 'distances': [[1.0404009819030762]],
 'documents': [['This is a document about pineapple']],
 'embeddings': None,
 'ids': [['id1']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None]],
 'uris': None}


Note how this time, we only got back one result although we specified two.
This is of course because there is only one document that contains the word pineapple.

Next download a dataset, we are using https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles
And put it into an "archives" directory in the project root, and extract it.
Now to work with it using polars (Don't forget to pip install polars if you havn't already):

In [8]:
import polars as pl

In [15]:
import os
print(os.getcwd()) # Make sure we can read the file
print(os.path.exists("./archive/Articles.csv"))

C:\Users\thoma\Documents\GitHub\ai\rag\chromadb
True


AttributeError: 'NoneType' object has no attribute 'with_row_index'

In [16]:
try:
    print("Reading csv...")
    articles = pl.read_csv("./archive/Articles.csv", encoding="ISO-8859-1").with_row_index(offset=1) # add an incrementing index column
    print("articles: ")
    print(articles)
except Exception as e:
    print(f"An error occurred: {e}")

Reading csv...
articles: 
shape: (2_692, 5)
┌───────┬─────────────────────────────────┬───────────┬─────────────────────────────────┬──────────┐
│ index ┆ Article                         ┆ Date      ┆ Heading                         ┆ NewsType │
│ ---   ┆ ---                             ┆ ---       ┆ ---                             ┆ ---      │
│ u32   ┆ str                             ┆ str       ┆ str                             ┆ str      │
╞═══════╪═════════════════════════════════╪═══════════╪═════════════════════════════════╪══════════╡
│ 1     ┆ KARACHI: The Sindh government … ┆ 1/1/2015  ┆ sindh govt decides to cut publ… ┆ business │
│ 2     ┆ HONG KONG: Asian markets start… ┆ 1/2/2015  ┆ asia stocks up in new year tra… ┆ business │
│ 3     ┆ HONG KONG:  Hong Kong shares o… ┆ 1/5/2015  ┆ hong kong stocks open 0.66 per… ┆ business │
│ 4     ┆ HONG KONG: Asian markets tumbl… ┆ 1/6/2015  ┆ asian stocks sink euro near ni… ┆ business │
│ 5     ┆ NEW YORK: US oil prices Monday… ┆ 1/6

In [14]:
print(articles.shape)

(2692, 4)


In [17]:
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env
openai_api_key = os.getenv("OPENAI_API_KEY")
print(openai_api_key)  

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
