### ChromaDB Tutorial: https://www.youtube.com/watch?v=Qs_y0lTJAp0
Usig the chromadb ephemeral client, and custom OpenAI embedding model.

In [1]:
import chromadb
from pprint import pprint

In [2]:
chroma_client = chromadb.Client()

In [3]:
#collection = chroma_client.create_collection(name="documents")
# Better, because it may already exist:
collection = chroma_client.get_or_create_collection(name="documents")

In [4]:
collection.add(
    documents=[
        "This is a document about pineapple",
        "This is a document about oranges"
    ],
    ids=["id1", "id2"]
)


In [5]:
results = collection.query(
    query_texts=["This is a query document about hawaii"], # Chroma will embed this for you
    n_results=2 # how many results to return
)
pprint(results)


{'data': None,
 'distances': [[1.0404009819030762, 1.2430799007415771]],
 'documents': [['This is a document about pineapple',
                'This is a document about oranges']],
 'embeddings': None,
 'ids': [['id1', 'id2']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None, None]],
 'uris': None}


##### As we can see above, the pineapple sentence is closer (less distance), perhaps because of pizza hawaii?

In [6]:
results = collection.query(
    query_texts=["This is a query document about florida"],
    n_results=1
)
pprint(results)

{'data': None,
 'distances': [[1.1462137699127197]],
 'documents': [['This is a document about oranges']],
 'embeddings': None,
 'ids': [['id2']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None]],
 'uris': None}


##### We can also include filters in the query.

Think of these as **SQL-like** filters that operate on the underliying collection, before a similarity search is conducted.

For instance:

In [7]:
results = collection.query(
    query_texts=["This is a query document about hawaii"],
    n_results=2,
    where_document={"$contains": "pineapple"}
)
pprint(results)

{'data': None,
 'distances': [[1.0404009819030762]],
 'documents': [['This is a document about pineapple']],
 'embeddings': None,
 'ids': [['id1']],
 'included': [<IncludeEnum.distances: 'distances'>,
              <IncludeEnum.documents: 'documents'>,
              <IncludeEnum.metadatas: 'metadatas'>],
 'metadatas': [[None]],
 'uris': None}


**Note** how this time, we only got back one result although we specified two.
This is of course because there is only one document that contains the word pineapple.

Next download a dataset, we are using https://www.kaggle.com/datasets/asad1m9a9h6mood/news-articles
And put it into an "archives" directory in the project root, and extract it.
Now to work with it using polars (Don't forget to pip install polars if you havn't already):

In [8]:
import polars as pl

In [9]:
import os
print(os.getcwd()) # Make sure we can read the file
print(os.path.exists("./archive/Articles.csv"))

C:\Users\thoma\Documents\GitHub\ai\rag\chromadb
True


In [10]:
try:
    print("Reading csv...")
    articles = pl.read_csv("./archive/Articles.csv", encoding="ISO-8859-1").with_row_index(offset=1) # add an incrementing index column
    print("articles: ")
    print(articles)
except Exception as e:
    print(f"An error occurred: {e}")

Reading csv...
articles: 
shape: (2_692, 5)
┌───────┬─────────────────────────────────┬───────────┬─────────────────────────────────┬──────────┐
│ index ┆ Article                         ┆ Date      ┆ Heading                         ┆ NewsType │
│ ---   ┆ ---                             ┆ ---       ┆ ---                             ┆ ---      │
│ u32   ┆ str                             ┆ str       ┆ str                             ┆ str      │
╞═══════╪═════════════════════════════════╪═══════════╪═════════════════════════════════╪══════════╡
│ 1     ┆ KARACHI: The Sindh government … ┆ 1/1/2015  ┆ sindh govt decides to cut publ… ┆ business │
│ 2     ┆ HONG KONG: Asian markets start… ┆ 1/2/2015  ┆ asia stocks up in new year tra… ┆ business │
│ 3     ┆ HONG KONG:  Hong Kong shares o… ┆ 1/5/2015  ┆ hong kong stocks open 0.66 per… ┆ business │
│ 4     ┆ HONG KONG: Asian markets tumbl… ┆ 1/6/2015  ┆ asian stocks sink euro near ni… ┆ business │
│ 5     ┆ NEW YORK: US oil prices Monday… ┆ 1/6

In [11]:
print(articles.shape)

(2692, 5)


In [12]:
from dotenv import load_dotenv

load_dotenv()  # Load environment variables from .env
openai_api_key = os.getenv("OPENAI_API_KEY")

In [13]:
# We're using a custom OpenAI embedding model here:
import chromadb.utils.embedding_functions as embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key=openai_api_key,
                model_name="text-embedding-3-small"
            )


In [14]:
# It is not necessary to call the OpenAI API 2692 times for this demo project
articles = articles[:50]
print(articles.shape)

(50, 5)


In [15]:
articles["Article"][0][:100]

'KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to '

In [16]:
# Print embedding of first article's first couple of words
openai_ef(articles["Article"][0][:30])

[array([ 0.00988591,  0.02568489,  0.00104518, ..., -0.02508434,
        -0.01958704,  0.00807657], dtype=float32),
 array([ 0.05952451,  0.01269055, -0.00069948, ..., -0.02103287,
        -0.01620701,  0.00878703], dtype=float32),
 array([ 0.00331469,  0.04346333, -0.01833384, ..., -0.03085521,
         0.0225992 ,  0.02097981], dtype=float32),
 array([ 0.05952451,  0.01269055, -0.00069948, ..., -0.02103287,
        -0.01620701,  0.00878703], dtype=float32),
 array([ 0.06287624,  0.04790863,  0.00675536, ..., -0.04803132,
        -0.00498026,  0.01923093], dtype=float32),
 array([ 0.02750912,  0.0018673 ,  0.01547583, ..., -0.01396485,
        -0.02123156,  0.00511318], dtype=float32),
 array([ 0.02756455,  0.01539813, -0.00891095, ..., -0.04537062,
        -0.01197632,  0.03469332], dtype=float32),
 array([ 0.03073194, -0.00257686, -0.00697588, ..., -0.0302968 ,
        -0.02063301,  0.01929132], dtype=float32),
 array([-0.01382395, -0.00268456,  0.01436963, ..., -0.01435664,
       

### Test

In [17]:
# Create articles collection
collection = chroma_client.get_or_create_collection(name="articles")

# Add first article
vectors=openai_ef([articles["Article"][0]]) # Note we are always passing an array (hence [...])
print(len(vectors))
print(len(vectors[0])) # Embedding dimensionality

1
1536


In [18]:
# Add to chroma. Also add the OpenAI embedding.
# If we didn't pass the embeddings explicitly, chroma would use the default embedding function.
#  (But we have to remember to use the same embedding model for query texts, see cell below this one where we create the embedding)
collection.add(
    documents=[
        articles["Article"][0]
    ],
    ids=["id1"],
    embeddings=vectors
)

In [19]:
query = "public transport fares by 7 per cent"

# Create an embedding out of the query first. 
# We would obviously not need this if we used the default embedding function.
query_embeddings = openai_ef([query]) 

# Perform query
collection.query(
    query_embeddings=query_embeddings,
    n_results=1
)

{'ids': [['id1']],
 'embeddings': None,
 'documents': [['KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \n\n\n\n\n\n\n\n\n\n\n']],
 'uris': None,
 'data': None,
 'metadatas': [[None]],
 'distances': [[0.9042726755142212]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: '

##### Now that the test succeeded, embed all 50 articles

In [20]:
# First let's clear the collection
chroma_client.delete_collection(name="articles")

In [21]:
# Create articles collection again
collection = chroma_client.get_or_create_collection(name="articles")

# Add 50 articles
vectors=openai_ef(articles["Article"][:50].to_list()) # Note here since we're adding a list, no [] is needed
print(len(vectors))
print(len(vectors[0])) # Embedding dimensionality

50
1536


##### Add Indexes

In [22]:
indexes = [f"id{x}" for x in articles["index"]]
print(indexes[:5])

['id1', 'id2', 'id3', 'id4', 'id5']


##### Add to collection

In [23]:
print(articles.shape) # Make sure articles is of expected size

(50, 5)


In [24]:
collection.add(
    documents=articles["Article"].to_list(), # We want just the articles, not the whole articles DATASET!
    ids=indexes,
    embeddings=vectors
)

In [25]:
collection.count() # Should return 50.

50

In [26]:
# We should get the same article (Or a similar one)
collection.query(
    query_embeddings=query_embeddings,
    n_results=1
)

{'ids': [['id1']],
 'embeddings': None,
 'documents': [['KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \n\n\n\n\n\n\n\n\n\n\n']],
 'uris': None,
 'data': None,
 'metadatas': [[None]],
 'distances': [[0.9040755033493042]],
 'included': [<IncludeEnum.distances: 'distances'>,
  <IncludeEnum.documents: 'documents'>,
  <IncludeEnum.metadatas: '

##### Now lets return more results

In [27]:
collection.query(
    query_embeddings=query_embeddings,
    n_results=2
)

{'ids': [['id1', 'id29']],
 'embeddings': None,
 'documents': [['KARACHI: The Sindh government has decided to bring down public transport fares by 7 per cent due to massive reduction in petroleum product prices by the federal government, Geo News reported.Sources said reduction in fares will be applicable on public transport, rickshaw, taxi and other means of traveling.Meanwhile, Karachi Transport Ittehad (KTI) has refused to abide by the government decision.KTI President Irshad Bukhari said the commuters are charged the lowest fares in Karachi as compare to other parts of the country, adding that 80pc vehicles run on Compressed Natural Gas (CNG). Bukhari said Karachi transporters will cut fares when decrease in CNG prices will be made.                        \n\n\n\n\n\n\n\n\n\n\n',
   'ISLAMABAD:  Federal Minister for Finance Ishaq Dar on Saturday announced a five percent increase in the General Sales Tax (GST) on petroleum products.Dar said that the increment would enable a recovery

**Note** The second article's similarity: "...five percent increase in the General Sales Tax..."

### Retrieval
Now lets make a retriveal by asking an actual question.

Note that this will just be similarity search by chroma again, no inference or anything like that.

In [28]:
query = "What's going on with petrol prices?"
query_embeddings = openai_ef([query]) 

# Perform query
collection.query(
    query_embeddings=query_embeddings,
    n_results=3
)

{'ids': [['id50', 'id39', 'id24']],
 'embeddings': None,
 'documents': [['London: World oil prices rose Friday, extending solid gains in the previous session in response to reports that leading petroleum producers are curtailing investment.US benchmark West Texas Intermediate (WTI) for March delivery added 75 cents to $51.96 a barrel compared with Thursday´s close.Brent North Sea crude for April leapt $1.02 to $60.30 a barrel in London afternoon trade -- the first time this year above $60.Crude futures had rallied by almost $2.50 on Thursday on reports of falling investment in the oil sector.Daniel Ang, investment analyst at Phillip Futures in Singapore, said dealers are wary about the impact of such cuts in the immediate term. "We will continue to see range-bound trading in the immediate term with little change to fundamentals," Ang told AFP. "The main thing is crude production. As long as production levels don´t go down for now, it is hard to see a reversal of low oil prices," he sai