--------------------------
#### ChromDB

- add TF-IDF vectors into ChromaDB
- Query the database
----------------------------

In [1]:
import pandas as pd

Use **IMDB** dataset

In [2]:
# Load the IMDb dataset
file_path = r'D:\AI-DATASETS\02-MISC-large\IMDB Dataset.csv'
df = pd.read_csv(file_path)

In [3]:
df.shape

(50000, 2)

In [4]:
# Take only the first 1000 reviews
reviews = df['review'].sample(1000).tolist()

#### Generate TF-IDF Vectors
- Use scikit-learn's TfidfVectorizer to generate TF-IDF vectors for the movie reviews.

In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [6]:
# Initialize the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=1000)  # Limiting to 1000 features for efficiency

In [7]:
# Generate TF-IDF vectors for the reviews
tfidf_matrix = vectorizer.fit_transform(reviews)

In [8]:
import chromadb

from chromadb.config import Settings

In [9]:
# Initialize ChromaDB client
client = chromadb.Client(Settings(allow_reset = True))

In [10]:
# List all collections
collections = client.list_collections()
print([collection.name for collection in collections])

[]


In [11]:
# Create a collection to store TF-IDF vectors
collection_name = client.get_or_create_collection("imdb_reviews")

In [12]:
# Convert TF-IDF matrix to dense array and insert into ChromaDB
tfidf_dense = tfidf_matrix.toarray()

In [13]:
# Add each review vector into ChromaDB
for idx, vector in enumerate(tfidf_dense):
    collection_name.add(
        ids       =[str(idx)],                  # Unique ID for each review
        embeddings=[vector],                    # The TF-IDF vector
        metadatas =[{"review": reviews[idx]}],  # Store the actual review
    )

- ids: Unique identifier for each review.
- embeddings: The TF-IDF vectors.
- metadatas: Metadata like the actual review text, which will be retrieved.

In [16]:
tfidf_dense

array([[0.06330038, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.08488014,
        0.        ],
       [0.08795382, 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

#### Querying ChromaDB with TF-IDF
- query ChromaDB to retrieve similar reviews using a TF-IDF-based retriever.

In [14]:
def query_chromadb(query_text, top_k=5):
    # Convert the query to a TF-IDF vector
    query_vector = vectorizer.transform([query_text]).toarray()[0]
    
    # Perform similarity search in ChromaDB
    results = collection_name.query(
        query_embeddings=[query_vector],  # The query vector
        n_results       =top_k            # Number of results to return
    )
    
    return results

In [17]:
# Example query
query_text = "I love movies about space adventures"
result = query_chromadb(query_text)

In [18]:
type(result)

dict

In [19]:
result.keys()

dict_keys(['ids', 'distances', 'metadatas', 'embeddings', 'documents', 'uris', 'data', 'included'])

In [20]:
result['ids']

[['932', '173', '362', '963', '442']]

In [21]:
result['distances']

[[1.2156418561935425,
  1.5228118896484375,
  1.6340508460998535,
  1.6497654914855957,
  1.6783223152160645]]

In [22]:
for idx, review in enumerate(result['metadatas'][0]):
    print(review['review'])
    print('---')

This was such a great film. It was done with such beautiful design, such symmetry. I love the way the classical music tied in with the classical art of earth, space and beyond. It was such a fluent, and thought provoking masterpiece. I loved the way the monoliths never changed, although primate, earth and space did. I loved how it was a question of "Do you need me/us now"? The movie expressed desire, peace, love, curiosity, finesse, gentleness, courage and innocence. What more could you ask for from a mere movie? Perhaps a complement movie. Any appreciator of 2001: A Space Odyssey will find the movie 2010 is 2001's complement movie. 2010 is more story orientated than 2001.
---
I'm not a movie snob. I've liked lots of movies that critics hate, and I've hated movies that critics love. However, I have to agree with critics here--"Galaxina" is just substandard. Clearly intended to be a comedy, it only has a few scattered laughs. "Galaxina" has poor photography; it has poor special effects;