-----------------------
#### default embedding function in ChromaDB
---------------------

In [1]:
import chromadb

In [2]:
client = chromadb.PersistentClient(path=r"D:\AI-DATASETS\02-MISC-large\GenAI-LLMs\chromadb\bhupen")

In [3]:
# check if collections exist
client.list_collections()

[Collection(id=d3e8521c-0d49-4d69-aea5-7d06a09d138b, name=learn_chromadb1)]

In [4]:
collection = client.get_or_create_collection("learn_chromadb")

In [5]:
dict(collection.get_model())

{'id': UUID('c1fd9b46-0c31-4869-8fa5-4499d2ddaddb'),
 'name': 'learn_chromadb',
 'configuration_json': {'hnsw_configuration': {'space': 'l2',
   'ef_construction': 100,
   'ef_search': 10,
   'num_threads': 4,
   'M': 16,
   'resize_factor': 1.2,
   'batch_size': 100,
   'sync_threshold': 1000,
   '_type': 'HNSWConfigurationInternal'},
  '_type': 'CollectionConfigurationInternal'},
 'metadata': None,
 'dimension': None,
 'tenant': 'default_tenant',
 'database': 'default_database',
 'version': 0,
 'log_position': 0}

In [6]:
tagline_dict = {
    1: "Empowering Ideas, Unleashing Words: Your Voice, Our Language Model.",
    2: "Infinite Possibilities, One Model: Transforming Text with LLM Excellence.",
    3: "Where Language Meets Limitless: Unleashing Creativity with LLM.",
    4: "Crafting Tomorrow's Conversations Today: LLM - The Language Pioneer.",
    5: "Words Redefined, Ideas Amplified: Navigate the Future with LLM.",
    6: "Unlocking Textual Brilliance: Your Ideas, Supercharged by LLM.",
    7: "Empowering Communication, Enriching Experiences: LLM at Your Service.",
    8: "The Art of Words, Elevated: LLM - Your Gateway to Expressive Text.",
    9: "In the Realm of Language Mastery: Unleash Potential with LLM.",
    10: "Transforming Ideas into Textual Triumphs: LLM, Your Creative Companion."
}

In [8]:
# Iterate over the dictionary items and print key-value pairs
for key, val in tagline_dict.items():
    #print(f"Key: {key}, Value: {val}")
    
    # Add a document to the collection
    collection.add(documents = [val], 
                   ids       = [str(key)]
                  )

Insert of existing embedding ID: 1
Add of existing embedding ID: 1
Insert of existing embedding ID: 2
Add of existing embedding ID: 2
Insert of existing embedding ID: 3
Add of existing embedding ID: 3
Insert of existing embedding ID: 4
Add of existing embedding ID: 4
Insert of existing embedding ID: 5
Add of existing embedding ID: 5
Insert of existing embedding ID: 6
Add of existing embedding ID: 6
Insert of existing embedding ID: 7
Add of existing embedding ID: 7
Insert of existing embedding ID: 8
Add of existing embedding ID: 8
Insert of existing embedding ID: 9
Add of existing embedding ID: 9
Insert of existing embedding ID: 10
Add of existing embedding ID: 10


In [9]:
collection.count()

10

In [10]:
phrases=[
    "Amanda baked cookies and will bring Jerry some tomorrow.",
    "Olivia and Olivier are voting for liberals in this election.",
    "Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do.",
    "John's cookies were only half-baked but he still carries them for Mary.",
]

In [11]:
ids=["001","002","003","004"]

In [12]:
metadatas =[{"source": "pdf-1"},
            {"source": "doc-1"},
            {"source": "pdf-2"},
            {"source": "txt-1"}]

In [13]:
collection = client.get_or_create_collection("learn_chromadb1")

In [14]:
collection.add(
    documents = phrases,
    metadatas = metadatas,
    ids       = ids
)

Add of existing embedding ID: 001
Add of existing embedding ID: 002
Add of existing embedding ID: 003
Add of existing embedding ID: 004
Add of existing embedding ID: 001
Add of existing embedding ID: 002
Add of existing embedding ID: 003
Add of existing embedding ID: 004
Insert of existing embedding ID: 001
Insert of existing embedding ID: 002
Insert of existing embedding ID: 003
Insert of existing embedding ID: 004
Add of existing embedding ID: 001
Add of existing embedding ID: 002
Add of existing embedding ID: 003
Add of existing embedding ID: 004


#### Importance of Embeddings in Collections
- Embeddings play a crucial role in managing and analyzing collections.
- They can be generated implicitly using built-in word embedding models in Chroma.
- Alternatively, embeddings can be generated externally via models from OpenAI, PaLM, or Cohere.
- Chroma offers easy integration with external APIs for automating the embedding generation and storage process.

#### Default Embedding Model in Chroma
- Chroma uses the **Sentence Transformers, all-MiniLM-L6-v2** model by default.
- This model generates sentence and document embeddings for a range of tasks.
- The embedding function runs locally on your machine.
- Required model files are downloaded automatically as needed.

In [15]:
print(collection.peek())

{'ids': ['001', '002', '003', '004'], 'embeddings': array([[-0.02752167,  0.05736668, -0.00122983, ..., -0.01637481,
        -0.03250027, -0.0933486 ],
       [ 0.08045931, -0.06479898, -0.00384998, ..., -0.0260211 ,
         0.08252283, -0.02077583],
       [-0.07562497, -0.02735966, -0.01390161, ...,  0.06941753,
        -0.09370109, -0.0335044 ],
       [ 0.00651375,  0.05261214,  0.05810095, ..., -0.06113185,
         0.02064367, -0.07570463]]), 'metadatas': [{'source': 'pdf-1'}, {'source': 'doc-1'}, {'source': 'pdf-2'}, {'source': 'txt-1'}], 'documents': ['Amanda baked cookies and will bring Jerry some tomorrow.', 'Olivia and Olivier are voting for liberals in this election.', 'Sam is confused, because he overheard Rick complaining about him as a roommate. Naomi thinks Sam should talk to Rick. Sam is not sure what to do.', "John's cookies were only half-baked but he still carries them for Mary."], 'uris': None, 'data': None, 'included': ['embeddings', 'metadatas', 'documents']}


In [16]:
collection.peek(1)

{'ids': ['001'],
 'embeddings': array([[-2.75216699e-02,  5.73666804e-02, -1.22983486e-03,
         -2.36949883e-02, -9.43863690e-02, -2.78063747e-03,
          7.46858940e-02, -5.72701655e-02, -2.50429623e-02,
          1.85038298e-02, -2.71457154e-02, -8.91745184e-03,
         -1.07487887e-01, -7.39713386e-03,  2.43057478e-02,
          1.12196254e-02,  3.65282334e-02, -5.46093062e-02,
         -8.01465288e-02, -9.81636811e-03, -1.41378702e-03,
          1.17890192e-02,  1.02783382e-01,  6.10934123e-02,
         -7.94704035e-02,  4.80488278e-02,  5.60123287e-03,
         -9.59517714e-03, -3.92248482e-03,  7.35128159e-03,
         -5.84082156e-02,  2.79703978e-02, -1.15979970e-01,
          2.91299429e-02,  4.04155105e-02, -3.52354981e-02,
          7.95362145e-03,  2.77526164e-03,  4.08411287e-02,
          2.25484744e-02,  1.23183820e-02, -4.28285897e-02,
         -7.29242787e-02,  1.79624427e-02, -1.42921925e-01,
         -6.66533932e-02,  3.31913517e-03, -1.35064693e-02,
         

In [17]:
collection.peek(1)['embeddings'].shape

(1, 384)

In [18]:
results = collection.query(
    query_texts= ["Mary got half-baked cake from John"],
    n_results  = 1
)

In [19]:
print(results['documents'][0][0])

John's cookies were only half-baked but he still carries them for Mary.


#### Sentence Transformers, all-MiniLM-L6-v2

##### Model Architecture
- **Base Model:** Based on Microsoft's **MiniLM** (Mini Language Model), a lightweight version of larger transformer models like BERT and RoBERTa.
- **Layers:** 6 transformer layers (L6 indicates this).
- **Hidden Size:** 384 hidden dimensions, making it smaller and faster while maintaining strong performance.
- **Distillation:** Trained using **knowledge distillation** from a larger model (e.g., BERT), learning to replicate its outputs with fewer parameters.

##### Datasets Used
- Fine-tuned on large-scale **sentence-pair datasets** for tasks such as semantic similarity and paraphrase detection. Key datasets include:
  - **STS (Semantic Textual Similarity) benchmark:** Measures how semantically similar two sentences are.
  - **Quora Question Pairs:** Used for detecting duplicate or paraphrased questions.
  - **NLI (Natural Language Inference) datasets:** Trained to classify sentence pairs as entailment, contradiction, or neutral.

##### Practical Use Cases
- **Semantic Search:** Enables finding semantically similar sentences or documents, improving search engine results and information retrieval.
- **Text Clustering:** Clustering similar sentences or paragraphs, helpful for organizing or summarizing text datasets.
- **Paraphrase Identification:** Detects whether two sentences convey the same meaning but in different words.
- **Recommendation Systems:** Generates content-based recommendations by comparing text descriptions of items through embeddings.
- **Zero-shot Classification:** Uses embeddings to classify text into categories by comparing sentence similarity, even without labeled data for each class.

The all-MiniLM-L6-v2 model is optimized for tasks requiring sentence-level embeddings, balancing speed and memory efficiency with robust performance.