#### This notebook demonstrate populating a vector database (chromadb)

The example use text data from a whatsapp newsgroup. A read only group that publishes India business news as one-liner
from different newspapers.


In [1]:
import os
import re
from datetime import datetime, timedelta

from tqdm.notebook import tqdm
from time import sleep

In [2]:
import chromadb
chroma_client = chromadb.Client()   ## in memory vdb

### Embedding function
By default, Chroma uses the Sentence Transformers all-MiniLM-L6-v2 model to create embeddings. This embedding model can create sentence and document embeddings that can be used for a wide variety of tasks. This embedding function runs locally on your machine, and may require you download the model files (this will happen automatically).


### create_collection : distance func
`create_collection` also takes an optional metadata argument which can be used to customize the distance method of the embedding space by setting the value of hnsw:space.

```
collection = client.create_collection(
        name="collection_name",
        metadata={"hnsw:space": "cosine"} # l2 is the default
    )
```

HNSW - Heirarchical Network Small World (options are cosine,l2 and ip )

In [3]:
## VDB related imports
from chromadb.utils import embedding_functions
# By default Chroma uses 
embedding_func = embedding_functions.DefaultEmbeddingFunction()
news_collection = chroma_client.create_collection(name="news",embedding_function=embedding_func)

## news preprocessor
from util.text2json import get_news

In [4]:
def add_record(rec,collection):
    ''' adds a record to chroma db '''
    if type(rec)!=dict: # rec should be a json 
        return False
    if all(x in rec.keys() for x in ['ids' ,'documents','metadatas']) :
        try:
            collection.add(ids=rec['ids'],documents=rec['documents'],metadatas=rec['metadatas'])
        except Exception as e:
            print(f"Error : {e}")
            return False
    else:
        return False


#### PREPROCESSING TEXT FILE

get_news return a json making it easy to add to chromadb see: `text2json` notebook for details

In [5]:
news=get_news("./_test.txt")

Rejected lines : 4319
News Items : 396


#### ADD NEWS TO CHROMADB

uses helper function `add_record`

In [6]:
for i in tqdm(range(len(news))):
    add_record(news[i],news_collection)

  0%|          | 0/396 [00:00<?, ?it/s]

Insert of existing embedding ID: 2023-06-06ET
Add of existing embedding ID: 2023-06-06ET
Insert of existing embedding ID: 2023-06-06BS
Add of existing embedding ID: 2023-06-06BS
Insert of existing embedding ID: 2023-06-06FE
Add of existing embedding ID: 2023-06-06FE
Insert of existing embedding ID: 2023-06-06M
Add of existing embedding ID: 2023-06-06M
Insert of existing embedding ID: 2023-06-10ET
Add of existing embedding ID: 2023-06-10ET
Insert of existing embedding ID: 2023-06-10BS
Add of existing embedding ID: 2023-06-10BS
Insert of existing embedding ID: 2023-06-10FE
Add of existing embedding ID: 2023-06-10FE
Insert of existing embedding ID: 2023-06-10M
Add of existing embedding ID: 2023-06-10M
Insert of existing embedding ID: 2023-06-13ET
Add of existing embedding ID: 2023-06-13ET
Insert of existing embedding ID: 2023-06-13BS
Add of existing embedding ID: 2023-06-13BS
Insert of existing embedding ID: 2023-06-13FE
Add of existing embedding ID: 2023-06-13FE
Insert of existing embedd

#### FETCH FROM AND QUERY CHROMADB

1. fetch records using `.get` and `.peek` using parameter to limit the output
2. use `.query` to filter and search for specific meta data

Examples follow

In [7]:
news_collection.get(limit=2)        # limit to 2 records
# you can use peek() instead of get() as it shows embeddings too
#news_collection.peek()

{'ids': ['2023-05-08ET', '2023-05-08BS'],
 'embeddings': None,
 'metadatas': [{'date': '2023-05-08', 'source': 'Economic Times'},
  {'date': '2023-05-08', 'source': 'Business Standard'}],
 'documents': ['Aditya Birla Fashion to raise up to ₹800 crore for TCNS acquisition,Indus plans capex push this fiscal to make the most of 5G boom,Blackstone signs binding pact for controlling stake in care hospitals,Equitas Small Finance Bank reports Q4 net profit at Rs 190.03 cr,Coal India Q4 Results: Profit declines 18% YoY to Rs 5,528 crore, dividend declared at Rs 4/share,TPG-backed RR Kabel files IPO papers with Sebi,Gold imports dip 24% to $35 billion in 2022-23,GSTN defers by 3 months implementation of e-invoice reporting time limit,Saudi Arabia economy grew 3.9% in Q1 boosted by non-oil activities,ChrysCap diversifies into public market, launches special fund,Daikin India becomes billion-dollar company, expect to double business in next 3 years,Electronic wearables production in India reaches

In [8]:
news_collection.query(
    query_texts=["Blackstone"],
    n_results=2
)

{'ids': [['2023-07-07BS', '2023-07-29BS']],
 'distances': [[1.305416226387024, 1.5939037799835205]],
 'metadatas': [[{'date': '2023-07-07', 'source': 'Business Standard'},
   {'date': '2023-07-29', 'source': 'Business Standard'}]],
 'embeddings': None,
 'documents': [['Blackstone Group top bidder for ESR, German investor Allianz platform,RBI governor Shaktikanta Das asks states to improve expenditure quality,Non-life insurance companies report 14% growth in June, shows GIC data,Epsilon Advanced Materials to up India capacity from 200 TPA to 10,000,PSBs should follow transparent NPA recognition norms, pursue risk mgmt: FM,Rationalise tariff on telecom parts to encourage other countries: Industry,State Bank of India rejigs senior leadership roles to boost dominance,Bombay Dyeing may sell Worli land parcel at Rs 5,000 cr valuation',
   "Blackstone, Baring in race to acquire up to 20% stake in pharma firm Cipla,'Diversify credit portfolio': FinMin flags concentration risk at 5 PSBs,Sterlit

In [9]:
# Identical to above query suggesting query_text is used to create an embedding before
# to find the nearest results
news_collection.query(
    query_embeddings=embedding_func(["Blackstone"]),
    n_results=3,
)

{'ids': [['2023-07-07BS', '2023-07-29BS', '2023-08-14ET']],
 'distances': [[1.305416226387024, 1.5939037799835205, 1.5994011163711548]],
 'metadatas': [[{'date': '2023-07-07', 'source': 'Business Standard'},
   {'date': '2023-07-29', 'source': 'Business Standard'},
   {'date': '2023-08-14', 'source': 'Economic Times'}]],
 'embeddings': None,
 'documents': [['Blackstone Group top bidder for ESR, German investor Allianz platform,RBI governor Shaktikanta Das asks states to improve expenditure quality,Non-life insurance companies report 14% growth in June, shows GIC data,Epsilon Advanced Materials to up India capacity from 200 TPA to 10,000,PSBs should follow transparent NPA recognition norms, pursue risk mgmt: FM,Rationalise tariff on telecom parts to encourage other countries: Industry,State Bank of India rejigs senior leadership roles to boost dominance,Bombay Dyeing may sell Worli land parcel at Rs 5,000 cr valuation',
   "Blackstone, Baring in race to acquire up to 20% stake in pharma

In [10]:

news_collection.query(
    query_embeddings=embedding_func(["What does black BlackStone do"]),
    where_document={"$contains":"Blackstone"},
    n_results=10,
)


{'ids': [['2023-07-07BS',
   '2023-07-29BS',
   '2023-08-14ET',
   '2023-05-22ET',
   '2023-05-08ET',
   '2023-06-08BS',
   '2023-05-18ET',
   '2023-05-08M',
   '2023-06-27M']],
 'distances': [[1.186376929283142,
   1.550341248512268,
   1.5727812051773071,
   1.7622734308242798,
   1.9258756637573242,
   1.9982293844223022,
   2.0005414485931396,
   2.0247809886932373,
   2.078800916671753]],
 'metadatas': [[{'date': '2023-07-07', 'source': 'Business Standard'},
   {'date': '2023-07-29', 'source': 'Business Standard'},
   {'date': '2023-08-14', 'source': 'Economic Times'},
   {'date': '2023-05-22', 'source': 'Economic Times'},
   {'date': '2023-05-08', 'source': 'Economic Times'},
   {'date': '2023-06-08', 'source': 'Business Standard'},
   {'date': '2023-05-18', 'source': 'Economic Times'},
   {'date': '2023-05-08', 'source': 'Mint'},
   {'date': '2023-06-27', 'source': 'Mint'}]],
 'embeddings': None,
 'documents': [['Blackstone Group top bidder for ESR, German investor Allianz platf

In [11]:
# project only documents
news_collection.query(
    query_embeddings=embedding_func(["BlackStone"]),
    where_document={"$contains":"BetterPlace"},
    n_results=10,
    include=["documents"]
)


{'ids': [['2023-05-08M']],
 'distances': None,
 'metadatas': None,
 'embeddings': None,
 'documents': [["BetterPlace acquires fintech lending startup Bueno Finance,LIC, MFs plough $2 billion into IT firms in Q4 as shares tumble,India eyes clean energy sources to tackle tariffs,India conducts talks with UAE on pharma export pricing challenges,Blackstone, ADIA are likely bidders for HDFC’s Credila,NCLT to hear BoB plea in Rel Home case,Silver ETFs getting investors' traction; asset bases reach ₹1,800 crore,Connekkt Media Network signs ₹270 crore deal with AVS Studios for 3 films,Defence ministry approves posting women officers of Territorial Army along LoC,Grindwell Norton Q4 Earnings: PAT rise 10% YoY in Q4, net income up 20%, Board declares highest ever dividend."]]}