# Documentation

## Overview
This script processes text data from multiple Parquet files, generates embeddings using the Google GenAI API, and stores them in a ChromaDB collection.

## Requirements
- Python
- Libraries: `os`, `pandas`, `chromadb`, `tqdm`, `dotenv`, `google.genai`
- Environment Variables:
  - `CHROMA_DB_PERSIST_DIRECTORY`: Path for ChromaDB persistence.
  - `GENAI_API_KEY`: API key for Google GenAI.

## Key Steps
1. **Load Environment Variables**
   - Load API keys and directory paths using `dotenv`.

2. **Initialize Clients**
   - Create ChromaDB and GenAI clients.

3. **Load Data**
   - Read and concatenate Parquet files into a DataFrame.

4. **Chunking Text**
   - Define `chunk_text()` to split text into manageable pieces for embedding.

5. **Generate Embeddings**
   - Use `genai.Client` to create embeddings for each text chunk.

6. **Store in ChromaDB**
   - Save embeddings, metadata, and documents into the ChromaDB collection.

## Output
- Confirms the number of embeddings stored.



# Code

In [1]:
import os
import pandas as pd
import chromadb
from chromadb.config import Settings
from tqdm import tqdm
from dotenv import load_dotenv
import google.genai as genai
from google.genai.types import EmbedContentConfig

load_dotenv()

persist_directory = os.getenv("CHROMA_DB_PERSIST_DIRECTORY")
genai_api_key = os.getenv("GENAI_API_KEY")

chroma_client = chromadb.PersistentClient(path=persist_directory)

client = genai.Client(api_key=genai_api_key)

collection_name = "pwc_kpmg_insights_collection"

try:
    collection = chroma_client.get_collection(collection_name)
except Exception as e:
    print(f"Collection '{collection_name}' not found, creating a new one.")
    collection = chroma_client.create_collection(name=collection_name)

file_path = "final_categorized_with_themes_and_summaries.parquet"

df = pd.read_parquet(file_path)
df["date"] = df["date"].astype(str)
print(f"Loaded {len(df)} documents from {len(file_path)} files.")

import numpy as np

def convert_to_str(value, sep=", "):
    if isinstance(value, (list, np.ndarray)):
        return sep.join([str(item) for item in value])
    return str(value)

def chunk_text(text, chunk_size=8000, overlap=500):
    """
    Splits the input text into chunks of chunk_size characters,
    with each chunk overlapping the previous one by 'overlap' characters.
    """
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        if end >= len(text):
            break
        start = end - overlap
    return chunks

embedding_ids = []
embedding_vectors = []
metadatas = []
documents = []
embedding_count = 0

for index, row in tqdm(df.iterrows(), total=len(df), desc="Processing documents"):
    date_val = row["date"]
    print(f"Row {index}: date value = {date_val!r}, type = {type(date_val)}")
    
    text = row["normalized_concatenated_text"]
    source = row.get("source", row.get("url_link", ""))
    title = row["title"]
    category = row["category"]
    niche = row["niche"]
    
    # Convert the key_themes and recurring_topics to strings
    key_themes = convert_to_str(row["key_themes"], ", ")
    recurring_topics = convert_to_str(row["recurring_topics"], ", ")
    
    chunks = chunk_text(text, chunk_size=8000, overlap=500)
    
    for i, chunk in enumerate(chunks):
        embedding_count += 1
        result = client.models.embed_content(
            model="text-embedding-004",
            contents=[chunk],
            config=EmbedContentConfig(
                task_type="RETRIEVAL_DOCUMENT",
                output_dimensionality=768,
                title=title
            )
        )
        embedding_vector = result.embeddings[0].values
        embedding_id = f"{index}_{i}"
        
        metadata_date = str(date_val) if date_val is not None else ""
        metadata = {
            "source": source,
            "title": title,
            "date": metadata_date,
            "category": category,
            "niche": niche,
            "key_themes": key_themes,
            "recurring_topics": recurring_topics,
            "embedding_number": embedding_count
        }
        embedding_ids.append(embedding_id)
        embedding_vectors.append(embedding_vector)
        metadatas.append(metadata)
        documents.append(chunk)

# --- Task 4: Store the Embeddings in Chroma DB ---
collection.add(
    ids=embedding_ids,
    embeddings=embedding_vectors,
    metadatas=metadatas,
    documents=documents
)

print(f"Stored {len(embedding_ids)} embeddings in Chroma DB.")


Collection 'pwc_kpmg_insights_collection' not found, creating a new one.
Loaded 17 documents from 51 files.


Processing documents:   0%|          | 0/17 [00:00<?, ?it/s]

Row 0: date value = '02/28/25', type = <class 'str'>


Processing documents:   6%|▌         | 1/17 [00:13<03:39, 13.69s/it]

Row 1: date value = '02/20/25', type = <class 'str'>


Processing documents:  12%|█▏        | 2/17 [00:15<01:39,  6.63s/it]

Row 2: date value = '02/10/25', type = <class 'str'>


Processing documents:  18%|█▊        | 3/17 [00:17<01:01,  4.38s/it]

Row 3: date value = '02/07/25', type = <class 'str'>


Processing documents:  24%|██▎       | 4/17 [00:27<01:27,  6.70s/it]

Row 4: date value = '02/07/25', type = <class 'str'>


Processing documents:  29%|██▉       | 5/17 [00:37<01:33,  7.79s/it]

Row 5: date value = '02/07/25', type = <class 'str'>


Processing documents:  35%|███▌      | 6/17 [00:47<01:34,  8.59s/it]

Row 6: date value = '02/06/25', type = <class 'str'>


Processing documents:  41%|████      | 7/17 [00:58<01:34,  9.43s/it]

Row 7: date value = '07/03/25', type = <class 'str'>


Processing documents:  47%|████▋     | 8/17 [01:09<01:30, 10.07s/it]

Row 8: date value = '05/03/25', type = <class 'str'>


Processing documents:  53%|█████▎    | 9/17 [01:28<01:43, 12.92s/it]

Row 9: date value = '04/03/25', type = <class 'str'>


Processing documents:  59%|█████▉    | 10/17 [01:38<01:22, 11.80s/it]

Row 10: date value = '04/03/25', type = <class 'str'>


Processing documents:  65%|██████▍   | 11/17 [02:02<01:33, 15.58s/it]

Row 11: date value = '27/02/25', type = <class 'str'>


Processing documents:  71%|███████   | 12/17 [02:15<01:13, 14.76s/it]

Row 12: date value = '25/02/25', type = <class 'str'>


Processing documents:  76%|███████▋  | 13/17 [02:17<00:43, 10.84s/it]

Row 13: date value = '24/02/25', type = <class 'str'>


Processing documents:  82%|████████▏ | 14/17 [02:44<00:47, 15.92s/it]

Row 14: date value = '21/02/25', type = <class 'str'>


Processing documents:  88%|████████▊ | 15/17 [02:46<00:23, 11.76s/it]

Row 15: date value = '18/02/25', type = <class 'str'>


Processing documents:  94%|█████████▍| 16/17 [02:51<00:09,  9.74s/it]

Row 16: date value = '14/02/25', type = <class 'str'>


Processing documents: 100%|██████████| 17/17 [02:58<00:00, 10.52s/it]


Stored 118 embeddings in Chroma DB.


In [2]:
import os
from dotenv import load_dotenv
import chromadb
import google.genai as genai
from google.genai.types import EmbedContentConfig

load_dotenv()

persist_directory = os.getenv("CHROMA_DB_PERSIST_DIRECTORY")
genai_api_key = os.getenv("GENAI_API_KEY")

client = genai.Client(api_key=genai_api_key)

chroma_client = chromadb.PersistentClient(path=persist_directory)
collection_name = "pwc_kpmg_insights_collection"
collection = chroma_client.get_collection(collection_name)

query_text = "What is the capital of France?"

query_result = client.models.embed_content(
    model="text-embedding-004",
    contents=[query_text],
    config=EmbedContentConfig(
        task_type="RETRIEVAL_DOCUMENT",
        output_dimensionality=768,
        title="Query"
    )
)
query_embedding = query_result.embeddings[0].values

results = collection.query(
    query_embeddings=[query_embedding],
    n_results=5,
    include=["documents", "metadatas", "embeddings", "distances"]
)

print("Retrieved Documents:")
for i, doc in enumerate(results["documents"][0]):
    print(f"\nResult {i+1}:")
    print("Document:", doc)
    print("Metadata:", results["metadatas"][0][i])
    print("Distance:", results["distances"][0][i])


Retrieved Documents:

Result 1:
Document:  city demonstrate unprecedented momentum aum contribution rise 9 approximately 19 remarkable growth rate 95x absolute term digital transformation revolutionise access 60 transaction form 21 transaction value occur digitally compare 45 transaction form 1 transaction value fy20134 however unlocking industrys full potential require fundamental shift approach draw insight transformative success across sector telecommunication healthcare realise sustainable change require product innovation entail deep cultural resonance community engagement solution complement rather replace exist financial wisdom engineer transformation propose comprehensive framework build four interconnected pillar strategic orchestration enhance coordination among industry stakeholder association mutual fund indias amfi governance mechanism development regional strategy respect cultural nuance drive financial inclusion creation robust research mechanism understand address behav

In [3]:
import os
import json
import numpy as np
from dotenv import load_dotenv
import chromadb

load_dotenv()

persist_directory = os.getenv("CHROMA_DB_PERSIST_DIRECTORY")  # e.g., "/path/to/save/to"
chroma_client = chromadb.PersistentClient(path=persist_directory)

collection_name = "pwc_kpmg_insights_collection"
collection = chroma_client.get_collection(collection_name)

data = collection.get(include=["documents", "embeddings", "metadatas"])

def make_json_serializable(obj):
    """
    Recursively convert objects (e.g., numpy arrays) into JSON serializable types.
    """
    if isinstance(obj, np.ndarray):
        return obj.tolist()
    elif isinstance(obj, dict):
        return {k: make_json_serializable(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [make_json_serializable(item) for item in obj]
    else:
        return obj

serializable_data = make_json_serializable(data)

output_file = "collection_export.json"
with open(output_file, "w") as f:
    json.dump(serializable_data, f, indent=2)

print(f"Collection exported to {output_file}")


Collection exported to collection_export.json
