<a href="https://colab.research.google.com/github/springboardmentor1234x-stack/Internal-Chatbot-with-RBAC/blob/Reethika-A/reethka_rbac_metadata.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [8]:
import re
import unicodedata
import json
import tiktoken

raw_text = """
Financial Report for FinSolve Technologies Inc. - 2024 Executive Summary ------------------------------------------- 2024 marked a year of both opportunity and challenge for FinSolve Technologies. Despite a robust revenue increase, we saw significant pressure in certain expense categories, notably vendor-related costs and software subscriptions. However, these pressures were balanced by cost-saving measures in operational efficiency, strong gross margin performance, and strategic investment in growth areas. The company is well-positioned to continue scaling its core offerings, but focused attention on cost optimization will be essential for maintaining profitability in the coming years. Year-Over-Year YoY Analysis ------------------------------------------- FinSolve Technologiess revenue grew by 25% in 2024, driven largely by the global expansion of its services, especially in Asia and Europe. This was accompanied by a 10% increase in vendor-related expenses, impacting overall profit margins. While gross profit increased by 25%, reflecting higher operational efficiency, net income saw a more modest increase of 12%. This suggests that while revenue growth is strong, controlling vendor costs and maintaining healthy cash flows remain key to long-term profitability. Expense Breakdown by Category ------------------------------------------- The primary drivers of expense in 2024 were 1. Vendor Services - A total of $30M, representing a 18% increase from the previous year. The largest contributors were - Marketing-related expenses Dinner, corporate events 40% of vendor services. - Training and education expenses 30% of vendor services. - Software subscriptions cloud services, licensing 25% of vendor services. - Other miscellaneous expenses 5% of vendor services. Analysis The Dinner and Training categories accounted for an increasing share of the marketing budget. While essential for brand positioning, these expenses need tighter management, potentially through vendor renegotiations or reduced event frequency. 2. Software Subscriptions - A significant expense totaling $25M, up 22% from 2023. Given the heavy reliance on cloud-based tools and SaaS subscriptions, this area could benefit from more rigorous contract negotiation and potential consolidation of service providers. 3. Employee Benefits and HR Costs - With FinSolve Technologiess growth in headcount, HR expenses benefits, recruitment, training saw a 10% increase. While employee growth is essential, optimizing benefits packages and hiring processes could reduce per-employee cost. 4. Other Operational Expenses - A mix of general operational and administrative expenses totaling $15M, with a notable increase in travel and miscellaneous office costs, which grew by 8% yea
"""

department = "Finance"
allowed_roles = ["Accountant", "Financial Analyst", "CFO"]
source_name = "Financial Report for FinSolve Technologies Inc. - 2024"

def clean_text(text):
    text = unicodedata.normalize('NFKC', text)

    text = re.sub(r'-{2,}', ' ', text)


    text = re.sub(r"[^a-zA-Z0-9\s.,!?;:%$€£¥\-\'\"()\[\]{}/]", '', text)

    text = re.sub(r'\s+', ' ', text).strip()

    return text

def chunk_text(text, max_tokens=350, encoding_name="cl100k_base"):
    encoder = tiktoken.get_encoding(encoding_name)
    tokens = encoder.encode(text)
    total_tokens = len(tokens)

    chunks = []
    start_idx = 0
    chunk_num = 1

    while start_idx < total_tokens:
        end_idx = min(start_idx + max_tokens, total_tokens)
        chunk_tokens = tokens[start_idx:end_idx]
        chunk_text = encoder.decode(chunk_tokens)

        chunk_id = f"fin_report_2024_{chunk_num}"
        chunks.append({
            "id": chunk_id,
            "text": chunk_text,
            "token_count": len(chunk_tokens)
        })

        start_idx = end_idx
        chunk_num += 1

    return chunks

def assign_metadata(chunks, source, dept, roles):
    for chunk in chunks:
        chunk["metadata"] = {
            "source": source,
            "department": dept,
            "allowed_roles": roles
        }
    return chunks


cleaned_text = clean_text(raw_text)
chunks = chunk_text(cleaned_text)
chunks_with_metadata = assign_metadata(chunks, source_name, department, allowed_roles)


print(json.dumps(chunks_with_metadata, indent=2))

[
  {
    "id": "fin_report_2024_1",
    "text": "Financial Report for FinSolve Technologies Inc. - 2024 Executive Summary 2024 marked a year of both opportunity and challenge for FinSolve Technologies. Despite a robust revenue increase, we saw significant pressure in certain expense categories, notably vendor-related costs and software subscriptions. However, these pressures were balanced by cost-saving measures in operational efficiency, strong gross margin performance, and strategic investment in growth areas. The company is well-positioned to continue scaling its core offerings, but focused attention on cost optimization will be essential for maintaining profitability in the coming years. Year-Over-Year YoY Analysis FinSolve Technologiess revenue grew by 25% in 2024, driven largely by the global expansion of its services, especially in Asia and Europe. This was accompanied by a 10% increase in vendor-related expenses, impacting overall profit margins. While gross profit increased b