<a href="https://colab.research.google.com/github/springboardmentor1234x-stack/Internal-Chatbot-with-RBAC/blob/Reethika-A/reethka_rbac_chunking.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [18]:
import re
import unicodedata
import tiktoken


raw_financial_data = """Financial Report for FinSolve Technologies Inc. - 2024 Executive Summary ------------------------------------------- 2024 marked a year of both opportunity and challenge for FinSolve Technologies. Despite a robust revenue increase, we saw significant pressure in certain expense categories, notably vendor-related costs and software subscriptions. However, these pressures were balanced by cost-saving measures in operational efficiency, strong gross margin performance, and strategic investment in growth areas. The company is well-positioned to continue scaling its core offerings, but focused attention on cost optimization will be essential for maintaining profitability in the coming years. Year-Over-Year YoY Analysis ------------------------------------------- FinSolve Technologiess revenue grew by 25% in 2024, driven largely by the global expansion of its services, especially in Asia and Europe. This was accompanied by a 10% increase in vendor-related expenses, impacting overall profit margins. While gross profit increased by 25%, reflecting higher operational efficiency, net income saw a more modest increase of 12%. This suggests that while revenue growth is strong, controlling vendor costs and maintaining healthy cash flows remain key to long-term profitability. Expense Breakdown by Category ------------------------------------------- The primary drivers of expense in 2024 were 1. Vendor Services - A total of $30M, representing a 18% increase from the previous year. The largest contributors were - Marketing-related expenses Dinner, corporate events 40% of vendor services. - Training and education expenses 30% of vendor services. - Software subscriptions cloud services, licensing 25% of vendor services. - Other miscellaneous expenses 5% of vendor services. Analysis The Dinner and Training categories accounted for an increasing share of the marketing budget. While essential for brand positioning, these expenses need tighter management, potentially through vendor renegotiations or reduced event frequency. 2. Software Subscriptions - A significant expense totaling $25M, up 22% from 2023. Given the heavy reliance on cloud-based tools and SaaS subscriptions, this area could benefit from more rigorous contract negotiation and potential consolidation of service providers. 3. Employee Benefits and HR Costs - With FinSolve Technologiess growth in headcount, HR expenses benefits, recruitment, training saw a 10% increase. While employee growth is essential, optimizing benefits packages and hiring processes could reduce per-employee cost. 4. Other Operational Expenses - A mix of general operational and administrative expenses totaling $15M, with a notable increase in travel and miscellaneous office costs, which grew by 8% yea"""


def clean_text(text: str) -> str:
    if not text or not isinstance(text, str):
        return ""


    text = unicodedata.normalize('NFKC', text)


    text = re.sub(r'-{2,}', ' ', text)

    pattern_to_remove = r"[^a-zA-Z0-9\s.,!?\-\'$%]"
    text = re.sub(pattern_to_remove, '', text)

    text = re.sub(r'\s+', ' ', text)

    return text.strip()


def chunk_text_by_tokens(
    cleaned_text: str,
    doc_id_prefix: str = "finsolve_2024",
    max_tokens: int = 512,
    model_name: str = "cl100k_base"
) -> list[dict]:

    try:
        encoder = tiktoken.get_encoding(model_name)
    except Exception as e:
        print(f"Error loading tokenizer: {e}")
        return []


    all_tokens = encoder.encode(cleaned_text)
    total_token_count = len(all_tokens)

    chunks = []
    start_index = 0
    seq_num = 1


    while start_index < total_token_count:
        end_index = min(start_index + max_tokens, total_token_count)

        chunk_tokens_list = all_tokens[start_index:end_index]
        current_count = len(chunk_tokens_list)


        chunk_text_str = encoder.decode(chunk_tokens_list)

        chunks.append({
            "id": f"{doc_id_prefix}_part_{seq_num:03d}",
            "tokens": current_count,
            "text": chunk_text_str
        })

        start_index = end_index
        seq_num += 1

    return chunks


if __name__ == "__main__":
    print("--- Starting Process ---")


    print("Cleaning text...")
    cleaned_data = clean_text(raw_financial_data)


    token_limit = 350
    print(f"Chunking text (Max {token_limit} tokens)...")


    chunked_data = chunk_text_by_tokens(cleaned_data, max_tokens=token_limit)


    print(f"\nTotal Chunks Generated: {len(chunked_data)}\n")

    for segment in chunked_data:

        print(f"CHUNK ID: {segment['id']}  |  TOKEN COUNT: {segment['tokens']}")

        print(segment['text'])
        print("\n")

--- Starting Process ---
Cleaning text...
Chunking text (Max 350 tokens)...

Total Chunks Generated: 2

CHUNK ID: finsolve_2024_part_001  |  TOKEN COUNT: 350
Financial Report for FinSolve Technologies Inc. - 2024 Executive Summary 2024 marked a year of both opportunity and challenge for FinSolve Technologies. Despite a robust revenue increase, we saw significant pressure in certain expense categories, notably vendor-related costs and software subscriptions. However, these pressures were balanced by cost-saving measures in operational efficiency, strong gross margin performance, and strategic investment in growth areas. The company is well-positioned to continue scaling its core offerings, but focused attention on cost optimization will be essential for maintaining profitability in the coming years. Year-Over-Year YoY Analysis FinSolve Technologiess revenue grew by 25% in 2024, driven largely by the global expansion of its services, especially in Asia and Europe. This was accompanied by