# Documentation

## Overview
This script processes article data using OpenAI's GPT model to categorize articles, identify niches, extract key themes, and recurring topics. The results are saved in a Parquet file format.

## Key Functionalities
1. **Data Loading:**
   - Merges data from two Parquet files.
2. **Category and Niche Extraction:**
   - Uses LLM to categorize articles and identify niches based on titles and content.
3. **Key Themes and Recurring Topics:**
   - Extracts broad themes and specific topics from text excerpts.
4. **Error Handling:**
   - Implements retry logic for API calls and checkpoint saving.

## Dependencies
- pandas
- openai
- tqdm
- json
- re
- os

## Code Walkthrough
### 1. Data Loading
- Loads and merges Parquet files into a DataFrame.
- Displays the count of loaded rows.

### 2. Category and Niche Extraction
- `process_title`: Sends article titles to LLM for category and niche extraction.
- `process_text_chunks`: Processes content in chunks if the title is insufficient.

### 3. Key Themes and Recurring Topics
- `extract_key_themes_and_recurring_topics`: Extracts themes from the introduction and recurring topics from the entire text.

### 4. Error Handling
- Implements retries for API rate limits.
- Saves checkpoints periodically.

### 5. Saving Results
- Saves processed results into a Parquet file.

## Usage
Run the script with a valid OpenAI API key set in the environment variables:
```
export OPENAI_API_KEY="your_api_key"
python script.py
```

## Conclusion
This script automates the categorization of articles and extraction of insights using OpenAI's API, providing a structured approach to handling large datasets.



# Code

In [None]:
import os
import pandas as pd
import json
import re
from tqdm import tqdm
import openai
import time

openai.api_key = os.getenv("OPENAI_API_KEY")

print("Loading data...")
df1 = pd.read_parquet("kpmg_india/kpmg_final_concatenated_insights_gzip.parquet")
df2 = pd.read_parquet("pwc_india/pwc_final_concatenated_insights_gzip.parquet")
df = pd.concat([df1, df2], ignore_index=True)
print(f"Loaded {len(df)} rows after merging")

def save_checkpoint(df, iteration, total):
    checkpoint_file = f"checkpoint_{iteration}_of_{total}.parquet"
    df.to_parquet(checkpoint_file, compression="gzip")
    print(f"Checkpoint saved: {checkpoint_file}")

def extract_json(text):
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        json_pattern = r'({[\s\S]*})'
        json_match = re.search(json_pattern, text)
        if json_match:
            try:
                return json.loads(json_match.group(1))
            except json.JSONDecodeError:
                pass
        category_match = re.search(r'category[:\s]+["\'](.*?)["\']', text, re.IGNORECASE)
        niche_match = re.search(r'niche[:\s]+["\'](.*?)["\']', text, re.IGNORECASE)
        if category_match and niche_match:
            return {"category": category_match.group(1), "niche": niche_match.group(1)}
        return {"category": "", "niche": ""}

def extract_list_from_response(text):
    try:
        data = json.loads(text)
        if isinstance(data, dict) and "items" in data:
            return data["items"]
        elif isinstance(data, list):
            return data
        
        matches = re.findall(r'[\"\']([^\"\']+)[\"\']', text)
        if matches:
            return matches
            
        list_items = re.findall(r'(?:^|\n)[\d\-\*]+\.?\s*([^\n]+)', text)
        if list_items:
            return [item.strip() for item in list_items]
            
    except json.JSONDecodeError:
        pass
    
    if "," in text:
        return [item.strip() for item in text.split(",") if item.strip()]
    else:
        return [item.strip() for item in text.split("\n") if item.strip()]

def invoke_llm(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = openai.ChatCompletion.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0
            )
            return response.choices[0].message['content'].strip()
        except Exception as e:
            if "429" in str(e):
                wait_time = 2 ** attempt
                print(f"Rate limit exceeded. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"LLM request failed: {e}")
                break
    return ""

def process_title(title):
    prompt = f"""
    Given the following article title:
    "{title}"

    Classify it into one of these categories:
    ["Supply Chain", "Energy Renewables", "Cyber Security", "Economy and Growth", "ESG", "Technology", "Risk Regulation", "Workforce", "Transformation", "India (Country)", "Healthcare"]

    Additionally, identify a relevant **niche** within this category.
    The **category must be from the list**, but the **niche can be any relevant term**.

    Return ONLY a JSON object with 2 keys: "category" and "niche".
    Example: {{"category": "Technology", "niche": "Cloud Computing"}}
    
    Do not include ANY explanatory text before or after the JSON.
    """
    response = invoke_llm(prompt)
    return extract_json(response)

def process_text_chunks(text):
    chunk_size = 400
    words = text.split()
    num_chunks = min(2, len(words) // chunk_size + (1 if len(words) % chunk_size > 0 else 0))

    category_votes = []
    niche_votes = []

    for i in range(num_chunks):
        chunk = " ".join(words[i * chunk_size:(i + 1) * chunk_size])
        prompt = f"""
        Given the following article excerpt:
        "{chunk}"

        Classify it into one of these categories:
        ["Supply Chain", "Energy Renewables", "Cyber Security", "Economy and Growth", "ESG", "Technology", "Risk Regulation", "Workforce", "Transformation", "India (Country)", "Healthcare"]

        Additionally, identify a relevant **niche** within this category.
        The **category must be from the list**, but the **niche can be any relevant term**.

        Return ONLY a JSON object with 2 keys: "category" and "niche".
        Example: {{"category": "Technology", "niche": "Cloud Computing"}}
        
        Do not include ANY explanatory text before or after the JSON.
        """
        response = invoke_llm(prompt)
        result = extract_json(response)
        category_votes.append(result.get("category", ""))
        niche_votes.append(result.get("niche", ""))

    final_category = max(set(category_votes), key=category_votes.count) if category_votes else ""
    final_niche = max(set(niche_votes), key=niche_votes.count) if niche_votes else ""

    return {"category": final_category, "niche": final_niche}

def extract_key_themes_recurring_topics_summaries(text):
    """
    Extracts key themes from the introduction (first 1/3rd of the text) and 
    recurring topics from the remaining text, processing the text in chunks.
    
    Additionally, extracts summaries for each chunk and concatenates them into a final summary.
    
    Returns:
        tuple: (list of key themes, list of recurring topics, concatenated summary)
    """
    if pd.isna(text) or text == "":
        return [], [], ""
    
    total_length = len(text)
    intro_threshold = total_length // 3
    chunk_size = 3000
    key_themes = []
    recurring_topics = []
    summaries = []

    for i in range(0, intro_threshold, chunk_size):
        chunk = text[i:i + chunk_size]
        prompt_intro = f'''
        Given the following text excerpt from an article:
        "{chunk}"
        
        Instructions:
        Identify the most important key themes addressed in this text.
        Key themes should be broad concepts or subjects that form the foundation of the content.
        
        Additionally, identify specific recurring topics that are mentioned or discussed multiple times in this excerpt.
        Recurring topics should be specific subjects, terms, or concepts that appear repeatedly.
        
        Finally, provide a concise summary of the text excerpt.
        Just give me the summary do not give an introduction like This excerpt discusses or This text summarizes t.
        
        Return ONLY a JSON object with three keys: "key_themes", "recurring_topics", and "summary".
        Example: {{"key_themes": ["Digital Transformation", "Regulatory Compliance"], 
                  "recurring_topics": ["Supply Chain Disruption", "Blockchain Implementation"], 
                  "summary": "Digital transformation and regulatory compliance are central themes, with discussions on supply chain disruptions and blockchain implementation."}}
        
        Be concise and specific. Each summary should be under 300 words, and should include any of the metrics that are mentioned.
        Do not include ANY explanatory text before or after the JSON.
        '''
        response_intro = invoke_llm(prompt_intro)
        result = extract_json(response_intro) or {}
        
        if "key_themes" in result:
            new_key_themes = result.get("key_themes", [])
            for theme in new_key_themes:
                cleaned_theme = theme.strip()
                if cleaned_theme and cleaned_theme.lower() not in [t.lower() for t in key_themes]:
                    key_themes.append(cleaned_theme)
        else:
            new_key_themes = extract_list_from_response(response_intro)
            for theme in new_key_themes:
                cleaned_theme = theme.strip()
                if cleaned_theme and cleaned_theme.lower() not in [t.lower() for t in key_themes]:
                    key_themes.append(cleaned_theme)
        
        if "recurring_topics" in result:
            new_topics = result.get("recurring_topics", [])
            recurring_topics.extend(new_topics)
        
        summary = result.get("summary", "").strip()
        if summary:
            summaries.append(summary)

    for j in range(intro_threshold, total_length, chunk_size):
        chunk = text[j:j + chunk_size]
        prompt_remaining = f'''
        Given the following text excerpt from an article:
        "{chunk}"
        
        Instructions:
        Identify the recurring topics that are mentioned or discussed multiple times.
        Recurring topics should be specific subjects, terms, or concepts that appear repeatedly.
        
        Additionally, provide a concise summary of the text excerpt.
        Just give me the summary do not give an introduction like This excerpt discusses or This text summarizes t.
        
        Return ONLY a JSON object with two keys: "items" and "summary".
        Example: {{"items": ["Supply Chain Disruption", "Blockchain Implementation"], 
                  "summary": "Digital transformation and regulatory compliance are central themes, with discussions on supply chain disruptions and blockchain implementation."}}
        
        Be specific and detailed. Each summary should be under 300 words, and should include any of the metrics that are mentioned.
        Do not include ANY explanatory text before or after the JSON.
        '''
        response_remaining = invoke_llm(prompt_remaining)
        result = extract_json(response_remaining)
        
        if "items" in result:
            remaining_topics = result.get("items", [])
            recurring_topics.extend(remaining_topics)
        else:
            remaining_topics = extract_list_from_response(response_remaining)
            recurring_topics.extend(remaining_topics)
        
        summary = result.get("summary", "").strip()
        if summary:
            summaries.append(summary)

    unique_recurring_topics = []
    lower_topics = set()
    for topic in recurring_topics:
        cleaned_topic = topic.strip()
        if cleaned_topic and cleaned_topic.lower() not in lower_topics:
            unique_recurring_topics.append(cleaned_topic)
            lower_topics.add(cleaned_topic.lower())
    
    final_summary = "\n".join(summaries) if summaries else ""
    
    return key_themes, unique_recurring_topics, final_summary

def main():
    if "category" not in df.columns:
        df["category"] = None
    if "niche" not in df.columns:
        df["niche"] = None
    if "key_themes" not in df.columns:
        df["key_themes"] = None
    if "recurring_topics" not in df.columns:
        df["recurring_topics"] = None
    if "summary" not in df.columns:
        df["summary"] = None

    total_rows = len(df)
    checkpoint_interval = 50

    try:
        for i in tqdm(range(total_rows)):
            title = df.at[i, "title"]
            text = df.at[i, "normalized_concatenated_text"] if "normalized_concatenated_text" in df.columns else ""

            if (pd.notna(df.at[i, "category"]) and 
                pd.notna(df.at[i, "niche"]) and 
                pd.notna(df.at[i, "key_themes"]) and 
                pd.notna(df.at[i, "recurring_topics"]) and 
                pd.notna(df.at[i, "summary"])):
                continue

            if pd.isna(title) or title == "":
                df.at[i, "category"] = ""
                df.at[i, "niche"] = ""
                df.at[i, "key_themes"] = []
                df.at[i, "recurring_topics"] = []
                df.at[i, "summary"] = ""
                continue

            if pd.isna(df.at[i, "category"]) or pd.isna(df.at[i, "niche"]):
                title_result = process_title(title)
                text_result = process_text_chunks(text) if text else {"category": "", "niche": ""}
                final_category = title_result["category"] if title_result["category"] else text_result["category"]
                final_niche = title_result["niche"] if title_result["niche"] else text_result["niche"]
                df.at[i, "category"] = final_category
                df.at[i, "niche"] = final_niche
                print(f"Row {i}: {title} → Category: {final_category}, Niche: {final_niche}")

            if pd.isna(df.at[i, "key_themes"]) or pd.isna(df.at[i, "recurring_topics"]) or pd.isna(df.at[i, "summary"]):
                key_themes, recurring_topics, final_summary = extract_key_themes_recurring_topics_summaries(text)
                df.at[i, "key_themes"] = key_themes
                df.at[i, "recurring_topics"] = recurring_topics
                df.at[i, "summary"] = final_summary
                print(f"Row {i}: Key Themes: {key_themes}")
                print(f"Row {i}: Recurring Topics: {recurring_topics}")
                print(f"Row {i}: Final Summary: {final_summary}")

            if (i + 1) % checkpoint_interval == 0:
                save_checkpoint(df, i + 1, total_rows)

        df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True).dt.strftime('%m/%d/%y')
        df.to_parquet("final_categorized_with_themes_and_summaries.parquet", compression="gzip")
        print("Processing complete. Final results saved.")
    
    except Exception as e:
        print(f"Error during processing: {e}")
        save_checkpoint(df, "error", total_rows)
        print("Progress saved before error.")
        raise e

if __name__ == "__main__":
    main()


Loading data...
Loaded 17 rows after merging


  0%|          | 0/17 [00:00<?, ?it/s]

Row 0: Issue no. 103 | February 2025 → Category: Economy and Growth, Niche: Inflation Trends


  6%|▌         | 1/17 [01:29<23:49, 89.37s/it]

Row 0: Key Themes: ['Business Combinations', 'Sustainability Disclosure', 'Regulatory Compliance', 'Business Combination Accounting', 'Measurement Period', 'Business Combination', 'Provisional Amount', 'Financial Reporting', 'Accounting Business Combinations', 'Measurement Period Adjustments', 'Fair Value Assessment', 'Depreciation and Goodwill', 'Sustainability Reporting', 'Material Information Disclosure', 'Financial Statement Adjustments', 'Sustainability-related Financial Disclosure', 'Materiality Judgement', 'General Purpose Financial Reporting', 'User Decision-Making']
Row 0: Recurring Topics: ['IFRS', 'Measurement Period', 'Acquisition Method', 'Material Information', 'Sustainability-related Risks and Opportunities', 'Sustainability-related risk and opportunity', 'Material information disclosure', 'Regulatory updates from SEBI', 'Audit committee review', 'Measurement period in business combinations', 'Acquisition Date', 'Adjustment', 'Goodwill', 'Identifiable Assets', 'Liabiliti

 12%|█▏        | 2/17 [02:09<15:03, 60.26s/it]

Row 1: Key Themes: ['Food Security', 'Public Policy', 'Nutrition', 'Sustainable Agriculture', 'Technology Innovation', 'Collaboration', 'Nutritional Security', 'Multistakeholder Approach', 'Agricultural Sustainability', 'Child Malnutrition', 'Nutritional Access', 'School Meal Programs', 'Micronutrient Fortification', 'Child Health and Development', 'Local Agriculture Support']
Row 1: Recurring Topics: ['Public Distribution System (PDS)', 'Malnutrition', 'Chronic Undernourishment', 'Digital Reform', 'Smallholder Farmers', 'Community Participation', 'Zero Hunger', 'Food Waste', 'India', 'Hunger', 'Undernutrition', 'Global Hunger Index', 'Food Distribution', 'Supplementary Nutrition', 'School Feeding Programs', 'Fortified Milk', 'Midday Meal Scheme', 'Ragi and Millet', 'Nutritional Education', 'Anaemia Prevention', 'Poshan Abhiyaan', 'Government Initiatives', 'School Meal Programs', 'Fortified Foods', 'Local Agriculture', 'Technology in Nutrition Tracking', 'Food Subsidy', 'Nutritional St

 18%|█▊        | 3/17 [02:22<09:00, 38.60s/it]

Row 2: Key Themes: ['Financial Crime', 'Regulatory Initiatives', 'Technological Integration', 'Fraud Prevention', 'Compliance Frameworks']
Row 2: Recurring Topics: ['Anti-Money Laundering (AML)', 'Corporate Transparency Act (CTA)', 'Beneficial Ownership Information', 'Fraud Risk Management', 'Financial Institutions', 'Regulatory Compliance', 'Financial Crime Prevention', 'Corporate White-Collar Crime', 'Internal Risk Assessment', 'Due Diligence', 'Digital Payment Advancements', 'Whistleblower Policy', 'Surveillance Systems']
Row 2: Final Summary: Financial crime is an evolving issue, with an estimated USD 31 trillion in illicit funds and USD 485.6 billion in fraud and scams expected in 2023. Technological advancements complicate the landscape for financial institutions (FIs). Key regulatory initiatives include the Corporate Transparency Act (CTA), effective January 1, 2024, requiring companies to file beneficial ownership information by January 1, 2025. The U.S. Department of Justice w

 24%|██▎       | 4/17 [03:54<12:59, 59.97s/it]

Row 3: Key Themes: ['Digital Transformation', 'Operational Efficiency', 'AI Adoption', 'Workforce Upskilling', 'Cybersecurity', 'Interoperability', 'Technology Innovation', 'Value Creation', 'Risk Management', 'AI and Quantum Computing', 'Industry Insights', 'Technology Investment', 'Organizational Strategy', 'Competitive Advantage', 'Strategic Implementation', 'AI and Automation', 'Organizational Agility', 'Profitability', 'Technology Adoption', 'Tech Debt', 'Investment Strategies', 'Leadership in Technology']
Row 3: Recurring Topics: ['Industrial Manufacturing', 'KPMG Global Tech Report', 'Digital Maturity', 'Supply Chain', 'Procurement', 'Finance', 'Return on Investment (ROI)', 'Data Strategy', 'KPMG Global Tech Report 2024', 'Technology Professionals', 'Evidence-Based Decision Making', 'Senior Leadership', 'Generative AI', 'Quantum Computing', 'High Performers', 'Fear of Missing Out (FOMO)', 'Technology Adoption', 'Market Trends', 'XaaS', 'Cloud Computing', 'Cybersecurity', 'Data A

 29%|██▉       | 5/17 [05:22<13:59, 69.98s/it]

Row 4: Key Themes: ['Digital Transformation', 'Strategic Investment', 'Technology Innovation', 'Risk Management', 'AI and ESG', 'Technology Adoption', 'Value Optimization', 'Evidence-Based Decision Making', 'Industry Insights', 'Technology Investment', 'Competitive Advantage', 'Profitability Improvement', 'Cloud Computing', 'Technical Debt', 'Organizational Strategy', 'Sustainability', 'Data-Driven Decision Making']
Row 4: Recurring Topics: ['KPMG Global Tech Report', 'Technology Sector', 'Evidence-Based Decision Making', 'Technical Debt', 'Generative AI', 'Quantum Computing', 'KPMG Global Tech Report 2024', 'AI Solutions', 'Senior Leadership Insights', 'High Performers in Tech', 'Investment Strategies', 'FOMO (Fear of Missing Out)', 'Third-Party Guidance', 'In-House Trials/Proof of Concept', 'Competitor Analysis', 'Customer Feedback', 'Senior Leadership Influence', 'XaaS Technology', 'AI Automation', 'Cybersecurity', 'Data Analytics', 'Executive Buy-in', 'Agility and Cost Reduction', 

 35%|███▌      | 6/17 [06:51<13:59, 76.28s/it]

Row 5: Key Themes: ['Energy Transformation', 'Technological Innovation', 'Strategic Decision-Making', 'Risk Management', 'Technology Adoption', 'Digital Transformation', 'Evidence-Based Decision Making', 'AI Implementation', 'Industry Insights', 'Technology Investment', 'Fear of Missing Out (FOMO)', 'Competitive Advantage', 'Investment Strategy', 'Profitability Improvement', 'Cloud Computing', 'Profitability', 'Tech Debt', 'Ecosystem Partnerships', 'Data-Driven Decision Making', 'Sustainability and Social Responsibility']
Row 5: Recurring Topics: ['AI', 'Advanced Analytics', 'Renewable Integration', 'Cyber Threats', 'Evidence-Based Decision Making', 'KPMG Global Tech Report 2024', 'Senior Leadership', 'C-Suite Executives', 'Value Optimization', 'High Performers', 'Industry Sectors', 'AI and Technology Adoption', 'Third-Party Guidance', 'In-House Trials/Proof of Concept', 'Customer Feedback', 'Competitor Influence', 'XaaS Technology', 'Data Analytics', 'AI Automation', 'Cybersecurity', 

 41%|████      | 7/17 [08:29<13:53, 83.39s/it]

Row 6: Key Themes: ['Tech Innovation', 'Strategic Decision-Making', 'Value Creation', 'Risk Management', 'Business Transformation', 'Digital Transformation', 'AI Implementation', 'Value Optimization', 'Industry Insights', 'Leadership in Technology', 'Technology Investment', 'Fear of Missing Out (FOMO)', 'Technology Adoption', 'Profitability Improvement', 'Strategic Investment', 'Cloud Computing', 'Tech Debt', 'Profitability Enhancement', 'Organizational Strategy', 'Sustainability', 'Decision-Making']
Row 6: Recurring Topics: ['Evidence-Based Decision Making', 'AI Technology', 'Technical Debt', 'Organizational Challenges', 'Leadership in Technology', 'KPMG Global Tech Report 2024', 'High Performers in Technology', 'Senior Leadership Insights', 'Industry Sectors', 'Competitor Influence', 'Third-Party Guidance', 'In-House Trials/Proof of Concept', 'Customer Feedback', 'Return on Investment', 'XaaS', 'AI Automation', 'Cybersecurity', 'Data Analytics', 'Executive Buy-in', 'Agility', 'Cost R

 47%|████▋     | 8/17 [10:18<13:45, 91.68s/it]

Row 7: Key Themes: ['Value-Based Healthcare (VBHC)', 'Quality Measurement Standards', 'Healthcare System Transformation', 'Patient-Centered Care', 'Cost Efficiency', 'Value-Based Healthcare', 'Healthcare Accessibility', 'Technological Innovation', 'Healthcare Quality Standards', 'Government Policy and Regulation', 'Healthcare Infrastructure', 'Government Initiatives', 'Patient Outcomes', 'Healthcare Innovation', 'Regulatory Evolution', 'Healthcare Cost Management', 'Quality Metrics and Performance Incentives', 'Integrated Care Systems', 'Quality Improvement in Healthcare', 'Patient Experience and Outcomes', 'International Healthcare Standards', 'Cost Management', 'Healthcare Access', 'Quality Improvement', 'Data Utilization', 'Chronic Disease Management', 'Preventive Care', 'Cost-Effectiveness', 'Data Analytics in Healthcare', 'Patient-Centric Care', 'Health Equity']
Row 7: Recurring Topics: ['Patient Outcomes', 'Healthcare Costs', 'Healthcare Delivery', 'Incentives for Providers', 'Ch

 53%|█████▎    | 9/17 [12:42<14:23, 107.94s/it]

Row 8: Key Themes: ['Financial Inclusion', 'Investor Empowerment', 'Mutual Fund Industry Growth', 'Technological Integration', 'Financial Literacy', 'Mutual Fund Growth', 'Investment Culture', 'Economic Development', 'Financial Empowerment', 'Digital Transformation', 'Cultural Integration', 'Investor Inclusion', 'Technology Integration', 'Community Engagement', 'Investment Accessibility', 'Cultural Alignment', 'Economic Transformation', 'Investment Choices', 'Mutual Fund Industry', 'Regulatory Evolution', 'Mutual Fund Market Development', 'Investor Education', 'Regulatory Framework', 'Private Sector Innovation', 'Growth of Mutual Funds', 'Investor Participation', 'Regulatory Reforms', 'Technological Advancements', 'Market Volatility', 'Accessibility of Financial Products', 'Regulatory Challenges', 'Investment Confidence', 'Systematic Investment Plans (SIPs)', 'Digital Transformation in Finance', 'Personalized Financial Planning', 'Cost Efficiency in Investments', 'Investor Education an

 59%|█████▉    | 10/17 [13:44<10:57, 94.00s/it]

Row 9: Key Themes: ['Financial Inclusion', 'Economic Growth', 'Financial Health', 'Digital Innovation', 'Poverty Reduction', 'Customer Segmentation', 'Behavioral Insights', 'Data-Driven Solutions', 'Digital Infrastructure', 'Policy Initiatives', 'Economic Stability', 'Debt Management', 'Investment', 'Financial Awareness', 'Access to Financial Services', 'Impact Measurement', 'Financial Wellbeing']
Row 9: Recurring Topics: ['Pradhan Mantri Jan Dhan Yojana (PMJDY)', 'Aadhaar', 'Digital Wallets', 'Direct Benefit Transfer', 'Retail Investors', 'Financial Ecosystem', 'Income Volatility', 'Debt Management', 'Low-Income Households', 'Financial Service Providers', 'Impact Measurement', 'Policy Intervention', 'Market-Linked Investments', 'PMJDY', 'Digital Lending', 'Financial Service Access', 'Rural Population', 'Risk Management', 'Financial Access', 'Credit Score', 'Income Stability', 'Occupation', 'Wealth Level', 'Financial Products', 'Access Metrics', 'Usage Metrics', 'Impact Metrics', 'Prad

 65%|██████▍   | 11/17 [16:30<11:35, 115.87s/it]

Row 10: Key Themes: ['Climate Resilience', 'Sustainable Development', 'Community Engagement', 'Climate Financing', 'Carbon Credit Mechanisms', 'Ecological Vulnerability', 'Economic Development', 'Climate Change Impact', 'Land Use and Land Cover Change', 'Resilience and Vulnerability', 'Green Finance', 'Biodiversity Conservation', 'Infrastructure Development', 'Geopolitical Significance', 'Ecosystem Preservation', 'Socioeconomic Equity', 'Climate Change Mitigation and Adaptation', 'Climate Change', 'Land Use and Land Cover (LULC)', 'Ecosystem Restoration', 'Disaster Impact and Recovery', 'Land Use Change', 'Agricultural Practices', 'Environmental Sustainability', 'Forest Cover', 'Water Bodies', 'Biodiversity Loss', 'Sustainable Management', 'Impact on Small Island Ecosystems', 'Adaptation and Resilience Strategies', 'Coastal Vulnerability', 'Environmental Impact', 'Temperature and Precipitation Changes', 'Natural Disasters', 'Community Resilience', 'Land Use Management']
Row 10: Recurri

 71%|███████   | 12/17 [17:57<08:55, 107.20s/it]

Row 11: Key Themes: ['Retail Reinvention', 'E-commerce Growth', 'Consumer Engagement', 'Channel Strategy', 'Market Dynamics', 'Consumer Behavior', 'Omnichannel Shopping', 'Quick Commerce', 'Omnichannel Retail Integration', 'Jevons Paradox', 'Omnichannel Retail', 'Digital Transformation', 'Retail Ecosystem', 'Offline vs Online Shopping', 'Retail Experience Enhancement', 'Technology Integration in Retail', 'Omnichannel Retailing', 'Q-commerce', 'Retail Disruption', 'Data Analytics', 'Ecommerce Evolution', 'Qcommerce', 'Retail Adaptation']
Row 11: Recurring Topics: ['Tier 2 and Tier 3 Cities', 'Brick-and-Mortar Retail', 'Quick Commerce', 'Digital Commerce', 'Consumer Utility', 'Retail Disruption', 'Customer Expectations', 'Digital Wave', 'Product Availability', 'Delivery Options', 'Market Growth Metrics', 'Ecommerce', 'Traditional Retailers', 'Consumer Experience', 'Survey Insights', 'Metro and Tier Cities', 'Online Shopping', 'Offline Shopping', 'Hybrid Approach', 'Consumer Preferences',

 76%|███████▋  | 13/17 [18:09<05:12, 78.13s/it] 

Row 12: Key Themes: ['Cybersecurity Challenges', 'Regulatory Compliance', 'Financial Services Risk Management', 'Data Protection']
Row 12: Recurring Topics: ['KYC (Know Your Customer)', 'AML (Anti-Money Laundering)', 'Geopolitical Risks', 'Crypto Exchange Vulnerabilities', 'National Risk Assessment (NRA)', 'Cybersecurity', 'Incident Response', 'Geopolitical Risk', 'Regulatory Compliance', 'Client Due Diligence', 'Know Your Customer (KYC) Procedures', 'Risk Management']
Row 12: Final Summary: The PwC India Financial Services Risk Symposium held on February 6, 2025, in Gurugram focused on the evolving complexities of cybersecurity and regulatory compliance in the financial sector. Keynote speaker Smarak Swain from the Ministry of Finance highlighted the vulnerabilities in the crypto field, particularly the significant losses attributed to poor cybersecurity practices, with incidents resulting in losses of around USD 230 million. The symposium emphasized the importance of a top-down strat

 82%|████████▏ | 14/17 [21:13<05:30, 110.30s/it]

Row 13: Key Themes: ['Consumer Spending Behavior', 'Economic Growth', 'Digital Infrastructure', 'Investment Opportunities', 'Middle Class Expansion', 'Digital Payment Adoption', "Financial Institutions' Strategies", 'Economic Factors Influencing Consumption', 'E-commerce Growth', 'Fintech Solutions', 'Psychological Factors in Consumption', 'Government Regulation', 'Payment Modes', 'Digital Payment Methods', 'Income Segmentation', 'Financial Inclusion', 'Expenditure Patterns', 'Consumer Engagement', 'Financial Services Innovation', 'Investment Trends', 'Digital Payment Solutions', 'Healthcare Financing', 'Consumer Behavior', 'Digital Accessibility', 'Financial Data Analysis', 'Geographic Segmentation', 'Payment Instruments', 'Economic Factors', 'Inflation Impact', 'Employment Trends', 'Digital Commerce', 'Government Policy and Regulation', 'Income Distribution', 'Obligatory vs Discretionary Expenses', 'Loan Repayment', 'Income Levels', 'Loan Repayment Trends', 'Income and Expenditure Pa

 88%|████████▊ | 15/17 [21:44<02:52, 86.22s/it] 

Row 14: Key Themes: ['Quality Evolution', 'Technological Integration', 'Customer-Centric Approach', 'Business Transformation', 'Stakeholder Engagement', 'Quality Management', 'Technological Innovation', 'Employee Engagement', 'Sustainability', 'Adaptability']
Row 14: Recurring Topics: ['Quality 50', 'Quality Management Systems', 'Industry 4.0', 'Data-Driven Approaches', 'Consumer Expectations', 'Data Management', 'Collaboration', 'Continuous Learning', 'Product Quality', 'Employee Satisfaction', 'Service Quality', 'Sustainability', 'Customer Experience', 'Data Quality', 'Real-time Quality Monitoring', 'Predictive Quality Analytics', 'Operational Success', 'Big Data', 'Cloud Solutions', 'Data Quality Assurance', 'Blockchain', 'Quality Control', 'Regulatory Compliance', 'Employee Wellbeing', 'Customer Loyalty', 'Quality Management', 'Quality Management System', 'Change Management', 'Data Infrastructure', 'Employee Buy-in', 'Technology Selection', 'Process Integration', 'Operational Effic

 94%|█████████▍| 16/17 [22:18<01:10, 70.62s/it]

Row 15: Key Themes: ['Power Automation Agent', 'Artificial Intelligence', 'Intelligent Automation', 'Business Transformation', 'Efficiency and Productivity', 'Multi-Agent Systems', 'Workflow Orchestration', 'User-Centric Design', 'AI and Human Collaboration', 'Automation and Efficiency', 'Decision Making and Insight', 'Adaptability and Context Awareness']
Row 15: Recurring Topics: ['AI Technology', 'Large Language Models (LLMs)', 'Governance', 'Automation', 'Decision Making', 'GenAI', 'Agent', 'User Interaction', 'Feedback', 'AI Agents', 'Human Capability Augmentation', 'Repetitive Tasks', 'Emotional Intelligence', 'Ethical Decision Making', 'Agentic Workflow', 'AI Integration', 'Self-learning', 'Self-healing', 'Human Feedback', 'Process Improvement', 'Task Execution', 'Ethical Guidelines', 'Performance Analysis', 'Agent Orchestration', 'Efficiency', 'Workflow Management', 'Invoice Processing', 'Expense Reconciliation', 'Cash Flow Management', 'Demand Forecasting', 'Inventory Managemen

100%|██████████| 17/17 [23:12<00:00, 81.90s/it]

Row 16: Key Themes: ['Economic Growth', 'Investment Trends', 'Mergers and Acquisitions', 'Private Equity', 'Market Dynamics', 'Domestic Transactions', 'Market Trends', 'Investment in Technology', 'Healthcare Sector Growth', 'Private Equity Activity', 'Capital Investment', 'Initial Public Offerings (IPOs)', 'Sector Consolidation']
Row 16: Recurring Topics: ['Deal Volume', 'Transaction Value', 'Private Equity Investment', "India's GDP Growth", 'Foreign Direct Investment', 'Deal Value', 'Domestic Deals', 'Technology Sector', 'Healthcare Sector', 'Average Ticket Size', 'Quarterly Trends', 'Large Deals', 'Energy Sector', 'Entertainment Sector', 'Healthcare', 'Renewable Energy', 'Strategic Investments', 'Domestic Mergers', 'IPOs', 'Small and Medium Enterprises (SMEs)', 'Initial Public Offering (IPO)', 'Mainboard IPO', 'SME IPO', 'Investment in Real Estate', 'Private Credit', 'Nifty 50 Index', 'Global Investment Climate', 'Credit Growth', 'Market Trends', 'Insolvency and Bankruptcy Code (IBC)


  df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True).dt.strftime('%m/%d/%y')


Processing complete. Final results saved.


In [14]:
date_mapping = df2.set_index("title")["date"]
df.loc[df["title"].isin(date_mapping.index), "date"] = df["title"].map(date_mapping)

df.head(16)

df.to_parquet("final_categorized_with_themes_and_summaries.parquet", compression="gzip")
print("Processing complete. Final results saved.")

Processing complete. Final results saved.


In [22]:
lengths = df['concatenated_text'].apply(len)
average_length = lengths.mean()

print("Average length of 'concatenated_text':", average_length)

lengths = df['normalized_concatenated_text'].apply(len)
average_length = lengths.mean()

print("Average length of 'normalized_concatenated_text':", average_length)

lengths = df['summary'].apply(len)
average_length = lengths.mean()

print("Average length of 'summary':", average_length)

Average length of 'concatenated_text': 66063.35294117648
Average length of 'normalized_concatenated_text': 48706.117647058825
Average length of 'summary': 17242.176470588234


In [24]:
import pandas as pd

df = pd.read_parquet("final_categorized_with_themes_and_summaries.parquet")

In [25]:
df.head()

Unnamed: 0,source,url_link,title,description,date,content,pdf_link,pdf_content,concatenated_text,tokenized_text,normalized_concatenated_text,category,niche,key_themes,recurring_topics,summary
0,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/aau-ac...,Issue no. 103 | February 2025,This edition of AAU covers relevant financial ...,02/28/25,"Ind AS 103, Business Combination provides guid...",https://kpmg.com/content/dam/kpmgsites/in/pdf/...,February 2025\nkpmg.com/in\nAccounting and \nA...,"Ind AS 103, Business Combination provides guid...",ind_as as_103 103_business business_combinatio...,ind 103 business combination provide guidance ...,Economy and Growth,Inflation Trends,"[Business Combinations, Sustainability Disclos...","[IFRS, Measurement Period, Acquisition Method,...",The text addresses the guidance provided by IN...
1,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/food-a...,Food and Nutritional Security in India,Solutions for achieving zero hunger and ensuri...,02/20/25,Food security has been a critical aspect of In...,https://kpmg.com/content/dam/kpmgsites/in/pdf/...,Food and nutritional \nsecurity in India\nSolu...,Food security has been a critical aspect of In...,food_security security_has has_been been_a a_c...,food security critical aspect indias public po...,India (Country),Agricultural Policy,"[Food Security, Public Policy, Nutrition, Sust...","[Public Distribution System (PDS), Malnutritio...",Food security is a critical aspect of India's ...
2,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/financ...,Financial Crime Bulletin,Dive deep into the financial crime avenues and...,02/10/25,Financial crimes have become an ever-evolving ...,,,Financial crimes have become an ever-evolving ...,financial_crimes crimes_have have_become becom...,financial crime become everevolving problem me...,Risk Regulation,Financial Compliance,"[Financial Crime, Regulatory Initiatives, Tech...","[Anti-Money Laundering (AML), Corporate Transp...","Financial crime is an evolving issue, with an ..."
3,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/kpmg-g...,KPMG global tech report – industrial manufactu...,"Interoperability, hybrid models and AI innovat...",02/07/25,In the rapidly evolving landscape of industria...,https://kpmg.com/content/dam/kpmgsites/xx/pdf/...,KPMG global \ntech report 2024\nKPMG Internati...,In the rapidly evolving landscape of industria...,in_the the_rapidly rapidly_evolving evolving_l...,rapidly evolve landscape industrial manufactur...,Technology,Industrial IoT,"[Digital Transformation, Operational Efficienc...","[Industrial Manufacturing, KPMG Global Tech Re...",The KPMG Global Tech Report 2024 highlights th...
4,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/kpmg-g...,KPMG global tech report: Technology insights,Tech: A bold sector that innovates while leadi...,02/07/25,The digital transformation journey is an impor...,https://kpmg.com/content/dam/kpmgsites/xx/pdf/...,KPMG global \ntech report 2024\nKPMG Internati...,The digital transformation journey is an impor...,the_digital digital_transformation transformat...,digital transformation journey important strat...,Technology,Emerging Technologies,"[Digital Transformation, Strategic Investment,...","[KPMG Global Tech Report, Technology Sector, E...",The text emphasizes the importance of digital ...
