# Documentation

## Overview
This script processes article data using OpenAI's GPT model to categorize articles, identify niches, extract key themes, and recurring topics. The results are saved in a Parquet file format.

## Key Functionalities
1. **Data Loading:**
   - Merges data from two Parquet files.
2. **Category and Niche Extraction:**
   - Uses LLM to categorize articles and identify niches based on titles and content.
3. **Key Themes and Recurring Topics:**
   - Extracts broad themes and specific topics from text excerpts.
4. **Error Handling:**
   - Implements retry logic for API calls and checkpoint saving.

## Dependencies
- pandas
- openai
- tqdm
- json
- re
- os

## Code Walkthrough
### 1. Data Loading
- Loads and merges Parquet files into a DataFrame.
- Displays the count of loaded rows.

### 2. Category and Niche Extraction
- `process_title`: Sends article titles to LLM for category and niche extraction.
- `process_text_chunks`: Processes content in chunks if the title is insufficient.

### 3. Key Themes and Recurring Topics
- `extract_key_themes_and_recurring_topics`: Extracts themes from the introduction and recurring topics from the entire text.

### 4. Error Handling
- Implements retries for API rate limits.
- Saves checkpoints periodically.

### 5. Saving Results
- Saves processed results into a Parquet file.

## Usage
Run the script with a valid OpenAI API key set in the environment variables:
```
export OPENAI_API_KEY="your_api_key"
python script.py
```

## Conclusion
This script automates the categorization of articles and extraction of insights using OpenAI's API, providing a structured approach to handling large datasets.



# Code

In [3]:
import os
import pandas as pd
import json
import re
from tqdm import tqdm
import openai
import time

openai.api_key = os.getenv("OPENAI_API_KEY")

print("Loading data...")
df1 = pd.read_parquet("kpmg_india/kpmg_final_concatenated_insights_gzip.parquet")
df2 = pd.read_parquet("pwc_india/pwc_final_concatenated_insights_gzip.parquet")
df = pd.concat([df1, df2], ignore_index=True)
print(f"Loaded {len(df)} rows after merging")

def save_checkpoint(df, iteration, total):
    checkpoint_file = f"checkpoint_{iteration}_of_{total}.parquet"
    df.to_parquet(checkpoint_file, compression="gzip")
    print(f"Checkpoint saved: {checkpoint_file}")

def extract_json(text):
    try:
        return json.loads(text)
    except json.JSONDecodeError:
        json_pattern = r'({[\s\S]*})'
        json_match = re.search(json_pattern, text)
        if json_match:
            try:
                return json.loads(json_match.group(1))
            except json.JSONDecodeError:
                pass
        category_match = re.search(r'category[:\s]+["\'](.*?)["\']', text, re.IGNORECASE)
        niche_match = re.search(r'niche[:\s]+["\'](.*?)["\']', text, re.IGNORECASE)
        if category_match and niche_match:
            return {"category": category_match.group(1), "niche": niche_match.group(1)}
        return {"category": "", "niche": ""}

def extract_list_from_response(text):
    try:
        data = json.loads(text)
        if isinstance(data, dict) and "items" in data:
            return data["items"]
        elif isinstance(data, list):
            return data
        
        matches = re.findall(r'[\"\']([^\"\']+)[\"\']', text)
        if matches:
            return matches
            
        list_items = re.findall(r'(?:^|\n)[\d\-\*]+\.?\s*([^\n]+)', text)
        if list_items:
            return [item.strip() for item in list_items]
            
    except json.JSONDecodeError:
        pass
    
    if "," in text:
        return [item.strip() for item in text.split(",") if item.strip()]
    else:
        return [item.strip() for item in text.split("\n") if item.strip()]

def invoke_llm(prompt, max_retries=5):
    for attempt in range(max_retries):
        try:
            response = openai.ChatCompletion.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": prompt}],
                temperature=0.0
            )
            return response.choices[0].message['content'].strip()
        except Exception as e:
            if "429" in str(e):
                wait_time = 2 ** attempt
                print(f"Rate limit exceeded. Retrying in {wait_time} seconds...")
                time.sleep(wait_time)
            else:
                print(f"LLM request failed: {e}")
                break
    return ""

def process_title(title):
    prompt = f"""
    Given the following article title:
    "{title}"

    Classify it into one of these categories:
    ["Supply Chain", "Energy Renewables", "Cyber Security", "Economy and Growth", "ESG", "Technology", "Risk Regulation", "Workforce", "Transformation", "India (Country)", "Healthcare"]

    Additionally, identify a relevant **niche** within this category.
    The **category must be from the list**, but the **niche can be any relevant term**.

    Return ONLY a JSON object with 2 keys: "category" and "niche".
    Example: {{"category": "Technology", "niche": "Cloud Computing"}}
    
    Do not include ANY explanatory text before or after the JSON.
    """
    response = invoke_llm(prompt)
    return extract_json(response)

def process_text_chunks(text):
    chunk_size = 400
    words = text.split()
    num_chunks = min(2, len(words) // chunk_size + (1 if len(words) % chunk_size > 0 else 0))

    category_votes = []
    niche_votes = []

    for i in range(num_chunks):
        chunk = " ".join(words[i * chunk_size:(i + 1) * chunk_size])
        prompt = f"""
        Given the following article excerpt:
        "{chunk}"

        Classify it into one of these categories:
        ["Supply Chain", "Energy Renewables", "Cyber Security", "Economy and Growth", "ESG", "Technology", "Risk Regulation", "Workforce", "Transformation", "India (Country)", "Healthcare"]

        Additionally, identify a relevant **niche** within this category.
        The **category must be from the list**, but the **niche can be any relevant term**.

        Return ONLY a JSON object with 2 keys: "category" and "niche".
        Example: {{"category": "Technology", "niche": "Cloud Computing"}}
        
        Do not include ANY explanatory text before or after the JSON.
        """
        response = invoke_llm(prompt)
        result = extract_json(response)
        category_votes.append(result.get("category", ""))
        niche_votes.append(result.get("niche", ""))

    final_category = max(set(category_votes), key=category_votes.count) if category_votes else ""
    final_niche = max(set(niche_votes), key=niche_votes.count) if niche_votes else ""

    return {"category": final_category, "niche": final_niche}

def extract_key_themes_and_recurring_topics(text):
    """
    Extracts key themes from the introduction (first 1/3rd of the text) and 
    recurring topics from the remaining text, processing the text in chunks.
    
    For the introduction portion, a combined prompt extracts both key themes and recurring topics.
    For the remaining text, we process in chunks to extract recurring topics.
    
    Returns:
        tuple: (list of key themes, list of recurring topics)
    """
    if pd.isna(text) or text == "":
        return [], []
    
    total_length = len(text)
    intro_threshold = total_length // 3
    chunk_size = 3000
    key_themes = []
    recurring_topics = []

    for i in range(0, intro_threshold, chunk_size):
        chunk = text[i:i+chunk_size]
        prompt_intro = f'''
        Given the following text excerpt from an article:
        "{chunk}"
        
        Identify the most important key themes addressed in this text.
        Key themes should be broad concepts or subjects that form the foundation of the content.
        
        Additionally, identify specific recurring topics that are mentioned or discussed multiple times in this excerpt.
        Recurring topics should be specific subjects, terms, or concepts that appear repeatedly.
        
        Return ONLY a JSON object with two keys: "key_themes" and "recurring_topics".
        Example: {{"key_themes": ["Digital Transformation", "Regulatory Compliance", "Risk Management"],
        "recurring_topics": ["Supply Chain Disruption", "Blockchain Implementation", "Vendor Management", "Cost Optimization"]}}
        
        Be concise and specific. Each key theme should be 1-3 words when possible, and each recurring topic 2-4 words.
        Do not include ANY explanatory text before or after the JSON.
        '''
        response_intro = invoke_llm(prompt_intro)
        result = extract_json(response_intro)
        
        if "key_themes" in result:
            new_key_themes = result.get("key_themes", [])
            for theme in new_key_themes:
                cleaned_theme = theme.strip()
                if cleaned_theme and cleaned_theme.lower() not in [t.lower() for t in key_themes]:
                    key_themes.append(cleaned_theme)
        else:
            new_key_themes = extract_list_from_response(response_intro)
            for theme in new_key_themes:
                cleaned_theme = theme.strip()
                if cleaned_theme and cleaned_theme.lower() not in [t.lower() for t in key_themes]:
                    key_themes.append(cleaned_theme)
        
        if "recurring_topics" in result:
            new_topics = result.get("recurring_topics", [])
            recurring_topics.extend(new_topics)
    
    for j in range(intro_threshold, total_length, chunk_size):
        chunk = text[j:j+chunk_size]
        prompt_remaining = f'''
        Given the following text excerpt from an article:
        "{chunk}"
        
        Identify the recurring topics that are mentioned or discussed multiple times.
        Recurring topics should be specific subjects, terms, or concepts that appear repeatedly.
        
        Return ONLY a JSON object with a single key "items" containing an array of topics.
        Example: {{"items": ["Supply Chain Disruption", "Blockchain Implementation", "Vendor Management", "Cost Optimization"]}}
        
        Be specific and detailed. Each topic can be 2-4 words.
        Do not include ANY explanatory text before or after the JSON.
        '''
        response_remaining = invoke_llm(prompt_remaining)
        remaining_topics = extract_list_from_response(response_remaining)
        recurring_topics.extend(remaining_topics)
    
    unique_recurring_topics = []
    lower_topics = set()
    for topic in recurring_topics:
        cleaned_topic = topic.strip()
        if cleaned_topic and cleaned_topic.lower() not in lower_topics:
            unique_recurring_topics.append(cleaned_topic)
            lower_topics.add(cleaned_topic.lower())
    
    return key_themes, unique_recurring_topics

def main():
    if "category" not in df.columns:
        df["category"] = None
    if "niche" not in df.columns:
        df["niche"] = None
    if "key_themes" not in df.columns:
        df["key_themes"] = None
    if "recurring_topics" not in df.columns:
        df["recurring_topics"] = None
    
    total_rows = len(df)
    checkpoint_interval = 50
    
    try:
        for i in tqdm(range(total_rows)):
            title = df.at[i, "title"]
            text = df.at[i, "normalized_concatenated_text"] if "normalized_concatenated_text" in df.columns else ""
            
            if (pd.notna(df.at[i, "category"]) and 
                pd.notna(df.at[i, "niche"]) and 
                pd.notna(df.at[i, "key_themes"]) and 
                pd.notna(df.at[i, "recurring_topics"])):
                continue
                
            if pd.isna(title) or title == "":
                df.at[i, "category"] = ""
                df.at[i, "niche"] = ""
                df.at[i, "key_themes"] = []
                df.at[i, "recurring_topics"] = []
                continue
            
            if pd.isna(df.at[i, "category"]) or pd.isna(df.at[i, "niche"]):
                title_result = process_title(title)
                text_result = process_text_chunks(text) if text else {"category": "", "niche": ""}
                final_category = title_result["category"] if title_result["category"] else text_result["category"]
                final_niche = title_result["niche"] if title_result["niche"] else text_result["niche"]
                df.at[i, "category"] = final_category
                df.at[i, "niche"] = final_niche
                print(f"Row {i}: {title} → Category: {final_category}, Niche: {final_niche}")
            
            if pd.isna(df.at[i, "key_themes"]) or pd.isna(df.at[i, "recurring_topics"]):
                key_themes, recurring_topics = extract_key_themes_and_recurring_topics(text)
                df.at[i, "key_themes"] = key_themes
                df.at[i, "recurring_topics"] = recurring_topics
                print(f"Row {i}: Key Themes: {key_themes}")
                print(f"Row {i}: Recurring Topics: {recurring_topics}")
            
            if (i + 1) % checkpoint_interval == 0:
                save_checkpoint(df, i + 1, total_rows)
        df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True).dt.strftime('%m/%d/%y')
        df.to_parquet("final_categorized_with_themes.parquet", compression="gzip")
        print("Processing complete. Final results saved.")
    except Exception as e:
        print(f"Error during processing: {e}")
        save_checkpoint(df, "error", total_rows)
        print("Progress saved before error.")
        raise e

if __name__ == "__main__":
    main()


Loading data...
Loaded 17 rows after merging


  0%|          | 0/17 [00:00<?, ?it/s]

Row 0: Issue no. 103 | February 2025 → Category: Economy and Growth, Niche: Inflation Trends


  6%|▌         | 1/17 [00:33<08:51, 33.19s/it]

Row 0: Key Themes: ['Business Combination', 'Sustainability Disclosure', 'Regulatory Updates', 'Regulatory Compliance', 'Measurement Period', 'Provisional Amount', 'Accounting Standards', 'Business Combinations', 'Financial Reporting', 'Fair Value Accounting', 'Sustainability Reporting', 'Material Information', 'Financial Disclosure', 'Materiality Judgement']
Row 0: Recurring Topics: ['Acquisition Method', 'Measurement Period', 'Provisional Amount', 'Material Information', 'Sustainability-related Risk', 'Risk Opportunity', 'Audit Committee Approval', 'Accounting Guidance', 'Acquisition Date', 'Adjustment Provisional Amount', 'Information Obtain', 'Goodwill Measurement', 'Asset Valuation', 'Financial Statement', 'Depreciation Expense', 'Goodwill Adjustment', 'Provisional Fair Value', 'Financial Statement Disclosure', 'Disclosure Requirements', 'IFRS Standards', 'Materiality Judgement', 'Primary User Decision', 'Sustainability-related Financial Disclosure', 'Information Need', 'Potential

 12%|█▏        | 2/17 [00:48<05:43, 22.90s/it]

Row 1: Key Themes: ['Food Security', 'Public Policy', 'Sustainable Agriculture', 'Nutritional Security', 'Multistakeholder Approach', 'Nutrition Education', 'School Meal Programs']
Row 1: Recurring Topics: ['Public Distribution System', 'Nutritional Programs', 'Digital Reform', 'Malnutrition', 'Food Waste', 'Smallholder Farmers', 'Undernutrition', 'Child Development', 'Food Distribution', 'School Feeding Program', 'Agricultural Sustainability', 'Fortified Milk', 'Midday Meal Scheme', 'Micronutrient Deficiency', 'Local Agriculture', 'Child Nutrition', 'Anaemia Prevention', 'School Meal Programs', 'Nutritional Education', 'Fortified Milk Initiatives', 'Local Agriculture Promotion', 'Community Gardens', 'Regional Meal Plans', 'Technology in Nutrition Tracking', 'Food Subsidy Program', 'Nutritional Status', 'Digitalization and Technology', 'Food Grain Management', 'Access to Nutritious Food', 'Rural and Urban Households', 'Aadhaar-based Authentication', 'Leakage and Waste', 'Smart Ration C

 18%|█▊        | 3/17 [00:54<03:28, 14.86s/it]

Row 2: Key Themes: ['Financial Crime', 'Regulatory Initiatives', 'Risk Management']
Row 2: Recurring Topics: ['Anti-Money Laundering', 'Fraud Detection', 'Beneficial Ownership', 'Corporate Transparency', 'Technological Integration', 'Compliance Requirements', 'Corporate White-Collar Crime', 'AML Compliance', 'Fraud Risk Management', 'Financial Crime Prevention', 'Internal Risk Assessment', 'Digital Payment Advancements', 'Surveillance Systems', 'Whistleblower Policies', 'Technology in Financial Crime', 'Regulatory Changes']
Row 3: KPMG global tech report – industrial manufacturing insights → Category: Technology, Niche: Industrial IoT


 24%|██▎       | 4/17 [01:30<05:03, 23.37s/it]

Row 3: Key Themes: ['Digital Transformation', 'Operational Efficiency', 'AI Adoption', 'Technology Innovation', 'Value Creation', 'Risk Management', 'Investment Strategy', 'Technology Adoption', 'Technology Investment', 'Strategic Implementation', 'Data Analytics', 'Profitability', 'Leadership Strategy']
Row 3: Recurring Topics: ['Industrial Manufacturing', 'Data Strategy', 'Cybersecurity Measures', 'Technology Leadership', 'Generative AI', 'Quantum Computing', 'Evidence-Based Decision', 'Technology Investment', 'Senior Leadership', 'High Performers', 'Fear of Missing Out', 'Executive Decision Making', 'Customer Feedback', 'Competitor Influence', 'XaaS Technology', 'Cloud Computing', 'Cybersecurity', 'AI Automation', 'Cost Reduction', 'Employee Feedback', 'Tech Adoption', 'Tech Debt', 'Data Analytics', 'Executive Satisfaction', 'Value Evaluation', 'Adaptive Approach', 'Executive Decision-Making', 'Partnership Ecosystem', 'Tech Investment Evaluation', 'Sustainability Initiatives', 'Data

 29%|██▉       | 5/17 [01:59<05:03, 25.30s/it]

Row 4: Key Themes: ['Digital Transformation', 'Technology Strategy', 'Investment Discipline', 'AI Implementation', 'Value Optimization', 'Technology Investment', 'Competitive Advantage', 'Technology Adoption', 'Profitability Improvement', 'Technical Debt']
Row 4: Recurring Topics: ['AI and ESG', 'Risk Management', 'Evidence-Based Decisions', 'Technology Innovation', 'KPMG Global Tech Report', 'Evidence-Based Decision', 'Senior Leadership Insight', 'Tech Maturity Stage', 'FOMO', 'Third-party Guidance', 'In-house Trials', 'Customer Feedback', 'Senior Leadership', 'Cost Effectiveness', 'XaaS Technology', 'AI Automation', 'Data Analytics', 'Cybersecurity', 'Cloud Computing', 'Strategic Vision', 'Executive Buy-in', 'High Performers', 'Legacy Systems', 'New Technology', 'Tech Debt', 'Investment Strategy', 'Ecosystem Partnerships', 'Sustainability', 'Data maturity', 'Tech initiative', 'Long-term goals', 'Tech Investment', 'Data-Driven Decision', 'Social Responsibility', 'Data Accessibility', 

 35%|███▌      | 6/17 [02:35<05:17, 28.90s/it]

Row 5: Key Themes: ['Energy Transformation', 'Technological Innovation', 'Strategic Decision-Making', 'Technology Leadership', 'Digital Transformation', 'AI Implementation', 'AI Investment', 'Market Competition', 'Technology Adoption', 'Strategic Investment', 'Technology Investment', 'Profitability', 'Decision Making']
Row 5: Recurring Topics: ['AI and Advanced Analytics', 'Evidence-Based Decisions', 'Renewable Integration', 'Cybersecurity Threats', 'KPMG Global Tech Report', 'Evidence-Based Decision Making', 'Senior Leadership Insights', 'Industry-Specific Challenges', 'Fear of Missing Out', 'Third-Party Guidance', 'In-House Trials', 'Customer Feedback', 'Technology Adoption', 'XaaS Technology', 'Cloud Computing', 'Data Analytics', 'AI Automation', 'Cybersecurity', 'Cost Reduction', 'Executive Buy-in', 'Tech Debt', 'Business Value', 'High Performers', 'New Technology', 'Existing Technology', 'Executive Intentions', 'Ecosystem Partnership', 'Evidence-Based Decision', 'Sustainability', 

 41%|████      | 7/17 [03:06<04:55, 29.58s/it]

Row 6: Key Themes: ['Tech Innovation', 'Strategic Decision-Making', 'Value Creation', 'Digital Transformation', 'AI Implementation', 'Business Strategy', 'Technology Investment', 'Competitive Advantage', 'Technology Adoption', 'Profitability Improvement', 'Tech Debt', 'Value Optimization', 'Sustainability']
Row 6: Recurring Topics: ['Evidence-Based Decisions', 'Technology Leadership', 'Risk Management', 'AI Implementation', 'Business Transformation', 'Value Optimization', 'Evidence-Based Decision', 'Senior Leadership Insight', 'Tech Initiative Measurement', 'Fear of Missing Out', 'Third-Party Guidance', 'In-House Trials', 'Customer Feedback', 'Competitor Analysis', 'XaaS Technology', 'Cloud Computing', 'AI Automation', 'Data Analytics', 'Cybersecurity', 'Strategic Vision', 'Investment Priorities', 'Profitability Boost', 'Legacy Systems', 'XaaS Investment', 'High Performers', 'Tech Initiative', 'Business Value Outcomes', 'Tech Investment Portfolio', 'Long-Term Goals', 'Customer Experien

 47%|████▋     | 8/17 [03:53<05:17, 35.29s/it]

Row 7: Key Themes: ['Value-Based Healthcare', 'Quality Measurement', 'Healthcare Transformation', 'Healthcare Accessibility', 'Technological Innovation', 'Patient Experience', 'Healthcare Infrastructure', 'Value-Based Care', 'Healthcare Innovation', 'Patient Outcomes', 'Healthcare Spending', 'Quality Metrics', 'Payment Models', 'Quality Improvement', 'Healthcare Costs', 'Cost Optimization', 'Healthcare Access', 'Chronic Disease Management']
Row 7: Recurring Topics: ['Patient Outcomes', 'Cost Efficiency', 'Healthcare Delivery', 'Chronic Disease Management', 'Healthcare Access', 'Patient Satisfaction', 'Quality Healthcare', 'Healthcare Costs', 'Telemedicine', 'Healthcare Providers', 'Government Initiatives', 'Integrated Care Delivery', 'Outcome-Based Payment', 'Patient-Centered Care', 'Healthcare Provider', 'Cost Reduction', 'Quality Care', 'Patient Safety', 'Digital Innovation', 'Monitoring Mechanism', 'Hospital Readmission Reduction', 'Preventive Care', 'Integrated Care Systems', 'Cost

 53%|█████▎    | 9/17 [04:47<05:27, 41.00s/it]

Row 8: Key Themes: ['Financial Inclusion', 'Investor Empowerment', 'Mutual Fund Industry', 'Mutual Fund Growth', 'Investment Culture', 'Financial Empowerment', 'Digital Transformation', 'Investment Growth', 'Cultural Integration', 'Investor Education', 'Technology Integration', 'Community Engagement', 'Cultural Alignment', 'Economic Transformation', 'Investment Choices', 'Mutual Funds', 'Financial Sector Development', 'Regulatory Framework', 'Market Expansion', 'Market Participation', 'Investor Awareness', 'Market Volatility', 'Mutual Fund Adoption', 'Financial Education', 'Investment Accessibility', 'Digital Platforms', 'Financial Planning', 'Regulatory Compliance', 'Cost Efficiency', 'ESG Integration', 'Investor Protection']
Row 8: Recurring Topics: ['Democratization of Wealth', 'Financial Literacy', 'Technological Integration', 'Trust Building', 'Investment Accessibility', 'Asset Management', 'Retail Mutual Fund', 'Systematic Investment Plan', 'Investor Confidence', 'Financial Wellb

 59%|█████▉    | 10/17 [05:18<04:26, 38.02s/it]

Row 9: Key Themes: ['Financial Inclusion', 'Economic Growth', 'Digital Finance', 'Customer Retention', 'Data-Driven Solutions', 'Financial Health', 'Digital Infrastructure', 'Debt Management', 'Access Metrics', 'Financial Wellbeing']
Row 9: Recurring Topics: ['Pradhan Mantri Jandhan Yojana', 'Aadhaar Biometric Identification', 'Financial Health', 'Retail Investor', 'Access to Financial Services', 'Impactful Financial Inclusion', 'Income Stability', 'Low-Income Households', 'Financial Service Providers', 'Behavioral Insights', 'Market-Linked Investments', 'Policy Initiatives', 'Income Volatility', 'Risk Management', 'Data-Driven Solutions', 'Customer Retention', 'Technological Innovation', 'Financial Awareness', 'Access to Finance', 'Investment', 'Credit Score', 'Usage of Financial Products', 'Impact Measurement', 'Financial Health Survey', 'Government Insurance Schemes', 'Financial Wellbeing', 'Financial Inclusion', 'Access Metrics', 'Usage Metrics', 'PMJDY Accounts', 'Financial Produc

 65%|██████▍   | 11/17 [06:15<04:22, 43.71s/it]

Row 10: Key Themes: ['Climate Resilience', 'Sustainable Development', 'Financing Solutions', 'Ecological Sensitivity', 'Climate Change', 'Land Use', 'Green Finance', 'Economic Development', 'Biodiversity Preservation', 'Ecosystem Protection', 'Ecosystem Management', 'Land Use Change', 'Agricultural Practices', 'Biodiversity Loss', 'Sustainable Management', 'Impact on Small Islands', 'Adaptation Strategies', 'Coastal Vulnerability', 'Environmental Impact', 'Biodiversity']
Row 10: Recurring Topics: ['Andaman Nicobar Islands', 'Climate Change Impact', 'Comprehensive Action Plan', 'Green Finance Options', 'Vulnerability Assessment', 'Carbon Credit', 'Climate Change', 'Vulnerability Index', 'Clean Cookstove', 'Mangrove Restoration', 'Climate Action Plan', 'Carbon Market', 'Renewable Energy', 'Agricultural Activity', 'Ecological Impact', 'Infrastructure Projects', 'Natural Resources', 'Ecological Sensitivity', 'Climate Resilience', 'Geopolitical Importance', 'Greenhouse Gas Mitigation', 'Ada

 71%|███████   | 12/17 [06:56<03:34, 43.00s/it]

Row 11: Key Themes: ['Retail Reinvention', 'Ecommerce Growth', 'Consumer Engagement', 'Consumer Behavior', 'Omnichannel Retail', 'Digital Technology', 'Offline Shopping', 'Consumer Experience', 'Technology Integration', 'Consumer Data', 'Qcommerce Impact', 'Qcommerce', 'Retail Adaptation']
Row 11: Recurring Topics: ['Brick-and-Mortar Retail', 'Tier 2 and 3 Cities', 'Digital Commerce', 'Consumer Utility', 'Hybrid Model', 'Retail Strategy', 'Omnichannel Shopping', 'Quick Commerce', 'Customer Expectations', 'Market Trends', 'Digital Wave', 'Ecommerce Integration', 'Traditional Retail', 'Shopping Experience', 'Jevons Paradox', 'Consumer Preferences', 'Market Survey', 'Online Shopping', 'Offline Shopping', 'Hybrid Approach', 'Retail Ecosystem', 'In-store Experience', 'Online Shopping Preference', 'Retail Personalization', 'Consumer Insights', 'Immersive Retail', 'Digital Tools', 'Retailer Challenges', 'Customer Preferences', 'Sales Decline', 'Inventory Management', 'Delivery Costs', 'Urban 

 76%|███████▋  | 13/17 [07:02<02:07, 31.83s/it]

Row 12: Key Themes: ['Cybersecurity', 'Regulatory Compliance', 'Risk Management']
Row 12: Recurring Topics: ['Financial Services', 'KYC Compliance', 'AML Challenges', 'Geopolitical Risks', 'Data Protection', 'Crypto Vulnerabilities', 'Cybersecurity Measures', 'Incident Response System', 'Regulatory Compliance', 'National Risk Assessment', 'Geopolitical Risk', 'Client Due Diligence', 'Risk Management Strategy', 'Private Sector Involvement', 'Employee Psychology', 'Reputational Damage']
Row 13: How India spends: A deep dive into consumer spending behaviour → Category: Economy and Growth, Niche: Consumer Behavior


 82%|████████▏ | 14/17 [08:16<02:13, 44.46s/it]

Row 13: Key Themes: ['Consumer Spending', 'Economic Growth', 'Digital Infrastructure', 'Digital Payments', 'Financial Institutions', 'Consumer Behavior', 'Ecommerce Growth', 'Fintech Solutions', 'Spending Patterns', 'Financial Inclusion', 'Consumer Engagement', 'Financial Services', 'Investment Trends', 'Digital Access', 'Financial Analysis', 'Data Privacy', 'Income Segmentation', 'Economic Factors', 'Digital Commerce', 'Investment Behavior', 'Income Distribution', 'Loan Repayment', 'Income Levels', 'Income Disparity', 'E-commerce Growth', 'Income Bracket', 'Fashion Purchases', 'City Tier']
Row 13: Recurring Topics: ["India's Middle Class", 'Transactional Data', 'Consumer Behavior', 'Investment Patterns', 'Indian Consumption', 'Financial Management', 'Credit Products', 'Market Growth', 'Indian Consumer Market', 'Spending Categories', 'Payment Modes', 'Government Regulation', 'Emotional Spending', 'UPI Transactions', 'Obligatory Expenditures', 'Discretionary Spending', 'Income Levels', 

 88%|████████▊ | 15/17 [08:29<01:10, 35.04s/it]

Row 14: Key Themes: ['Quality Evolution', 'Manufacturing Innovation', 'Technology Integration', 'Quality Management', 'Sustainability', 'Employee Engagement']
Row 14: Recurring Topics: ['Quality 50', 'Consumer Expectations', 'Data-Driven Approach', 'Stakeholder Needs', 'Technological Innovation', 'Human Insight', 'Data Management', 'Collaboration', 'Continuous Learning', 'Product Quality', 'Market Adaptability', 'Customer Satisfaction', 'Sustainability Consideration', 'Real-time Quality Monitoring', 'Predictive Quality Analytics', 'Service Quality', 'Customer Experience', 'Data Quality', 'Operational Success', 'Environmental Impact', 'Data Quality Assurance', 'Blockchain Technology', 'Regulatory Compliance', 'Quality Control', 'Employee Wellbeing', 'Automation in Compliance', 'Supply Chain Management', 'Digital Skills', 'Integrated Systems', 'Quality Management System', 'Change Management', 'Data Infrastructure', 'Employee Buy-in', 'Process Integration', 'Technology Selection', 'Operat

 94%|█████████▍| 16/17 [08:45<00:29, 29.42s/it]

Row 15: Key Themes: ['Automation', 'Artificial Intelligence', 'Business Transformation', 'Intelligent Automation', 'Multi-Agent Systems', 'User Interaction', 'AI Collaboration', 'Human Augmentation']
Row 15: Recurring Topics: ['Power Automation Agent', 'Intelligent Automation', 'AI Technology', 'Productivity Efficiency', 'Generative AI', 'Decision Making', 'Automation Framework', 'Goal-Oriented Agents', 'Workflow Orchestration', 'User Feedback', 'Self-Improvement', 'Agent Autonomy', 'Human Intervention', 'Emotional Intelligence', 'Context Awareness', 'Task Execution', 'Agentic Workflow', 'Automation', 'AI Integration', 'Process Improvement', 'Ethical Guidelines', 'Human Feedback', 'Self-learning Capability', 'Performance Analysis', 'Workflow Management', 'Data Management', 'Continuous Monitoring', 'Proactive Action', 'Collaboration', 'Efficiency and Productivity', 'Agent Orchestration', 'Task Automation', 'Invoice Processing', 'Expense Reconciliation', 'Cash Flow Management', 'Demand F

100%|██████████| 17/17 [09:04<00:00, 32.05s/it]

Row 16: Key Themes: ['Economic Growth', 'Investment Trends', 'Mergers and Acquisitions', 'Domestic Transactions', 'Market Growth', 'Market Outlook']
Row 16: Recurring Topics: ['Deal Volume', 'Private Equity Investment', "India's GDP Growth", 'Market Consolidation', 'Regulatory Amendments', 'Deal Value', 'Domestic Deals', 'Technology Sector', 'Healthcare Sector', 'Private Equity', 'large deals', 'renewable energy', 'joint ventures', 'domestic mergers', 'IPO activity', 'capital investment', 'Initial Public Offering', 'Mainboard IPO', 'SME IPO', 'Investment Growth', 'Private Credit', 'Real Estate Sector', 'Market Volatility', 'Credit Investment', 'Deal Structure', 'Global Investment Climate', 'Insolvency Framework', 'Corporate Insolvency Resolution Process', 'National Company Law Tribunal', 'Insolvency Bankruptcy Board of India', 'Voluntary Liquidation', 'Operational Creditor', 'Group Insolvency', 'Cross-Border Insolvency', 'Digital Personal Data Protection', 'NCLT Capacity', 'Infrastruct


  df['date'] = pd.to_datetime(df['date'], errors='coerce', infer_datetime_format=True).dt.strftime('%m/%d/%y')


Processing complete. Final results saved.


In [None]:
date_mapping = df2.set_index("title")["date"]
df.loc[df["title"].isin(date_mapping.index), "date"] = df["title"].map(date_mapping)

df.head(16)

Unnamed: 0,source,url_link,title,description,date,content,pdf_link,pdf_content,concatenated_text,tokenized_text,normalized_concatenated_text,category,niche,key_themes,recurring_topics
0,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/aau-ac...,Issue no. 103 | February 2025,This edition of AAU covers relevant financial ...,02/28/25,"Ind AS 103, Business Combination provides guid...",https://kpmg.com/content/dam/kpmgsites/in/pdf/...,February 2025\nkpmg.com/in\nAccounting and \nA...,"Ind AS 103, Business Combination provides guid...",ind_as as_103 103_business business_combinatio...,ind 103 business combination provide guidance ...,Economy and Growth,Inflation Trends,"[Business Combination, Sustainability Disclosu...","[Acquisition Method, Measurement Period, Provi..."
1,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/food-a...,Food and Nutritional Security in India,Solutions for achieving zero hunger and ensuri...,02/20/25,Food security has been a critical aspect of In...,https://kpmg.com/content/dam/kpmgsites/in/pdf/...,Food and nutritional \nsecurity in India\nSolu...,Food security has been a critical aspect of In...,food_security security_has has_been been_a a_c...,food security critical aspect indias public po...,India (Country),Agricultural Policy,"[Food Security, Public Policy, Sustainable Agr...","[Public Distribution System, Nutritional Progr..."
2,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/financ...,Financial Crime Bulletin,Dive deep into the financial crime avenues and...,02/10/25,Financial crimes have become an ever-evolving ...,,,Financial crimes have become an ever-evolving ...,financial_crimes crimes_have have_become becom...,financial crime become everevolving problem me...,Risk Regulation,Anti-Money Laundering,"[Financial Crime, Regulatory Initiatives, Risk...","[Anti-Money Laundering, Fraud Detection, Benef..."
3,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/kpmg-g...,KPMG global tech report – industrial manufactu...,"Interoperability, hybrid models and AI innovat...",02/07/25,In the rapidly evolving landscape of industria...,https://kpmg.com/content/dam/kpmgsites/xx/pdf/...,KPMG global \ntech report 2024\nKPMG Internati...,In the rapidly evolving landscape of industria...,in_the the_rapidly rapidly_evolving evolving_l...,rapidly evolve landscape industrial manufactur...,Technology,Industrial IoT,"[Digital Transformation, Operational Efficienc...","[Industrial Manufacturing, Data Strategy, Cybe..."
4,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/kpmg-g...,KPMG global tech report: Technology insights,Tech: A bold sector that innovates while leadi...,02/07/25,The digital transformation journey is an impor...,https://kpmg.com/content/dam/kpmgsites/xx/pdf/...,KPMG global \ntech report 2024\nKPMG Internati...,The digital transformation journey is an impor...,the_digital digital_transformation transformat...,digital transformation journey important strat...,Technology,Emerging Technologies,"[Digital Transformation, Technology Strategy, ...","[AI and ESG, Risk Management, Evidence-Based D..."
5,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/kpmg-g...,KPMG global tech report: energy insights,Empower your digital transformation with data-...,02/07/25,The energy industry is at a pivotal moment of ...,https://assets.kpmg.com/content/dam/kpmgsites/...,KPMG global \ntech report 2024\nKPMG Internati...,The energy industry is at a pivotal moment of ...,the_energy energy_industry industry_is is_at a...,energy industry pivotal moment transformation ...,Energy Renewables,Sustainable Energy Solutions,"[Energy Transformation, Technological Innovati...","[AI and Advanced Analytics, Evidence-Based Dec..."
6,KPMG Insights,https://kpmg.com/in/en/insights/2025/02/kpmg-g...,KPMG global tech report 2024,In the race to keep up with rapid tech innovat...,02/06/25,"As tech innovation opens endless potential, te...",https://assets.kpmg.com/content/dam/kpmgsites/...,KPMG global \ntech report 2024\nKPMG Internati...,"As tech innovation opens endless potential, te...",as_tech tech_innovation innovation_opens opens...,tech innovation open endless potential tech le...,Technology,Emerging Technologies,"[Tech Innovation, Strategic Decision-Making, V...","[Evidence-Based Decisions, Technology Leadersh..."
7,PWC Insights,,Quality measures and standards for transitioni...,A roadmap to facilitate the transition to VBHC.,07/03/25,,https://www.pwc.in/ghost-templates/quality-mea...,Quality measures and standards \nMarch 2025\nf...,Quality measures and standards for transitioni...,quality_measures measures_and and_standards st...,quality measure standard transition valuebased...,Healthcare,Value-Based Care,"[Value-Based Healthcare, Quality Measurement, ...","[Patient Outcomes, Cost Efficiency, Healthcare..."
8,PWC Insights,,The mutual funds route to Viksit Bharat @2047,A comprehensive roadmap for the evolution of t...,05/03/25,,https://www.pwc.in/ghost-templates/the-mutual-...,The mutual funds route \nto Viksit Bharat @204...,The mutual funds route to Viksit Bharat @2047 ...,the_mutual mutual_funds funds_route route_to t...,mutual fund route viksit bharat 2047 comprehen...,Economy and Growth,Investment Strategies,"[Financial Inclusion, Investor Empowerment, Mu...","[Democratization of Wealth, Financial Literacy..."
9,PWC Insights,,Financial health: Transcending from access to ...,Explore India’s financial inclusion journey an...,04/03/25,,https://www.pwc.in/ghost-templates/financial-h...,TM\nMarch 2025\nFinancial health: \nTranscendi...,Financial health: Transcending from access to ...,financial_health health_transcending transcend...,financial health transcend access impact explo...,Economy and Growth,Financial Inclusion,"[Financial Inclusion, Economic Growth, Digital...","[Pradhan Mantri Jandhan Yojana, Aadhaar Biomet..."
