# Narration Keywords Analysis

This notebook defines comprehensive keyword lists for different transaction categories and analyzes their frequency in transaction narrations. It serves as the reference library for keyword-based classification throughout the project.

## Objectives
- Define domain-specific keywords for each transaction category
- Calculate keyword frequency in narrations
- Identify patterns and common terms for each category type
- Enable data-driven category prediction

## Categories Covered
- Business Receipts
- Dividend Income
- Savings Account Interest
- Fixed Deposit/Term Deposit Interest
- Rental Income
- Salary
- Life Insurance Receipts
- Income Tax Refunds
- Bond/Securities Interest

In [None]:
# Import required libraries
import pandas as pd  # Data manipulation and analysis
import numpy as np   # Numerical computations
import re            # Regular expressions for text processing
from collections import Counter  # For counting keyword frequencies

In [None]:
# Load transaction data from Excel file
# Data source: merged transactions file with raw master sheet
all_df = pd.read_excel(
    "merged transactions_2412_v3_2512.xlsx",
    sheet_name="Raw Master Sheet"
)

In [None]:
# Convert Date column to datetime format for temporal analysis
all_df["Date"] = pd.to_datetime(all_df["Date"])

In [None]:
def normalize(text):
    """
    Normalize transaction narration text for consistent analysis.
    
    Process:
    1. Convert to lowercase for case-insensitive matching
    2. Remove special characters (keep only alphanumeric and spaces)
    3. Remove extra whitespace
    
    Args:
        text (str): Raw narration text
        
    Returns:
        str: Normalized text
    """
    text = str(text).lower()
    text = re.sub(r"[^a-z0-9 ]", " ", text)  # Remove non-alphanumeric chars
    text = re.sub(r"\s+", " ", text)  # Collapse whitespace
    return text.strip()  # Remove leading/trailing whitespace

In [None]:
# Apply normalization to all narrations and store in new column
all_df["narr_norm"] = all_df["Narration"].apply(normalize)

In [None]:
# Display category distribution to understand class balance
all_df['Category'].value_counts()

Category
Mutual Funds                                  6888
Business receipts (service-based business)    6780
Dividend Income                                677
Interest from Savings Bank                     634
Interest from savings bank                     581
Interest from deposit                          356
Rental Income                                  336
Salary Income                                  224
Interest from deposits                         141
Income Tax Refund                               48
Interest on bonds & government securities        9
Receipts from life insurance policy              3
Name: count, dtype: int64

In [None]:
# Define comprehensive keyword dictionaries for each transaction category
# These keywords are used to identify and classify transactions

BUSINESS_RECEIPT_KEYWORDS = [
    'invoice', 'inv',
    'consulting', 'consultancy',
    'professional fee', 'professional charges',
    'service charge', 'service charges', 'service fee',
    'maintenance',
    'amc',
    'freelance',
    'consultant',
    'it services',
    'tech services'
]

DIVIDEND_KEYWORDS = [
    'dividend',
    'interim dividend',
    'final dividend',
    'mf dividend',  # Mutual Fund dividend
    'for fy'  # For financial year
]

SAVINGS_INTEREST_KEYWORDS = [
    'interest',
    'sb interest',  # Savings Bank interest
    'savings interest',
    'interest paid',
    'sb int',
    'int'
]

DEPOSIT_INTEREST_KEYWORDS = [
    'fd interest',  # Fixed Deposit interest
    'fd int',
    'rd int',  # Recurring Deposit interest
    'td int',  # Term Deposit interest
    'int on fd'
]

RENTAL_INCOME_KEYWORDS = [
    'rent',
    'house rent',
    'flat rent',
    'shop rent',
    'office rent',
    'room rent',
    'lease',
    'rent for',
    'rent received'
]

SALARY_KEYWORDS = [
    'salary', 'sal', 'slry', 'salry',
    'payroll',
    'neft sal',  # NEFT (National Electronic Funds Transfer) salary
    'rtgs sal',  # RTGS (Real Time Gross Settlement) salary
    'wages',
    'remuneration',
    'stipend',
    'honorarium'
]

LIFE_INSURANCE_RECEIPT_KEYWORDS = [
    'life insurance',
    'maturity proceeds',
    'lic',  # Life Insurance Corporation
    'life insurance corporation',
    'hdfc life',
    'icici prudential',
    'sbi life',
    'max life',
    'tata aia',
    'bajaj allianz life'
]

INCOME_TAX_REFUND_KEYWORDS = [
    'it refund',
    'tax refund',
    'cbdt'  # Central Board of Direct Taxes
]

BOND_INTEREST_KEYWORDS = [
    'gsec',  # Government Securities
    't bill',  # Treasury Bill
    'tbill',
    'sovereign gold bond',
    'ncd'  # Non-Convertible Debenture
]

In [None]:
# Calculate keyword frequency for Bond Interest keywords
# This demonstrates how to analyze keyword presence in the dataset

keyword_freq = {}

# Count occurrences of each keyword in normalized narrations
for kw in BOND_INTEREST_KEYWORDS:
    keyword_freq[kw] = (
        all_df['narr_norm']
        .str.contains(kw, na=False)  # Case-insensitive substring search
        .sum()  # Count occurrences
    )

# Convert to DataFrame and sort by frequency
keyword_freq_df = (
    pd.DataFrame.from_dict(keyword_freq, orient='index', columns=['count'])
      .sort_values('count', ascending=False)  # Sort descending
      .reset_index()
      .rename(columns={'index': 'keyword'})
)

keyword_freq_df

Unnamed: 0,keyword,count
0,t bill,358
1,tbill,257
2,gsec,12
3,sovereign gold bond,4
4,ncd,2
