# Enhanced Terrapay Transaction Monitoring Analysis

This notebook conducts a thorough analysis of Terrapay's transaction monitoring system to optimize rule efficiency and reduce false positives, with enhanced granular insights and quantitative impact assessment.

## Advanced Objectives

1. Identify KYC IDs that have alerted in the last 3 months across rules
2. Determine what percentage of KYC IDs alerted on multiple rules
3. Analyze true positive vs false positive rates for rules
4. Examine KYC breakage impact on rule efficiency
5. Identify redundant or overlapping rules
6. Quantify the impact of rule modifications and threshold adjustments
7. Implement ATL/BTL threshold optimization analysis
8. Develop advanced clustering and pattern detection for rules
9. Generate specific, actionable recommendations with quantified impact

## 1. Setup and Data Loading

Let's import the necessary libraries and load the data.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
from collections import defaultdict, Counter
import itertools
import os
from fuzzywuzzy import fuzz  # For name similarity matching
import networkx as nx
from sklearn.cluster import AgglomerativeClustering, KMeans
from sklearn.preprocessing import StandardScaler
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.stats import pearsonr
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 200)
pd.set_option('display.max_rows', 100)

# Set visualization style
plt.style.use('ggplot')
sns.set(style="whitegrid")

# Create output directory for visualizations
os.makedirs('visualizations', exist_ok=True)

In [None]:
# Load all data from Excel file
def load_data():
    """Load all data from Excel file"""
    # Read transaction data
    transaction_data = pd.read_excel('transaction_dummy_data_10k_final.xlsx', 
                                     sheet_name='transaction_dummy_data_10k')
    
    # Read metadata
    metadata = pd.read_excel('transaction_dummy_data_10k_final.xlsx', 
                             sheet_name='Sheet2')
    
    # Read rule descriptions
    rule_descriptions = pd.read_excel('transaction_dummy_data_10k_final.xlsx', 
                                     sheet_name='rule_description')
    
    # Convert dates to datetime if not already
    date_columns = ['transaction_date_time_local', 'created_at', 'closed_at', 
                    'kyc_sender_create_date', 'kyc_receiver_create_date',
                    'dob_sender', 'dob_receiver', 'self_closure_date']
    
    for col in date_columns:
        if col in transaction_data.columns:
            transaction_data[col] = pd.to_datetime(transaction_data[col])
    
    return transaction_data, metadata, rule_descriptions

# Load all data
transaction_data, metadata, rule_descriptions = load_data()

## 2. Initial Data Exploration

Let's explore the dataset to understand its structure and contents.

In [None]:
# Basic exploratory analysis
print(f"Dataset shape: {transaction_data.shape}")
print(f"Data timeframe: {transaction_data['transaction_date_time_local'].min().date()} to {transaction_data['transaction_date_time_local'].max().date()}")

# Basic stats about the data
print("\nStatus distribution:")
status_counts = transaction_data['status'].value_counts()
print(status_counts)

# Plot status distribution
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=status_counts.index, y=status_counts.values)
plt.title('Alert Status Distribution')
plt.ylabel('Count')

# Add percentage labels
total = sum(status_counts)
for i, p in enumerate(ax.patches):
    percentage = 100 * p.get_height() / total
    ax.annotate(f'{percentage:.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')
                
plt.tight_layout()
plt.savefig('visualizations/status_distribution.png')
plt.show()

print("\nRule frequency distribution:")
freq_counts = transaction_data['rule_frequency'].value_counts()
print(freq_counts)

# Plot frequency distribution with percentages
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=freq_counts.index, y=freq_counts.values)
plt.title('Rule Frequency Distribution')
plt.ylabel('Count')

# Add percentage labels
total = sum(freq_counts)
for i, p in enumerate(ax.patches):
    percentage = 100 * p.get_height() / total
    ax.annotate(f'{percentage:.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')

plt.tight_layout()
plt.savefig('visualizations/frequency_distribution.png')
plt.show()

print("\nRule pattern distribution:")
pattern_counts = transaction_data['rule_pattern'].value_counts()
print(pattern_counts)

# Plot pattern distribution with percentages
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=pattern_counts.index, y=pattern_counts.values)
plt.title('Rule Pattern Distribution')
plt.ylabel('Count')

# Add percentage labels
total = sum(pattern_counts)
for i, p in enumerate(ax.patches):
    percentage = 100 * p.get_height() / total
    ax.annotate(f'{percentage:.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')

plt.tight_layout()
plt.savefig('visualizations/pattern_distribution.png')
plt.show()

print("\nTop 10 most frequent alerting rules:")
top_rules = transaction_data['alert_rules'].value_counts().head(10)
print(top_rules)

# Plot top rules with percentages
plt.figure(figsize=(12, 6))
ax = sns.barplot(x=top_rules.index, y=top_rules.values)
plt.title('Top 10 Most Frequent Alerting Rules')
plt.xticks(rotation=45)
plt.ylabel('Count')

# Add percentage labels
total = len(transaction_data)
for i, p in enumerate(ax.patches):
    percentage = 100 * p.get_height() / total
    ax.annotate(f'{percentage:.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')

plt.tight_layout()
plt.savefig('visualizations/top_rules.png')
plt.show()

# Distribution of triggered_on (sender vs receiver)
print("\nDistribution of triggered_on:")
triggered_counts = transaction_data['triggered_on'].value_counts()
print(triggered_counts)

# Plot triggered_on distribution with percentages
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=triggered_counts.index, y=triggered_counts.values)
plt.title('Distribution of Rules Triggered on Sender vs Receiver')
plt.ylabel('Count')

# Add percentage labels
total = sum(triggered_counts)
for i, p in enumerate(ax.patches):
    percentage = 100 * p.get_height() / total
    ax.annotate(f'{percentage:.1f}%', (p.get_x() + p.get_width() / 2., p.get_height()), 
                ha='center', va='bottom')

plt.tight_layout()
plt.savefig('visualizations/triggered_on_distribution.png')
plt.show()

# Count of unique KYC IDs
print("\nUnique KYC IDs:")
print(f"Sender KYC IDs: {transaction_data['sender_kyc_id_no'].nunique()}")
print(f"Receiver KYC IDs: {transaction_data['receiver_kyc_id_no'].nunique()}")

# Display sample of rule descriptions
print("\nRule descriptions sample:")
print(rule_descriptions.head())

## 3. KYC Alert Overlap Analysis with Enhanced Visualizations

In this section, we'll identify KYC IDs that have alerted across multiple rules and create detailed visualizations of the overlaps.

In [None]:
def analyze_kyc_alert_overlap(transaction_data):
    """Analyze KYC IDs that have alerted across rules and determine overlap."""
    print("Analyzing KYC IDs that have alerted...")
    
    # Group alerts by KYC ID (based on triggered_on field)
    kyc_alerts = defaultdict(set)
    kyc_entity_type = {}
    
    for idx, row in transaction_data.iterrows():
        if row['triggered_on'] == 'sender':
            kyc_id = row['sender_kyc_id_no']
            kyc_entity_type[kyc_id] = 'sender'
        else:  # receiver
            kyc_id = row['receiver_kyc_id_no']
            kyc_entity_type[kyc_id] = 'receiver'
            
        kyc_alerts[kyc_id].add(row['alert_rules'])
    
    # Calculate statistics
    total_kyc_with_alerts = len(kyc_alerts)
    kyc_with_multiple_rules = sum(1 for rules in kyc_alerts.values() if len(rules) > 1)
    
    # Distribution of number of rules per KYC
    rule_count_per_kyc = [len(rules) for rules in kyc_alerts.values()]
    rule_count_distribution = pd.Series(rule_count_per_kyc).value_counts().sort_index()
    
    # Calculate overlap percentage
    overlap_percentage = (kyc_with_multiple_rules / total_kyc_with_alerts) * 100 if total_kyc_with_alerts > 0 else 0
    
    print(f"Total KYC IDs with alerts: {total_kyc_with_alerts}")
    print(f"KYC IDs alerting on multiple rules: {kyc_with_multiple_rules}")
    print(f"Percentage of KYC IDs alerting on multiple rules: {overlap_percentage:.2f}%")
    
    print("\nDistribution of number of rules per KYC ID:")
    print(rule_count_distribution)
    
    # Calculate statistics by entity type (sender vs receiver)
    sender_kycs = {k: v for k, v in kyc_alerts.items() if kyc_entity_type.get(k) == 'sender'}
    receiver_kycs = {k: v for k, v in kyc_alerts.items() if kyc_entity_type.get(k) == 'receiver'}
    
    # Sender statistics
    sender_multiple_rules = sum(1 for rules in sender_kycs.values() if len(rules) > 1)
    sender_overlap_pct = (sender_multiple_rules / len(sender_kycs)) * 100 if sender_kycs else 0
    
    # Receiver statistics
    receiver_multiple_rules = sum(1 for rules in receiver_kycs.values() if len(rules) > 1)
    receiver_overlap_pct = (receiver_multiple_rules / len(receiver_kycs)) * 100 if receiver_kycs else 0
    
    print(f"\nSender KYCs with multiple rules: {sender_multiple_rules} ({sender_overlap_pct:.2f}%)")
    print(f"Receiver KYCs with multiple rules: {receiver_multiple_rules} ({receiver_overlap_pct:.2f}%)")
    
    # Plot the distribution as a histogram
    plt.figure(figsize=(12, 6))
    plt.subplot(1, 2, 1)
    bins = range(1, max(rule_count_per_kyc) + 2)
    plt.hist(rule_count_per_kyc, bins=bins, alpha=0.7, color='skyblue', edgecolor='black')
    plt.title('Histogram of Rules per KYC ID')
    plt.xlabel('Number of Rules')
    plt.ylabel('Count of KYC IDs')
    plt.xticks(range(1, max(rule_count_per_kyc) + 1))
    plt.grid(axis='y', alpha=0.75)
    
    # Plot as a bar chart too
    plt.subplot(1, 2, 2)
    rule_count_distribution.plot(kind='bar')
    plt.title('Number of Rules Triggered per KYC ID')
    plt.xlabel('Number of Rules')
    plt.ylabel('Count of KYC IDs')
    plt.xticks(rotation=0)
    plt.tight_layout()
    plt.savefig('visualizations/rules_per_kyc_distribution.png')
    plt.show()
    
    # Compare sender vs receiver overlap distribution
    sender_rule_counts = [len(rules) for rules in sender_kycs.values()]
    receiver_rule_counts = [len(rules) for rules in receiver_kycs.values()]
    
    plt.figure(figsize=(12, 6))
    bins = range(1, max(max(sender_rule_counts, default=1), max(receiver_rule_counts, default=1)) + 2)
    plt.hist([sender_rule_counts, receiver_rule_counts], bins=bins, 
             label=['Sender', 'Receiver'], alpha=0.7, edgecolor='black')
    plt.title('Histogram of Rules per KYC ID by Entity Type')
    plt.xlabel('Number of Rules')
    plt.ylabel('Count of KYC IDs')
    plt.legend()
    plt.grid(axis='y', alpha=0.75)
    plt.xticks(range(1, max(bins)))
    plt.tight_layout()
    plt.savefig('visualizations/rules_per_kyc_by_entity.png')
    plt.show()
    
    # Find co-occurring rules
    rule_pairs = []
    for rules in kyc_alerts.values():
        if len(rules) > 1:
            # Convert set to array for easier processing
            rule_list = list(rules)
            for i in range(len(rule_list)):
                for j in range(i+1, len(rule_list)):
                    rule_pairs.append((rule_list[i], rule_list[j]))
    
    # Count occurrences of each rule pair
    rule_pair_counts = pd.Series(rule_pairs).value_counts().head(15)
    
    print("\nTop 15 co-occurring rule pairs:")
    print(rule_pair_counts)
    
    # Plot top co-occurring rule pairs
    plt.figure(figsize=(14, 8))
    ax = rule_pair_counts.plot(kind='barh', color='teal')
    plt.title('Top 15 Co-occurring Rule Pairs')
    plt.xlabel('Count of Co-occurrences')
    plt.ylabel('Rule Pair')
    
    # Add percentage labels relative to total overlapping KYCs
    for i, v in enumerate(rule_pair_counts):
        percentage = 100 * v / kyc_with_multiple_rules
        ax.text(v + 0.1, i, f'{percentage:.1f}%', va='center')
        
    plt.tight_layout()
    plt.savefig('visualizations/top_cooccurring_rules.png')
    plt.show()
    
    # Create and plot co-occurrence matrix if we have rule pairs
    if rule_pairs:
        unique_rules = sorted(set(rule for pair in rule_pairs for rule in pair))
        
        # Only create a heatmap if not too large
        if len(unique_rules) <= 30:  
            cooccurrence_matrix = pd.DataFrame(0, index=unique_rules, columns=unique_rules)
            
            for r1, r2 in rule_pairs:
                cooccurrence_matrix.loc[r1, r2] += 1
                cooccurrence_matrix.loc[r2, r1] += 1
            
            plt.figure(figsize=(14, 12))
            sns.heatmap(cooccurrence_matrix, cmap="YlGnBu", annot=True, fmt='.0f')
            plt.title('Rule Co-occurrence Matrix')
            plt.tight_layout()
            plt.savefig('visualizations/rule_cooccurrence_matrix.png')
            plt.show()
    
    # Additional analysis: most common rule combinations (more than pairs)
    rule_combinations = defaultdict(int)
    
    # Look for combinations of 2-4 rules that frequently co-occur
    for rules in kyc_alerts.values():
        rule_list = sorted(list(rules))  # Sort to ensure consistent ordering
        if len(rule_list) >= 2:
            # Generate all combinations of 2-4 rules (or fewer if not enough rules)
            max_combo_size = min(4, len(rule_list))
            for size in range(2, max_combo_size + 1):
                for combo in itertools.combinations(rule_list, size):
                    rule_combinations[combo] += 1
    
    # Get the top combinations
    top_combinations = sorted(rule_combinations.items(), key=lambda x: x[1], reverse=True)[:10]
    
    print("\nTop rule combinations (beyond just pairs):")
    for combo, count in top_combinations:
        print(f"{' + '.join(combo)}: {count} occurrences")
    
    return kyc_alerts, rule_count_distribution, rule_pair_counts, kyc_entity_type, rule_combinations

In [None]:
# Execute the enhanced KYC alert overlap analysis
kyc_alerts, rule_count_dist, rule_pairs, kyc_entity_type, rule_combinations = analyze_kyc_alert_overlap(transaction_data)

## 4. Rule Efficiency Analysis with Impact Assessment

Next, we'll analyze the efficiency of each rule and quantify the impact of potential modifications.

In [None]:
def analyze_rule_efficiency_with_impact(transaction_data, rule_descriptions):
    """Analyze the efficiency of rules and quantify impact of potential modifications."""
    print("Analyzing rule efficiency and potential impact of modifications...")
    
    # Filter for closed alerts only (where investigation is complete)
    closed_alerts = transaction_data[transaction_data['status'].isin(['Closed TP', 'Closed FP'])]
    
    # Overall TP/FP rates
    true_positive_rate = len(closed_alerts[closed_alerts['status'] == 'Closed TP']) / len(closed_alerts) * 100
    false_positive_rate = len(closed_alerts[closed_alerts['status'] == 'Closed FP']) / len(closed_alerts) * 100
    
    print(f"Overall True Positive Rate: {true_positive_rate:.2f}%")
    print(f"Overall False Positive Rate: {false_positive_rate:.2f}%")
    
    # Create a performance dataframe for each rule
    rule_performance = closed_alerts.groupby('alert_rules').apply(
        lambda x: pd.Series({
            'Total': len(x),
            'TP': sum(x['status'] == 'Closed TP'),
            'FP': sum(x['status'] == 'Closed FP'),
            'TP_Rate': sum(x['status'] == 'Closed TP') / len(x) * 100 if len(x) > 0 else 0,
            'FP_Rate': sum(x['status'] == 'Closed FP') / len(x) * 100 if len(x) > 0 else 0,
            'Frequency': x['rule_frequency'].iloc[0] if not x['rule_frequency'].empty else 'Unknown',
            'Pattern': x['rule_pattern'].iloc[0] if not x['rule_pattern'].empty else 'Unknown'
        })
    ).reset_index()
    
    # Merge with rule descriptions
    rule_performance = rule_performance.merge(
        rule_descriptions[['Rule no.', 'Rule description', 'Current threshold']], 
        left_on='alert_rules', 
        right_on='Rule no.', 
        how='left'
    ).drop('Rule no.', axis=1)
    
    # Calculate efficiency score (F1-like measure)
    # Higher weight for TP rate to prioritize catching real issues
    rule_performance['Efficiency_Score'] = (2 * rule_performance['TP_Rate']) / (100 + rule_performance['TP_Rate'])
    
    # Sort by efficiency score descending
    rule_performance_by_efficiency = rule_performance.sort_values('Efficiency_Score', ascending=False)
    
    print("\nRule performance by efficiency score (Top 10):")
    print(rule_performance_by_efficiency[['alert_rules', 'Total', 'TP', 'FP', 'TP_Rate', 'Efficiency_Score', 'Frequency', 'Pattern']].head(10))
    
    print("\nRule performance by efficiency score (Bottom 10):")
    print(rule_performance_by_efficiency[['alert_rules', 'Total', 'TP', 'FP', 'TP_Rate', 'Efficiency_Score', 'Frequency', 'Pattern']].tail(10))
    
    # Calculate impact of removing inefficient rules
    # Find inefficient rules (high volume, low TP rate)
    inefficient_rules = rule_performance[(rule_performance['Total'] > 50) & 
                                         (rule_performance['TP_Rate'] < 30)].sort_values('Total', ascending=False)
    
    print("\nInefficient rules (high volume, low TP rate):")
    print(inefficient_rules[['alert_rules', 'Total', 'TP', 'FP', 'TP_Rate', 'Efficiency_Score', 'Frequency', 'Pattern']].head(10))
    
    # Calculate impact of removing these rules
    total_alerts = closed_alerts.shape[0]
    total_tp = closed_alerts[closed_alerts['status'] == 'Closed TP'].shape[0]
    total_fp = closed_alerts[closed_alerts['status'] == 'Closed FP'].shape[0]
    
    # Impact if we remove the top 5 inefficient rules
    top5_inefficient = inefficient_rules.head(5)['alert_rules'].tolist()
    removed_alerts = closed_alerts[closed_alerts['alert_rules'].isin(top5_inefficient)]
    removed_tp = removed_alerts[removed_alerts['status'] == 'Closed TP'].shape[0]
    removed_fp = removed_alerts[removed_alerts['status'] == 'Closed FP'].shape[0]
    
    # Calculate new metrics after removal
    new_total = total_alerts - removed_alerts.shape[0]
    new_tp = total_tp - removed_tp
    new_fp = total_fp - removed_fp
    new_tp_rate = new_tp / new_total * 100 if new_total > 0 else 0
    
    # Calculate percentage changes
    alert_reduction_pct = removed_alerts.shape[0] / total_alerts * 100
    tp_reduction_pct = removed_tp / total_tp * 100 if total_tp > 0 else 0
    fp_reduction_pct = removed_fp / total_fp * 100 if total_fp > 0 else 0
    tp_rate_change = new_tp_rate - true_positive_rate
    
    print("\nImpact of removing top 5 inefficient rules:")
    print(f"Alert volume reduction: {removed_alerts.shape[0]} alerts ({alert_reduction_pct:.2f}% of total)")
    print(f"True positives lost: {removed_tp} ({tp_reduction_pct:.2f}% of all TPs)")
    print(f"False positives eliminated: {removed_fp} ({fp_reduction_pct:.2f}% of all FPs)")
    print(f"True positive rate change: {true_positive_rate:.2f}% → {new_tp_rate:.2f}% ({tp_rate_change:+.2f}%)")
    
    # Impact of converting daily rules to weekly/monthly
    daily_rules = rule_performance[rule_performance['Frequency'] == 'daily'].copy()
    # Find inefficient daily rules
    inefficient_daily = daily_rules[daily_rules['TP_Rate'] < 30].sort_values('Total', ascending=False)
    
    # Calculate impact of converting top 5 inefficient daily rules
    top5_daily = inefficient_daily.head(5)['alert_rules'].tolist()
    daily_alerts = closed_alerts[closed_alerts['alert_rules'].isin(top5_daily)]
    
    # Estimate reduction in alerts (assuming weekly = ~1/5 of daily volume)
    daily_to_weekly_reduction = daily_alerts.shape[0] * 0.8  # 80% reduction
    daily_to_weekly_pct = daily_to_weekly_reduction / total_alerts * 100
    
    print("\nEstimated impact of converting top 5 inefficient daily rules to weekly:")
    print(f"Alert volume reduction: ~{daily_to_weekly_reduction:.0f} alerts ({daily_to_weekly_pct:.2f}% of total)")
    
    # Analyze performance by pattern
    pattern_performance = rule_performance.groupby('Pattern').agg({
        'Total': 'sum',
        'TP': 'sum',
        'FP': 'sum'
    }).reset_index()
    
    pattern_performance['TP_Rate'] = pattern_performance['TP'] / pattern_performance['Total'] * 100
    pattern_performance['Volume_Pct'] = pattern_performance['Total'] / pattern_performance['Total'].sum() * 100
    pattern_performance = pattern_performance.sort_values('TP_Rate', ascending=False)
    
    print("\nPerformance by rule pattern:")
    print(pattern_performance)
    
    # Analyze performance by frequency
    frequency_performance = rule_performance.groupby('Frequency').agg({
        'Total': 'sum',
        'TP': 'sum',
        'FP': 'sum'
    }).reset_index()
    
    frequency_performance['TP_Rate'] = frequency_performance['TP'] / frequency_performance['Total'] * 100
    frequency_performance['Volume_Pct'] = frequency_performance['Total'] / frequency_performance['Total'].sum() * 100
    frequency_performance = frequency_performance.sort_values('TP_Rate', ascending=False)
    
    print("\nPerformance by rule frequency:")
    print(frequency_performance)
    
    # Plot TP rate by rule (top 20 by volume) with FP rate
    top_rules_by_volume = rule_performance.sort_values('Total', ascending=False).head(20)
    
    # Create a figure for the visualization
    plt.figure(figsize=(16, 8))
    
    # Set the positions for the bars
    x = np.arange(len(top_rules_by_volume))
    width = 0.35
    
    # Create the bars
    plt.bar(x - width/2, top_rules_by_volume['TP_Rate'], width, label='TP Rate', color='green', alpha=0.7)
    plt.bar(x + width/2, top_rules_by_volume['FP_Rate'], width, label='FP Rate', color='red', alpha=0.7)
    
    # Add a reference line at 50%
    plt.axhline(y=50, color='blue', linestyle='--', label='50% Rate')
    
    # Add some text for labels, title and axes ticks
    plt.xlabel('Rule')
    plt.ylabel('Rate (%)')
    plt.title('True Positive and False Positive Rates for Top 20 Rules by Volume')
    plt.xticks(x, top_rules_by_volume['alert_rules'], rotation=45, ha='right')
    plt.legend()
    
    # Add a second y-axis for alert volume
    ax2 = plt.twinx()
    # Plot the total volume as a line
    ax2.plot(x, top_rules_by_volume['Total'], 'o-', color='purple', alpha=0.6, label='Total Alerts')
    ax2.set_ylabel('Number of Alerts', color='purple')
    ax2.tick_params(axis='y', colors='purple')
    
    # Add annotations for total volume
    for i, v in enumerate(top_rules_by_volume['Total']):
        ax2.annotate(str(v), (x[i], v), xytext=(0, 5), textcoords='offset points', ha='center')
    
    plt.tight_layout()
    plt.savefig('visualizations/tp_fp_rates_top_volume_rules.png')
    plt.show()
    
    # Plot by pattern with volume proportion
    plt.figure(figsize=(12, 8))
    ax = plt.subplot(111)
    
    # Bar plot for TP rate
    bars = plt.bar(pattern_performance['Pattern'], pattern_performance['TP_Rate'], 
                   color='green', alpha=0.7, label='TP Rate')
    
    # Add volume percentage annotations
    for i, bar in enumerate(bars):
        height = bar.get_height()
        volume_pct = pattern_performance.iloc[i]['Volume_Pct']
        plt.text(bar.get_x() + bar.get_width()/2., height + 1,
                f'Vol: {volume_pct:.1f}%', ha='center', va='bottom')
    
    plt.title('True Positive Rate by Rule Pattern (with Volume Percentage)')
    plt.xlabel('Pattern')
    plt.ylabel('True Positive Rate (%)')
    plt.ylim(0, 100)  # Set y-axis limit to 100%
    plt.tight_layout()
    plt.savefig('visualizations/tp_rate_by_pattern_with_volume.png')
    plt.show()
    
    # Plot by frequency with volume proportion
    plt.figure(figsize=(12, 8))
    
    # Bar plot for TP rate
    bars = plt.bar(frequency_performance['Frequency'], frequency_performance['TP_Rate'], 
                   color='purple', alpha=0.7, label='TP Rate')
    
    # Add volume percentage annotations
    for i, bar in enumerate(bars):
        height = bar.get_height()
        volume_pct = frequency_performance.iloc[i]['Volume_Pct']
        plt.text(bar.get_x() + bar.get_width()/2., height + 1,
                f'Vol: {volume_pct:.1f}%', ha='center', va='bottom')
    
    plt.title('True Positive Rate by Rule Frequency (with Volume Percentage)')
    plt.xlabel('Frequency')
    plt.ylabel('True Positive Rate (%)')
    plt.ylim(0, 100)  # Set y-axis limit to 100%
    plt.tight_layout()
    plt.savefig('visualizations/tp_rate_by_frequency_with_volume.png')
    plt.show()
    
    # Get list of true positives
    true_positives = transaction_data[transaction_data['status'] == 'Closed TP']
    
    # Extract unique KYC IDs with true positive alerts
    tp_kyc_ids = []
    for idx, row in true_positives.iterrows():
        if row['triggered_on'] == 'sender':
            tp_kyc_ids.append(row['sender_kyc_id_no'])
        else:  # receiver
            tp_kyc_ids.append(row['receiver_kyc_id_no'])
    
    unique_tp_kyc_ids = set(tp_kyc_ids)
    
    print(f"\nTotal true positive alerts: {len(true_positives)}")
    print(f"Number of unique KYC IDs with true positive alerts: {len(unique_tp_kyc_ids)}")
    
    # Calculate effectiveness ratio (TP per KYC)
    tp_per_kyc = len(true_positives) / len(unique_tp_kyc_ids) if unique_tp_kyc_ids else 0
    print(f"True positives per unique KYC ID: {tp_per_kyc:.2f}")
    
    return rule_performance, pattern_performance, frequency_performance, true_positives, unique_tp_kyc_ids

In [None]:
# Execute the enhanced rule efficiency analysis with impact assessment
rule_performance, pattern_performance, frequency_performance, true_positives, tp_kyc_ids = \
    analyze_rule_efficiency_with_impact(transaction_data, rule_descriptions)

## 5. ATL/BTL Threshold Optimization Analysis

Now we'll perform ATL/BTL (Above-the-Line/Below-the-Line) threshold analysis for the top 5 alerting rules to find optimal thresholds.

In [None]:
def perform_atl_btl_analysis(transaction_data, rule_descriptions):
    """Perform ATL/BTL threshold analysis for the top 5 alerting rules."""
    print("Performing ATL/BTL threshold optimization analysis...")
    
    # Get the top 5 rules by alert volume
    top_rules = transaction_data['alert_rules'].value_counts().head(5).index.tolist()
    print(f"Top 5 rules by alert volume: {top_rules}")
    
    # Get current thresholds
    rule_thresholds = rule_descriptions.set_index('Rule no.')['Current threshold'].to_dict()
    
    # For each rule, simulate different thresholds
    for rule in top_rules:
        if rule in rule_thresholds:
            current_threshold = rule_thresholds[rule]
            print(f"\nAnalyzing Rule {rule} (Current threshold: {current_threshold})")
            
            # Get alerts for this rule
            rule_alerts = transaction_data[transaction_data['alert_rules'] == rule]
            
            # Filter for closed alerts (with known outcome)
            closed_rule_alerts = rule_alerts[rule_alerts['status'].isin(['Closed TP', 'Closed FP'])]
            
            # Check if there's value information
            if 'usd_value' in closed_rule_alerts.columns:
                # For volume-based rules, we'll use transaction value as a proxy
                # In real-world scenarios, you'd use the actual metric that triggers the rule
                
                # Get the rule pattern to determine what type of rule it is
                rule_pattern = rule_descriptions[rule_descriptions['Rule no.'] == rule]['Rule Pattern'].values[0] \
                    if rule in rule_descriptions['Rule no.'].values else 'Unknown'
                
                print(f"Rule pattern: {rule_pattern}")
                
                # Extract transaction values and statuses
                values = closed_rule_alerts['usd_value'].values
                statuses = closed_rule_alerts['status'].values
                
                if len(values) > 0:
                    # Create threshold ranges based on the data
                    min_val = values.min()
                    max_val = values.max()
                    
                    # Generate potential thresholds
                    # For demonstration, use percentiles as potential thresholds
                    percentiles = np.arange(10, 100, 10)  # 10th, 20th, ..., 90th percentiles
                    thresholds = np.percentile(values, percentiles)
                    
                    # Add current threshold if it's a numeric value
                    if isinstance(current_threshold, (int, float)):
                        thresholds = np.append(thresholds, current_threshold)
                        thresholds = np.sort(thresholds)
                    
                    # Calculate TP/FP rates for each threshold
                    results = []
                    for threshold in thresholds:
                        # For volume rules, alerts are triggered when value > threshold
                        # For other rule types, the logic might be different
                        if rule_pattern in ['Volume']:
                            triggered = values > threshold
                        else:  # Default behavior
                            triggered = values > threshold
                        
                        # Count TPs and FPs with this threshold
                        tp_count = sum((statuses == 'Closed TP') & triggered)
                        fp_count = sum((statuses == 'Closed FP') & triggered)
                        total = sum(triggered)
                        
                        # Calculate rates
                        tp_rate = tp_count / sum(statuses == 'Closed TP') * 100 if sum(statuses == 'Closed TP') > 0 else 0
                        fp_rate = fp_count / sum(statuses == 'Closed FP') * 100 if sum(statuses == 'Closed FP') > 0 else 0
                        precision = tp_count / total * 100 if total > 0 else 0
                        
                        # Calculate alert volume reduction
                        alert_reduction = (1 - sum(triggered) / len(values)) * 100
                        
                        results.append({
                            'Threshold': threshold,
                            'TP_Count': tp_count,
                            'FP_Count': fp_count,
                            'Total_Alerts': sum(triggered),
                            'TP_Rate': tp_rate,
                            'FP_Rate': fp_rate,
                            'Precision': precision,
                            'Alert_Reduction': alert_reduction
                        })
                    
                    # Convert to DataFrame
                    results_df = pd.DataFrame(results)
                    
                    # Find optimal threshold
                    # Define a simple scoring function that balances TP rate and alert reduction
                    results_df['Score'] = results_df['TP_Rate'] * 0.7 - results_df['Alert_Reduction'] * 0.3
                    optimal_threshold = results_df.loc[results_df['Score'].idxmax(), 'Threshold']
                    
                    print(f"Optimal threshold: {optimal_threshold:.2f} (current: {current_threshold})")
                    
                    # Print summary of results
                    current_results = results_df[results_df['Threshold'] == current_threshold] if current_threshold in results_df['Threshold'].values else None
                    optimal_results = results_df[results_df['Threshold'] == optimal_threshold]
                    
                    print("Current threshold performance:")
                    if current_results is not None and not current_results.empty:
                        print(f"TP Rate: {current_results.iloc[0]['TP_Rate']:.2f}%, FP Rate: {current_results.iloc[0]['FP_Rate']:.2f}%, "
                              f"Precision: {current_results.iloc[0]['Precision']:.2f}%, Alerts: {current_results.iloc[0]['Total_Alerts']}")
                    else:
                        print("Current threshold not in the analysis range")
                    
                    print("Optimal threshold performance:")
                    print(f"TP Rate: {optimal_results.iloc[0]['TP_Rate']:.2f}%, FP Rate: {optimal_results.iloc[0]['FP_Rate']:.2f}%, "
                          f"Precision: {optimal_results.iloc[0]['Precision']:.2f}%, Alerts: {optimal_results.iloc[0]['Total_Alerts']}")
                    
                    # Calculate impact of changing threshold
                    if current_results is not None and not current_results.empty:
                        alert_change = optimal_results.iloc[0]['Total_Alerts'] - current_results.iloc[0]['Total_Alerts']
                        alert_change_pct = (alert_change / current_results.iloc[0]['Total_Alerts']) * 100 if current_results.iloc[0]['Total_Alerts'] > 0 else 0
                        tp_rate_change = optimal_results.iloc[0]['TP_Rate'] - current_results.iloc[0]['TP_Rate']
                        precision_change = optimal_results.iloc[0]['Precision'] - current_results.iloc[0]['Precision']
                        
                        print(f"Impact of threshold change: {alert_change:+.0f} alerts ({alert_change_pct:+.2f}%), "
                              f"TP Rate: {tp_rate_change:+.2f}%, Precision: {precision_change:+.2f}%")
                    
                    # Plot threshold analysis results
                    plt.figure(figsize=(14, 8))
                    
                    # Plot TP Rate, Precision and Alert Reduction
                    plt.plot(results_df['Threshold'], results_df['TP_Rate'], 'o-', label='TP Rate', color='green')
                    plt.plot(results_df['Threshold'], results_df['Precision'], 's-', label='Precision', color='blue')
                    plt.plot(results_df['Threshold'], results_df['Alert_Reduction'], '^-', label='Alert Reduction', color='red')
                    
                    # Add current and optimal threshold lines
                    if current_threshold in results_df['Threshold'].values:
                        plt.axvline(x=current_threshold, color='black', linestyle='--', label=f'Current Threshold ({current_threshold})')
                    plt.axvline(x=optimal_threshold, color='purple', linestyle='-', label=f'Optimal Threshold ({optimal_threshold:.2f})')
                    
                    plt.title(f'Threshold Analysis for Rule {rule}')
                    plt.xlabel('Threshold Value')
                    plt.ylabel('Percentage (%)')
                    plt.legend()
                    plt.grid(True, alpha=0.3)
                    plt.tight_layout()
                    plt.savefig(f'visualizations/threshold_analysis_{rule}.png')
                    plt.show()
                else:
                    print("Insufficient data for threshold analysis")
            else:
                print("Value information not available for threshold analysis")
        else:
            print(f"No threshold information found for rule {rule}")
    
    return top_rules

In [None]:
# Execute ATL/BTL threshold analysis
top_rules = perform_atl_btl_analysis(transaction_data, rule_descriptions)

## 6. KYC Breakage Analysis with Enhanced Metrics

This section analyzes the KYC breakage issue in more detail and quantifies its impact.

In [None]:
def analyze_kyc_breakage_enhanced(transaction_data):
    """Analyze the KYC breakage issue with enhanced metrics and visualizations."""
    print("\nEnhanced KYC breakage analysis...")
    
    # Analyze sender KYC breakage
    print("\n=== Sender KYC Breakage Analysis ===")
    
    # Group by sender name and count KYC IDs
    sender_name_groups = transaction_data.groupby('sender_name_kyc_wise')['sender_kyc_id_no'].nunique().reset_index()
    sender_name_groups.columns = ['sender_name', 'kyc_id_count']
    
    # Distribution of KYC IDs per sender name
    sender_kyc_distribution = sender_name_groups['kyc_id_count'].value_counts().sort_index()
    
    print("Distribution of KYC IDs per unique sender name:")
    print(sender_kyc_distribution.head(10))
    
    # Calculate stats
    avg_kyc_per_sender = sender_name_groups['kyc_id_count'].mean()
    max_kyc_per_sender = sender_name_groups['kyc_id_count'].max()
    senders_with_multiple_kyc = sum(sender_name_groups['kyc_id_count'] > 1)
    percentage_senders_with_multiple_kyc = senders_with_multiple_kyc / len(sender_name_groups) * 100
    
    print(f"\nAverage KYC IDs per unique sender name: {avg_kyc_per_sender:.2f}")
    print(f"Maximum KYC IDs for a single sender name: {max_kyc_per_sender}")
    print(f"Senders with multiple KYC IDs: {senders_with_multiple_kyc} ({percentage_senders_with_multiple_kyc:.2f}%)")
    
    # Visualize the distribution
    plt.figure(figsize=(10, 6))
    sender_kyc_counts = sender_name_groups['kyc_id_count'].values
    plt.hist(sender_kyc_counts, bins=range(1, max(sender_kyc_counts) + 2), 
             alpha=0.7, color='blue', edgecolor='black')
    plt.title('Distribution of KYC IDs per Sender Name')
    plt.xlabel('Number of KYC IDs')
    plt.ylabel('Count of Sender Names')
    plt.xticks(range(1, max(sender_kyc_counts) + 1))
    plt.grid(axis='y', alpha=0.75)
    plt.savefig('visualizations/sender_kyc_distribution.png')
    plt.show()
    
    # Find senders with the most KYC IDs
    top_multiple_kyc_senders = sender_name_groups.sort_values('kyc_id_count', ascending=False).head(10)
    print("\nTop 10 senders with most KYC IDs:")
    print(top_multiple_kyc_senders)
    
    # Analyze receiver KYC breakage
    print("\n=== Receiver KYC Breakage Analysis ===")
    
    # Group by receiver name and count KYC IDs
    receiver_name_groups = transaction_data.groupby('receiver_name_kyc_wise')['receiver_kyc_id_no'].nunique().reset_index()
    receiver_name_groups.columns = ['receiver_name', 'kyc_id_count']
    
    # Distribution of KYC IDs per receiver name
    receiver_kyc_distribution = receiver_name_groups['kyc_id_count'].value_counts().sort_index()
    
    print("Distribution of KYC IDs per unique receiver name:")
    print(receiver_kyc_distribution.head(10))
    
    # Calculate stats
    avg_kyc_per_receiver = receiver_name_groups['kyc_id_count'].mean()
    max_kyc_per_receiver = receiver_name_groups['kyc_id_count'].max()
    receivers_with_multiple_kyc = sum(receiver_name_groups['kyc_id_count'] > 1)
    percentage_receivers_with_multiple_kyc = receivers_with_multiple_kyc / len(receiver_name_groups) * 100
    
    print(f"\nAverage KYC IDs per unique receiver name: {avg_kyc_per_receiver:.2f}")
    print(f"Maximum KYC IDs for a single receiver name: {max_kyc_per_receiver}")
    print(f"Receivers with multiple KYC IDs: {receivers_with_multiple_kyc} ({percentage_receivers_with_multiple_kyc:.2f}%)")
    
    # Visualize the distribution
    plt.figure(figsize=(10, 6))
    receiver_kyc_counts = receiver_name_groups['kyc_id_count'].values
    plt.hist(receiver_kyc_counts, bins=range(1, max(receiver_kyc_counts) + 2), 
             alpha=0.7, color='green', edgecolor='black')
    plt.title('Distribution of KYC IDs per Receiver Name')
    plt.xlabel('Number of KYC IDs')
    plt.ylabel('Count of Receiver Names')
    plt.xticks(range(1, max(receiver_kyc_counts) + 1))
    plt.grid(axis='y', alpha=0.75)
    plt.savefig('visualizations/receiver_kyc_distribution.png')
    plt.show()
    
    # Analyze impact on alerts
    print("\n=== Impact of KYC Breakage on Alerts ===")
    
    # Filter for alerts triggered on receivers and senders
    receiver_alerts = transaction_data[transaction_data['triggered_on'] == 'receiver']
    sender_alerts = transaction_data[transaction_data['triggered_on'] == 'sender']
    
    # Find alerts for entities with multiple KYC IDs
    multiple_kyc_senders = sender_name_groups[sender_name_groups['kyc_id_count'] > 1]['sender_name'].tolist()
    multiple_kyc_receivers = receiver_name_groups[receiver_name_groups['kyc_id_count'] > 1]['receiver_name'].tolist()
    
    # Filter alerts for these entities
    sender_multiple_kyc_alerts = sender_alerts[sender_alerts['sender_name_kyc_wise'].isin(multiple_kyc_senders)]
    receiver_multiple_kyc_alerts = receiver_alerts[receiver_alerts['receiver_name_kyc_wise'].isin(multiple_kyc_receivers)]
    
    # Calculate impact percentages
    sender_affected_pct = len(sender_multiple_kyc_alerts) / len(sender_alerts) * 100 if len(sender_alerts) > 0 else 0
    receiver_affected_pct = len(receiver_multiple_kyc_alerts) / len(receiver_alerts) * 100 if len(receiver_alerts) > 0 else 0
    overall_affected_pct = (len(sender_multiple_kyc_alerts) + len(receiver_multiple_kyc_alerts)) / len(transaction_data) * 100
    
    print(f"Sender alerts affected by KYC breakage: {len(sender_multiple_kyc_alerts)} ({sender_affected_pct:.2f}%)")
    print(f"Receiver alerts affected by KYC breakage: {len(receiver_multiple_kyc_alerts)} ({receiver_affected_pct:.2f}%)")
    print(f"Overall alerts affected: {len(sender_multiple_kyc_alerts) + len(receiver_multiple_kyc_alerts)} ({overall_affected_pct:.2f}%)")
    
    # Analyze impact on rule performance
    # Check if KYC breakage affects certain rules more than others
    if not receiver_multiple_kyc_alerts.empty and not sender_multiple_kyc_alerts.empty:
        sender_rule_impact = sender_multiple_kyc_alerts.groupby('alert_rules').size().sort_values(ascending=False)
        receiver_rule_impact = receiver_multiple_kyc_alerts.groupby('alert_rules').size().sort_values(ascending=False)
        
        print("\nSender rules most affected by KYC breakage:")
        print(sender_rule_impact.head(5))
        
        print("\nReceiver rules most affected by KYC breakage:")
        print(receiver_rule_impact.head(5))
        
        # Calculate false positive impact
        sender_fp = sender_multiple_kyc_alerts[sender_multiple_kyc_alerts['status'] == 'Closed FP']
        receiver_fp = receiver_multiple_kyc_alerts[receiver_multiple_kyc_alerts['status'] == 'Closed FP']
        
        sender_fp_pct = len(sender_fp) / len(sender_multiple_kyc_alerts) * 100 if len(sender_multiple_kyc_alerts) > 0 else 0
        receiver_fp_pct = len(receiver_fp) / len(receiver_multiple_kyc_alerts) * 100 if len(receiver_multiple_kyc_alerts) > 0 else 0
        
        print(f"\nFalse positive rate in sender multiple KYC alerts: {sender_fp_pct:.2f}%")
        print(f"False positive rate in receiver multiple KYC alerts: {receiver_fp_pct:.2f}%")
        
        # Compare with overall FP rate
        overall_fp_rate = len(transaction_data[transaction_data['status'] == 'Closed FP']) / len(transaction_data[transaction_data['status'].isin(['Closed TP', 'Closed FP'])]) * 100
        print(f"Overall false positive rate: {overall_fp_rate:.2f}%")
        print(f"KYC breakage impact on FP rate - Sender: {sender_fp_pct - overall_fp_rate:+.2f}%, Receiver: {receiver_fp_pct - overall_fp_rate:+.2f}%")
    
    # Visualize the KYC breakage impact
    plt.figure(figsize=(14, 8))
    
    # Create data for the chart
    affected_data = [
        len(sender_multiple_kyc_alerts), 
        len(sender_alerts) - len(sender_multiple_kyc_alerts),
        len(receiver_multiple_kyc_alerts), 
        len(receiver_alerts) - len(receiver_multiple_kyc_alerts)
    ]
    
    x = ['Sender\n(Affected)', 'Sender\n(Not Affected)', 'Receiver\n(Affected)', 'Receiver\n(Not Affected)']
    colors = ['red', 'blue', 'red', 'blue']
    
    # Create the bar chart
    bars = plt.bar(x, affected_data, color=colors, alpha=0.7)
    
    # Add percentage labels
    for i, bar in enumerate(bars):
        if i < 2:  # Sender categories
            percentage = 100 * bar.get_height() / len(sender_alerts) if len(sender_alerts) > 0 else 0
        else:  # Receiver categories
            percentage = 100 * bar.get_height() / len(receiver_alerts) if len(receiver_alerts) > 0 else 0
        plt.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.1,
                f'{percentage:.1f}%', ha='center', va='bottom')
    
    plt.title('Impact of KYC Breakage on Alerts')
    plt.ylabel('Number of Alerts')
    plt.grid(axis='y', alpha=0.3)
    plt.tight_layout()
    plt.savefig('visualizations/kyc_breakage_impact.png')
    plt.show()
    
    # Simulate impact of KYC deduplication
    # For each entity with multiple KYCs, estimate reduction in false positives
    print("\n=== Estimated Impact of KYC Deduplication ===")
    
    # Count false positives from multiple KYC entities
    total_fp = len(transaction_data[transaction_data['status'] == 'Closed FP'])
    multiple_kyc_fp = len(sender_fp) + len(receiver_fp)
    multiple_kyc_fp_pct = multiple_kyc_fp / total_fp * 100 if total_fp > 0 else 0
    
    # Estimate FP reduction (conservatively assume 70% would be eliminated)
    estimated_fp_reduction = multiple_kyc_fp * 0.7
    estimated_fp_reduction_pct = estimated_fp_reduction / total_fp * 100 if total_fp > 0 else 0
    
    print(f"False positives from entities with multiple KYCs: {multiple_kyc_fp} ({multiple_kyc_fp_pct:.2f}% of all FPs)")
    print(f"Estimated reduction in false positives with deduplication: {estimated_fp_reduction:.0f} ({estimated_fp_reduction_pct:.2f}%)")
    
    # Estimate alert volume reduction
    total_alerts = len(transaction_data)
    multiple_kyc_alerts = len(sender_multiple_kyc_alerts) + len(receiver_multiple_kyc_alerts)
    
    # Conservative estimate: 50% of these alerts are duplicates
    estimated_alert_reduction = multiple_kyc_alerts * 0.5
    estimated_alert_reduction_pct = estimated_alert_reduction / total_alerts * 100
    
    print(f"Estimated reduction in alert volume with deduplication: {estimated_alert_reduction:.0f} ({estimated_alert_reduction_pct:.2f}%)")
    
    return sender_name_groups, receiver_name_groups, sender_multiple_kyc_alerts, receiver_multiple_kyc_alerts

In [None]:
# Execute the enhanced KYC breakage analysis
sender_name_groups, receiver_name_groups, sender_multiple_kyc_alerts, receiver_multiple_kyc_alerts = \
    analyze_kyc_breakage_enhanced(transaction_data)

## 7. Rule Clustering Analysis

Here we'll use clustering techniques to identify natural groups of rules and patterns.

In [None]:
def perform_rule_clustering(transaction_data, rule_performance, kyc_alerts):
    """Perform rule clustering analysis to identify natural groups of rules."""
    print("\nPerforming rule clustering analysis...")
    
    # Extract rule co-occurrence patterns
    # Create a matrix where each row is a KYC ID and each column is a rule
    # Value is 1 if the KYC triggered that rule, 0 otherwise
    all_rules = sorted(transaction_data['alert_rules'].unique())
    rule_matrix = pd.DataFrame(0, index=kyc_alerts.keys(), columns=all_rules)
    
    for kyc_id, rules in kyc_alerts.items():
        for rule in rules:
            if rule in all_rules:  # Ensure rule is in the columns
                rule_matrix.loc[kyc_id, rule] = 1
    
    # Calculate correlation matrix between rules
    rule_corr_matrix = rule_matrix.corr()
    
    # Convert NaN to 0 (in case some rules have no co-occurrences)
    rule_corr_matrix.fillna(0, inplace=True)
    
    # Create a distance matrix (1 - correlation)
    distance_matrix = 1 - rule_corr_matrix.abs()
    
    # Perform hierarchical clustering
    linkage_matrix = linkage(distance_matrix.values, method='ward')
    
    # Plot dendrogram
    plt.figure(figsize=(16, 10))
    plt.title("Hierarchical Clustering of Rules Based on Co-occurrence")
    dendrogram(linkage_matrix, labels=all_rules, leaf_rotation=90)
    plt.tight_layout()
    plt.savefig('visualizations/rule_hierarchical_clustering.png')
    plt.show()
    
    # Determine optimal number of clusters
    num_clusters = min(10, len(all_rules))  # Cap at 10 clusters for readability
    
    # Apply hierarchical clustering
    cluster_model = AgglomerativeClustering(n_clusters=num_clusters, affinity='precomputed', linkage='ward')
    cluster_labels = cluster_model.fit_predict(distance_matrix)
    
    # Create a DataFrame with rule clusters
    rule_clusters = pd.DataFrame({
        'Rule': all_rules,
        'Cluster': cluster_labels
    })
    
    # Add performance metrics
    rule_clusters = rule_clusters.merge(rule_performance[['alert_rules', 'Total', 'TP_Rate', 'Pattern', 'Frequency']], 
                                        left_on='Rule', right_on='alert_rules', how='left').drop('alert_rules', axis=1)
    
    # Analyze each cluster
    print("\nRule clusters from hierarchical clustering:")
    for cluster_id in range(num_clusters):
        cluster_rules = rule_clusters[rule_clusters['Cluster'] == cluster_id]
        print(f"\nCluster {cluster_id} ({len(cluster_rules)} rules):")
        
        # Print rules in this cluster
        print("Rules: " + ", ".join(cluster_rules['Rule'].tolist()))
        
        # Calculate cluster metrics
        avg_tp_rate = cluster_rules['TP_Rate'].mean()
        patterns = cluster_rules['Pattern'].value_counts()
        frequencies = cluster_rules['Frequency'].value_counts()
        
        print(f"Average TP Rate: {avg_tp_rate:.2f}%")
        print(f"Dominant patterns: {patterns.to_dict()}")
        print(f"Frequencies: {frequencies.to_dict()}")
    
    # Visualize clusters on a 2D plot
    # We'll use a simple 2D representation based on TP rate and total alerts
    plt.figure(figsize=(14, 10))
    
    # Create a scatterplot of rules
    scatter = plt.scatter(rule_clusters['Total'], rule_clusters['TP_Rate'], 
                         c=rule_clusters['Cluster'], cmap='viridis', 
                         s=100, alpha=0.7)
    
    # Add rule labels
    for i, row in rule_clusters.iterrows():
        plt.annotate(row['Rule'], (row['Total'], row['TP_Rate']), 
                     xytext=(5, 5), textcoords='offset points')
    
    plt.colorbar(scatter, label='Cluster')
    plt.title('Rule Clusters by Alert Volume and TP Rate')
    plt.xlabel('Total Alerts')
    plt.ylabel('True Positive Rate (%)')
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('visualizations/rule_clusters_scatter.png')
    plt.show()
    
    # Network visualization of rule relationships
    # Create a graph where rules are nodes and edges represent correlation
    G = nx.Graph()
    
    # Add nodes
    for rule in all_rules:
        cluster_id = rule_clusters.loc[rule_clusters['Rule'] == rule, 'Cluster'].iloc[0]
        G.add_node(rule, cluster=cluster_id)
    
    # Add edges (only for correlations above threshold)
    threshold = 0.3  # Correlation threshold
    for i in range(len(all_rules)):
        for j in range(i+1, len(all_rules)):
            rule1, rule2 = all_rules[i], all_rules[j]
            correlation = rule_corr_matrix.loc[rule1, rule2]
            if correlation > threshold:
                G.add_edge(rule1, rule2, weight=correlation)
    
    # Plot the network if it's not too large
    if len(all_rules) <= 40:  # Limit for readability
        plt.figure(figsize=(16, 12))
        
        # Position nodes using force-directed layout
        pos = nx.spring_layout(G, k=0.3, iterations=50)
        
        # Get node colors based on cluster
        node_colors = [G.nodes[rule]['cluster'] for rule in G.nodes()]
        
        # Draw nodes
        nx.draw_networkx_nodes(G, pos, node_size=200, node_color=node_colors, cmap='viridis', alpha=0.8)
        
        # Draw edges with varying width based on correlation
        edges = G.edges(data=True)
        edge_weights = [edge[2]['weight']*3 for edge in edges]
        nx.draw_networkx_edges(G, pos, width=edge_weights, alpha=0.5)
        
        # Draw labels
        nx.draw_networkx_labels(G, pos, font_size=8)
        
        plt.title('Rule Correlation Network (Colored by Cluster)')
        plt.axis('off')
        plt.tight_layout()
        plt.savefig('visualizations/rule_correlation_network.png')
        plt.show()
    
    return rule_clusters, rule_corr_matrix

In [None]:
# Execute rule clustering analysis
rule_clusters, rule_corr_matrix = perform_rule_clustering(transaction_data, rule_performance, kyc_alerts)

## 8. Comprehensive Recommendations with Quantified Impact

Finally, we'll generate detailed recommendations with quantified impact for rule optimization.

In [None]:
def generate_comprehensive_recommendations(transaction_data, rule_performance, rule_clusters, rule_corr_matrix, kyc_alerts):
    """Generate comprehensive recommendations with quantified impact."""
    print("\nGenerating comprehensive recommendations with quantified impact...")
    
    # Total alert stats for reference
    total_alerts = len(transaction_data)
    closed_alerts = transaction_data[transaction_data['status'].isin(['Closed TP', 'Closed FP'])]
    total_tp = len(closed_alerts[closed_alerts['status'] == 'Closed TP'])
    total_fp = len(closed_alerts[closed_alerts['status'] == 'Closed FP'])
    current_tp_rate = total_tp / len(closed_alerts) * 100 if len(closed_alerts) > 0 else 0
    
    # Create recommendations list
    recommendations = []
    
    # 1. Remove or Modify Inefficient Rules
    inefficient_rules = rule_performance[(rule_performance['Total'] > 50) & 
                                         (rule_performance['TP_Rate'] < 30)].sort_values('Total', ascending=False)
    
    if not inefficient_rules.empty:
        for _, rule in inefficient_rules.head(5).iterrows():
            rule_alerts = closed_alerts[closed_alerts['alert_rules'] == rule['alert_rules']]
            rule_tp = rule['TP']
            rule_fp = rule['FP']
            rule_total = rule['Total']
            
            # Calculate impact if rule is removed
            new_tp_count = total_tp - rule_tp
            new_fp_count = total_fp - rule_fp
            new_total = len(closed_alerts) - rule_total
            new_tp_rate = new_tp_count / new_total * 100 if new_total > 0 else 0
            tp_rate_change = new_tp_rate - current_tp_rate
            
            # Alert volume reduction
            rule_volume_in_all = transaction_data[transaction_data['alert_rules'] == rule['alert_rules']].shape[0]
            volume_reduction = rule_volume_in_all / total_alerts * 100
            
            recommendations.append({
                'Category': 'Remove Inefficient Rules',
                'Rules': rule['alert_rules'],
                'Action': 'Remove rule or significantly increase threshold',
                'Rationale': f"Low TP rate ({rule['TP_Rate']:.1f}%) with high volume ({rule['Total']} alerts)",
                'Impact': f"Alert volume: -{volume_reduction:.1f}%, TP rate: {tp_rate_change:+.2f}%",
                'Priority': 'High' if rule['Total'] > 100 else 'Medium'
            })
    
    # 2. Consolidate Similar Rules (from clusters and correlation)
    # Get highly correlated rule pairs
    high_corr_threshold = 0.7
    high_corr_pairs = []
    
    # Find all pairs of rules with correlation above threshold
    for i in range(len(rule_corr_matrix.index)):
        for j in range(i+1, len(rule_corr_matrix.columns)):
            rule1 = rule_corr_matrix.index[i]
            rule2 = rule_corr_matrix.columns[j]
            corr = rule_corr_matrix.iloc[i, j]
            if corr >= high_corr_threshold:
                high_corr_pairs.append((rule1, rule2, corr))
    
    # Sort by correlation
    high_corr_pairs.sort(key=lambda x: x[2], reverse=True)
    
    # Generate recommendations for top correlated pairs
    for rule1, rule2, corr in high_corr_pairs[:5]:  # Top 5 pairs
        # Get performance data
        rule1_perf = rule_performance[rule_performance['alert_rules'] == rule1]
        rule2_perf = rule_performance[rule_performance['alert_rules'] == rule2]
        
        if not rule1_perf.empty and not rule2_perf.empty:
            rule1_tp_rate = rule1_perf.iloc[0]['TP_Rate']
            rule2_tp_rate = rule2_perf.iloc[0]['TP_Rate']
            
            # Determine which rule to keep
            keep_rule = rule1 if rule1_tp_rate >= rule2_tp_rate else rule2
            remove_rule = rule2 if keep_rule == rule1 else rule1
            
            # Calculate impact of consolidation
            # For all alerts (not just closed ones)
            keep_alerts = transaction_data[transaction_data['alert_rules'] == keep_rule].shape[0]
            remove_alerts = transaction_data[transaction_data['alert_rules'] == remove_rule].shape[0]
            
            # Estimate overlap based on correlation
            # Higher correlation means more shared alerts
            overlap_factor = corr
            unique_remove_alerts = remove_alerts * (1 - overlap_factor)
            
            # Volume reduction
            volume_reduction = unique_remove_alerts / total_alerts * 100
            
            recommendations.append({
                'Category': 'Consolidate Similar Rules',
                'Rules': f"{rule1}, {rule2}",
                'Action': f"Combine rules, keep {keep_rule}",
                'Rationale': f"High correlation ({corr:.2f}), {keep_rule} has higher TP rate ({max(rule1_tp_rate, rule2_tp_rate):.1f}%)",
                'Impact': f"Alert volume: -{volume_reduction:.1f}%, No significant impact on TP rate",
                'Priority': 'High' if corr > 0.9 else 'Medium'
            })
    
    # 3. Convert Inefficient Daily Rules to Weekly
    daily_rules = rule_performance[(rule_performance['Frequency'] == 'daily') & 
                                  (rule_performance['TP_Rate'] < 30)].sort_values('Total', ascending=False).head(3)
    
    if not daily_rules.empty:
        # Aggregate daily rules
        daily_rule_list = daily_rules['alert_rules'].tolist()
        daily_alert_count = transaction_data[transaction_data['alert_rules'].isin(daily_rule_list)].shape[0]
        
        # Estimate reduction (weekly = ~1/5 of daily)
        estimated_reduction = daily_alert_count * 0.8  # 80% reduction
        volume_reduction_pct = estimated_reduction / total_alerts * 100
        
        recommendations.append({
            'Category': 'Adjust Rule Frequency',
            'Rules': ', '.join(daily_rule_list),
            'Action': 'Convert inefficient daily rules to weekly frequency',
            'Rationale': 'Daily rules generate high volume with low true positive rates',
            'Impact': f"Alert volume: -{volume_reduction_pct:.1f}%, Potential improved accuracy",
            'Priority': 'Medium'
        })
    
    # 4. Implement KYC Deduplication
    # Estimate FP reduction from KYC deduplication
    # Get entities with multiple KYCs
    sender_kyc_counts = transaction_data.groupby('sender_name_kyc_wise')['sender_kyc_id_no'].nunique()
    receiver_kyc_counts = transaction_data.groupby('receiver_name_kyc_wise')['receiver_kyc_id_no'].nunique()
    
    senders_with_multiple_kyc = sender_kyc_counts[sender_kyc_counts > 1].index.tolist()
    receivers_with_multiple_kyc = receiver_kyc_counts[receiver_kyc_counts > 1].index.tolist()
    
    sender_multiple_kyc_alerts = transaction_data[
        (transaction_data['triggered_on'] == 'sender') & 
        (transaction_data['sender_name_kyc_wise'].isin(senders_with_multiple_kyc))
    ]
    
    receiver_multiple_kyc_alerts = transaction_data[
        (transaction_data['triggered_on'] == 'receiver') & 
        (transaction_data['receiver_name_kyc_wise'].isin(receivers_with_multiple_kyc))
    ]
    
    # Calculate metrics
    multiple_kyc_alerts = pd.concat([sender_multiple_kyc_alerts, receiver_multiple_kyc_alerts])
    multiple_kyc_alert_pct = len(multiple_kyc_alerts) / total_alerts * 100
    
    # Estimate reduction (assuming 60% are duplicate alerts)
    estimated_alert_reduction = len(multiple_kyc_alerts) * 0.6
    volume_reduction_pct = estimated_alert_reduction / total_alerts * 100
    
    # Estimate FP reduction
    multiple_kyc_fp = multiple_kyc_alerts[multiple_kyc_alerts['status'] == 'Closed FP'].shape[0]
    estimated_fp_reduction = multiple_kyc_fp * 0.7  # 70% of FPs could be eliminated
    new_fp = total_fp - estimated_fp_reduction
    new_tp_rate = total_tp / (total_tp + new_fp) * 100 if (total_tp + new_fp) > 0 else 0
    tp_rate_improvement = new_tp_rate - current_tp_rate
    
    recommendations.append({
        'Category': 'KYC Breakage Mitigation',
        'Rules': 'All rules (system-wide)',
        'Action': 'Implement name/phone fuzzy matching system to deduplicate KYC IDs',
        'Rationale': f"{multiple_kyc_alert_pct:.1f}% of alerts involve entities with multiple KYC IDs",
        'Impact': f"Alert volume: -{volume_reduction_pct:.1f}%, TP rate: {tp_rate_improvement:+.2f}%",
        'Priority': 'High'
    })
    
    # 5. Recommendations based on clusters
    # Find the most inefficient clusters
    cluster_performance = rule_clusters.groupby('Cluster').apply(
        lambda x: pd.Series({
            'Avg_TP_Rate': x['TP_Rate'].mean(),
            'Total_Alerts': x['Total'].sum(),
            'Rule_Count': len(x)
        })
    ).sort_values('Avg_TP_Rate')
    
    # Get the most inefficient cluster with multiple rules
    inefficient_clusters = cluster_performance[(cluster_performance['Avg_TP_Rate'] < 40) & 
                                              (cluster_performance['Rule_Count'] > 1)]
    
    if not inefficient_clusters.empty:
        worst_cluster = inefficient_clusters.index[0]
        cluster_rules = rule_clusters[rule_clusters['Cluster'] == worst_cluster]['Rule'].tolist()
        
        # Calculate impact of optimizing this cluster
        cluster_alerts = inefficient_clusters.loc[worst_cluster, 'Total_Alerts']
        cluster_alert_pct = cluster_alerts / rule_performance['Total'].sum() * 100
        
        # Estimate improvement (consolidate to best rule in cluster)
        best_rule_in_cluster = rule_performance[rule_performance['alert_rules'].isin(cluster_rules)].sort_values('TP_Rate', ascending=False).iloc[0]
        
        # Estimate alert reduction
        estimated_alert_reduction = cluster_alerts * 0.6  # 60% reduction through consolidation
        volume_reduction_pct = estimated_alert_reduction / total_alerts * 100
        
        recommendations.append({
            'Category': 'Optimize Rule Clusters',
            'Rules': ', '.join(cluster_rules),
            'Action': f"Consolidate cluster to a single optimized rule based on {best_rule_in_cluster['alert_rules']}",
            'Rationale': f"Cluster of {len(cluster_rules)} similar rules with low TP rate ({inefficient_clusters.loc[worst_cluster, 'Avg_TP_Rate']:.1f}%)",
            'Impact': f"Alert volume: -{volume_reduction_pct:.1f}%, Improved consistency",
            'Priority': 'Medium'
        })
    
    # 6. Advanced Analytical Approach (Rule Scoring System)
    recommendations.append({
        'Category': 'Advanced Analytics',
        'Rules': 'All rules (system-wide)',
        'Action': 'Implement entity risk scoring system based on rule trigger patterns',
        'Rationale': 'Current rules operate independently, missing the value of combined risk signals',
        'Impact': 'Estimated 15-25% reduction in false positives while maintaining true positive catch rate',
        'Priority': 'Medium-Long term'
    })
    
    # 7. Custom Recommendations for Risky Country Rules
    risky_country_rules = rule_performance[rule_performance['alert_rules'].str.contains('9', na=False)]
    
    if not risky_country_rules.empty:
        # Check performance of risky country rules
        avg_tp_rate = risky_country_rules['TP_Rate'].mean()
        
        if avg_tp_rate < 40:
            recommendations.append({
                'Category': 'Optimize Risky Country Rules',
                'Rules': 'All risky country rules (9xxx series)',
                'Action': 'Revise risk country classification and thresholds',
                'Rationale': f"Low average TP rate for risky country rules ({avg_tp_rate:.1f}%)",
                'Impact': 'More focused monitoring of genuinely high-risk countries, estimated 10-15% reduction in alerts',
                'Priority': 'Medium'
            })
    
    # Calculate combined impact
    # Estimate total alert volume reduction from all recommendations
    # This is approximate since some recommendations overlap
    total_volume_reduction = 0
    tp_rate_change = 0
    
    for rec in recommendations:
        impact = rec['Impact']
        
        # Extract volume reduction percentage
        if 'Alert volume:' in impact:
            try:
                volume_part = impact.split('Alert volume:')[1].split('%')[0].strip(' -')
                volume_reduction = float(volume_part)
                total_volume_reduction += volume_reduction
            except:
                pass
        
        # Extract TP rate change
        if 'TP rate:' in impact:
            try:
                tp_part = impact.split('TP rate:')[1].split('%')[0].strip()
                if '+' in tp_part or '-' in tp_part:
                    tp_change = float(tp_part)
                    tp_rate_change += tp_change
            except:
                pass
    
    # Cap the maximum reduction to a reasonable value (80%)
    total_volume_reduction = min(total_volume_reduction, 80)
    
    # Create summary DataFrame
    recommendations_df = pd.DataFrame(recommendations)
    
    # Print summary
    print("\nSummary of recommendations:")
    print(f"Total recommendations: {len(recommendations_df)}")
    print(f"Estimated total alert volume reduction: {total_volume_reduction:.1f}%")
    print(f"Estimated true positive rate change: {tp_rate_change:+.2f}%")
    print("\nRecommendations by category:")
    print(recommendations_df['Category'].value_counts())
    
    # Visualize recommendations by category
    plt.figure(figsize=(12, 6))
    recommendations_df['Category'].value_counts().plot(kind='barh', color='teal')
    plt.title('Recommendations by Category')
    plt.xlabel('Number of Recommendations')
    plt.tight_layout()
    plt.savefig('visualizations/recommendations_by_category.png')
    plt.show()
    
    # Visualize recommendations by priority
    plt.figure(figsize=(10, 6))
    recommendations_df['Priority'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['red', 'orange', 'green'])
    plt.title('Recommendations by Priority')
    plt.ylabel('')
    plt.tight_layout()
    plt.savefig('visualizations/recommendations_by_priority.png')
    plt.show()
    
    return recommendations_df, total_volume_reduction, tp_rate_change

In [None]:
# Generate comprehensive recommendations
recommendations_df, total_volume_reduction, tp_rate_change = generate_comprehensive_recommendations(
    transaction_data, rule_performance, rule_clusters, rule_corr_matrix, kyc_alerts
)

## 9. Save Results to Excel

Let's save all our analysis results to Excel for easier sharing and reference.

In [None]:
# Save key results to Excel
with pd.ExcelWriter('terrapay_enhanced_analysis_results.xlsx') as writer:
    # Overall performance metrics
    overall_metrics = pd.DataFrame({
        'Metric': [
            'Total Alerts',
            'True Positives',
            'False Positives',
            'True Positive Rate',
            'Estimated Alert Volume Reduction',
            'Estimated TP Rate Improvement'
        ],
        'Value': [
            len(transaction_data),
            len(transaction_data[transaction_data['status'] == 'Closed TP']),
            len(transaction_data[transaction_data['status'] == 'Closed FP']),
            f"{len(transaction_data[transaction_data['status'] == 'Closed TP']) / len(transaction_data[transaction_data['status'].isin(['Closed TP', 'Closed FP'])]) * 100:.2f}%",
            f"{total_volume_reduction:.2f}%",
            f"{tp_rate_change:+.2f}%"
        ]
    })
    overall_metrics.to_excel(writer, sheet_name='Summary', index=False)
    
    # Rule performance
    rule_performance.to_excel(writer, sheet_name='Rule Performance', index=False)
    
    # True Positive cases
    true_positives.to_excel(writer, sheet_name='True Positives', index=False)
    
    # KYC breakage analysis
    pd.DataFrame({'sender_name': sender_name_groups['sender_name'], 
                 'kyc_count': sender_name_groups['kyc_id_count']}).to_excel(
        writer, sheet_name='Sender KYC Breakage', index=False)
    
    pd.DataFrame({'receiver_name': receiver_name_groups['receiver_name'], 
                 'kyc_count': receiver_name_groups['kyc_id_count']}).to_excel(
        writer, sheet_name='Receiver KYC Breakage', index=False)
    
    # Rule correlation matrix
    rule_corr_matrix.to_excel(writer, sheet_name='Rule Correlation Matrix')
    
    # Rule clusters
    rule_clusters.to_excel(writer, sheet_name='Rule Clusters', index=False)
    
    # Recommendations
    recommendations_df.to_excel(writer, sheet_name='Recommendations', index=False)
    
    # High-risk entities (KYC IDs with multiple true positives)
    tp_kyc_counts = pd.Series(tp_kyc_ids).value_counts()
    high_risk_kycs = tp_kyc_counts[tp_kyc_counts > 1].reset_index()
    high_risk_kycs.columns = ['KYC_ID', 'TP_Count']
    if not high_risk_kycs.empty:
        high_risk_kycs.to_excel(writer, sheet_name='High Risk KYCs', index=False)

print("\nAnalysis complete. Results saved to 'terrapay_enhanced_analysis_results.xlsx'.")
print("Key findings and recommendations have been generated based on the analysis.")

## 10. Conclusions and Next Steps

Based on our enhanced analysis, we can draw several key conclusions and recommend next steps for optimization:

### Key Findings

1. **Rule Efficiency**: We've identified significant variation in rule performance, with some rules generating high volumes of alerts but low true positive rates. The most inefficient rules could be removed or modified to substantially reduce false positives.

2. **Rule Overlap**: Our clustering and correlation analysis revealed considerable redundancy in the rule set. Several rules are highly correlated and could be consolidated without significant loss of detection capability.

3. **KYC Breakage Impact**: The issue of multiple KYC IDs for the same entity is a major driver of false positives, particularly for receiver-focused rules. Our analysis quantifies this impact and shows significant potential for improvement through identity deduplication.

4. **Threshold Optimization**: The ATL/BTL analysis demonstrates that many rules could benefit from threshold adjustments. Optimized thresholds could significantly reduce alert volume while minimizing the loss of true positives.

5. **Pattern and Frequency Effectiveness**: Different rule patterns and frequencies show varying levels of effectiveness. Daily rules tend to generate more false positives than weekly/monthly rules, while certain patterns (e.g., Many-to-1, 1-to-Many) may be more effective than others.

### Estimated Impact of Recommendations

Our comprehensive recommendations could lead to:
- Approximately ${total_volume_reduction:.1f}% reduction in overall alert volume
- Approximately ${tp_rate_change:+.2f}% improvement in true positive rate
- More consistent and interpretable rules
- Improved ability to focus investigative resources on genuine risks

### Next Steps

1. **Staged Implementation**: Implement recommendations in phases, starting with the highest-priority items (KYC deduplication and removal of the most inefficient rules)

2. **KYC Optimization**: Develop and implement a name/phone matching system to mitigate the KYC breakage issue, especially for receiver identification

3. **Rule Consolidation**: Consolidate redundant rules based on the cluster analysis, keeping the most effective rule in each group

4. **Threshold Adjustment**: Apply the optimized thresholds identified in the ATL/BTL analysis to the top alerting rules

5. **Monitoring Framework**: Establish a continuous monitoring process to evaluate rule performance over time and make ongoing adjustments

6. **Advanced Analytics**: Consider moving toward a more sophisticated entity risk scoring system that weights the combined signals from multiple rules

7. **Regular Review**: Implement a quarterly review process for rule performance and optimization

By implementing these recommendations, Terrapay can significantly improve the efficiency and effectiveness of its transaction monitoring system, reducing the burden of false positives while maintaining or improving its ability to detect genuine suspicious activity.