# Enhanced Fraud Data Generation and Neptune Bulk Loading

This workshop demonstrates how to generate sophisticated synthetic fraud data and efficiently load it into Amazon Neptune for graph-based fraud detection. We'll create a realistic financial network with 100,000 transactions containing various fraud patterns including money laundering rings, shell companies, and synthetic identity fraud.

## Workshop Architecture

```mermaid
graph TD
    A[Enhanced Fraud Generator] --> B[Synthetic Data]
    B --> C[CSV Files]
    C --> D[S3 Bucket]
    D --> E[Neptune Bulk Loader]
    E --> F[Neptune Graph Database]
    F --> G[Fraud Detection Analytics]
    
    B --> B1[Institutions: 20]
    B --> B2[Accounts: 5,000]
    B --> B3[Transactions: 100,000]
    B --> B4[Fraud Patterns: 9 types]
    
    style A fill:#e1f5fe
    style F fill:#f3e5f5
    style G fill:#e8f5e8
```

## What You'll Build
- **Money laundering rings** with circular transaction patterns
- **Shell company networks** for hiding illicit funds
- **Synthetic identity fraud** using fake personas
- **Risk scoring system** for transaction analysis
- **High-performance bulk loading** to Neptune

For fraud detection queries and analysis, see: `Fraud_Detection_Analytics.ipynb`

## Setup and Configuration

### Retrieve Neptune Endpoint from CloudFormation

Get the Neptune endpoint from the `neptune-cluster` CloudFormation stack output.</br>
Should be of format: `financial-network-cluster.cluster-xxx.us-west-2.neptune.amazonaws.com`</br>
And replace # Configuration -> NEPTUNE_ENDPOINT

In [None]:
# Load graph notebook extensions
%load_ext graph_notebook.magics

# Import required libraries
import pandas as pd
import boto3
import json
import os
from src.enhanced_fraud_generator import EnhancedFraudGenerator
from src.neptune_bulk_loader import NeptuneBulkLoader

# Auto-detect configuration
session = boto3.Session()
account_id = boto3.client('sts').get_caller_identity()['Account']
region = session.region_name

print("üéØ Enhanced Fraud Data Generator - Workshop Ready!")
print(f"Account ID: {account_id}")
print(f"Region: {region}")

# Check prerequisites
print("üîç Checking prerequisites...")

# Check if config file exists
config_exists = os.path.exists('config/enhanced_fraud_rules.yaml')
print(f"  Config file: {'‚úÖ' if config_exists else '‚ùå'}")

# Check S3 bucket
S3_BUCKET = f"{account_id}-neptune-bulk-load"
try:
    boto3.client('s3').head_bucket(Bucket=S3_BUCKET)
    s3_exists = True
except:
    s3_exists = False
print(f"  S3 Bucket ({S3_BUCKET}): {'‚úÖ' if s3_exists else '‚ùå Deploy CloudFormation first'}")

# Configuration
NEPTUNE_ENDPOINT = os.environ.get('NEPTUNE_ENDPOINT', 'UPDATE-ME.cluster-xyz.us-west-2.neptune.amazonaws.com')
# Use the Neptune role directly (we know it exists from CF stack)
NEPTUNE_ROLE_ARN = f"arn:aws:iam::{account_id}:role/neptune-workbench-NeptuneS3AccessRole"
print(f"‚úÖ Using Neptune role: {NEPTUNE_ROLE_ARN}")

print(f"\nNeptune Endpoint: {NEPTUNE_ENDPOINT}")
print(f"Neptune Role ARN: {NEPTUNE_ROLE_ARN}")
if 'UPDATE-ME' in NEPTUNE_ENDPOINT:
    print('‚ö†Ô∏è  UPDATE NEPTUNE_ENDPOINT above with your actual Neptune cluster endpoint')
else:
    print('‚úÖ Ready to generate and load data!')

## Step 1: Generate Enhanced Fraud Data (100,000 transactions)

In [None]:
# Initialize the enhanced fraud generator
generator = EnhancedFraudGenerator('config/enhanced_fraud_rules.yaml')

print("üéØ Generating enhanced fraud network with 100,000 transactions...")
print("üìä This includes:")
print("   ‚Ä¢ 20 financial institutions")
print("   ‚Ä¢ 5,000 accounts (including 100 shell + 50 synthetic)")
print("   ‚Ä¢ 100,000 transactions (3% fraud rate)")
print("   ‚Ä¢ 9 sophisticated fraud patterns")
print("\n‚è±Ô∏è  This may take 2-3 minutes...")

# Generate the complete network
data = generator.generate_network()

# Save to files
generator.save_data('enhanced_output')

print("\n‚úÖ Generation Complete! Files saved to enhanced_output/")

# Prepare enhanced data for QuickSight
print("\nüìä Preparing enhanced data for QuickSight...")
try:
    import os
    os.chdir('enhanced_output')
    
    # Import and run the QuickSight data preparation
    from src.prepare_quicksight_data import prepare_quicksight_data, create_s3_manifest, create_dashboard_config
    
    # Generate enhanced QuickSight dataset
    prepare_quicksight_data()
    
    # Create manifest with correct bucket name
    quicksight_bucket = f"{account_id}-quicksight-fraud-data"
    create_s3_manifest(quicksight_bucket)
    
    # Create dashboard configuration
    create_dashboard_config()
    
    # Upload enhanced dataset to S3
    quicksight_bucket = f"{account_id}-quicksight-fraud-data"
    s3_client = boto3.client('s3')
    
    # Upload the single enhanced file
    enhanced_file = 'quicksight_fraud_data.csv'
    if os.path.exists(enhanced_file):
        s3_client.upload_file(enhanced_file, quicksight_bucket, f'enhanced-data/{enhanced_file}')
        print(f"  ‚úÖ Uploaded enhanced dataset: {enhanced_file}")
        
        # Also upload manifest and summary files
        for file in ['quicksight_manifest.json', 'fraud_summary.json', 'dashboard_config.json']:
            if os.path.exists(file):
                s3_client.upload_file(file, quicksight_bucket, f'config/{file}')
                print(f"  ‚úÖ Uploaded {file}")
    
    os.chdir('..')
    print(f"\n‚úÖ QuickSight data ready at s3://{quicksight_bucket}/enhanced-data/")
    print("   ‚Üí Ready for Lab 4: QuickSight Dashboard creation")
    
except Exception as e:
    print(f"\n‚ö†Ô∏è  QuickSight data preparation failed: {str(e)}")
    print("   This is optional - Neptune bulk loading will still work")

## Step 2: Examine Generated Data

In [None]:
# Load and examine the generated data
institutions_df = pd.read_csv('enhanced_output/institutions.csv')
accounts_df = pd.read_csv('enhanced_output/accounts.csv')
transactions_df = pd.read_csv('enhanced_output/transactions.csv')

print(f"üìà Data Summary:")
print(f"  Institutions: {len(institutions_df):,}")
print(f"  Accounts: {len(accounts_df):,}")
print(f"  Transactions: {len(transactions_df):,}")

# Fraud statistics
fraud_df = transactions_df[transactions_df['is_fraud'] == True]
print(f"\nüö® Fraud Statistics:")
print(f"  Fraud Transactions: {len(fraud_df):,}")
print(f"  Fraud Rate: {len(fraud_df)/len(transactions_df)*100:.2f}%")

# Show fraud type distribution
print(f"\nüé≠ Fraud Types:")
fraud_counts = fraud_df['fraud_type'].value_counts()
for fraud_type, count in fraud_counts.items():
    print(f"  {fraud_type}: {count:,}")

In [None]:
# Show sample transactions
print("üí∏ Sample Transactions:")
display(transactions_df.head())

print("\nüö® Sample Fraud Transactions:")
display(fraud_df.head())

## Step 3: Bulk Load to Neptune via S3

In [None]:
# Initialize bulk loader
bulk_loader = NeptuneBulkLoader(
    neptune_endpoint=NEPTUNE_ENDPOINT,
    s3_bucket=S3_BUCKET,
    neptune_role_arn=NEPTUNE_ROLE_ARN
)

print("üåä Starting bulk load process...")
print("This will:")
print("1. Convert data to Neptune CSV format")
print("2. Upload to S3")
print("3. Start Neptune bulk load job")
print("4. Monitor progress")
print("\nThis may take 5-10 minutes...")

In [None]:
# Execute bulk load (this is the main workshop step)
print("üöÄ Starting bulk load to Neptune...")
print("This will:")
print("  1Ô∏è‚É£ Convert data to Neptune CSV format")
print("  2Ô∏è‚É£ Upload to S3 bucket")
print("  3Ô∏è‚É£ Start Neptune bulk load job")
print("  4Ô∏è‚É£ Monitor progress until complete")
print("\n‚è±Ô∏è  Total time: ~2-3 minutes for bulk loading")

success = bulk_loader.bulk_load_enhanced_fraud_data('enhanced_output')

if success:
    print("\nüéâ SUCCESS: Enhanced fraud network loaded into Neptune!")
    print("\nüìä Data Processing Status:")
    print("   ‚úÖ Neptune: Graph data loaded and ready for queries")
    print("   ‚úÖ QuickSight: Enhanced dataset ready for Lab 4")
    print("\nüìä Next Steps:")
    print("   ‚Ä¢ Open the 'Fraud_Detection_Analytics.ipynb' notebook")
    print("   ‚Ä¢ Run fraud detection queries and analysis")
    print("   ‚Ä¢ Create QuickSight dashboards using processed data")
    print("   ‚Ä¢ Explore the graph data with advanced analytics")
else:
    print("\nüí• FAILED: Bulk load unsuccessful.")
    print("Check CloudFormation stack outputs for correct role ARN")

## Summary

‚úÖ **Completed Successfully:**
1. Generated 100,000 enhanced fraud transactions with sophisticated patterns
2. Bulk loaded all data to Neptune via S3
3. Data is now ready for analysis

üéØ **What was created:**
- Money laundering rings
- Shell company networks
- Synthetic identity fraud
- Risk scoring system
- High-performance bulk loading

üìä **Next Steps:**
- Open `Fraud_Detection_Analytics.ipynb` for fraud detection queries
- Explore graph patterns and relationships
- Build ML models on the graph data