# Exploratory Data Analysis (EDA)

This notebook contains comprehensive exploratory analysis of the credit risk data for the Bati Bank buy-now-pay-later service.

## Objectives
1. Understand the structure and quality of the dataset
2. Identify patterns and relationships in the data
3. Detect outliers and missing values
4. Form hypotheses for feature engineering
5. Generate insights to guide model development


In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Load data
df = pd.read_csv('../data/raw/data.csv')

print("="*80)
print("EXPLORATORY DATA ANALYSIS - CREDIT RISK MODEL")
print("="*80)
print(f'\nData shape: {df.shape}')
print(f'Number of rows: {df.shape[0]:,}')
print(f'Number of columns: {df.shape[1]}')
print(f'\nColumns: {list(df.columns)}')


## 1. Overview of the Data

Understanding the structure of the dataset, including data types and basic information.


In [None]:
# Display basic information about the dataset
print("="*80)
print("1. DATA OVERVIEW")
print("="*80)

print("\n--- Data Types ---")
print(df.dtypes)

print("\n--- First 5 rows ---")
display(df.head())

print("\n--- Last 5 rows ---")
display(df.tail())

print("\n--- Sample of data ---")
display(df.sample(5))


In [None]:
# Convert TransactionStartTime to datetime
df['TransactionStartTime'] = pd.to_datetime(df['TransactionStartTime'])

# Extract date components for analysis
df['Year'] = df['TransactionStartTime'].dt.year
df['Month'] = df['TransactionStartTime'].dt.month
df['Day'] = df['TransactionStartTime'].dt.day
df['DayOfWeek'] = df['TransactionStartTime'].dt.dayofweek
df['Hour'] = df['TransactionStartTime'].dt.hour

print("TransactionStartTime converted to datetime")
print(f"Date range: {df['TransactionStartTime'].min()} to {df['TransactionStartTime'].max()}")
print(f"Time span: {(df['TransactionStartTime'].max() - df['TransactionStartTime'].min()).days} days")


In [None]:
# Identify numerical and categorical columns
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()

# Remove ID columns from categorical for analysis
id_cols = ['TransactionId', 'BatchId', 'AccountId', 'SubscriptionId', 'CustomerId']
categorical_cols = [col for col in categorical_cols if col not in id_cols]

print("="*80)
print("COLUMN CATEGORIZATION")
print("="*80)
print(f"\nNumerical columns ({len(numerical_cols)}): {numerical_cols}")
print(f"\nCategorical columns ({len(categorical_cols)}): {categorical_cols}")
print(f"\nID columns: {id_cols}")


## 2. Summary Statistics

Understanding the central tendency, dispersion, and shape of the dataset's distribution.


In [None]:
# Summary statistics for numerical features
print("="*80)
print("2. SUMMARY STATISTICS - NUMERICAL FEATURES")
print("="*80)
display(df[numerical_cols].describe())


In [None]:
# Additional statistics for key numerical features
print("\n--- Additional Statistics for Key Features ---")
key_features = ['Amount', 'Value', 'FraudResult']

for feature in key_features:
    if feature in df.columns:
        print(f"\n{feature}:")
        print(f"  Mean: {df[feature].mean():.2f}")
        print(f"  Median: {df[feature].median():.2f}")
        print(f"  Std Dev: {df[feature].std():.2f}")
        print(f"  Min: {df[feature].min():.2f}")
        print(f"  Max: {df[feature].max():.2f}")
        print(f"  Skewness: {df[feature].skew():.2f}")
        print(f"  Kurtosis: {df[feature].kurtosis():.2f}")


In [None]:
# Summary statistics for categorical features
print("="*80)
print("2. SUMMARY STATISTICS - CATEGORICAL FEATURES")
print("="*80)

for col in categorical_cols:
    print(f"\n--- {col} ---")
    print(f"Unique values: {df[col].nunique()}")
    print(f"Most frequent: {df[col].mode()[0] if len(df[col].mode()) > 0 else 'N/A'}")
    print(f"Frequency of most frequent: {df[col].value_counts().iloc[0] if len(df[col].value_counts()) > 0 else 0}")
    print("\nTop 10 values:")
    print(df[col].value_counts().head(10))


## 3. Distribution of Numerical Features

Visualizing the distribution of numerical features to identify patterns, skewness, and potential outliers.


In [None]:
# Distribution plots for key numerical features
print("="*80)
print("3. DISTRIBUTION OF NUMERICAL FEATURES")
print("="*80)

# Select key numerical features for visualization (excluding ID-like columns)
key_numerical = ['Amount', 'Value', 'CountryCode', 'PricingStrategy', 'FraudResult', 'Year', 'Month', 'DayOfWeek', 'Hour']

fig, axes = plt.subplots(3, 3, figsize=(18, 15))
axes = axes.ravel()

for idx, col in enumerate(key_numerical):
    if col in df.columns:
        ax = axes[idx]
        df[col].hist(bins=50, ax=ax, edgecolor='black')
        ax.set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
        ax.set_xlabel(col)
        ax.set_ylabel('Frequency')
        ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Detailed distribution for Amount and Value (log scale for better visualization)
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Amount distribution
axes[0, 0].hist(df['Amount'], bins=100, edgecolor='black', alpha=0.7)
axes[0, 0].set_title('Distribution of Amount (Linear Scale)', fontweight='bold')
axes[0, 0].set_xlabel('Amount')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].axvline(df['Amount'].mean(), color='r', linestyle='--', label=f'Mean: {df["Amount"].mean():.2f}')
axes[0, 0].axvline(df['Amount'].median(), color='g', linestyle='--', label=f'Median: {df["Amount"].median():.2f}')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# Amount distribution (log scale for positive values)
positive_amounts = df[df['Amount'] > 0]['Amount']
axes[0, 1].hist(np.log10(positive_amounts + 1), bins=50, edgecolor='black', alpha=0.7)
axes[0, 1].set_title('Distribution of Positive Amounts (Log10 Scale)', fontweight='bold')
axes[0, 1].set_xlabel('Log10(Amount + 1)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].grid(True, alpha=0.3)

# Value distribution
axes[1, 0].hist(df['Value'], bins=100, edgecolor='black', alpha=0.7)
axes[1, 0].set_title('Distribution of Value (Linear Scale)', fontweight='bold')
axes[1, 0].set_xlabel('Value')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].axvline(df['Value'].mean(), color='r', linestyle='--', label=f'Mean: {df["Value"].mean():.2f}')
axes[1, 0].axvline(df['Value'].median(), color='g', linestyle='--', label=f'Median: {df["Value"].median():.2f}')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# Value distribution (log scale)
axes[1, 1].hist(np.log10(df['Value'] + 1), bins=50, edgecolor='black', alpha=0.7)
axes[1, 1].set_title('Distribution of Value (Log10 Scale)', fontweight='bold')
axes[1, 1].set_xlabel('Log10(Value + 1)')
axes[1, 1].set_ylabel('Frequency')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nAmount Statistics:")
print(f"  Positive transactions: {(df['Amount'] > 0).sum():,} ({(df['Amount'] > 0).mean()*100:.2f}%)")
print(f"  Negative transactions: {(df['Amount'] < 0).sum():,} ({(df['Amount'] < 0).mean()*100:.2f}%)")
print(f"  Zero transactions: {(df['Amount'] == 0).sum():,} ({(df['Amount'] == 0).mean()*100:.2f}%)")


## 4. Distribution of Categorical Features

Analyzing the distribution of categorical features to understand frequency and variability of categories.


In [None]:
# Distribution of categorical features
print("="*80)
print("4. DISTRIBUTION OF CATEGORICAL FEATURES")
print("="*80)

# Plot distributions for key categorical features
key_categorical = ['CurrencyCode', 'ProductCategory', 'ChannelId', 'ProviderId']

fig, axes = plt.subplots(2, 2, figsize=(18, 12))
axes = axes.ravel()

for idx, col in enumerate(key_categorical):
    if col in df.columns:
        ax = axes[idx]
        value_counts = df[col].value_counts()
        
        # Plot top 15 categories if there are many
        if len(value_counts) > 15:
            top_values = value_counts.head(15)
            top_values.plot(kind='bar', ax=ax, color='steelblue', edgecolor='black')
            ax.set_title(f'Top 15 {col} Distribution', fontsize=12, fontweight='bold')
        else:
            value_counts.plot(kind='bar', ax=ax, color='steelblue', edgecolor='black')
            ax.set_title(f'{col} Distribution', fontsize=12, fontweight='bold')
        
        ax.set_xlabel(col)
        ax.set_ylabel('Frequency')
        ax.tick_params(axis='x', rotation=45)
        ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()


In [None]:
# ProductCategory distribution with percentages
print("\n--- ProductCategory Distribution ---")
product_dist = df['ProductCategory'].value_counts()
product_pct = df['ProductCategory'].value_counts(normalize=True) * 100

product_df = pd.DataFrame({
    'Count': product_dist,
    'Percentage': product_pct
})
display(product_df)

# Visualize ProductCategory
plt.figure(figsize=(12, 6))
product_dist.plot(kind='bar', color='coral', edgecolor='black')
plt.title('ProductCategory Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Product Category')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()


In [None]:
# ChannelId distribution
print("\n--- ChannelId Distribution ---")
channel_dist = df['ChannelId'].value_counts()
channel_pct = df['ChannelId'].value_counts(normalize=True) * 100

channel_df = pd.DataFrame({
    'Count': channel_dist,
    'Percentage': channel_pct
})
display(channel_df)

# Visualize ChannelId
plt.figure(figsize=(10, 6))
channel_dist.plot(kind='bar', color='lightgreen', edgecolor='black')
plt.title('ChannelId Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Channel ID')
plt.ylabel('Frequency')
plt.xticks(rotation=45, ha='right')
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()


## 5. Correlation Analysis

Understanding the relationship between numerical features.


In [None]:
# Correlation matrix for numerical features
print("="*80)
print("5. CORRELATION ANALYSIS")
print("="*80)

# Select numerical features for correlation (excluding derived date features for now)
corr_features = ['Amount', 'Value', 'CountryCode', 'PricingStrategy', 'FraudResult']
corr_matrix = df[corr_features].corr()

print("\n--- Correlation Matrix ---")
display(corr_matrix)

# Visualize correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, fmt='.3f', cmap='coolwarm', center=0,
            square=True, linewidths=1, cbar_kws={"shrink": 0.8})
plt.title('Correlation Matrix of Numerical Features', fontsize=14, fontweight='bold', pad=20)
plt.tight_layout()
plt.show()


In [None]:
# Relationship between Amount and Value
print("\n--- Amount vs Value Relationship ---")
print(f"Correlation between Amount and Value: {df['Amount'].abs().corr(df['Value']):.4f}")
print(f"Note: Value should be absolute value of Amount")

# Check if Value is indeed absolute of Amount
df['Amount_Abs'] = df['Amount'].abs()
value_match = (df['Value'] == df['Amount_Abs']).all()
print(f"Value equals absolute Amount: {value_match}")

if not value_match:
    mismatch_count = (df['Value'] != df['Amount_Abs']).sum()
    print(f"Number of mismatches: {mismatch_count} ({(mismatch_count/len(df)*100):.2f}%)")

# Scatter plot of Amount vs Value
plt.figure(figsize=(10, 6))
plt.scatter(df['Amount'].abs(), df['Value'], alpha=0.5, s=1)
plt.plot([df['Amount'].abs().min(), df['Amount'].abs().max()], 
         [df['Amount'].abs().min(), df['Amount'].abs().max()], 
         'r--', linewidth=2, label='Perfect correlation line')
plt.xlabel('Absolute Amount')
plt.ylabel('Value')
plt.title('Amount (Absolute) vs Value', fontsize=14, fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()


## 6. Identifying Missing Values

Identifying missing values to determine missing data patterns and decide on appropriate imputation strategies.


In [None]:
# Check for missing values
print("="*80)
print("6. MISSING VALUES ANALYSIS")
print("="*80)

missing_values = df.isnull().sum()
missing_pct = (missing_values / len(df)) * 100

missing_df = pd.DataFrame({
    'Missing Count': missing_values,
    'Missing Percentage': missing_pct
})

missing_df = missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False)

if len(missing_df) > 0:
    print("\n--- Columns with Missing Values ---")
    display(missing_df)
else:
    print("\n✓ No missing values found in the dataset!")

# Check for empty strings or whitespace
print("\n--- Checking for Empty Strings ---")
for col in df.select_dtypes(include=['object']).columns:
    empty_count = (df[col].astype(str).str.strip() == '').sum()
    if empty_count > 0:
        print(f"{col}: {empty_count} empty strings ({(empty_count/len(df)*100):.2f}%)")


## 7. Outlier Detection

Using box plots and statistical methods to identify outliers.


In [None]:
# Box plots for numerical features to detect outliers
print("="*80)
print("7. OUTLIER DETECTION")
print("="*80)

# Box plots for key numerical features
fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Amount box plot
axes[0, 0].boxplot(df['Amount'], vert=True)
axes[0, 0].set_title('Box Plot: Amount', fontweight='bold')
axes[0, 0].set_ylabel('Amount')
axes[0, 0].grid(True, alpha=0.3)

# Value box plot
axes[0, 1].boxplot(df['Value'], vert=True)
axes[0, 1].set_title('Box Plot: Value', fontweight='bold')
axes[0, 1].set_ylabel('Value')
axes[0, 1].grid(True, alpha=0.3)

# Log scale box plots for better visualization
axes[1, 0].boxplot(np.log10(df['Amount'].abs() + 1), vert=True)
axes[1, 0].set_title('Box Plot: Log10(Absolute Amount + 1)', fontweight='bold')
axes[1, 0].set_ylabel('Log10(Amount + 1)')
axes[1, 0].grid(True, alpha=0.3)

axes[1, 1].boxplot(np.log10(df['Value'] + 1), vert=True)
axes[1, 1].set_title('Box Plot: Log10(Value + 1)', fontweight='bold')
axes[1, 1].set_ylabel('Log10(Value + 1)')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Statistical outlier detection using IQR method
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound, Q1, Q3, IQR

print("\n--- Outlier Detection using IQR Method ---")
for col in ['Amount', 'Value']:
    outliers, lower, upper, Q1, Q3, IQR = detect_outliers_iqr(df, col)
    print(f"\n{col}:")
    print(f"  Q1: {Q1:.2f}")
    print(f"  Q3: {Q3:.2f}")
    print(f"  IQR: {IQR:.2f}")
    print(f"  Lower bound: {lower:.2f}")
    print(f"  Upper bound: {upper:.2f}")
    print(f"  Number of outliers: {len(outliers):,} ({(len(outliers)/len(df)*100):.2f}%)")
    print(f"  Outlier range: [{outliers[col].min():.2f}, {outliers[col].max():.2f}]")


In [None]:
# Analyze FraudResult distribution
print("\n--- FraudResult Analysis ---")
fraud_dist = df['FraudResult'].value_counts()
fraud_pct = df['FraudResult'].value_counts(normalize=True) * 100

print(f"Fraud cases (1): {fraud_dist.get(1, 0):,} ({fraud_pct.get(1, 0):.2f}%)")
print(f"Non-fraud cases (0): {fraud_dist.get(0, 0):,} ({fraud_pct.get(0, 0):.2f}%)")
print(f"\nClass imbalance ratio: {fraud_dist.get(0, 1) / fraud_dist.get(1, 1):.2f}:1")

# Visualize FraudResult
plt.figure(figsize=(8, 6))
fraud_dist.plot(kind='bar', color=['green', 'red'], edgecolor='black')
plt.title('FraudResult Distribution', fontsize=14, fontweight='bold')
plt.xlabel('Fraud Result (0=No, 1=Yes)')
plt.ylabel('Frequency')
plt.xticks(rotation=0)
plt.grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()


In [None]:
# Analyze fraud by different categories
print("\n--- Fraud Analysis by Category ---")

fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# Fraud by ProductCategory
fraud_by_category = df.groupby('ProductCategory')['FraudResult'].agg(['sum', 'count'])
fraud_by_category['fraud_rate'] = (fraud_by_category['sum'] / fraud_by_category['count']) * 100
fraud_by_category = fraud_by_category.sort_values('fraud_rate', ascending=False)

axes[0, 0].barh(fraud_by_category.index[:10], fraud_by_category['fraud_rate'][:10], color='coral')
axes[0, 0].set_title('Fraud Rate by ProductCategory (Top 10)', fontweight='bold')
axes[0, 0].set_xlabel('Fraud Rate (%)')
axes[0, 0].grid(True, alpha=0.3, axis='x')

# Fraud by ChannelId
fraud_by_channel = df.groupby('ChannelId')['FraudResult'].agg(['sum', 'count'])
fraud_by_channel['fraud_rate'] = (fraud_by_channel['sum'] / fraud_by_channel['count']) * 100
fraud_by_channel = fraud_by_channel.sort_values('fraud_rate', ascending=False)

axes[0, 1].barh(fraud_by_channel.index, fraud_by_channel['fraud_rate'], color='steelblue')
axes[0, 1].set_title('Fraud Rate by ChannelId', fontweight='bold')
axes[0, 1].set_xlabel('Fraud Rate (%)')
axes[0, 1].grid(True, alpha=0.3, axis='x')

# Fraud by PricingStrategy
fraud_by_pricing = df.groupby('PricingStrategy')['FraudResult'].agg(['sum', 'count'])
fraud_by_pricing['fraud_rate'] = (fraud_by_pricing['sum'] / fraud_by_pricing['count']) * 100
fraud_by_pricing = fraud_by_pricing.sort_values('fraud_rate', ascending=False)

axes[1, 0].bar(fraud_by_pricing.index.astype(str), fraud_by_pricing['fraud_rate'], color='lightgreen', edgecolor='black')
axes[1, 0].set_title('Fraud Rate by PricingStrategy', fontweight='bold')
axes[1, 0].set_xlabel('Pricing Strategy')
axes[1, 0].set_ylabel('Fraud Rate (%)')
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Fraud over time (by month)
df['YearMonth'] = df['TransactionStartTime'].dt.to_period('M')
fraud_by_month = df.groupby('YearMonth')['FraudResult'].agg(['sum', 'count'])
fraud_by_month['fraud_rate'] = (fraud_by_month['sum'] / fraud_by_month['count']) * 100

axes[1, 1].plot(fraud_by_month.index.astype(str), fraud_by_month['fraud_rate'], marker='o', linewidth=2, markersize=6)
axes[1, 1].set_title('Fraud Rate Over Time (by Month)', fontweight='bold')
axes[1, 1].set_xlabel('Year-Month')
axes[1, 1].set_ylabel('Fraud Rate (%)')
axes[1, 1].tick_params(axis='x', rotation=45)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()


In [None]:
# Customer-level analysis (preview for feature engineering)
print("\n--- Customer-Level Statistics Preview ---")
customer_stats = df.groupby('CustomerId').agg({
    'TransactionId': 'count',
    'Amount': ['sum', 'mean', 'std'],
    'Value': ['sum', 'mean'],
    'FraudResult': 'sum',
    'ProductCategory': lambda x: x.nunique(),
    'ChannelId': lambda x: x.nunique()
}).round(2)

customer_stats.columns = ['Transaction_Count', 'Total_Amount', 'Avg_Amount', 'Std_Amount',
                          'Total_Value', 'Avg_Value', 'Fraud_Count', 'Unique_Categories', 'Unique_Channels']

print(f"Number of unique customers: {df['CustomerId'].nunique():,}")
print(f"Average transactions per customer: {customer_stats['Transaction_Count'].mean():.2f}")
print(f"Median transactions per customer: {customer_stats['Transaction_Count'].median():.2f}")

display(customer_stats.head(10))


## 8. Key Insights Summary

Based on the exploratory data analysis, here are the most important insights that will guide feature engineering and model development.


### Insight 1: Severe Class Imbalance in Fraud Detection
- **Finding**: The dataset shows extreme class imbalance with fraud cases representing only ~0.2% of all transactions.
- **Implication**: This will require special handling during model training (e.g., class weighting, SMOTE, or stratified sampling).
- **Action**: Consider using techniques like stratified k-fold cross-validation and appropriate evaluation metrics (AUC-ROC, Precision-Recall curve) rather than accuracy.

### Insight 2: Highly Skewed Transaction Amounts with Significant Outliers
- **Finding**: Transaction amounts (Amount and Value) are highly right-skewed with a wide range (from small values to millions).
- **Implication**: 
  - Standard scaling may not be sufficient; log transformation or robust scaling may be needed.
  - Outliers may represent legitimate high-value transactions or errors that need investigation.
- **Action**: 
  - Apply log transformation or robust scaling for Amount/Value features.
  - Consider capping extreme values or using percentile-based features.
  - Create features like transaction amount categories (low/medium/high).

### Insight 3: Strong Relationship Between Amount and Value
- **Finding**: Value appears to be the absolute value of Amount (with some potential discrepancies to investigate).
- **Implication**: These features are highly correlated and may provide redundant information.
- **Action**: 
  - Use either Amount or Value, not both, or create derived features like transaction direction (credit/debit).
  - Investigate cases where Value ≠ |Amount| as they may indicate data quality issues.

### Insight 4: Temporal Patterns and Customer Behavior Diversity
- **Finding**: 
  - Transactions span multiple months, allowing for temporal feature engineering.
  - Customers show diverse behavior in terms of transaction frequency, product categories, and channels used.
- **Implication**: 
  - RFM (Recency, Frequency, Monetary) features can be engineered at the customer level.
  - Time-based features (day of week, hour, month) may capture behavioral patterns.
- **Action**: 
  - Create customer-level aggregations: transaction count, total/avg amount, recency of last transaction.
  - Engineer temporal features: transaction frequency, days since first/last transaction.
  - Build features for product category diversity and channel preferences.

### Insight 5: Fraud Patterns Vary by Product Category and Channel
- **Finding**: Different product categories and channels show varying fraud rates.
- **Implication**: 
  - ProductCategory and ChannelId are important predictors of fraud risk.
  - These categorical features should be encoded (one-hot, target encoding, or WoE) for model use.
- **Action**: 
  - Create fraud rate features by category/channel (target encoding).
  - Consider interaction features between ProductCategory, ChannelId, and transaction amounts.
  - Use Weight of Evidence (WoE) transformation for these categorical features if using logistic regression.

### Additional Considerations:
1. **Data Quality**: No missing values detected, but need to verify Value = |Amount| relationship.
2. **Feature Engineering Opportunities**: 
   - Customer-level RFM features (Recency, Frequency, Monetary)
   - Transaction velocity features (transactions per day/week)
   - Behavioral diversity features (number of unique categories, channels, providers)
   - Temporal features (time since last transaction, transaction patterns)
3. **Model Considerations**: 
   - Given class imbalance, focus on precision-recall metrics
   - Consider ensemble methods or cost-sensitive learning
   - Ensure interpretability for regulatory compliance (Basel II requirements)


In [None]:
# Final summary statistics
print("="*80)
print("FINAL SUMMARY")
print("="*80)
print(f"\nDataset Overview:")
print(f"  Total transactions: {len(df):,}")
print(f"  Unique customers: {df['CustomerId'].nunique():,}")
print(f"  Unique accounts: {df['AccountId'].nunique():,}")
print(f"  Date range: {df['TransactionStartTime'].min()} to {df['TransactionStartTime'].max()}")
print(f"  Time span: {(df['TransactionStartTime'].max() - df['TransactionStartTime'].min()).days} days")
print(f"\nTarget Variable (FraudResult):")
print(f"  Fraud cases: {(df['FraudResult'] == 1).sum():,} ({(df['FraudResult'] == 1).mean()*100:.2f}%)")
print(f"  Non-fraud cases: {(df['FraudResult'] == 0).sum():,} ({(df['FraudResult'] == 0).mean()*100:.2f}%)")
print(f"\nTransaction Amounts:")
print(f"  Total amount: {df['Amount'].sum():,.2f}")
print(f"  Average amount: {df['Amount'].mean():.2f}")
print(f"  Median amount: {df['Amount'].median():.2f}")
print(f"  Total value: {df['Value'].sum():,.2f}")
print(f"\nData Quality:")
print(f"  Missing values: {df.isnull().sum().sum()}")
print(f"  Duplicate transactions: {df.duplicated().sum()}")
print(f"\nReady for feature engineering!")
