# AI Payment Risk Scoring - Exploratory Data Analysis

This notebook provides comprehensive exploratory data analysis for the AI-based customer payment risk evaluation system.

## Overview
- **Objective**: Analyze customer payment data to understand risk patterns
- **Data**: Customer demographics, financial history, and payment behavior
- **Target**: Payment failure prediction and risk scoring

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import warnings
warnings.filterwarnings('ignore')

# Configure plotting
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("Libraries imported successfully!")

# Load the data
data_path = '../data/raw/your_excel_file.xlsx'  # Update with your actual file path
df = pd.read_excel(data_path)

# Display the first few rows of the dataset
df.head()

In [None]:
# Import custom modules
import sys
import os
sys.path.append(os.path.join(os.getcwd(), '..', 'src'))

from data_preparation import DataPreparator
from model_training import ModelTrainer
from scoring import RiskScorer
from utils import ResultsExporter
import config

print("Custom modules imported successfully!")

## 1. Data Generation and Loading

Let's start by generating sample data or loading existing data for analysis.

In [None]:
# Check for missing values
missing_values = df.isnull().sum()
missing_values[missing_values > 0]

## 2. Data Overview and Quality Assessment

In [None]:
# Visualize the distribution of the target variable
plt.figure(figsize=(8, 5))
sns.countplot(x='未払FLAG', data=df)
plt.title('Distribution of Payment Default Flag')
plt.xlabel('未払FLAG')
plt.ylabel('Count')
plt.show()

In [None]:
# Initialize data preparator
data_prep = DataPreparator()

# Generate sample data
print("Generating sample customer data...")
sample_data = data_prep.generate_sample_data(n_customers=1000, n_transactions_per_customer=5)

print(f"Generated data shape: {sample_data.shape}")
print(f"Columns: {list(sample_data.columns)}")
sample_data.head()

# Visualize correlations between features
plt.figure(figsize=(12, 8))
correlation_matrix = df.corr()
sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

## 3. Customer-Level Data Analysis

Let's aggregate the transaction data to customer level and analyze patterns.

In [None]:
# Aggregate to customer level
customer_data = data_prep.aggregate_customer_data(sample_data)

print(f"Customer-level data shape: {customer_data.shape}")
print(f"Unique customers: {len(customer_data)}")
customer_data.head()

In [None]:
# Customer demographics analysis
fig, axes = plt.subplots(2, 3, figsize=(18, 12))

# Age distribution
customer_data['age'].hist(bins=30, ax=axes[0, 0], alpha=0.7)
axes[0, 0].set_title('Age Distribution')
axes[0, 0].set_xlabel('Age')
axes[0, 0].set_ylabel('Frequency')

# Income distribution
customer_data['income'].hist(bins=30, ax=axes[0, 1], alpha=0.7, color='green')
axes[0, 1].set_title('Income Distribution')
axes[0, 1].set_xlabel('Income')
axes[0, 1].set_ylabel('Frequency')

# Credit score distribution
customer_data['credit_score'].hist(bins=30, ax=axes[0, 2], alpha=0.7, color='orange')
axes[0, 2].set_title('Credit Score Distribution')
axes[0, 2].set_xlabel('Credit Score')
axes[0, 2].set_ylabel('Frequency')

# Account balance distribution
customer_data['account_balance'].hist(bins=30, ax=axes[1, 0], alpha=0.7, color='red')
axes[1, 0].set_title('Account Balance Distribution')
axes[1, 0].set_xlabel('Account Balance')
axes[1, 0].set_ylabel('Frequency')

# Total transactions
customer_data['total_transactions'].hist(bins=20, ax=axes[1, 1], alpha=0.7, color='purple')
axes[1, 1].set_title('Total Transactions Distribution')
axes[1, 1].set_xlabel('Number of Transactions')
axes[1, 1].set_ylabel('Frequency')

# Payment failure rate
if 'payment_failure' in customer_data.columns:
    failure_counts = customer_data['payment_failure'].value_counts()
    axes[1, 2].pie(failure_counts.values, labels=['No Failure', 'Failure'], autopct='%1.1f%%')
    axes[1, 2].set_title('Payment Failure Distribution')

plt.tight_layout()
plt.show()

## Conclusioninformation
print("=== DATA OVERVIEW ===")
This exploratory data analysis reveals:








The analysis confirms the dataset is suitable for training an effective payment risk scoring model.5. **Model Readiness**: Data is well-prepared for machine learning pipeline4. **Feature Relationships**: Clear patterns between financial metrics and payment risk3. **Target Balance**: Reasonable class distribution for payment failures2. **Feature Distribution**: Well-distributed customer demographics and financial metrics1. **Data Quality**: Clean dataset with minimal missing values








This exploratory data analysis provides insights into the dataset, including the distribution of the target variable and correlations between features. Further steps will involve data preparation and model training.## Conclusionsample_data.describe()print("\n=== BASIC STATISTICS ===")
print(missing_values[missing_values > 0])missing_values = sample_data.isnull().sum()

print("\n=== MISSING VALUES ===")print(sample_data.dtypes)print("\n=== DATA TYPES ===")print(f"Memory usage: {sample_data.memory_usage(deep=True).sum() / 1024**2:.2f} MB")






plt.show()plt.tight_layout()axes[1, 1].set_xlabel('Bytes')axes[1, 1].set_title('Top 10 Columns by Memory Usage')axes[1, 1].barh(memory_usage.index, memory_usage.values)


memory_usage = sample_data.memory_usage(deep=True).sort_values(ascending=False)[:10]

# Memory usage by columnaxes[1, 0].set_title('Duplicate Records')axes[1, 0].bar(['Unique', 'Duplicate'], [unique_count, duplicate_count])

unique_count = len(sample_data) - duplicate_countduplicate_count = sample_data.duplicated().sum()# Data quality visualization



# Duplicate recordsaxes[0, 1].set_title('Data Types Distribution')fig, axes = plt.subplots(2, 2, figsize=(15, 12))

axes[0, 1].pie(dtype_counts.values, labels=dtype_counts.index, autopct='%1.1f%%')
# Missing values heatmap
sns.heatmap(sample_data.isnull(), cbar=True, ax=axes[0, 0], cmap='viridis')
axes[0, 0].set_title('Missing Values Heatmap')

# Data types distribution
dtype_counts = sample_data.dtypes.value_counts()