# Credit Card Customer Dataset Documentation

This notebook provides a comprehensive overview of the credit card customer dataset through visualizations and descriptive statistics.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set style for better visualizations
plt.style.use('seaborn')
sns.set_palette('Set2')
%matplotlib inline

In [None]:
# Load the dataset
df = pd.read_csv('customer_segmentation.csv')

# Display basic information
print("Dataset Shape:", df.shape)
print("\nColumn Types:")
df.info()

## Dataset Column Descriptions

### Demographic Information
- **customer_id**: Unique identifier for each customer
- **age**: Customer's age in years (22-50)
- **gender**: Customer's gender (M/F)
- **dependent_count**: Number of dependents (0-4)
- **education_level**: Education level achieved
  - Categories: Uneducated, High School, College, Graduate, Post-Graduate, Doctorate
- **marital_status**: Current marital status
  - Categories: Single, Married, Divorced, Unknown
- **estimated_income**: Projected annual income (25,000-150,000 USD)

### Account Behavior
- **months_on_book**: Customer tenure in months (12-72)
- **total_relationship_count**: Number of bank contacts (1-5)
- **months_inactive_12_mon**: Inactivity periods (1-4)

### Transaction Patterns
- **total_trans_amount**: Total transaction value (300.00-5200.25 USD)
- **total_trans_count**: Number of transactions (10-50)
- **avg_utilization_ratio**: Card utilization ratio (0.36-0.57)

In [None]:
# Display summary statistics
print("Summary Statistics for Numerical Variables:")
df.describe()

In [None]:
# Visualize distributions of numerical variables
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
numeric_cols = [col for col in numeric_cols if col != 'customer_id']

fig, axes = plt.subplots(3, 3, figsize=(15, 15))
axes = axes.ravel()

for idx, col in enumerate(numeric_cols):
    if idx < len(axes):
        sns.histplot(data=df, x=col, ax=axes[idx], kde=True)
        axes[idx].set_title(f'Distribution of {col}')
        if col in ['estimated_income', 'total_trans_amount']:
            axes[idx].set_xlabel(f'{col} (USD)')

plt.tight_layout()
plt.show()

In [None]:
# Visualize categorical variables
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

# Gender distribution
sns.countplot(data=df, x='gender', ax=axes[0])
axes[0].set_title('Gender Distribution')

# Education level distribution
sns.countplot(data=df, x='education_level', ax=axes[1])
axes[1].set_title('Education Level Distribution')
axes[1].tick_params(axis='x', rotation=45)

# Marital status distribution
sns.countplot(data=df, x='marital_status', ax=axes[2])
axes[2].set_title('Marital Status Distribution')
axes[2].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

In [None]:
# Create correlation heatmap
plt.figure(figsize=(12, 8))
correlation_matrix = df.select_dtypes(include=['int64', 'float64']).drop('customer_id', axis=1).corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix of Numerical Variables')
plt.tight_layout()
plt.show()

## Key Insights

1. **Education Distribution**:
   - Full range from Uneducated to Doctorate level
   - Graduate and Post-Graduate levels most common
   - Clear correlation with income levels

2. **Marital Status Patterns**:
   - Balanced distribution across categories
   - 'Unknown' status present for privacy considerations
   - Correlations with age and income visible

3. **Financial Behavior**:
   - Transaction amounts vary significantly
   - Credit utilization stays below 60%
   - Most customers maintain regular activity

4. **Customer Engagement**:
   - Tenure ranges from 1-6 years
   - Most customers have multiple bank interactions
   - Low inactivity periods overall