# Data Cleaning – Credit Card Churn Dataset
This notebook performs initial data cleaning on the raw credit card churn dataset.  
The goal is to prepare the dataset for EDA and modeling by:
- Removing duplicates
- Handling missing values
- Fixing data types
- Addressing outliers
- Managing high-cardinality categorical features  
The cleaned dataset will be saved in `data/processed/` for use in later stages.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
from pathlib import Path
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

# Reproducibility
np.random.seed(42)

In [None]:
# Paths
DATA_DIR = Path("../../data/raw")
FILE_PATH = DATA_DIR / "credit_card_attrition_dataset_mark.csv" 

In [None]:
# Load
df = pd.read_csv(FILE_PATH)

## 1. Looking at the Dataset

In [None]:
df.shape

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

## 2. Checking for Duplicates

In [None]:
# Count duplicates
df.duplicated().sum()

In [None]:
# Remove duplicates
df = df.drop_duplicates()

In [None]:
df.duplicated().sum()

## 3. Checking for Missing Data

In [None]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
print(df.isna().sum())

In [None]:
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

In [None]:
df[["Income", "CreditLimit", "TotalSpend"]].isnull().sum()

*The columns that has missing values area `Income`, `CreditLimit`, `TotalSpend` which has 5k missing  values.*

In [None]:
cols_with_missing = ["Income", "CreditLimit", "TotalSpend"]

df[cols_with_missing].skew()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

cols_with_missing = ["Income", "CreditLimit", "TotalSpend"]

for col in cols_with_missing:
    plt.figure(figsize=(6,4))
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f"Distribution of {col} (with Missing Values)")
    plt.show()

*Since `Income` and `TotalSpend` were highly right-skewed, I will use median imputation to avoid distortion from outliers. `CreditLimit` was nearly symmetric, so I will use mean imputation to preserve its distribution.*

In [None]:
# Median for skewed features
df["Income"] = df["Income"].fillna(df["Income"].median())
df["TotalSpend"] = df["TotalSpend"].fillna(df["TotalSpend"].median())

# Mean for symmetric feature
df["CreditLimit"] = df["CreditLimit"].fillna(df["CreditLimit"].mean())

In [None]:
df[["Income", "CreditLimit", "TotalSpend"]].isnull().sum()