# Data Cleaning – Credit Card Churn Dataset
This notebook performs initial data cleaning on the raw credit card churn dataset.  
The goal is to prepare the dataset for EDA and modeling by:
- Removing duplicates
- Handling missing values
- Addressing outliers
- Managing high-cardinality categorical features
- Handle Imbalanced Target Variable (`AttritionFlag`)
- Handle High Dimensionality
 
The cleaned dataset will be saved in `data/processed/` for use in later stages.

In [None]:
# Standard libraries
import pandas as pd
import numpy as np
from pathlib import Path
import os

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Display settings
pd.set_option('display.max_columns', None)
sns.set(style="whitegrid")

# Reproducibility
np.random.seed(42)

In [None]:
# Paths
DATA_DIR = Path("../../data/raw")
FILE_PATH = DATA_DIR / "credit_card_attrition_dataset_mark.csv" 

## Loading the Dataset

In [None]:
# Load
df = pd.read_csv(FILE_PATH)

## 1. Looking at the Dataset

*I begin by examining the dataset, including number of rows, columns, data types, and summarize key statistics such as mean, median, min, max, and standard deviation for numeric columns. 
This helps to get an initial sense of the data and spot any obvious issues early on.*

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

*Remove unnecessary columns like `CustomerID`*

In [None]:
# Drop CustomerID
df.drop(columns=["CustomerID"], inplace=True)

print("CustomerID column removed. New shape:", df.shape)

## 2. Checking for Duplicates

*Duplicate records can skew the analysis and lead to biased models.  
I check for exact duplicates in the dataset using `.duplicated()` method and remove them using `drop_duplicates()`.*

In [None]:
# Count duplicates
df.duplicated().sum()

In [None]:
# Remove duplicates
df = df.drop_duplicates()

In [None]:
df.duplicated().sum()

In [None]:
# Get unique count for each variable
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
df.nunique()

In [None]:
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

In [None]:
df.shape

## 3. Checking for Missing Data

Missing values can bias the model if not handled properly.  
I check the distribution of missing values per column and decide on appropriate strategies. 

In [None]:
pd.set_option('display.max_rows', None)  # Show all rows
pd.set_option('display.max_columns', None)  # Show all columns
print(df.isna().sum())

In [None]:
pd.reset_option('display.max_rows')
pd.reset_option('display.max_columns')

In [None]:
df[["Income", "CreditLimit", "TotalSpend"]].isnull().sum()

*The columns that has missing values are `Income`, `CreditLimit`, `TotalSpend` which has 5k missing  values.*

In [None]:
cols_with_missing = ["Income", "CreditLimit", "TotalSpend"]

df[cols_with_missing].skew()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

figures_dir = Path("../../reports/figures/data_cleaning")
figures_dir.mkdir(parents=True, exist_ok=True)

cols_with_missing = ["Income", "CreditLimit", "TotalSpend"]

for col in cols_with_missing:
    plt.figure(figsize=(6,4))
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f"Distribution of {col} (with Missing Values)")
    
    # Save figure
    save_path = figures_dir / f"{col}_distribution.png"
    plt.savefig(save_path, dpi=300, bbox_inches="tight")
    
    # Show plot
    plt.show()

    print(f"Saved: {save_path}")

*Since `Income` and `TotalSpend` were highly right-skewed, I will use median imputation to avoid distortion from outliers. `CreditLimit` was nearly symmetric, so I will use mean imputation to preserve its distribution.*

In [None]:
# Median for skewed features
df["Income"] = df["Income"].fillna(df["Income"].median())
df["TotalSpend"] = df["TotalSpend"].fillna(df["TotalSpend"].median())

# Mean for symmetric feature
df["CreditLimit"] = df["CreditLimit"].fillna(df["CreditLimit"].mean())

In [None]:
df[["Income", "CreditLimit", "TotalSpend"]].isnull().sum()

In [None]:
df.describe()

## 4. Checking for Outliers

Outliers can distort model performance.  
I will use boxplots and IQR method to detect outliers.

In [None]:
# Get all numeric columns except target
numeric_cols = df.select_dtypes(include=["int64", "float64"]).columns.tolist()
if "AttritionFlag" in numeric_cols:
    numeric_cols.remove("AttritionFlag")

print("Numeric columns:", numeric_cols)

In [None]:
# Function for detecting outliers using IQR
def detect_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    outliers = df[(df[column] < lower_bound) | (df[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

figures_dir = Path("../../reports/figures/data_cleaning/outliers_before")
figures_dir.mkdir(parents=True, exist_ok=True)

outlier_summary = []

for col in numeric_cols:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    outlier_summary.append({
        "Column": col,
        "Num_Outliers": len(outliers),
        "Lower_Bound": lower,
        "Upper_Bound": upper
    })
    
    # Save boxplot
    plt.figure(figsize=(6,4))
    sns.boxplot(x=df[col])
    plt.title(f"{col} - Outlier Detection")
    plt.savefig(figures_dir / f"{col}_boxplot.png", dpi=300, bbox_inches="tight")
    plt.show()
    plt.close()

# Summary DataFrame
outlier_df = pd.DataFrame(outlier_summary).sort_values(by="Num_Outliers", ascending=False)
print(outlier_df)

In [None]:
def cap_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1

    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df[column] = np.where(df[column] < lower_bound, lower_bound,
                          np.where(df[column] > upper_bound, upper_bound, df[column]))

# Apply to all numeric columns
for col in numeric_cols:
    cap_outliers_iqr(df, col)

print("Outliers capped for all numeric columns.")

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
from pathlib import Path

figures_dir = Path("../../reports/figures/data_cleaning/outliers_after")
figures_dir.mkdir(parents=True, exist_ok=True)

outlier_summary = []

for col in numeric_cols:
    outliers, lower, upper = detect_outliers_iqr(df, col)
    outlier_summary.append({
        "Column": col,
        "Num_Outliers": len(outliers),
        "Lower_Bound": lower,
        "Upper_Bound": upper
    })
    
    # Save boxplot
    plt.figure(figsize=(6,4))
    sns.boxplot(x=df[col])
    plt.title(f"{col} - Outlier Detection")
    plt.savefig(figures_dir / f"{col}_boxplot.png", dpi=300, bbox_inches="tight")
    plt.show()
    plt.close()

# Summary DataFrame
outlier_df = pd.DataFrame(outlier_summary).sort_values(by="Num_Outliers", ascending=False)
print(outlier_df)

*I detected outliers using the IQR method and capped them instead of removing them to preserve dataset size while reducing the influence of extreme values. This is particularly important for financial datasets where high values can be genuine but shouldn’t overly bias the model.*

## 5. Handling High Cardinality in Categorical Features

Features with too many unique categories can cause problems for modeling, such as overfitting or high dimensionality.  

In [None]:
# Identify categorical columns
categorical_cols = df.select_dtypes(include=["object"]).columns.tolist()
print("Categorical columns:", categorical_cols)

In [None]:
# Count unique values in each categorical column
for col in categorical_cols:
    print(f"{col}: {df[col].nunique()} unique values")

*The`Country` column is the only high cardinality, but we still need to convert categorical variables to numerical.*

In [None]:
pd.set_option('display.max_rows', None) 

for col in categorical_cols:
    print(f"Distribution of {col}:")
    print(df[col].value_counts())
    print("-" * 40)

### *One hot encoding for low cardinality variables.*

In [None]:
import pandas as pd

# Columns to one-hot encode
low_cardinality_cols = ["Gender", "MaritalStatus", "EducationLevel", "CardType"]

# One-hot encode
df_encoded = pd.get_dummies(df, columns=low_cardinality_cols, drop_first=False)

print("Shape before encoding:", df.shape)
print("Shape after encoding:", df_encoded.shape)

print("One-hot encoding done.")

*Note: `df_encoded` is the new df we are working with.*

In [None]:
df_encoded.head(15)

### *Frequency Encoding for High Cardinality `Country` Column*

In [None]:
# Frequency encoding for Country column
country_freq = df_encoded['Country'].value_counts(normalize=False)  # counts (not percentage)

# Map frequencies back to the dataframe
df_encoded['Country_FE'] = df_encoded['Country'].map(country_freq)

# Drop original Country column
df_encoded.drop(columns=['Country'], inplace=True)

print("Frequency encoding applied to Country column.")
print(df_encoded[['Country_FE']].head())

In [None]:
df_encoded.shape

In [None]:
df_encoded.head(15)

In [None]:
df_encoded.info()

*The `df_encoded` is now fully numeric.*

## 6. Identifying Imbalanced target variable `AttritionFlag`

Imbalanced classes can lead to biased models that favor the majority class. 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# Check distribution
class_counts = df_encoded['AttritionFlag'].value_counts()
print(class_counts)

# Visualize
plt.figure(figsize=(5,4))
sns.barplot(x=class_counts.index, y=class_counts.values)
plt.title("AttritionFlag Distribution")
plt.xlabel("AttritionFlag")
plt.ylabel("Count")
plt.show()

### Note on Handling Class Imbalance

The target variable `AttritionFlag` is imbalanced:

- **Class 0:** 95,040 records  
- **Class 1:** 4,960 records  

I will **not** address this imbalance during the data cleaning phase for the following reasons:

1. **Single Cleaned Dataset** – I aim to produce **one cleaned CSV file** (`credit_card_cleaned.csv`) containing all rows and columns after cleaning, without any artificial oversampling or undersampling applied at this stage.  

2. **Preventing Data Leakage** – Oversampling techniques such as **SMOTE** must be applied **only to the training set**. Applying them before splitting the dataset could introduce information from the test set into the training process, leading to overly optimistic performance estimates.  

3. **Better Workflow Separation** – Class balancing will be handled in the **model preparation stage**, after the train-test split. The steps will be:
   - Split the cleaned dataset into training and testing sets.
   - Apply **SMOTE** only on the training set to generate synthetic samples for the minority class.
   - Leave the test set untouched for an unbiased evaluation of the model.

This approach ensures a clean, reusable dataset for multiple modeling experiments while preserving the integrity of model evaluation.


## 7. Handling High dimensionality

High dimensionality can lead to increased computational cost, model overfitting, and difficulty in interpreting results.  
In this step, we aim to simplify the dataset while retaining useful information by applying two techniques:

1. **Check for Near-Zero Variance Features**  
   Features with little to no variance across samples carry minimal predictive power.  
   Such columns provide almost the same value for all observations, making them unhelpful for distinguishing between classes.  
   Removing them helps reduce noise and speeds up model training without sacrificing accuracy.

2. **Remove Highly Correlated Features**  
   Features with very high correlation (e.g., Pearson correlation coefficient > 0.95) are essentially redundant, as they carry the same information.  
   Keeping both leads to:
   - Multicollinearity issues in linear models (e.g., Logistic Regression), where coefficient estimates become unstable.
   - Increased complexity in tree-based models (e.g., Random Forest, XGBoost), with no real gain in predictive power.
   
By performing these steps **now**, we ensure that the dataset is optimized for modeling without introducing data leakage.  
No transformations here use the target variable (`AttritionFlag`), so the process is safe to apply before splitting the dataset.


## Identify & remove near-zero variance features

In [None]:
print("Initial shape:", df_encoded.shape)

In [None]:
# Exclude target
target = "AttritionFlag"
feature_cols = [c for c in df_encoded.columns if c != target]

# Calculate stats for each feature
nzv_stats = []
for col in feature_cols:
    top_freq = df_encoded[col].value_counts(normalize=True, dropna=False).iloc[0]
    var = df_encoded[col].var()
    nunique = df_encoded[col].nunique(dropna=False)
    nzv_stats.append((col, top_freq, var, nunique))

nzv_df = pd.DataFrame(nzv_stats, columns=["feature","top_freq","variance","nunique"])
nzv_df = nzv_df.sort_values("top_freq", ascending=False)

# Show features where >99% of values are the same (adjust threshold if you want)
top_freq_threshold = 0.99
near_constant = nzv_df[nzv_df["top_freq"] >= top_freq_threshold]["feature"].tolist()

print(f"Found {len(near_constant)} near-constant features (top_freq >= {top_freq_threshold}):")
print(near_constant)

# Optionally also show very small variance features (useful for continuous)
var_threshold = 1e-6   # you can tune this small number if needed
low_variance = nzv_df[nzv_df["variance"] <= var_threshold]["feature"].tolist()
print(f"\nFound {len(low_variance)} very low variance features (variance <= {var_threshold}):")
print(low_variance)

# Consolidate features to drop (union of both sets)
to_drop_nzv = sorted(set(near_constant + low_variance))
print(f"\nTotal features suggested to drop (near-zero var): {len(to_drop_nzv)}")

# Preview table for inspection
display(nzv_df.head(20))

*None of the features are useless constants, so we can skip removing near-zero variance columns*

### *Identify Highly Correlated Columns*

In [None]:
corr_matrix = df_encoded.corr().abs()
upper_triangle = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

high_corr_cols = [column for column in upper_triangle.columns if any(upper_triangle[column] > 0.95)]
print(f"Highly correlated columns ({len(high_corr_cols)}): {high_corr_cols}")

### Renaming Encoded Gender Column

`Gender_Female` and `Gender_Male` are highly correlated because they are direct complements (if one is 1, the other is 0). Therefore, we will drop `Gender_Male` and rename `Gender_Female` to `Is_Female` for clarity.


In [None]:
# Drop Gender_Male
df_encoded = df_encoded.drop(columns=["Gender_Male"])

# Rename Gender_Female to Is_Female
df_encoded = df_encoded.rename(columns={"Gender_Female": "Is_Female"})

print("Dropped 'Gender_Male' and renamed 'Gender_Female' to 'Is_Female'.")

In this step, only **1 column** (`Gender_Male`) was dropped due to perfect correlation with `Gender_Female`.  
This leaves us with **71 columns** in the dataset.

While 71 features are not considered high dimensional in a strict sense  
(high dimensionality typically refers to datasets with hundreds or thousands of features),  
this step was still important to:

- Remove redundant variables to avoid multicollinearity, which can impact model interpretability (especially in Logistic Regression).  
- Eliminate unnecessary complexity without losing useful information.

No additional dimensionality reduction was performed at this stage.  
Further dimensionality reduction techniques and feature selection will be considered later during the **feature engineering and modeling phase** using  
model-based feature importance, domain knowledge, or dimensionality reduction techniques if needed.


### Saving the Cleaned Dataset

We have now finalized the **data cleaning process** and will save the cleaned dataset into the `data/processed` folder.  
This ensures that subsequent steps such as **feature engineering** and **model training** use a consistent, pre-cleaned dataset without repeating the cleaning steps.

In [None]:
# Save cleaned dataset to processed folder
processed_path = Path("../../data/processed/credit_card_attrition_cleaned.csv")
df_encoded.to_csv(processed_path, index=False)

print(f"Cleaned dataset saved to: {processed_path}")