# Exploratory Data Analysis (EDA) and Feature Engineering

In this notebook, I will perform exploratory data analysis (EDA) on the cleaned credit card customer data to better understand patterns, relationships, and distributions of variables that may influence customer churn.  

Insights from EDA will guide the creation of meaningful features that could improve the performance of our predictive models.  

### Note on Univariate Analysis

Univariate analysis for both numerical and categorical features was already performed during the **Data Cleaning** stage:

- **Numerical features:**  
  - **Skewness:** Specifically examined `Income`, `CreditLimit`, and `TotalSpend`, as these columns contained missing values and required careful handling for imputation and potential transformation.  
  - **Outliers:** All numerical features were checked for outliers using the **IQR method**. Outliers were **capped** rather than removed to preserve dataset size while reducing the influence of extreme values.

- **Categorical features:**  
  Frequency counts and distributions were assessed.  
  - Low-cardinality features were **one-hot encoded**.  
  - High-cardinality features were **frequency encoded** to retain predictive information without inflating dimensionality.

As a result, the dataset has already been cleaned and transformed at a univariate level, allowing this notebook to focus on:

- Bivariate analysis of features and the target variable  
- Correlation analysis  
- Feature Engineering
- Encoding categorical variables and preparing data for modeling

In [None]:
# Import necessary libraries
import pandas as pd

# Load the cleaned dataset saved from the previous step
cleaned_data_path = r'..\..\data\processed\credit_card_attrition_cleaned.csv'
df = pd.read_csv(cleaned_data_path)

# Display the first few rows to verify loading
df.head()

In [None]:
df.columns

In [None]:
df.shape

## 1. Bivariate analysis of features and the target variable
Goal: Identify features that differ significantly between churned (AttritionFlag=1) and non-churned (AttritionFlag=0) customers.

In [None]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
# Count the number of churned and non-churned customers
attrition_counts = df['AttritionFlag'].value_counts()

# Pie chart
plt.figure(figsize=(6,6))
plt.pie(attrition_counts, labels=attrition_counts.index, autopct='%1.1f%%', startangle=90, colors=['#4CAF50', '#F44336'])
plt.title('Distribution of AttritionFlag')
plt.show()

In [None]:
# Separate features by type
continuous_features = ['Age', 'Income', 'CreditLimit', 'TotalSpend', 'Tenure', 'TotalTransactions']
boolean_features = ['Is_Female', 'MaritalStatus_Divorced', 'MaritalStatus_Married', 
                    'MaritalStatus_Single', 'MaritalStatus_Widowed',
                    'EducationLevel_Bachelor', 'EducationLevel_High School',
                    'EducationLevel_Master', 'EducationLevel_PhD',
                    'CardType_Black', 'CardType_Gold', 'CardType_Platinum', 'CardType_Silver']
engineered_features = [f'Feature_{i}' for i in range(50)] + ['Country_FE']

In [None]:
# -------------------------------
# 1. Continuous Features
# -------------------------------
import os

# Set folder to save plots
save_folder = r'..\..\reports\figures\EDA_FeatureEng\Continuous_FeaturesVSAttritionFlag'
os.makedirs(save_folder, exist_ok=True)  # Create folder if it doesn't exist

print("Bivariate Analysis: Continuous Features\n")

for col in continuous_features:
    print(col + " vs AttritionFlag")
    print(df.groupby('AttritionFlag')[col].describe(), "\n")
    
    plt.figure(figsize=(6,4))
    sns.boxplot(x='AttritionFlag', y=col, data=df)
    plt.title(f'{col} vs AttritionFlag')
    plt.tight_layout()
    plt.savefig(os.path.join(save_folder, f'{col}_boxplot.png'))
    plt.show() 

*For Continuous features `Age`, `Income`, `CreditLimit`, `TotalSpend`, `Tenure`, `TotalTransactions`,  there are no significant differences in the distributions of continuous features between customers who churned (AttritionFlag = 1) and those who did not (AttritionFlag = 0). This suggests that these continuous features may have limited predictive power for distinguishing churn in this dataset.*

In [None]:
# -------------------------------
# 2. Boolean Features (Readable)
# -------------------------------
# Set folder to save plots
save_folder = r'..\..\reports\figures\EDA_FeatureEng\Boolean_FeaturesVSAttritionFlag'
os.makedirs(save_folder, exist_ok=True)  # Create folder if it doesn't exist

print("Bivariate Analysis: Boolean Features\n")

for col in boolean_features:
    # Print value counts per AttritionFlag
    print(f'{col} vs AttritionFlag')
    print(pd.crosstab(df[col], df['AttritionFlag'], normalize='columns'), "\n")  # proportions
    
    # Plot grouped bar chart
    plt.figure(figsize=(6,4))
    sns.countplot(x=col, hue='AttritionFlag', data=df, palette='Set2')
    plt.title(f'{col} distribution by AttritionFlag')
    plt.xlabel(f'{col} value')
    plt.ylabel('Count')
    plt.legend(title='AttritionFlag', labels=['No Churn (0)', 'Churn (1)'])
    plt.tight_layout()
    plt.savefig(os.path.join(save_folder, f'{col}_barplot.png'))
    plt.show()

*Same with continuous variables, the boolean features do not show significant differences between Churn and No Churn.*

In [None]:
# -------------------------------
# Bivariate Analysis: Engineered Features
# -------------------------------
# Set folder to save plots
save_folder = r'..\..\reports\figures\EDA_FeatureEng\Engineered_FeaturesVSAttritionFlag'
os.makedirs(save_folder, exist_ok=True)  # Create folder if it doesn't exist

print("Bivariate Analysis: Engineered Features\n")

for col in engineered_features:
    print(f"{col} vs AttritionFlag")
    # Group statistics
    print(df.groupby('AttritionFlag')[col].mean(), "\n")
    
    # # Boxplot
    plt.figure(figsize=(6,4))
    sns.boxplot(x='AttritionFlag', y=col, data=df)
    plt.title(f'{col} vs AttritionFlag')
    plt.xlabel('AttritionFlag (0=No, 1=Yes)')
    plt.ylabel(col)
    plt.tight_layout()
    
    # Save plot
    plt.savefig(os.path.join(save_folder, f'{col}_boxplot.png'))
    plt.show()

*The bivariate analysis of the engineered features (`Feature_0` to `eature_49` and `Country_FE`) against the target variable AttritionFlag shows that the mean values of each feature are very similar between employees who stayed (`0`) and those who left (`1`).*

### Bivariate Analysis Conclusion
The bivariate analysis shows minimal differences between the features and the target variable (`AttritionFlag`). Continuous, boolean, and engineered features all exhibit similar distributions across the target classes. Overall, no individual feature demonstrates a significant relationship with attrition, indicating that predictive patterns may require modeling feature interactions or more advanced techniques.

## 2. Correlation Analysis

Goal: Discover highly correlated features to inform which features to combine, transform, or drop during feature engineering.

In [None]:
# Spearman correlation
corr_matrix = df.corr(method='spearman')

pd.set_option('display.max_rows', None) 

# Correlation with target
target_corr = corr_matrix['AttritionFlag'].sort_values(ascending=False)
print(target_corr)

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x=target_corr.index, y=target_corr.values)
plt.xticks(rotation=90)
plt.ylabel('Spearman Correlation with AttritionFlag')
plt.title('Feature Correlation with Target')
plt.tight_layout()
plt.show()

*Spearman correlation between all numeric features and `AttritionFlag` shows values extremely close to zero, indicating no significant linear or monotonic relationship. This aligns with the bivariate analysis findings, confirming that none of the features individually differentiate the target.*

In [None]:
plt.figure(figsize=(20,16))
sns.heatmap(corr_matrix, cmap='coolwarm', center=0, linewidths=0.5)
plt.title('Spearman Correlation Matrix')
plt.xticks(rotation=90)
plt.yticks(rotation=0)
plt.show()

### Correlation Analysis Conclusion
*There are no notable correlations between feature-feature and feature-target relationships. The few negative correlations observed are expected, as they originate from individual columns that were one-hot encoded.*

## 3. Feature Engineering

Goal: To transform the existing dataset into a richer, more informative set of features that maximizes the model’s ability to predict attrition/churn, despite weak raw correlations.

In [None]:
df.head()

### Create target-independent features

In [None]:
import numpy as np

df['AvgTransaction'] = df['TotalSpend'] / np.where(df['TotalTransactions'] == 0, 1, df['TotalTransactions'])
df['CreditUsage'] = df['TotalSpend'] / np.where(df['CreditLimit'] == 0, 1, df['CreditLimit'])
df['SpendIncomeRatio'] = df['TotalSpend'] / np.where(df['Income'] == 0, 1, df['Income'])
df['TenureRatio'] = df['Tenure'] / np.where(df['Age'] == 0, 1, df['Age'])

`AvgTransaction`: Measures average spend per transaction, it normalizes spending by number of transactions to capture customer behavior.

`CreditUsage`: Indicates how much of their credit limit the customer uses; can reflect financial stress or engagement with the product.

`SpendIncomeRatio`: Shows the relative spending compared to income, which can indicate whether spending is sustainable or risky.

`TenureRatio`: Normalizes tenure by age, capturing how long a customer has been with the bank relative to their age (loyalty vs. age).

In [None]:
feature_cols = [f'Feature_{i}' for i in range(50)]

df['Feature_sum'] = df[feature_cols].sum(axis=1)
df['Feature_mean'] = df[feature_cols].mean(axis=1)
df['Feature_std'] = df[feature_cols].std(axis=1)
df['Feature_max'] = df[feature_cols].max(axis=1)
df['Feature_min'] = df[feature_cols].min(axis=1)

Aggregates condense high-dimensional features into summary statistics:

sum: total activity/score across all features

mean: average behavior across features

std: variability of behavior

max/min: capture extremes or outliers

Helps reduce dimensionality and captures overall trends in customer behavior.

In [None]:
num_cols = df.select_dtypes(include=np.number).columns.tolist()
num_cols.remove('AttritionFlag')  # exclude target

skew_values = df[num_cols].skew().sort_values(ascending=False)
print(skew_values.head(20))  # top 20 most skewed features

In [None]:
for col in ['CreditUsage', 'SpendIncomeRatio']:  # only skewed features
    min_val = df[col].min()
    if min_val <= 0:
        df[col + '_log'] = np.log1p(df[col] - min_val + 1)
    else:
        df[col + '_log'] = np.log1p(df[col])

*Since `CreditUsage` and `SpendIncomeRatio` are highly-skewed, we perform a log transformation (with a shift if necessary) to reduce skewness and make the distributions more suitable for modeling.*

In [None]:
df.shape

In [None]:
df.columns

In [None]:
df.head()

In [None]:
# Only numeric columns
numeric_cols = df.select_dtypes(include=np.number).columns.tolist()

# Compute correlation with target
corr_with_target = df[numeric_cols].corr()['AttritionFlag'].sort_values(ascending=False)

# Show top 10 positively and negatively correlated features
print("Top positive correlations:\n", corr_with_target.head(10))
print("\nTop negative correlations:\n", corr_with_target.tail(10))


In [None]:
corr_spearman = df.corr(method='spearman')['AttritionFlag'].sort_values(ascending=False)
print("Top positive correlations:\n", corr_spearman.head(10))
print("\nTop negative correlations:\n", corr_spearman.tail(10))

All numeric features, including engineered ones, show very low correlations with the target (highest ~0.009, lowest ~-0.011).

Even features with low correlation are kept, as they may still contribute in combination with others or capture non-linear relationships.

This suggests that churn is likely driven by complex interactions, so feature engineering and non-linear models may be important.

### Create target-dependent features

*Split dataset first to avoid data leakage since this will involve the target variable `AttritionFlag`*