# ðŸ“˜ **EDA â€” Hardship & Settlement Analysis**

This notebook focuses on performing Exploratory Data Analysis (EDA) on customer hardship and settlement information. The goal is to understand the structure, patterns, and behaviors associated with financial hardship events and debt settlement activities.

## ðŸŽ¯ **Objectives of This Notebook**
- Explore the dataset containing combined hardship and settlement fields  
- Split the dataset into meaningful subsets for focused analysis  
- Clean and standardize the data using a reusable preprocessing function  
- Examine patterns in hardship events such as hardship types, durations, and loan impacts  
- Investigate settlement behaviors including settlement amounts, timing, and customer characteristics  
- Generate insights that may support risk assessment, customer segmentation, or policy strategies  

By the end of this notebook, we establish a clear understanding of the hardship and settlement dynamics within the portfolio, enabling deeper modeling or reporting work.


## ðŸ“Š **1. Loading the Hardship & Settlement Dataset**

In this step, the dataset containing hardship and settlement information is loaded from the CSV file. The initial shape of the dataset and the list of available columns are inspected to understand the structure before further exploration.

### âœ” Steps Performed
- Read the `hardship_settlement.csv` file into a pandas DataFrame  
- Display the dataset shape (rows, columns)  
- Preview the column names to understand variable availability  


In [None]:
import pandas as pd

df = pd.read_csv("hardship_settlement.csv")
df.shape, df.columns

## ðŸ§© **2. Splitting the Dataset into Hardship and Settlement Subsets**

Since the dataset includes both *hardship-related* and *settlement-related* fields, this section separates them into two dedicated DataFrames for cleaner and more focused analysis.

### âœ” Steps Performed
- Define column groups related to **hardship** and **settlement**  
- Create two new DataFrames (`hardship_df` and `settlement_df`) by selecting only the relevant columns  
- Display the first few rows of each subset for validation  

This separation allows targeted exploratory analysis for each domain.


In [None]:
hardship_cols = [
    "id",
    "hardship_flag",
    "hardship_type",
    "hardship_reason",
    "hardship_status",
    "deferral_term",
    "hardship_amount",
    "hardship_start_date",
    "hardship_end_date",
    "payment_plan_start_date",
    "hardship_length",
    "hardship_dpd",
    "hardship_loan_status",
    "orig_projected_additional_accrued_interest",
    "hardship_payoff_balance_amount",
    "hardship_last_payment_amount",
]

settlement_cols = [
    "id",
    "debt_settlement_flag",
    "debt_settlement_flag_date",
    "settlement_status",
    "settlement_date",
    "settlement_amount",
    "settlement_percentage",
    "settlement_term",
]

hardship_df = df[hardship_cols].copy()
settlement_df = df[settlement_cols].copy()

hardship_df.head(), settlement_df.head()


## ðŸ§¼ **3. Creating a Reusable Cleaning Function**

A general-purpose cleaning function is defined to standardize the preprocessing applied to both hardship and settlement datasets. This ensures consistent formatting and prepares each dataset for deeper analysis.

### âœ” Cleaning Steps Applied
- Trim all column names to remove unintended whitespace  
- Strip whitespace in text columns and convert placeholder values (e.g., `"None"`, `"nan"`, `"NA"`) into proper `NaN`  
- Convert selected columns into numeric format  
- Convert date-like columns into `datetime`  
- Drop columns that are entirely empty  
- Remove duplicate rows to avoid redundant observations  

This function provides a clean and uniform structure, reducing noise and improving data reliability for downstream EDA tasks.


In [None]:
import numpy as np

def clean_df(dataframe, numeric_cols=None, date_cols=None):
    df_clean = dataframe.copy()

    df_clean.columns = df_clean.columns.str.strip()

    obj_cols = df_clean.select_dtypes(include="object").columns
    for col in obj_cols:
        df_clean[col] = (
            df_clean[col]
            .astype(str)
            .str.strip()
            .replace({"": np.nan, "nan": np.nan, "None": np.nan, "NA": np.nan})
        )

    if numeric_cols is not None:
        for col in numeric_cols:
            if col in df_clean.columns:
                df_clean[col] = pd.to_numeric(df_clean[col], errors="coerce")

    if date_cols is not None:
        for col in date_cols:
            if col in df_clean.columns:
                df_clean[col] = pd.to_datetime(df_clean[col], errors="coerce")

    df_clean = df_clean.dropna(axis=1, how="all")

    df_clean = df_clean.drop_duplicates()

    return df_clean


## ðŸ§¼ **4. Cleaning Hardship and Settlement DataFrames**

In this section, numeric and date columns for both the hardship and settlement datasets are specified, and the previously defined cleaning function is applied. This ensures standardized formatting, consistent data types, and removal of noise before deeper exploration.

### âœ” Steps Performed
- Identify numeric and date columns relevant to **hardship** records  
- Identify numeric and date columns relevant to **settlement** records  
- Apply the reusable `clean_df()` function separately to:
  - `hardship_df` â†’ producing `hardship_clean`  
  - `settlement_df` â†’ producing `settlement_clean`  
- Inspect the cleaned DataFrames using `.info()` to verify:
  - Correct data types  
  - Successful numeric/date conversions  
  - Removal of empty or duplicate columns  

This step establishes a clean and reliable foundation for all subsequent EDA tasks.


In [None]:
hardship_numeric_cols = [
    "hardship_amount",
    "hardship_length",
    "hardship_dpd",
    "orig_projected_additional_accrued_interest",
    "hardship_payoff_balance_amount",
    "hardship_last_payment_amount",
]

hardship_date_cols = [
    "hardship_start_date",
    "hardship_end_date",
    "payment_plan_start_date",
]

settlement_numeric_cols = [
    "settlement_amount",
    "settlement_percentage",
    "settlement_term",
]

settlement_date_cols = [
    "debt_settlement_flag_date",
    "settlement_date",
]

hardship_clean = clean_df(
    hardship_df,
    numeric_cols=hardship_numeric_cols,
    date_cols=hardship_date_cols,
)

settlement_clean = clean_df(
    settlement_df,
    numeric_cols=settlement_numeric_cols,
    date_cols=settlement_date_cols,
)

hardship_clean.info()
settlement_clean.info()


## ðŸ”¤ **5. Standardizing Text Columns to Lowercase**

To ensure consistent formatting across textual fields, this step converts all string-based columns in both the hardship and settlement datasets to lowercase. This prevents mismatches during grouping, filtering, or merging operations caused by inconsistent capitalization.

### âœ” Steps Performed
- Identify all columns with `object` or `category` data types  
- Convert only non-null string values to lowercase (NaN values remain unchanged)  
- Apply the transformation to both:
  - `hardship_clean`  
  - `settlement_clean`  
- Display the updated data types to confirm the transformation  

This standardization step improves data quality and helps avoid case-sensitive inconsistencies in later analysis.


In [None]:
def lowercase_text_columns(df: pd.DataFrame) -> pd.DataFrame:
    df_copy = df.copy()

    text_cols = df_copy.select_dtypes(include=["object", "category"]).columns

    for col in text_cols:
        df_copy[col] = df_copy[col].str.lower()

    return df_copy

hardship_clean = lowercase_text_columns(hardship_clean)
settlement_clean = lowercase_text_columns(settlement_clean)

hardship_clean.dtypes, settlement_clean.dtypes


## ðŸ’¾ **6. Exporting Cleaned Hardship and Settlement Data**

After completing all preprocessing and standardization steps, the cleaned datasets are exported as separate CSV files. These outputs will be used for downstream analysis or modeling tasks.

### âœ” Steps Performed
- Exported files exclude the index for a clean tabular structure  

This step finalizes the data preparation workflow and provides clean, ready-to-use datasets for further exploration.


In [None]:
hardship_clean.to_csv("hardship_clean.csv", index=False)
settlement_clean.to_csv("settlement_clean.csv", index=False)

## ðŸ”Ž **7. Running a Comprehensive EDA Summary Function**

This section defines a reusable exploratory analysis function that provides a structured overview of any given DataFrame. The goal is to quickly understand dataset composition, data quality, and variable characteristics without writing repetitive code.

### âœ” Key Features of the Function
- Display total number of rows and columns  
- Show data types of all variables  
- Identify missing values with counts and percentages  
- Report the number of unique values (top 20 columns)  
- Separate numerical and categorical columns  
- Provide summary statistics for numerical columns  
- Display the top frequent categories for categorical columns  

### âœ” Why This Is Useful
- Helps validate the integrity and structure of both **hardship** and **settlement** datasets  
- Quickly reveals data issues such as high missing rates, inconsistent categories, or extreme cardinality  
- Enables efficient comparison across datasets during EDA  
- Provides a standardized diagnostic tool for future notebooks  


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
sns.set(style="whitegrid")

def basic_eda(df: pd.DataFrame, name: str = "df"):
    print(f"====== {name} â€“ General Information ======")
    print(f"Number of rows: {df.shape[0]}")
    print(f"Number of columns: {df.shape[1]}")
    print("\nData types:")
    print(df.dtypes)

    # Missing value analysis
    print("\nMissing values (count and ratio):")
    missing = df.isna().sum()
    missing = missing[missing > 0].sort_values(ascending=False)
    if missing.empty:
        print("No missing values.")
    else:
        missing_ratio = (missing / len(df)).round(4)
        missing_df = pd.DataFrame({
            "missing_count": missing,
            "missing_ratio": missing_ratio
        })
        print(missing_df)

    # Unique value counts
    print("\nNumber of unique values per column (top 20):")
    nunique = df.nunique().sort_values(ascending=False).head(20)
    print(nunique)

    # Numerical and categorical column counts
    num_cols = df.select_dtypes(include=["int64", "float64", "Int64", "Float64"]).columns.tolist()
    cat_cols = df.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
    print(f"\nNumber of numerical columns: {len(num_cols)}")
    print(f"Number of categorical columns: {len(cat_cols)}")

    # Summary statistics for numerical columns
    if len(num_cols) > 0:
        print("\nSummary statistics for numerical columns:")
        print(df[num_cols].describe().T)

    # Most frequent categories for categorical columns
    if len(cat_cols) > 0:
        print("\nMost frequent values for categorical columns (top 3 categories):")
        for col in cat_cols[:10]:  # if many columns exist, show only the first 10
            print(f"\nColumn: {col}")
            print(df[col].value_counts(dropna=False).head(3))


## ðŸ“ˆ **8. Visualizing Numerical Distributions**

This section introduces a helper function that visualizes the distribution of numerical features using both histograms and boxplots. These visual tools make it easier to identify outliers, detect skewness, and understand the spread of numerical variables.

### âœ” Steps Performed
- Automatically detect numerical columns in the dataset  
- Display a message if no numerical fields are present  
- For each of the first selected numerical columns:
  - Plot a histogram with a KDE curve to show the distribution shape  
  - Plot a boxplot to highlight outliers and distribution spread  
- Limit the number of plotted columns using the `max_cols` parameter to keep visual output manageable  

This function enhances exploratory analysis by providing quick and consistent visual summaries of numerical features.


In [None]:
def plot_numeric_distributions(df: pd.DataFrame, name: str = "df", max_cols: int = 6):
    # Identify numeric columns
    num_cols = df.select_dtypes(include=["int64", "float64", "Int64", "Float64"]).columns.tolist()
    if len(num_cols) == 0:
        print(f"No numeric columns found in {name}.")
        return

    print(f"{name} â€“ Numerical column distributions (first {max_cols} columns):")

    for col in num_cols[:max_cols]:
        fig, axes = plt.subplots(1, 2, figsize=(12, 4))

        # Histogram
        sns.histplot(df[col].dropna(), kde=True, ax=axes[0])
        axes[0].set_title(f"{col} â€“ Histogram")

        # Boxplot
        sns.boxplot(x=df[col], ax=axes[1])
        axes[1].set_title(f"{col} â€“ Boxplot")

        plt.tight_layout()
        plt.show()


## ðŸ§© **9. Visualizing Categorical Feature Distributions**

This section introduces a helper function designed to visualize the frequency distribution of categorical variables. These bar charts make it easy to identify dominant categories, rare labels, and potential inconsistencies within the dataset.

### âœ” Steps Performed
- Automatically detect categorical fields (object, category, boolean)  
- Display a message if the dataset contains no categorical variables  
- For each of the first selected categorical columns:
  - Compute the top 10 most frequent categories  
  - Plot a bar chart showing category counts  
  - Rotate labels for improved readability  

This function provides a quick and interpretable overview of categorical data, supporting downstream segmentation, cleaning, or feature engineering tasks.


In [None]:
def plot_categorical_distributions(df: pd.DataFrame, name: str = "df", max_cols: int = 6):
    # Identify categorical columns
    cat_cols = df.select_dtypes(include=["object", "category", "bool"]).columns.tolist()
    if len(cat_cols) == 0:
        print(f"No categorical columns found in {name}.")
        return

    print(f"{name} â€“ Categorical column distributions (first {max_cols} columns):")

    for col in cat_cols[:max_cols]:
        vc = df[col].value_counts(dropna=False).head(10)  # top 10 most frequent categories

        plt.figure(figsize=(10, 4))
        sns.barplot(x=vc.index.astype(str), y=vc.values)
        plt.xticks(rotation=45, ha="right")
        plt.title(f"{col} â€“ Category Frequencies (Top 10)")
        plt.ylabel("Count")
        plt.xlabel(col)
        plt.tight_layout()
        plt.show()


## ðŸ”— **10. Correlation Analysis of Numerical Features**

This section provides a helper function that computes and visualizes the correlation matrix for all numerical fields within a dataset. By examining pairwise correlations, we can detect multicollinearity, strong linear relationships, or unexpected variable interactions.

### âœ” Steps Performed
- Automatically identify all numerical columns  
- Display a message if fewer than two numerical fields are available  
- Compute the correlation matrix using Pearson correlation  
- Visualize the matrix using a heatmap with a diverging color palette  
- Center the color scale at zero to highlight positive and negative correlations  

This visualization supports feature selection, risk analysis, and deeper understanding of variable dependencies.


In [None]:
def plot_correlation(df: pd.DataFrame, name: str = "df"):
    # Identify numeric columns
    num_cols = df.select_dtypes(include=["int64", "float64", "Int64", "Float64"]).columns.tolist()

    # Check if correlation can be computed
    if len(num_cols) < 2:
        print(f"Not enough numeric columns to compute correlation for {name}.")
        return

    # Compute correlation matrix
    corr = df[num_cols].corr()

    # Plot heatmap
    plt.figure(figsize=(10, 8))
    sns.heatmap(corr, annot=False, cmap="coolwarm", center=0)
    plt.title(f"{name} â€“ Numerical Feature Correlation Matrix")
    plt.tight_layout()
    plt.show()


## ðŸ§ª **11. Running Complete EDA on Hardship and Settlement Data**

In this step, the full exploratory analysis workflow is executed for both the hardship and settlement datasets. This includes structural inspection, distribution analysis, categorical exploration, and correlation evaluation.

### âœ” Steps Performed
- Run the `basic_eda()` function to review:
  - Dataset structure  
  - Missing values  
  - Unique value counts  
  - Numerical and categorical summaries  
- Visualize numerical feature distributions using `plot_numeric_distributions()`  
- Visualize categorical feature distributions using `plot_categorical_distributions()`  
- Analyze correlations between numerical variables with `plot_correlation()`  
- Perform all analyses separately for:
  - **hardship_clean**
  - **settlement_clean**

This step consolidates all previously defined EDA tools into a complete diagnostic review of both datasets.

In [None]:
# If needed, load cleaned datasets
# hardship_clean = pd.read_csv("hardship_clean.csv")
# settlement_clean = pd.read_csv("settlement_clean.csv")

# Run full EDA summary
basic_eda(hardship_clean, "hardship_clean")
basic_eda(settlement_clean, "settlement_clean")

# Numerical distributions
plot_numeric_distributions(hardship_clean, "hardship_clean", max_cols=6)
plot_numeric_distributions(settlement_clean, "settlement_clean", max_cols=6)

# Categorical distributions
plot_categorical_distributions(hardship_clean, "hardship_clean", max_cols=6)
plot_categorical_distributions(settlement_clean, "settlement_clean", max_cols=6)

# Correlation analysis
plot_correlation(hardship_clean, "hardship_clean")
plot_correlation(settlement_clean, "settlement_clean")