# Credit Risk Scorecard Demo

## Before you begin
To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also create a documentation project on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the client library

In [1]:
# %pip install --upgrade validmind

## Initialize the client library
In a browser, go to the Client Integration page of your documentation project and click Copy to clipboard next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

This step requires a documentation project. Learn how you can create one.

Next, replace this placeholder with your own code snippet:

In [2]:
import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "clk00h0u800x9qjy67gduf5om"
)

2023-08-09 15:56:18,021 - INFO(validmind.api_client): Connected to ValidMind. Project: [6] Credit Risk Scorecard - Initial Validation (clk00h0u800x9qjy67gduf5om)


## Setup

#### Introduction

The **Credit risk Scorecard** model created from the Lending Club dataset is instrumental in computing the Probability of Default (PD), a key factor in ECL calculations. This scorecard assesses several credit characteristics of potential borrowers, like their credit history, income, outstanding debts, and more, each of which is assigned a specific score. By combining these scores, we derive a total score for each borrower, which translates into an estimated Point-in-Time (PiT) PD. The PiT PD reflects the borrower's likelihood of default at a specific point in time, accounting for both current and foreseeable future conditions.

Additionally, for a holistic view of credit risk, it's essential to estimate the Lifetime PD. The Lifetime PD, as the name suggests, predicts the borrower's likelihood of default throughout the life of the exposure, taking into account potential future changes in the economic and financial conditions.

#### Import Libraries

In [3]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

# Standard library imports
import re
import pickle
from datetime import datetime
from typing import List

# Data handling and analysis imports
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import statsmodels.api as sm
import inspect

# Visualization imports
%matplotlib inline

# Scorecard development
import scorecardpy as sc

#### Helper Functions

In [4]:
def save_model(model, df, base_filename):
    """Save a model and a dataframe with a timestamp in the filename"""
    # Get current date and time
    now = datetime.datetime.now()

    # Convert the current date and time to string
    timestamp_str = now.strftime("%Y%m%d_%H%M%S")

    filename = f'{base_filename}_{timestamp_str}.pkl'

    # Save the model and dataframe
    with open(filename, 'wb') as file:
        pickle.dump((model, df), file)
        
    print(f"Model and dataframe saved as {filename}")

def get_numerical_columns(df):
        numerical_columns = df.select_dtypes(
            include=["int", "float", "uint"]
        ).columns.tolist()
        return numerical_columns

def get_categorical_columns(df):
        categorical_columns = df.select_dtypes(
            include=["object", "category"]
        ).columns.tolist()
        return categorical_columns

def compute_outliers(series, threshold=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - threshold * IQR
    upper_bound = Q3 + threshold * IQR
    return series[(series < lower_bound) | (series > upper_bound)]

def transform_woe_df(woe_df):
    # Select and rename columns
    transformed_df = woe_df[['variable', 'bin', 'count', 'count_distr', 'good', 'bad', 'badprob', 'woe', 'bin_iv', 'total_iv']].copy()
    transformed_df.rename(columns={
        'bin_iv': 'total_iv'
    }, inplace=True)
    
    # Create 'is_special_values' column (assuming there are no special values)
    transformed_df['is_special_values'] = False
    
    # Transform 'bin' column into interval format and store it in 'breaks' column
    transformed_df['breaks'] = transformed_df['bin'].apply(lambda x: '[-inf, %s)' % x if isinstance(x, float) else '[%s, inf)' % x)
    
    # Group by 'variable' to create bins dictionary
    bins = {}
    for variable, group in transformed_df.groupby('variable'):
        bins[variable] = group
    
    return bins

def get_features_with_min_missing(df, min_missing_percentage):
    # Calculate the percentage of missing values in each column
    missing_percentages = df.isnull().mean() * 100

    # Get the variables where the percentage of missing values is greater than the specified minimum
    variables_to_drop = missing_percentages[missing_percentages > min_missing_percentage].index.tolist()

    # Also add any columns where all values are missing
    variables_to_drop.extend(df.columns[df.isnull().all()].tolist())

    # Remove duplicates (if any)
    variables_to_drop = list(set(variables_to_drop))

    return variables_to_drop

#### Developer Tasks

In [5]:
def import_raw_data(source): 
    print("Importing raw data from:", source)
    df_out = pd.read_csv(source)
    print(f"Data imported successfully with {df_out.shape[0]} rows and {df_out.shape[1]} columns.")
    return df_out

def drop_features(df, to_drop):
    df_out = df.copy()

    # Before dropping
    initial_cols = df_out.shape[1]
    
    df_out.drop(columns=to_drop, axis=1, inplace=True)

    # After dropping
    after_drop_cols = df_out.shape[1]
    
    print(f"Dropped {initial_cols - after_drop_cols} columns.")
    print(f"Columns remaining after dropping: {after_drop_cols}")

    return df_out 

def add_default_definition(df, default_column):
    
    # Check if 'loan_status' is in the DataFrame
    if 'loan_status' not in df.columns:
        raise ValueError("'loan_status' column not found in the DataFrame.")

    print("Converting 'loan_status' to target column...")
    # Assuming the column name is 'loan_status'
    df[default_column] = df['loan_status'].apply(lambda x: 0 if x == "Fully Paid" else 1 if x == "Charged Off" else np.nan)

    initial_row_count = df.shape[0]
    # Remove rows where the target column is NaN
    df = df.dropna(subset=[default_column])
    removed_rows = initial_row_count - df.shape[0]
    print(f"Removed {removed_rows} rows with undefined 'loan_status' values.")

    # Convert target column to integer
    df[default_column] = df[default_column].astype(int)
    print(f"Converted 'loan_status' to '{default_column}' and set its data type to integer.")
    
    # Remove the 'loan_status' column from the DataFrame
    df.drop(columns=['loan_status'], inplace=True)
    print("'loan_status' column has been removed from the DataFrame.")
    
    return df

def convert_term_column(df):
    """
    Function to remove 'months' string from the 'term' column and convert it to categorical
    """
    
    column = "term"
    
    # Ensure the column exists in the dataframe
    if column not in df.columns:
        raise ValueError(f"The column '{column}' does not exist in the dataframe.")
    
    df[column] = df[column].str.replace(' months', '')
    
    # Convert to categorical
    df[column] = df[column].astype('object')

    return df

def convert_emp_length_column(df):
    """
    Function to clean 'emp_length' column and convert it to categorical.
    """
    
    column = "emp_length"
    
    # Ensure the column exists in the dataframe
    if column not in df.columns:
        raise ValueError(f"The column '{column}' does not exist in the dataframe.")
    
    df[column] = df[column].replace('n/a', np.nan)
    df[column] = df[column].str.replace('< 1 year', str(0))
    df[column] = df[column].apply(lambda x: re.sub('\D', '', str(x)))
    df[column].fillna(value = 0, inplace=True)

    # Convert to categorical
    df[column] = df[column].astype('object')

    return df 

def convert_inq_last_6mths_column(df):
    """
    Function to convert 'inq_last_6mths' column into categorical.
    """
    column = "inq_last_6mths"

    # Ensure the column exists in the dataframe
    if column not in df.columns:
        raise ValueError(f"The column '{column}' does not exist in the dataframe.")

    # Convert to categorical
    df[column] = df[column].astype('category')

    return df


def remove_iqr_outliers(df, target_column, threshold=1.5):
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    num_cols.remove(target_column)  # Exclude target_column from numerical columns
    for col in num_cols:
        outliers = compute_outliers(df[col], threshold)
        df = df[~df[col].isin(outliers)]
    return df


def data_split(df, target_column):
    df_out = df.copy()

    # Split data into train and test 
    X = df_out.drop(target_column, axis=1)
    y = df_out[target_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

    print(f"Training data has {X_train.shape[0]} rows and {X_train.shape[1]} columns.")
    print(f"Test data has {X_test.shape[0]} rows and {X_test.shape[1]} columns.")

    # Concatenate X_train with y_train to form df_train
    df_train = pd.concat([X_train, y_train], axis=1)

    # Concatenate X_test with y_test to form df_test
    df_test = pd.concat([X_test, y_test], axis=1)

    return df_train, df_test


def remove_features_missing_values(df, min_missing_percentage):
    """Drop columns with missing values exceeding a certain percentage."""
    
    def get_features_with_min_missing(data, threshold_percentage):
        """Get features with missing values above the given threshold."""
        missing_percent = data.isnull().mean() * 100
        return missing_percent[missing_percent > threshold_percentage].index.tolist()

    print("Analyzing missing values in the dataset...")
    vars_to_drop = get_features_with_min_missing(df, min_missing_percentage)
    
    if vars_to_drop:
        print(f"Found {len(vars_to_drop)} features with more than {min_missing_percentage}% missing values.")
        print("Dropping the following columns:", ', '.join(vars_to_drop))
        return df.drop(columns=vars_to_drop)
    else:
        print(f"No features found with more than {min_missing_percentage}% missing values.")
        return df


def drop_categories(df):
    df_out = df.copy()

    # Initial count
    initial_count = df_out.shape[0]

    # Select rows where purpose is 'debt_consolidation' or 'credit_card'
    df_out = df_out[df_out['purpose'].isin(['debt_consolidation', 'credit_card'])]
    print(f"Rows retained with purpose 'debt_consolidation' or 'credit_card': {df_out.shape[0]}")

    # Remove rows where grade is 'F' or 'G'
    df_out = df_out[~df_out['grade'].isin(['F', 'G'])]
    print(f"Rows after removing grades 'F' or 'G': {df_out.shape[0]}")

    # Remove rows where sub_grade starts with 'F' or 'G'
    df_out = df_out[~df_out['sub_grade'].str.startswith(('F', 'G'))]
    print(f"Rows after removing sub_grades starting with 'F' or 'G': {df_out.shape[0]}")

    # Remove rows where home_ownership is 'OTHER', 'NONE', or 'ANY'
    df_out = df_out[~df_out['home_ownership'].isin(['OTHER', 'NONE', 'ANY'])]
    print(f"Rows after removing home_ownership values 'OTHER', 'NONE', or 'ANY': {df_out.shape[0]}")

    print(f"Total rows dropped: {initial_count - df_out.shape[0]}")

    return df_out


def convert_to_woe(df, woe_df, target_col):
    df_out = df.copy()
    
    # Placeholder for the transformation function - you'll need to define or import it
    bins = transform_woe_df(woe_df)
    
    # Print how many features are getting transformed
    print(f"Converting {len(bins)} features to WoE values.")
    
    # Make sure we don't transform the target column
    if target_col in bins:
        del bins[target_col]
        print(f"Excluded {target_col} from WoE transformation.")
    
    # Apply the WoE transformation
    df_out = sc.woebin_ply(df_out, bins=bins)
    
    print(f"Successfully converted features to WoE values.")

    return df_out

def add_constant(df):
    df_out = df.copy()

    # Before adding constant
    initial_cols = df_out.shape[1]

    # Add constant
    df_out = sm.add_constant(df_out)

    # After adding constant
    after_add_cols = df_out.shape[1]

    print(f"Added constant to dataframe. Number of columns went from {initial_cols} to {after_add_cols}.")
    
    return df_out

def train_model(df, target_column):
    
    # Ensure that the target column is in the DataFrame
    if target_column not in df.columns:
        raise ValueError(f"'{target_column}' not found in DataFrame.")

    # Get X (features) and y (target) from df
    X = df.drop(target_column, axis=1)  # Drop the target column to get features
    y = df[target_column]

    # Define the model
    model = sm.GLM(y, X, family=sm.families.Binomial())

    print(f"Training the model with {X.shape[1]} features and {X.shape[0]} data points.")

    # Fit the model
    model_fit = model.fit()

    print("Model trained successfully.")

    return model_fit

#### Developer Class

In [6]:
import datetime
import inspect
import re
import pandas as pd
import logging
from IPython.core.display import display, HTML

# Set up the logging configuration
logging.basicConfig(level=logging.INFO, format='INFO: %(message)s')

class Developer:
    def __init__(self):
        self.tasks_log = []
        self.tasks_details = []
        self.validation_log = []  # Log for validation tests
        self.tasks = {}  # Dictionary to store tasks

    def add_task(self, task_id, task):
        """Register a task."""
        if task_id in self.tasks:
            raise ValueError(f"Task ID '{task_id}' already exists!")
        self.tasks[task_id] = {'task': task}
        return task_id

    def get_caller_info(self, frame):
        """Fetch the calling line of code and the variable names."""
        code_context = inspect.getframeinfo(frame).code_context
        line_of_code = code_context[0].strip() if code_context else ""
        input_vars = {id(var): name for name, var in frame.f_locals.items()}
        return line_of_code, input_vars

    def get_task(self, task_id):
        """Retrieve task entry based on the task ID."""
        task_entry = self.tasks.get(task_id)
        if not task_entry:
            raise ValueError(f"No task found for ID {task_id}")
        return task_entry

    def execute_task(self, task_id, inputs=None, area_id=None, validation_tests=None):
        if inputs is None:
            inputs = []
        
        logging.info(f"Executing task '{task_id}'...\n")
        
        frame = inspect.currentframe().f_back
        line_of_code, input_vars = self.get_caller_info(frame)
        input_var_names = [input_vars.get(id(inp), "N/A") for inp in inputs]
        task_entry = self.get_task(task_id)
        
        result = task_entry['task'](*inputs)
        
        # Extract the variable name to which the result is assigned
        output_match = re.search(r'^\s*([\w\s,]+?)\s*=', line_of_code)
        output_var_name = output_match.group(1).replace(" ", "") if output_match else "N/A"
        
        start_time = datetime.datetime.now()
        end_time = datetime.datetime.now()
        duration = (end_time - start_time).seconds
        
        # Log the task details internally
        self.tasks_log.append(task_id)
        self.tasks_details.append({
            'Time': start_time.strftime('%Y-%m-%d %H:%M:%S'),
            'Area ID': area_id,
            'Task ID': task_id,
            'Input': ", ".join(input_var_names),
            'Output': output_var_name,
            'Duration': f"{duration} seconds"
        })

        # Log the validation tests
        self.validation_log.append({
            'Area ID': area_id,
            'Task ID': task_id,
            'Input': ", ".join(input_var_names),
            'Output': output_var_name,
            'Validation Tests': ", ".join(validation_tests) if validation_tests else "N/A"
        })

        return result

    def show_validation_plan(self):
        """Return the validation plan details in a tabular format."""
        df = pd.DataFrame(self.validation_log)

        # Use HTML line breaks for Jupyter Notebook rendering
        separator = "<br>"
        df['Validation Tests'] = df['Validation Tests'].apply(lambda x: separator.join(x.split(", ")) if x != "none" else "none")

        # Replace "N/A" with "none"
        df.replace({"N/A": "none"}, inplace=True)
        
        return df


    
    def show_lifecycle(self):
        """Display the model lifecycle details in a tabular format."""
        df = pd.DataFrame(self.tasks_details)
        return df



#### Model Development Parameters

In [7]:
default_column = "default"

lending_club_url = "https://vmai.s3.us-west-1.amazonaws.com/datasets/lending_club_loan_data_2007_2014.csv"

preliminary_features_to_drop = [
    "id", "member_id", "funded_amnt", "emp_title", "url", "desc", "application_type",
    "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record",
    "revol_bal", "total_rec_prncp", "total_rec_late_fee", "recoveries", "out_prncp_inv", "out_prncp", 
    "collection_recovery_fee", "next_pymnt_d", "initial_list_status", "pub_rec",
    "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "pymnt_plan",
    "tot_coll_amt", "tot_cur_bal", "total_rev_hi_lim", "last_pymnt_d", "last_credit_pull_d",
    'earliest_cr_line', 'issue_d']

final_features_to_drop = ['addr_state', 'total_rec_int', 'loan_amnt',
                    'funded_amnt_inv', 'dti', 'revol_util', 'total_pymnt', 
                    'total_pymnt_inv', 'last_pymnt_amnt', "inq_last_6mths"]

min_missing_percentage = 80

#### Register Developer Tasks

In [8]:
# Instantiate the Developer class
developer = Developer()

# Register developer tasks
developer.add_task(
    task_id="import_raw_data", 
    task=import_raw_data,
)

developer.add_task(
    task_id="drop_features",
    task=drop_features,  
)

developer.add_task(
    task_id="add_default_definition",
    task=add_default_definition,  
)

developer.add_task(
    task_id="convert_term_column",
    task=convert_term_column,  
)

developer.add_task(
    task_id="convert_emp_length_column",
    task=convert_emp_length_column,  
)

developer.add_task(
    task_id="convert_inq_last_6mths_column",
    task=convert_inq_last_6mths_column,  
)

developer.add_task(
    task_id="data_split",
    task=data_split,  
)

developer.add_task(
    task_id="drop_categories",
    task=drop_categories,  
)

developer.add_task(
    task_id="convert_to_woe",
    task=convert_to_woe,  
)

developer.add_task(
    task_id="add_constant",
    task=add_constant,  
)

developer.add_task(
    task_id="train_model",
    task=train_model,  
)

developer.add_task(
    task_id="remove_features_missing_values",
    task=remove_features_missing_values,  
)

'remove_features_missing_values'

## Model development

In [9]:
df_1 = developer.execute_task(
    area_id = "data_description",
    task_id = "import_raw_data", 
    inputs = [lending_club_url],
    validation_tests = ["descriptive_statistics", "missing_values_bar_plot"]
)

INFO: Executing task 'import_raw_data'...



Importing raw data from: https://vmai.s3.us-west-1.amazonaws.com/datasets/lending_club_loan_data_2007_2014.csv


  df_out = pd.read_csv(source)


Data imported successfully with 466285 rows and 75 columns.


In [10]:
df_2 = developer.execute_task(
    area_id = "data_preparation",
    task_id = "drop_features", 
    inputs = [df_1, preliminary_features_to_drop],
    validation_tests = []
)

INFO: Executing task 'drop_features'...



Dropped 33 columns.
Columns remaining after dropping: 42


In [11]:
df_3 = developer.execute_task(
    area_id = "data_preparation",
    task_id = "add_default_definition", 
    inputs = [df_2, default_column],
    validation_tests = ["missing_values_bar_plot",
                        "class_imbalance", 
                        "iqr_outliers_table"]
)

INFO: Executing task 'add_default_definition'...



Converting 'loan_status' to target column...
Removed 239071 rows with undefined 'loan_status' values.
Converted 'loan_status' to 'default' and set its data type to integer.
'loan_status' column has been removed from the DataFrame.


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[default_column] = df[default_column].astype(int)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.drop(columns=['loan_status'], inplace=True)


In [12]:
df_4 = developer.execute_task(
    area_id="data_preparation",
    task_id="remove_features_missing_values", 
    inputs=[df_3, min_missing_percentage],
    validation_tests=["missing_values_bar_plot"]
)

INFO: Executing task 'remove_features_missing_values'...



Analyzing missing values in the dataset...
Found 18 features with more than 80% missing values.
Dropping the following columns: mths_since_last_major_derog, annual_inc_joint, dti_joint, verification_status_joint, open_acc_6m, open_il_6m, open_il_12m, open_il_24m, mths_since_rcnt_il, total_bal_il, il_util, open_rv_12m, open_rv_24m, max_bal_bc, all_util, inq_fi, total_cu_tl, inq_last_12m


In [13]:
df_5 = developer.execute_task(
    area_id="data_preparation",
    task_id="convert_term_column", 
    inputs=[df_4]
)

df_6 = developer.execute_task(
    area_id="data_preparation",
    task_id="convert_emp_length_column", 
    inputs=[df_5]
)

df_7 = developer.execute_task(
    area_id="data_preparation",
    task_id="convert_inq_last_6mths_column", 
    inputs=[df_6]
)

INFO: Executing task 'convert_term_column'...

INFO: Executing task 'convert_emp_length_column'...

INFO: Executing task 'convert_inq_last_6mths_column'...



In [14]:
df_train_1, df_test_1 = developer.execute_task(
    area_id="data_sampling",
    task_id="data_split", 
    inputs=[df_7, default_column],
    validation_tests=["tabular_numerical_histograms", 
                      "high_cardinality", 
                      "tabular_categorical_bar_plots"]
)

INFO: Executing task 'data_split'...



Training data has 181771 rows and 23 columns.
Test data has 45443 rows and 23 columns.


In [15]:
df_train_2 = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_categories", 
    inputs=[df_train_1],
    validation_tests=["target_rate_bar_plots"]
)

df_test_2 = drop_categories(df_test_1)

INFO: Executing task 'drop_categories'...



Rows retained with purpose 'debt_consolidation' or 'credit_card': 142293
Rows after removing grades 'F' or 'G': 137816
Rows after removing sub_grades starting with 'F' or 'G': 137816
Rows after removing home_ownership values 'OTHER', 'NONE', or 'ANY': 137723
Total rows dropped: 44048
Rows retained with purpose 'debt_consolidation' or 'credit_card': 35532
Rows after removing grades 'F' or 'G': 34349
Rows after removing sub_grades starting with 'F' or 'G': 34349
Rows after removing home_ownership values 'OTHER', 'NONE', or 'ANY': 34322
Total rows dropped: 11121


In [16]:
df_train_3 = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_features", 
    inputs=[df_train_2, final_features_to_drop],
    validation_tests=["chi_squared_features_table", 
                      "anova_one_way_table", 
                      "pearson_correlation_matrix", 
                      "feature_target_correlation_plot",
                      "woe_bin_table",
                      "woe_bin_table",   # with different parameters
                      "woe_bin_plots"]
)

df_test_3 = drop_features(df_test_2, final_features_to_drop)

INFO: Executing task 'drop_features'...



Dropped 10 columns.
Columns remaining after dropping: 14
Dropped 10 columns.
Columns remaining after dropping: 14


In [17]:
from validmind.vm_models.test_context import TestContext
from validmind.tests.data_validation.WOEBinTable import WOEBinTable

params = {
    "breaks_adj": {
        "int_rate": [5,10,15]}  
     }

vm_df = vm.init_dataset(dataset=df_train_3, target_column=default_column)
test_context = TestContext(dataset=vm_df)

metric = WOEBinTable(test_context, params=params)
metric.run()
woe_dic = metric.result.metric.value['woe_iv']
woe_df = pd.DataFrame(woe_dic)

2023-08-09 15:57:14,626 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...


Running with breaks_adj: {'int_rate': [5, 10, 15]}
Performing binning with breaks_adj: {'int_rate': [5, 10, 15]}
[INFO] creating woe binning ...


 (ColumnNames: emp_length)


In [18]:
df_train_4 = developer.execute_task(
    area_id="feature_engineering",
    task_id="convert_to_woe", 
    inputs=[df_train_3, woe_df, default_column],
)     

df_test_4 = convert_to_woe(df_test_3, woe_df, default_column)

INFO: Executing task 'convert_to_woe'...



Converting 13 features to WoE values.
[INFO] converting into woe values ...


 (ColumnNames: emp_length)


Successfully converted features to WoE values.
Converting 13 features to WoE values.
[INFO] converting into woe values ...


 (ColumnNames: emp_length)


Successfully converted features to WoE values.


In [19]:
df_train_5 = developer.execute_task(
    area_id="model_training",
    task_id="add_constant", 
    inputs=[df_train_4]
)

df_test_5 = add_constant(df_test_4)

INFO: Executing task 'add_constant'...



Added constant to dataframe. Number of columns went from 14 to 15.
Added constant to dataframe. Number of columns went from 14 to 15.


In [20]:
model_fit_1 = developer.execute_task(
    area_id="model_training",
    task_id="train_model", 
    inputs=[df_train_5, default_column]
)

print(model_fit_1.summary())

INFO: Executing task 'train_model'...



Training the model with 14 features and 137723 data points.
Model trained successfully.
                 Generalized Linear Model Regression Results                  
Dep. Variable:                default   No. Observations:               137723
Model:                            GLM   Df Residuals:                   137709
Model Family:                Binomial   Df Model:                           13
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -59975.
Date:                Wed, 09 Aug 2023   Deviance:                   1.1995e+05
Time:                        15:57:26   Pearson chi2:                 1.38e+05
No. Iterations:                     5   Pseudo R-squ. (CS):            0.06635
Covariance Type:            nonrobust                                         
                              coef    std err          z      P>|z|      [0.025      0.975]
------------------------------

In [21]:
model_features_to_drop = []

df_train_6 = developer.execute_task(
    area_id="model_training",
    task_id="drop_features", 
    inputs=[df_train_5, model_features_to_drop]
)

df_test_6 = drop_features(df_test_5, model_features_to_drop)

INFO: Executing task 'drop_features'...



Dropped 0 columns.
Columns remaining after dropping: 15
Dropped 0 columns.
Columns remaining after dropping: 15


In [22]:
model_fit_2 = developer.execute_task(
    area_id="model_training",
    task_id="train_model", 
    inputs=[df_train_6, default_column],
    validation_tests = ["regression_coeffs_plot", 
                        "regression_models_coeffs", 
                        "log_regression_confusion_matrix", 
                        "regression_roc_curve", "gini_table", 
                        "logistic_reg_prediction_histogram", 
                        "logistic_reg_cumulative_prob", 
                        "scorecard_histogram"]
)

print(model_fit_2.summary())

INFO: Executing task 'train_model'...



Training the model with 14 features and 137723 data points.
Model trained successfully.
                 Generalized Linear Model Regression Results                  
Dep. Variable:                default   No. Observations:               137723
Model:                            GLM   Df Residuals:                   137709
Model Family:                Binomial   Df Model:                           13
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -59975.
Date:                Wed, 09 Aug 2023   Deviance:                   1.1995e+05
Time:                        15:57:26   Pearson chi2:                 1.38e+05
No. Iterations:                     5   Pseudo R-squ. (CS):            0.06635
Covariance Type:            nonrobust                                         
                              coef    std err          z      P>|z|      [0.025      0.975]
------------------------------

## Model validation

#### Validation Plan

In [23]:
df_validation = developer.show_validation_plan()
display(HTML(df_validation.to_html(escape=False)))

Unnamed: 0,Area ID,Task ID,Input,Output,Validation Tests
0,data_description,import_raw_data,lending_club_url,df_1,descriptive_statistics missing_values_bar_plot
1,data_preparation,drop_features,"df_1, preliminary_features_to_drop",df_2,none
2,data_preparation,add_default_definition,"df_2, default_column",df_3,missing_values_bar_plot class_imbalance iqr_outliers_table
3,data_preparation,remove_features_missing_values,"df_3, min_missing_percentage",df_4,missing_values_bar_plot
4,data_preparation,convert_term_column,df_4,df_5,none
5,data_preparation,convert_emp_length_column,df_5,df_6,none
6,data_preparation,convert_inq_last_6mths_column,df_6,df_7,none
7,data_sampling,data_split,"df_7, default_column","df_train_1,df_test_1",tabular_numerical_histograms high_cardinality tabular_categorical_bar_plots
8,exploratory_data_analysis,drop_categories,df_train_1,df_train_2,target_rate_bar_plots
9,exploratory_data_analysis,drop_features,"df_train_2, final_features_to_drop",df_train_3,chi_squared_features_table anova_one_way_table pearson_correlation_matrix feature_target_correlation_plot woe_bin_table woe_bin_table woe_bin_plots


#### Create ValidMind Datasets

In [24]:
vm_df_1 = vm.init_dataset(dataset=df_1, target_column=default_column)
vm_df_3 = vm.init_dataset(dataset=df_3, target_column=default_column)
vm_df_4 = vm.init_dataset(dataset=df_4, target_column=default_column)
vm_df_train_1 = vm.init_dataset(dataset=df_train_1, target_column=default_column)
vm_df_train_2 = vm.init_dataset(dataset=df_train_2, target_column=default_column)
vm_df_train_3 = vm.init_dataset(dataset=df_train_3, target_column=default_column)

2023-08-09 15:57:26,948 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...
2023-08-09 15:57:33,263 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...
2023-08-09 15:57:34,386 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...
2023-08-09 15:57:35,456 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...
2023-08-09 15:57:36,797 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...
2023-08-09 15:57:37,404 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...

#### Create ValidMind Model

In [26]:
vm_df_train = vm.init_dataset(dataset=df_train_6, target_column=default_column)
vm_df_test = vm.init_dataset(dataset=df_test_6, target_column=default_column)

vm_model_fit_2 = vm.init_model(
    model = model_fit_2, 
    train_ds=vm_df_train, 
    test_ds=vm_df_test)

2023-08-09 15:57:38,211 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...
2023-08-09 15:57:38,945 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...
INFO: Pandas dataset detected. Initializing VM Dataset instance...


#### Run All Validation Tests

In [27]:
from validmind.vm_models.test_context import TestContext
from validmind.tests.data_validation.DescriptiveStatistics import DescriptiveStatistics

test_context_1 = TestContext(dataset=vm_df_1)

metric = DescriptiveStatistics(test_context_1)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>This section provides descriptive statistics for numerical and categorical varia…

In [28]:
from validmind.tests.data_validation.MissingValuesBarPlot import MissingValuesBarPlot

params = {"threshold": 80,
          "fig_height": 1100}

metric = MissingValuesBarPlot(test_context_1, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of missing values by plotting horizontal bar plots w…

In [29]:
test_context_3 = TestContext(dataset=vm_df_3)

params = {"threshold": 80,
          "fig_height": 1100}

metric = MissingValuesBarPlot(test_context_3, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of missing values by plotting horizontal bar plots w…

In [30]:
from validmind.tests.data_validation.ClassImbalance import ClassImbalance

metric = ClassImbalance(test_context_3)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='\n            <h2>Class Imbalance ❌</h2>\n            <p>The class imbalance test m…

In [31]:
from validmind.tests.data_validation.IQROutliersTable import IQROutliersTable

num_features = get_numerical_columns(df_3)
params = {"num_features": num_features,
          "threshold": 1.5
        }

metric = IQROutliersTable(test_context_3, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Analyzes the distribution of outliers in numerical features using the Interquart…

In [32]:
from validmind.tests.data_validation.IQROutliersBarPlot import IQROutliersBarPlot

num_features = get_numerical_columns(df_3)
params = {"num_features": num_features,
          "threshold": 1.5,
          "fig_width": 500}

metric = IQROutliersBarPlot(test_context_3, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of the outliers for numeric variables based on perce…

In [33]:
from validmind.tests.data_validation.TabularNumericalHistograms import TabularNumericalHistograms

test_context_train_1 = TestContext(dataset=vm_df_train_1)

metric = TabularNumericalHistograms(test_context_train_1)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of numerical data by plotting the histogram. The inp…

In [34]:
from validmind.tests.data_validation.HighCardinality import HighCardinality
metric = HighCardinality(test_context_train_1)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='\n            <h2>Cardinality ✅</h2>\n            <p>The high cardinality test meas…

In [35]:
from validmind.tests.data_validation.TabularCategoricalBarPlots import TabularCategoricalBarPlots
metric = TabularCategoricalBarPlots(test_context_train_1)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of categorical data by plotting bar plots. The input…

In [36]:
from validmind.tests.data_validation.TargetRateBarPlots import TargetRateBarPlots

test_context_train_2 = TestContext(dataset=vm_df_train_2)

# Configure the metric
params = {
    "default_column": default_column,
    "columns": None
}

metric = TargetRateBarPlots(test_context_train_2, params=params)
metric.run()
await metric.result.log()
metric.result.show()

The column default is correct and contains only 1 and 0.


VBox(children=(HTML(value='<p>Generates a visual analysis of target ratios by plotting bar plots. The input da…

In [37]:
from validmind.tests.data_validation.ChiSquaredFeaturesTable import ChiSquaredFeaturesTable

test_context_train_3 = TestContext(dataset=vm_df_train_3)

cat_features = get_categorical_columns(df_train_3)
params = {"cat_features": cat_features,
          "p_threshold": 0.05}

metric = ChiSquaredFeaturesTable(test_context_train_3, params)
metric.run()
await metric.result.log() 
metric.result.show()

VBox(children=(HTML(value='<p>Perform a Chi-Squared test of independence for each categorical variable with th…

In [38]:
from validmind.tests.data_validation.ANOVAOneWayTable import ANOVAOneWayTable

num_features = get_numerical_columns(df_train_3)
params = {"num_features": num_features,
          "p_threshold": 0.05}

metric = ANOVAOneWayTable(test_context_train_3, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Perform an ANOVA F-test for each numerical variable with the target. The input d…

In [39]:
from validmind.tests.data_validation.PearsonCorrelationMatrix import PearsonCorrelationMatrix

params = {"declutter": False,
          "features": None,
          "fontsize": 13}

metric = PearsonCorrelationMatrix(test_context_train_3, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Extracts the Pearson correlation coefficient for all pairs of numerical variable…

In [40]:
from validmind.tests.data_validation.FeatureTargetCorrelationPlot import FeatureTargetCorrelationPlot

params = {"features": None}

metric = FeatureTargetCorrelationPlot(test_context_train_3, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of correlations between features and target by plott…

In [41]:
from validmind.tests.data_validation.WOEBinTable import WOEBinTable

metric = WOEBinTable(test_context_train_3)
metric.run()
await metric.result.log()
metric.result.show()

Running with breaks_adj: None
Performing binning with breaks_adj: None
[INFO] creating woe binning ...



There are blank strings in 1 columns, which are replaced with NaN. 
 (ColumnNames: emp_length)



VBox(children=(HTML(value="<p>Implements WoE-based automatic binning for features in a dataset and calculates …

In [42]:
params = {
    "breaks_adj": {
        "int_rate": [5,10,15]}  
     }

metric = WOEBinTable(test_context_train_3, params)
metric.run()
await metric.result.log()
metric.result.show()

Running with breaks_adj: {'int_rate': [5, 10, 15]}
Performing binning with breaks_adj: {'int_rate': [5, 10, 15]}
[INFO] creating woe binning ...



There are blank strings in 1 columns, which are replaced with NaN. 
 (ColumnNames: emp_length)



VBox(children=(HTML(value="<p>Implements WoE-based automatic binning for features in a dataset and calculates …

In [43]:
from validmind.tests.data_validation.WOEBinPlots import WOEBinPlots

params = {
    "breaks_adj": {"int_rate": [5,10,15]},
    "fig_height": 500,
}

metric = WOEBinPlots(test_context_train_3, params=params)
metric.run()
await metric.result.log()
metric.result.show()

[INFO] creating woe binning ...



There are blank strings in 1 columns, which are replaced with NaN. 
 (ColumnNames: emp_length)



VBox(children=(HTML(value='<p>Generates a visual analysis of the WoE and IV values distribution for categorica…

In [44]:
from validmind.tests.model_validation.statsmodels.RegressionCoeffsPlot import RegressionCoeffsPlot

test_context_models_fit_2 = TestContext(models = [vm_model_fit_2])

metric = RegressionCoeffsPlot(test_context_models_fit_2)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value="<p>Regression Coefficients with Confidence Intervals Plot</p>\n<p>This class is use…

In [45]:
from validmind.tests.model_validation.statsmodels.RegressionModelsCoeffs import RegressionModelsCoeffs

metric = RegressionModelsCoeffs(test_context_models_fit_2)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>This section shows the coefficients of different regression models that were tra…

In [46]:
from validmind.tests.model_validation.statsmodels.LogRegressionConfusionMatrix import LogRegressionConfusionMatrix

test_context_model_fit_2 = TestContext(model= vm_model_fit_2)

# Configure test parameters
params = {
    "cut_off_threshold": 0.5,
}

metric = LogRegressionConfusionMatrix(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>A confusion matrix is a table that is used to describe the performance of a clas…

In [47]:
from validmind.tests.model_validation.statsmodels.RegressionROCCurve import RegressionROCCurve

metric = RegressionROCCurve(test_context_model_fit_2)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>A receiver operating characteristic (ROC), or simply ROC curve, is a graphical p…

In [48]:
from validmind.tests.model_validation.statsmodels.GINITable import GINITable

metric = GINITable(test_context_model_fit_2)
metric.run()
await metric.result.log() 
metric.result.show()

Predicted scores obtained...
Computing AUC...
Computing GINI...
Computing AUC...
Computing KS...
Predicted scores obtained...
Computing AUC...
Computing GINI...
Computing AUC...
Computing KS...


VBox(children=(HTML(value='<p>Compute and display the AUC, GINI, and KS for train and test sets.</p>'), HTML(v…

In [49]:
from validmind.tests.model_validation.statsmodels.LogisticRegPredictionHistogram import LogisticRegPredictionHistogram

# Configure test parameters
params = {
    "title": "Histogram of Probability of Default",
}

metric = LogisticRegPredictionHistogram(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

In [None]:
from validmind.tests.model_validation.statsmodels.LogisticRegCumulativeProb import LogisticRegCumulativeProb

# Configure test parameters
params = {
    "title": "Cumulative Probability of Default",
}

metric = LogisticRegCumulativeProb(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.ScorecardHistogram import ScorecardHistogram

# Configure test parameters
params = {
    "target_score": 600,
    "target_odds": 50,
    "pdo": 20,
    "title": "Histogram of Credit Scores",
}

metric = ScorecardHistogram(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()