# Credit Risk Scorecard Demo

## Before you Begin
To use the ValidMind Developer Framework with a Jupyter notebook, you need to install and initialize the client library first, along with getting your Python environment ready.

If you don't already have one, you should also create a documentation project on the ValidMind platform. You will use this project to upload your documentation and test results.

## Install the Client Library

In [1]:
# %pip install --upgrade validmind

## Initialize the Client Library
In a browser, go to the Client Integration page of your documentation project and click Copy to clipboard next to the code snippet. This code snippet gives you the API key, API secret, and project identifier to link your notebook to your documentation project.

This step requires a documentation project. Learn how you can create one.

Next, replace this placeholder with your own code snippet:

In [2]:
import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "clk00h0u800x9qjy67gduf5om"
)

2023-08-07 18:59:35,678 - INFO(validmind.api_client): Connected to ValidMind. Project: [6] Credit Risk Scorecard - Initial Validation (clk00h0u800x9qjy67gduf5om)


## Use Case

#### Introduction

The **Credit risk Scorecard** model created from the Lending Club dataset is instrumental in computing the Probability of Default (PD), a key factor in ECL calculations. This scorecard assesses several credit characteristics of potential borrowers, like their credit history, income, outstanding debts, and more, each of which is assigned a specific score. By combining these scores, we derive a total score for each borrower, which translates into an estimated Point-in-Time (PiT) PD. The PiT PD reflects the borrower's likelihood of default at a specific point in time, accounting for both current and foreseeable future conditions.

Additionally, for a holistic view of credit risk, it's essential to estimate the Lifetime PD. The Lifetime PD, as the name suggests, predicts the borrower's likelihood of default throughout the life of the exposure, taking into account potential future changes in the economic and financial conditions.

#### Import Libraries

In [3]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

# Standard library imports
import re
import pickle
from datetime import datetime
from typing import List

# Data handling and analysis imports
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
import statsmodels.api as sm
import inspect

# Visualization imports
%matplotlib inline

# Scorecard development
import scorecardpy as sc



#### Processing Functions

In [4]:
unused_variables = [
    "id", "member_id", "funded_amnt", "emp_title", "url", "desc", "application_type",
    "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record",
    "revol_bal", "total_rec_prncp", "total_rec_late_fee", "recoveries", "out_prncp_inv", "out_prncp", 
    "collection_recovery_fee", "next_pymnt_d", "initial_list_status", "pub_rec",
    "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "pymnt_plan",
    "tot_coll_amt", "tot_cur_bal", "total_rev_hi_lim", "last_pymnt_d", "last_credit_pull_d",
    'earliest_cr_line', 'issue_d']

In [5]:
def save_model(model, df, base_filename):
    """Save a model and a dataframe with a timestamp in the filename"""
    # Get current date and time
    now = datetime.datetime.now()

    # Convert the current date and time to string
    timestamp_str = now.strftime("%Y%m%d_%H%M%S")

    filename = f'{base_filename}_{timestamp_str}.pkl'

    # Save the model and dataframe
    with open(filename, 'wb') as file:
        pickle.dump((model, df), file)
        
    print(f"Model and dataframe saved as {filename}")

def get_numerical_columns(df):
        numerical_columns = df.select_dtypes(
            include=["int", "float", "uint"]
        ).columns.tolist()
        return numerical_columns

def get_categorical_columns(df):
        categorical_columns = df.select_dtypes(
            include=["object", "category"]
        ).columns.tolist()
        return categorical_columns

def add_target_column(df, target_column):
    # Assuming the column name is 'loan_status'
    df[target_column] = df['loan_status'].apply(lambda x: 0 if x == "Fully Paid" else 1 if x == "Charged Off" else np.nan)
    # Remove rows where the target column is NaN
    df = df.dropna(subset=[target_column])
    # Convert target column to integer
    df[target_column] = df[target_column].astype(int)
    return df

def variables_with_min_missing(df, min_missing_percentage):
    # Calculate the percentage of missing values in each column
    missing_percentages = df.isnull().mean() * 100

    # Get the variables where the percentage of missing values is greater than the specified minimum
    variables_to_drop = missing_percentages[missing_percentages > min_missing_percentage].index.tolist()

    # Also add any columns where all values are missing
    variables_to_drop.extend(df.columns[df.isnull().all()].tolist())

    # Remove duplicates (if any)
    variables_to_drop = list(set(variables_to_drop))

    return variables_to_drop

def clean_term_column(df, column):
    """
    Function to remove 'months' string from the 'term' column and convert it to categorical
    """
    # Ensure the column exists in the dataframe
    if column not in df.columns:
        raise ValueError(f"The column '{column}' does not exist in the dataframe.")
    
    df[column] = df[column].str.replace(' months', '')
    
    # Convert to categorical
    df[column] = df[column].astype('object')

def clean_emp_length_column(df, column):
    """
    Function to clean 'emp_length' column and convert it to categorical.
    """
    # Ensure the column exists in the dataframe
    if column not in df.columns:
        raise ValueError(f"The column '{column}' does not exist in the dataframe.")
    
    df[column] = df[column].replace('n/a', np.nan)
    df[column] = df[column].str.replace('< 1 year', str(0))
    df[column] = df[column].apply(lambda x: re.sub('\D', '', str(x)))
    df[column].fillna(value = 0, inplace=True)

    # Convert to categorical
    df[column] = df[column].astype('object')

def clean_inq_last_6mths(df, column):
    """
    Function to convert 'inq_last_6mths' column into categorical.
    """
    # Ensure the column exists in the dataframe
    if column not in df.columns:
        raise ValueError(f"The column '{column}' does not exist in the dataframe.")

    # Convert to categorical
    df[column] = df[column].astype('category')

def compute_outliers(series, threshold=1.5):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - threshold * IQR
    upper_bound = Q3 + threshold * IQR
    return series[(series < lower_bound) | (series > upper_bound)]

def remove_iqr_outliers(df, target_column, threshold=1.5):
    num_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    num_cols.remove(target_column)  # Exclude target_column from numerical columns
    for col in num_cols:
        outliers = compute_outliers(df[col], threshold)
        df = df[~df[col].isin(outliers)]
    return df

def transform_woe_df(woe_df):
    # Select and rename columns
    transformed_df = woe_df[['variable', 'bin', 'count', 'count_distr', 'good', 'bad', 'badprob', 'woe', 'bin_iv', 'total_iv']].copy()
    transformed_df.rename(columns={
        'bin_iv': 'total_iv'
    }, inplace=True)
    
    # Create 'is_special_values' column (assuming there are no special values)
    transformed_df['is_special_values'] = False
    
    # Transform 'bin' column into interval format and store it in 'breaks' column
    transformed_df['breaks'] = transformed_df['bin'].apply(lambda x: '[-inf, %s)' % x if isinstance(x, float) else '[%s, inf)' % x)
    
    # Group by 'variable' to create bins dictionary
    bins = {}
    for variable, group in transformed_df.groupby('variable'):
        bins[variable] = group
    
    return bins

#### Developer Tasks

In [6]:
# 1. Import Raw Data
def import_raw_data(source): 
    df_out = pd.read_csv(source)
    return df_out

# 2. Data Preparation: Add Definition of Default
def add_default_definition(df, default_column):
    
    df_out = df.copy()
    df_out = add_target_column(df_out, default_column)

    # Drop 'loan_status' variable 
    df_out.drop(columns='loan_status', axis=1, inplace=True)

    # Remove unused variables
    df_out = df_out.drop(columns=unused_variables)

    # Remove missing values
    min_missing_count = 80
    variables_to_drop = variables_with_min_missing(df_out, min_missing_count)
    df_out.drop(columns=variables_to_drop, axis=1, inplace=True)
    df_out.dropna(axis=0, subset=["emp_length"], inplace=True)
    df_out.dropna(axis=0, subset=["revol_util"], inplace=True)

    # Format variable types
    clean_emp_length_column(df_out, 'emp_length')
    clean_term_column(df_out, 'term')
    clean_inq_last_6mths(df_out, 'inq_last_6mths')

    # Remove outliers
    df_out = remove_iqr_outliers(df_out, default_column, threshold=1.5)

    return df_out

# 3. Data Sampling: Data Split 
def data_split(df, target_column):
    df_out = df.copy()

    # Split data into train and test 
    X = df_out.drop(target_column, axis = 1)
    y = df_out[target_column]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, 
                                                    random_state = 42, stratify = y)

    # Concatenate X_train with y_train to form df_train
    df_train = pd.concat([X_train, y_train], axis=1)

    # Concatenate X_test with y_test to form df_test
    df_test = pd.concat([X_test, y_test], axis=1)

    return df_train, df_test

# 4. EDA: Drop Unused Categories
def drop_categories(df):

    df_out = df.copy()
 
    # Select rows where purpose is 'debt_consolidation' or 'credit_card'
    df_out = df_out[df_out['purpose'].isin(['debt_consolidation', 'credit_card'])]
    
    # Remove rows where grade is 'F' or 'G'
    df_out = df_out[~df_out['grade'].isin(['F', 'G'])]

    # Remove rows where sub_grade starts with 'F' or 'G'
    df_out = df_out[~df_out['sub_grade'].str.startswith(('F', 'G'))]
    
    # Remove rows where home_ownership is 'OTHER', 'NONE', or 'ANY'
    df_out = df_out[~df_out['home_ownership'].isin(['OTHER', 'NONE', 'ANY'])]

    return df_out

# 5. EDA: Drop Unused Features
def drop_features(df, to_drop):
    df_out = df.copy()
    df_out = df_out.drop(columns = to_drop, axis=1)
    return df_out 

# 6. EDA: Convert Features to WoE Values
def convert_to_woe(df, woe_df, target_col):
    
    df_out = df.copy()
    
    # Create bins from woe_df
    bins = transform_woe_df(woe_df)
    
    # Make sure we don't transform the target column
    if target_col in bins:
        del bins[target_col]
    
    # Apply the WoE transformation
    df_out = sc.woebin_ply(df_out, bins=bins)
    
    return df_out

# 7. Model Training: Add Constant
def add_constant(df):
    df_out = df.copy()

    # Add constant
    df_out = sm.add_constant(df_out)

    return df_out

# 8. Model Training: Train Model
def train_model(df, target_column):
    
    # Ensure that the target column is in the DataFrame
    if target_column not in df.columns:
        raise ValueError(f"'{target_column}' not found in DataFrame.")

    # Get X (features) and y (target) from df
    X = df.drop(target_column, axis=1)  # Drop the target column to get features
    y = df[target_column]

    # Define the model
    model = sm.GLM(y, X, family=sm.families.Binomial())

    # Fit the model
    model_fit = model.fit()

    return model_fit


#### Developer Class

In [7]:
import datetime
import inspect
import re
import pandas as pd

class Developer:
    def __init__(self):
        self.tasks_log = []
        self.tasks_details = []
        self.tasks = {}  # Dictionary to store tasks

    def add_task(self, task_id, task):
        """Register a task."""
        if task_id in self.tasks:
            raise ValueError(f"Task ID '{task_id}' already exists!")
        
        self.tasks[task_id] = {'task': task}
        return task_id

    def execute_task(self, task_id, inputs=[], area_id=None):
        """Execute a registered task by its ID."""
        frame = inspect.currentframe().f_back
        
        # Use inspect to get the calling line of code
        code_context = inspect.getframeinfo(frame).code_context
        line_of_code = code_context[0].strip() if code_context else ""
        
        # Match input variables with local variables of the calling frame
        input_vars = {id(var): name for name, var in frame.f_locals.items()}
        input_var_names = [input_vars.get(id(inp), "N/A") for inp in inputs]

        task_entry = self.tasks.get(task_id)
        if not task_entry:
            raise ValueError(f"No task found for ID {task_id}")

        start_time = datetime.datetime.now()  # Get the current timestamp
        result = task_entry['task'](*inputs)
        end_time = datetime.datetime.now()  # Get the timestamp after execution
        duration = (end_time - start_time).seconds

        # Extract the variable name to which the result is assigned
        output_match = re.search(r'^\s*([\w\s,]+?)\s*=', line_of_code)
        output_var_name = output_match.group(1).replace(" ", "") if output_match else "N/A"

        # Log the details
        self.tasks_log.append(task_id)
        self.tasks_details.append({
            'Time': start_time.strftime('%Y-%m-%d %H:%M:%S'),
            'Area ID': area_id,
            'Task ID': task_id,
            'Input': ", ".join(input_var_names),
            'Output': output_var_name,
            'Duration': f"{duration} seconds"
        })

        return result

    def show_lifecycle(self):
        """Display the model lifecycle details in a tabular format."""
        df = pd.DataFrame(self.tasks_details)
        return df



#### Register Developer Tasks

In [8]:
# Instantiate the Developer class
developer = Developer()

# Set parameters
default_column = "default"

# Register developer tasks
developer.add_task(
    task_id="import_raw_data", 
    task=import_raw_data,
)

developer.add_task(
    task_id="add_default_definition",
    task=add_default_definition,  
)

developer.add_task(
    task_id="data_split",
    task=data_split, 
)

developer.add_task(
    task_id="drop_categories",
    task=drop_categories, 
)

developer.add_task(
    task_id="drop_features",
    task=drop_features, 
)

developer.add_task(
    task_id="convert_to_woe",
    task=convert_to_woe, 
)

developer.add_task(
    task_id="add_constant",
    task=add_constant, 
)

developer.add_task(
    task_id="train_model",
    task=train_model,
)



'train_model'

## Data Description

#### Import Raw Data: `df_1`

In [9]:
lending_club_url = "https://vmai.s3.us-west-1.amazonaws.com/datasets/lending_club_loan_data_2007_2014.csv"

df_1 = developer.execute_task(
    area_id="data_description",
    task_id="import_raw_data", 
    inputs=[lending_club_url]   
)

  df_out = pd.read_csv(source)


#### Validate Dataset: `df_1`

In [10]:
from validmind.vm_models.test_context import TestContext
from validmind.tests.data_validation.DescriptiveStatistics import DescriptiveStatistics

vm_df_1 = vm.init_dataset(dataset=df_1)
test_context_1 = TestContext(dataset=vm_df_1)

metric = DescriptiveStatistics(test_context_1)
metric.run()
await metric.result.log()
metric.result.show()

2023-08-07 19:00:19,428 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


VBox(children=(HTML(value='<p>This section provides descriptive statistics for numerical and categorical varia…

In [11]:
from validmind.tests.data_validation.MissingValuesBarPlot import MissingValuesBarPlot

params = {"threshold": 70,
          "fig_height": 1100}

metric = MissingValuesBarPlot(test_context_1, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of missing values by plotting horizontal bar plots w…

## Data Preparation

#### Add Definition of Default: `df_2`

In [12]:
df_2 = developer.execute_task(
    area_id="data_preparation",
    task_id="add_default_definition", 
    inputs=[df_1, default_column]
)



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy



#### Validate Dataset: `df_2`

In [13]:
from validmind.tests.data_validation.ClassImbalance import ClassImbalance

vm_df_2 = vm.init_dataset(dataset=df_2,
                        target_column=default_column)
test_context_2 = TestContext(dataset=vm_df_2)

metric = ClassImbalance(test_context_2)
metric.run()
await metric.result.log()
metric.result.show()

2023-08-07 19:00:35,116 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


VBox(children=(HTML(value='\n            <h2>Class Imbalance ❌</h2>\n            <p>The class imbalance test m…

In [14]:
from validmind.tests.data_validation.IQROutliersTable import IQROutliersTable

num_features = get_numerical_columns(df_2)
params = {"num_features": num_features,
          "threshold": 1.5
        }

metric = IQROutliersTable(test_context_2, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Analyzes the distribution of outliers in numerical features using the Interquart…

In [15]:
from validmind.tests.data_validation.IQROutliersBarPlot import IQROutliersBarPlot

num_features = get_numerical_columns(df_2)
params = {"num_features": num_features,
          "threshold": 1.5,
          "fig_width": 500}

metric = IQROutliersBarPlot(test_context_2, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of the outliers for numeric variables based on perce…

## Data Sampling

#### Sampling Method

We employ stratified sampling to create our training and testing sets. Stratified sampling is particularly important in this context. When the `stratify = y` parameter is set, it ensures that the distribution of the target variable (`y`) in the test set is the same as that in the original dataset. 

This is crucial for maintaining a consistent representation of the target variable classes, especially important in scenarios where the classes are imbalanced, which is often the case in credit risk scorecards.

#### Data Split: `df_train_2` and `df_test_2`

In [16]:
df_train_2, df_test_2 = developer.execute_task(
    area_id="data_sampling",
    task_id="data_split", 
    inputs=[df_2, default_column]
)

## Exploratory Data Analysis 

#### Validate Dataset: `df_train_2`

In [17]:
from validmind.tests.data_validation.TabularNumericalHistograms import TabularNumericalHistograms

vm_df_train_2 = vm.init_dataset(dataset=df_train_2,
                                target_column=default_column)
test_context_train_2 = TestContext(dataset=vm_df_train_2)

metric = TabularNumericalHistograms(test_context_train_2)
metric.run()
await metric.result.log()
metric.result.show()

2023-08-07 19:00:59,389 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


VBox(children=(HTML(value='<p>Generates a visual analysis of numerical data by plotting the histogram. The inp…

In [18]:
from validmind.tests.data_validation.HighCardinality import HighCardinality
metric = HighCardinality(test_context_train_2)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='\n            <h2>Cardinality ✅</h2>\n            <p>The high cardinality test meas…

In [19]:
from validmind.tests.data_validation.TabularCategoricalBarPlots import TabularCategoricalBarPlots
metric = TabularCategoricalBarPlots(test_context_train_2)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of categorical data by plotting bar plots. The input…

#### Drop Categories: `df_train_3` and `df_test_3`

In [20]:
df_train_3 = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_categories", 
    inputs=[df_train_2]
)

df_test_3 = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_categories", 
    inputs=[df_test_2]
)

#### Validate Dataset: `df_train_3`

In [21]:
from validmind.tests.data_validation.TargetRateBarPlots import TargetRateBarPlots

vm_df_train_3 = vm.init_dataset(
    dataset=df_train_3, 
    target_column=default_column)

test_context_train_3 = TestContext(dataset=vm_df_train_3)

# Configure the metric
params = {
    "default_column": default_column,
    "columns": None
}

metric = TargetRateBarPlots(test_context_train_3, params=params)
metric.run()
await metric.result.log()
metric.result.show()

2023-08-07 19:01:56,192 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


The column default is correct and contains only 1 and 0.


VBox(children=(HTML(value='<p>Generates a visual analysis of target ratios by plotting bar plots. The input da…

#### Drop Features: `df_train_4` and `df_test_4`

In [22]:
to_drop = ['addr_state', 'total_rec_int', 'loan_amnt',
                    'funded_amnt_inv', 'dti', 'revol_util', 'total_pymnt', 
                    'total_pymnt_inv', 'last_pymnt_amnt']

df_train_4 = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_features", 
    inputs=[df_train_3, to_drop]
)

df_test_4 = developer.execute_task(
    area_id="exploratory_data_analysis",
    task_id="drop_features", 
    inputs=[df_test_3, to_drop]
)

#### Validate Dataset: `df_train_4`

In [23]:
from validmind.tests.data_validation.ChiSquaredFeaturesTable import ChiSquaredFeaturesTable

vm_df_train_4 = vm.init_dataset(dataset=df_train_4, target_column=default_column)
test_context_train_4 = TestContext(dataset=vm_df_train_4)

cat_features = get_categorical_columns(df_train_4)
params = {"cat_features": cat_features,
          "p_threshold": 0.05}

metric = ChiSquaredFeaturesTable(test_context_train_4, params)
metric.run()
await metric.result.log() 
metric.result.show()

2023-08-07 19:02:08,784 - INFO(validmind.client): Pandas dataset detected. Initializing VM Dataset instance...


VBox(children=(HTML(value='<p>Perform a Chi-Squared test of independence for each categorical variable with th…

In [24]:
from validmind.tests.data_validation.ANOVAOneWayTable import ANOVAOneWayTable

num_features = get_numerical_columns(df_train_4)
params = {"num_features": num_features,
          "p_threshold": 0.05}

metric = ANOVAOneWayTable(test_context_train_4, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Perform an ANOVA F-test for each numerical variable with the target. The input d…

In [25]:
from validmind.tests.data_validation.PearsonCorrelationMatrix import PearsonCorrelationMatrix

params = {"declutter": False,
          "features": None,
          "fontsize": 13}

metric = PearsonCorrelationMatrix(test_context_train_4, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Extracts the Pearson correlation coefficient for all pairs of numerical variable…

In [26]:
from validmind.tests.data_validation.FeatureTargetCorrelationPlot import FeatureTargetCorrelationPlot

params = {"features": None}

metric = FeatureTargetCorrelationPlot(test_context_train_4, params)
metric.run()
await metric.result.log()
metric.result.show()

VBox(children=(HTML(value='<p>Generates a visual analysis of correlations between features and target by plott…

## Feature Engineering

#### Validate Dataset: `df_train_4`

In [27]:
from validmind.tests.data_validation.WOEBinTable import WOEBinTable

# Run test
metric = WOEBinTable(test_context_train_4)
metric.run()
await metric.result.log()
woe_iv_dic = metric.result.metric.value['woe_iv']
metric.result.show()

Running with breaks_adj: None
Performing binning with breaks_adj: None
[INFO] creating woe binning ...


VBox(children=(HTML(value="<p>Implements WoE-based automatic binning for features in a dataset and calculates …

In [28]:
# Set test parameters
params = {
    "breaks_adj": {
        "int_rate": [5,10,15]}  
     }

# Run test
metric = WOEBinTable(test_context_train_4, params)
metric.run()
await metric.result.log()
woe_iv_dic = metric.result.metric.value['woe_iv']
metric.result.show()

Running with breaks_adj: {'int_rate': [5, 10, 15]}
Performing binning with breaks_adj: {'int_rate': [5, 10, 15]}
[INFO] creating woe binning ...


VBox(children=(HTML(value="<p>Implements WoE-based automatic binning for features in a dataset and calculates …

In [29]:
from validmind.tests.data_validation.WOEBinPlots import WOEBinPlots

# Set test parameters
params = {
    "breaks_adj": {"int_rate": [5,10,15]},
    "fig_height": 500,
}

# Run test
metric = WOEBinPlots(test_context_train_4, params=params)
metric.run()
await metric.result.log()
metric.result.show()

[INFO] creating woe binning ...


VBox(children=(HTML(value='<p>Generates a visual analysis of the WoE and IV values distribution for categorica…

#### Convert Features into WoE Values: `df_train_5` and `df_test_5`

In [30]:
# Compute WoE 
params = {
    "breaks_adj": {
        "int_rate": [5,10,15]}  
     }

metric = WOEBinTable(test_context_train_4, params=params)
metric.run()
woe_dic = metric.result.metric.value['woe_iv']
woe_df = pd.DataFrame(woe_dic)

Running with breaks_adj: {'int_rate': [5, 10, 15]}
Performing binning with breaks_adj: {'int_rate': [5, 10, 15]}
[INFO] creating woe binning ...


In [31]:
df_train_5 = developer.execute_task(
    area_id="feature_engineering",
    task_id="convert_to_woe", 
    inputs=[df_train_4, woe_df, default_column]
)

df_test_5 = developer.execute_task(
    area_id="feature_engineering",
    task_id="convert_to_woe", 
    inputs=[df_test_4, woe_df, default_column]
)

[INFO] converting into woe values ...
[INFO] converting into woe values ...


## Model Training

#### Add Constant: `df_train_6` and `df_test_6`

In [32]:
df_train_6 = developer.execute_task(
    area_id="model_training",
    task_id="add_constant", 
    inputs=[df_train_5]
)

df_test_6 = developer.execute_task(
    area_id="model_training",
    task_id="add_constant", 
    inputs=[df_test_5]
)

#### Train Model 1: `model_fit_1`

In [33]:
df_train_6.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 105815 entries, 101002 to 462837
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   const                    105815 non-null  float64
 1   default                  105815 non-null  int64  
 2   purpose_woe              105815 non-null  float64
 3   installment_woe          105815 non-null  float64
 4   open_acc_woe             105815 non-null  float64
 5   term_woe                 105815 non-null  float64
 6   Unnamed: 0_woe           105815 non-null  float64
 7   emp_length_woe           105815 non-null  float64
 8   verification_status_woe  105815 non-null  float64
 9   sub_grade_woe            105815 non-null  float64
 10  inq_last_6mths_woe       0 non-null       float64
 11  int_rate_woe             105815 non-null  float64
 12  grade_woe                105815 non-null  float64
 13  annual_inc_woe           105815 non-null  float64
 14 

In [34]:
model_fit_1 = developer.execute_task(
    area_id="model_training",
    task_id="train_model", 
    inputs=[df_train_6, default_column]
)

print(model_fit_1.summary())

MissingDataError: exog contains inf or nans

#### Drop Features: `df_train_7` and `df_test_7`

In [None]:
to_drop = []

df_train_7 = developer.execute_task(
    area_id="model_training",
    task_id="drop_features", 
    inputs=[df_train_6, to_drop]
)

df_test_7 = developer.execute_task(
    area_id="model_training",
    task_id="drop_features", 
    inputs=[df_test_6, to_drop]
)

#### Train Model 2: `model_fit_2`

In [None]:
model_fit_2 = developer.execute_task(
    area_id="model_training",
    task_id="train_model", 
    inputs=[df_train_7, default_column]
)

# Save the model and train dataset for PD development 
save_data = False
if save_data:
    save_model(model_fit_2, df=df_train_7, base_filename='model_fit_glm_scorecard')

print(model_fit_2.summary())

## Model Evaluation

#### Validate Model: `model_fit_2`

In [None]:
# Create VM dataset
vm_df_train_7 = vm.init_dataset(
    dataset=df_train_7,
    target_column=default_column)
vm_df_test_7 = vm.init_dataset(
    dataset=df_test_7,
    target_column=default_column)

# Create VM model
vm_model_fit_2 = vm.init_model(
    model = model_fit_2, 
    train_ds=vm_df_train_7, 
    test_ds=vm_df_test_7)

In [None]:
from validmind.tests.model_validation.statsmodels.RegressionCoeffsPlot import RegressionCoeffsPlot

test_context_models_fit_2 = TestContext(models = [vm_model_fit_2])

metric = RegressionCoeffsPlot(test_context_models_fit_2)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.RegressionModelsCoeffs import RegressionModelsCoeffs

metric = RegressionModelsCoeffs(test_context_models_fit_2)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.LogRegressionConfusionMatrix import LogRegressionConfusionMatrix

test_context_model_fit_2 = TestContext(model= vm_model_fit_2)

# Configure test parameters
params = {
    "cut_off_threshold": 0.5,
}

metric = LogRegressionConfusionMatrix(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.RegressionROCCurve import RegressionROCCurve

metric = RegressionROCCurve(test_context_model_fit_2)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.GINITable import GINITable

metric = GINITable(test_context_model_fit_2)
metric.run()
await metric.result.log() 
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.LogisticRegPredictionHistogram import LogisticRegPredictionHistogram

# Configure test parameters
params = {
    "title": "Histogram of Probability of Default",
}

metric = LogisticRegPredictionHistogram(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.LogisticRegCumulativeProb import LogisticRegCumulativeProb

# Configure test parameters
params = {
    "title": "Cumulative Probability of Default",
}

metric = LogisticRegCumulativeProb(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

In [None]:
from validmind.tests.model_validation.statsmodels.ScorecardHistogram import ScorecardHistogram

# Configure test parameters
params = {
    "target_score": 600,
    "target_odds": 50,
    "pdo": 20,
    "title": "Histogram of Credit Scores",
}

metric = ScorecardHistogram(test_context_model_fit_2, params)
metric.run()
await metric.result.log()
metric.result.show()

#### Summary of Model Lifecycle

In [None]:
lifecycle = developer.show_lifecycle()
display(lifecycle)