# Probability of Default Model using ValidMind

- Step 1: Connect Notebook to ValidMind Project
- Step 2: Import Raw Data
- Step 3: Data Description on Raw Data
- Step 4: Data Preprocessing
- Step 5: Data Description on Preprocessed Data 
- Step 6: Univariate Analysis
- Step 7: Multivariate Analysis
- Step 8: Model Training 

## Step 1: Connect Notebook to ValidMind Project

#### Import Libraries

In [50]:
# Load API key and secret from environment variables
%load_ext dotenv
%dotenv .env

import zipfile
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib as mpl
from sklearn.model_selection import train_test_split, RepeatedStratifiedKFold, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, roc_auc_score, confusion_matrix, precision_recall_curve, auc
from sklearn.feature_selection import f_classif
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from scipy.stats import chi2_contingency
%matplotlib inline

The dotenv extension is already loaded. To reload it, use:
  %reload_ext dotenv


#### Connect Notebook to ValidMind Project

In [51]:
import validmind as vm

vm.init(
  api_host = "http://localhost:3000/api/v1/tracking",
  api_key = "2494c3838f48efe590d531bfe225d90b",
  api_secret = "4f692f8161f128414fef542cab2a4e74834c75d01b3a8e088a1834f2afcfe838",
  project = "cliwzqjgv00001fy6869rlav9"
)

2023-06-20 14:37:26,978 - INFO - api_client - Connected to ValidMind. Project: [3] PD Model - Initial Validation (cliwzqjgv00001fy6869rlav9)


In [55]:
from validmind.tests import list_tests, load_test, describe_test

list_tests(filter="data_validation")

Test Type,Name,Description,ID
ThresholdTest,Skewness,The skewness test measures the extent to which a distribution of  values differs from a normal distribution. A positive skew describes  a longer tail of values in the right and a negative skew describes a  longer tail of values in the left.,validmind.data_validation.Skewness
ThresholdTest,Duplicates,"The duplicates test measures the number of duplicate rows found in  the dataset. If a primary key column is specified, the dataset is  checked for duplicate primary keys as well.",validmind.data_validation.Duplicates
Metric,DatasetDescription,Collects a set of descriptive statistics for a dataset,validmind.data_validation.DatasetDescription
Metric,ScatterPlot,Generates a visual analysis of data by plotting a scatter plot matrix for all columns  in the dataset. The input dataset can have multiple columns (features) if necessary.,validmind.data_validation.ScatterPlot
ThresholdTest,TimeSeriesOutliers,Test that find outliers for time series data using the z-score method,validmind.data_validation.TimeSeriesOutliers
Metric,TabularCategoricalBarPlots,"Generates a visual analysis of categorical data by plotting bar plots.  The input dataset can have multiple categorical variables if necessary.  In this case, we produce a separate plot for each categorical variable.",validmind.data_validation.TabularCategoricalBarPlots
Metric,AutoStationarity,Automatically detects stationarity for each time series in a DataFrame  using the Augmented Dickey-Fuller (ADF) test.,validmind.data_validation.AutoStationarity
Metric,DescriptiveStatistics,"Collects a set of descriptive statistics for a dataset, both for  numerical and categorical variables",validmind.data_validation.DescriptiveStatistics
Metric,PearsonCorrelationMatrix,Extracts the Pearson correlation coefficient for all pairs of numerical variables  in the dataset. This metric is useful to identify highly correlated variables  that can be removed from the dataset to reduce dimensionality.,validmind.data_validation.PearsonCorrelationMatrix
Metric,TabularNumericalHistograms,"Generates a visual analysis of numerical data by plotting the histogram.  The input dataset can have multiple numerical variables if necessary.  In this case, we produce a separate plot for each numerical variable.",validmind.data_validation.TabularNumericalHistograms


## Step 2: Import Raw Data

#### Import Lending Club Dataset

In [None]:
# Specify the path to the zip file
# filepath = '/Users/juanvalidmind/Dev/datasets/lending club/data_2007_2014/loan_data_2007_2014.csv'
filepath = '/Users/juanvalidmind/Dev/datasets/lending club/data_2007_2011/lending_club_loan_data_2007_2011.csv'
df = pd.read_csv(filepath)

# ONLY FOR TESTING


# Perform operations on the DataFrame as needed
print(df.head())

## Step 3: Data Description on Raw Data

In [None]:
from validmind.vm_models.test_context import TestContext
from validmind.data_validation.metrics import TabularDescriptionTables

vm_df = vm.init_dataset(dataset=df)
test_context = TestContext(dataset=vm_df)
metric = TabularDescriptionTables(test_context)
metric.run()
metric.result.show()

## Step 4: Data Preparation

#### Remove Unused Variables

Remove all the **Demographic** and **Customer Behavioural** features which is of no use for default analysis for credit approval.

In [None]:
# remove non-required columns
# id - not required
# member_id - not required
# acc_now_delinq - empty
# funded_amnt - not useful, funded_amnt_inv is useful which is funded to person
# emp_title - brand names not useful
# pymnt_plan - fixed value as n for all
# url - not useful
# desc - can be applied some NLP but not for EDA
# title - too many distinct values not useful
# zip_code - complete zip is not available
# delinq_2yrs - post approval feature
# mths_since_last_delinq - only half values are there, not much information
# mths_since_last_record - only 10% values are there
# revol_bal - post/behavioural feature
# initial_list_status - fixed value as f for all
# out_prncp - post approval feature
# out_prncp_inv - not useful as its for investors
# total_pymnt - post approval feature
# total_pymnt_inv - not useful as it is for investors
# total_rec_prncp - post approval feature
# total_rec_int - post approval feature
# total_rec_late_fee - post approval feature
# recoveries - post approval feature
# collection_recovery_fee - post approval feature
# last_pymnt_d - post approval feature
# last_credit_pull_d - irrelevant for approval
# last_pymnt_amnt - post feature
# next_pymnt_d - post feature
# collections_12_mths_ex_med - only 1 value 
# policy_code - only 1 value
# acc_now_delinq - single valued
# application_type - single
# pub_rec_bankruptcies - single valued for more than 99%
# addr_state - may not depend on location as its in financial domain

unused_variables = ["id", "member_id", "funded_amnt", "emp_title", "pymnt_plan", "url", "desc",
                    "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record",
                    "revol_bal", "initial_list_status", "out_prncp", "out_prncp_inv", "total_pymnt",
                    "total_pymnt_inv", "total_rec_prncp", "total_rec_int", "total_rec_late_fee", "recoveries",
                    "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt", "next_pymnt_d", "last_credit_pull_d",
                    "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "application_type", "addr_state"]
df_selected_vars = df.drop(columns=unused_variables)
print("Features we are left with",list(df_selected_vars.columns))

In [None]:
df_selected_vars.info()

#### Process `emp_length`

In [None]:
def process_emp_length(df, column='emp_length'):
    # Define a mapping from original string values to numeric values
    mapping = {'10+ years': 10, '< 1 year': 0, '1 year': 1, '2 years': 2, 
               '3 years': 3, '4 years': 4, '5 years': 5, '6 years': 6, 
               '7 years': 7, '8 years': 8, '9 years': 9, np.nan: np.nan}
    
    # Apply the mapping to the specified column
    df[column] = df[column].map(mapping)
    return df

In [None]:
df_selected_vars = process_emp_length(df_selected_vars, 'emp_length')

#### Format Dates

In [None]:
def convert_to_datetime(df, columns):
    # Specify the date format
    date_format = "%b-%y"

    # Iterate over the specified columns and convert to datetime
    for column in columns:
        df[column] = pd.to_datetime(df[column], format=date_format)

    return df

In [None]:
# Convert the specified columns to datetime
columns_to_convert = ['issue_d']
df_dates_fixed = convert_to_datetime(df_selected_vars, columns_to_convert)

#### Remove Variables with Large Number of Missing Values

In [None]:
def variables_with_min_missing(df, min_missing_percentage):
    # Calculate the percentage of missing values in each column
    missing_percentages = df.isnull().mean() * 100

    # Get the variables where the percentage of missing values is greater than the specified minimum
    variables_to_drop = missing_percentages[missing_percentages > min_missing_percentage].index.tolist()

    return variables_to_drop


In [None]:
min_missing_count = 80
variables_to_drop = variables_with_min_missing(df_dates_fixed, min_missing_count)
df_no_missing = df_dates_fixed.drop(columns=variables_to_drop)

#### Remove Missing Values

In [None]:
df_no_missing.dropna(axis=0, subset=["emp_length"], inplace=True)
df_no_missing.dropna(axis=0, subset=["revol_util"], inplace=True)

#### Remove Rows with Loan Status `Current` 

Removing records with loan status as **`Current`**, as the loan is currently running and we can’t infer any information regarding default from such loans.

In [None]:
# Remove the rows with loan_status as "Current"
df_no_current = df_no_missing[df_no_missing["loan_status"].apply(lambda x: False if x == "Current" else True)]

# Update loan_status as Fully Paid to 0 and Charged Off to 1
df_no_current["loan_status"] = df_no_current["loan_status"].apply(lambda x: 0 if x == "Fully Paid" else 1)

# Convert 'emp_length' to string type
df_no_current["emp_length"] = df_no_current["emp_length"].astype(str)

# Update emp_length feature with continuous values as int
# where (< 1 year) is assumed as 0 and 10+ years is assumed as 10 and rest are stored as their magnitude
df_no_current["emp_length"] = pd.to_numeric(df_no_current["emp_length"].apply(lambda x: 0 if "<" in x else (x.split('+')[0] if "+" in x else x.split()[0])))

# Look through the purpose value counts
loan_purpose_values = df_no_current["purpose"].value_counts() * 100 / df_no_current.shape[0]

# Remove rows with less than 1% of value counts in particular purpose 
loan_purpose_delete = loan_purpose_values[loan_purpose_values < 1].index.values
df_processed = df_no_current[[False if p in loan_purpose_delete else True for p in df_no_current["purpose"]]]

# Update int_rate, revol_util without % sign and as numeric type
df_processed["int_rate"] = pd.to_numeric(df_processed["int_rate"].apply(lambda x:x.split('%')[0]))
df_processed["revol_util"] = pd.to_numeric(df_processed["revol_util"].apply(lambda x:x.split('%')[0]))


#### Add New Variables

In [None]:
# Extracting month and year from issue_date
df_processed['month'] = df_processed['issue_d'].apply(lambda x: x.month)
df_processed['year'] = df_processed['issue_d'].apply(lambda x: x.year)

# Get year from issue_d and replace the same
df_processed["earliest_cr_line"] = pd.to_numeric(df_processed["earliest_cr_line"].apply(lambda x:x.split('-')[1]))

#### Binning Continuous Variables

Create bins for `loan_amnt`.

In [None]:
# Create bins for loan_amnt range
bins = [0, 5000, 10000, 15000, 20000, 25000, 36000]
bucket_l = ['0-5000', '5000-10000', '10000-15000', '15000-20000', '20000-25000','25000+']
df_processed['loan_amnt_range'] = pd.cut(df_processed['loan_amnt'], bins, labels=bucket_l)

Create bins for `int_rate` range.

In [None]:
# Convert 'int_rate' to numeric
df_processed['int_rate'] = pd.to_numeric(df_processed['int_rate'], errors='coerce')

# Create bins for int_rate range
bins = [0, 7.5, 10, 12.5, 15, 100]
bucket_l = ['0-7.5', '7.5-10', '10-12.5', '12.5-15', '15+']

# Using pd.cut to create 'int_rate_range' column
df_processed['int_rate_range'] = pd.cut(df_processed['int_rate'], bins, labels=bucket_l)

# Convert NaN to 'Unknown'
df_processed['int_rate_range'] = df_processed['int_rate_range'].cat.add_categories('Unknown')
df_processed['int_rate_range'].fillna('Unknown', inplace=True)

Create bins for `annual_inc` range.

In [None]:
# Create bins for annual_inc range
bins = [0, 25000, 50000, 75000, 100000, 1000000]
bucket_l = ['0-25000', '25000-50000', '50000-75000', '75000-100000', '100000+']
df_processed['annual_inc_range'] = pd.cut(df_processed['annual_inc'], bins, labels=bucket_l)

# Convert NaN to 'Unknown'
df_processed['annual_inc_range'] = df_processed['annual_inc_range'].cat.add_categories('Unknown')
df_processed['annual_inc_range'].fillna('Unknown', inplace=True)

Create bins for `installment` range.

In [None]:
# Create bins for installment range
def installment(n):
    if n <= 200:
        return 'low'
    elif n > 200 and n <=500:
        return 'medium'
    elif n > 500 and n <=800:
        return 'high'
    else:
        return 'very high'

df_processed['installment'] = df_processed['installment'].apply(lambda x: installment(x))

Create bins for `dti` range.

In [None]:
# Create bins for dti range
bins = [-1, 5.00, 10.00, 15.00, 20.00, 25.00, 50.00]
bucket_l = ['0-5%', '5-10%', '10-15%', '15-20%', '20-25%', '25%+']
df_processed['dti_range'] = pd.cut(df_processed['dti'], bins, labels=bucket_l)

## Step 5: Data Description on Processed Data

In [None]:
vm_df = vm.init_dataset(dataset=df_processed,
                        target_column='loan_status')
test_context = TestContext(dataset=vm_df)
metric = TabularDescriptionTables(test_context)
metric.run()
metric.result.show()

## Step 6: Univariate Analysis

### Target Variable

In [None]:
# Check for amount of defaults in the data using countplot
plt.figure()
sns.countplot(y="loan_status", data=df_processed)
plt.show()

In [None]:
from validmind.tests.data_validation.ClassImbalance import ClassImbalance
metric = ClassImbalance(test_context)
metric.run()
metric.result.show()

#### Missing Values

In [None]:
from validmind.tests.data_validation.MissingValues import MissingValues
metric = MissingValues(test_context)
metric.run()
metric.result.show()

### Numerical Features

In [None]:
from validmind.data_validation.metrics import TabularNumericalHistograms
metric = TabularNumericalHistograms(test_context)
metric.run()
metric.result.show()

### Categorical Features

In [None]:
from validmind.tests.data_validation.HighCardinality import HighCardinality
metric = HighCardinality(test_context)
metric.run()
metric.result.show()

HighCardinality

In [None]:
from validmind.data_validation.metrics import TabularCategoricalBarPlots
metric = TabularCategoricalBarPlots(test_context)
metric.run()
metric.result.show()

### Datetime Features

In [None]:
from validmind.data_validation.metrics import TabularDateTimeHistograms
metric = TabularDateTimeHistograms(test_context)
metric.run()
metric.result.show()

### Loan Defaults Ratio by Feature

In [None]:
from validmind.data_validation.metrics import LoanDefaultRatio

# Select numerical and categorical features 
numerical_features = ['emp_length', 'month', 'year', 'earliest_cr_line', 'inq_last_6mths', 'revol_util', 'total_acc',
                       'loan_amnt_range', 'int_rate_range', 'dti_range', 'installment', 'annual_inc_range']
categorical_features = ['term', 'grade', 'sub_grade', 'home_ownership', 'verification_status', 'purpose', 'open_acc', 'pub_rec']

# Configure the metric
params = {
    "loan_status_col": "loan_status",
    "columns": numerical_features + categorical_features
}

test_context = TestContext(dataset=vm_df)
metric = LoanDefaultRatio(test_context, params=params)
metric.run()
metric.result.show()

## Step 7: Multivariate Analysis

Select variables for multivariate analysis.

In [None]:
df_processed.info()

In [None]:
target_variable = ['loan_status']
selected_features = ['term', 'grade', 'purpose', 'pub_rec',
                      'revol_util', 'funded_amnt_inv', 'int_rate', 
                      'annual_inc_range', 'dti', 'installment',
                      'loan_amnt_range', 'annual_inc', 'loan_amnt',
                      'earliest_cr_line']
df_multivariate = df_processed.loc[:, selected_features + target_variable]

vm_df = vm.init_dataset(dataset=df_multivariate)
test_context = TestContext(dataset=vm_df)

### Bivariate Analysis

Define metric as custom test.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

from dataclasses import dataclass
from validmind.vm_models import Figure, Metric


@dataclass
class BivariateBarPlots(Metric):
    """
    Generates a visual analysis of categorical data by plotting bivariate bar plots.
    The input dataset and variable_pairs are required.
    """

    name = "bivariate_bar_plots"
    required_context = ["dataset"]
    default_params = {"variable_pairs": None, "loan_status_filter": None}

    def plot_bivariate_bar(self, variable_pairs, loan_status_filter):
        figures = []
        for x, hue in variable_pairs.items():
            df = self.dataset.df
            if loan_status_filter:
                df = df[df["loan_status"].isin(loan_status_filter)]

            means = df.groupby([x, hue])["loan_status"].mean().unstack().reset_index()
            hue_categories = means.columns[1:]

            n = len(hue_categories)
            width = 1 / (n + 1)

            plt.figure()

            color_palette = {
                category: color
                for category, color in zip(
                    hue_categories, plt.cm.get_cmap("tab10").colors
                )
            }

            for i, hue_category in enumerate(hue_categories):
                plt.bar(
                    np.arange(len(means)) + i * width,
                    means[hue_category],
                    color=color_palette[hue_category],
                    alpha=0.7,
                    label=hue_category,
                    width=width,
                )

            plt.title(x + " by " + hue)
            plt.xlabel(x)
            plt.ylabel("Loan Default Ratio")
            plt.xticks(ticks=np.arange(len(means)), labels=means[x], rotation=90)
            plt.legend()
            plt.show()

            figures.append(
                Figure(
                    for_object=self, key=f"{self.key}:{x}_{hue}", figure=plt.figure()
                )
            )

        plt.close("all")

        return figures

    def run(self):
        variable_pairs = self.params["variable_pairs"]
        loan_status_filter = self.params["loan_status_filter"]

        figures = self.plot_bivariate_bar(variable_pairs, loan_status_filter)

        return self.cache_results(figures=figures)

In [None]:
#Note: this does not work - from validmind.tests.data_validation.BivariateBarPlots import BivariateBarPlots

# Configure the metric
variable_pairs = {'annual_inc_range': 'purpose', 
                  'term': 'purpose', 
                  'grade': 'purpose',
                  'loan_amnt_range': 'purpose',
                  'loan_amnt_range': 'term',
                  'installment': 'purpose'}

params = {
    "variable_pairs": variable_pairs,
    "loan_status_filter": None
}

metric = BivariateBarPlots(test_context, params=params)
#metric.run()

**Scatter Plots**

In [None]:
df_multivariate.info()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from dataclasses import dataclass
from validmind.vm_models import Figure, Metric


@dataclass
class BivariateScatterPlots(Metric):
    """
    Generates a visual analysis of categorical data by plotting bivariate scatter plots.
    The input dataset and variable_pairs are required.
    """

    name = "bivariate_scatter_plots"
    required_context = ["dataset"]
    default_params = {"variable_pairs": None, "loan_status_filter": None}

    def plot_bivariate_scatter(self, variable_pairs, loan_status_filter):
        figures = []
        for x, y in variable_pairs.items():
            df = self.dataset.df
            if loan_status_filter:
                df = df[df["loan_status"] == loan_status_filter]

            plt.figure()

            # Scatterplot using seaborn, with color variation based on 'loan_status'
            # Create color mapping with rgba values, last value is alpha (transparency)
            palette = {0: (0.8, 0.8, 0.8, 0.8), 1: 'tab:red'}
            plot = sns.scatterplot(data=df, x=x, y=y, hue='loan_status', palette=palette, alpha=1)

            # Change legend labels
            legend_labels = ['Default' if t.get_text()=='1' else 'Non-default' for t in plot.legend_.texts[1:]]
            plot.legend_.texts[1:] = legend_labels

            plt.title(x + " and " + y)
            plt.xlabel(x)
            plt.ylabel(y)
            plt.show()

            figures.append(
                Figure(
                    for_object=self, key=f"{self.key}:{x}_{y}", figure=plt.figure()
                )
            )

        plt.close("all")

        return figures

    def run(self):
        variable_pairs = self.params["variable_pairs"]
        loan_status_filter = self.params["loan_status_filter"]

        figures = self.plot_bivariate_scatter(variable_pairs, loan_status_filter)

        return self.cache_results(figures=figures)

In [None]:
variable_pairs = {'int_rate': 'annual_inc', 
                  'funded_amnt_inv': 'dti', 
                  'annual_inc': 'funded_amnt_inv',
                  'loan_amnt': 'int_rate',
                  'int_rate': 'annual_inc',
                  'earliest_cr_line': 'int_rate'}

params = {
    "variable_pairs": variable_pairs,
    "loan_status_filter": None
}

metric = BivariateScatterPlots(test_context, params=params)
metric.run()

**Bivariate Histograms**

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

from dataclasses import dataclass
from validmind.vm_models import Figure, Metric


@dataclass
class BivariateHistograms(Metric):
    """
    Generates a visual analysis of categorical data by plotting bivariate histograms.
    The input dataset and variable_pairs are required.
    """

    name = "bivariate_histograms"
    required_context = ["dataset"]
    default_params = {"variable_pairs": None, "loan_status_filter": None}

    def plot_bivariate_histogram(self, variable_pairs, loan_status_filter):
        figures = []
        palette = {0: (0.5, 0.5, 0.5, 0.8), 1: 'tab:red'}

        for x, y in variable_pairs.items():
            df = self.dataset.df
            if loan_status_filter:
                df = df[df["loan_status"] == loan_status_filter]

            fig, axes = plt.subplots(2, 1)

            for ax, var in zip(axes, [x, y]):
                for loan_status, color in palette.items():
                    subset = df[df['loan_status'] == loan_status]
                    sns.histplot(subset[var],
                                 ax=ax, 
                                 color=color,
                                 edgecolor=None, 
                                 kde=True, 
                                 label='Default' if loan_status else 'Non-default')

                ax.set_title(f"Histogram of {var} by loan status")
                ax.set_xlabel(var)
                ax.legend()

            plt.tight_layout()
            plt.show()

            figures.append(
                Figure(
                    for_object=self, key=f"{self.key}:{x}_{y}", figure=plt.figure()
                )
            )

        plt.close("all")

        return figures

    def run(self):
        variable_pairs = self.params["variable_pairs"]
        loan_status_filter = self.params["loan_status_filter"]

        figures = self.plot_bivariate_histogram(variable_pairs, loan_status_filter)

        return self.cache_results(figures=figures)


In [None]:
variable_pairs = {'int_rate': 'annual_inc', 
                  'funded_amnt_inv': 'dti', 
                  'annual_inc': 'funded_amnt_inv',
                  'loan_amnt': 'int_rate',
                  'int_rate': 'annual_inc',
                  'earliest_cr_line': 'int_rate'}


params = {
    "variable_pairs": variable_pairs,
    "loan_status_filter": None
}

metric = BivariateHistograms(test_context, params=params)
metric.run()

#### Multivariate Analysis 

**Pearson Correlation Matrix**

In [None]:
from validmind.tests.data_validation.PearsonCorrelationMatrix import PearsonCorrelationMatrix

metric = PearsonCorrelationMatrix(test_context)
metric.run()
metric.result.show()



## Step 8: Model Training

In [None]:
df_multivariate.info()

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# First, we define the preprocessing steps
numeric_features = ['pub_rec', 'revol_util', 'funded_amnt_inv', 'int_rate', 'dti', 'annual_inc', 'loan_amnt', 'earliest_cr_line']
categorical_features = ['term', 'grade', 'purpose', 'annual_inc_range', 'loan_amnt_range']

numeric_transformer = StandardScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression(solver='lbfgs', max_iter=1000))])



# Train the model
clf.fit(X_train, y_train)

# We can now evaluate on the test set
print("model score: %.3f" % clf.score(X_test, y_test))


In [62]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
import pandas as pd

# First, we define the preprocessing steps
numeric_features = ['pub_rec', 'revol_util', 'funded_amnt_inv', 'int_rate', 'dti', 'annual_inc', 'loan_amnt', 'earliest_cr_line']
categorical_features = ['term', 'grade', 'purpose', 'annual_inc_range', 'loan_amnt_range', 'installment']  # Added 'installment'

# Handle categorical features
df_encoded = pd.get_dummies(df_multivariate, columns=categorical_features)

# Split the data
X = df_encoded.drop('loan_status', axis=1)
y = df_encoded['loan_status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Add a constant to the independent values
X_train = sm.add_constant(X_train)

# Define the model
glm_model_fit = sm.GLM(y_train, X_train, family=sm.families.Binomial())

# Fit the model
results = glm_model_fit.fit()

# Print the summary
print(results.summary())

# Evaluate on the test set
X_test = sm.add_constant(X_test)  # Adding a constant to the test data
y_pred = results.predict(X_test)

# You can then further analyze y_pred to measure model performance on the test set.

                 Generalized Linear Model Regression Results                  
Dep. Variable:            loan_status   No. Observations:                29110
Model:                            GLM   Df Residuals:                    29072
Model Family:                Binomial   Df Model:                           37
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -10836.
Date:                Tue, 20 Jun 2023   Deviance:                       21673.
Time:                        14:58:39   Pearson chi2:                 2.88e+04
No. Iterations:                   100   Pseudo R-squ. (CS):            0.06862
Covariance Type:            nonrobust                                         
                                    coef    std err          z      P>|z|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
const         

Scale variable X. 

In [None]:
import statsmodels.api as sm
from sklearn.preprocessing import scale

# Scale your variables
X_scaled = scale(X)

# Add a constant to the independent values
X_scaled = sm.add_constant(X_scaled)

# Define the model
model = sm.GLM(y, X_scaled, family=sm.families.Binomial())

# Fit the model
results = model.fit()

# Print the summary
print(results.summary())


#### ValidMind Models 

In [63]:
# Initialize training and testing datasets for model A
vm_train_ds = vm.init_dataset(dataset=X_train, type="generic", target_column='loan_status')
vm_test_ds = vm.init_dataset(dataset=X_test, type="generic", target_column='loan_status')

# Initialize model A
vm_model_A = vm.init_model(
    model = glm_model_fit, 
    train_ds=vm_train_ds, 
    test_ds=vm_test_ds)

2023-06-20 14:58:44,278 - INFO - client - Pandas dataset detected. Initializing VM Dataset instance...
2023-06-20 14:58:44,279 - INFO - dataset - Inferring dataset types...
2023-06-20 14:58:44,940 - INFO - client - Pandas dataset detected. Initializing VM Dataset instance...
2023-06-20 14:58:44,941 - INFO - dataset - Inferring dataset types...


ValueError: Model type statsmodels.GLM is not supported at the moment.