# Application Scorecard Demo

## Introduction

TBC.

## Notebook Setup

In [None]:
import pandas as pd

## Initialize the client library

Every documentation project in the Platform UI comes with a _code snippet_ that lets the client library associate your documentation and tests with the right project on the Platform UI when you run this notebook. As you will see later, documentation projects are useful because they act as containers for model documentation and validation reports and they enable you to organize all of your documentation work in one place. 

Get your code snippet by creating a documentation project:

1. In a browser, log into the [Platform UI](https://app.prod.validmind.ai).

2. Go to **Documentation Projects** and click **Create new project**.

3. Select **`[Demo] Customer Churn Model`** and **`Initial Validation`** for the model name and type, give the project a unique  name to make it yours, and then click **Create project**.

4. Go to **Documentation Projects** > **YOUR_UNIQUE_PROJECT_NAME** > **Getting Started** and click **Copy snippet to clipboard**.

Next, replace this placeholder with your own code snippet:

In [None]:
#import validmind as vm

#vm.init(
#  api_host = "https://api.dev.vm.validmind.ai/api/v1/tracking",
#  api_key = "...",
#  api_secret = "...",
#  project = "..."
#)

## Data Collection

In [None]:
# Define the URL to the Lending Club loan data set (2007-2014) hosted on AWS S3 for easy access.
source = "https://vmai.s3.us-west-1.amazonaws.com/datasets/lending_club_loan_data_2007_2014.csv"

# Load CSV with pandas, setting column 21 (index 20) to string data type to prevent DtypeWarning due to mixed types.
df = pd.read_csv(source, dtype={20: str})

## Data Preparation

In [None]:
# Drop non relevant columns for building an application scorecard model
COLS_TO_DROP = [
    "Unnamed: 0",
    "id", "member_id", "funded_amnt", "emp_title", "url", "desc", "application_type",
    "title", "zip_code", "delinq_2yrs", "mths_since_last_delinq", "mths_since_last_record",
    "revol_bal", "total_rec_prncp", "total_rec_late_fee", "recoveries", "out_prncp_inv", "out_prncp",
    "collection_recovery_fee", "next_pymnt_d", "initial_list_status", "pub_rec",
    "collections_12_mths_ex_med", "policy_code", "acc_now_delinq", "pymnt_plan",
    "tot_coll_amt", "tot_cur_bal", "total_rev_hi_lim", "last_pymnt_d", "last_credit_pull_d",
    'earliest_cr_line', 'issue_d'
]

df.drop(columns=COLS_TO_DROP, axis=1, inplace=True)

In [None]:
# Calculate the fraction of missing values for each feature in the dataset.
missing_fractions = df.isnull().mean()

# Set a threshold for the minimum fraction of missing values to consider dropping a feature.
min_missing_fraction = 0.8

# Identify features where the missing value fraction exceeds the threshold.
to_drop = missing_fractions[missing_fractions > min_missing_fraction].index.tolist()

# Remove identified features with too many missing values from the dataset.
df.drop(columns=to_drop, inplace=True)

# Define the target variable for the model, representing loan default status.
# Map 'loan_status' to a binary variable where 'Fully Paid' loans are 0 (no default)
# and 'Charged Off' loans are 1 (default). Other statuses are treated as missing (NaN) and then removed.
target_column = "default"
df[target_column] = df["loan_status"].apply(
    lambda x: 0 if x == "Fully Paid" else 1 if x == "Charged Off" else np.nan
)

# Remove rows with missing target variable values to ensure model integrity.
df.dropna(subset=[target_column], inplace=True)

# Convert the target variable to integer type for modeling.
df[target_column] = df[target_column].astype(int)

# Drop the original 'loan_status' column as it's now redundant with 'default'.
df.drop(columns=["loan_status"], inplace=True)