# Credit Card Default Prediction

The purpose of this notebook it to construct a full machine learning pipeline to predict the probability of a customer defaulting on their credit card debt. The assignment prompt for this is adapted from an assignment in UBC MDS DSCI 572, but has been reworked. The data is sourced from the Kaggle Credit Card Clients Dataset, found here https://www.kaggle.com/datasets/uciml/default-of-credit-card-clients-dataset

# The Problem

The problem is a classification task to identify customers who will default on their credit card payments in the next month. In the raw data set, this column is labeled as `default.payment.next.month`, but this will be re-labeled and referred to to as `default` for convenience. A 0 indicates no default, and a 1 indicates a default. There is significant class imbalance in the dataset, with the majority being non default. As such the analysis will be primarily concerned with metrics related to the identifying the positive class (default), and this can be considered an anomaly detection problem, meaning raw accuracy is an unreliable metric. 

Initial model evaluation will be based on maximal f1 score. This analysis will consider a balanced approach, as we want to avoid casting an overly broad net and having excessive false positives to ensure that intervention as specifically targeted as possible. However, we can consider minimizing false negatives (recall) to be more damaging due to profit loss to the credit card company, while intervention against an individual who would not default may be still warranted if the model identifies them as sufficiently high risk. Overall, we want a somewhat balanced model that still prioritizes minimizing type II error (false negative) over type I error (false postive).

The ideal method would be to set an minimum acceptable operating point for precision, and use this to determine a probability threshold that maximizes recall for each model. The model with the highest recall at this operating point is the model selected by this modeling process. However, setting this operating point requires additional industry knowledge, as well as being specific to the risk tolerance of the company/ deployment context.

# EDA

#### Observations

1. Approximately 22% of examples are default (positive class), 78% are not.
2. We have 24 potential features and one target

Feature Descriptions:

* `ID`: Unique identifier, will be dropped
* `LIMIT_BAL`: Maximum credit, numeric
* `SEX` : Binary/ Numeric. We don't necessarily want to make decisions on default prediction based on sex for ethical reasons, so this will be dropped.
* `EDUCATION`: Ordinal Feature for education level. According to the dataset description, 5 and 6 are unknown. These will be combined into a single category and changed to a value of zero. 
* `MARRIAGE`: Married, Unmarried, Others. This is not ordinal as we don't know relative order, so it will be one hot encoded as a categorical. There are also some unused unlabeled values (54 cases of 0, possibly meaning missing data), which will be combined into others

* `AGE`: a numeric integer
* `PAY_0` -> `PAY_6`: Series of ordinals indicating payment behavior, with 0 being the most recent month at the time of data collection, and each following feature being a month prior. Low values are good payment behaviors (i.e. -1 is paid duly), while positive values indicate how delayed payment is (i.e. 2 is 2 months delayed). There are unlabeled values (0 and -2), I speculate that a 0 is something like partial repayment, and a -2 is overpayment or no credit used. Regardless, these are relative uncommon and will be kept to maintain ordinality. There is obvious correlation between sequential columns, as if an individual is 6 months late in `PAY_0`, we know they must have been 5 months late in `PAY_1`. Conversely, in a predictive sense, if we know `PAY_1` is a large positive number, then that individual is clearly not paying their bill and `PAY_0` is more likely to be the next integer, rather than the individual suddenly paying their bill.
* `BILL_AMT1` -> `BILL_AMT6`: follows same time series as the `PAY` columns. Numeric indicating amount owing
* `PAY_AMT1` -> `PAY_AMT6`: follows same time series as the `PAY` columns. Numeric indicating payment made.
