# Credit Approval Modeling

**Goal:** Binary classification.

**You will learn to:**
- Inspect and clean a dataset (missing values, types).
- Build a preprocessing pipeline for numeric & categorical columns.
- Train and compare three models: Logistic Regression (baseline), Random Forest, and Gradient Boosting.
- Evaluate with Accuracy, Classification Report, ROC AUC, and ROC curves.
- Peek into feature importance/coefficients.
- Work on storytelling.


## 1. Setup & Imports

Read the data using Pandas `read_csv` function. It has 16 columns and no header row. The data was objetained from this [link](https://archive.ics.uci.edu/dataset/27/credit+approval).

Description of data given in source website - 

> This file concerns credit card applications.  All attribute names and values have been changed to meaningless symbols to protect confidentiality of the data.
>This dataset is interesting because there is a good mix of attributes -- continuous, nominal with small numbers of values, and nominal with larger numbers of values.  There are also a few missing values.

In [2]:
import numpy as np
import pandas as pd

## 2. Load and Peek at the Data

In [5]:
df = pd.read_csv("cc_approvals.data",header=None, na_values=["?"])

df.columns = [f"A{i}" for i in range(1,17)]
df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202.0,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+


In [7]:
df.isna().sum().sort_values(ascending=True)

A3      0
A8      0
A9      0
A10     0
A11     0
A12     0
A13     0
A15     0
A16     0
A4      6
A5      6
A6      9
A7      9
A1     12
A2     12
A14    13
dtype: int64

## 3. Target Encoding & Column Types

- Target (`A16`) values are `'+'` (approved) and `'-'` (denied).  
- We map `'+' = 1` and `'-' = 0`.
- Convert numeric-looking columns to numeric (coerce errors - NaN).


In [10]:
x = df.drop(columns=["A16"])
y_raw = df["A16"]

y = y_raw.map({"+":1, "-":0})


In [11]:
y.value_counts()

A16
0    383
1    307
Name: count, dtype: int64

In [13]:
x.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 15 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      678 non-null    object 
 1   A2      678 non-null    float64
 2   A3      690 non-null    float64
 3   A4      684 non-null    object 
 4   A5      684 non-null    object 
 5   A6      681 non-null    object 
 6   A7      681 non-null    object 
 7   A8      690 non-null    float64
 8   A9      690 non-null    object 
 9   A10     690 non-null    object 
 10  A11     690 non-null    int64  
 11  A12     690 non-null    object 
 12  A13     690 non-null    object 
 13  A14     677 non-null    float64
 14  A15     690 non-null    int64  
dtypes: float64(4), int64(2), object(9)
memory usage: 81.0+ KB


In [15]:
numeric_cols = ["A2", "A3", "A8", "A11", "A14", "A15"]

for c in numeric_cols:
    x[c] = pd.to_numeric(x[c], errors="coerce")

cat_cols = x.select_dtypes(include=["object"]).columns.tolist()
num_cols = [c for c in numeric_cols if c not in cat_cols]

print(cat_cols)
print(num_cols)


['A1', 'A4', 'A5', 'A6', 'A7', 'A9', 'A10', 'A12', 'A13']
['A2', 'A3', 'A8', 'A11', 'A14', 'A15']


## 4. Train/Test Split

We keep class proportions similar across train and test using `stratify=y`.


In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) # Random state is used to ensure reproducibility; 
# a seed value of 42 is used to ensure the same split every time the code is run.

x_train.shape, x_test.shape, y_train.shape, y_test.shape

((552, 15), (138, 15), (552,), (138,))

## 5. Preprocessing Pipelines

- **Numeric:** median imputation + standardization  
- **Categorical:** most-frequent imputation + one-hot encoding (ignore unknowns)


In [17]:
#Data has numeric and categorical features.
#Numeric features are scaled to have a mean of 0 and a standard deviation of 1.
#Categorical features are one-hot encoded.
#Unknown values are imputed with the most frequent value.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

numeric_preprocessor = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

cat_preprocessor = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocess = ColumnTransformer(transformers=[
        ("num", numeric_preprocessor, num_cols),
        ("cat", cat_preprocessor, cat_cols)
    ]
)


    

## 6. Train/Eval

In [18]:
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score, roc_curve

def fit_and_evaluate(model, x_train, y_train, x_test, y_test, name="(model)"):
    model.fit(x_train, y_train)
    y_pred = model.predict(x_test)
    y_pred_proba = model.predict_proba(x_test)[:, 1] if hasattr(model, "predict_proba") else None
    acc = accuracy_score(y_test, y_pred)
    report = classification_report(y_test, y_pred, digits=3)
    auc = roc_auc_score(y_test, y_pred_proba) if y_pred_proba is not None else None
    
    print(f"{name} - Accuracy: {acc:.4f}")
    print(f"{name} - Classification Report:")
    print(report)
    print(f"{name} - ROC AUC: {auc:.4f}")

    return {
        "model": model,
        "name": name,
        "accuracy": acc,
        "report": report,
        "auc": auc
    }



## 7. Models

We train three models with the same preprocessing:
1. **Logistic Regression** – simple, strong baseline, interpretable coefficients  
2. **Random Forest** – non-linear, handles interactions, has feature importances  
3. **Gradient Boosting (sklearn)** – additive trees (note: this is **not** XGBoost; if you want XGBoost, install `xgboost` and swap the classifier)


In [19]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier

models_lr = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", LogisticRegression(max_iter=200))
])

models_rfc = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", RandomForestClassifier(n_estimators=100, max_depth=None, min_samples_split=5, class_weight="balanced", n_jobs=-1, random_state=42))
])

models_gbc = Pipeline(steps=[
    ("preprocess", preprocess),
    ("model", GradientBoostingClassifier(n_estimators=100, max_depth=3, learning_rate=0.05, random_state=42))
])

    

## 8. Training & Evaluation

In [20]:
results = []

results.append(fit_and_evaluate(models_lr, x_train, y_train, x_test, y_test, "Logistic Regression"))
results.append(fit_and_evaluate(models_rfc, x_train, y_train, x_test, y_test, "Random Forest"))
results.append(fit_and_evaluate(models_gbc, x_train, y_train, x_test, y_test, "Gradient Boosting"))

summary = pd.DataFrame(results)
summary.sort_values(by="accuracy", ascending=False, inplace=True)
summary 



Logistic Regression - Accuracy: 0.8261
Logistic Regression - Classification Report:
              precision    recall  f1-score   support

           0      0.814     0.838     0.826        68
           1      0.838     0.814     0.826        70

    accuracy                          0.826       138
   macro avg      0.826     0.826     0.826       138
weighted avg      0.826     0.826     0.826       138

Logistic Regression - ROC AUC: 0.8910
Random Forest - Accuracy: 0.8406
Random Forest - Classification Report:
              precision    recall  f1-score   support

           0      0.829     0.853     0.841        68
           1      0.853     0.829     0.841        70

    accuracy                          0.841       138
   macro avg      0.841     0.841     0.841       138
weighted avg      0.841     0.841     0.841       138

Random Forest - ROC AUC: 0.9065
Gradient Boosting - Accuracy: 0.8261
Gradient Boosting - Classification Report:
              precision    recall  f1-sc

Unnamed: 0,model,name,accuracy,report,auc
1,"(ColumnTransformer(transformers=[('num',\n ...",Random Forest,0.84058,precision recall f1-score ...,0.906513
0,"(ColumnTransformer(transformers=[('num',\n ...",Logistic Regression,0.826087,precision recall f1-score ...,0.890966
2,"(ColumnTransformer(transformers=[('num',\n ...",Gradient Boosting,0.826087,precision recall f1-score ...,0.905882


### 1. Accuracy

**Definition:**  
The fraction of correct predictions out of all predictions.

**Significance:**  
- Simple and intuitive measure of performance.  
- Works well if classes are balanced.  
- Can be misleading if data is imbalanced (e.g., 95% “negative” and 5% “positive”: a model that predicts everything as “negative” still gets 95% accuracy).  

---

### 2. Classification Report (Precision, Recall, F1-score)

This expands accuracy into more detailed metrics for each class:

#### Precision
- **Definition:** Of all instances predicted positive, how many were actually positive?  
- **Significance:**  
  - High precision means fewer false alarms.  
  - Useful in domains where false positives are costly (e.g., spam detection).  

#### Recall (Sensitivity, True Positive Rate)
- **Definition:** Of all actual positive instances, how many did we correctly identify?  
- **Significance:**  
  - High recall means fewer missed positives.  
  - Important in healthcare, fraud detection, etc., where missing a positive case is critical.  

#### F1-score
- **Definition:** Harmonic mean of precision and recall.  
- **Significance:**  
  - Balances precision and recall.  
  - Especially useful in imbalanced datasets where accuracy is misleading.  

#### Support
- **Definition:** Number of actual instances of each class.  
- **Significance:** Helps contextualize precision/recall scores.  

---

### 3. ROC AUC (Area Under the ROC Curve)

- **ROC Curve:** Plots True Positive Rate (Recall) vs. False Positive Rate at different probability thresholds.  
- **AUC (Area Under Curve):** Single number summarizing the ROC curve.  
  - Ranges from 0.5 (random guessing) to 1.0 (perfect model).  

**Significance:**  
- Threshold-independent: evaluates model’s ability to rank positives above negatives.  
- Useful for comparing classifiers.  
- Works well even with class imbalance.  

## 9. ROC Curves

Single chart with the ROC curves from each model.


In [22]:
import matplotlib.pyplot as plt

plt.figure(figsize=(10, 6))

for r in results:
    if r["y_pred_proba"] is not None:
        continue
    
    fpr, tpr, thresholds = roc_curve(r["y_test"], r["y_pred_proba"])
    plt.plot(fpr, tpr, label=f"{r['name']} (AUC = {r['auc']:.2f})")

plt.plot([0, 1], [0, 1], "k--", label="Random Guessing")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend()
plt.show()


KeyError: 'y_pred_proba'

<Figure size 1000x600 with 0 Axes>