# Credit Card Approvals

This notebook shows how to use supervised machine learning to build a predictor for credit card approvals.

The dataset is preprocessed by addressing differences in data type and scale as well as missing entries. Separate datasets for training and testing are created and a logistic regression model is fit to the training set. Finally, predictions are made based on the test set and model performance is evaluated and improved via hyperparameter tuning.

## Table of Contents

1 Imports

2 Exploratory Data Analysis

3 Preprocessing

4 Model Training

5 Generating Predictions

6 Hyperparameter Tuning

7 Model Summary

## 1 Imports

In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

In [2]:
# Load dataset
cc_apps = pd.read_csv("data/cc_approvals.data", header=None)

## 2 Exploratory Data Analysis

In [3]:
# Inspect the first five rows
print(cc_apps.head())
print()
# Inspect the dimensions of the dataframe
print(cc_apps.shape)

  0      1      2  3  4  5  6     7  8  9   10 11 12     13   14 15
0  b  30.83  0.000  u  g  w  v  1.25  t  t   1  f  g  00202    0  +
1  a  58.67  4.460  u  g  q  h  3.04  t  t   6  f  g  00043  560  +
2  a  24.50  0.500  u  g  q  h  1.50  t  f   0  f  g  00280  824  +
3  b  27.83  1.540  u  g  w  v  3.75  t  t   5  t  g  00100    3  +
4  b  20.17  5.625  u  g  w  v  1.71  t  f   0  f  s  00120    0  +

(690, 16)


In [4]:
# Inspect datatypes of individual columns
print(cc_apps.dtypes)

0      object
1      object
2     float64
3      object
4      object
5      object
6      object
7     float64
8      object
9      object
10      int64
11     object
12     object
13     object
14      int64
15     object
dtype: object


## 3 Preprocessing

### 3.1 Drop Unnecessary Columns

In [5]:
# Drop features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

### 3.2 Train-Test Split

In [6]:
# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

### 3.3 Replace "?" with NaN values

In [7]:
# Replace the '?'s with NaN in the train and test sets
cc_apps_train_nans_replaced = cc_apps_train.replace("?", np.NaN)
cc_apps_test_nans_replaced = cc_apps_test.replace("?", np.NaN)

### 3.4 Mean Imputation

In [8]:
# Get only numeric columns for mean calculation
numeric_cols = cc_apps_train_nans_replaced.select_dtypes(include=np.number).columns

# Fill missing values with the mean of numeric columns
cc_apps_train_imputed = cc_apps_train_nans_replaced.fillna(cc_apps_train_nans_replaced[numeric_cols].mean())

# Repeat the same for the test set
cc_apps_test_imputed = cc_apps_test_nans_replaced.fillna(cc_apps_train_nans_replaced[numeric_cols].mean())

### 3.5 Mode Imputation

In [9]:
# Iterate over each column of cc_apps_train_imputed
for col in cc_apps_train_imputed.columns:
    # Check if the column is of type "object"
    if cc_apps_train_imputed[col].dtypes == "object":
        # Impute with the most frequent value
        cc_apps_train_imputed = cc_apps_train_imputed.fillna(
            cc_apps_train_imputed[col].value_counts().index[0]
        )
        cc_apps_test_imputed = cc_apps_test_imputed.fillna(
            cc_apps_train_imputed[col].value_counts().index[0]
        )

### 3.6 Handling Categorical Variables

In [10]:
# Convert the categorical features in the train and test sets
cc_apps_train_cat_encoding = pd.get_dummies(cc_apps_train_imputed)
cc_apps_test_cat_encoding = pd.get_dummies(cc_apps_test_imputed)

### 3.7 Reindexing Columns

In [11]:
# Reindex the columns of the test set to align with the train set
cc_apps_test_cat_encoding = cc_apps_test_cat_encoding.reindex(
    columns=cc_apps_train_cat_encoding.columns, fill_value=0
)

### 3.8 Create Training and Test Sets

In [12]:
# Segregate features and labels into separate variables
X_train, y_train = (
    cc_apps_train_cat_encoding.iloc[:, :-1].values,
    cc_apps_train_cat_encoding.iloc[:, [-1]].values.ravel(),
)
X_test, y_test = (
    cc_apps_test_cat_encoding.iloc[:, :-1].values,
    cc_apps_test_cat_encoding.iloc[:, [-1]].values.ravel(),
)

### 3.9 Apply a MinMaxScaler

In [13]:
# Instantiate a MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

## 4 Model Training

In [14]:
# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

## 5 Generating Predictions

In [15]:
# Use logreg to predict instances from the test set and store them
y_pred = logreg.predict(rescaledX_test)

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

[[103   0]
 [  0 125]]


## 6 Hyperparameter Tuning

In [16]:
# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary in which tol and max_iter are keys and the lists of their values are the corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train.ravel())

## 7 Model Summary

In [17]:
# Summarise results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print(
    "Accuracy of logistic regression classifier: ",
    best_model.score(rescaledX_test, y_test),
)

Best: 1.000000 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  1.0
