# Predicting Credit Card Approvals with Logistic Regression

GitHub Repository: https://github.com/skhiearth/Predicting-Credit-Card-Approvals-with-Logistic-Regression

Commercial banks receive a lot of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming. In this project, I try to build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

The dataset used in this project is the [Credit Card Approval dataset](http://archive.ics.uci.edu/ml/datasets/credit+approval) from the UCI Machine Learning Repository.


Citation: Dua, D. and Graff, C. (2019). [UCI Machine Learning Repository](http://archive.ics.uci.edu/ml). Irvine, CA: University of California, School of Information and Computer Science.

In [1]:
# Import required libraries
import pandas as pd
import numpy as np

# Load dataset
data = pd.read_csv("datasets/cc_approvals.data", header = None)

# Inspect data
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [2]:
# Printing summary statistics
print(data.shape)
print("\n")
print(data.describe())
print("\n")
print(data.info())

# Inspecting missing values in the dataset
print("\n")
print("Dataset tail:")
print(data.tail())
data = data.replace("?", np.NaN)

(690, 16)


               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
0     690 non-null object
1     690 non-null object
2     690 non-null float64
3     690 non-null object
4     690 non-null object
5     690 non-null object
6     690 non-null object
7     690 non-null float64
8     690 non-null object
9     690 non-null object
10    690 non-null int64
11    690 non-null object
12    690 non-null object
13    690 non-null object
14    690 no

#### The dataset is a mixture of both numerical and non-numerical values. There are four columns with numerical values.

In [3]:
# Impute the missing values with mean imputation
data.fillna(data.mean(), inplace=True)
sum(data.isnull())

120

#### Imputing the missing values in the non-numeric columns and using LabelEncoder to convert them into numeric values:

In [4]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

for col in data.columns:
    # Check if the column is of object type
    if data[col].dtypes == 'object':
        # Impute with the most frequent value
        data = data.fillna(data[col].value_counts().index[0])
        data[col]=le.fit_transform(data[col])

#### Feature Selection, Feature Scaling and Train-Test Split

In [5]:
# Import train_test_split and StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

'''Drop the features 11 and 13 (Driving License and Zip Code - not relevant for credit card application) 
and converting the DataFrame to a NumPy array'''

data_dropped = data.drop([11, 13], axis=1)
data_values = data_dropped.values

# Segregate features and labels into separate variables
X,y = data_values[:,0:12], data_values[:,13]

# Instantiate StandardScaler and use it to rescale X_train and X_test
scaler = StandardScaler()

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 10)
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

#### Implementing Logistic Regression

In [6]:
# Importing Logistic Regression and GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

classifier = LogisticRegression(solver = 'liblinear')

tolerance = [0.01, 0.001, 0.0001]
m_iter = [100, 150, 200]
pen = ['l1', 'l2']
c = [0.001, 0.01, 0.1, 1, 10, 100, 1000]

grid = dict(tol = tolerance, 
            max_iter = m_iter, 
            penalty = pen, 
            C = c)

searcher = GridSearchCV(classifier, 
                        param_grid = grid,
                        cv = 5)

searcher.fit(X_train, y_train)

# Report the best parameters
print("Best CV params", searcher.best_params_)
print(searcher.best_score_*100)

Best CV params {'C': 1, 'max_iter': 100, 'penalty': 'l1', 'tol': 0.01}
86.15136876006441


In [7]:
# Finalising the model based on best C-value
lr = searcher.best_estimator_

# Predicting the Test set results
y_pred = lr.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[32  2]
 [ 4 31]]
