![Credit card being held in hand](credit_card.jpg)

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this workbook, you will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

### The Data

The data is a small subset of the Credit Card Approval dataset from the UCI Machine Learning Repository showing the credit card applications a bank receives. This dataset has been loaded as a `pandas` DataFrame called `cc_apps`. The last column in the dataset is the target value.

In [29]:
!pip install ydata_profiling

Defaulting to user installation because normal site-packages is not writeable
Collecting ydata_profiling
  Downloading ydata_profiling-4.8.3-py2.py3-none-any.whl.metadata (20 kB)
Collecting pydantic>=2 (from ydata_profiling)
  Downloading pydantic-2.7.1-py3-none-any.whl.metadata (107 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m107.3/107.3 kB[0m [31m15.8 MB/s[0m eta [36m0:00:00[0m
Collecting visions<0.7.7,>=0.7.5 (from visions[type_image_path]<0.7.7,>=0.7.5->ydata_profiling)
  Downloading visions-0.7.6-py3-none-any.whl.metadata (11 kB)
Collecting htmlmin==0.1.12 (from ydata_profiling)
  Downloading htmlmin-0.1.12.tar.gz (19 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting phik<0.13,>=0.11.1 (from ydata_profiling)
  Downloading phik-0.12.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.6 kB)
Collecting multimethod<2,>=1.4 (from ydata_profiling)
  Downloading multimethod-1.10-py3-none-any.whl.metadata (8.2 kB)
Collecting 

In [30]:
#EDA
from ydata_profiling import ProfileReport
from ydata_profiling.utils.cache import cache_file

In [31]:
profile = ProfileReport(data, title="Teste", explorative=True)
profile.to_file("profile_2.html")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

# Fazendo as alterações

In [32]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score,RandomizedSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,accuracy_score, precision_score, recall_score,f1_score,make_scorer
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,g,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,g,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,g,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,s,0,+


In [33]:
data = cc_apps

In [34]:
data = data.drop(columns=[4])
data[1] = pd.to_numeric(data[1], errors='coerce')
data[3] = data[3].replace(['?','l'],'o')
data[6] = data[6].replace(['?','j','z','dd','n','o'],'o')
data[13] = data[13].replace(['+','-'],[1,0])
data = data[(data[7] <= 15) & (data[10] <= 20) & (data[12] <= 4120)]
data = data.dropna()
data = pd.get_dummies(data)
data.columns = data.columns.astype(str)

In [35]:
data.head()

Unnamed: 0,1,2,7,10,12,13,0_?,0_a,0_b,3_o,3_u,3_y,5_?,5_aa,5_c,5_cc,5_d,5_e,5_ff,5_i,5_j,5_k,5_m,5_q,5_r,5_w,5_x,6_bb,6_ff,6_h,6_o,6_v,8_f,8_t,9_f,9_t,11_g,11_p,11_s
0,30.83,0.0,1.25,1,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,0
1,58.67,4.46,3.04,6,560,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,1,1,0,0
2,24.5,0.5,1.5,0,824,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,1,0,0
3,27.83,1.54,3.75,5,3,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,0,1,1,0,0
4,20.17,5.625,1.71,0,0,1,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,1,1,0,0,0,1


# Modelando

In [36]:
X = data.drop(columns=['13'])
y = data['13']

In [37]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=325)

X_test,X_val,y_test,y_val = train_test_split(X_test,y_test,test_size=0.5, random_state=3255)

In [38]:
# Dados
print("Total")
print(len(X.index))
print(len(y.index))

# Treino e Teste (geral)
print("Treino")
print(len(X_train.index))
print(len(y_train.index))

print("Teste")
print(len(X_test.index))
print(len(y_test.index))

print("Validação")
print(len(X_val.index))
print(len(y_val.index))

Total
635
635
Treino
444
444
Teste
95
95
Validação
96
96


## Modelos

### 1. Random Forest

In [39]:
model_1 = RandomForestClassifier()
model_1.fit(X_train,y_train)

In [40]:
predictions_1 = model_1.predict(X_test)

In [41]:
print("Confusion Matrix")
print(confusion_matrix(y_test,predictions_1))
print("Accuracy:",accuracy_score(y_test,predictions_1).round(3))
print("Precision:",precision_score(y_test,predictions_1).round(3))
print("Sensitivity:",recall_score(y_test,predictions_1).round(3)) #Sensitivity
print("Specifity:",recall_score(y_test,predictions_1,pos_label=0).round(3)) #Specifity
print("F1-Score:",f1_score(y_test,predictions_1).round(3))

Confusion Matrix
[[49  5]
 [ 6 35]]
Accuracy: 0.884
Precision: 0.875
Sensitivity: 0.854
Specifity: 0.907
F1-Score: 0.864


In [42]:
predictions_2 = model_1.predict(X_val)

In [43]:
print("Confusion Matrix")
print(confusion_matrix(y_val,predictions_2))
print("Accuracy:",accuracy_score(y_val,predictions_2).round(3))
print("Precision:",precision_score(y_val,predictions_2).round(3))
print("Sensitivity:",recall_score(y_val,predictions_2).round(3)) #Sensitivity
print("Specifity:",recall_score(y_val,predictions_2,pos_label=0).round(3)) #Specifity
print("F1-Score:",f1_score(y_val,predictions_2).round(3))

Confusion Matrix
[[48 11]
 [ 7 30]]
Accuracy: 0.812
Precision: 0.732
Sensitivity: 0.811
Specifity: 0.814
F1-Score: 0.769


### 2. Logistic Regression

In [44]:
model_2 = LogisticRegression()
model_2.fit(X_train,y_train)

In [45]:
predictions_3 = model_2.predict(X_test)

In [46]:
print("Confusion Matrix")
print(confusion_matrix(y_test,predictions_3))
print("Accuracy:",accuracy_score(y_test,predictions_3).round(3))
print("Precision:",precision_score(y_test,predictions_3).round(3))
print("Sensitivity:",recall_score(y_test,predictions_3).round(3)) #Sensitivity
print("Specifity:",recall_score(y_test,predictions_3,pos_label=0).round(3)) #Specifity
print("F1-Score:",f1_score(y_test,predictions_3).round(3))

Confusion Matrix
[[47  7]
 [ 6 35]]
Accuracy: 0.863
Precision: 0.833
Sensitivity: 0.854
Specifity: 0.87
F1-Score: 0.843


In [47]:
predictions_4 = model_2.predict(X_val)

In [48]:
print("Confusion Matrix")
print(confusion_matrix(y_val,predictions_4))
print("Accuracy:",accuracy_score(y_val,predictions_4).round(3))
print("Precision:",precision_score(y_val,predictions_4).round(3))
print("Sensitivity:",recall_score(y_val,predictions_4).round(3)) #Sensitivity
print("Specifity:",recall_score(y_val,predictions_4,pos_label=0).round(3)) #Specifity
print("F1-Score:",f1_score(y_val,predictions_4).round(3))

Confusion Matrix
[[47 12]
 [ 7 30]]
Accuracy: 0.802
Precision: 0.714
Sensitivity: 0.811
Specifity: 0.797
F1-Score: 0.759


### 3. Optimized Random Forest

In [49]:
model_opt = RandomForestClassifier(max_depth=6, max_features=None)
model_opt.fit(X_train,y_train)

In [50]:
predictions_5 = model_opt.predict(X_test)

In [51]:
print("Confusion Matrix")
print(confusion_matrix(y_test,predictions_5))
print("Accuracy:",accuracy_score(y_test,predictions_5).round(3))
print("Precision:",precision_score(y_test,predictions_5).round(3))
print("Sensitivity:",recall_score(y_test,predictions_5).round(3)) #Sensitivity
print("Specifity:",recall_score(y_test,predictions_5,pos_label=0).round(3)) #Specifity
print("F1-Score:",f1_score(y_test,predictions_5).round(3))

Confusion Matrix
[[48  6]
 [ 6 35]]
Accuracy: 0.874
Precision: 0.854
Sensitivity: 0.854
Specifity: 0.889
F1-Score: 0.854


In [52]:
predictions_6 = model_opt.predict(X_val)

In [53]:
print("Confusion Matrix")
print(confusion_matrix(y_val,predictions_6))
print("Accuracy:",accuracy_score(y_val,predictions_6).round(3))
print("Precision:",precision_score(y_val,predictions_6).round(3))
print("Sensitivity:",recall_score(y_val,predictions_6).round(3)) #Sensitivity
print("Specifity:",recall_score(y_val,predictions_6,pos_label=0).round(3)) #Specifity
print("F1-Score:",f1_score(y_val,predictions_6).round(3))

Confusion Matrix
[[50  9]
 [ 6 31]]
Accuracy: 0.844
Precision: 0.775
Sensitivity: 0.838
Specifity: 0.847
F1-Score: 0.805


## Otimizando os modelos (GridSearch)

In [54]:
#Cross-Validation
acc_score = make_scorer(accuracy_score)
pre_score = make_scorer(precision_score)

cross_val_score(estimator=model_1,X=X,y=y,cv=5,scoring=acc_score)
cross_val_score(estimator=model_1,X=X,y=y,cv=5,scoring=pre_score)

array([0.5308642 , 0.96226415, 1.        , 0.64615385, 0.97560976])

In [55]:
# Randomized Search
param_dist = {"criterion": ['gini', 'entropy', 'log_loss'],
              "max_depth":[4,6,8,None],
              "max_features":['sqrt', 'log2', None]}

In [56]:
random_search = RandomizedSearchCV(estimator=model_1, param_distributions=param_dist,n_iter=40,cv=5)

In [57]:
random_search.fit(X,y)

In [58]:
random_search.best_estimator_

In [59]:
#Resolução
best_score = 0.854