# Credit Qualification

This tool was designed to provide crucial information in determining which clients can access a bank loan and which clients should not. Real world data was provided for the making of this tool, so a data privacy disclaimer was needed.

Relevant Information:

    This file concerns credit card applications.  All attribute names
    and values have been changed to meaningless symbols to protect
    confidentiality of the data.

## Dataset description

First things first we will take a look at the diferent variables presented to us, as well as their data type

Number of Attributes: 15 + class attribute

Attribute Information:

    A1:	b, a.
    A2:	continuous.
    A3:	continuous.
    A4:	u, y, l, t.
    A5:	g, p, gg.
    A6:	c, d, cc, i, j, k, m, r, q, w, x, e, aa, ff.
    A7:	v, h, bb, j, n, z, dd, ff, o.
    A8:	continuous.
    A9:	t, f.
    A10:	t, f.
    A11:	continuous.
    A12:	t, f.
    A13:	g, p, s.
    A14:	continuous.
    A15:	continuous.
    A16: +,-         (class attribute)




Using this information we can identify our target variable (**A16**). We can also see that it is a discrete variable, therefore we are facing a **Classification Problem**. It only has two possible values, "+" or "-".

Missing Attribute Values:
    37 cases (5%) have one or more missing values.  The missing
    values from particular attributes are:

    A1:  12
    A2:  12
    A4:   6
    A5:   6
    A6:   9
    A7:   9
    A14: 13

We have around 5% of missing values, we will have to deal with those shortly. Now we'll take a look at the data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
df = pd.read_csv("data/crx3.data")
columns = ["A1", "A2", "A3", "A4", "A5", "A6", "A7", "A8",
           "A9", "A10", "A11", "A12", "A13", "A14", "A15", "A16"]
df.columns = columns
df.head()

Unnamed: 0,A1,A2,A3,A4,A5,A6,A7,A8,A9,A10,A11,A12,A13,A14,A15,A16
0,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43.0,560,+
1,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280.0,824,+
2,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100.0,3,+
3,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120.0,0,+
4,b,32.08,4.0,u,g,m,v,2.5,t,f,0,t,g,360.0,0,+


First glance at our data, we can see that numerical variables differ significatly in scale.

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 689 entries, 0 to 688
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A1      677 non-null    object 
 1   A2      677 non-null    float64
 2   A3      689 non-null    float64
 3   A4      683 non-null    object 
 4   A5      683 non-null    object 
 5   A6      680 non-null    object 
 6   A7      680 non-null    object 
 7   A8      689 non-null    float64
 8   A9      689 non-null    object 
 9   A10     689 non-null    object 
 10  A11     689 non-null    int64  
 11  A12     689 non-null    object 
 12  A13     689 non-null    object 
 13  A14     676 non-null    float64
 14  A15     689 non-null    int64  
 15  A16     689 non-null    object 
dtypes: float64(4), int64(2), object(10)
memory usage: 86.2+ KB


We can check wich columns presented the most amount of missing values.

In [4]:
df.describe()

Unnamed: 0,A2,A3,A8,A11,A14,A15
count,677.0,689.0,689.0,689.0,676.0,689.0
mean,31.569261,4.765631,2.224819,2.402032,183.988166,1018.862119
std,11.96667,4.97847,3.348739,4.86618,173.934087,5213.743149
min,13.75,0.0,0.0,0.0,0.0,0.0
25%,22.58,1.0,0.165,0.0,74.5,0.0
50%,28.42,2.75,1.0,0.0,160.0,5.0
75%,38.25,7.25,2.625,3.0,277.0,396.0
max,80.25,28.0,28.5,67.0,2000.0,100000.0


Using describe we can confirm that these numerical variables have notable diferences in scale, that is something we'll have to keep in mind.

## Data Processing
Our first step now will be separating our data into train set and test set.

In [5]:
from sklearn.model_selection import train_test_split

X = df.drop(axis=1, columns='A16')
# replacing target variable possible values with 1 and 0
y = df['A16'].replace(to_replace=["+", "-"], value=[1, 0])

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=123)

In [6]:
# Categorical variable names
cat_var = X_train.select_dtypes(include=['object', 'bool']).columns
# Numerical variable names
num_var = X_train.select_dtypes(np.number).columns

### Data Cleaning
Now we will impute missing values

In [7]:
from sklearn.impute import SimpleImputer  
from sklearn.impute import KNNImputer
from sklearn.compose import ColumnTransformer

For this project we used a KNN Imputer for numerical variables and a most frequent imputer for categorical variables

In [8]:
# Creating both numerical and categorical imputer
t1 = ('num_imputer', KNNImputer(n_neighbors=5), num_var)
t2 = ('cat_imputer', SimpleImputer(strategy='most_frequent'),
      cat_var)

column_transformer_cleaning = ColumnTransformer(
    transformers=[t1, t2], remainder='passthrough')

column_transformer_cleaning.fit(X_train)

Train_transformed = column_transformer_cleaning.transform(X_train)
Test_transformed = column_transformer_cleaning.transform(X_test)

# Here we update the order in wich variables are located in the dataframe, given that after transforming, we will have all
# numerical variables first, followed by all the categorical variables.
var_order = num_var.tolist() + cat_var.tolist()

# And finally we recreate the Data Frames
X_train_clean = pd.DataFrame(Train_transformed, columns=var_order)
X_test_clean = pd.DataFrame(Test_transformed, columns=var_order)

### Normalizing and encoding data
Next step is to normalize numerical data and encode categorical variables (One Hot Encoding or creating "dummy" variables)

In [9]:
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder

In [10]:
# We obtain the diferent values in all categorical variables
dif_values = [df[column].dropna().unique() for column in cat_var]

In [11]:
# Now we create the transformers
t_norm = ("normalizer", MinMaxScaler(feature_range=(0, 1)), num_var)
t_nominal = ("onehot", OneHotEncoder(
    sparse=False, categories=dif_values), cat_var)             # As the dataset isn't huge, we will set sparse=false

In [12]:
column_transformer_norm_enc = ColumnTransformer(transformers=[t_norm, t_nominal],
                                                remainder='passthrough')

column_transformer_norm_enc.fit(X_train_clean)

ColumnTransformer(remainder='passthrough',
                  transformers=[('normalizer', MinMaxScaler(),
                                 Index(['A2', 'A3', 'A8', 'A11', 'A14', 'A15'], dtype='object')),
                                ('onehot',
                                 OneHotEncoder(categories=[array(['a', 'b'], dtype=object),
                                                           array(['u', 'y', 'l'], dtype=object),
                                                           array(['g', 'p', 'gg'], dtype=object),
                                                           array(['q', 'w', 'm', 'r', 'cc', 'k', 'c', 'd', 'x', 'i', 'e', 'aa', 'ff',
       'j'], dtype=object),
                                                           array(['h', 'v', 'bb', 'ff', 'j', 'z', 'o', 'dd', 'n'], dtype=object),
                                                           array(['t', 'f'], dtype=object),
                                                           array(['t', 'f'], dtype

In [13]:
X_train_transformed = column_transformer_norm_enc.transform(X_train_clean)
X_test_transformed = column_transformer_norm_enc.transform(X_test_clean)

And with this transformations, we end our data preprocessing

## Model Selection
We all know learning from any test set is a huge mistake that will compromise the precission of our estimations of performance. That is why we will separate even further our train set into validation train and validation test sets.

In [14]:
X_val_train, X_val_test, y_val_train, y_val_test = train_test_split(
    X_train_transformed, y_train, test_size=0.20, random_state=123)

As our performance metrics, we will use accuracy

In [15]:
from sklearn.metrics import accuracy_score

Now it begings the process of training different models to see wich one performs the best and with what hyperparameters.
For this project, we selected K-Nearest Neighbors (weighted and not weighted), Decision Tree Classifier and Logistic Regression.

### Decision Tree Classifier

In [16]:
from sklearn.tree import DecisionTreeClassifier

best_Tree_model = None
best_AC_T = 0

for i in range(1, 100):
    model_T = DecisionTreeClassifier(max_depth=i)
    model_T.fit(X_val_train, y_val_train)
    y_pred_T = model_T.predict(X_val_test)
    AC_Tree = accuracy_score(y_val_test, y_pred_T)

    if AC_Tree > best_AC_T:
        best_AC_T = AC_Tree
        best_Tree_model = model_T

# print('The best Decision Tree Classifier had a depth of: ',
#       best_Tree_model.max_depth)
# print('With and Accuracy of: ', round(best_AC_T, 3))

The best Decision Tree Classifier had a depth of:  6    
With and Accuracy of:  0.928

### Distance Weighted K-Nearest Neighbors

In [17]:
from sklearn import neighbors

best_KNN_D = None
best_AC_KNN_D = 0

for i in range(1, 100):
    KNN_D_model = neighbors.KNeighborsClassifier(
        n_neighbors=i, weights='distance')
    KNN_D_model.fit(X_val_train, y_val_train)
    y_pred_KNN_D = KNN_D_model.predict(X_val_test)

    AC_KNN_D = accuracy_score(y_val_test, y_pred_KNN_D)

    if AC_KNN_D > best_AC_KNN_D:
        best_AC_KNN_D = AC_KNN_D
        best_KNN_D = KNN_D_model

# print('The best distance weighted KNN model had: ',
#       best_KNN_D.n_neighbors, ' neighbors')
# print('With an accuracy of: ', round(best_AC_KNN_D, 3))

The best distance weighted KNN model had:  40  neighbors  
With an accuracy of:  0.919

### Not Weighted K-Nearest Neighbors

In [18]:
from sklearn import neighbors

best_KNN_U = None
best_AC_KNN_U = 0

for i in range(1, 100):
    KNN_U_model = neighbors.KNeighborsClassifier(
        n_neighbors=i, weights='uniform')
    KNN_U_model.fit(X_val_train, y_val_train)
    y_pred_KNN_U = KNN_U_model.predict(X_val_test)

    AC_KNN_U = accuracy_score(y_val_test, y_pred_KNN_U)

    if AC_KNN_U > best_AC_KNN_U:
        best_AC_KNN_U = AC_KNN_U
        best_KNN_U = KNN_U_model

# print('The best not weighted KNN model had: ',
#       best_KNN_D.n_neighbors, ' neighbors')
# print('With an accuracy of: ', round(best_AC_KNN_D, 3))

The best not weighted KNN model had: 40 neighbors  
With an accuracy of: 0.919

### Logistic Regression

In [19]:
from sklearn import linear_model

LogR_model = linear_model.LogisticRegression(
    max_iter=20000, penalty='none', fit_intercept=True, random_state=123)
LogR_model.fit(X_val_train, y_val_train)
y_pred_LogR = LogR_model.predict(X_val_test)

AC_LogR = accuracy_score(y_val_test, y_pred_LogR)

print("Logistic Regression model had the following coeficients: \n", LogR_model.coef_)
print("Accuracy: ", round(AC_LogR, 5))

Logistic Regression model had the following coeficients: 
 [[  0.54512457  -1.13751313   1.78327063   9.05423909  -5.9583447
   15.93291865  -0.08989456  -0.28535717   0.07024966  -0.4455014
    0.           0.07024966  -0.4455014    0.          -0.51642635
    0.33065104  -0.51281162   4.16481675   1.78493893  -0.63033784
   -0.07534405   0.17111417   1.93482775   0.35350374   2.31896792
   -0.43179881  -4.39978618  -4.86756717   1.65236382   1.37986192
   -0.48524295   3.89445997   5.78178813  -3.8028768  -12.15040702
   -1.31685497   4.67165616   1.86432309  -2.23957482   0.15237269
   -0.52762442  -0.12488661  -0.25036513  -1.05918775  -1.03367138
    1.71760741]]
Accuracy:  0.9009


## Model Training
We identified (it was close, mostly because of our great data preprocessing) that **Decision Tree Classifier** was the winner. Now it's time to train that model with all of our train data (keeping the winning depth of 6) to obtain the *down to earth* performance of our model.

In [20]:
model = DecisionTreeClassifier(max_depth=6)
model.fit(X_train_transformed, y_train)
y_pred = model.predict(X_test_transformed)
accuracy = accuracy_score(y_test, y_pred)

# print('Model Accuracy: ', round(accuracy, 3))

Model Accuracy:  0.804

## Conclusion

The tool we developed to help determining which clients can access a bank loan has an estimated accuracy of 80%