<h1 style="color: LightSlateGray; font-size:250%; ">Modules</h1>
<ol style="padding: 10px;">
    <li><code>Numpy</code>: Numpy (Numerical Python) adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.</li>
    <li><code>Pandas</code>: Pandas (Python Data Analysis Library) is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.</li>
    <li><code>scikit-learn</code>: scikit-learn is a machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means.</li>
</ol>

<h1 style="color: Tomato;  font-family:courier; font-size:300%;">Credit Card Approval</h1>
<h1 style="padding-bottom: 15px;">Dataset.</h1>
The data is taken from <a href="https://archive.ics.uci.edu/ml/datasets/credit+approval">UCI Machine Learning Repository.</a>

In [121]:
import pandas as pd
cc_apps = pd.read_csv("cc_approvals.data", header=None)
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


# Understanding data.
Columns in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. 

In [122]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 non-null    object 
 9   9       690 non-null    object 
 10  10      690 non-null    int64  
 11  11      690 non-null    object 
 12  12      690 non-null    object 
 13  13      690 non-null    object 
 14  14      690 non-null    int64  
 15  15      690 non-null    object 
dtypes: float64(2), int64(2), object(12)
memory usage: 86.4+ KB


In [124]:
cc_apps.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,?,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


# Preparing Data.
Handling missing data values from dataset.

In [125]:
import numpy as np
cc_apps = cc_apps.replace('?', np.nan)
cc_apps.tail(20)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
670,b,47.17,5.835,u,g,w,v,5.5,f,f,0,f,g,465,150,-
671,b,25.83,12.835,u,g,cc,v,0.5,f,f,0,f,g,0,2,-
672,a,50.25,0.835,u,g,aa,v,0.5,f,f,0,t,g,240,117,-
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,f,g,256,17,-
674,a,37.33,2.5,u,g,i,h,0.21,f,f,0,f,g,260,246,-
675,a,41.58,1.04,u,g,aa,v,0.665,f,f,0,f,g,240,237,-
676,a,30.58,10.665,u,g,q,h,0.085,f,t,12,t,g,129,3,-
677,b,19.42,7.25,u,g,m,v,0.04,f,t,1,f,g,100,1,-
678,a,17.92,10.21,u,g,ff,ff,0.0,f,f,0,f,g,0,50,-
679,a,20.08,1.25,u,g,c,v,0.0,f,f,0,f,g,0,0,-


In [126]:
cc_apps.fillna(cc_apps.mean(), inplace=True)
cc_apps.isnull().sum()

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64

In [127]:
# Total Number of NaN in the dataset 
cc_apps.isnull().values.sum()

67

In [128]:
for col in cc_apps.columns:
    if cc_apps[col].dtypes == 'object':
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])
cc_apps.isnull().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

# Converting non-numeric data into numeric.


In [129]:
from sklearn.preprocessing import LabelEncoder
L=LabelEncoder()
for col in cc_apps.columns.values:
    if cc_apps[col].dtypes=='object':
        cc_apps[col]=L.fit_transform(cc_apps[col])

In [130]:
cc_apps.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    int32  
 1   1       690 non-null    int32  
 2   2       690 non-null    float64
 3   3       690 non-null    int32  
 4   4       690 non-null    int32  
 5   5       690 non-null    int32  
 6   6       690 non-null    int32  
 7   7       690 non-null    float64
 8   8       690 non-null    int32  
 9   9       690 non-null    int32  
 10  10      690 non-null    int64  
 11  11      690 non-null    int32  
 12  12      690 non-null    int32  
 13  13      690 non-null    int32  
 14  14      690 non-null    int64  
 15  15      690 non-null    int32  
dtypes: float64(2), int32(12), int64(2)
memory usage: 54.0 KB


In [131]:
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,1,156,0.0,2,1,13,8,1.25,1,1,1,0,0,68,0,0
1,0,328,4.46,2,1,11,4,3.04,1,1,6,0,0,11,560,0
2,0,89,0.5,2,1,11,4,1.5,1,0,0,0,0,96,824,0
3,1,125,1.54,2,1,13,8,3.75,1,1,5,1,0,31,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,0,2,37,0,0


In [132]:
cc_apps = cc_apps.drop([11, 13], axis=1)
cc_apps

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,1,156,0.000,2,1,13,8,1.25,1,1,1,0,0,0
1,0,328,4.460,2,1,11,4,3.04,1,1,6,0,560,0
2,0,89,0.500,2,1,11,4,1.50,1,0,0,0,824,0
3,1,125,1.540,2,1,13,8,3.75,1,1,5,0,3,0
4,1,43,5.625,2,1,13,8,1.71,1,0,0,2,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
685,1,52,10.085,3,3,5,4,1.25,0,0,0,0,0,1
686,0,71,0.750,2,1,2,8,2.00,0,1,2,0,394,1
687,0,97,13.500,3,3,6,3,2.00,0,1,1,0,1,1
688,1,20,0.205,2,1,0,8,0.04,0,0,0,0,750,1


In [133]:
cc_apps = cc_apps.values
cc_apps

array([[1.000e+00, 1.560e+02, 0.000e+00, ..., 0.000e+00, 0.000e+00,
        0.000e+00],
       [0.000e+00, 3.280e+02, 4.460e+00, ..., 0.000e+00, 5.600e+02,
        0.000e+00],
       [0.000e+00, 8.900e+01, 5.000e-01, ..., 0.000e+00, 8.240e+02,
        0.000e+00],
       ...,
       [0.000e+00, 9.700e+01, 1.350e+01, ..., 0.000e+00, 1.000e+00,
        1.000e+00],
       [1.000e+00, 2.000e+01, 2.050e-01, ..., 0.000e+00, 7.500e+02,
        1.000e+00],
       [1.000e+00, 1.970e+02, 3.375e+00, ..., 0.000e+00, 0.000e+00,
        1.000e+00]])

# Machine Learning Start.
ML includes 3 types of learning Spervised, Unsupervised and Reinforcement.<br><br>
Regression vs. Classification


## Split the data into train and test sets.

In [134]:
from sklearn.model_selection import train_test_split

# Segregate features and labels into separate variables
X,Y = cc_apps[:,0:13], cc_apps[:,13]

# Split into train and test sets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

## Preprocessing

In [135]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

## 1. logistic regression

In [136]:
from sklearn.linear_model import LogisticRegression

# Fitting logistic regression with default parameter values
logreg = LogisticRegression()
logreg.fit(rescaledX_train, Y_train)

LogisticRegression()

## 2. Making predictions

In [137]:
from sklearn.metrics import confusion_matrix
Y_pred = logreg.predict(rescaledX_test)

print("Logistic regression classifier has accuracy of: ", logreg.score(rescaledX_test, Y_test)*100, "%")

# Evaluate the confusion_matrix
confusion_matrix(Y_test, Y_pred)

Logistic regression classifier has accuracy of:  84.21052631578947 %


array([[94,  9],
       [27, 98]], dtype=int64)