## 1. Import libraries and dataset
<p>First, loading and viewing the dataset. <br>
We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.</p>

In [1]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv("datasets/cc_approvals.data", header=None, sep=',')

# Inspect data
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


## 2. Inspecting the applications
<p>The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but <a href="http://rstudio-pubs-static.s3.amazonaws.com/73039_9946de135c0a49daa7a0a9eda4a67a72.html">this blog</a> gives us a pretty good overview of the probable features. </p>
<p>The probable features in a typical credit card application are <code>Gender</code>, <code>Age</code>, <code>Debt</code>, <code>Married</code>, <code>BankCustomer</code>, <code>EducationLevel</code>, <code>Ethnicity</code>, <code>YearsEmployed</code>, <code>PriorDefault</code>, <code>Employed</code>, <code>CreditScore</code>, <code>DriversLicense</code>, <code>Citizen</code>, <code>ZipCode</code>, <code>Income</code> and finally the <code>ApprovalStatus</code>. <br>
This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.</p>
<p>As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.</p>

In [2]:
# inspect_dataframes() function definition 
def inspect_dataframes(filenames, dataframes):  #filenames as a list of strings ; dataframes as a list of corresponding dataframes
    # data validation code
    keys = filenames
    values = dataframes
    for z1, z2 in zip(keys, values):
        shape = z2.shape
        ncol = z2.shape[1]
        dup = z2.duplicated().sum()
        na = z2.isna().sum()
        uq = z2.nunique()
        datalists = dict()
        keys_p = ['duplicates', 'na', 'unique', 'dtype']
        keys_d = ['non-null count', 'mean/mode', 'std', 'min', '25%', '50%', '75%', 'max']
        for k in (keys_p + keys_d):
            datalists[k] = []
        for col in range(ncol):  #loading keys_p values
            datalists['duplicates'].append(z2.iloc[:,col].duplicated().sum())
            datalists['na'].append(z2.isna().sum()[col])
            datalists['unique'].append(z2.nunique()[col])
            datalists['dtype'].append(z2.dtypes[col])
        for k2 in enumerate(keys_d):  # loading keys_d values
            for col in range(ncol):
                if len(z2.iloc[:,col].describe())==8: # describe method outputs 8 values on numeric columns, 4 on others
                    datalists[k2[1]].append(round(z2.iloc[:,col].describe()[k2[0]],2))
                else:
                    if k2[1]=='non-null count':
                        datalists[k2[1]].append(z2.count()[col])
                    elif k2[1]=='mean/mode':
                        datalists[k2[1]].append(z2.iloc[:,col].mode()[0])
                    else:
                        datalists[k2[1]].append('NC')
        print(z1, 'dataframe - ', f'shape:{shape}', f'dupl:{dup}')
        display(pd.DataFrame(datalists, index = pd.MultiIndex.from_tuples([c for c in enumerate(z2.columns)], names=['#', 'Column'])))

In [3]:
# use custom function to inspect the dataframe
inspect_dataframes(["datasets/cc_approvals.data"], [cc_apps])

datasets/cc_approvals.data dataframe -  shape:(690, 16) dupl:0


Unnamed: 0_level_0,Unnamed: 1_level_0,duplicates,na,unique,dtype,non-null count,mean/mode,std,min,25%,50%,75%,max
#,Column,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,0,687,0,3,object,690.0,b,NC,NC,NC,NC,NC,NC
1,1,340,0,350,object,690.0,?,NC,NC,NC,NC,NC,NC
2,2,475,0,215,float64,690.0,4.76,4.98,0.0,1.0,2.75,7.21,28.0
3,3,686,0,4,object,690.0,u,NC,NC,NC,NC,NC,NC
4,4,686,0,4,object,690.0,g,NC,NC,NC,NC,NC,NC
5,5,675,0,15,object,690.0,c,NC,NC,NC,NC,NC,NC
6,6,680,0,10,object,690.0,v,NC,NC,NC,NC,NC,NC
7,7,558,0,132,float64,690.0,2.22,3.35,0.0,0.16,1.0,2.62,28.5
8,8,688,0,2,object,690.0,t,NC,NC,NC,NC,NC,NC
9,9,688,0,2,object,690.0,f,NC,NC,NC,NC,NC,NC


In [4]:
mapping = {0:'Gender', 1:'Age', 2:'Debt', 3:'Married', 4:'BankCustomer', 5:'EducationLevel', 6:'Ethnicity', 7:'YearsEmployed', 8:'PriorDefault', 9:'Employed', 10:'CreditScore', 11:'DriversLicense', 12:'Citizen', 13:'ZipCode', 14:'Income', 15:'ApprovalStatus'}
display(mapping)

{0: 'Gender',
 1: 'Age',
 2: 'Debt',
 3: 'Married',
 4: 'BankCustomer',
 5: 'EducationLevel',
 6: 'Ethnicity',
 7: 'YearsEmployed',
 8: 'PriorDefault',
 9: 'Employed',
 10: 'CreditScore',
 11: 'DriversLicense',
 12: 'Citizen',
 13: 'ZipCode',
 14: 'Income',
 15: 'ApprovalStatus'}

In [5]:
# check for binary columns
cat_cols = [0, 3, 4, 8, 9, 11, 12, 15]
for c in cat_cols:
    print(cc_apps[c].name, cc_apps[c].unique())

0 ['b' 'a' '?']
3 ['u' 'y' '?' 'l']
4 ['g' 'p' '?' 'gg']
8 ['t' 'f']
9 ['t' 'f']
11 ['f' 't']
12 ['g' 's' 'p']
15 ['+' '-']


In [6]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print('\n')

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print('\n')

               2           7          10             14
count  690.000000  690.000000  690.00000     690.000000
mean     4.758725    2.223406    2.40000    1017.385507
std      4.978163    3.346513    4.86294    5210.102598
min      0.000000    0.000000    0.00000       0.000000
25%      1.000000    0.165000    0.00000       0.000000
50%      2.750000    1.000000    0.00000       5.000000
75%      7.207500    2.625000    3.00000     395.500000
max     28.000000   28.500000   67.00000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 690 entries, 0 to 689
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       690 non-null    object 
 1   1       690 non-null    object 
 2   2       690 non-null    float64
 3   3       690 non-null    object 
 4   4       690 non-null    object 
 5   5       690 non-null    object 
 6   6       690 non-null    object 
 7   7       690 non-null    float64
 8   8       690 no

In [7]:
# Inspect missing values in the dataset
print(cc_apps.tail(17))

    0      1       2  3  4   5   6      7  8  9   10 11 12     13   14 15
673  ?  29.50   2.000  y  p   e   h  2.000  f  f   0  f  g  00256   17  -
674  a  37.33   2.500  u  g   i   h  0.210  f  f   0  f  g  00260  246  -
675  a  41.58   1.040  u  g  aa   v  0.665  f  f   0  f  g  00240  237  -
676  a  30.58  10.665  u  g   q   h  0.085  f  t  12  t  g  00129    3  -
677  b  19.42   7.250  u  g   m   v  0.040  f  t   1  f  g  00100    1  -
678  a  17.92  10.210  u  g  ff  ff  0.000  f  f   0  f  g  00000   50  -
679  a  20.08   1.250  u  g   c   v  0.000  f  f   0  f  g  00000    0  -
680  b  19.50   0.290  u  g   k   v  0.290  f  f   0  f  g  00280  364  -
681  b  27.83   1.000  y  p   d   h  3.000  f  f   0  f  g  00176  537  -
682  b  17.08   3.290  u  g   i   v  0.335  f  f   0  t  g  00140    2  -
683  b  36.42   0.750  y  p   d   v  0.585  f  f   0  f  g  00240    3  -
684  b  40.58   3.290  u  g   m   v  3.500  f  f   0  t  s  00400    0  -
685  b  21.08  10.085  y  p   e   h  1

## 3. Splitting the dataset into train and test sets
<p>Now, we'll split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. <br>
No information from the test data should be used to preprocess the training data or should be used to direct the training process of a machine learning model : no data leakage. Hence, we first split the data and then preprocess it.</p>
<p>Also, features like <code>DriversLicense</code> and <code>ZipCode</code> are not as important as the other features in the dataset for predicting credit card approvals. To get a better sense, we can measure their <a href="https://realpython.com/numpy-scipy-pandas-correlation-python/">statistical correlation</a> to the labels of the dataset (this is out of scope for this project)
So we should apply a <em>feature selection</em> by dropping those useless features to design our machine learning model with the best set of features.</p>
<p>We'll set a split ratio of 33% (test_size argument) and a random_state argument to 42, classically. br>
Train and test DataFrames will be assigned to the following variables respectively: cc_apps_train, cc_apps_test.
We'll also keep track of the total number of features before and after dropping the features as this often helps with debugging.</p>

In [8]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13
cc_apps = cc_apps.drop(columns=[11, 13])

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

## 4. Handling the missing values
<p>Now we've split our data, we can handle some of the issues we identified when inspecting the DataFrame, including:</p>
<ul>
<li>Our dataset contains both numeric and non-numeric data (specifically data that are of <code>float64</code>, <code>int64</code> and <code>object</code> types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values.</li>
<li>The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like <code>mean</code>, <code>max</code>, and <code>min</code>) about the features that have numerical values. </li>
<li>Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output of the second task.</li>
</ul>


### 4.1. Question marks
<p>We'll start by replacing the question marks with NaN, using numpy, both in train and test sets.</p>

In [9]:
# Import numpy
import numpy as np

# Replace the '?'s with NaN in the train and test sets
cc_apps_train = cc_apps_train.replace('?', np.nan)
cc_apps_test = cc_apps_test.replace('?', np.nan)

In [10]:
missing = pd.DataFrame()
complete = pd.DataFrame()

# display rows with missing values
for c in list(cc_apps_train.columns):
    missing = pd.concat([missing, cc_apps_train[cc_apps_train[c].isna()]], axis=0).drop_duplicates()
display(missing, missing.shape)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
346,,32.25,1.5,u,g,c,v,0.25,f,f,0,g,122,-
641,,33.17,2.25,y,p,cc,v,3.5,f,f,0,g,141,-
374,,28.17,0.585,u,g,aa,v,0.04,f,f,0,g,1004,-
479,,26.5,2.71,y,p,,,0.085,f,f,0,s,0,-
598,,20.08,0.125,u,g,q,v,1.0,f,t,1,g,768,+
489,,45.33,1.0,u,g,q,v,0.125,f,f,0,g,0,-
520,,20.42,7.5,u,g,k,v,1.5,t,t,1,g,234,+
673,,29.5,2.0,y,p,e,h,2.0,f,f,0,g,17,-
329,b,,4.0,y,p,i,v,0.085,f,f,0,g,0,-
445,a,,11.25,u,g,ff,ff,0.0,f,f,0,g,5200,-


(19, 14)

In [11]:
# display rows with non-missing values
complete = cc_apps_train.loc[list(set(cc_apps_train.index).difference(set(missing.index)))]
display(complete, complete.shape)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
0,b,30.83,0.000,u,g,w,v,1.250,t,t,1,g,0,+
1,a,58.67,4.460,u,g,q,h,3.040,t,t,6,g,560,+
3,b,27.83,1.540,u,g,w,v,3.750,t,t,5,g,3,+
4,b,20.17,5.625,u,g,w,v,1.710,t,f,0,s,0,+
5,b,32.08,4.000,u,g,m,v,2.500,t,f,0,g,0,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
683,b,36.42,0.750,y,p,d,v,0.585,f,f,0,g,3,-
684,b,40.58,3.290,u,g,m,v,3.500,f,f,0,s,0,-
687,a,25.25,13.500,y,p,ff,ff,2.000,f,t,1,g,1,-
688,b,17.92,0.205,u,g,aa,v,0.040,f,f,0,g,750,-


(443, 14)

In [12]:
cc_apps_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 462 entries, 382 to 102
Data columns (total 14 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       454 non-null    object 
 1   1       457 non-null    object 
 2   2       462 non-null    float64
 3   3       456 non-null    object 
 4   4       456 non-null    object 
 5   5       455 non-null    object 
 6   6       455 non-null    object 
 7   7       462 non-null    float64
 8   8       462 non-null    object 
 9   9       462 non-null    object 
 10  10      462 non-null    int64  
 11  12      462 non-null    object 
 12  14      462 non-null    int64  
 13  15      462 non-null    object 
dtypes: float64(2), int64(2), object(10)
memory usage: 70.3+ KB


In [13]:
cc_apps_train = cc_apps_train.replace(['?', 0], np.nan)
cc_apps_test = cc_apps_test.replace(['?', 0], np.nan)

In [14]:
filled_test = cc_apps_train\
    .fillna(value={c: cc_apps_train[c].mean() for c in [2, 7, 10, 14]})\
    .fillna(value={c: cc_apps_train[c].mode()[0] for c in [0, 1, 3, 4, 5, 6, 12]})  # .mode()[0] as mode function always returns a series
filled_test

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
382,a,24.33,2.500000,y,p,i,bb,4.50000,f,f,5.663366,g,456.000000,-
137,b,33.58,2.750000,u,g,m,v,4.25000,t,t,6.000000,g,1625.654676,+
346,b,32.25,1.500000,u,g,c,v,0.25000,f,f,5.663366,g,122.000000,-
326,b,30.17,1.085000,y,p,c,v,0.04000,f,f,5.663366,g,179.000000,-
33,a,36.75,5.125000,u,g,e,v,5.00000,t,f,5.663366,g,4000.000000,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,b,34.83,4.000000,u,g,d,bb,12.50000,t,f,5.663366,g,1625.654676,-
106,b,28.75,1.165000,u,g,k,v,0.50000,t,f,5.663366,s,1625.654676,-
270,b,37.58,4.803781,u,g,c,v,2.32086,f,f,5.663366,p,1625.654676,+
435,b,19.00,4.803781,y,p,ff,ff,2.32086,f,t,4.000000,g,1.000000,-


In [15]:
cc_apps_train.fillna(cc_apps_train.mean())

  cc_apps_train.fillna(cc_apps_train.mean())


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,12,14,15
382,a,24.33,2.500000,y,p,i,bb,4.50000,f,f,5.663366,g,456.000000,-
137,b,33.58,2.750000,u,g,m,v,4.25000,t,t,6.000000,g,1625.654676,+
346,,32.25,1.500000,u,g,c,v,0.25000,f,f,5.663366,g,122.000000,-
326,b,30.17,1.085000,y,p,c,v,0.04000,f,f,5.663366,g,179.000000,-
33,a,36.75,5.125000,u,g,e,v,5.00000,t,f,5.663366,g,4000.000000,+
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71,b,34.83,4.000000,u,g,d,bb,12.50000,t,f,5.663366,g,1625.654676,-
106,b,28.75,1.165000,u,g,k,v,0.50000,t,f,5.663366,s,1625.654676,-
270,b,37.58,4.803781,,,,,2.32086,f,f,5.663366,p,1625.654676,+
435,b,19.00,4.803781,y,p,ff,ff,2.32086,f,t,4.000000,g,1.000000,-


### 4.2. Mean imputation for numeric values
<p>We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.</p>
<p>An important question that gets raised here is <em>why are we giving so much importance to missing values</em>? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as Linear Discriminant Analysis (LDA). </p>
<p>So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.
<li>For the numeric columns, we'll impute the missing values (NaNs) with pandas method fillna() and check the method performed as expected by printing the total number of NaNs in each column.
<li>As our dataset contains both numeric and non-numeric data, for this task we will only impute the missing values (NaNs) present in the columns having numeric data-types (columns 2, 7, 10 and 14).</li></p>
<p>Helpful links : mean imputation <a href="https://machinelearningmastery.com/handle-missing-data-python/">tutorial</a></p>


In [16]:
# Impute the missing values with mean imputation
cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
cc_apps_test.fillna(cc_apps_test.mean(), inplace=True)

# Count the number of NaNs in the datasets and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     8
1     5
2     0
3     6
4     6
5     7
6     7
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     4
1     7
2     0
3     0
4     0
5     2
6     2
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


  cc_apps_train.fillna(cc_apps_train.mean(), inplace=True)
  cc_apps_test.fillna(cc_apps_test.mean(), inplace=True)


### 4.3 Imputation for non-numeric data
<p>We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this is why the mean imputation strategy would not work here. This needs a different treatment. </p>
<p>We are going to impute these missing values with the most frequent values as present in the respective columns. This is <a href="https://www.datacamp.com/community/tutorials/categorical-data">good practice</a> when it comes to imputing missing values for categorical data in general.</p>

In [17]:
# Iterate over each column of cc_apps_train
for col in cc_apps_train.columns:
    # Check if the column is of object type
    if cc_apps_train[col].dtypes == 'object':
        # Impute with the most frequent value
        cc_apps_train = cc_apps_train.fillna({col: cc_apps_train[col].value_counts().index[0]})
        cc_apps_test = cc_apps_test.fillna({col: cc_apps_test[col].value_counts().index[0]})

# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps_train.isnull().sum())
print(cc_apps_test.isnull().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64
0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
12    0
14    0
15    0
dtype: int64


## 5. Preprocessing the data
<p>The missing values are now successfully handled.</p>
<p>There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into two main tasks:</p>
<ol>
<li>Convert the non-numeric data into numeric (using pandas function <code>get_dummies()</code>).</li>
<li>Scale the feature values to a uniform range.</li>
</ol>


### 5.1. Convert the non-numeric values to numeric
<p>First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using the <code>get_dummies()</code> method from pandas.</p>
We'll also add a reindexing step (<code>reindex()</code> method) on <code>cc_apps_test</code>, in order to discard any new categorical feature that'd appear in the test data.

In [18]:
# Convert the categorical features in the train and test sets independently
cc_apps_train = pd.get_dummies(cc_apps_train)
cc_apps_test = pd.get_dummies(cc_apps_test)

In [19]:
print('nb of cols in cc_apps_train: ', cc_apps_train.shape[1])
print('nb of cols in cc_apps_test: ', cc_apps_test.shape[1])
cc_apps_test_save = cc_apps_test
comcol = len(list(set(cc_apps_train.columns).intersection(set(cc_apps_test.columns))))
diffcol1 = len(list(set(cc_apps_train.columns).difference(set(cc_apps_test.columns))))
diffcol2 = len(list(set(cc_apps_test.columns).difference(set(cc_apps_train.columns))))  # checking for cc_apps_test columns that are not in cc_apps_train
print('common cols: ', comcol, '\ncols from cc_apps_train not in cc_aps_test: ', diffcol1, '\ncols from cc_apps_test not in cc_aps_train: ', diffcol2)

nb of cols in cc_apps_train:  329
nb of cols in cc_apps_test:  218
common cols:  154 
cols from cc_apps_train not in cc_aps_test:  175 
cols from cc_apps_test not in cc_aps_train:  64


In [20]:
# Reindex the columns of the test set aligning with the train set
cc_apps_test = cc_apps_test.reindex(columns=cc_apps_train.columns, fill_value=0)
print('nb of cols in final cc_apps_test dataframe: ', cc_apps_test.shape[1])

nb of cols in final cc_apps_test dataframe:  329


### 5.2. Rescale the features of the data
<p>Now, we are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data. </p>
<p>Now, let's try to understand what these scaled values mean in the real world. Let's use <code>CreditScore</code> as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a <code>CreditScore</code> of 1 is the highest since we're rescaling all the values to the range of 0-1.</p>
<p>To complete this step, we will :
<li>first import the <code>MinMaxScaler</code> class from the sklearn.preprocessing module.
<li>then, segregate the features and labels into train and tests sets of values
<li>and finally rescale <code>X_train</code> and <code>X_test</code>, paying attention not to fit the test set (no data leakage !)</li></p>

In [21]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Segregate features and labels into separate variables
X_train, y_train = cc_apps_train.iloc[:, :-2].values, cc_apps_train.iloc[:,[-1]].values
X_test, y_test = cc_apps_test.iloc[:, :-2].values, cc_apps_test.iloc[:, [-1]].values

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0,1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

## 6. Fitting a logistic regression model to the train set
<p>Essentially, predicting if a credit card application will be approved or not is a <a href="https://en.wikipedia.org/wiki/Statistical_classification">classification</a> task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved. </p>
<p>This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.</p>
<p>Which model should we pick? A question to ask is: <em>are the features that affect the credit card approval decision process correlated with each other?</em> Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).</p>
<p>To complete this step, we will :
<li>import <code>LogisticRegression</code> from the sklearn.linear_model module.
<li>then, fit <code>rescaledX_train</code> and <code>y_train</code> to a LogisticRegression instance <code>logreg</code> using the <code>fit()</code> method.</li></p>

In [22]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

  y = column_or_1d(y, warn=True)


## 7. Making predictions and evaluating performance
<p>But how well does our model perform? </p>
<p>We will now evaluate our model on the test set with respect to <a href="https://developers.google.com/machine-learning/crash-course/classification/accuracy">classification accuracy</a>. But we will also take a look the model's <a href="http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/">confusion matrix</a>. In the case of predicting credit card applications, it is important to see if our machine learning model is equally capable of predicting approved and denied status, in line with the frequency of these labels in our original dataset. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.  </p>
<p>To complete this step, we will :
<li>import <code>confusion_matrix()</code> from sklearn.metrics module,
<li>predict labels named <code>y_pred</code> from <code>rescaledX_test</code>, and calculate accuracy score using <code>score()</code> method,
<li>print a  <code>confusion_matrix()</code> from test (<code>y_test</code>) and predicted labels (<code>y_pred</code>).</li></p>

<p>Helpful links:
<li>sklearn confusion matrix <a href="https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html">documentation</a></li></p>

In [23]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

Accuracy of logistic regression classifier:  0.8421052631578947
[[94  9]
 [27 98]]


## 8. Grid searching and making the model perform better
<p>Our model was pretty good! In fact it was able to yield an accuracy score of 100%.</p>
<p>For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.</p>
<p>But if we hadn't got a perfect score what's to be done?. We can perform a <a href="https://machinelearningmastery.com/how-to-tune-algorithm-parameters-with-scikit-learn/">grid search</a> of the model parameters to improve the model's ability to predict credit card approvals.</p>

### 8.1. Building a dictionnary of parameters
<p><a href="http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html">scikit-learn's implementation of logistic regression</a> consists of different hyperparameters but we will grid search over the following two:</p>
<ul>
<li>tol</li>
<li>max_iter</li>
</ul>
<p>To complete this step, we will :
<li>Define the grid of parameter values, defining tol with different values (0.01, 0.001 and 0.0001) and max_iter with values 100, 150 and 200.
<li>Create a dictionnary using the <code>dict()</code> method, that will allow us to plug it as is in <code>GridSearchCV</code> later on</li></p>

Grid search can be very exhaustive if the model is very complex and the dataset is extremely large. Luckily, that is not the case for this project.

In [24]:
# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

### 8.2. Finding the best performing model
<p>We have defined the grid of hyperparameter values and converted them into a single dictionary format which <code>GridSearchCV()</code> expects as one of its parameters. Now, we will begin the grid search to see which values perform best.</p>
<p>We will instantiate <code>GridSearchCV()</code> with our earlier <code>logreg</code> model with all the data we have. We will also instruct <code>GridSearchCV()</code> to perform a <a href="https://www.dataschool.io/machine-learning-with-scikit-learn/">cross-validation</a> of five folds.</p>
<p>We'll end the notebook by storing the best-achieved score and the respective best parameters.</p>
<p>While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as <strong>scaling</strong>, <strong>label encoding</strong>, and <strong>missing value imputation</strong>. We finished with some <strong>machine learning</strong> to predict if a person's application for a credit card would get approved or not given some information about that person.</p>

<p>To complete this step, we will :
<li>Import GridSearchCV from the sklearn.model_selection module
<li>Instantiate a <code>GridSearchCV()</code> using previously defined <code>param_grid</code> and 5 cross-validations
<li>Fit the training features (<code>rescaledX_train</code>) and lables (<code>y_train</code>) to the <code>grid_model</code>
<li>And finally compute the <code>best_score_</code>, the <code>best_params_</code>  and the <code>best_estimator_</code>'s score evaluated on rescaled test features and labels (<code>rescaledX_test</code>, <code>y_test</code>).</li></p>

<p>Grid searching is a process of finding an optimal set of values for the parameters of a certain machine learning model. This is often known as hyperparameter optimization which is an active area of research. Note that, here we have used the word parameters and hyperparameters interchangeably, but they are not exactly the same.</p>

In [25]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

In [26]:
# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train.ravel())  # .ravel() used to flatten the numpy array and avoid following warning :
# DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
# y = column_or_1d(y, warn=True)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print("Accuracy of logistic regression classifier: ", best_model.score(rescaledX_test, y_test))

Best: 0.867906 using {'max_iter': 100, 'tol': 0.01}
Accuracy of logistic regression classifier:  0.8421052631578947
