## Predicting credit card approvals using machine learning

### To find the project click [here](https://app.datacamp.com/workspace/w/cd940d1b-68ea-458c-8471-5f2e5d8f6beb)
### Index (click to redirect to section)
- [Loading data into a dataframe](#load-the-data)
- [Inspecting the applications](#inspecting-the-applications)
- [Handling the missing values](#handling-the-missing-values)
- [Processing the data](#prepocessing-the-data)
- [Splitting the dataset into train and test sets](#splitting-the-dataset-into-train-and-test-sets)
- [Scaling the data](#scaling-the-data)
- [Fitting a logistic regression model to the train set](#fitting-a-logistic-regression-model-to-the-train-set)
- [Making predictions and evaluating performance](#making-predictions-and-evaluating-performance)
- [Grid search to hypertune the model](#grid-search-to-hypertune-the-model)
- [Find the best performing model](#find-the-best-performing-model)

<a id='load-the-data'></a>
#### Loading data into a dataframe

In [3]:
# Import pandas
import pandas as pd

# Load dataset
cc_apps = pd.read_csv('datasets/cc_approvals.data', header=None)

# Inspect data
print(cc_apps.head())

  0      1     2  3  4  5  6     7  8  9   10  11  12     13   14  15
0  0      1  2.00  3  4  5  6  7.00  8  9  10  11  12     13   14  15
1  b  30.83  0.00  u  g  w  v  1.25  t  t   1   f   g  00202    0   +
2  a  58.67  4.46  u  g  q  h  3.04  t  t   6   f   g  00043  560   +
3  a  24.50  0.50  u  g  q  h  1.50  t  f   0   f   g  00280  824   +
4  b  27.83  1.54  u  g  w  v  3.75  t  t   5   t   g  00100    3   +


<a id='inspecting-the-applications'></a>
#### Inspecting the applications

In [4]:
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)

print("\n")

# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)

print("\n")

# Inspect missing values in the dataset
cc_apps.isna().sum()

               2           7           10             14
count  691.000000  691.000000  691.000000     691.000000
mean     4.754732    2.230318    2.410999    1015.933430
std      4.975661    3.349021    4.868008    5206.465716
min      0.000000    0.000000    0.000000       0.000000
25%      1.000000    0.165000    0.000000       0.000000
50%      2.750000    1.000000    0.000000       5.000000
75%      7.165000    2.667500    3.000000     395.000000
max     28.000000   28.500000   67.000000  100000.000000


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 691 entries, 0 to 690
Data columns (total 16 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   0       691 non-null    object 
 1   1       691 non-null    object 
 2   2       691 non-null    float64
 3   3       691 non-null    object 
 4   4       691 non-null    object 
 5   5       691 non-null    object 
 6   6       691 non-null    object 
 7   7       691 non-null    float64
 8   8    

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

<a id='handling-the-missing-values'></a>
#### Handling the missing values

In [5]:
# Import numpy
import numpy as np

# Inspect missing values in the dataset
print(cc_apps.isna().sum())

# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.nan)

# Inspect the missing values again
print(cc_apps.isna().sum())

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64
0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64


In [6]:
# Impute the missing values with mean imputation
cc_apps = cc_apps.fillna(cc_apps.mean())

# Count the number of NaNs in the dataset to verify
print(cc_apps.isna().sum())

0     12
1     12
2      0
3      6
4      6
5      9
6      9
7      0
8      0
9      0
10     0
11     0
12     0
13    13
14     0
15     0
dtype: int64


In [7]:
# Iterate over each column of cc_apps
for col in cc_apps.columns:
    # Check if the column is of object type
    if cc_apps[col].dtype == 'object':
        # Impute with the most frequent value
        cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])

# Count the number of NaNs in the dataset and print the counts to verify
cc_apps.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
dtype: int64

<a id='prepocessing-the-data'></a>
#### Preprocessing the data

In [8]:
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder

# Instantiate LabelEncoder
le = LabelEncoder()

# Iterate over all the values of each column and extract their dtypes
for col in cc_apps.columns.to_numpy():
    # Compare if the dtype is object
    if cc_apps[col].dtypes == 'object':
        # Use LabelEncoder to do the numeric transformation
        cc_apps[col] = le.fit_transform(cc_apps[col])

<a id='splitting-the-dataset-into-train-and-test-sets'></a>
#### Splitting the dataset into train and test sets

In [9]:
# Import train_test_split
from sklearn.model_selection import train_test_split

# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop([11, 13], axis=1)
cc_apps = cc_apps.to_numpy()

# Segregate features and labels into separate variables
X, y = cc_apps[:, 0:-1], cc_apps[:, -1]

# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

<a id='scaling-the-data'></a>
#### Scaling the data

In [10]:
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)

<a id='fitting-a-logistic-regression-model-to-the-train-set'></a>
#### Fitting a logistic regression model to the train set

In [11]:
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

LogisticRegression()

<a id='making-predictions-and-evaluating-performance'></a>
#### Making predictions and evaluating performance

In [15]:
# Import confusion_matrix
from sklearn.metrics import confusion_matrix, classification_report

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ",
      logreg.score(X_test, y_test))

# Print the confusion matrix of the logreg model
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))


Accuracy of logistic regression classifier:  0.6550218340611353
              precision    recall  f1-score   support

         0.0       0.78      0.83      0.81       103
         1.0       0.86      0.81      0.83       126

    accuracy                           0.82       229
   macro avg       0.82      0.82      0.82       229
weighted avg       0.82      0.82      0.82       229



<a id='grid-search-to-hypertune-the-model'></a>
#### Grid search to hypertune the model

In [16]:
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict({'tol': tol, 'max_iter': max_iter})

<a id='find-the-best-performing-model'></a>
#### Find the best performing model

In [17]:
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)

# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))



Best: 0.849828 using {'max_iter': 100, 'tol': 0.01}
