# Predicting Credit Card Approvals

Commercial banks receive _a lot_ of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like real banks do.

![Credit card being held in hand](credit_card.jpg)

You have been provided with a small subset of the credit card applications a bank receives. The dataset has been loaded as a Pandas DataFrame for you. You will start from there. 

In [2]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV

# Load the dataset
cc_apps = pd.read_csv("cc_approvals.data", header=None) 
cc_apps.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


# Start coding here
# Use as many cells as you need
cc_apps.describe()

### How to approach this project?

1. Splitting the dataset into train and test sets

2. Handling the missing values

3. Preprocessing the data

4. Segregating features and labels and feature rescaling

5. Training and evaluating a logistic regression model

6. Hyperparameter search and making the model perform better

### 1. Splitting the dataset into train and test sets

    Drop features 11 and 13 using the drop() method as they are non-essential for the task of this project.
    
    Using the train_test_split() method, split the data into train and test sets with a split ratio of 33% (test_size argument) and set the random_state argument to 42. 
    
    Assign the train and test DataFrames to the following variables respectively: cc_apps_train, cc_apps_test.


### 2. Handling the missing values

Replace '?' with NaNs using replace() in both train and test sets. Store them in cc_apps_train_nans_replaced and cc_apps_test_nans_replaced respectively.
Impute missing values (NaNs) in numeric columns with fillna(), ensuring test set imputation uses mean values from the training set. Store the imputed DataFrames in cc_apps_train_imputed and cc_apps_test_imputed respectively.
Iterate through cc_apps_train_imputed columns with a for loop, checking for object data type.
Impute missing values in both cc_apps_train_imputed and cc_apps_test_imputed columns with the most frequent value from cc_apps_train_imputed using fillna() and value_counts().



### 3. Preprocessing the data

Apply get_dummies() to cc_apps_train_imputed and cc_apps_test_imputed, storing the results in cc_apps_train_cat_encoding and cc_apps_test_cat_encoding, respectively.
Reindex cc_apps_test_cat_encoding with columns from cc_apps_train_cat_encoding, filling missing columns with 0s.



### 4. Segregating features and labels and feature rescaling

Segregate cc_apps_train_cat_encoding into (X_train, y_train) and cc_apps_test_cat_encoding into (X_test, y_test).
Instantiate a MinMaxScaler as scaler with feature_range set to (0,1).
Fit scaler to X_train and transform data, saving it as rescaledX_train.
Transform X_test using scaler and store the result in rescaledX_test.

### 5. Training and evaluating a logistic regression model

Instantiate LogisticRegression into a variable named logreg with default values.
Train logreg on rescaledX_train and y_train.
Make predictions on rescaledX_test with logreg and store the results in y_pred.
Use confusion_matrix() with y_test and y_pred to print the confusion matrix.

### 6. Hyperparameter search and making the model perform better

Create tol list with values 0.01, 0.001, and 0.0001, and max_iter list with values 100, 150, and 200.
Create a param_grid dictionary with keys tol and max_iter, mapping them to their respective lists of values.
Instantiate GridSearchCV() with a 5-fold cross-validation.
Fit rescaledX_train and y_train to grid_model, storing results in grid_model_result.
Store the best model from grid_model_result in best_model, the best model parameters in best_params, and best performance score in best_score.
Evaluate the best model from grid_model_result on the test set (rescaledX_test, y_test).


In [None]:
# Drop the features 11 and 13
cc_apps = cc_apps.drop([11, 13], axis=1)

# Split into train and test sets
cc_apps_train, cc_apps_test = train_test_split(cc_apps, test_size=0.33, random_state=42)

# Replace the '?'s with NaN in the train and test sets
cc_apps_train_nans_replaced = cc_apps_train.replace("?", np.NaN)
cc_apps_test_nans_replaced = cc_apps_test.replace("?", np.NaN)

# Impute the missing values with mean imputation
cc_apps_train_imputed = cc_apps_train_nans_replaced.fillna(cc_apps_train_nans_replaced.mean())
cc_apps_test_imputed = cc_apps_test_nans_replaced.fillna(cc_apps_train_nans_replaced.mean())

# Iterate over each column of cc_apps_train_imputed
for col in cc_apps_train_imputed.columns:
    # Check if the column is of object type
    if cc_apps_train_imputed[col].dtypes == "object":
        # Impute with the most frequent value
        cc_apps_train_imputed = cc_apps_train_imputed.fillna(
            cc_apps_train_imputed[col].value_counts().index[0]
        )
        cc_apps_test_imputed = cc_apps_test_imputed.fillna(
            cc_apps_train_imputed[col].value_counts().index[0]
        )

# Convert the categorical features in the train and test sets independently
cc_apps_train_cat_encoding = pd.get_dummies(cc_apps_train_imputed)
cc_apps_test_cat_encoding = pd.get_dummies(cc_apps_test_imputed)

# Reindex the columns of the test set aligning with the train set
cc_apps_test_cat_encoding = cc_apps_test_cat_encoding.reindex(
    columns=cc_apps_train_cat_encoding.columns, fill_value=0
)

# Segregate features and labels into separate variables
X_train, y_train = (
    cc_apps_train_cat_encoding.iloc[:, :-1].values,
    cc_apps_train_cat_encoding.iloc[:, [-1]].values,
)
X_test, y_test = (
    cc_apps_test_cat_encoding.iloc[:, :-1].values,
    cc_apps_test_cat_encoding.iloc[:, [-1]].values,
)

# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.transform(X_test)

# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()

# Fit logreg to the train set
logreg.fit(rescaledX_train, y_train)

# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)

# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))

# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]

# Create a dictionary where tol and max_iter are keys and the lists of their values are the corresponding values
param_grid = dict(tol=tol, max_iter=max_iter)

# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator=logreg, param_grid=param_grid, cv=5)

# Fit grid_model to the data
grid_model_result = grid_model.fit(rescaledX_train, y_train)

# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))

# Extract the best model and evaluate it on the test set
best_model = grid_model_result.best_estimator_
print(
    "Accuracy of logistic regression classifier: ",
    best_model.score(rescaledX_test, y_test),
)