# Predicting creditcards approval v2

Plan
1. Data already cleaned from previous notebook
2. EDA
3. Come up with some sense of the dataset before going into building models
4. Define problem statement and come up with assumptions
<br>
<br>
5. Preprocessing what more can we do - dimensional reduction, NMF??
6. Fine tune each model's parameters to squeeze out the best for each model
7. Compare different classification models - knn, logistic regression 

## Get data from cleaned source

In [None]:
# import packages
import pandas as pd
import numpy as np
# read csv
df = pd.read_csv('cc_approvals_cleaned.csv')
print(df.head())
print('-'*40)
print(df.info())

In [None]:
# change zipcode to string again, got reseted when reloaded csv
df.ZipCode = df.ZipCode.astype('str')
print(df.info())

## Problem statement
1. Who are our creditcards customers?
2. Which features influenced approval decision?
3. Given application data, develop a classification model to predict creditcard approval to save manual application revision time.


## EDA

1. Most of applicants are under 40 years old
2. Median income is 5.00 !!!!

In [None]:
# Inspect data
print(df.describe())
print(df.describe(include=['O']))

In [None]:
# Separate categorical and numerical features
cat_feats = []
num_feats = []
for col in df.columns:
    if col == 'ApprovalStatus':
        pass # this is our target variable
    elif df[col].dtype == object:
        # print(col, 'is a cat feat')
        cat_feats.append(col)
    else:
        # print(col, 'is a num feat')
        num_feats.append(col)
print('Categorical features:', cat_feats)
print('Numerical features:', num_feats)

In [None]:
# Visualize numerical features to find outliers
import seaborn as sns
import matplotlib.pyplot as plt

fig, ax = plt.subplots(ncols=len(num_feats), figsize=(30,5))
for i in range(len(num_feats)):
    _ = sns.boxplot(data=df, x=num_feats[i], ax=ax[i])
plt.show()

In [None]:
# Visualize ApprovalStatus in terms of cat feature
for cat_feat in cat_feats:
    # if df[cat_feat].nunique() < 7:
        # _ = sns.catplot(data=df, x='ApprovalStatus', kind='count', col=cat_feat)
        # plt.show()
    print(pd.crosstab(index=df[cat_feat], columns=df['ApprovalStatus']))
        

Cat features value has no meaning we can infer on .....

Base model in last notebook managed to get around 84% accuracy.<br>
See how can we improve on that.

## Preprocessing
1. Drop zipcode - so many
2. Label encode cat features
3. Mean impute num features

In [None]:
# separate cat and num features and drop ZipCode
X_num = df.loc[:, num_feats]
X_cat = df.loc[:, cat_feats]
X_cat = X_cat.drop(['ZipCode'], axis=1)
cat_feats.remove('ZipCode')
y = df.loc[:,'ApprovalStatus']


In [None]:
# try label encoding for categorical variables
from sklearn.preprocessing import LabelEncoder
for cat_feat in cat_feats:
    le = LabelEncoder()
    X_cat[cat_feat] = le.fit_transform(X_cat[cat_feat])

In [None]:
# median imputation for numerical features
from sklearn.impute import SimpleImputer
im = SimpleImputer(strategy='median')
X_num = pd.DataFrame(im.fit_transform(X_num), columns=X_num.columns)

In [None]:
# concatenate X_cat and X_num --> X
X = pd.concat([X_num, X_cat], axis=1)
print(X.info())

In [None]:
# scale since all are are numeric
from sklearn.preprocessing import StandardScaler
X = StandardScaler().fit_transform(X)

## Build base model

In [None]:
# train:test = 90:10 w/ constant random state to duplicate
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=69)


In [None]:
# fit and predict for base model
from sklearn.linear_model import LogisticRegression
log_reg = LogisticRegression().fit(X_train, y_train)
y_base_model = log_reg.predict(X_test)

In [None]:
# View results for base model
from sklearn.metrics import classification_report
# classification report
print('Classification report')
print(classification_report(y_test, y_base_model))

In [None]:
# ROC of base model
import matplotlib.pyplot as plt  
from sklearn.metrics import plot_roc_curve
plot_roc_curve(log_reg, X_test, y_test)
plt.show()