## Before you get started
In this notebook we are about to finish the credit card approval prediction task with pandas and [Towhee](https://towhee.io/). Towhee is an open-source machine learning pipeline that helps you with various machine learning tasks. Make sure you have installed towhee before you get started via `pip install towhee`.

You are more than welcome to join our community, together we can make a differnece.

**Github**: [https://github.com/towhee-io/towhee](https://github.com/towhee-io/towhee)

**Slack**: [https://slack.towhee.io](https://slack.towhee.io)

**Twitter**: [https://twitter.com/towheeio](https://twitter.com/towheeio)

## Data Processing With Pandas

In [1]:
import warnings
warnings.filterwarnings("ignore")

In [2]:
pip install towhee imblearn pandas

First load the data from [Kaggle](https://www.kaggle.com/code/chizzzy/credit-card-approval-prediction/data?scriptVersionId=92959791) as Dataframe for further processing.

In [13]:
! curl -L https://github.com/towhee-io/examples/releases/download/data/application_record.csv.zip -O
! curl -L https://github.com/towhee-io/examples/releases/download/data/credit_record.csv.zip -O
! unzip -q -o application_record.csv.zip
! unzip -q -o credit_record.csv.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:01 --:--:--     0
100 3109k  100 3109k    0     0  1228k      0 --:--:--  0:00:02 --:--:--     0 0  0:00:02  0:00:02 --:--:-- 7492k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2338k  100 2338k    0     0  1401k      0  0:00:01  0:00:01 --:--:-- 8433k


In [4]:
import pandas as pd

record = pd.read_csv("./credit_record.csv", encoding = 'utf-8')
data = pd.read_csv("./application_record.csv", encoding = 'utf-8')

Find the first month that users' data were recorded and rename the column with a more understandable name.

In [5]:
begin_month=pd.DataFrame(record.groupby(["ID"])["MONTHS_BALANCE"].agg(min))
begin_month=begin_month.rename(columns={'MONTHS_BALANCE':'begin_month'}) 

Process the `STATUS` column to find out if candidates have the record of overdue. Here is a table describe what each label stands for:
- X: No loan for the month, labeled as -1;
- C: paid off that month, labeled as -1;
- 0: 1-29 days past due;
- 1: 30-59 days past due;
- 2: 60-89 days overdue;
- 3: 90-119 days overdue;
- 4: 120-149 days overdue;
- 5: Overdue or bad debts, write-offs for more than 150 days 

In [6]:
record.loc[record['STATUS']=='X', 'STATUS']=-1 
record.loc[record['STATUS']=='C', 'STATUS']=-1 
record.loc[record['STATUS']=='0', 'STATUS']=0 
record.loc[record['STATUS']=='1', 'STATUS']=1
record.loc[record['STATUS']=='2', 'STATUS']=2
record.loc[record['STATUS']=='3', 'STATUS']=3 
record.loc[record['STATUS']=='4', 'STATUS']=4 
record.loc[record['STATUS']=='5', 'STATUS']=5
record.groupby('ID')['STATUS'].max().value_counts(normalize=True)

 0    0.754202
-1    0.129455
 1    0.101838
 2    0.007307
 5    0.004241
 3    0.001914
 4    0.001044
Name: STATUS, dtype: float64

Generally, users in risk should be less than 3%, thus those who overdue for more than 60 days should be marked as risk users.

In [7]:
record.loc[record['STATUS']>=2, 'dep_value']=1
record.loc[record['STATUS']<2, 'dep_value']=0 
temp = record[['ID', 'dep_value']].groupby('ID').sum()
temp.loc[temp['dep_value']!=0, 'dep_value']='Yes'
temp.loc[temp['dep_value']==0, 'dep_value']= 'No'
temp.value_counts(normalize=True)

dep_value
No           0.985495
Yes          0.014505
dtype: float64

Merge the information into one dataframe, and mark those risk users with target `1` while other users `0`. We will regard the `target` column as result. Meanwhile, we should drop those rows with missing values to avoid disturb.

In [8]:
new_data=pd.merge(data,begin_month,how="left",on="ID")
new_data=pd.merge(new_data, temp,how='inner',on='ID')
new_data['target']=new_data['dep_value']
new_data.loc[new_data['target']=='Yes','target']=1
new_data.loc[new_data['target']=='No','target']=0

In [9]:
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
new_data.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS,begin_month,dep_value,target
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,-15.0,No,0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0,-14.0,No,0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0,-29.0,No,0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,-4.0,No,0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0,-26.0,No,0


Before we get started, we should take a rough look at our samples, in case inbalanced data leading to a weird result.

In [10]:
new_data = new_data.dropna()
new_data['target'].value_counts()

0    24712
1      422
Name: target, dtype: int64

Obviously the data are extremely imbalance, so we'll need to resample the data.

In [11]:
from imblearn.over_sampling import SMOTEN
X = new_data[['ID', 'CODE_GENDER', 'FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'NAME_INCOME_TYPE', 
       'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS', 'NAME_HOUSING_TYPE', 'DAYS_BIRTH', 'DAYS_EMPLOYED', 'FLAG_MOBIL',
       'FLAG_WORK_PHONE', 'FLAG_PHONE', 'FLAG_EMAIL', 'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'begin_month','dep_value']]
y = new_data['target'].astype('int')
X_balance,y_balance = SMOTEN().fit_resample(X, y)
X_balance = pd.DataFrame(X_balance, columns = X.columns)
X_balance.insert(0, 'target', y_balance)
new_data = X_balance
new_data['years_birth'] = new_data.DAYS_BIRTH.map(lambda x: -int(x)//365)
new_data['years_employed'] = new_data.DAYS_EMPLOYED.map(lambda x: -int(x)//365)

## Building Models with [Towhee](https://towhee.io/) 

In the following part, we will use Towhee's DataCollection API to deal with the processed data. DataCollection provides a series API to support training, prediction, and evaluating with machine learning models.

Also, Towhee has encapsulated several models as built-in operators, which we will introduce in the following training part.

### Prepare Training And Testing Data

We need to split the original dataset into training and testing dataset.

In [12]:
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(new_data, test_size=0.2)

### Feature Extract

Then we need to process some data, include:
- Discretize the numerical data, both continuous data and binary data;
- Encode categorical data with non-digital value;
- Stack chosen features into a feature tensor;

First step is to create models and fit with the correspoding columns.

In [13]:
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder
from scipy import sparse
import numpy as np

discretizers = {}
encoders = {}

for col in [
    'CNT_CHILDREN',
    'AMT_INCOME_TOTAL',
    'years_birth',
    'years_employed',
    'CNT_FAM_MEMBERS'
    ]:
    if col == 'CNT_CHILDREN':
        discretizers[col] = KBinsDiscretizer(n_bins=3, encode='onehot', strategy='quantile')
    else:
        discretizers[col] = KBinsDiscretizer(n_bins=5, encode='onehot', strategy='quantile')
    discretizers[col].fit(train_data[col].values.reshape(-1, 1))

for col in [
    'NAME_INCOME_TYPE',
    'OCCUPATION_TYPE',
    'NAME_HOUSING_TYPE',
    'NAME_EDUCATION_TYPE',
    'NAME_FAMILY_STATUS',
    'CODE_GENDER',
    'FLAG_OWN_CAR',
    'FLAG_OWN_REALTY'
    ]:
    encoders[col] = OneHotEncoder()
    encoders[col].fit(train_data[col].values.reshape(-1, 1))

Then we can define functions to transform the date with the fitted models.

In [None]:
class NumDiscretizer:
    def __init__(self, col_name):
        self._model = discretizers[col_name]

    def __call__(self, data):
        data = np.array([data]).reshape([-1, 1])
        return self._model.transform(data)

class CateOneHotEncoder:
    def __init__(self, col_name):
        self._model = encoders[col_name]

    def __call__(self, data):
        data = np.array([data]).reshape([-1, 1])
        return self._model.transform(data)

def tensor_hstack(*arg):
    return sparse.hstack(arg)

Load data from pandas dataframe, discretize the numerical data, encode categorical data, and stack these features into a feature tensor.

In [None]:
from towhee import pipe, ops

feature_extract = (
	pipe.input('df')
        .flat_map('df', tuple(new_data), lambda x: [i for _, i in x.iterrows()])
        .map('NAME_INCOME_TYPE', 'inctp', CateOneHotEncoder(col_name='NAME_INCOME_TYPE'))
        .map('OCCUPATION_TYPE', 'occyp', CateOneHotEncoder(col_name='OCCUPATION_TYPE'))
        .map('NAME_HOUSING_TYPE', 'houtp', CateOneHotEncoder(col_name='NAME_HOUSING_TYPE'))
        .map('NAME_EDUCATION_TYPE', 'edutp', CateOneHotEncoder(col_name='NAME_EDUCATION_TYPE'))
        .map('NAME_FAMILY_STATUS', 'famtp', CateOneHotEncoder(col_name='NAME_FAMILY_STATUS'))
        .map('CODE_GENDER', 'gender', CateOneHotEncoder(col_name='CODE_GENDER'))
        .map('FLAG_OWN_CAR', 'car', CateOneHotEncoder(col_name='FLAG_OWN_CAR'))
        .map('FLAG_OWN_REALTY', 'realty', CateOneHotEncoder(col_name='FLAG_OWN_REALTY'))
        .map('CNT_CHILDREN', 'childnum', NumDiscretizer(col_name='CNT_CHILDREN'))
        .map('AMT_INCOME_TOTAL', 'inc', NumDiscretizer(col_name='AMT_INCOME_TOTAL'))
        .map('years_birth', 'age', NumDiscretizer(col_name='years_birth'))
        .map('years_employed', 'worktm', NumDiscretizer(col_name='years_employed'))
        .map('CNT_FAM_MEMBERS', 'fmsize', NumDiscretizer(col_name='CNT_FAM_MEMBERS'))
        .map(
            ('childnum', 'inc', 'age', 'worktm', 'fmsize', 'inctp', 'occyp', 'houtp', 'edutp', 'famtp','gender', 'car', 'realty'),
            'fea',
            tensor_hstack
        )
)

### Model

In this tutorial, we will user logistic regression, decision tree, and support vector machine.

#### Train

- Load the feature tensors and labels
- feed the features and lables to logistic regression model, decision tree model and support vector machine model for training.

In [15]:
from towhee.utils.sklearn_utils import LogisticRegression, DecisionTreeClassifier, svm
from towhee.utils.scipy_utils import sparse
LR = LogisticRegression(max_iter=10)
DT = DecisionTreeClassifier(splitter='random', max_depth=10)
SVM = svm.SVC(C = 0.8, kernel='rbf', probability=True)

def lr_fit(fea, tar):
    X = sparse.vstack(fea)
    y = np.array(tar).reshape([-1, 1])
    LR.fit(X, y)

def dt_fit(fea, tar):
    X = sparse.vstack(fea)
    y = np.array(tar).reshape([-1, 1])
    DT.fit(X, y)

def svm_fit(fea, tar):
    X = sparse.vstack(fea)
    y = np.array(tar).reshape([-1, 1])
    SVM.fit(X, y)

def lr_predict(fea):
    return LR.predict(fea)[0]

def dt_predict(fea):
    return DT.predict(fea)[0]

def svm_predict(fea):
    return SVM.predict(fea)[0]

In [16]:
train = (
	feature_extract.output('fea', 'target')
)
train_size = len(train_data)
res = train(train_data[:train_size])
x = [res.get() for _ in range(train_size)]
fea = [i[0] for i in x]
tar = [i[1] for i in x]

lr_fit(fea, tar)
dt_fit(fea, tar)
svm_fit(fea, tar)

#### Evaluate

The trained models are ready and the states are properly stored, we can predict with them and see how they works.

- Run the same feature extract procedure on testing set to get the feature tensor
- Predict the result with three models we trained previously
- Compare the predicted result and actual result, calculate accuracy and recall

In [18]:
from sklearn.metrics import accuracy_score, recall_score

def cal_accuracy(predicted_list, actual_list):
	return accuracy_score(actual_list, predicted_list)

def cal_recall(predicted_list, actual_list):
	return recall_score(actual_list, predicted_list, average='weighted')

def get_confusion_matrix(predicted_list, actual_list):
	return confusion_matrix(actual_list, predicted_list)

In [19]:
eval =(
	feature_extract.map('fea', 'lr_pre', lr_predict)
		.map('fea', 'dt_pre', dt_predict)
		.map('fea', 'svm_pre', svm_predict)
		.window_all(('lr_pre', 'target'), 'lr_accuracy', cal_accuracy)
		.window_all(('lr_pre', 'target'), 'lr_recall', cal_recall)
		.window_all(('dt_pre', 'target'), 'dt_accuracy', cal_accuracy)
		.window_all(('dt_pre', 'target'), 'dt_recall', cal_recall)
		.window_all(('svm_pre', 'target'), 'svm_accuracy', cal_accuracy)
		.window_all(('svm_pre', 'target'), 'svm_recall', cal_recall)
		.output(
			'ID', 'target', 'lr_pre', 'dt_pre', 'svm_pre',
			'lr_accuracy', 'lr_recall', 'dt_accuracy', 'dt_recall', 'svm_accuracy', 'svm_recall',
			tracer=True
		)
)

res = eval(test_data)
from towhee.datacollection import DataCollection
res = DataCollection(res)

ID,target,lr_pre,dt_pre,svm_pre,lr_accuracy,lr_recall,dt_accuracy,dt_recall,svm_accuracy,svm_recall
5047662,1,1,1,1,0.7542741527567021,0.7542741527567021,0.8801213960546282,0.8801213960546282,0.9757207890743552,0.9757207890743552
5091321,0,1,1,1,0.7542741527567021,0.7542741527567021,0.8801213960546282,0.8801213960546282,0.9757207890743552,0.9757207890743552
5009628,1,1,1,1,0.7542741527567021,0.7542741527567021,0.8801213960546282,0.8801213960546282,0.9757207890743552,0.9757207890743552
5051160,1,1,1,1,0.7542741527567021,0.7542741527567021,0.8801213960546282,0.8801213960546282,0.9757207890743552,0.9757207890743552
5022044,1,1,0,1,0.7542741527567021,0.7542741527567021,0.8801213960546282,0.8801213960546282,0.9757207890743552,0.9757207890743552


In [14]:
import os
os.remove('credit_record.csv.zip')
os.remove('application_record.csv.zip')
os.remove('credit_record.csv')
os.remove('application_record.csv')