# A FICO problem

Fico is the industry standard for determining credit worthiness. However, research shows that FICO is overused. FICO's usage fits the idiom "square peg in a round hole".

Evidence:

* A [research study](https://www.stlouisfed.org/publications/regional-economist/october-2008/did-credit-scores-predict-the-subprime-crisis) by the Federal Reserve Bank of St. Louis demonstrates that __FICO scores were a poor indicater of the subprime mortgage crisis__.
> Given the nature of FICO scores, one might expect to find a relationship between borrowers’ scores and the incidence of default and foreclosure.....FICO scores have not indicated that relationship....higher FICO scores have been associated with bigger increases in default rates over time.
* [Studies prove](https://www.marketwatch.com/story/your-digital-footprint-could-provide-a-more-accurate-credit-score-2018-05-03) that __digital footprints can outperform FICO__.
> Those who order from mobile phones are three times as likely to default as those who order from desktops. A customer who uses her name in her email address is 30 percent less likely to default than one who doesn’t. Those who shop between noon and 6 p.m. are half as likely to default as midnight to 6 a.m. buyers
* FICO depends on credit usage (CU). __CU is unreliable__ because it fluctuates through economic cycles. [CU amongs millenials dropped](http://fortune.com/2018/02/27/why-millennials-are-ditching-credit-cards/), especially during the 2008 recession. But being debt conscious is financially desirable.
* Paying timely minimums on credit statements results in both a strong score and __mounting debt__.
* Credit utilization ratios factor heavily into FICO ( $util = \frac{debt}{availCredt}$ ). A healthy utilization is < 0.3. Credit card sneak companies sneakily encourage opening several credit lines to improve this ratio. There is a conflict of interest, because __they profit on credit debt__.

# An ML solution to credit defaults

30,000 instances of credit default data were collected by the University of California Irvine.

In [8]:
# Data Exploration

import pandas as pd
fname = './UCI_Credit_Card.csv'

df = pd.read_csv(fname)
df = df.drop(['ID'], axis=1)
df.sample(5)

Unnamed: 0,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,...,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
25058,130000.0,1,3,2,49,0,0,0,0,0,...,16898.0,11236.0,6944.0,1610.0,1808.0,7014.0,27.0,7011.0,4408.0,0
20677,90000.0,1,3,3,40,0,0,0,0,0,...,56544.0,57048.0,58421.0,1923.0,2597.0,2637.0,2041.0,2291.0,4824.0,1
5917,200000.0,2,3,1,46,-1,-1,-1,-1,-2,...,0.0,0.0,0.0,5590.0,2500.0,0.0,0.0,0.0,0.0,1
11139,60000.0,2,1,1,37,0,0,0,0,0,...,41098.0,43536.0,44670.0,2000.0,2000.0,2000.0,3100.0,2000.0,0.0,0
28631,120000.0,2,2,1,41,-1,-1,-1,-1,-1,...,4502.0,6013.0,5094.0,8209.0,421.0,5000.0,6013.0,0.0,7600.0,0


In [3]:
from time import time
import numpy as np
from sklearn import svm
from sklearn.metrics import accuracy_score
from read_csv import read

def main():
    data_set = read(fname)
    features, labels = split_features_labels(data_set)
    train_features, train_labels, test_features, test_labels = split_train_test(features, labels, 0.7)
    print(len(train_features), ' ', len(test_features))
    clf = svm.SVC()
    print('Start training...')
    tStart = time()
    clf.fit(train_features, train_labels)
    print('Training time: ', round(time()-tStart, 3), 's')
    print('Accuracy: ', accuracy_score(clf.predict(test_features), test_labels))

def split_train_test(features, labels, test_size):
    total_test_size = int(len(features) * test_size)
    np.random.seed(2)
    indices = np.random.permutation(len(features))
    train_features = features[indices[:-total_test_size]]
    train_labels = labels[indices[:-total_test_size]]
    test_features  = features[indices[-total_test_size:]]
    test_labels  = labels[indices[-total_test_size:]]
    return train_features, train_labels, test_features, test_labels

def split_features_labels(data_set):
    features = data_set['features']
    labels = data_set['labels']

    ids_removed = [x[1:] for x in features]
    features = [x[:11] for x in ids_removed]

    bills = [x[11:17] for x in ids_removed]
    payments = [x[17:] for x in ids_removed]

    div = lambda a,b: [x/y if y > 0 else 1 for x, y in zip(a, b)]
    fraction_paid = list(map(div, payments, bills))

    features = [val+fraction_paid[idx] for idx,val in enumerate(features)]

    return np.array([np.array(x) for x in features]), np.array(labels)

if __name__ == '__main__':
    main()

9000   21000
Start training...
Training time:  2.969 s
Accuracy:  0.7961904761904762
