## Big Data in Finance Assignment 1

#### Akos Furton, Joaquin Coitino, Marnelia Scribante, Siow Meng Low

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier

%matplotlib inline

#### Read Data

The below code read in the data and also scales the features before using 1-NN classifier.

In [2]:
seedValue = 99

loans = pd.read_excel('LCloanbook.xls', sheetname = 'Data')

xDF = loans.iloc[ : , 1:]
yDF = loans.iloc[ :, 0]

x = xDF.as_matrix()
y = yDF.as_matrix()

# Standardise before using 1-NN
xScaled = StandardScaler().fit_transform(x)


#### Full Model

The accuracies of the three techniques in full models (computed using 10-Fold Cross Validation) are:  

In [3]:
# Full Models
loans_full_logistic = cross_val_score(LogisticRegression(), x, y, scoring = 'accuracy', 
                                      cv = KFold(10, shuffle = True, random_state = seedValue))
loans_full_tree = cross_val_score(DecisionTreeClassifier(random_state = seedValue), x, y, scoring = 'accuracy', 
                                  cv = KFold(10, shuffle = True, random_state = seedValue))
loans_full_knn = cross_val_score(KNeighborsClassifier(n_neighbors = 1), xScaled, y, scoring = 'accuracy', 
                                 cv = KFold(10, shuffle = True, random_state = seedValue))

print("Accuracy of Logistic Regression (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_full_logistic) * 100), \
      "%")
print("Accuracy of Tree Classifer (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_full_tree) * 100), "%")
print("Accuracy of 1-NN (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_full_knn) * 100), "%")


Accuracy of Logistic Regression (computed using 10-Fold Cross Validation): 81.94 %
Accuracy of Tree Classifer (computed using 10-Fold Cross Validation): 84.36 %
Accuracy of 1-NN (computed using 10-Fold Cross Validation): 59.79 %


#### Reduced Model Attributes

The loan attributes we believe are the most informative are:
  
1) acc_now_delinq - If a borrower currently has a non-zero number of delinquent accounts, we believe they would be much more likely to default an another loan.
   
2) delinq_2yrs - If a borrower has fallen over 30 days past due on a loan within the past two years, we believe they are liable to fall overdue again in future loans.
   
3) dti - We belive that the borrower's total debt to income would be critical in determining delinquency because small loans would be more easily repaid. If a borrower falls behind on a large loan, then they would have difficulty meeting interest payments.
   
4) home_ownership_MORTGAGE - We belive a borrower that currently has a mortgage would be more likely to default on a loan given that they have existing debt obligations.
   
5) home_ownership_OWN - A homeowner would be much less likely to default on a loan, considering they likely have no existing debt payments and much higher likelyhood of free cashflow.
   
6) home_ownership_RENT - A person who rents their home would be more likely to default on a loan considering that rent payments are often a person's largest monthly cash outflow. Therefore, since rent payments come first in a person's budget, they might not have enough money left for loan repayments.
   
7) int_rate - By using the prevailing public's judgement in setting an interest rate on a loan, we belive that higher rates would correspond with delinquencies. Loans that are assessed as more risky are given larger interest payments to compensate for the higher risk of default.
   
8) mths_since_last_delinq - We see the length of time since a borrower has been delinquent as a key predictor because a borrower who has recently defaulted is likely to default again. Conversely, a borrower who defaulted ages ago may have since improved his or her financial status.
   
9) open_acc_6m - The number of open credit lines in the last 6 months is an important attribute because people who lean on credit for their daily expenses are more likely to default. A person would only seek to open multiple lines of credit if they have exhausted their currently open lines.
   
10) pub_rec - We believe number of derogatory public records is critical because it reflects on the borrower's past history with credit. With prior bankrupcies or liens, the borrower has shown a history of non-repayment. Therefore, they would be much more likely to default on a new loan.

#### Reduced Model

The accuracies of the three techniques in full models (computed using 10-Fold Cross Validation) are:  

In [4]:
# Reduced Models
attrSlct = ['acc_now_delinq', 'delinq_2yrs', 'dti', 'home_ownership_MORTGAGE', 'home_ownership_OWN', \
            'home_ownership_RENT', 'int_rate', 'mths_since_last_delinq', 'open_acc_6m', 'pub_rec']

x_Reduced = xDF.loc[ : , attrSlct].as_matrix()
xScaled_Reduced = StandardScaler().fit_transform(x_Reduced)

loans_reduced_logistic = cross_val_score(LogisticRegression(), x_Reduced, y, scoring = 'accuracy', 
                                         cv = KFold(10, shuffle = True, random_state = seedValue))
loans_reduced_tree = cross_val_score(DecisionTreeClassifier(random_state = seedValue), x_Reduced, y, 
                                     scoring = 'accuracy', cv = KFold(10, shuffle = True, random_state = seedValue))
loans_reduced_knn = cross_val_score(KNeighborsClassifier(n_neighbors = 1), xScaled_Reduced, y, 
                                    scoring = 'accuracy', cv = KFold(10, shuffle = True, random_state = seedValue))

print("Accuracy of Logistic Regression (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_reduced_logistic) * 100), \
      "%")
print("Accuracy of Tree Classifer (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_reduced_tree) * 100), "%")
print("Accuracy of 1-NN (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_reduced_knn) * 100), "%")



Accuracy of Logistic Regression (computed using 10-Fold Cross Validation): 64.14 %
Accuracy of Tree Classifer (computed using 10-Fold Cross Validation): 63.14 %
Accuracy of 1-NN (computed using 10-Fold Cross Validation): 56.97 %


#### LASSO-MODEL (with 10-Fold Cross Validation)

Below shows the accuracy of LASSO-MODEL computed using 10-Fold Cross Validation.  

In [5]:
# Logit with 10-fold cross-validation
LRCV_l1 = LogisticRegressionCV(Cs = [0.002], 
                               cv = KFold(10, shuffle = True, random_state = 99), 
                                         penalty='l1', solver = 'liblinear')

LRCV_l1.fit(xScaled, y)

print("Number of Attributes after 10-fold cross-validation: ", sum(LRCV_l1.coef_[0] != 0))
print("Average accuracy over 10-fold cross-validation: %.2f" % (np.mean(LRCV_l1.scores_[1]) * 100), "%")


Number of Attributes after 10-fold cross-validation:  10
Average accuracy over 10-fold cross-validation: 81.87 %


#### LASSO-MODEL (without 10-Fold Cross Validation)

Below shows the accuracy of LASSO-MODEL without using 10-fold cross validation.  

In [6]:
# Logit using all data without cross-validation
#for c in np.arange(0.0015, 0.0025, 0.0001):
#    LR_l1 = LogisticRegression(C = c, penalty='l1')
#    LR_l1.fit(xScaled, y)
#    
#    print("C=", c)
#    print("Number of Attributes=", sum(LR_l1.coef_[0] != 0))
#    print("In-Sample Accuracy=", LR_l1.score(xScaled, y))
    
LR_l1 = LogisticRegression(C = 0.002, penalty='l1')
LR_l1.fit(xScaled, y)

interceptDF = pd.DataFrame(LR_l1.intercept_, index = ['Intercept'], columns = ['Value'])
coefDF = pd.DataFrame(LR_l1.coef_[0][np.where(LR_l1.coef_[0] != 0)], 
                      index = xDF.columns[np.where(LR_l1.coef_[0] != 0)], 
                      columns = ['Value'])

finalDF = pd.concat([interceptDF, coefDF])

print("Number of Attributes:", sum(LR_l1.coef_[0] != 0))
print("In-Sample Accuracy: %.2f" % (LR_l1.score(xScaled, y) * 100), "%")
print("The coefficient values of LASSO-MODEL are: ")
print(finalDF)
    

Number of Attributes: 10
In-Sample Accuracy: 81.92 %
The coefficient values of LASSO-MODEL are: 
                    Value
Intercept        0.118545
loan_amnt        1.140102
int_rate         0.569961
installment      0.093198
annual_inc      -0.009127
dti              0.032038
inq_last_6mths   0.037829
out_prncp       -1.464046
total_rec_prncp -1.291023
issue_year      -0.500639
GRADE_A         -0.023369


#### Revisit the Techniques with LASSO-selected attributes

We retrain the three classifiers using the LASSO-selected attributes and the accuracies are as below:  

In [3]:
# LASSO Attributes
attrLASSO = ['loan_amnt', 'int_rate', 'installment', 'annual_inc', 'dti', 'inq_last_6mths', \
             'out_prncp', 'total_rec_prncp', 'issue_year', 'GRADE_A']

x_LASSO = xDF.loc[ : , attrLASSO].as_matrix()
xScaled_LASSO = StandardScaler().fit_transform(x_LASSO)

loans_LASSO_logistic = cross_val_score(LogisticRegression(), x_LASSO, y, scoring = 'accuracy', 
                                         cv = KFold(10, shuffle = True, random_state = seedValue))
loans_LASSO_tree = cross_val_score(DecisionTreeClassifier(random_state = seedValue), x_LASSO, y, 
                                   scoring = 'accuracy', cv = KFold(10, shuffle = True, random_state = seedValue))
loans_LASSO_knn = cross_val_score(KNeighborsClassifier(n_neighbors = 1), xScaled_LASSO, y, scoring = 'accuracy', 
                                    cv = KFold(10, shuffle = True, random_state = seedValue))

print("Accuracy of Logistic Regression (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_LASSO_logistic) * 100), \
      "%")
print("Accuracy of Tree Classifer (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_LASSO_tree) * 100), "%")
print("Accuracy of 1-NN (computed using 10-Fold Cross Validation): %.2f" % (np.mean(loans_LASSO_knn) * 100), "%")


Accuracy of Logistic Regression (computed using 10-Fold Cross Validation): 82.01 %
Accuracy of Tree Classifer (computed using 10-Fold Cross Validation): 86.06 %
Accuracy of 1-NN (computed using 10-Fold Cross Validation): 81.50 %


### Random Forest using LASSO-selected Attributes

In [6]:
loans_LASSO_RandomForest = cross_val_score(RandomForestClassifier(), x_LASSO, y, scoring = 'accuracy',
               cv = KFold(10, shuffle = True, random_state = seedValue)).mean()
print("Accuracy of Random Forest (computed using 10-Fold Cross Validation): %.2f" % (loans_LASSO_RandomForest * 100), \
      "%")


Accuracy of Random Forest (computed using 10-Fold Cross Validation): 88.92 %
