# CS 559: Final Project
Yashvi, Annanya Chaturvedi, Zheng Ke Xu

## 1. Feature Selection
Analyze 95 column data and select the most relevant features for determining bankruptcy.
- Section 3.1: Training data process for Clustering

In [301]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import joblib
from sklearn.preprocessing import StandardScaler

df = pd.read_csv('./train_data.csv').drop(columns=["Index", " Net Income Flag"])

def preprocess_data(df):
    """Performs standardization and feature selection on training data set"""
    df_numpy = df.to_numpy()
    scalar = StandardScaler()
    X_scaled = scalar.fit_transform(df_numpy)

    df_scaled = pd.DataFrame(
        X_scaled,
        columns=df.columns,
        index=df.index
    )

    corr_matrix = df_scaled.corr()

    bankrupt_feature = corr_matrix["Bankrupt?"]
    abs_corr = bankrupt_feature[abs(bankrupt_feature) > 0.1].drop("Bankrupt?")
    features = df_scaled[abs_corr.index.tolist()]

    print(features)

    return features, corr_matrix

joblib.dump(preprocess_data, 'preprocess_data.joblib')
print("Custom preprocessor for training data 'preprocess_data' saved to 'preprocess_data.joblib'. Import to transform data.")

Custom preprocessor for training data 'preprocess_data' saved to 'preprocess_data.joblib'. Import to transform data.


- Running preprocess_data on training data set. This performs standardization and feature selection automatically.

In [302]:
processed_features, corr_matrix = preprocess_data(df=df)
selected_features = processed_features

print(selected_features)

      ROA(C) before interest and depreciation before interest  \
0                                             -0.904871         
1                                              0.404409         
2                                              1.081097         
3                                             -0.362077         
4                                              0.081298         
...                                                 ...         
5802                                           0.033994         
5803                                           0.795670         
5804                                          -0.226579         
5805                                           0.135016         
5806                                           0.532691         

      ROA(A) before interest and % after tax  \
0                                  -0.835696   
1                                   0.213147   
2                                   0.933137   
3                          

- The main idea here is to select for features that have a correlation value to Bankruptcy above or below 0.1 (weak correlation)
- See below for how the correlation matrix looks like.

In [303]:
corr_matrix.style.background_gradient(cmap='coolwarm')

Unnamed: 0,Bankrupt?,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,Continuous interest rate (after tax),Operating Expense Rate,Research and development expense rate,Cash flow rate,Interest-bearing debt interest rate,Tax rate (A),Net Value Per Share (B),Net Value Per Share (A),Net Value Per Share (C),Persistent EPS in the Last Four Seasons,Cash Flow Per Share,Revenue Per Share (Yuan 짜),Operating Profit Per Share (Yuan 짜),Per Share Net profit before tax (Yuan 짜),Realized Sales Gross Profit Growth Rate,Operating Profit Growth Rate,After-tax Net Profit Growth Rate,Regular Net Profit Growth Rate,Continuous Net Profit Growth Rate,Total Asset Growth Rate,Net Value Growth Rate,Total Asset Return Growth Rate Ratio,Cash Reinvestment %,Current Ratio,Quick Ratio,Interest Expense Ratio,Total debt/Total net worth,Debt ratio %,Net worth/Assets,Long-term fund suitability ratio (A),Borrowing dependency,Contingent liabilities/Net worth,Operating profit/Paid-in capital,Net profit before tax/Paid-in capital,Inventory and accounts receivable/Net value,Total Asset Turnover,Accounts Receivable Turnover,Average Collection Days,Inventory Turnover Rate (times),Fixed Assets Turnover Frequency,Net Worth Turnover Rate (times),Revenue per person,Operating profit per person,Allocation rate per person,Working Capital to Total Assets,Quick Assets/Total Assets,Current Assets/Total Assets,Cash/Total Assets,Quick Assets/Current Liability,Cash/Current Liability,Current Liability to Assets,Operating Funds to Liability,Inventory/Working Capital,Inventory/Current Liability,Current Liabilities/Liability,Working Capital/Equity,Current Liabilities/Equity,Long-term Liability to Current Assets,Retained Earnings to Total Assets,Total income/Total expense,Total expense/Assets,Current Asset Turnover Rate,Quick Asset Turnover Rate,Working capitcal Turnover Rate,Cash Turnover Rate,Cash Flow to Sales,Fixed Assets to Assets,Current Liability to Liability,Current Liability to Equity,Equity to Long-term Liability,Cash Flow to Total Assets,Cash Flow to Liability,CFO to Assets,Cash Flow to Equity,Current Liability to Current Assets,Liability-Assets Flag,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Equity to Liability
Bankrupt?,1.0,-0.258319,-0.279412,-0.26943,-0.096922,-0.096414,0.000175,-0.008481,-0.008858,-0.017882,-0.008028,-0.003046,-0.03296,-0.075394,-0.02289,-0.10776,-0.163697,-0.163768,-0.163004,-0.213784,-0.075772,-0.004272,-0.142063,-0.196457,-0.00078,-0.015827,-0.038332,-0.038076,-0.009896,-0.040414,0.068779,-0.016447,-0.092378,-0.002466,0.031576,-0.003216,0.012873,0.255948,-0.255948,0.026166,0.194085,0.073503,-0.141154,-0.202788,0.092437,-0.068582,-0.004621,-0.007115,0.001436,0.073608,0.01817,0.069851,-0.090843,0.002566,-0.195747,-0.084633,-0.04711,-0.104714,-0.003477,0.082318,0.195163,-0.080985,-0.002211,0.002758,-0.025323,-0.15399,0.158252,-0.005161,-0.245174,-0.00702,0.150057,0.01284,0.022955,-0.00268,-0.015366,0.000608,0.069851,-0.025323,0.158252,0.174562,-0.079996,-0.051127,-0.123874,-0.061547,0.164024,0.130209,-0.319499,0.021277,-0.007735,-0.096923,-0.178342,0.178064,0.019088,-0.003186,-0.086477
ROA(C) before interest and depreciation before interest,-0.258319,1.0,0.939022,0.986581,0.326427,0.324283,0.035481,0.0513,0.04753,0.015869,0.049659,0.067838,0.102687,0.31613,0.055408,0.244951,0.504123,0.503939,0.503832,0.772327,0.375872,-0.009174,0.682962,0.746449,5.2e-05,0.036583,0.115842,0.116057,0.025958,0.003999,-0.023775,0.079202,0.34127,0.01422,-0.023135,-0.000375,-0.023585,-0.262291,0.262291,0.003436,-0.165805,-0.036279,0.680535,0.749051,-0.117544,0.210508,-0.035151,0.007019,-0.062505,-0.065194,0.035519,-0.011107,0.314631,-0.013727,0.266583,0.184578,0.106724,0.248698,-0.012352,-0.045973,-0.209421,0.381735,-0.00227,0.016291,0.054588,0.115619,-0.140083,0.02566,0.695126,0.022401,-0.292769,0.007351,-0.028431,0.004023,-0.03311,0.012853,-0.009992,0.054588,-0.140083,-0.089105,0.268913,0.169215,0.503955,0.119113,-0.159225,-0.107906,0.897289,-0.065082,0.009797,0.326427,0.259451,-0.142374,-0.021375,0.018494,0.060622
ROA(A) before interest and % after tax,-0.279412,0.939022,1.0,0.95459,0.320266,0.318084,0.031468,0.050891,0.047305,0.025164,0.048005,0.076321,0.08194,0.280526,0.05703,0.219697,0.531529,0.531542,0.531555,0.762051,0.321699,-0.003449,0.649041,0.749162,0.003434,0.042867,0.126238,0.127004,0.025633,0.012524,-0.069352,0.080966,0.306084,0.015222,-0.014514,0.00481,-0.010934,-0.26589,0.26589,0.021213,-0.167882,-0.036991,0.646183,0.755478,-0.08199,0.223502,-0.032219,0.009095,-0.051912,-0.134476,0.047817,-0.008559,0.336322,-0.006706,0.313757,0.205864,0.165733,0.231794,-0.013294,-0.038257,-0.193348,0.343269,0.001561,0.007291,0.084362,0.130712,-0.132449,0.02588,0.763735,0.028255,-0.35296,-0.000604,-0.024585,0.006493,-0.02729,0.01327,-0.006401,0.084362,-0.132449,-0.111379,0.270667,0.168998,0.440194,0.099353,-0.196662,-0.163373,0.972268,-0.094109,0.010463,0.320267,0.275467,-0.141451,-0.013228,0.018416,0.068064
ROA(B) before interest and depreciation after tax,-0.26943,0.986581,0.95459,1.0,0.325783,0.323621,0.034953,0.051643,0.048291,0.017906,0.050582,0.066367,0.099302,0.316054,0.052883,0.191227,0.500545,0.500501,0.50043,0.761101,0.361146,-0.006732,0.654977,0.717778,0.001785,0.036132,0.118018,0.118191,0.025085,0.007071,-0.028368,0.078981,0.338233,0.014,-0.021031,0.002614,-0.022553,-0.267003,0.267003,0.003721,-0.162512,-0.034514,0.652213,0.720797,-0.116931,0.194465,-0.035331,0.010142,-0.053052,-0.059517,0.024936,-0.011101,0.316409,-0.014013,0.26712,0.16774,0.100846,0.239495,-0.011958,-0.041335,-0.21791,0.381692,0.000449,0.010108,0.048026,0.11358,-0.140306,0.021728,0.718711,0.023499,-0.319325,-0.001005,-0.029615,0.004783,-0.032399,0.013156,-0.009117,0.048026,-0.140306,-0.084276,0.264618,0.167106,0.495913,0.114275,-0.161353,-0.120077,0.921417,-0.083162,0.0076,0.325784,0.264747,-0.141286,-0.018566,0.018008,0.06533
Operating Gross Margin,-0.096922,0.326427,0.320266,0.325783,1.0,0.999504,0.003981,0.029547,0.02453,0.050704,0.027058,-0.203346,-0.018604,0.345703,0.023546,0.063753,0.140959,0.141368,0.141424,0.247826,0.156195,0.182558,0.257332,0.239381,0.013289,0.022647,0.054165,0.05304,0.009078,0.008874,-0.018306,0.025527,0.134739,0.026152,0.013589,-0.004727,-0.023338,-0.240318,0.240318,0.005664,-0.082719,-0.021054,0.256679,0.239134,-0.092745,-0.097165,0.08498,0.0229,0.044253,0.002922,-0.133937,0.042898,0.23593,-0.007314,0.246827,0.155004,0.097916,0.245002,-0.003784,-0.025549,-0.195092,0.241928,-0.035268,0.028948,0.063027,0.071992,-0.074374,0.007148,0.195202,0.044854,0.233371,-0.117527,-0.12939,0.020065,-0.0716,-0.044079,0.00367,0.063027,-0.074374,-0.0706,0.110598,0.122402,0.226072,0.039164,-0.134583,-0.033724,0.306854,0.023122,-0.000388,1.0,0.06909,-0.080496,-0.014096,0.003759,0.111684
Realized Sales Gross Margin,-0.096414,0.324283,0.318084,0.323621,0.999504,1.0,0.003837,0.029269,0.024187,0.050496,0.026781,-0.203444,-0.019042,0.346033,0.023423,0.063471,0.139133,0.139548,0.139604,0.245735,0.156172,0.182784,0.25633,0.237469,0.013305,0.022555,0.054003,0.052875,0.009076,0.009533,-0.018309,0.025452,0.134865,0.026195,0.013639,-0.004844,-0.023333,-0.240551,0.240551,0.005846,-0.082619,-0.021036,0.255675,0.237166,-0.092857,-0.097708,0.085125,0.022969,0.043743,0.003052,-0.134196,0.04296,0.235276,-0.007638,0.246758,0.154942,0.097974,0.24542,-0.003768,-0.025737,-0.194923,0.241905,-0.035343,0.029175,0.064261,0.071921,-0.074319,0.006756,0.19329,0.04486,0.234137,-0.11747,-0.129386,0.020153,-0.071248,-0.044131,0.003689,0.064261,-0.074319,-0.070624,0.110372,0.122247,0.226008,0.03909,-0.134534,-0.033722,0.304647,0.023201,-0.000623,0.999504,0.068675,-0.080503,-0.013236,0.003747,0.111988
Operating Profit Rate,0.000175,0.035481,0.031468,0.034953,0.003981,0.003837,1.0,0.92497,0.867953,-0.623892,0.916083,0.013467,0.016831,0.024877,0.00289,0.019998,0.019114,0.019068,0.019094,0.019804,0.013808,-0.039801,0.0217,0.019352,0.000649,0.004866,0.005784,0.005701,0.001011,0.03554,-0.000176,0.003445,0.015894,0.000861,0.000464,0.001909,-0.001439,0.011548,-0.011548,0.000764,0.001938,0.000263,0.021443,0.019258,0.015088,0.029825,-0.024425,0.001061,0.010243,0.005562,0.016935,-0.002265,0.018683,0.000813,0.027244,0.027283,0.035383,0.011396,0.000269,0.001231,0.011009,0.021161,-0.001126,0.002031,0.020002,0.010776,0.001849,0.002099,0.023143,0.002012,0.007438,0.00839,0.012994,-0.230663,0.016966,-0.085395,0.000134,0.020002,0.001849,0.001053,0.02275,0.006143,0.026952,0.014356,-0.083109,-0.00032,0.028782,-0.003547,0.001036,0.003982,0.00565,0.002065,0.001147,0.000314,-0.014995
Pre-tax net Interest Rate,-0.008481,0.0513,0.050891,0.051643,0.029547,0.029269,0.92497,1.0,0.986141,-0.280075,0.999153,0.013251,0.015815,0.026089,0.003902,0.021305,0.031795,0.031774,0.03179,0.031747,0.015664,0.128504,0.024348,0.03109,0.000785,0.003741,0.006881,0.006759,0.001441,0.035263,-0.001016,0.004054,0.018415,0.003212,-0.01918,0.002318,-0.001994,-0.000313,0.000313,9.9e-05,-0.001657,0.002179,0.024102,0.031372,0.010126,0.02732,0.099777,0.0022,0.006837,0.003312,0.014754,-0.019578,0.019214,0.000689,0.037661,0.029786,0.036102,0.017702,0.00011,-0.000389,-0.001744,0.023082,0.010395,0.001506,0.012572,0.014873,-0.002638,0.001983,0.039887,0.003165,-0.005057,0.007526,0.011361,0.090352,0.014295,0.233716,-5e-05,0.012572,-0.002638,-0.000535,0.046569,0.014821,0.030499,0.027406,-0.144503,-0.003435,0.048416,-0.004671,0.0007,0.029547,0.010209,-0.002245,0.000992,0.00089,-0.012085
After-tax net Interest Rate,-0.008858,0.04753,0.047305,0.048291,0.02453,0.024187,0.867953,0.986141,1.0,-0.16455,0.988239,0.013182,0.015756,0.024171,0.003716,0.01982,0.03044,0.030413,0.03043,0.029091,0.014445,0.119019,0.022486,0.028057,0.000817,0.002793,0.005625,0.005468,0.001141,0.035192,-0.000866,0.003518,0.017381,0.002721,-0.016928,0.002354,-0.001389,-0.003259,0.003259,0.000867,-0.001762,0.00208,0.022272,0.028394,0.010169,0.027567,0.088774,0.002039,0.007145,0.004087,0.015207,-0.042867,0.017413,0.000718,0.042227,0.030802,0.037192,0.017764,0.000191,-7.5e-05,-0.006308,0.021333,0.009621,0.001457,0.007551,0.01691,-0.003681,0.001931,0.038517,0.002846,-0.003118,0.007745,0.011724,0.247629,0.014769,0.38167,-1e-06,0.007551,-0.003681,0.000519,0.051636,0.015629,0.028448,0.031665,-0.174057,-0.002976,0.045506,-0.004154,-0.00072,0.024531,0.009642,-0.002866,0.001093,0.000865,-0.008068
Non-industry income and expenditure/revenue,-0.017882,0.015869,0.025164,0.017906,0.050704,0.050496,-0.623892,-0.280075,-0.16455,1.0,-0.259368,-0.006767,-0.009993,-0.009191,0.000726,-0.006702,0.017099,0.017172,0.01714,0.015259,-0.002667,0.36479,-0.004746,0.015049,-2.7e-05,-0.004599,-0.000461,-0.000502,0.000408,-0.017261,-0.001644,-0.000365,-0.00228,0.004431,-0.040612,-5.5e-05,-0.000466,-0.029816,0.029816,-0.001727,-0.008304,0.003816,-0.0046,0.015867,-0.017289,-0.019157,0.266877,0.001844,-0.011816,-0.007239,-0.012438,-0.03454,-0.007681,-0.000636,0.008627,-0.007666,-0.015136,0.007616,-0.000453,-0.003911,-0.031394,-0.005986,0.02422,-0.002033,-0.024673,0.003367,-0.010096,-0.001225,0.023565,0.001425,-0.029187,-0.005716,-0.00946,0.768451,-0.01346,0.696317,-0.000441,-0.024673,-0.010096,-0.003761,0.0383,0.014959,-0.005361,0.020094,-0.087222,-0.006255,0.02686,-0.000647,-0.001179,0.050703,0.006721,-0.009833,-0.000857,0.001036,0.013026


## 2. Clustering Subgroups
Determining subgroups from selected features using K-means unsupervised learning model.
- Section 3.2: Company Characterization
- Start by training KMeans on selected features

In [304]:
from sklearn.cluster import KMeans

k_means = KMeans(n_clusters=6, n_init=10, random_state=100, init='k-means++')
k_means.fit(selected_features.to_numpy())

labels = k_means.labels_
unique_labels = pd.Series(labels).value_counts()
centroids = k_means.cluster_centers_ 

print("K-means labels: \n" + str(labels) + "\n")
print("K-means unique labels: \n" + str(unique_labels) + "\n")
print("K-means centroids: \n" + str(centroids))

K-means labels: 
[1 2 2 ... 1 2 2]

K-means unique labels: 
2    2379
1    2292
3     682
0     444
4       7
5       3
Name: count, dtype: int64

K-means centroids: 
[[ 1.68016643e+00  1.60494023e+00  1.60925902e+00  1.09862957e-01
   2.05128544e+00  2.04672919e+00  2.04567996e+00  2.15437349e+00
   2.01714695e+00  2.12007847e+00 -4.38994331e-01  4.38994331e-01
  -2.02573750e-01  2.00992760e+00  2.12395116e+00  7.02320531e-01
   9.35812672e-01 -2.54615636e-01  1.89551487e-01 -1.03904167e-01
   1.08894922e+00 -1.37730746e-01 -1.03904167e-01 -1.37955676e-01
   8.09255720e-01 -3.09588200e-01 -3.47404167e-02  1.35983855e+00
   2.64952414e-01 -1.29013280e-01]
 [-1.58166089e-01 -9.83395527e-02 -1.33079552e-01  1.35436709e-01
  -2.07511549e-01 -2.05508169e-01 -2.05104264e-01 -1.59634529e-01
  -1.34862606e-01 -1.54171638e-01  6.53137558e-01 -6.53137558e-01
   1.52949971e-01 -1.35059618e-01 -1.59245945e-01 -4.24878501e-01
  -4.25929440e-01  5.39523158e-01 -4.41222910e-02  1.38997180e-01
  -2.6

Once labels are available, group the selected features by their labels. This creates subgroups for further analysis. Any subgroups without any bankrupt entries are set as h=0.

In [329]:
selected_features['Cluster'] = labels
grouped_features = selected_features.groupby('Cluster')

#Save each group as its own array - makes it easier to distribute among the group
subgroups = {}
for label, group in grouped_features:
    subgroups[label] = group.to_numpy()

#Since each label changes between runtimes, we should sort the subgroups to keep it consistent
sorted_subgroups = sorted(
    subgroups.items(),
    key=lambda item: len(item[1]),
    reverse=True
)

sorted_subgroups = [item[1] for item in sorted_subgroups]
for i in sorted_subgroups:
    print(len(i))

2379
2292
682
444
7
3


## 3. Training Models
Create individual models to process the subgroups.
- Section 3.3: Building Training Models