### 1.Data import and call packages

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time

In [None]:
df=pd.read_csv('../input/company-bankruptcy-prediction/data.csv')
print(df.shape)
df.head(5)

### 2.Data analysis

In [None]:
df['Bankrupt?'].value_counts()

1. data.csv, is a training dataset with label, whose length is  6819.

2. There are 95 dimensions in training data, and the leaved column 'Bankrupt?' is the target to predict, regarded as the y of data.

3. The 'Bankrupt?' column consists of label '1' and '0', and the label '1' points to the bankrupt condition of company, the label '0' means not bankrupt. The number of bankrupt company : not bankrupt company =1:30, as we turn it into binary classifier problem.m

### 3.Data preprocessing

1. Missing value analysis

  According to the statistics, there are no missing values in 95 feature columns, which have relatively complete features and little interference to model prediction. So instead of removing any of the original features, use automatic model selection, as the model can automatically assign features of the weight.

2. Deal with outliers

### 4.Feature engineering

1.Design feature Altman Z score according to Edward Altman in 1968

2.One-hot coding of category variables
   
  For category variables, the general model (LR,SVM...) do one-hot coding, however, for integrated tree models such as LGB and XGB, one-hot coding can be done without one-hot processing. According to LGB official website documents, one-hot coding is not a good solution because of the category characteristics with large cardinality. The LGB learning tree grows very imbalanced and requires a very deep depth to achieve good accuracy. And because of the category variables has been changed into numerical variables in the data, we don't handle with this part.
  
3.Rank processing of numerical features

Rank processing is carried out on numerical features to ensure the robustness of the model to abnormal data, improve the stability of the model and reduce the risk of overfitting. In the sense of economics, the post-Rank normalization treatment is also beneficial to unify the standards of bankrupt enterprises in different industries.

In [None]:
#get the dimensions without 'Bankrupt'
no_features=['Bankrupt?']
features=[feat for feat in df.columns.values if feat not in no_features]
print("diminsion:",len(features))

In [None]:
#The features are all numerical features, which are all processed as rank features
for feat in list(df[features].columns.values):
    df[feat]=df[feat].rank()/float(df.shape[0]) # sort and normalize

In [None]:
#Visually observe the result of sort and normalize
print(df[' ROA(C) before interest and depreciation before interest'].plot(kind='kde'))

Tree models (which belong to probabilistic models) do not need normalization because they do not care about the values of variables, but about the distribution of variables and the conditional probabilities between variables. Rank processing of numerical features is carried out here to ensure the robustness of the model to abnormal data, improve the stability of the model, reduce the risk of overfitting, and improve the prediction accuracy of enterprises in different industries from the perspective of economics.

4.Feature selection

The embedded method based on tree model was adopted. Firstly, all the features were trained, and then the topK features were selected for training and analysis according to the importance of the features obtained from the model.

Specifically, the parameter FEATURE_FRaction is set in the LightTGBM model. If the FEATURE_FRaction is less than 1.0, LightTGBM will randomly select some features in each iteration. For example, if set to 0.8, 80% of the features will be selected before each tree is trained.

5.Feature extraction

Feature extraction is not done here, because feature extraction in economic and management research will make the feature space lose its original connotation and greatly reduce the interpretability. Generally, feature extraction is used in computer competitions to improve the index value.

### 5.Model training and evaluation

1. Train set/test set  
  StratifiedKfold method is used for data division (train_test_split can also be used)
2. Model selection   
  LGB model was adopted
3. Evaluation indicators   
  AUC&F1

In [None]:
#to get the X and y
X=df[features].values
y=df['Bankrupt?'].values.astype(int)
print('X shape:',X.shape)
print('y shape:',y.shape)

  4.  K fold cross validation -- determine the value of K  

Reasons for choosing cross validation:  
   Root cause: due to limited data, it is easy to overfit if the data is used solely for training the model.

   Theory: With cross-validation, the model variance "should" be reduced and the generalization ability of the model can be improved. We expect the model to perform well on multiple sub-data sets of the training set, rather than on the whole training data set alone.

   From the perspective of variance deviation: K=1, cross validation is not used at all, so the data are used for training, and the model is prone to over-fitting, which is characterized by low deviation and high variance. If K=n, leave one method, the overall deviation of the model increases and the variance decreases.

A 2017 study suggested an alternative empirical choice, suggesting k=log(n) and guaranteed n/ k >3d, where n represents the amount of data and d represents the number of features

In [None]:
from math import e
import math
k=np.log(X.shape[0]) #默认以e为底
print('k:{}'.format(k))
if (X.shape[0])/k>3*(X.shape[1]):
    print("meet the condition")
print("the final k value is :",round(k))

5.Put in the parameters after Bayesian parameter adjustment to calculate, K-fold cross validation combined with LGB

In [None]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
import lightgbm as lgb

In [None]:
print("start：********************************")
start = time.time()

K = 9 
skf = StratifiedKFold(n_splits=K,shuffle=True,random_state=2018)

auc_cv = []
pred_cv = []

for k,(train_in,test_in) in enumerate(skf.split(X,y)):
    X_train,X_test,y_train,y_test = X[train_in],X[test_in],\
                                    y[train_in],y[test_in]
    
    # The data structure
    lgb_train = lgb.Dataset(X_train, y_train)
    lgb_eval = lgb.Dataset(X_test, y_test, reference=lgb_train)

    # Set the parameters
    params = {
                'boosting': 'gbdt',
                'objective':'binary',
                'verbosity': -1,
                'learning_rate': 0.01,
                'metric': 'auc',
                'num_leaves':17 ,
                'min_data_in_leaf': 26, 
                'min_child_weight': 1.12,
                'max_depth': 9,
                "feature_fraction": 0.91,
                "bagging_fraction": 0.82,
                "bagging_freq": 2,
                }

    print('................Start training..........................')
    # train
    gbm = lgb.train(params,
                    lgb_train,
                    num_boost_round=2000,
                    valid_sets=lgb_eval,
                    early_stopping_rounds=100,
                    verbose_eval=100)

    print('................Start predict .........................')
    # Predict
    y_pred = gbm.predict(X_test,num_iteration=gbm.best_iteration)
    # Evaluate
    tmp_auc = roc_auc_score(y_test,y_pred)
    auc_cv.append(tmp_auc)
    print("valid auc:",tmp_auc)
    # Test
    pred = gbm.predict(X, num_iteration = gbm.best_iteration)
    pred_cv.append(pred) 
    
# the mean auc score of StratifiedKFold
print('the cv information:')
print(auc_cv)
lgb_mean_auc = np.mean(auc_cv)
print('cv mean score',lgb_mean_auc)

end = time.time()
lgb_practice_time=end-start
print("......................run with time: {} s".format(lgb_practice_time)  )
print("over:*********************************")

# turn into array
res =  np.array(pred_cv)
print("rusult：",res.shape)
# mean the result
r = res.mean(axis = 0)
print('result shape:',r.shape)
result = pd.DataFrame()
result['company_id'] = range(1,df.shape[0]+1)
result['pred_prob'] = r

### 6.Results display and analysis

In [None]:
# Displays the features of the Top30
lgb.plot_importance(gbm,max_num_features = 30,figsize=(20,10))
plt.show()

In [None]:
# Rank features by importance
df1 = pd.DataFrame({'feature': features,'importance': gbm.feature_importance()}).sort_values(by='importance',ascending = False) 
use = df1.loc[df1['importance']!=0,'feature'].tolist()
print('Number of useful features:',len(use))

In [None]:
df1.head(10)

The top10 features of importance are: 
1. Interest-bearing debt interest rate 
2. Borrowing dependency 
3. Persistent EPS in the Last Four Seasons 
4. Accounts Receivable Turnove 
5. Net Value Growth Rate 
6. Total debt/Total net worth 
7. Non-industry income and expenditure/revenue 
8. Inventory/Working Capital 
9. Cash/Total Assets 
10. Quick Ratio     

We could draw a conclusion that in the economic sense, the above ten points and corporate bankruptcy can be explained very strongly.

In [None]:
result

In [None]:
total_auc = roc_auc_score(y,result['pred_prob'])
print(total_auc)