# <center> Intro <center>
1. Problem = how to predict the company will survive or not with company current condition, and we position ourselves as company consultant, and we must give our client insight to keep their company running well
2. Data =
>* what is being predicted? = Company bankruptcy paramater
>* what is needed in prediction? = selecting the feature we have that impact the bankrupt for company, so we can reduce the risk of the company going bankrupt, and can increase the company's chances of continuing to work
3. Machine Learning Objective = Maximize chance to survive
4. Action = dont let the predict fail, we say the company will survive but eventually go bankrupt, and the company will go bankrupt while surviving
5. Value = keep the company from going bankrupt

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# I. Exploratory Data Analysis

In [None]:
df=pd.read_csv('../input/company-bankruptcy-prediction/data.csv')
print(df.shape)
df.head(5)

In [None]:
df.info()

as we can see, the data types of our dataset is numerical, there are integer and float type

next step is to check is there any missing value on this dataset?

In [None]:
df.isna().sum()

there is no missing value, which is good, now we are looking at the comparison between companies that go bankrupt and those that survive

In [None]:
plt.figure(figsize= (10,10))
plt.pie(df['Bankrupt?'].value_counts().tolist(), labels = ['0','1'],autopct = '%.2f',colors=['skyblue','black'] ,explode = (0,0.1))
plt.title('Comparison of the number of companies that survive and go bankrupt')
plt.legend(['Survive','Bankrupt'])
plt.axis('equal')
plt.show()

There is a very extreme comparison of data between companies that go bankrupt and those that survive"


Based on the case that our dataset is unbalanced, we must balance the data during modeling

as we know, we have many columns in this dataset, so we will find which column has strong correlation using the spearman method

In [None]:
corr = df.corr(method = 'spearman')

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(100,100))
with sns.axes_style("white"):
  ax = sns.heatmap(corr, mask=mask,vmin=0., vmax=1,annot = True, cmap='Blues')
plt.show()


In the heatmap above we focus on the Bankrupt feature, and those that have a correlation value (> 0.2 and >-0.2) with the bankrupt feature are:
1. Total debt/Total net worth                            = 0.22
2. Debt ratio %                                          = 0.22
3. Borrowing dependency                                  = 0.22
4. Liability to Equity                                   = 0.2
5. Net Value Growth Rate                                 = -0.2
6. Quick Ratio                                           = -0.2
7. Net Value Per Share (C)                                = -0.2
8. Net Value Per Share (B)                                = -0.21
9. Net Value Per Share (A)                                = -0.21
10. ROA(C) before interest and depreciation before interes = -0.22
11. ROA(A) before interest and % after tax                 = -0.22
12. ROA(B) before interest and depreciation after tax      = -0.22
13. Pre-tax net interest rate                              = -0.22
14. After-tax net Interest Rate                            = -0.22
15. Non-industry income and expenditure/revenue            = -0.22
16. Net worth/Assets                                      = -0.22
17. Equity to Liability                                   = -0.22
18. Continuous interest rate (after tax)                   = -0.23
19. Per Share Net profit before tax (Yuan ¥)              = -0.23
20. Net profit before tax/Paid-in capital                 = -0.23
21. Persistent EPS in the Last Four Seasons               = -0.24

but, there is too many column that have strong correlation (1 value) that might be make a bias on correlation, so we try to make a new heatmap based on the column that have strong correlation (positive and negative) with bankruptcy column

In [None]:
corr_df = df[['Bankrupt?',' Total debt/Total net worth',' Debt ratio %',' Borrowing dependency',' Liability to Equity',' Net Value Growth Rate',' Quick Ratio',' Net Value Per Share (B)', ' Net Value Per Share (A)',
       ' Net Value Per Share (C)',' ROA(C) before interest and depreciation before interest',' ROA(A) before interest and % after tax',' ROA(B) before interest and depreciation after tax',' Pre-tax net Interest Rate',
       ' After-tax net Interest Rate',' Non-industry income and expenditure/revenue',' Net worth/Assets',' Equity to Liability',' Continuous interest rate (after tax)',' Per Share Net profit before tax (Yuan ¥)',
       ' Net profit before tax/Paid-in capital',' Persistent EPS in the Last Four Seasons']]

In [None]:
corr2 = corr_df.corr(method = 'spearman')

mask = np.zeros_like(corr2, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

plt.figure(figsize=(100,100))
with sns.axes_style("white"):
  ax = sns.heatmap(corr2, mask=mask,vmin=0., vmax=1,annot = True, cmap='Blues')
plt.show()

there is still a column that is worth 1, but not as much as the initial heatmap, its okay, we can use the feature selection later

Now we compare

In [None]:
plt.figure(figsize=(40,40))
plt.subplot(6,4,1)
sns.boxplot(x="Bankrupt?", y=" Total debt/Total net worth", data=df, order=[0, 1])
plt.subplot(6,4,2)
sns.boxplot(x="Bankrupt?", y=" Debt ratio %", data=df, order=[0, 1])
plt.subplot(6,4,3)
sns.boxplot(x="Bankrupt?", y=" Borrowing dependency", data=df, order=[0, 1])
plt.subplot(6,4,4)
sns.boxplot(x="Bankrupt?", y=" Liability to Equity", data=df, order=[0, 1])
plt.subplot(6,4,5)
sns.boxplot(x="Bankrupt?", y=" Net Value Growth Rate", data=df, order=[0, 1])
plt.subplot(6,4,6)
sns.boxplot(x="Bankrupt?", y=" Quick Ratio", data=df, order=[0, 1])
plt.subplot(6,4,7)
sns.boxplot(x="Bankrupt?", y=" Net Value Per Share (B)", data=df, order=[0, 1])
plt.subplot(6,4,8)
sns.boxplot(x="Bankrupt?", y=" Net Value Per Share (A)", data=df, order=[0, 1])
plt.subplot(6,4,9)
sns.boxplot(x="Bankrupt?", y=" Net Value Per Share (C)", data=df, order=[0, 1])
plt.subplot(6,4,10)
sns.boxplot(x="Bankrupt?", y=" ROA(C) before interest and depreciation before interest", data=df, order=[0, 1])
plt.subplot(6,4,11)
sns.boxplot(x="Bankrupt?", y=" ROA(A) before interest and % after tax", data=df, order=[0, 1])
plt.subplot(6,4,12)
sns.boxplot(x="Bankrupt?", y=" ROA(B) before interest and depreciation after tax", data=df, order=[0, 1])
plt.subplot(6,4,13)
sns.boxplot(x="Bankrupt?", y=" Pre-tax net Interest Rate", data=df, order=[0, 1])
plt.subplot(6,4,14)
sns.boxplot(x="Bankrupt?", y=" After-tax net Interest Rate", data=df, order=[0, 1])
plt.subplot(6,4,15)
sns.boxplot(x="Bankrupt?", y=" Net worth/Assets", data=df, order=[0, 1])
plt.subplot(6,4,16)
sns.boxplot(x="Bankrupt?", y=" Non-industry income and expenditure/revenue", data=df, order=[0, 1])
plt.subplot(6,4,17)
sns.boxplot(x="Bankrupt?", y=" Equity to Liability", data=df, order=[0, 1])
plt.subplot(6,4,18)
sns.boxplot(x="Bankrupt?", y=" Continuous interest rate (after tax)", data=df, order=[0, 1])
plt.subplot(6,4,19)
sns.boxplot(x="Bankrupt?", y=" Per Share Net profit before tax (Yuan ¥)", data=df, order=[0, 1])
plt.subplot(6,4,20)
sns.boxplot(x="Bankrupt?", y=" Net profit before tax/Paid-in capital", data=df, order=[0, 1])
plt.subplot(6,4,21)
sns.boxplot(x="Bankrupt?", y=" Persistent EPS in the Last Four Seasons", data=df, order=[0, 1])

As seen in the boxplot above, most of the minimum value of companies that survive is higher than the minimum value of a company that is bankrupt, except for one feature debt ratio.

Now we want to detecting outliers of the dataset

In [None]:
# Plotting Boxplots of the numerical features

plt.figure(figsize = (20,20))
ax =sns.boxplot(data = df, orient="h")
ax.set_title('Bank Data Boxplots', fontsize = 18)
ax.set(xscale="log")
plt.show()

Based on the results of the analysis that has been carried out, there are points that can be noted, including:

1. Data sets are not normally distributed
2. There are many outliers which are very likely to influence the results of the modeling
3. imbalanced dataset, the ratio of 1 and 0 is not comparable, this can be overcome by random over sampling and smote methods

# II. Modelling

## Feature Selection

In [None]:
from sklearn.feature_selection import SelectKBest, mutual_info_classif
X = df.drop(['Bankrupt?'], axis = 1)
y = df['Bankrupt?']
selector = SelectKBest(mutual_info_classif,k=9)
selector.fit(X, y)
X.columns[selector.get_support()]

i use mutual info classif to do feature selection because to estimate mutual information for a discrete target variable. Now lets make a new dataframe to do modelling

In [None]:
df_model = df[['Bankrupt?',
       ' ROA(A) before interest and % after tax',
       ' Continuous interest rate (after tax)',
       ' Persistent EPS in the Last Four Seasons',
       ' Per Share Net profit before tax (Yuan ¥)', 
       ' Debt ratio %',
       ' Borrowing dependency', 
       ' Net profit before tax/Paid-in capital',
       ' Net Income to Total Assets', 
       " Net Income to Stockholder's Equity"]]

but we need to tidy up the column names first to avoid calling the names wrong in the future

In [None]:
df_model.columns = df_model.columns.str.strip()
df_model.columns = df_model.columns.str.replace(" " ,"_")
print("Nama Kolom setelah dirubah","\n",df_model.columns[:10])

### <center> Confussion Matrix <center>

* * 0 = Survive *
* * 1 = Bankrupt *

         - TP: There are companies that are predicted to go bankrupt and in fact go bankrupt
         - TN: There are companies that are predicted to be Survive and in fact they are Survive
         - FP: There are companies that are predicted to go bankrupt even though they are surviving
         - FN: There are companies that are predicted to survive even though they are bankrupt

Action:
* FP: the image of the consulting firm is inaccurate, but not detrimental to the client company
* FN: the image of the consulting firm is inaccurate, is detrimental to clients and can cause problems

-> What will be pressed is FN, using recall

### <center> Method Selection <center>

As a first step to choosing the best method for modeling, we use the lazy predicton as the initial assumption for selecting the method, then we crosscheck again.

In [None]:
X = df_model.drop(['Bankrupt?'], axis = 1)
y = df_model['Bankrupt?']

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   stratify = y,
                                                   test_size = 0.3,
                                                   random_state = 2021)

first of all we need to install this fucking library, im confused how to install it

In [None]:
!pip install lazypredict

In [None]:
!pip install pandas -U

In [None]:
import warnings
warnings.filterwarnings("ignore")
from sklearn.utils.testing import ignore_warnings

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import sklearn.metrics as met
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, precision_score, roc_auc_score, plot_roc_curve, f1_score,confusion_matrix , accuracy_score, recall_score
import lazypredict
from lazypredict.Supervised import LazyClassifier

In [None]:
LC = LazyClassifier(verbose=0,ignore_warnings=True, custom_metric=None)
model,predictions = LC.fit(X_train, X_test, y_train, y_test)

In [None]:
predictions.sort_values(by='Accuracy')

### <center> Moment of Truth
is the result from lazy predictor accurate? lets see

#### Extra Trees Classifier

In [None]:
from sklearn.ensemble import ExtraTreesClassifier

clf = ExtraTreesClassifier()
clf.fit(X_train,y_train)
y_pred=clf.predict(X_test)

print('the result is :')
print('Accuracy : '+str(met.accuracy_score(y_test,y_pred)))
print('f1 score: '+str(met.f1_score(y_test,y_pred,average='weighted')))

#### SVC

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from sklearn.svm import SVC
svc = make_pipeline(StandardScaler(), SVC(gamma='auto'))
svc.fit(X_train,y_train)
y_pred=svc.predict(X_test)

print('the result is :')
print('Accuracy : '+str(met.accuracy_score(y_test,y_pred)))
print('f1 score: '+str(met.f1_score(y_test,y_pred,average='weighted')))

#### AdaBoost Classifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.datasets import make_classification
abc = AdaBoostClassifier(n_estimators=100, random_state=0)
abc.fit(X_train,y_train)
y_pred=abc.predict(X_test)

print('the result is :')
print('Accuracy : '+str(met.accuracy_score(y_test,y_pred)))
print('f1 score: '+str(met.f1_score(y_test,y_pred,average='weighted')))

#### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
y_pred=lr.predict(X_test)

print('the result is :')
print('Accuracy : '+str(met.accuracy_score(y_test,y_pred)))
print('f1 score: '+str(met.f1_score(y_test,y_pred,average='weighted')))

#### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
adb = RandomForestClassifier()
adb.fit(X_train,y_train)
y_pred=adb.predict(X_test)

print('the result is :')
print('Accuracy : '+str(met.accuracy_score(y_test,y_pred)))
print('f1 score: '+str(met.f1_score(y_test,y_pred,average='weighted')))

#### KNN Classifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)

print('the result is :')
print('Accuracy : '+str(met.accuracy_score(y_test,y_pred)))
print('F1 score: '+str(met.f1_score(y_test,y_pred,average='weighted')))

'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters ‘macro’ to account for label imbalance; it can result in an F-score that is not between precision and recall.

its accurate, we got the same value

But we cant 100% believe the lazy predictor, we must re-check again, and using cross validation to make sure the model work better

In [None]:
skfold = StratifiedKFold(n_splits=5, random_state=2021, shuffle=False)

for train_index, test_index in skfold.split(X_train,y_train):
    
    print("Train:", train_index, "Test:", test_index)
    X_train_sm, X_val_sm = X_train.iloc[train_index], X_train.iloc[test_index]
    y_train_sm, y_val_sm = y_train.iloc[train_index], y_train.iloc[test_index]



X_train_sm = X_train_sm.values
X_val_sm = X_val_sm.values
y_train_sm = y_train_sm.values
y_val_sm = y_val_sm.values


train_unique_label, train_counts_label = np.unique(y_train_sm, return_counts=True)
test_unique_label, test_counts_label = np.unique(y_val_sm, return_counts=True)
print('-' * 84)

print('Label Distributions: \n')
print(train_counts_label/ len(y_train_sm))
print(test_counts_label/ len(y_val_sm))

#### <center> Extra Trees Classifier

In [None]:
from sklearn.model_selection import GridSearchCV
from imblearn.pipeline import make_pipeline as imbalanced_make_pipeline
from imblearn.over_sampling import SMOTE


accuracy_lst_xtc = []
precision_lst_xtc = []
recall_lst_xtc = []
f1_lst_xtc = []
auc_lst_xtc = []

xtc = ExtraTreesClassifier()


## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
              "max_features": [1, 3, 10],
              "min_samples_split": [2, 3, 10],
              "min_samples_leaf": [1, 3, 10],
              "bootstrap": [False],
              "n_estimators" :[100,300],
              "criterion": ["gini"]}

gsxtc = GridSearchCV(xtc,param_grid = ex_param_grid, cv=skfold, scoring="accuracy", n_jobs= 4, verbose = 1)


for train, val in skfold.split(X_train_sm, y_train_sm):
    pipeline_xtc = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), gsxtc)
    model_xtc = pipeline_xtc.fit(X_train_sm[train], y_train_sm[train])
    best_est_xtc = gsxtc.best_estimator_
    prediction_xtc = best_est_xtc.predict(X_train_sm[val])
    
    accuracy_lst_xtc.append(pipeline_xtc.score(X_train_sm[val], y_train_sm[val]))
    precision_lst_xtc.append(precision_score(y_train_sm[val], prediction_xtc))
    recall_lst_xtc.append(recall_score(y_train_sm[val], prediction_xtc))
    f1_lst_xtc.append(f1_score(y_train_sm[val], prediction_xtc))
    auc_lst_xtc.append(roc_auc_score(y_train_sm[val], prediction_xtc))

print('---' * 45)
print('')
print('Extra Trees Classifier results:')
print('')
print("accuracy: {}".format(np.mean(accuracy_lst_xtc)))
print("precision: {}".format(np.mean(precision_lst_xtc)))
print("recall: {}".format(np.mean(recall_lst_xtc)))
print("f1: {}".format(np.mean(f1_lst_xtc)))
#gsxtc.fit(X_train,y_train)
#xtc_best = gsxtc.best_estimator_
print('Best Estimator = {}',best_est_xtc)
print('')
print('---' * 45)

In [None]:
# Printing the classification report

label = ['0', '1']
pred_xtc_sm = best_est_xtc.predict(X_val_sm)
print(classification_report(y_val_sm, pred_xtc_sm, target_names=label))

In [None]:
CM = pd.DataFrame(confusion_matrix(y_val_sm, pred_xtc_sm), columns = ['Survive','Bankrupt'], index = ['Survive','Bankrupt'])
CM

#### <center> XGBoost

In [None]:
from xgboost import XGBClassifier

accuracy_lst_xgb = []
precision_lst_xgb = []
recall_lst_xgb = []
f1_lst_xgb = []
auc_lst_xgb = []

xgb=XGBClassifier()

xgb_params={"n_estimators":[67,70,100,120],
        'reg_lambda':[2,1],
        'gamma':[0,0.3,0.2,0.1],
        'eta':[0.06,0.05,0.04],
        "max_depth":[3,5],
        'objective':['binary:logistic']}

gsxgb = GridSearchCV(xgb,param_grid = xgb_params, cv=skfold, scoring="accuracy", n_jobs= 4, verbose = 1)


for train, val in skfold.split(X_train_sm, y_train_sm):
    pipeline_xgb = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), gsxgb) # SMOTE happens during Cross Validation not before..
    model_xgb = pipeline_xgb.fit(X_train_sm[train], y_train_sm[train])
    best_est_xgb = gsxgb.best_estimator_
    prediction_xgb = best_est_xgb.predict(X_train_sm[val])
    
    accuracy_lst_xgb.append(pipeline_xgb.score(X_train_sm[val], y_train_sm[val]))
    precision_lst_xgb.append(precision_score(y_train_sm[val], prediction_xgb))
    recall_lst_xgb.append(recall_score(y_train_sm[val], prediction_xgb))
    f1_lst_xgb.append(f1_score(y_train_sm[val], prediction_xgb))
    auc_lst_xgb.append(roc_auc_score(y_train_sm[val], prediction_xgb))

print('---' * 45)
print('')
print('XGBOOST results:')
print('')
print("accuracy: {}".format(np.mean(accuracy_lst_xgb)))
print("precision: {}".format(np.mean(precision_lst_xgb)))
print("recall: {}".format(np.mean(recall_lst_xgb)))
print("f1: {}".format(np.mean(f1_lst_xgb)))
#gsxgb.fit(X_train,y_train)
#gsxgb_best = gsxgb.best_estimator_
print('Best Estimator = {}',best_est_xgb)
print('')
print('---' * 45)

In [None]:
# Printing the classification report

label = ['0', '1']
pred_xgb_sm = best_est_xgb.predict(X_val_sm)
print(classification_report(y_val_sm, pred_xgb_sm, target_names=label))

In [None]:
CM = pd.DataFrame(confusion_matrix(y_val_sm, pred_xgb_sm), columns = ['Survive','Bankrupt'], index = ['Survive','Bankrupt'])
CM

#### <center> Logistic Regression

In [None]:
accuracy_lst_reg = []
precision_lst_reg = []
recall_lst_reg = []
f1_lst_reg = []
auc_lst_reg = []

log_reg_sm = LogisticRegression()
#log_reg_params = {}
log_reg_params = {"penalty": ['l2'],
                  'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                  'class_weight': ['balanced',None],
                  'solver':['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}

rand_log_reg = GridSearchCV(log_reg_sm,param_grid = log_reg_params, cv=skfold, scoring="accuracy", n_jobs= 4, verbose = 1)


for train, val in skfold.split(X_train_sm, y_train_sm):
    pipeline_reg = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_log_reg) # SMOTE happens during Cross Validation not before..
    model_reg = pipeline_reg.fit(X_train_sm[train], y_train_sm[train])
    best_est_reg = rand_log_reg.best_estimator_
    prediction_reg = best_est_reg.predict(X_train_sm[val])
    
    accuracy_lst_reg.append(pipeline_reg.score(X_train_sm[val], y_train_sm[val]))
    precision_lst_reg.append(precision_score(y_train_sm[val], prediction_reg))
    recall_lst_reg.append(recall_score(y_train_sm[val], prediction_reg))
    f1_lst_reg.append(f1_score(y_train_sm[val], prediction_reg))
    auc_lst_reg.append(roc_auc_score(y_train_sm[val], prediction_reg))


print('---' * 45)
print('')
print('Logistic Regression (SMOTE) results:')
print('')
print("accuracy: {}".format(np.mean(accuracy_lst_reg)))
print("precision: {}".format(np.mean(precision_lst_reg)))
print("recall: {}".format(np.mean(recall_lst_reg)))
print("f1: {}".format(np.mean(f1_lst_reg)))
print('Best Estimator = {}',best_est_reg)
print('')
print('---' * 45)


In [None]:
# Printing the classification report

label = ['Fin.Stable', 'Fin.Unstable']
pred_reg_sm = best_est_reg.predict(X_val_sm)
print(classification_report(y_val_sm, pred_reg_sm, target_names=label))

In [None]:
CM = pd.DataFrame(confusion_matrix(y_val_sm, pred_reg_sm), columns = ['Survive','Bankrupt'], index = ['Survive','Bankrupt'])
CM

#### <center> Random Forest Classifier

In [None]:
accuracy_lst_rfc = []
precision_lst_rfc = []
recall_lst_rfc = []
f1_lst_rfc = []
auc_lst_rfc = []

rfc_sm = RandomForestClassifier()
#rfc_params = {}
rfc_params = {'max_features' : ['auto', 'sqrt', 'log2'],
              'random_state' : [42],
              'class_weight' : ['balanced','balanced_subsample'],
              'criterion' : ['gini', 'entropy'],
              'bootstrap' : [True,False]}
    
    
rand_rfc = GridSearchCV(rfc_sm,param_grid = rfc_params, cv=skfold, scoring="accuracy", n_jobs= 4, verbose = 1)

for train, val in skfold.split(X_train_sm, y_train_sm):
    pipeline_rfc = imbalanced_make_pipeline(SMOTE(sampling_strategy='minority'), rand_rfc) # SMOTE happens during Cross Validation not before..
    model_rfc = pipeline_rfc.fit(X_train_sm, y_train_sm)
    best_est_rfc = rand_rfc.best_estimator_
    prediction_rfc = best_est_rfc.predict(X_train_sm[val])
    
    accuracy_lst_rfc.append(pipeline_rfc.score(X_train_sm[val], y_train_sm[val]))
    precision_lst_rfc.append(precision_score(y_train_sm[val], prediction_rfc))
    recall_lst_rfc.append(recall_score(y_train_sm[val], prediction_rfc))
    f1_lst_rfc.append(f1_score(y_train_sm[val], prediction_rfc))
    auc_lst_rfc.append(roc_auc_score(y_train_sm[val], prediction_rfc))
    
print('---' * 45)
print('')
print("accuracy: {}".format(np.mean(accuracy_lst_rfc)))
print("precision: {}".format(np.mean(precision_lst_rfc)))
print("recall: {}".format(np.mean(recall_lst_rfc)))
print("f1: {}".format(np.mean(f1_lst_rfc)))
print('Best Estimator = {}',best_est_rfc)
print('---' * 45)

In [None]:
# Printing the classification report

label = ['Fin.Stable', 'Fin.Unstable']
pred_reg_rfc = best_est_rfc.predict(X_val_sm)
print(classification_report(y_val_sm, pred_reg_rfc, target_names=label))

In [None]:
CM = pd.DataFrame(confusion_matrix(y_val_sm, pred_reg_rfc), columns = ['Survive','Bankrupt'], index = ['Survive','Bankrupt'])
CM

based on confussion matrix we choose XGBoost