# Company Bankruptcy Prediction

## By Minh Nguyet Nguyen (Selene)

This project aims to predict company bankruptcy by applying machine learning techniques, specifically Support Vector Machines (SVM) and Random Forest classifiers. The project explores different design choices, compares model performance, and analyzes time-space complexity to determine the most effective approach for this classification task.

In [1]:
# Python Version 3.9.13, conda version 23.1.0

In [7]:
# Import packages
import sklearn # v 1.0.2
import numpy as np # v 1.21.5
import pandas as pd # v 1.4.4
import matplotlib.pyplot as plt # v 3.5.2
import statsmodels.api as sm # v 0.13.2
import sweetviz as sv # v 2.2.1
import statsmodels.tsa.stattools as stattools # v 0.13.2
import warnings as ww

from sklearn.model_selection import train_test_split # v 1.0.2
from sklearn.ensemble import RandomForestClassifier # v 1.0.2


from sklearn.tree import DecisionTreeRegressor # v 1.0.2
from sklearn.svm import SVC # v 1.0.2
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix # v 1.0.2
from sklearn.tree import export_text # v 1.0.2
from mlxtend.feature_selection import SequentialFeatureSelector as SFS # v 0.23.0
from sklearn.utils import resample # v 1.0.2

from sklearn.impute import SimpleImputer # v 1.0.2 # Impute missing values

import seaborn as sns # v. 0.11.2

ww.filterwarnings("ignore")
%matplotlib inline

In [8]:
# Read in the dataset
bankruptcy_df = pd.read_csv("/content/company_bankruptcy_data.csv")

## EDA

In [9]:
bankruptcy_df.columns

Index(['Bankrupt', ' ROA(C) before interest and depreciation before interest',
       ' ROA(A) before interest and % after tax',
       ' ROA(B) before interest and depreciation after tax',
       ' Operating Gross Margin', ' Realized Sales Gross Margin',
       ' Operating Profit Rate', ' Pre-tax net Interest Rate',
       ' After-tax net Interest Rate',
       ' Non-industry income and expenditure/revenue',
       ' Continuous interest rate (after tax)', ' Operating Expense Rate',
       ' Research and development expense rate', ' Cash flow rate',
       ' Interest-bearing debt interest rate', ' Tax rate (A)',
       ' Net Value Per Share (B)', ' Net Value Per Share (A)',
       ' Net Value Per Share (C)', ' Persistent EPS in the Last Four Seasons',
       ' Cash Flow Per Share', ' Revenue Per Share (Yuan ¥)',
       ' Operating Profit Per Share (Yuan ¥)',
       ' Per Share Net profit before tax (Yuan ¥)',
       ' Realized Sales Gross Profit Growth Rate',
       ' Operating Profit 

In [10]:
bankruptcy_df.shape

(6819, 96)

In [11]:
bankruptcy_df.dtypes

Bankrupt                                                      int64
 ROA(C) before interest and depreciation before interest    float64
 ROA(A) before interest and % after tax                     float64
 ROA(B) before interest and depreciation after tax          float64
 Operating Gross Margin                                     float64
                                                             ...   
 Liability to Equity                                        float64
 Degree of Financial Leverage (DFL)                         float64
 Interest Coverage Ratio (Interest expense to EBIT)         float64
 Net Income Flag                                              int64
 Equity to Liability                                        float64
Length: 96, dtype: object

In [12]:
# Missing values
bankruptcy_df.isna().sum() # the total number of missing values (NaN or null) in each column of the dataframe

Bankrupt                                                    0
 ROA(C) before interest and depreciation before interest    0
 ROA(A) before interest and % after tax                     0
 ROA(B) before interest and depreciation after tax          0
 Operating Gross Margin                                     0
                                                           ..
 Liability to Equity                                        0
 Degree of Financial Leverage (DFL)                         0
 Interest Coverage Ratio (Interest expense to EBIT)         0
 Net Income Flag                                            0
 Equity to Liability                                        0
Length: 96, dtype: int64

In [13]:
# Find columns with maximum value greater than 1
columns_with_max_greater_than_1 = bankruptcy_df.columns[bankruptcy_df.max() > 1]
columns_with_max_greater_than_1

Index([' Operating Expense Rate', ' Research and development expense rate',
       ' Interest-bearing debt interest rate', ' Revenue Per Share (Yuan ¥)',
       ' Total Asset Growth Rate', ' Net Value Growth Rate', ' Current Ratio',
       ' Quick Ratio', ' Total debt/Total net worth',
       ' Accounts Receivable Turnover', ' Average Collection Days',
       ' Inventory Turnover Rate (times)', ' Fixed Assets Turnover Frequency',
       ' Revenue per person', ' Allocation rate per person',
       ' Quick Assets/Current Liability', ' Cash/Current Liability',
       ' Inventory/Current Liability',
       ' Long-term Liability to Current Assets',
       ' Current Asset Turnover Rate', ' Quick Asset Turnover Rate',
       ' Cash Turnover Rate', ' Fixed Assets to Assets',
       ' Total assets to GNP price'],
      dtype='object')

In [14]:
# First 5 rows of the dataset
bankruptcy_df.head()

Unnamed: 0,Bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
0,1,0.370594,0.424389,0.40575,0.601457,0.601457,0.998969,0.796887,0.808809,0.302646,...,0.716845,0.009219,0.622879,0.601453,0.82789,0.290202,0.026601,0.56405,1,0.016469
1,1,0.464291,0.538214,0.51673,0.610235,0.610235,0.998946,0.79738,0.809301,0.303556,...,0.795297,0.008323,0.623652,0.610237,0.839969,0.283846,0.264577,0.570175,1,0.020794
2,1,0.426071,0.499019,0.472295,0.60145,0.601364,0.998857,0.796403,0.808388,0.302035,...,0.77467,0.040003,0.623841,0.601449,0.836774,0.290189,0.026555,0.563706,1,0.016474
3,1,0.399844,0.451265,0.457733,0.583541,0.583541,0.9987,0.796967,0.808966,0.30335,...,0.739555,0.003252,0.622929,0.583538,0.834697,0.281721,0.026697,0.564663,1,0.023982
4,1,0.465022,0.538432,0.522298,0.598783,0.598783,0.998973,0.797366,0.809304,0.303475,...,0.795016,0.003878,0.623521,0.598782,0.839973,0.278514,0.024752,0.575617,1,0.03549


In [15]:
# Last 5 rows of the dataset
bankruptcy_df.tail()

Unnamed: 0,Bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
6814,0,0.493687,0.539468,0.54323,0.604455,0.604462,0.998992,0.797409,0.809331,0.30351,...,0.799927,0.000466,0.62362,0.604455,0.840359,0.279606,0.027064,0.566193,1,0.02989
6815,0,0.475162,0.538269,0.524172,0.598308,0.598308,0.998992,0.797414,0.809327,0.30352,...,0.799748,0.001959,0.623931,0.598306,0.840306,0.278132,0.027009,0.566018,1,0.038284
6816,0,0.472725,0.533744,0.520638,0.610444,0.610213,0.998984,0.797401,0.809317,0.303512,...,0.797778,0.00284,0.624156,0.610441,0.840138,0.275789,0.026791,0.565158,1,0.097649
6817,0,0.506264,0.559911,0.554045,0.60785,0.60785,0.999074,0.7975,0.809399,0.303498,...,0.811808,0.002837,0.623957,0.607846,0.841084,0.277547,0.026822,0.565302,1,0.044009
6818,0,0.493053,0.570105,0.549548,0.627409,0.627409,0.99808,0.801987,0.8138,0.313415,...,0.815956,0.000707,0.62668,0.627408,0.841019,0.275114,0.026793,0.565167,1,0.233902


In [16]:
# Summary statistics
bankruptcy_df.describe()

Unnamed: 0,Bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
count,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,...,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0,6819.0
mean,0.032263,0.50518,0.558625,0.553589,0.607948,0.607929,0.998755,0.79719,0.809084,0.303623,...,0.80776,18629420.0,0.623915,0.607946,0.840402,0.280365,0.027541,0.565358,1.0,0.047578
std,0.17671,0.060686,0.06562,0.061595,0.016934,0.016916,0.01301,0.012869,0.013601,0.011163,...,0.040332,376450100.0,0.01229,0.016934,0.014523,0.014463,0.015668,0.013214,0.0,0.050014
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
25%,0.0,0.476527,0.535543,0.527277,0.600445,0.600434,0.998969,0.797386,0.809312,0.303466,...,0.79675,0.0009036205,0.623636,0.600443,0.840115,0.276944,0.026791,0.565158,1.0,0.024477
50%,0.0,0.502706,0.559802,0.552278,0.605997,0.605976,0.999022,0.797464,0.809375,0.303525,...,0.810619,0.002085213,0.623879,0.605998,0.841179,0.278778,0.026808,0.565252,1.0,0.033798
75%,0.0,0.535563,0.589157,0.584105,0.613914,0.613842,0.999095,0.797579,0.809469,0.303585,...,0.826455,0.005269777,0.624168,0.613913,0.842357,0.281449,0.026913,0.565725,1.0,0.052838
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,9820000000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [17]:
# SweetViz
my_report = sv.analyze(bankruptcy_df)

                                             |          | [  0%]   00:00 -> (? left)

In [18]:
my_report.show_html()

Report SWEETVIZ_REPORT.html was generated! NOTEBOOK/COLAB USERS: the web browser MAY not pop up, regardless, the report IS saved in your notebook/colab files.


In [19]:
bankruptcy_df["Bankrupt"].value_counts()

0    6599
1     220
Name: Bankrupt, dtype: int64

According to the Sweetviz report, the dataset, particularly the Bankrupt varible is imbalanced, with 6,599 observations for class 0 (97%) and 220 observations for class 1 (3%).

### Undersampling the dataset

In [20]:
# Separate the dataset into two datasets
dataset_1 = bankruptcy_df[bankruptcy_df["Bankrupt"] == 0]  # Class 0 (majority class)
dataset_2 = bankruptcy_df[bankruptcy_df["Bankrupt"] == 1]  # Class 1 (minority class)

In [21]:
# Check the number of observations in dataset_1
len(dataset_1)

6599

In [22]:
# Check the number of observations in dataset_2
len(dataset_2)

220

In [23]:
# Undersample the majority class
undersampled_dataset_1 = resample(dataset_1, replace=False, n_samples=220, random_state=42)

In [24]:
# Concatenate the new undersampled majority dataset to the minority dataset
undersampled_df = pd.concat([undersampled_dataset_1, dataset_2])

In [25]:
# Save dataset
undersampled_df.to_csv('/content/undersampled_df.csv', index=True)

In [26]:
undersampled_df

Unnamed: 0,Bankrupt,ROA(C) before interest and depreciation before interest,ROA(A) before interest and % after tax,ROA(B) before interest and depreciation after tax,Operating Gross Margin,Realized Sales Gross Margin,Operating Profit Rate,Pre-tax net Interest Rate,After-tax net Interest Rate,Non-industry income and expenditure/revenue,...,Net Income to Total Assets,Total assets to GNP price,No-credit Interval,Gross Profit to Sales,Net Income to Stockholder's Equity,Liability to Equity,Degree of Financial Leverage (DFL),Interest Coverage Ratio (Interest expense to EBIT),Net Income Flag,Equity to Liability
2236,0,0.471945,0.540667,0.523636,0.607518,0.607518,0.999034,0.797471,0.809381,0.303531,...,0.801098,0.002294,0.623034,0.607520,0.840529,0.281733,0.026791,0.565159,1,0.023960
5538,0,0.507093,0.554187,0.558702,0.611972,0.611986,0.999132,0.797542,0.809430,0.303451,...,0.808595,0.002987,0.625407,0.611967,0.840884,0.277676,0.026885,0.565571,1,0.042552
4593,0,0.503924,0.550425,0.556936,0.605788,0.605788,0.999048,0.797459,0.809370,0.303481,...,0.806939,0.001287,0.624203,0.605785,0.840704,0.276773,0.026811,0.565249,1,0.056514
6315,0,0.451275,0.498528,0.503346,0.598438,0.598438,0.998972,0.797280,0.809214,0.303326,...,0.771691,0.008026,0.623244,0.598436,0.837267,0.284545,0.026559,0.563737,1,0.020046
4205,0,0.533418,0.613607,0.596445,0.613939,0.612743,0.999142,0.797527,0.809452,0.303401,...,0.837382,0.000646,0.623888,0.613937,0.843529,0.280201,0.026804,0.565218,1,0.027769
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6591,1,0.418515,0.433984,0.461427,0.612750,0.612750,0.998864,0.796902,0.808857,0.302892,...,0.725750,0.000487,0.623730,0.612747,0.828067,0.292648,0.026666,0.564481,1,0.015620
6640,1,0.196802,0.211023,0.221425,0.598056,0.598056,0.998933,0.796144,0.808149,0.301423,...,0.519388,0.017588,0.623465,0.598051,0.856906,0.259280,0.026769,0.565052,1,0.003946
6641,1,0.337640,0.254307,0.378446,0.590842,0.590842,0.998869,0.796943,0.808897,0.302953,...,0.557733,0.000847,0.623302,0.590838,0.726888,0.336515,0.026777,0.565092,1,0.011797
6642,1,0.340028,0.344636,0.380213,0.581466,0.581466,0.998372,0.796292,0.808283,0.302857,...,0.641804,0.000376,0.623497,0.581461,0.765967,0.337315,0.026722,0.564807,1,0.011777


In [27]:
# Check the number of observations for Class 0 of the new oversampled dataset
len(undersampled_df[undersampled_df["Bankrupt"]==0])

220

### Spliting the datasets into Training and Testing Sets

**Original dataset**

In [56]:
# Separate dataset into Y (dependent) and X (independent) variables
y = bankruptcy_df['Bankrupt']
X = bankruptcy_df.drop('Bankrupt', axis=1)

In [57]:
# Use the train_test_split function to split the sale data into training and testing set (80% vs 20%)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

**Undersampled dataset**

In [58]:
# Separate dataset into Y (dependent) and X (independent) variables
y_undersampled = undersampled_df['Bankrupt']
X_undersampled = undersampled_df.drop('Bankrupt', axis=1)

In [59]:
# Use the train_test_split function to split the loan data into training and testing set (80% vs 20%)
X_train_undersampled, X_test_undersampled, y_train_undersampled, y_test_undersampled = train_test_split(X_undersampled, y_undersampled, test_size=0.2, random_state=5)

## Part I

### Random Forest

**1. Fit the model on the training set using the default parameters and report your findings.**

In [60]:
# Instantiate the model
clf_rf = RandomForestClassifier()

In [61]:
# Fit the model
%%time
clf_rf.fit(X_train, y_train)

CPU times: user 3.65 s, sys: 7.78 ms, total: 3.66 s
Wall time: 3.69 s


In [62]:
# Make predictions on test set
y_pred = clf_rf.predict(X_test)

In [63]:
# Evaluation metrics (I avoided using accuracy because the original dataset is imbalanced)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f_score = f1_score(y_test, y_pred)

In [64]:
precision, recall, f_score

(0.8888888888888888, 0.1568627450980392, 0.26666666666666666)

* A precision of 0.8889 indicates that when the classifier predicts a company as bankrupt, it is correct about 88.89% of the time.
* A recall of 0.1569 indicates that the classifier is capturing only about 15.69% of the actual bankrupt companies.
* The model has high precision, however, the recall is relatively low. This means that when the model predicts bankruptcy, it is correct a large percentage of the time, but it misses a substantial number of actual bankrupt cases.
* An F-score of 0.2667 indicates the trade-off between precision recall. The model is not achieving a satisfactory balance between precision and recall.

In [65]:
# Display the decision path
clf_rf.decision_path(X_train)

(<5455x21254 sparse matrix of type '<class 'numpy.int64'>'
 	with 5338611 stored elements in Compressed Sparse Row format>,
 array([    0,   219,   448,   687,   900,  1099,  1318,  1519,  1730,
         1955,  2168,  2393,  2594,  2805,  2984,  3223,  3422,  3651,
         3866,  4081,  4300,  4497,  4692,  4911,  5118,  5311,  5540,
         5763,  5948,  6187,  6402,  6615,  6830,  7059,  7280,  7479,
         7682,  7885,  8106,  8317,  8534,  8745,  8934,  9141,  9362,
         9571,  9756,  9951, 10170, 10381, 10586, 10797, 10998, 11233,
        11446, 11685, 11902, 12111, 12312, 12543, 12748, 12961, 13172,
        13339, 13548, 13745, 13930, 14139, 14382, 14623, 14836, 15069,
        15276, 15499, 15704, 15939, 16160, 16373, 16580, 16803, 17010,
        17231, 17434, 17651, 17850, 18059, 18288, 18503, 18718, 18925,
        19130, 19337, 19552, 19777, 19992, 20203, 20406, 20641, 20854,
        21055, 21254]))

In [66]:
# Pairing feature names with their importances
feature_importance_pairs = list(zip(X_train.columns, clf_rf.feature_importances_))

# Displaying the feature importances alongside their names
for feature, importance in feature_importance_pairs:
    print(f"{feature}: {importance}")

 ROA(C) before interest and depreciation before interest: 0.013561483122392685
 ROA(A) before interest and % after tax: 0.011312083660149757
 ROA(B) before interest and depreciation after tax: 0.010715766714336498
 Operating Gross Margin: 0.00619813084229941
 Realized Sales Gross Margin: 0.006206781812297031
 Operating Profit Rate: 0.005616937231320381
 Pre-tax net Interest Rate: 0.010460711423863758
 After-tax net Interest Rate: 0.009620457375436214
 Non-industry income and expenditure/revenue: 0.018033978155599078
 Continuous interest rate (after tax): 0.012508061435248236
 Operating Expense Rate: 0.009768870398829007
 Research and development expense rate: 0.006508726765356376
 Cash flow rate: 0.006920925901944419
 Interest-bearing debt interest rate: 0.01661466815668654
 Tax rate (A): 0.0024438870503382276
 Net Value Per Share (B): 0.014323912128276276
 Net Value Per Share (A): 0.013637673695433568
 Net Value Per Share (C): 0.020777944699338944
 Persistent EPS in the Last Four Seas

**2. Repeat 1 but change the parameters to different non-default parameters. Evaluate this model on your choice of metrics. Which model do you prefer?**

In [67]:
# Random Forest Classifier with Non-default Parameters
clf_rf_modified = RandomForestClassifier(criterion='entropy')

In [68]:
# Fit the model
%%time
clf_rf_modified.fit(X_train, y_train)

CPU times: user 3.54 s, sys: 6.38 ms, total: 3.55 s
Wall time: 5.89 s


In [69]:
# Make predictions on test set
y_pred_modified = clf_rf_modified.predict(X_test)

In [70]:
# Calculate metrics
precision_modified = precision_score(y_test, y_pred_modified)
recall_modified = recall_score(y_test, y_pred_modified)
f_score_modified = f1_score(y_test, y_pred_modified)

In [71]:
precision_modified, recall_modified, f_score_modified

(0.7272727272727273, 0.1568627450980392, 0.25806451612903225)

* A precision of 0.7273 indicates that when the classifier predicts a company as bankrupt, it is correct about 72.73% of the time.
* A recall of 15.69 indicates that the classifier is capturing only about 15.69% of the actual bankrupt companies.
* The model's precision is relatively high, while the recall is low.
* An F-score of 0.2581 indicates the trade-off between precision recall. The model is not achieving a satisfactory balance between precision and recall.
I prefer the Random Forest Classifier with Default parameters since it has better (higher) metrics compared to the classifier with non-default parameters.

**3. Which model takes longer to fit? 1. Or 2.? Report the CPU time.**

The Random Forest Classifier with Default parameters takes longer to fit.
* The Random Forest Classifier with Default parameters: user 3.65 s, sys: 7.78 ms, total: 3.66 s
* The Random Forest Classifier with Non-default parameters: user 3.54 s, sys: 6.38 ms, total: 3.55 s

**4. Based on your findings above which parameter combinations give you the best results for classification? Would you prefer “Gini” or “entropy” as your splitting metric?**

In order to identify which parameter combinations yield the best results for classification, I believe we need to conduct a hyperparameter tuning process through cross-validation, involving fine-tuning hyperparameters, such as n_estimators, to optimize the model's performance on the given dataset.

Based on my findings above, I would prefer the "Gini" metric (or the Default Random Forest Classifier) as my splitting criterion since the Default Random Forest Classifier has a higher metrics compared to the Random Forest Classifier with non-default parameters' metrics.

**5. Choose one of the models from number 1. Or number 2, display, and discuss the decision rules. Do the rules make sense for classification?**

In [72]:
# Display the decision rules
for i, clf in enumerate(clf_rf.estimators_):
    # Display decision rules as text
    text_representation = export_text(clf, feature_names=X_train.columns.tolist())
    print(f"Decision Rules:\n{text_representation}")

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
|   |   |---  Continuous Net Profit Growth Rate >  0.22
|   |   |   |---  Allocation rate per person <= 0.25
|   |   |   |   |---  Net Value Growth Rate <= 0.00
|   |   |   |   |   |---  Total assets to GNP price <= 0.00
|   |   |   |   |   |   |--- class: 0.0
|   |   |   |   |   |---  Total assets to GNP price >  0.00
|   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |---  Net Value Growth Rate >  0.00
|   |   |   |   |   |---  Current Ratio <= 0.01
|   |   |   |   |   |   |---  Current Liability to Current Assets <= 0.04
|   |   |   |   |   |   |   |--- class: 1.0
|   |   |   |   |   |   |---  Current Liability to Current Assets >  0.04
|   |   |   |   |   |   |   |---  Average Collection Days <= 0.01
|   |   |   |   |   |   |   |   |---  Interest Expense Ratio <= 0.63
|   |   |   |   |   |   |   |   |   |---  Operating profit/Paid-in capital <= 0.09
|   |   |   |   |   |   |   |   |   |   |---  Interest-bearing d

The decision rules display a hierarchical structure for classification, providing a clear path for determining the class of a given entity based on specific financial parameters (financial ratios).

The rules use various financial ratios. The decision tree makes logical sense for classification, as it systematically navigates through different financial metrics to arrive at a final classification of either class 0.0 or class 1.0.

Based on the Random Forest Classifier, if the Continuous Net Profit Growth Rate is larger than 0.22, the algorithm proceeds to examine the Allocation Rate per person. If this rate is less than or equal to 0.25, it checks the Net Value Growth Rate, verifying it is below 0.00. The algorithm continues by evaluating the Total Assets to GNP Price. If it does not exceed 0.00, the final decision is made, and the instance is classified as belonging to Class 0; else, if the Total Assets to GNP Price is larger than 0.00, the instance is classified as belonging to Class 1.

Therefore, the rules make sense for classification

**6. Which features are the most important for classification?**

In [73]:
# Get and pair feature importances
feature_importances = dict(zip(X.columns, clf_rf.feature_importances_))

# Top 5 most important features
top_features = sorted(feature_importances.items(), key=lambda x: x[1], reverse=True)[:5]
top_features

[(" Net Income to Stockholder's Equity", 0.028528101684305578),
 (' Net Value Growth Rate', 0.02712142741165896),
 (' Persistent EPS in the Last Four Seasons', 0.023315610984259494),
 (' Net Value Per Share (C)', 0.020777944699338944),
 (' Per Share Net profit before tax (Yuan ¥)', 0.018961126241586357)]

**7. On average, what do you think is the tradeoff between model fitting time and model performance?**

According to the above outputs, I think the tradeoff between model fitting time and model performance is that the better the model performs, the longer it takes to fit the model: The Default Random Forest Classifier outperforms the Random Forest Classifier with non-default parameters because it has higher metrics. Also, the Default Random Forest Classifier has a longer CPU time compared to the Random Forest Classifier with non-default parameters.

### Support Vector Machines

**1. Fit the model on the training set using the default parameters and report your findings.**

In [74]:
# Instantiate the model
clf = SVC()

In [75]:
# Fit the model
%%time
clf.fit(X_train, y_train)

CPU times: user 688 ms, sys: 4.39 ms, total: 692 ms
Wall time: 694 ms


In [76]:
# Make predictions on test set
y_pred_clf = clf.predict(X_test)

In [77]:
# Calculate metrics
accuracy_clf = accuracy_score(y_test, y_pred_clf)
precision_clf_zero_division = precision_score(y_test, y_pred_clf, zero_division=1)
precision_clf = precision_score(y_test, y_pred_clf)

In [78]:
accuracy_clf, precision_clf_zero_division, precision_clf

(0.9626099706744868, 1.0, 0.0)

In [79]:
# Confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_clf).ravel()
tn, fp, fn, tp

(1313, 0, 51, 0)

* True Negatives (TN) is 1313, indicating the correct prediction of instances belonging to Class 0 (negative).
* False Positives (FP) is 0, implying no instances were mistakenly classified as Class 1 (positive) when they actually pertain to Class 0.
* False Negatives (FN) is 51, indicating the incorrect prediction of Class 0 when they should have been assigned to Class 1.
* True Positives (TP) is 0, indicating the model's inability to successfully identify any instances of Class 1.
The significant number of True Negatives contributes to a high accuracy, but the model encounters challenges in accurately classifying instances from Class 1, leading to undefined precision, recall, and F1 score for Class 1. This issue could stem from a strong imbalance in the distribution of the classes.

**2. Change the kernel from “rbf” to “linear” and repeat the model fitting procedure in 1 above.**

In [80]:
# Instantiate the model
clf_modified = SVC(kernel='linear')

In [53]:
# Fit the model
# clf_modified.fit(X_train, y_train)
# The linear SVM did not converge.

**3. Evaluate both models using any metrics of choice.**

In [81]:
# Calculate metrics
accuracy_clf = accuracy_score(y_test, y_pred_clf)
precision_clf_zero_division = precision_score(y_test, y_pred_clf, zero_division=1)
precision_clf = precision_score(y_test, y_pred_clf)

In [82]:
accuracy_clf, precision_clf_zero_division, precision_clf

(0.9626099706744868, 1.0, 0.0)

* An accuracy of 0.9626 indicates that the model is correct about 96.26% of the time. However, because this dataset is significantly imbalanced, accuracy may not be a good evaluation metric.
* While the significant number of True Negatives contributes to high accuracy, the model encounters challenges in accurately classifying instances from Class 1, leading to undefined precision, recall, and F1 score for Class 1. This issue could stem from a strong imbalance in the distribution of the classes.
* The precision value with zero division is 1.0, implying that there were no false positives or instances predicted as positive by the model.

**4. How long does it take to fit the SVM model in 1. above? How about 2. above? Report only the CPU times.**

* The Default Support Vector Machines (Model 1): user 688 ms, sys: 4.39 ms, total: 692 ms
* The linear SVM (Model 2) did not converge.

## Part II

**1. Fit the Random Forest and the SVM on the undersampled data and compare the fitting times for both models? Which model takes longer to fit?**

**Random Forest**

In [83]:
# Instantiate the model
clf_rf_undersampled = RandomForestClassifier()

In [84]:
# Fit the model
%%time
clf_rf_undersampled.fit(X_train_undersampled, y_train_undersampled)

CPU times: user 454 ms, sys: 4.38 ms, total: 459 ms
Wall time: 478 ms


**SVM**

In [85]:
# Instantiate the model
clf_undersampled = SVC()

In [86]:
# Fit the model
%%time
clf_undersampled.fit(X_train_undersampled, y_train_undersampled)

CPU times: user 23.1 ms, sys: 585 µs, total: 23.7 ms
Wall time: 36.4 ms


The Default Random Forest Classifier takes longer to fit.

* Default Random Forest Classifier: user 454 ms, sys: 4.38 ms, total: 459 ms

* Default Support Vector Machines: user 23.1 ms, sys: 585 µs, total: 23.7 ms

**2. Compare the performance of both models on any metrics of choice.**

In [87]:
# Random Forest
y_pred_rf_undersampled = clf_rf_undersampled.predict(X_test_undersampled)

# SVM
y_pred_clf_undersampled = clf_undersampled.predict(X_test_undersampled)

# Evaluate models based on accuracy
accuracy_rf_undersampled = accuracy_score(y_test_undersampled, y_pred_rf_undersampled)
accuracy_svm_undersampled = accuracy_score(y_test_undersampled, y_pred_clf_undersampled)

In [88]:
# Accuracy of Randome Forest
accuracy_rf_undersampled

0.7954545454545454

In [89]:
# Accuracy of SVM
accuracy_svm_undersampled

0.625

Random Forest performs better based on accuracy.