# Introduction 

This bankrupcy dataset is great for practice since much of the data cleaning and scaling is done, and we can focus primarily on building the model itself. We will build a model to predict whether or not a company will file for bankrupcy. 

In [None]:
import pandas as pd
import numpy as np 
!pip install pyjanitor
import janitor

company_bankrupcy = pd.read_csv('../input/company-bankruptcy-prediction/data.csv').clean_names()
company_bankrupcy.head()

In [None]:
y = company_bankrupcy['bankrupt_']
X = company_bankrupcy.drop('bankrupt_', axis=1)

In [None]:
X.shape

We have almost seven thousand observations and 95 predictor variables

## Splitting and setting up our model

We will split the data into a testing set and a training set, and from the training set, we will split that into a training and validation set to evaluate our model both during training and on new data (the testing data) it hadn't seen.

The model will also adopt an early-stopping call back to halt training if we see we are not lowering our loss.

In [None]:
from sklearn.model_selection import train_test_split
from tensorflow import keras 
from tensorflow.keras import layers, callbacks
from tensorflow.keras.callbacks import EarlyStopping

# Split training and testing sets 
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=13, stratify=y)

# split training and validation set
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, 
                                                  test_size=0.25, random_state=10,
                                                 stratify=y_train) 

early_stopping = callbacks.EarlyStopping(
    min_delta=0.001, # minimium amount of change to count as an improvement
    patience=15, # how many epochs to wait before stopping
    restore_best_weights=True,
)

model = keras.Sequential([
#     layers.Dense(95, activation='relu'),
    layers.Dense(190, activation='relu', input_shape=[95]),
    layers.Dense(190, activation='relu'),
    layers.Dense(1, activation='sigmoid'),
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['binary_accuracy'])

A wider model appears to work best. This is assuming many of the predictor variables have a linear relationship to our target variable `y`. 

We will fir the model now and take a look at the corresponding accuracy and loss curves to ensure we are not overfitting our model to the training set. 

In [None]:
history = model.fit(
    X_train, 
    y_train, 
    validation_data=(X_val, y_val),
    epochs=500, #150 
    batch_size=100, #50 
    callbacks=[early_stopping]
)

In [None]:
# history_df = pd.DataFrame(history.history)
# history_df.plot();

history_df = pd.DataFrame(history.history)
history_df.loc[:, ['loss', 'val_loss']].plot();
print(f"Minimum validation loss: {history_df['val_loss'].min()}")

Our loss curves do not seem to indicate we are overfitting the training data, although the validation loss is slightly worse. 

In [None]:
history_df.loc[5:, ['binary_accuracy', 'val_binary_accuracy']].plot();
print(f"Max validation Accuracy: {history_df['val_binary_accuracy'].max()}")

## Final test set 

In [None]:
preds = model.predict(X_test)
scores = model.evaluate(X_test, y_test)

Ok! Our highest validation accuracy during training was around 96.5%, and our predictions on the new testing data was 95.8%. This suggests that the model was not overfit.

In [None]:
preds.round()

In [None]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_test, preds.round())

# Conclusion

Because this bankrupcy dataset was clean, we could focus on building the deep learning model. The confusion matrix above shows how the model performed on the new data.