Christopher Kerns  
Kaggle: ctkerns

In [1]:
import numpy as np
import pandas as pd

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))


# 1. Read the CSV data

In [1]:
train = pd.read_csv("/kaggle/input/cap-4611-spring-21-assignment-1/train.csv")
train.head()

In [1]:
test = pd.read_csv("/kaggle/input/cap-4611-spring-21-assignment-1/test.csv")
test.head()

# 2. Check for missing values in training data

In [1]:
train.info()

There is no data that is explicitly marked as null.

In [1]:
train.describe()

It looks like 0 is being used to represent missing values across most of the columns.

In [1]:
train[train == 0].info()

In some columns, there are hundreds of zeros, which probably don't indicate missing data. In other columns, there are one or two. If zero values in these columns don't represent missing values then they may still be outliers.

# 3. Handle Missing Values


I've considered removing data points where data is missing. Specifically in the columns where there are only a couple zero values. However, I think while this could make it easier for decision trees to split certain features, the removal of valid data in other columns would skew the decision tree's data. I think the best approach would be to impute these values using the mean. This would not affect the features where the data points have valid data. Compared to having the data removed, imputing with the mean would not skew the threshold in either direction at a node. I didn't end up doing this because I didn't have to in order to pass the benchmark.

# 4. Check for outliers

In [1]:
import matplotlib.pyplot as plt



plt.boxplot(train.iloc[:,2:12])
plt.xticks(list(range(1,11)), list(train.columns.values[2:12]), rotation=-30, ha='left')
plt.show()

#--

plt.boxplot(train.iloc[:,16:22])
plt.xticks(list(range(1,7)), list(train.columns.values[16:22]), rotation=-30, ha='left')
plt.show()

#--

plt.boxplot(train.iloc[:,23:30])
plt.xticks(list(range(1,8)), list(train.columns.values[23:30]), rotation=-30, ha='left')
plt.show()

plt.boxplot(train.iloc[:,23:30])
plt.xticks(list(range(1,8)), list(train.columns.values[23:30]), rotation=-30, ha='left')
plt.show()

#--

plt.boxplot(train.iloc[:,31:35])
plt.xticks(list(range(1,5)), list(train.columns.values[31:35]), rotation=-30, ha='left')
plt.show()

# 5. Handle Outliers

The boxplots show hundreds of points that are considered to be outliers. The performance of decision trees is resilient to outliers because the scale of the data does not matter when calculating information gain. Clamping these values to a normal range would likely have no effect, as the samples would end up on the same side of the chosen thresholds as before. I also don't want to drop these values entirely, as that could skew the thresholds chosen at each node. So I think the best decision is to let the outliers be.

# 6. Normalization and Standardization

In this assignment, we are limited to decision tree and random forest classifiers. Normalization will not have any effect on the ability for a decision tree or random forest to classify the data. This is because normalization is just a linear scaling of the data, but decision trees can accomodate data at any scale because the value each node splits at is chosen based on information gain. So after normalizing, a decision tree would simply choose a different value to split each feature at which would have no effect on accuracy.  

There is also no use for standardization for the same reason.

# 7. Build and train a decision tree on training data

In [1]:
from sklearn.model_selection import train_test_split

seed = 1

# Split data into training and validation sets.
X = train.iloc[ : , 2: ]
y = train["Bankrupt"]

X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=seed)

In [1]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(max_depth = 96, criterion='gini', random_state=seed)
dt.fit(X_train, y_train)

# 8. Decision tree accuracy

In [1]:
dt_pred_prob = dt.predict_proba(X_valid)[:,1]
dt_pred = dt.predict(X_valid)

## ROC AUC Score

In [1]:
from sklearn.metrics import roc_auc_score

dt_roc = roc_auc_score(y_valid, dt_pred_prob)
print(dt_roc)

## F1 Score

In [1]:
from sklearn.metrics import f1_score

dt_f1 = f1_score(y_valid, dt_pred)
print(dt_f1)

## Acccuracy

In [1]:
from sklearn.metrics import accuracy_score

dt_acc = accuracy_score(y_valid, dt_pred)
print(dt_acc)

# 9. Build and train random forest model on training data

In [1]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=293, min_samples_split=2, min_samples_leaf=11, max_features='log2', max_depth=11, criterion='gini', bootstrap=False, random_state=seed)
rf.fit(X_train, y_train)

## Tune hyperparameters

In [1]:
from scipy.stats import randint
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': randint(10,1000),
    'max_features': ['log2', 'sqrt'],
    'min_samples_leaf': randint(1,20),
    'min_samples_split': randint(2,20),
    'max_depth': list(range(4,50)) + [None],
    'criterion': ['gini'],
    'bootstrap': [True, False]
}

#rf_cv = RandomizedSearchCV(rf, param_dist, cv=5, scoring='roc_auc', n_iter=100, verbose=10, n_jobs=-1)

#rf_cv.fit(X, y)

#print("Tuned Logistic Regression Parameters: {}".format(rf_cv.best_params_)) 
#print("Best score is {}".format(rf_cv.best_score_))

I'm going to leave the cross validation commented out because otherwise it will take forever everytime I hit run all, but the best parameters I got were:
* bootstrap: False
* criterion: gini
* max_depth: 11
* max_features: log2
* min_samples_leaf: 11
* min_samples_split: 2
* n_estimators: 293

With 94.1956% accuracy

# 10. Random forest accuracy

In [1]:
rf_pred_prob= rf.predict_proba(X_valid)[:,1]
rf_pred = rf.predict(X_valid)

## ROC AUC Score

In [1]:
rf_roc = roc_auc_score(y_valid, rf_pred_prob)
print(rf_roc)

## F1 Score

In [1]:
rf_f1 = f1_score(y_valid, rf_pred)
print(rf_f1)

## Accuracy

In [1]:
rf_acc = accuracy_score(y_valid, rf_pred)
print(rf_acc)

# 11. Select the best model

I'm choosing the random forest model. Before I started tuning hyperparameters, it was giving me much better results. I also think a random forest model compared to the decision tree model is almost always better. A random forest is just a bunch of decision trees, so it is capable of doing anything a decision tree can in theory. I think the use is also justified by the large number of features in the dataset, and the need for a model which can sort out the useful ones from the insignificant ones.

In [1]:
X_test = test.iloc[ : , 1: ]
y_pred = rf.predict_proba(X_test)[:,1]

output = pd.DataFrame({"id": test.id, "Bankrupt": y_pred})
output.to_csv('my_submission.csv', index=False)
print("Your submission was successfully saved!")