random Forest playgroung using the lab in itroduction to statistical learning as a reference

In [None]:
import numpy as np
import pandas as pd
from matplotlib.pyplot import subplots
from sklearn.tree import (DecisionTreeClassifier as DTC, 
                          DecisionTreeRegressor as DTR,
                          plot_tree,
                          export_text)
from sklearn.metrics import (accuracy_score,
                             log_loss)
from sklearn.ensemble import (RandomForestRegressor as RF,
                              GradientBoostingRegressor as GBR)
import sklearn.model_selection as skm
from ISLP.models import ModelSpec as MS
from ISLP import load_data, confusion_table


First we load the our smaller data (1std_dataset)
I have two ideas in mind on what to use as a classifier:
First: use "disstressed" as a binary classifier; however since there is a very small number of distressed observations
Second: use 'z_score' which is continious as the classifier by setting a threshold and define that a "good" and "bad" here we can move around the threshold and see how the model performs  

In [None]:
df = pd.read_csv("../datasets/1std_dataset.csv")
distressed = np.asarray(df['distressed'].values)
model = MS(df.select_dtypes(include=np.number).columns, intercept=False)
D = model.fit_transform(df)
feature_names = list(D.columns)
X = np.asarray(D)

In [None]:
clf = DTC(criterion='gini', 
          max_depth=30,
          random_state=0)
clf.fit(X, distressed)

In [None]:
accuracy_score(distressed, clf.predict(X))

With distressed the accuracy is 1, which is to be expected, since basically every observation is not distressed. (Out of 667 Observation 8 are distressed)

Let's use Z_score as a classifier now

In [None]:
zscore_save = np.where(df.z_score >= 1.8, 1, 0) 
model = MS(df.select_dtypes(include=np.number).columns.drop('z_score'), intercept=False)
D = model.fit_transform(df)
feature_names = list(D.columns)
X = np.asarray(D)

In [None]:
clf = DTC(criterion='gini', 
          max_depth=3)
clf.fit(X, zscore_save)

In [None]:
accuracy_score(zscore_save, clf.predict(X))

for the first iteration we get a highly accurate model with with 7% error rate. This might be indication of an overfit, so we should be careful with the interpretation.

In [None]:
resid_dev = np.sum(log_loss(zscore_save, clf.predict_proba(X)))
resid_dev

In [None]:
ax = subplots(figsize=(25,15))[1]
plot_tree(clf,
    feature_names=feature_names,
    ax=ax);

In [None]:
print(export_text(clf,
feature_names=feature_names,
show_weights=True))

Cost of revenue seems to be the most important feature in this model

Next, we will split the data into some validation sets and see how the model performs on those

In [None]:
validation = skm.ShuffleSplit(n_splits=5,
                              test_size=0.2,
                              random_state=0)
results = skm.cross_validate(clf,
                             D,
                             zscore_save,
                             cv=validation)
results['test_score']

The model is still performing well on the vadation sets.

Now let's actually create new models based on training and validations sets 

In [None]:
(X_train,
 X_test,
 zscore_train,
 zscore_test) = skm.train_test_split(X,
                                     zscore_save,
                                     test_size=0.2,
                                     random_state=0)

In [None]:
clf = DTC(criterion='entropy', random_state=0)
clf.fit(X_train, zscore_train)
accuracy_score(zscore_test, clf.predict(X_test))

In [None]:
ccp_path = clf.cost_complexity_pruning_path(X_train, zscore_train)
kfold = skm.KFold(10,
                  random_state=1,
                  shuffle=True)

In [None]:
grid = skm.GridSearchCV(clf,
                        {'ccp_alpha': ccp_path.ccp_alphas},
                        refit=True,
                        cv=kfold,
                        scoring='accuracy')
grid.fit(X_train, zscore_train)
grid.best_score_

In [None]:
# plot the results of the grid search 
ax = subplots(figsize=(25, 15))[1]
best_ = grid.best_estimator_
plot_tree(best_,
          feature_names=feature_names,
          ax=ax)

In [None]:
best_.tree_.n_leaves

In [None]:
print(accuracy_score(zscore_test,
                     best_.predict(X_test)))
confusion = confusion_table(best_.predict(X_test),
                            zscore_test)
confusion

This is already a really good result.