# Complex Machine Learning Models

Advanced machine learning models are not implicitly interpretable. 
A human with use-case specific knowledge cannot see the inner workings of a neural network, random forest, etc. with one glimpse. 
This notebooks builds such "complex machine learning models" from the 
[Breast cancer wisconsin (diagnostic) dataset](https://scikit-learn.org/stable/datasets/index.html#breast-cancer-wisconsin-diagnostic-dataset) provided with scikit-learn.

Because this is just an example, this notebook contains two classifiers:

* Logistic Regression
* Random Forest Classifier
* Multi-Layer Perceptron or Artificial Neural Network

These are trained on the aforementioned dataset that contains 30 features and 569 total samples.
The classifiers are stored on the local disk for downstream analysis.
These two tables contain general information on the dataset and value statistics for the features.

| | |
|-|-|
| Classes | 2 |
| Samples per class | 212(M),357(B) |
| Samples total | 569 |
| Dimensionality | 30 |
| Features | real, positive |

| Feature | Average | Deviation |
|-|-|-|
| radius (mean) | 6.981 | 28.11 |
| texture (mean) | 9.71 | 39.28 |
| perimeter (mean) | 43.79 | 188.5 |
| area (mean) | 143.5 | 2501.0 |
| smoothness (mean) | 0.053 | 0.163 |
| compactness (mean) | 0.019 | 0.345 |
| concavity (mean) | 0.0 | 0.427 |
| concave points (mean) | 0.0 | 0.201 |
| symmetry (mean) | 0.106 | 0.304 |
| fractal dimension (mean) | 0.05 | 0.097 |
| radius (standard error) | 0.112 | 2.873 |
| texture (standard error) | 0.36 | 4.885 |
| perimeter (standard error) | 0.757 | 21.98 |
| area (standard error) | 6.802 | 542.2 |
| smoothness (standard error) | 0.002 | 0.031 |
| compactness (standard error) | 0.002 | 0.135 |
| concavity (standard error) | 0.0 | 0.396 |
| concave points (standard error) | 0.0 | 0.053 |
| symmetry (standard error) | 0.008 | 0.079 |
| fractal dimension (standard error) | 0.001 | 0.03 |
| radius (worst) | 7.93 | 36.04 |
| texture (worst) | 12.02 | 49.54 |
| perimeter (worst) | 50.41 | 251.2 |
| area (worst) | 185.2 | 4254.0 |
| smoothness (worst) | 0.071 | 0.223 |
| compactness (worst) | 0.027 | 1.058 |
| concavity (worst) | 0.0 | 1.252 |
| concave points (worst) | 0.0 | 0.291 |
| symmetry (worst) | 0.156 | 0.664 |
| fractal dimension (worst) | 0.055 | 0.208 |

In [None]:
%load_ext autoreload
%autoreload 2

In [None]:
import os
import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    precision_score, 
    recall_score, 
    accuracy_score, 
    roc_auc_score
)
from sklearn.neural_network import MLPClassifier
from tic import load_test_data

## Preparing the data

Here, we load the data and create the datasets for training and testing.

In [None]:
data = load_test_data()
X_train, X_test, y_train, y_test = data['dataset'].values()

## Building Classifiers

In this part, we train two classifiers:

* Logistic Regression
* Random Forest
* Multi-Layer Perceptron

to predict whether a breast cancer tumor is benign or malignant.

## Logistic Regression

In [None]:
clf_lr = LogisticRegression(solver='lbfgs', max_iter=10000)
clf_lr.fit(X_train, y_train)
y_pred_lr = clf_lr.predict(X_test)

### Random Forest Classifier

In [None]:
clf_rf = RandomForestClassifier(n_estimators=100)
clf_rf.fit(X_train, y_train)
y_pred_rf = clf_rf.predict(X_test)

### Multi-Layer Perceptron Classifier

In [None]:
clf_mlp = MLPClassifier()
clf_mlp.fit(X_train, y_train)
y_pred_mlp = clf_mlp.predict(X_test)

## Evaluation

In [None]:
print(f'''
Logistic Regression:
------------------------------
Accuracy:   {accuracy_score(y_test, y_pred_lr)}
Precision:  {precision_score(y_test, y_pred_lr)}
Recall:     {recall_score(y_test, y_pred_lr)}
AUROC:      {roc_auc_score(y_test, y_pred_lr)}

Random Forest Classifier:
------------------------------
Accuracy:   {accuracy_score(y_test, y_pred_rf)}
Precision:  {precision_score(y_test, y_pred_rf)}
Recall:     {recall_score(y_test, y_pred_rf)}
AUROC:      {roc_auc_score(y_test, y_pred_rf)}


Multi-Layer Perceptron Classifier:
----------------------------------
Accuracy:   {accuracy_score(y_test, y_pred_mlp)}
Precision:  {precision_score(y_test, y_pred_mlp)}
Recall:     {recall_score(y_test, y_pred_mlp)}
AUROC:      {roc_auc_score(y_test, y_pred_mlp)}
''')

## Persisting the classifiers

For downstream interpretability analysis, the classifiers are persisted.

In [None]:
directory = '.classifiers'
if not os.path.exists(directory):
    os.makedirs(directory)

pickle.dump(clf_lr, open(f'{directory}/logistic_regression.clf', 'wb'))
pickle.dump(clf_rf, open(f'{directory}/random_forest.clf', 'wb'))
pickle.dump(clf_mlp, open(f'{directory}/multi_layer_perceptron.clf', 'wb'))