## Handling imports

In [None]:
import sys
import os
sys.path.append(os.path.abspath('../'))
from main import train_and_evaluate_model, evaluate_model, detect_outliers
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings(action='ignore')

## Outlier detection
To run the outlier detection we are calling the `detect_outliers` function, specifying the path to our dataset and the path to the folder, where the results will be saved

The result will be a csv with the samples, that the algorithm detected as possible outliers; and a plot visualizing these samples using MDS projection

In [None]:
%matplotlib notebook
data_path = 'example_data.csv'
target_folder = 'example_results/outliers'
outliers = detect_outliers(data_path=data_path,
                target_folder='example_results/outliers')

#### Printing and visualizing the outliers
The outlier samples and visualization polt is saved under the previously defined target folder as `posssible_outliers.csv` and `outlier_visualized.png` respectively

***Note:** For some samples it may not be trivial why they were or weren't classified as an outlier just by looking at the visualization. It is due to the fact that the outlier detection algorithm is run on the high dimensional input data, while for the visualzation this data is projected to a 2D plane, for which it is inevitable to lose some information.*

In [None]:
outliers

In [None]:
plot = plt.imread(os.path.join(target_folder, 'outliers_visualized.png'))
plt.axis('off')
plt.imshow(plot)

## Training and evaluating
Training a new model can be done by calling the `train_and_evaluate_model` function, specifying the path to the dataset, configuration file, and the target folder. Feature importances can also be calculated with this function.

In [None]:
data_path = 'example_data.csv'
target_folder = 'example_results/train'
config_path = 'config.yaml'
res = train_and_evaluate_model(data_path=data_path,
                         config_path=config_path,
                         target_folder=target_folder,
                         calculate_feature_importances=True)

#### Printing the results and the plots
Calling the function returns the corss validated metrics with a 95% confidence interval. It is also saved as a csv in the target folder as `cross_val_result.csv`

In [None]:
res

The ROC and the PR curve is saved in the target folder as `test_result_roc_curve.png` and `test_result_pr_curve.png` respectively.

In [None]:
%matplotlib inline 
roc = plt.imread(os.path.join(target_folder, 'test_result_roc_curve.png'))
plt.axis('off')
plt.imshow(roc)

In [None]:
pr = plt.imread(os.path.join(target_folder, 'test_result_pr_curve.png'))
plt.axis('off')
plt.imshow(pr)

#### Feature importances
The importance values (i.e. the absolute mean SHAP values) for each features are saved in the targetfolder as `feature_importances.csv`

In [None]:
pd.read_csv(os.path.join(target_folder, 'feature_importances.csv'), index_col=0)

In addition a plot is created in the target folder to better visualize these values

In [None]:
shap_plot = plt.imread(os.path.join(target_folder, 'feature_importances_plot.png'))
plt.axis('off')
plt.imshow(shap_plot)

## Evaluating a trained model on an external dataset
Evaluating a trained model can be done by calling the `evaluate` function, specifying the path to the previously saved model, the dataset, configuration file, and the target folder, as well as the name of the target feature that is to be predicted. Feature importances can also be calculated with this function.

***Note:** for the sake of simplicity, to demonstrate the use of this feature we are running the evaluation on the data that the model was trained on. However, when used in a real setting this feature should only be used on a different data*

In [None]:
%matplotlib notebook
data_path = 'example_data.csv'
model_path = 'example_results/train/model.pickle'
target_folder = 'example_results/eval'
target_column = 'Death'
res_eval = evaluate_model(model_path=model_path,
                          data_path=data_path,
                          target_column=target_column,
                          target_folder=target_folder,
                          calculate_feature_importances=False)

#### Printing the results and the plots
Calling the function returns the resulting metrics. In addition, it is also saved as a csv in the target folder as `test_result.csv`

In [None]:
res_eval

In [None]:
%matplotlib inline 
roc = plt.imread(os.path.join(target_folder, 'test_result_roc_curve.png'))
plt.axis('off')
plt.imshow(roc)

In [None]:
pr = plt.imread(os.path.join(target_folder, 'test_result_pr_curve.png'))
plt.axis('off')
plt.imshow(pr)