# Introduction
These two vignettes contain walk-throughs of machine learning development in both R and in Python. 
These tutorials aim to outline the basic steps in training and assessing ML models.
The details presented are not meant to discuss the every detail of best practices in machine learning nor do they necessarily show how to develop the best performing model.
Instead, the goals are to provide a clear example of how we go from data to predictions in the ML framework and to illustrate general machine learning principles.

# A machine learning walkthrough in Python
## Loading packages and data

In [None]:
import pandas as pd
import numpy as np
import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import average_precision_score
import sklearn.metrics

## Data processing
We proceed by loading our data into Python.
The variable `path_to_data` is the location of the csv file with the data, and the function `pd.read_csv()` will import the data into a data frame.
A data frame is a rectangular representation of data where each row represents an observation and each column represents a variable.

Each row represents a patient, and the column contains the data for each patient.
The data frame contains an identifying `SID` for each patient, the patient's `SEX`, concentrations of different amino acids, alloisoleucine in `Allo`, homocysteine in `Hcys`, and argininosuccinic acid lyase deficiency in `ASA`.
The `Class` column contains the "labels" normal and abnormal for each PAA profile.

In [None]:
path_to_data = "/Users/ed/git/ml_review_paper/data/clc317479-file001.csv"
df = pd.read_csv(path_to_data)
df.head()   # print a preview of the data frame

In order to prepare our data frame for machine learning purposes, we need to prepare our data into a form that the machine learning algorithms will accept.

1. We need to get rid any features that we will not input into the algorithm.
In this case, we will need to remove the patient identifiers (the `SID` column).
We do this by using the `drop()` function to remove the `SID` column

2. We need to convert any categorical variables into numerical codes.
For example, the `SEX` column has values of `F` for female, `M` for male, and `U` for unidentified.
We use the `loc()` function to find the rows with a specific value (e.g. `F`) and then replace it was the appropriate numerical code (e.g. `0`).
We do this for all columns with categorical data.

3. We separate the features (e.g. all of the concentrations in our data) and the labels (e.g. normal or abnormal) and then save them as a Numpy matrix.
The data is saved into a matrix `X`, and the labels are saved into a matrix `Y`.

In [None]:
df = df.drop(['SID'], axis = 1)

# convert SEX column to numerical codes
df.loc[(df['SEX'] == 'F'), 'SEX'] = 0
df.loc[(df['SEX'] == 'M'), 'SEX'] = 1
df.loc[(df['SEX'] == 'U'), 'SEX'] = 2

# convert ASA column to numerical codes
df.loc[(df['ASA'] == 'N'), 'ASA'] = 0
df.loc[(df['ASA'] == 'Y'), 'ASA'] = 1

# convert Allo column to numerical codes
df.loc[(df['Allo'] == 'N'), 'Allo'] = 0
df.loc[(df['Allo'] == 'Y'), 'Allo'] = 1

# convert Hcys column to numerical codes
df.loc[(df['Hcys'] == 'N'), 'Hcys'] = 0
df.loc[(df['Hcys'] == 'Y'), 'Hcys'] = 1

# convert labels from text to numerical codes 
df.loc[(df['Class'] == 'No.significant.abnormality.detected.'), 'Class'] = 0
df.loc[(df['Class'] == 'X.Abnormal'), 'Class'] = 1

# save data as X and labels as Y
X = df.iloc[:,:-1].to_numpy()
Y = df.iloc[:,-1:].to_numpy()

## Splitting our data into a training, test, and validation set

In [None]:
SEED = 7
TEST_SIZE = 0.30
VAL_SIZE = 0.20

# Split dataset into Train and Test datasets
x_train, x_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size = TEST_SIZE,
                                                    random_state = SEED)
x_train, x_val, y_train, y_val   = train_test_split(x_train,
                                                    y_train,
                                                    test_size = VAL_SIZE,
                                                    random_state = SEED)

## ML Training Protocol

In [None]:
model_xgb = xgb.XGBClassifier(
  random_state = 7,
  n_estimators = 200,
  verbosity = 0,
  use_label_encoder = False,
  learning_rate = 0.1,
  objective = 'binary:logistic'
)

model_xgb.fit(
  X=x_train, 
  y=y_train.flatten(),
  eval_set = [(x_train, y_train), (x_val, y_val)],
  early_stopping_rounds=10,
  eval_metric=["logloss"]
)

In [None]:
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties


results = model_xgb.evals_result()
epochs = len(results['validation_0']['logloss'])
x_axis = range(0, epochs)

fig, axs = plt.subplots(figsize=(7, 4.5), facecolor='w', edgecolor='k')

font = FontProperties()
font.set_name('Arial')

plt.plot(x_axis, results['validation_0']['logloss'],
         label='Train', linewidth=1.3);
plt.plot(x_axis, results['validation_1']['logloss'],
         label='Validation', color='C0',linestyle='--', linewidth=1.3);
plt.axvline(x=73, color='r', linestyle='--', linewidth=1);

plt.legend();

plt.ylabel('Loss');
plt.xlabel('Iteration', fontproperties=font, fontsize=12);

plt.grid(True)

plt.show()

In [None]:
y_preds = model_xgb.predict(x_test).astype(int)

# get prediction probabilities for test data
y_probs = model_xgb.predict_proba(x_test).astype(float)

y_test = y_test.astype(int)

# evaluate predictions
accuracy = accuracy_score(y_test, y_preds)
print("Binomial Classification Accuracy: %.2f%%" % (accuracy * 100.0))

In [None]:
precision, recall, _ = precision_recall_curve(y_test, y_probs[:,1])

auprc = round(sklearn.metrics.auc(recall, precision), 3)

fig, axs = plt.subplots(1,1, figsize=(5, 4), facecolor='w', edgecolor='k')

axs.annotate(
    "PRAUC: {0}".format(auprc), 
    xy=(0.8,0.1),
    xycoords="data",
    va="center", 
    ha="center", 
    fontsize=18,
    bbox=dict(boxstyle="round",fc="w")
);
plt.plot(precision, recall, label="Test", linewidth=2, linestyle='--');
axs.set_xlabel('Recall', fontsize=14);
axs.set_ylabel('Precision', fontsize=14);

plt.grid(True)
plt.show()