# Machine Learning - Part 2


## Machine learning paradigms

There are many different types of models in machine learning and choosing the best one is dependent on:
1. The problem you aim to solve
2. The data you have

In some instances multiple models may work well for you, in which case you will have to consider other aspects of the model, such as:
* interpretability
* memory cost
* number of samples
* dimensionality
* and so on...

Though these considerations may help you narrow down your choices, choosing the *best* remains a difficult task. I will provide some general information about different types of machine learning models while keeping some of the above aspects in mind.

Below is a figure that shows a very well defined hierarchy of different ML models that one can consider. The upper level of this hierarchy gives 3 main learning paradigms: **supervised**, **unsupervised**, and **reinforcement**. I will discuss all 3 of these as well as a fourth, called **semi-supervised**.

<img width="500px" src="img/ml_hierarchy.png"/>


In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,\
        roc_auc_score, auc, precision_recall_curve, roc_curve, log_loss
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, RandomizedSearchCV, GridSearchCV
from sklearn import tree
from IPython.display import Image 

import random
## set seed for randomization
random.seed(42)

## Pima Indians Diabetes dataset
We will use the Pima Indians dataset to experiment with decision trees. The Pima are a group of Native Americans living in Arizona. A genetic predisposition allowed this group to survive on a carbohydrate poor diet. In recent years, a sudden shift from traditional agricultural crops to processed foods and a decline in physical activity led to a high prevalence of type 2 diabetes in this population.  

The dataset can be downloaded here:

https://www.kaggle.com/uciml/pima-indians-diabetes-database#diabetes.csv

I have named the downloaded file: `diabetes.csv`.

The dataset includes data from 768 women. The columns are defined as follows:

* `Pregnancies`: Number of times pregnant
* `Glucose`: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
* `BloodPressure`: Diastolic blood pressure (mm Hg)
* `SkinThickness`: Triceps skin fold thickness (mm)
* `Insulin`: 2-Hour serum insulin (mu U/ml)
* `BMI`: Body mass index (weight in kg/(height in m)^2)
* `DiabetesPedigreeFunction`: The output of the pedigree function that provides measure of genetic influence and gives us an idea of the hereditary risk one might have with the onset of diabetes mellitus
* `Age`: Age (years)
* `Outcome`: Class variable (0 or 1) 268 of 768 are 1 (positive), the others are 0 (negative)

In [None]:
## load Pima Indians Diabetes dataset (downloaded May 14, 2019; N=768)
df = pd.read_csv("diabetes.csv")

In [None]:
## function to determine if a row has an missing value
def valid_value(row):
    if 0 == row['Glucose'] or \
       0 == row['BloodPressure'] or \
       0 == row['SkinThickness'] or \
       0 == row['Insulin'] or \
       0 == row['BMI'] or \
       0 == row['Age']:
        return False
    else:
        return True

## create dataframe with only valid rows
df_pima = df[df.apply(lambda row: valid_value(row), axis=1)]
df_pima.head()

In [None]:
print(f"length of original dataframe: {len(df)}")
print(f"length of filtered dataframe: {len(df_pima)}")

In [None]:
## split dataset in features and target variable
feature_cols = \
    ['Pregnancies', 'Insulin', 'BMI', 'Age','Glucose',
     'BloodPressure','DiabetesPedigreeFunction', 'SkinThickness']

X = df_pima[feature_cols]
y = df_pima['Outcome']

In [None]:
## split dataset into training set and test set
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.2, random_state=42, stratify=df_pima['Outcome']) # 70% training and 30% test
print(f"Number of samples in trianing set = {len(X_train)}")
print(f"Number of samples in testing set = {len(X_test)}")

In [None]:
print(f"Number of positive samples in training set = {y_train.to_list().count(1)}")
print(f"Number of negative samples in training set = {y_train.to_list().count(0)}")
print(f"Ratio of positive to negative samples in training set = {y_train.to_list().count(1)/y_train.to_list().count(0):.3f}\n")

print(f"Number of positive samples in testing set = {y_test.to_list().count(1)}")
print(f"Number of negative samples in testing set = {y_test.to_list().count(0)}")
print(f"Ratio of positive to negative samples in testing set = {y_test.to_list().count(1)/y_test.to_list().count(0):.3f}")

## Random forest
Random forest classifers are similar to decision trees in that they use hierarchical structures to split the dataset based on features. However, unlike decision trees, these classifiers use muliple decision trees (a "forest") in classification process using a method called *bagging*. Random forest is called an *ensemble* method because we have multiple classifiers by which we make our final prediction.

The random forest algorithm consists of four general steps:
* Select random samples from a given dataset - *bootstrapping*.
* Construct a decision tree for each sample and get a prediction result from each decision tree.
* Perform a vote for each predicted result.
* Select the prediction result with the most votes as the final prediction - *aggregating*.

<p style="text-align: center;">
  <img width="500px" src="img/random_forest_voting.png" />
  <em><small>Image taken from <a href="https://www.geeksforgeeks.org/bagging-vs-boosting-in-machine-learning/">Bagging vs Boosting in Machine Learning</a></small></em>
</p>


**Advantages**
* Random forests is considered as a highly accurate and robust method because of the number of decision trees participating in the process.
* It does not suffer from the overfitting problem. The main reason is that it takes the average of all the predictions, which cancels out the biases.
* The algorithm can be used in both classification and regression problems.
* Random forests can also handle missing values. There are two ways to handle these: using median values to replace continuous variables, and computing the proximity-weighted average of missing values.
* You can get the relative feature importance, which helps in selecting the most contributing features for the classifier.

**Disadvantages**
* Random forests is slow in generating predictions because it has multiple decision trees. Whenever it makes a prediction, all the trees in the forest have to make a prediction for the same given input and then perform voting on it. This whole process is time-consuming.
* The model is difficult to interpret compared to a decision tree, where you can easily make a decision by following the path in the tree.



## Implementing random forest
Like decision trees, building and fitting a random forest classifier is a straightforward task  in scikit-learn. First, we define a random forest classifier variable, and, second, we train the classifier by calling the `fit` method.

Random forest has many hyperparameters. Hyperparameters included in Random Forest are:
* `n_estimators` = number of trees in the forest
* `criterion` = the criterion used to choose a split at each node (e.g. gini, entropy, mse, etc.)
* `max_depth` = maximum length of the longest route in each tree
* `min_samples_split` = minimum number of samples to split on at a node
* `max_leaf_nodes` = maximum number of leaf nodes
* `max_features` = maximum number of random features to test at each node
* `max_samples` = size of bootstrapped dataset for each tree

In [None]:
## build and fit random forest classifier
rfc = RandomForestClassifier(n_estimators=200, max_depth=4, random_state=42)
rfc.fit(X_train, y_train)

## Evaluating random forest
We can evaluate the our random forest classifier by calculating the accuracy, recall, precision, and F1 scores.

In [None]:
y_pred_forest = rfc.predict(X_test)
y_proba_forest = list(zip(*rfc.predict_proba(X_test)))[1]

In [None]:
def show_confusion_matrix(y_test, y_pred):
    tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
    colors = sns.color_palette("Blues")
    ax = sns.heatmap([[tp,fp],[fn,tn]], square=True, annot=True, fmt='d', 
                     cbar=False, cmap=colors, vmin=-1, annot_kws={"size":13}, linewidths=1.0)
    # set labels on figure
    ax.set_xticklabels(labels=["pos","neg"], fontsize=13)
    ax.xaxis.tick_top()
    ax.set_yticklabels(labels=["pos","neg"], fontsize= 13)
    plt.xlabel("\nactual value", fontsize=15)
    ax.xaxis.set_label_position('top') 
    plt.ylabel("predicted value\n", fontsize=15)
    plt.show()

In [None]:
## get values for confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, y_pred_forest).ravel()
print(f"True Negative = {tn}\nFalse Positive = {fp}\nFalse Negative = {fn}\nTrue Positive = {tp}")

In [None]:
## show confusion matrix for random forest
show_confusion_matrix(y_test, y_pred_forest)

In [None]:
print(f"Accuracy = {accuracy_score(y_test, y_pred_forest):.3f}")

In [None]:
print(f"Recall = {recall_score(y_test, y_pred_forest):.3f}")

In [None]:
print(f"Precision = {precision_score(y_test, y_pred_forest):.3f}")

In [None]:
print(f"F1 score = {f1_score(y_test, y_pred_forest):.3f}")

As before, we can display the `confusion_matrix` of our classifier.

In [None]:
def plot_roc_curve(fpr, tpr):
    fig, ax = plt.subplots()
    ax.fill_between(fpr, tpr, alpha=.5, color='darkorange')
    ax.plot(fpr, tpr, color='darkorange', lw=2, label=f'AUROC = {auc(fpr,tpr):.3f}')
    # Add dashed line with a slope of 1
    ax.plot([0,1], [0,1], color='black', linestyle='dotted', lw=2, \
            label=f'Random = 0.500')
    ax.legend()
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.show()
    
def plot_pr_curve(recall, precision, random_aupr):
    fig, ax = plt.subplots()
    ax.fill_between(recall, precision, alpha=.5, color='blue')
    ax.plot(recall, precision, color='blue', lw=2, label=f'AUPR = {auc(recall, precision):.3f}')
    # Add dashed line where random (or no skill) would be
    ax.plot([1,0], [random_aupr,random_aupr], color='black', linestyle='dotted', \
            lw=2, label=f'Random = {random_aupr:.3f}')
    ax.legend()
    plt.xlabel("Recall")
    plt.ylabel("Precision")
    plt.show()

In [None]:
fpr_forest, tpr_forest, thresholds_forest = roc_curve(y_test, y_proba_forest)
print("AUROC")
print(f"Random forest: {auc(fpr_forest,tpr_forest):.3f}")
print(f"Random (no skill) AUROC: {auc([0,1], [0,1]):.3f}")

In [None]:
plot_roc_curve(fpr_forest,tpr_forest)

In [None]:
precision_forest, recall_forest, thresholds_forest = precision_recall_curve(y_test, y_proba_forest)

print("AUPR")
print(f"Random forest: {auc(recall_forest, precision_forest):.3f}")

positive_class = y_test.to_list().count(1)
negative_class = y_test.to_list().count(0)
random_control = positive_class/(positive_class+negative_class)

print(f"Random (no skill) AUPR = {auc([0,1], [random_control,random_control]):.3f}")

In [None]:
plot_pr_curve(recall_forest,precision_forest,random_control)

## Hyperparameter tuning

Cross-validation is key to choosing the best possible hyperparameters. This involves splitting the training set into $k$ number of subsets where one subset is used as a validation set and the remaining $k-1$ are used for training. This is then completed over all possible sets of $k$ and the average of the metrics is used to assess the model with the given hyperparameters.

<p style="text-align: center;">
  <img width="800px" src="img/k_fold_cv.png"/>
    <em><small>Image taken from <a href="https://www.sharpsightlabs.com/blog/cross-validation-explained/">Cross Validation, Explained</a></small></em>
</p>

To further this idea, we can use cross-validation in concert with a *grid search* which runs a model with variable hyperparameters that are defined by lists of values. This will "check" the metrics for each of this runs and average them. The optimal combination of hyperparameters will be outputted as the best model.

In [None]:
# Number of trees to be used
rfc_n_estimators = [int(x) for x in np.linspace(100, 400, 4)]
# Maximum length in tree
rfc_max_depth = [int(x) for x in np.linspace(2, 8, 4)]
rfc_max_features = [2,3,4]

rfc_grid = {'n_estimators': rfc_n_estimators,
            'max_depth': rfc_max_depth,
            'max_features': rfc_max_features}

# Create the model to be tuned
rfc_base = RandomForestClassifier(random_state=42)

# Create the random search Random Forest
rfc_grid = GridSearchCV(estimator=rfc_base, param_grid=rfc_grid, 
                                cv=5, scoring='f1', n_jobs=2)

# Fit the random search model
rfc_grid.fit(X_train, y_train)

In [None]:
# Get the optimal parameters
rfc_grid.best_params_

In [None]:
y_pred_grid = rfc_grid.predict(X_test)
y_proba_grid = list(zip(*rfc_grid.predict_proba(X_test)))[1]

In [None]:
print(f"Accuracy = {accuracy_score(y_test, y_pred_grid):.3f}")

In [None]:
print(f"Recall = {recall_score(y_test, y_pred_grid):.3f}")

In [None]:
print(f"Precision = {precision_score(y_test, y_pred_grid):.3f}")

In [None]:
print(f"F1 score = {f1_score(y_test, y_pred_grid):.3f}")

In [None]:
fpr_grid, tpr_grid, thresholds_grid = roc_curve(y_test, y_proba_grid)
print("AUROC")
print(f"best rf: {auc(fpr_grid,tpr_grid):.3f}")
print(f"Random (no skill) AUROC: {auc([0,1], [0,1]):.3f}")

In [None]:
plot_roc_curve(fpr_grid,tpr_grid)

In [None]:
precision_grid, recall_grid, thresholds_grid = precision_recall_curve(y_test, y_proba_grid)

print("AUPR")
print(f"best rf: {auc(recall_grid, precision_grid):.3f}")
print(f"Random (no skill) AUPR = {auc([0,1], [random_control,random_control]):.3f}")

In [None]:
plot_pr_curve(recall_grid,precision_grid,random_control)

## Feature ranking
In addition to evaluating the random forest classifier, it is sometimes helpful to see how important each of the features were in arriving at final predictions. If we notice that a feature is of little importance, we can eliminate it from our training dataset in order to gain efficiency.

When building a random forest classifier, scikit-learn returns a variable named `feature_importances_`.

In [None]:
## find important features
rfc.feature_importances_

The raw output is a little difficult to interpret. So, we will put the output in a Pandas Series.

In [None]:
feature_imp = \
    pd.Series(rfc.feature_importances_, index=feature_cols).sort_values(ascending=False)
feature_imp

We can also visualize the feature importances using a seaborn barplot.

In [None]:
## visualize important features
%matplotlib inline

# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)

# Add labels to your graph
plt.xlabel('\nFeature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features\n")

plt.show()

## XGBoost

How does this differ from Random Forest? Random Forest uses bagging in order to train a final model. XGBoost works by a method called **boosting**, which is an iterative, sequential method that adds a new decision tree to the overall model at each step to minimize error from the previous trees. Each new tree is a *weak learner* that when all combined creates a strong learner that will accurately predict the outcome.

<img width="500px" src="img/xgboost_boosting.png" />

A problem with XGBoost is that it is highly sensitive to it's hyperparameters. If too many trees are added, it can be overfit. Moreover, the `learning rate` is crucial because the model will perform better if trained slowly, but the likelihood of many trees being created increases with a decreaed learning rate. FInding the right balance for the model is key to the robustness and generalizability of the model.


In [None]:
import xgboost as xgb

In [None]:
## build and fit XGBoost classifier
xgc = xgb.XGBClassifier(objective='binary:logistic', n_estimators=100, \
                        alpha=0.01, max_depth=4, learning_rate=0.1, \
                        colsample_bytree=0.3, verbosity=0)
xgc.fit(X_train, y_train)

y_pred_boost = xgc.predict(X_test)
y_proba_boost = list(zip(*xgc.predict_proba(X_test)))[1]

In [None]:
show_confusion_matrix(y_test, y_pred_boost)

In [None]:
print(f"Accuracy = {accuracy_score(y_test, y_pred_boost):.3f}")

In [None]:
print(f"Recall = {recall_score(y_test, y_pred_boost):.3f}")

In [None]:
print(f"Precision = {precision_score(y_test, y_pred_boost):.3f}")

In [None]:
print(f"F1 score = {f1_score(y_test, y_pred_boost):.3f}")

In [None]:
fpr_boost, tpr_boost, thresholds_boost = roc_curve(y_test, y_proba_boost)
print("AUROC")
print(f"best rf: {auc(fpr_boost,tpr_boost):.3f}")
print(f"Random (no skill) AUROC: {auc([0,1], [0,1]):.3f}")

In [None]:
plot_roc_curve(fpr_boost,tpr_boost)

In [None]:
precision_boost, recall_boost, thresholds_boost = precision_recall_curve(y_test, y_proba_boost)

print("AUPR")
print(f"xgboost: {auc(recall_boost, precision_boost):.3f}")
print(f"Random (no skill) AUPR = {auc([0,1], [random_control,random_control]):.3f}")

In [None]:
plot_pr_curve(recall_boost,precision_boost,random_control)

In [None]:
feature_imp = \
    pd.Series(xgc.feature_importances_, index=feature_cols).sort_values(ascending=False)

## visualize important features
%matplotlib inline

# Creating a bar plot
sns.barplot(x=feature_imp, y=feature_imp.index)

# Add labels to your graph
plt.xlabel('\nFeature Importance Score')
plt.ylabel('Features')
plt.title("Visualizing Important Features\n")

plt.show()

## Logistic regression
Logistic regression is a simple and commonly used machine learning algorithm for two-class classification. It is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Logistic regression can be used to answer questions such as:
* How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day?
* Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?

*Logistic regression can also be used for multi-class predictions, but we will not cover that here.*

In general, logistic regression uses a linear combination of more than one feature value or explanatory variable as argument of the sigmoid function:

$f(x) = \frac{1}{1+e^{-x}}$

The corresponding output of the sigmoid function is a number between 0 and 1. 

<img width="300px" src="img/sigmoid.png"/>

The middle value is considered as threshold to establish what belongs to the class 1 and to the class 0. In particular, an input producing an outcome greater than 0.5 is considered belonging to the class 1. Conversely, if the output is less than 0.5, then the corresponding input is classified as belonging to class 0.

For our logistic regression model we use the logistic function:

$f_{w,b}(x) = \frac{1}{1+e^{-(wx+b)}}$

The logistic function is our **activation function**. This is going to tell us when a sample is 0 or 1.

To calculate the solution to this equation, i.e. obtain the best intercept and coefficients, we aim to maximize the **log likelihood** of the training data.

$ \ln L_{\mathbf{w},b} = \sum_{i=1}^{N}y_{i}\ln f_{\mathbf{w},b}(x_{i})+(1-{y_{i}}) \ln (1-f_{\textbf{w},b}(x_{i})$

Though it may appear daunting, when you break it down, it isn't that bad. When $y_{i}=1$, the second part of the summation drops out (1-1=0), whereas when $y_{i}=0$ the first part of the equation drops out. 

Log likelihood is our **cost function**. Generally, minimizing functions is preferred over maximizing, so the negative of the function is commonly used.

## Optimizers

The general point is there are many methods that have been developed and well tested for optimizing (maximizing or minimizing) a function. You start with a guess then you adjust the parameters over several iterations until you have converged to some point (number of iterations, tolerance, etc.).

<img width="300px" src="img/minimization.gif"/>


I do not go into specific optimizers here, for the sake of time. I do think I will update this notebook in the future to include a nice section on optimizers.

https://en.wikipedia.org/wiki/Gradient_descent

https://scikit-learn.org/stable/modules/sgd.html

https://en.wikipedia.org/wiki/Limited-memory_BFGS

For our implementation of logistic regression, we will use scikit-learn's LogisticRegression model:
* https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

Let's load the libraries we will be using...

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import math
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, \
    f1_score, roc_auc_score, auc, precision_recall_curve, roc_curve,\
    classification_report, confusion_matrix

Finally, the test and training data is fit to our model and we predict outcomes.

In [None]:
## create a logistic regression classifier and predict
logreg = LogisticRegression(random_state=42, max_iter=500)
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
y_proba_logreg = logreg.predict_proba(X_test)[:,1]
print(f"Our model converged after {logreg.n_iter_[0]} iterations.")

In [None]:
## show confustion matrix
show_confusion_matrix(y_test, y_pred_logreg)

In [None]:
print(f"Accuracy = {accuracy_score(y_test, y_pred_logreg):.3f}")

In [None]:
print(f"Recall = {recall_score(y_test, y_pred_logreg):.3f}")

In [None]:
print(f"Precision = {precision_score(y_test, y_pred_logreg):.3f}")

In [None]:
print(f"F1 score = {f1_score(y_test, y_pred_logreg):.3f}")

In [None]:
fpr_logreg, tpr_logreg, thresholds_logreg = roc_curve(y_test, y_proba_logreg)
print("AUROC")
print(f"logreg: {auc(fpr_logreg,tpr_logreg):.3f}")
print(f"Random (no skill) AUROC: {auc([0,1], [0,1]):.3f}")

In [None]:
plot_roc_curve(fpr_logreg,tpr_logreg)

In [None]:
precision_logreg, recall_logreg, thresholds_logreg = precision_recall_curve(y_test, y_proba_logreg)

print("AUPR")
print(f"logreg: {auc(recall_logreg, precision_logreg):.3f}")
print(f"Random (no skill) AUPR = {auc([0,1], [random_control,random_control]):.3f}")

In [None]:
plot_pr_curve(recall_logreg,precision_logreg,random_control)

### Interpretability and feature importance

We will analyze the model a bit more to understand what the model is doing and what features are most important to the classification.

First, we will extract the coefficients from the model and match them with the corresponding name of the features.

In [None]:
coefs = list(zip(feature_cols,logreg.coef_[0]))
coefs

Next, we take the exponential of the coefficients to calculate the odds ratio for each feature.

In [None]:
odds_ratio = [(x[0],math.exp(x[1])) for x in coefs]
odds_ratio

We see that of these features the most important appears to be Diabetes Pedigree Function. The odds ratio tells us that for every 1 unit increase in Diabetes Pedigree Function a patient is 2.26x more likely to experience the outcome (diabetes)

In [None]:
[(x[0],(x[1]-1)*100.0) for x in odds_ratio]

## Neural Networks

Now that we understand and have run a logistic regression model, let's go a bit "deeper". We can think of a neural network (NN) as a set of nested functions -- we call these layers. Each layer in our model takes input from the previous layer and outputs directly to the next layer, i.e. fully connected. 

We are going to create a 3 layer neural network with the previously used 8 variables as features and the "Outcome" as the label. 

The first layer of our NN will take in all 8 features as input, has a ReLU (rectified linear unit) activation function, and outputs 24 latent features (hidden). As opposed to the logistic function, discussed previously, ReLU sets the input to 0 if it is <0 or uses the input as is if >0.

$f(x)=max(0,x)$

The second layer of our NN will take in all 24 latent features from the previous layer as input, has a ReLU (rectified linear unit) activation function, and outputs 12 latent features.

The third (and last) layer of our model is a sigmoid output layer that takes in the previous 12 latent features as input.

The loss function we use for this model is binary cross entropy, which basically sums the log probabilty of a given sample being in the 0 class and the log probability of the sample being in the 1 class across all samples. This is essentially the same function as the log likelihood. We want to minimize this loss function.

$ \ln Loss = \sum_{i=1}^{N}-(y_{i}\ln f(x_{i})+(1-{y_{i}}) \ln (1-f(x_{i}))$


For our implementation of neural network, we will use keras's sequential model:
* https://keras.io/guides/sequential_model/

Let's load the libraries we will be using...

In [None]:
import random
import numpy as np
import tensorflow as tf
random.seed(42)
np.random.seed(42)
tf.random.set_seed(42)
from keras.models import Sequential
from keras.layers import Dense, Dropout

# summarize history for loss
def plot_fit(history):
    plt.plot(history.history['loss'])
    plt.plot(history.history['val_loss'])
    plt.title('model loss')
    plt.ylabel('loss')
    plt.xlabel('epoch')
    plt.legend(['train', 'validation'], loc='upper left')
    plt.show()

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(24, input_dim=8, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['AUC'])
# fit the keras model on the dataset
history = model.fit(X_train, y_train, validation_split=0.2, epochs=300, batch_size=20, verbose=False)

# make class predictions with the model
y_proba = model.predict(X_test)
y_pred = (y_proba > 0.5).astype("int32")

In [None]:
plot_fit(history)

In [None]:
## show confustion matrix
show_confusion_matrix(y_test, y_pred)

In [None]:
accuracy_score(y_test, y_pred)

In [None]:
# calculate F1 score
f1_score(y_test, y_pred)

In [None]:
# Calculate AUROC
roc_auc_score(y_test, y_pred)

In [None]:
# Calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_proba)
# Plot the ROC curve
plot_roc_curve(fpr, tpr)

In [None]:
# Calculate PR curve
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Calculate AUPR
auc(recall, precision)

In [None]:
# Tabulate the results in a dataframe
df_pr = pd.DataFrame(list(zip(recall,precision)), columns=['recall','precision'])
# Plot the PR curve
plot_pr_curve(recall, precision, random_control)

<font color='red'>***Note***</font>: 
Interesting artifact of this plot, the rapid drop to the left of the plot is due to the way the curve is calculated. As you move right to left on the plot the threshold for determining if a prediction is postive becomes more stringent. So on the far left, there are very few positive class predictions (both true and false positives), so adding one more false or true positive (moving towards the right) can greatly affect the precision. 

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(24, input_dim=8, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['AUC'])
# fit the keras model on the dataset
history = model.fit(X_train, y_train, validation_split=0.2, epochs=300, batch_size=20, verbose=False)

# make class predictions with the model
y_proba = model.predict(X_test)
y_pred = (y_proba > 0.5).astype("int32")

### Underfitting and Overfitting

The below figure is an excellent example of underfitting, a good fit, and overfitting.

<img width="800px" src="img/model_fit.png"/>

One can also plot learning curves to determine a good fit during training. This plots loss vs. epoch.

<img width="900px" src="img/learning_curves.png"/>

There are methods for avoiding overfitting/overtraining, such as regularization, dropout, etc. You can learn more about these in many of the references provided.

https://scikit-learn.org/stable/model_selection.html

*Choosing the right model(s), activation function(s), and **hyperparameters** are crucial for creating a robust and **generalizable** model.*

In [None]:
# define the keras model
model = Sequential()
model.add(Dense(24, input_dim=8, activation='relu'))
model.add(Dense(12, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['AUC'])
# fit the keras model on the dataset
history = model.fit(X_train, y_train, validation_split=0.2, epochs=1000, batch_size=20, verbose=False)
plot_fit(history)

## References and additional reading

In this module, we covered the basics of implementing and evaluating a logistic regression classifier in scikit learns and a neural network using keras. 

* Burkov A. The Hundred-Page Machine Learning Book by Andriy Burkov. Expert Systems. 2019;5(2):132-50.
* Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J. Scikit-learn: Machine learning in Python. the Journal of machine Learning research. 2011 Nov 1;12:2825-30.
* Chollet F. Keras documentation. keras.io. 2015;33.
* Goodfellow I, Bengio Y, Courville A. Deep learning.
* Bishop C. Pattern Recognition and Machine Learning.
* Friedman JH, Tibshirani R, Hastie T. The Elements of Statistical Learning.
* https://www.codecademy.com/learn/machine-learning
* https://www.w3schools.com/python/python_ml_getting_started.asp
* https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/