In [None]:
from supportLibrary import *
from IPython.display import display
import ipywidgets as widgets
import pandas as pd

##  Partition the Data

To evaluate how good a Machine Learning model performs is called cross validation. To do this, you either split your data into a training set and a testing set (typically an 80:20 split) or you split your data into a training set, a validation set, and a testing set (typically 70:15:15). For the training/testing split, the model is trained with the training set, then its accuracy is evaluated using the testing set which the model has never seen before. For the training/validation/testing split, the model is trained with the testing set while periodically evaluating its accuracy on the validation set to make sure the model is not overfitting the testing data. Finally, the model is evaluated using the testing set which the model has never seen before. It is important to note that each partition contains random samples of the dataset to make sure the model is trained/tested with an even number of datapoints from each device.

![title](Images/dataSplit.jpeg)

In [None]:
df = pd.read_csv('current4.csv')
df = shuffleAndNormalize(df)
trainX, trainY = getTrainingSet(df)
testX, testY = getTestingSet(df)

## Choose Which Model to Use

### Deep Neural Net
A model that, taking inspiration from the brain, is composed of layers consisting of simple connected units or neurons followed by nonlinearities.

### Gradient Tree Boosting
An ensemble learning method which involves sequentially adding new, shallow decision trees to a random forest model.

### Random Forest
An ensemble learning method that operates by constructing a multitude of decision trees at training time then outputting the class that is the mode of the classes output by the individual trees.

### Support Vector Machine
A classification algorithm that seeks to maximize the margin between positive and negative classes by mapping input data vectors to a higher dimensional space.

In [None]:
modelType = getModelType()
display(modelType)

## Train the Model

The models we use for these types of machine learning problems assume that samples are independent. That is, no sample depends on values of a previous sample. To enforce this, we shuffle the samples to ensure that any sample is equally likely to follow any other sample.

In [None]:
model = trainModel(modelType.value, trainX, trainY)

## Quantify the Model's Performance

#### Accuracy
Accuracy identifies the fraction of predictions that a classification model got right.
$$\text{Accuracy} = \frac{\text{Correct Predictions}}{\text{Total Number of Examples}}$$

#### Precision
Precision identifies the frequency with which a model was correct when predicting the positive class. Loss in precision occurs when the model predicts a device needs maintenance when it actually does not.
$$\text{Precision} = \frac{\text{True Positives}}{\text{True Positives + False Positives}}$$

#### Recall
Recall identifies out of the total number of positive examples in the dataset, how many the model correctly identified. Loss in recall occurs when the model predicts a device does not need maintenance when it actually does.
$$\text{Recall} = \frac{\text{True Positives}}{\text{True Positives + False Negatives}}$$

#### F1 Score
F1 Score identifies the harmonic mean of the Precision and Recall in order to report an accurate average of the two metrics.
$$\text{F1 Score} = \frac{2}{\frac{1}{\text{Precision}}+\frac{1}{\text{Recall}}}$$

In [None]:
predictions = getPredictions(model, testX, testY, modelType.value)
accuracy, precision, recall, f1score = getAccuracyMetrics(testY, predictions)
print("Accuracy {}\nPrecision {}\nRecall {}\nf1 Score {}".format(accuracy, precision, recall, f1score))
plotConfusionMatrix(testY, predictions)