# Fault diagnostic using simulated data

Train and test an AI model using the previously generated datasets as diagnostic tool to detect cpu fault.

In [54]:
import numpy as np
import pandas as pd
import plotly.graph_objects as go
import pickle

from sklearn.neural_network import MLPClassifier

## Import datasets

Import previously generated datasets

In [55]:
dataset = pd.read_csv("data/dataset_1000_cases_20_percent_broken.csv")
testset = pd.read_csv("data/test_set_1000_cases_20_percent_broken.csv")

## Clear and create datasets

We create the training sets: 
- Xtrain: the data that the neural network will use to learn and determine how to predict a class when we give it data
- ytrain: the classes of the data   

Working contains the classes so we drop it for Xtrain, and we keep it for ytrain while dropping all others columns.   

In [56]:
Xtrain = dataset.drop("working", axis="columns")

We centralize the values of Xtrain by columns and keep the values of its mean and its standard deviation.

In [57]:
Xmean = Xtrain.mean()
Xstd = Xtrain.std()
Xtrain = (Xtrain - Xmean) / Xstd
Xtrain = Xtrain.fillna(0.0)

In [58]:
Xmean.to_csv("data/broken_fan_classifier_mean.csv")
Xstd.to_csv("data/broken_fan_classifier_std.csv")

We create ytrain that contains the class of each case: we drop every columns other than Working

In [59]:
ytrain = dataset.drop(["T_cpu", "fan.T_air", "fan.tension"], axis="columns")

In [60]:
# We then make numpy array from Xtrain and ytrain to be able to use them later.
Xtrain = Xtrain.to_numpy()
ytrain = ytrain.to_numpy().ravel()

We create the test sets: 
- Xtest: the data that the neural network will use to predict a class
- ytest: the classes of the test data we want to determine.   

We drop the columns the same way we did for Xtrain and ytrain.   

In [61]:
Xtest = testset.drop("working", axis="columns")
ytest = testset.drop(["T_cpu", "fan.T_air", "fan.tension"], axis="columns")

We also centralize the values of Xtest with the mean and standard deviation of **Xtrain**, not Xtest.

In [62]:
Xtest = (Xtest - Xmean) / Xstd
Xtest = Xtest.fillna(0)

In [63]:
# We then make numpy array from Xtest and ytest.
Xtest = Xtest.to_numpy()
ytest = ytest.to_numpy().ravel()

## Create and train the neural network

Here we create our neural network using the MLPClassifier function from sklearn.   
The solver is the solver for weight optimization. There are 3 choices: lbfgs, an optimizer in the family of quasi-Newton methods, sgd: stochastic gradient descent and adam a stochastic gradient-based optimizer.   
random_state is here to control the random number generator used and know if our neuralNetwork really improved after we modified it. It could have been an unlucky then lucky random number generator. You can try 99 to see a huge difference.   
The lenght of the hidden_layer parameter is the number of layer of neuron and the ith number is the number of the ith layer.   
verbose=False is to avoid seeing the result of each iteration: if True it gives the loss of each iteration.   
shuffle=True is to shuffle or not the data at each iteration.   
max_iter: the maximum number of iteration if the neural network doesn't stop before (default: 200, try putting 300 to see it stop by itself).

In [64]:
neuralNetwork = MLPClassifier(
    solver="sgd",
    random_state=1,
    hidden_layer_sizes=(10, 10, 10, 10),
    verbose=False,
    shuffle=True,
    max_iter=200,
)
neuralNetwork.fit(Xtrain, ytrain)


Stochastic Optimizer: Maximum iterations (200) reached and the optimization hasn't converged yet.



Save trained model

In [65]:
pickle.dump(neuralNetwork, open("data/broken_fan_classifier.pkl", "wb"))

## Data study

We will predict the class of the data from Xtest and collect the probabilities for each class to study them.

In [66]:
y_prediction = neuralNetwork.predict(Xtest)  # the classes
y_prediction_probas = neuralNetwork.predict_proba(Xtest)  # the probabilities for each classes

y_prediction_probas is the liste of probabilities for each line in Xtest. Here it has 2 columns, each for  class, representing their probability.

In [67]:
y_prediction_probas

array([[9.75055164e-01, 2.49448362e-02],
       [1.84046369e-04, 9.99815954e-01],
       [9.74989359e-01, 2.50106409e-02],
       ...,
       [6.97148214e-02, 9.30285179e-01],
       [9.79883181e-01, 2.01168194e-02],
       [6.97476263e-02, 9.30252374e-01]], shape=(2000, 2))

Creation of mask allowing us to check the class of the data.

In [68]:
maskNoProblem = testset["working"] == True
maskBroken = testset["working"] == False

We check the percentage of right answers of our model: one way to check if it works or not. However we do NOT know if for the neural network it has 50.1% or 100% probability to be the class given. We only know which one has the higher probability.

In [69]:
("Percentage of right guesses:", neuralNetwork.score(Xtest, ytest) * 100)

('Percentage of right guesses:', 100.0)

That is why we create and use a function to display various statistics: median, mean, standard deviation, minimum and maximum.

In [70]:
def compute_stats(arr):
    return dict(
        median=np.median(arr) * 100,
        mean=np.mean(arr) * 100,
        std=np.std(arr) * 100,
        min=np.min(arr) * 100,
        max=np.max(arr) * 100,
    )

Stats of dysfunctional cpu.

In [71]:
compute_stats(y_prediction_probas[maskBroken][:, 0])

{'median': np.float64(98.10567201214351),
 'mean': np.float64(97.1999820322584),
 'std': np.float64(1.650556182400494),
 'min': np.float64(93.04225385344056),
 'max': np.float64(98.91777067222705)}

Stats of working cpu.

In [72]:
compute_stats(y_prediction_probas[maskNoProblem][:, 1])

{'median': np.float64(99.81876954327235),
 'mean': np.float64(98.56397042548898),
 'std': np.float64(2.6635313647235646),
 'min': np.float64(92.11408678070192),
 'max': np.float64(99.98159536309423)}

We plot probability(Temp) for each type of cpu.   
That is the probability given by the neural network to the right class for each point we gave it in the test set.

In [73]:
df = pd.DataFrame(
    {
        "Broken": y_prediction_probas[maskBroken][:, 0],
        "No problem": y_prediction_probas[maskNoProblem][:, 1],
        "Temp": testset[maskNoProblem]["fan.T_air"],
    }
)
fig = go.Figure()
fig.add_trace(go.Scatter(x=df["Temp"], y=df["Broken"], mode="markers", name="Broken"))
fig.add_trace(go.Scatter(x=df["Temp"], y=df["No problem"], mode="markers", name="No problem"))
fig.layout = go.Layout(
    title={
        "text": "Percentage of prediction function of temperature",
        "font": {"size": 34},
        "x": 0.5,
    },
    width=1200,
    height=600,
    xaxis={"title": {"text": "Temperature", "font": {"size": 20}}, "gridcolor": "#EBF0F8"},
    yaxis={
        "title": {"text": "Percentage of prediction", "font": {"size": 20}},
        "gridcolor": "#EBF0F8",
    },
    yaxis2={
        "title": {"text": "weight (kg)", "font": {"size": 20}},
        "side": "right",
        "gridcolor": "#EBF0F8",
        "overlaying": "y",
    },
    plot_bgcolor="white",
    hovermode="x",
)
fig.show()

Boxplots of probability

In [74]:
fig = go.Figure()
fig.add_trace(
    go.Box(
        y=y_prediction_probas[maskBroken][:, 0],
        name="Broken",
        jitter=0.3,
        pointpos=-1.8,
        boxpoints="all",  # represent all points
        marker_color="rgb(255, 0, 0)",
        line_color="rgb(255, 0, 0)",
    )
)
fig.add_trace(
    go.Box(
        y=y_prediction_probas[maskNoProblem][:, 1],
        name="Working",
        jitter=0.3,
        pointpos=-1.8,
        boxpoints="all",  # represent all points
        marker_color="rgb(0, 255, 0)",
        line_color="rgb(0, 255, 0)",
    )
)

In [75]:
def percentageNumbers(first, second, classes=[True, False], masks=[maskBroken, maskNoProblem]):

    firstGroup = [0] * len(classes)
    secondGroup = [0] * len(classes)
    lowGroup = [0] * len(classes)
    wrongGroup = [0] * len(classes)
    numberPerClass = [k.value_counts()[True] for k in masks]
    print(numberPerClass)

    for k in range(len(masks)):
        for i in range(len(y_prediction[masks[k]])):
            # print(ytest[masks[k]][i], y_prediction[masks[k]][i], y_prediction_probas[masks[k]][i][k])
            if ytest[masks[k]][i] != y_prediction[masks[k]][i]:
                wrongGroup[k] += 1
            elif y_prediction_probas[masks[k]][i][k] > first:
                firstGroup[k] += 1
            elif y_prediction_probas[masks[k]][i][k] > second:
                secondGroup[k] += 1
            else:
                lowGroup[k] += 1
    return (
        np.array(firstGroup) / np.array(numberPerClass),
        np.array(secondGroup) / np.array(numberPerClass),
        np.array(lowGroup) / np.array(numberPerClass),
        np.array(wrongGroup) / np.array(numberPerClass),
    )


percentageNumbers(0.95, 0.90)

[np.int64(1000), np.int64(1000)]


(array([0.855, 0.812]),
 array([0.145, 0.188]),
 array([0., 0.]),
 array([0., 0.]))

In [76]:
test = percentageNumbers(0.95, 0.90)

[np.int64(1000), np.int64(1000)]


In [77]:
fig = go.Figure(
    data=[
        go.Bar(name="95<", x=["Broken", "No problem"], y=test[0], text=test[0]),
        go.Bar(name="90< <95", x=["Broken", "No problem"], y=test[1], text=test[1]),
        go.Bar(name="<90", x=["Broken", "No problem"], y=test[2], text=test[2]),
        go.Bar(name="Wrong", x=["Broken", "No problem"], y=test[3], text=test[3]),
    ]
)
# Change the bar mode
fig.update_layout(barmode="stack")
fig.show()

In [78]:
fig = go.Figure()
fig.add_trace(
    go.Histogram(
        x=y_prediction_probas[:, 1],
        histnorm="percent",
        xbins=dict(start=0, end=1.0, size=0.01),
        opacity=0.7,
    )
)

fig.update_layout(
    title_text="Precision of the prediction",
    xaxis_title_text="Precision",
    yaxis_title_text="Percentage of prediction",
    bargap=0.2,
    bargroupgap=0.1,
)