# Introduction
This notebook focuses on optmizing the classifier for the [GECCO 2017 competition](http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2017/) ([Rules](https://notebooks.azure.com/n/UxScBeYo9pM/files/rulesGeccoIc2017.pdf))

---

# Code
## Import packages/modules
Some packages/modules are necessary for loading and visualising the data.
- **pandas:** load CVS file;
- **matplotlib.pyplot:** plot data;
- **numpy:** array manipulation;
- **differential_evolution (scipi):** optimization.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix
from scipy.optimize import differential_evolution

## Load Training Data
Data is loaded using the *read_csv* function from **pandas** package. After that, we convert incorrect data types.

**Obs:** This step is better covered at **[01 - load and view](https://notebooks.azure.com/n/vGEm8hmoKOg/notebooks/01%20-%20load%20and%20view.ipynb)**.

In [2]:
# Load data
train_data = pd.read_csv('../data/waterDataTraining.csv', delimiter=',')

In [3]:
# Convert first column to Date-Time
train_data['Time'] = train_data["Time"].apply(pd.to_datetime)

In [4]:
# Convert last column to Boolean
train_data["Event (boolean)"] = train_data["Event (boolean)"] == " TRUE"

## Prepare Training Data
Convert data to arrays

In [5]:
# Impute data with mean
train_data["Chlorine dioxide (mg/L)"].fillna(train_data["Chlorine dioxide (mg/L)"].mean(), inplace=True)
train_data["PH value"].fillna(train_data["PH value"].mean(), inplace=True)
train_data["Redox potential (mV)"].fillna(train_data["Redox potential (mV)"].mean(), inplace=True)

In [6]:
# Create input array
X1 = np.array(train_data['Chlorine dioxide (mg/L)'])
X2 = np.array(train_data['PH value'])
X3 = np.array(train_data['Redox potential (mV)'])

In [7]:
# Normalize data to [0, 1]
X1 = (X1 - X1.min())/(X1.max() - X1.min())
X2 = (X2 - X2.min())/(X2.max() - X2.min())
X3 = (X3 - X3.min())/(X3.max() - X3.min())

In [8]:
# Create input 2D array
X_train = np.array([X1, X2, X3]).T

# Create output array
Y_train = np.array(train_data['Event (boolean)'])

## Load Test Data

In [9]:
# Load data
test_data = pd.read_csv('../data/waterDataTesting.csv', delimiter=',')

  interactivity=interactivity, compiler=compiler, result=result)


In [10]:
# Convert first column to Date-Time
test_data['Time'] = test_data["Time"].apply(pd.to_datetime)

In [11]:
# Convert fourth column to float64
test_data['PH value'] = test_data["PH value"].apply(pd.to_numeric, errors='coerce')

In [12]:
# Convert fifth column to float64
test_data['Redox potential (mV)'] = test_data["Redox potential (mV)"].apply(pd.to_numeric, errors='coerce')

In [13]:
# Convert sixth column to float64
test_data['Electric conductivity (uS/cm)'] = test_data["Electric conductivity (uS/cm)"].apply(pd.to_numeric, errors='coerce')

## Prepare Test Data

In [14]:
# Impute data with mean
test_data["Chlorine dioxide (mg/L)"].fillna(test_data["Chlorine dioxide (mg/L)"].mean(), inplace=True)
test_data["PH value"].fillna(test_data["PH value"].mean(), inplace=True)
test_data["Redox potential (mV)"].fillna(test_data["Redox potential (mV)"].mean(), inplace=True)

In [15]:
# Create input array
X1 = np.array(test_data['Chlorine dioxide (mg/L)'])
X2 = np.array(test_data['PH value'])
X3 = np.array(test_data['Redox potential (mV)'])

In [16]:
# Normalize data to [0, 1]
X1 = (X1 - X1.min())/(X1.max() - X1.min())
X2 = (X2 - X2.min())/(X2.max() - X2.min())
X3 = (X3 - X3.min())/(X3.max() - X3.min())

In [17]:
# Create input 2D array
X_test = np.array([X1, X2, X3]).T

# Create output array
Y_test = np.array(test_data['Event (boolean)'])

## Artificial neural networks model evaluation
ANN is used in order to classify the events from the series. The F1 score is calculated from the resulting confusion matrix.

In [18]:
def my_evaluation(x):

    # Initialize model
    model = MLPClassifier(activation='tanh', solver='adam', alpha=x[0], hidden_layer_sizes=(int(x[1]), int(x[2])), random_state=1)

    # fit the model on our data
    model.fit(X_train, Y_train)

    # Predict output
    y = model.predict(X_test)
    
    # Calculate confusion matrix
    tn, fp, fn, tp = confusion_matrix(Y_test, y).ravel()

    # Calculate sensitivity (True Positive Rate)
    tpr = tp / (tp + fn)

    # Calculate precision (Positive predictive value)
    ppv = tp / (tp + fp)
    
    # Calculate Gecco 2017 result with F1 score
    f1 = 2 * (ppv * tpr) / (ppv + tpr)

    # Estimate efficiency
    return (1 - f1)

## Differential Evolution

In [19]:
maxiter = 10
popsize = 10
bounds = [(0, 1e-1), (1, 100), (1, 10)]

result = differential_evolution(my_evaluation, bounds, maxiter=maxiter, popsize=popsize)
result.x, result.fun

(array([  4.45229911e-04,   8.01879614e+01,   9.54792776e+00]),
 0.52404105888708807)