# Introduction
This notebook focuses on classifying data from the [GECCO 2017 competition](http://www.spotseven.de/gecco/gecco-challenge/gecco-challenge-2017/) ([Rules](https://notebooks.azure.com/n/UxScBeYo9pM/files/rulesGeccoIc2017.pdf))

---

# Code
## Import packages/modules
Some packages/modules are necessary for loading and visualising the data.
- **pandas:** load CVS file;
- **matplotlib.pyplot:** plot data;
- **numpy:** array manipulation.

In [20]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

## Load Training Data
Data is loaded using the *read_csv* function from **pandas** package. After that, we convert incorrect data types.

**Obs:** This step is better covered at **[01 - load and view](https://notebooks.azure.com/n/vGEm8hmoKOg/notebooks/01%20-%20load%20and%20view.ipynb)**.

In [21]:
# Load data
train_data = pd.read_csv('waterDataTraining.csv', delimiter=',')

In [22]:
# Convert first column to Date-Time
train_data['Time'] = train_data["Time"].apply(pd.to_datetime)

In [23]:
# Convert last column to Boolean
train_data["Event (boolean)"] = train_data["Event (boolean)"] == " TRUE"

## Prepare Training Data
Convert data to arrays

In [25]:
# Impute data with mean
train_data["Chlorine dioxide (mg/L)"].fillna(train_data["Chlorine dioxide (mg/L)"].mean(), inplace=True)
train_data["PH value"].fillna(train_data["PH value"].mean(), inplace=True)
train_data["Redox potential (mV)"].fillna(train_data["Redox potential (mV)"].mean(), inplace=True)

In [26]:
# Create input array
X1 = np.array(train_data['Chlorine dioxide (mg/L)'])
X2 = np.array(train_data['PH value'])
X3 = np.array(train_data['Redox potential (mV)'])

In [27]:
# Normalize data to [0, 1]
X1 = (X1 - X1.min())/(X1.max() - X1.min())
X2 = (X2 - X2.min())/(X2.max() - X2.min())
X3 = (X3 - X3.min())/(X3.max() - X3.min())

In [28]:
# Create input 2D array
X_train = np.array([X1, X2, X3]).T

# Create output array
Y_train = np.array(train_data['Event (boolean)'])

## Load Test Data

In [29]:
# Load data
test_data = pd.read_csv('waterDataTesting.csv', delimiter=',')

  interactivity=interactivity, compiler=compiler, result=result)


In [30]:
# Convert first column to Date-Time
test_data['Time'] = test_data["Time"].apply(pd.to_datetime)

In [31]:
# Convert fourth column to float64
test_data['PH value'] = test_data["PH value"].apply(pd.to_numeric, errors='coerce')

In [32]:
# Convert fifth column to float64
test_data['Redox potential (mV)'] = test_data["Redox potential (mV)"].apply(pd.to_numeric, errors='coerce')

In [33]:
# Convert sixth column to float64
test_data['Electric conductivity (uS/cm)'] = test_data["Electric conductivity (uS/cm)"].apply(pd.to_numeric, errors='coerce')

## Prepare Test Data

In [35]:
# Impute data with mean
test_data["Chlorine dioxide (mg/L)"].fillna(test_data["Chlorine dioxide (mg/L)"].mean(), inplace=True)
test_data["PH value"].fillna(test_data["PH value"].mean(), inplace=True)
test_data["Redox potential (mV)"].fillna(test_data["Redox potential (mV)"].mean(), inplace=True)

In [36]:
# Create input array
X1 = np.array(test_data['Chlorine dioxide (mg/L)'])
X2 = np.array(test_data['PH value'])
X3 = np.array(test_data['Redox potential (mV)'])

In [37]:
# Normalize data to [0, 1]
X1 = (X1 - X1.min())/(X1.max() - X1.min())
X2 = (X2 - X2.min())/(X2.max() - X2.min())
X3 = (X3 - X3.min())/(X3.max() - X3.min())

In [38]:
# Create input 2D array
X_test = np.array([X1, X2, X3]).T

# Create output array
Y_test = np.array(test_data['Event (boolean)'])

## Estimate efficiency
A function is created in order to quickly estimate the model efficiency

In [48]:
# Import confusion matrix function
from sklearn.metrics import confusion_matrix

# Function to estimate estimators efficiency
def my_efficiency(Y, y):

    # Calculate confusion matrix
    tn, fp, fn, tp = confusion_matrix(Y, y).ravel()

    # Calculate accuracy
    acc = (tp + tn) / (tp + fp + tn + fn)

    # Calculate sensitivity (True Positive Rate)
    tpr = tp / (tp + fn)

    # Calculate specificity (True Negative Rate)
    tnr = tn / (fp + tn)

    # Calculate precision (Positive predictive value)
    ppv = tp / (tp + fp)

    # Calculate negative predictive value
    npv = tn / (fn + tn)
    
    # Calculate Gecco 2017 result with F1 score
    f1 = 2 * (ppv * tpr) / (ppv + tpr)
    
    # Print values
    print("Confusion Matrix:")
    print("\t" + str(tp) + "\t" + str(fp))
    print("\t" + str(fn) + "\t" + str(tn))
    print("\n")
    
    print("Accuracy = " + str(acc*100) + "%")
    print("F1 score = " + str(f1))

## Classification
Classification methods are used in order to classify the events from the series.

Several methods will be tested here, based on [scikit-learn notebooks](https://github.com/jakevdp/sklearn_tutorial/blob/master/notebooks/02.2-Basic-Principles.ipynb).

### Logistic regression
Logistic regression presented the best result on the 2017 GECCO challenge. It was achieved by Fitore Muharremi with a F1 score of 0.441.

In [59]:
# Import Logistic Regression estimator
from sklearn import linear_model

# Initialize model
model = linear_model.LogisticRegression(C=1e5)

# fit the model on our data
model.fit(X_train, Y_train)

# Predict output
y = model.predict(X_test)

# Estimate efficiency
my_efficiency(Y_test, y)

Confusion Matrix:
	491	0
	2277	241900


Accuracy = 99.0693511207%
F1 score = 0.301319423136


### Support Vector Machine
Also used (linear kernel) by Fitore Muharremi.

In [67]:
# Import Support vector machine estimator
from sklearn import svm

# Initialize model
model = svm.SVC(kernel='rbf')

# fit the model on our data
model.fit(X_train, Y_train)

# Predict output
y = model.predict(X_test)

# Estimate efficiency
my_efficiency(Y_test, y)

Confusion Matrix:
	361	0
	2407	241900


Accuracy = 99.0162178953%
F1 score = 0.230744646852


### K Nearest neighbors (KNN)
K nearest neighbors (kNN) is one of the simplest learning strategies: given a new, unknown observation, look up in your reference database which ones have the closest features and assign the predominant class."

In [55]:
# Import KNN estimator
from sklearn import neighbors

# Initialize model
model = neighbors.KNeighborsClassifier(n_neighbors=3)

# fit the model on our data
model.fit(X_train, Y_train)

# Predict output
y = model.predict(X_test)

# Estimate efficiency
my_efficiency(Y_test, y)

Confusion Matrix:
	1130	4332
	1638	237568


Accuracy = 97.5599588013%
F1 score = 0.274605103281


### Decision Tree

In [51]:
# Import Decision Tree estimator
from sklearn import tree

# Initialize model
model = tree.DecisionTreeClassifier()

# fit the model on our data
model.fit(X_train, Y_train)

# Predict output
y = model.predict(X_test)

# Estimate efficiency
my_efficiency(Y_test, y)

Confusion Matrix:
	1199	4723
	1569	237177


Accuracy = 97.4283518891%
F1 score = 0.275949367089


### Artificial Neural Networks (multilayer perceptron)

In [68]:
# Import MLP estimator
from sklearn.neural_network import MLPClassifier

# Initialize model
model = MLPClassifier(activation='tanh', solver='adam', alpha=1e-3, hidden_layer_sizes=(3, 1), random_state=1)

# fit the model on our data
model.fit(X_train, Y_train)

# Predict output
y = model.predict(X_test)

# Estimate efficiency
my_efficiency(Y_test, y)

Confusion Matrix:
	845	14
	1923	241886


Accuracy = 99.2083149411%
F1 score = 0.465949820789
