# Playing with Classifiers

## Integration test with Genetic Search
This section of notebook tests checks that classifiers can be used with Genetic Search. It is essentially the pipe example for segmentors, but using classifiers.

In [1]:
from see.classifiers import Classifier
from see.classifier_fitness import ClassifierFitness
from see.Workflow import workflow


# Initialize Algorithm Space and Workflow
algorithm_space = Classifier.algorithmspace

workflow.addalgos([Classifier, ClassifierFitness])
wf = workflow()
print(wf)

<class 'see.Workflow.workflow'> parameters: 
	algorithm = MLP Neural Network
	max_iter = 200
	alpha = 0.0001
	max_depth = 1
	n_estimators = 100
	n_neighbors = 5



# Creating Trials
First, **preprocessing** of the provided dataset needs to occur prior to using the classifier pipeline.
We then split the dataset into a training and testing set. The **testing set** will be **stored**
away until the very end where it will be used to compare the best solutions found via Genetic Search.
Solutions will be compared in the form of **side-by-side plots**.
The **training set** will be **further split** into smaller trials that will be used to tune each generation
of algorithms during the Genetic Search algorithm.

In [2]:
from sklearn.model_selection import train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from see.base_classes import pipedata
from see.classifier_helpers import helpers

# Import/Create data
X, y = make_moons(noise=0.3, random_state=10)
#X, y = make_circles(noise=0.2, factor=0.5, random_state=1)

# Preprocess data
X = StandardScaler().fit_transform(X)

# Split data into training and testing sets
dataset = helpers.generate_train_test_set(X, y)

# Split training data into trials
trials = helpers.generate_tuning_trials(dataset.training_set.X, dataset.training_set.y)

In [3]:
# Example run of the default Classifier on the generated trials
individual = Classifier()
data = individual.runAlgo(trials)
print(data)
print("The fitness score of the default classifier: ", ClassifierFitness().pipe(data).fitness)

data <see.base_classes.pipedata object at 0x7fc08f5d8640>
<class 'see.classifiers.Classifier'> parameters: 
	algorithm = MLP Neural Network
	max_iter = 200
	alpha = 0.0001
	max_depth = 1
	n_estimators = 100
	n_neighbors = 5

<see.base_classes.pipedata object at 0x7fc08f5d8640>
Time: 1.248 s
<see.base_classes.pipedata object at 0x7fc08f5d8640>
The fitness score of the default classifier:  0.09999999999999999


## Genetic Search with Classifiers

In [4]:
from see import GeneticSearch

In [5]:
mydata = trials

In [6]:
my_evolver = GeneticSearch.Evolver(workflow, mydata, pop_size=30)

In [None]:
# warnings may appear when this runs
population = my_evolver.run(ngen=15)

Initializing a new random population
Generation 0/15 of population size 30
data <see.base_classes.pipedata object at 0x7fc08e95a400>
<class 'see.Workflow.workflow'> parameters: 
	algorithm = Random Forest
	max_iter = 746
	alpha = 0.9226
	max_depth = 24
	n_estimators = 390
	n_neighbors = 311

<see.base_classes.pipedata object at 0x7fc08e95a400>
Time: 3.373 s
fitness=0.05

data <see.base_classes.pipedata object at 0x7fc08e95a460>
<class 'see.Workflow.workflow'> parameters: 
	algorithm = K Nearest Neighbors
	max_iter = 88
	alpha = 0.1997
	max_depth = 329
	n_estimators = 684
	n_neighbors = 716

<see.base_classes.pipedata object at 0x7fc08e95a460>
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
Time: 0.011 s
fitness=0.4

data <see.base_classes.pipedata object at 0x7fc08e95a3a0>
<class 'see.Workflow.workflow'> parameters: 
	algorithm = Decision Tree
	max_iter = 659
	alpha = 0.0454
	max_depth = 393
	n_estimators = 33
	n_neighbors = 863


Time: 16.069 s
fitness=0.06666666666666667

data <see.base_classes.pipedata object at 0x7fc08e97eaf0>
<class 'see.Workflow.workflow'> parameters: 
	algorithm = K Nearest Neighbors
	max_iter = 594
	alpha = 0.5936
	max_depth = 830
	n_estimators = 18
	n_neighbors = 964

<see.base_classes.pipedata object at 0x7fc08e97eaf0>
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
RUNNING EVAL FOR KNN
Time: 0.069 s
fitness=0.05

data <see.base_classes.pipedata object at 0x7fc08e97ef10>
<class 'see.Workflow.workflow'> parameters: 
	algorithm = Gaussian Naive Bayes
	max_iter = 280
	alpha = 0.169
	max_depth = 933
	n_estimators = 124
	n_neighbors = 77

<see.base_classes.pipedata object at 0x7fc08e97ef10>
RUNNING EVAL FOR GNB
RUNNING EVAL FOR GNB
RUNNING EVAL FOR GNB
RUNNING EVAL FOR GNB
RUNNING EVAL FOR GNB
Time: 0.023 s
fitness=0.09999999999999999

data <see.base_classes.pipedata object at 0x7fc08e983370>
<class 'see.Workflow.workflow'> parameters: 
	algorithm = Rando

In [None]:
print(my_evolver.hof) # The best solution in each generation
print(my_evolver.best_avgs) # The fitness score of the best of each generation

## Final Comparison
We compare the found solutions using the testing set that was set aside at the very beginning.
We also train the classifiers using the training set that was made at the very beginning.

*TODO*: I'm not sure if it is correct to retrain the classifiers using the entire training
set. There might be issues with overfitting...

*TODO*: The plots below work nicely because we have two dimension data. Think about what to
do when the feature space is multidimensional.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt

solution_set = my_evolver.hof

# Use stored away testing set and the entire training set for final comparisons
training_set = dataset.training_set
testing_set = dataset.testing_set

# TODO: Algorithms may be non-deterministic so fitness scores may change
# on different plot iterations...?

# Color Maps
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

# Create Mesh Grid
h = 0.02
x_min, x_max = testing_set.X[:, 0].min() - .5, testing_set.X[:, 0].max() + .5
y_min, y_max = testing_set.X[:, 1].min() - .5, testing_set.X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                  np.arange(y_min, y_max, h))

figure = plt.figure(figsize=(20, 10))

ax = plt.subplot(2,round(len(solution_set)/2) + 1,1)

# Plot the training points and testing points (Input data)
#ax.scatter(training_set.X[:, 0], training_set.X[:, 1], c=training_set.y, cmap=cm_bright,
#           edgecolors='k')
ax.scatter(testing_set.X[:, 0], testing_set.X[:, 1], c=testing_set.y, cmap=cm_bright, alpha=0.6,
           edgecolors='k')

# Format plot
ax.set_title("Input data")
ax.set_xlim(xx.min(), xx.max())
ax.set_ylim(yy.min(), yy.max())
ax.set_xticks(())
ax.set_yticks(())

for i, ind in enumerate(solution_set):
    # Initialize classifier
    algo_name = ind[0]
    param_list = ind
    clf = Classifier.algorithmspace[algo_name](param_list)
    
    predictions = clf.evaluate(training_set, testing_set)

    ax = plt.subplot(2, round(len(solution_set)/2) + 1,i + 2)

    # Plot predictions and fitness score
    ax.scatter(testing_set.X[:, 0], testing_set.X[:, 1], c=predictions, cmap=cm_bright, alpha=0.6,
               edgecolors='k')
    score = ClassifierFitness().evaluate(predictions, testing_set.y)
    
    ax.text(xx.max() - .3, yy.min() + .3, ('%.2f' % score).lstrip('0'),
            size=15, horizontalalignment='right')

    # Draw Contour/Color Plot
    # TODO: We have to retrain the data set because we
    # we do not store the classifier created in clf#evaluate
    # anywhere. Consider storing it as an attribute.

    contour_set = pipedata()
    contour_set.X = np.c_[xx.ravel(), yy.ravel()]
    Z = clf.evaluate(training_set, contour_set)
    Z = Z.reshape(xx.shape)
    
    ax.contourf(xx, yy, Z, cmap=cm, alpha=.4)
    
    # Highlight/circle errors
    errors = testing_set.X[predictions != testing_set.y]
    for mistake in errors:
        #err = testing_set.X[mistake,:]
        ax.scatter(mistake[0], mistake[1], marker='o', s=150, lw=1, color="green", facecolor="none")

    # Format plot
    ax.set_title(algo_name)

    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    ax.set_xticks(())
    ax.set_yticks(())

plt.tight_layout()

In [None]:
my_evolver.hof[0]
print(my_evolver.hof)