## Run TPOT
Runs the TPOT genetic algorithm searching method to try and find the best classifer to use. Then investigate the
results manually to try and get the best trade-off between classifier complexity and accuracy - look along the 'Pareto front' and pick the best manually.

In [1]:
from ErrorML.ErrorML import *

In [2]:
from tpot import TPOTClassifier, TPOTRegressor

In [3]:
def robin_metric(y_true, y_pred):
    """A metric of accuracy that ignores the class with the highest accuracy.
    
    Given y_true and y_pred, it calculates a confusion matrix, and then takes
    the average of the diagonal elements of the matrix, ignoring the highest value.
    This gives us an accuracy of all but the class which is predicted best - which
    is useful for imbalanced learning, where one class is always predicted very well."""
    cm = confusion_matrix(y_true, y_pred)
    
    # Normalize confusion matrix
    cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
    
    metric = np.sort(np.diag(cm))[:-1].mean()
    
    return metric

In [69]:
df = load_data('2017_ValidationPts_ALL_Update15March19b_ROBIN.csv')
X, y = get_processed_data(df, classes=[-2, -0.2, 0.2, 6.5], categorised=True, focal=False,
                          scale=False, exclude=None, absolute=False, subset=None)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33)

In [74]:
tpot = TPOTClassifier(verbosity=3, scoring=robin_metric)
tpot.fit(X_train, y_train)



30 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=10100, style=ProgressStyle(descriâ€¦

_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='hinge' is not supported, Parameters: penalty='l1', loss='hinge', dual=False
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 69
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative
_pre_test decorator: _random_mutation_operator: num_test=0 Input X must be non-negative
Pipeline encountered that has previously been evaluated during the optimization process. Using the score from the previous evaluation

_pre_test decorator: _random_mutation_operator: num_test=0 X contains negative values.
_pre_test decorator: _random_mutation_operator: num_test=0 cosine was provided as affinity. Ward can only work with euclidean distances.
Generation 5 - Current Pareto front scores:
-1	0.594735760383695	BernoulliNB(input_matrix, BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False)
-2	0.7006090092428947	GaussianNB(RandomForestClassifier(input_matrix, RandomForestClassifier__bootstrap=True, RandomForestClassifier__criterion=gini, RandomForestClassifier__max_features=0.5, RandomForestClassifier__min_samples_leaf=8, RandomForestClassifier__min_samples_split=18, RandomForestClassifier__n_estimators=100))
-3	0.70947687944962	BernoulliNB(FastICA(CombineDFs(input_matrix, LogisticRegression(input_matrix, LogisticRegression__C=5.0, LogisticRegression__dual=True, LogisticRegression__penalty=l2)), FastICA__tol=1.0), BernoulliNB__alpha=1.0, BernoulliNB__fit_prior=False)

_pre_test decorator: _random_mutation_oper

TPOTClassifier(config_dict=None, crossover_rate=0.1, cv=5,
        disable_update_check=False, early_stop=None, generations=100,
        max_eval_time_mins=5, max_time_mins=None, memory=None,
        mutation_rate=0.9, n_jobs=1, offspring_size=None,
        periodic_checkpoint_folder=None, population_size=100,
        random_state=None, scoring=<function robin_metric at 0x11ce5fea0>,
        subsample=1.0, use_dask=False, verbosity=3, warm_start=False)

### Investigate the Pareto Front pipelines manually

In [77]:
tpot.pareto_front_fitted_pipelines_

{'BernoulliNB(input_matrix, BernoulliNB__alpha=0.1, BernoulliNB__fit_prior=False)': Pipeline(memory=None,
      steps=[('bernoullinb', BernoulliNB(alpha=0.1, binarize=0.0, class_prior=None, fit_prior=False))]),
 'BernoulliNB(FastICA(input_matrix, FastICA__tol=0.05), BernoulliNB__alpha=0.001, BernoulliNB__fit_prior=False)': Pipeline(memory=None,
      steps=[('fastica', FastICA(algorithm='parallel', fun='logcosh', fun_args=None, max_iter=200,
     n_components=None, random_state=None, tol=0.05, w_init=None,
     whiten=True)), ('bernoullinb', BernoulliNB(alpha=0.001, binarize=0.0, class_prior=None, fit_prior=False))])}

In [49]:
gnb = tpot.pareto_front_fitted_pipelines_['GaussianNB(PCA(input_matrix, PCA__iterated_power=6, PCA__svd_solver=randomized))']

In [50]:
gnb.score(X_test, y_test)

0.8604651162790697

In [51]:
y_pred = gnb.predict(X_test)

In [52]:
robin_metric(y_test, y_pred)

0.7490566037735849