This notebook documents our adventures through exploring dimensionality reduction of our ML models. One benefit is that this helps us prepare it for t-SNE visualization.

# Initialize

In [1]:
# Modify the path so that we use GASpy_dev instead of GASpy
import sys
gaspy_path = '/global/project/projectdirs/m2755/GASpy_dev/'
sys.path.insert(0, gaspy_path)
sys.path.insert(0, gaspy_path + '/GASpy_feedback')
sys.path.insert(0, gaspy_path + '/GASpy_regressions')
cache_path = gaspy_path + 'GASpy_regressions/cache'

In [2]:
import copy
import time
import dill as pickle
import numpy as np
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, iplot
import plotly.graph_objs as go
import seaborn as sns
from tpot import TPOTRegressor
from sklearn.gaussian_process import GaussianProcessRegressor
from gaspy_regress.regressor import GASpyRegressor
from gaspy_regress.preprocessor import GASpyPreprocessor
from gaspy_regress import gio, plot, predict
from gaspy.utils import vasp_settings_to_str, read_rc, docs_to_pdocs

In [3]:
init_notebook_mode(connected=True)
%matplotlib inline
%load_ext ipycache

VASP_SETTINGS = vasp_settings_to_str({'gga': 'RP',
                                      'pp_version': '5.4',
                                      'encut': 350})
features = ['coordcount']
outer_features = ['neighbors_coordcounts']
responses = ['energy']
blocks = ['adsorbate']
fingerprints = {'neighborcoord': '$processed_data.fp_final.neighborcoord'}
block = ('CO',)


The `IPython.config` package has been deprecated since IPython 4.0. You should import from traitlets.config instead.


IPython.utils.traitlets has moved to a top-level traitlets package.



# TPOT hyperparameter tuning
We run the TPOT inner loop a bunch of times with different numbers of generations and population sizes to see how we end up doing on our development set.

## Pre-work
We make some functions to condense the reading

In [7]:
model_name = 'profiling/inner_tpot'

# We make a function that creates and fits a GASpyRegressor.
# We'll be using this over and over for profiling.
def profile_inner(gen, pop, dim_red=None, **pp_args):
    '''
    This function makes and fits a GASpyRegressor class using TPOT.
    
    Inputs:
        gen         [int] How many TPOT generations you want to use
        pop         [int] The TPOT population size you want to use
        dim_red     [str] A string indicating the type of
                    dimensionality reduction technique you want to use.
        **pp_args   Any arguments you want to pass to the dimensionality
                    reducer you're using.
    '''
    tpot = TPOTRegressor(generations=gen, population_size=pop, verbosity=2, random_state=42)
    H = GASpyRegressor(features=features, responses=responses,
                       blocks=blocks, vasp_settings=VASP_SETTINGS,
                       fingerprints=fingerprints, train_size=0.8, dev_size=0.1,
                       dim_red=dim_red, **pp_args)
    H.fit_tpot(tpot, model_name=model_name, blocks=[block])
    return H

# Make a function to do the plotting
def plot_3d_profile(x, y, z):
    '''
    This function is simply a wrapper for creating 3D scatter plot profiles
    
    Inputs:
        x  [np.array] A vector for the x-axis
        y  [np.array] A vector for the x-axis
        z  [list in a list] The RMSE dictionaries that correspond to
           the x and y settings. This assumes that these dictionaries
           have the keys `block` and 'dev', where `block` is the model
           block defined near the top of this notebook.
    '''
    # Format the data
    X, Y = np.meshgrid(x, y)
    Z = np.empty((len(y), len(x)))
    for i, _ in enumerate(x):
        for j, _ in enumerate(y):
            Z[j, i] = z[i][j][block]['dev']
    # Plot it
    trace = go.Scatter3d(x=X.flatten(), y=Y.flatten(), z=Z.flatten())
    xaxis = dict(title='Generations', range=[x.min()-1, x.max()+1])
    yaxis = dict(title='Populations', range=[y.min()-1, y.max()+1])
    zaxis = dict(title='RMSE', range=[0, 0.4])
    layout = go.Layout(xaxis=xaxis, yaxis=yaxis)
    iplot(go.Figure(data=[trace], layout=layout))

## Without dim_red
We first did this without using any dimensionality reduction becaues... well, we forgot. Here are the results! It looks like the number of generations doesn't do much, and that a population size of ~21 works pretty well... we guess?

In [5]:
%%cache inner_tpot_nodimred.pkl gens, pops, rmses --cachedir=../cache/profiling

# Set the space we want to investigate.
gens = np.linspace(1, 5, 3, dtype=int)
pops = np.linspace(2, 40, 5, dtype=int)

# Execute the inner-loop regressions over a range.
rmses = [[None]*len(pops)]*len(gens)
for i, gen in enumerate(gens):
    for j, pop in enumerate(pops):
        rmses[i][j] = profile_inner(gen, pop).rmses

[Skipped the cell's code and loaded variables gens, pops, rmses from file '/global/project/projectdirs/m2755/GASpy_dev/GASpy_regressions/cache/profiling/inner_tpot_nodimred.pkl'.]
Version 0.9.1 of tpot is outdated. Version 0.9.2 was released Wednesday January 17, 2018.
Starting to pull documents...
Generation 1 - Current best internal CV score: 0.147014886117

Best pipeline: RandomForestRegressor(ElasticNetCV(input_matrix, l1_ratio=0.75, tol=0.01), bootstrap=True, max_features=0.4, min_samples_leaf=16, min_samples_split=14, n_estimators=100)
Version 0.9.1 of tpot is outdated. Version 0.9.2 was released Wednesday January 17, 2018.
Starting to pull documents...
Generation 1 - Current best internal CV score: 0.141750427645

Best pipeline: RandomForestRegressor(RidgeCV(input_matrix), bootstrap=True, max_features=0.75, min_samples_leaf=11, min_samples_split=9, n_estimators=100)
Version 0.9.1 of tpot is outdated. Version 0.9.2 was released Wednesday January 17, 2018.
Starting to pull documen

0it [00:00, ?it/s]102it [00:00, 430.92it/s]14742it [00:00, 612.23it/s]29132it [00:01, 868.05it/s]42345it [00:01, 37759.62it/s]
Optimization Progress:   0%|          | 0/4 [00:00<?, ?pipeline/s]Optimization Progress:  25%|##5       | 1/4 [00:01<00:04,  1.63s/pipeline]Optimization Progress:  50%|#####     | 2/4 [00:03<00:03,  1.62s/pipeline]Optimization Progress:  75%|#######5  | 3/4 [00:07<00:02,  2.45s/pipeline]Optimization Progress: 100%|##########| 4/4 [00:09<00:00,  2.16s/pipeline]                    Optimization Progress: 100%|##########| 4/4 [00:09<00:00,  2.16s/pipeline]                                                                          0it [00:00, ?it/s]102it [00:00, 296.81it/s]14742it [00:00, 422.92it/s]29132it [00:01, 600.99it/s]42345it [00:01, 38272.18it/s]
Optimization Progress:   0%|          | 0/22 [00:00<?, ?pipeline/s]Optimization Progress:   5%|4         | 1/22 [00:01<00:35,  1.67s/pipeline]Optimization Progress:   9%|9         | 2/22 [00:0

In [6]:
plot_3d_profile(gens, pops, rmses)

## With PCA
It looks like the number of generations doesn't really matter again. And we should use a population size of ~10 when using PCA.

In [8]:
%%cache inner_tpot_pca.pkl gens, pops, rmses --cachedir=../cache/profiling

# Set the space we want to investigate.
gens = np.linspace(1, 5, 3, dtype=int)
pops = np.linspace(2, 40, 5, dtype=int)

# Execute the inner-loop regressions over a range.
rmses = [[None]*len(pops)]*len(gens)
for i, gen in enumerate(gens):
    for j, pop in enumerate(pops):
        # Note that the settings we use for PCA trigger the use of
        # Thomas Minka's auto-selection algorithm for dimensionality
        model = profile_inner(gen, pop,
                              dim_red='pca',
                              n_components='mle',
                              svd_solver='full')
        rmses[i][j] = model.rmses

[Saved variables gens, pops, rmses to file '/global/project/projectdirs/m2755/GASpy_dev/GASpy_regressions/cache/profiling/inner_tpot_pca.pkl'.]
Version 0.9.1 of tpot is outdated. Version 0.9.2 was released Wednesday January 17, 2018.
Starting to pull documents...
Generation 1 - Current best internal CV score: 0.152029718352

Best pipeline: RandomForestRegressor(ElasticNetCV(input_matrix, l1_ratio=0.75, tol=0.01), bootstrap=True, max_features=0.4, min_samples_leaf=16, min_samples_split=14, n_estimators=100)
Version 0.9.1 of tpot is outdated. Version 0.9.2 was released Wednesday January 17, 2018.
Starting to pull documents...
Generation 1 - Current best internal CV score: 0.149553427104

Best pipeline: ExtraTreesRegressor(input_matrix, bootstrap=True, max_features=0.95, min_samples_leaf=2, min_samples_split=18, n_estimators=100)
Version 0.9.1 of tpot is outdated. Version 0.9.2 was released Wednesday January 17, 2018.
Starting to pull documents...
Generation 1 - Current best internal CV s


This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.

0it [00:00, ?it/s]102it [00:00, 296.76it/s]14747it [00:00, 422.67it/s]29139it [00:01, 600.87it/s]42455it [00:01, 37814.35it/s]

From version 0.21, test_size will always complement train_size unless both are specified.


A class named 'FitnessMulti' has already been created and it will be overwritten. Consider deleting previous creation of that class or rename it.


A class named 'Individual' has already been created and it will be overwritten. Consider deleting previous creation of that class or rename it.

0it [00:00, ?it/s]102it [00:00, 209.21it/s]14747it [00:00, 298.54it/s]29139it [00:01, 424.34it/s]42455it [00:01, 33153.57it/s]
Optimization Progress:   0%|          | 0/22 [00:00<?, ?pipeline/s]Opti

In [9]:
plot_3d_profile(gens, pops, rmses)