# regress.ipynb
Author:  Kevin Tran <ktran@andrew.cmu.edu>

This python notebook performs regressions on data pulled from a local GASdb. It then saves these regressions into pickles (for later use) and creates parity plots of the regression fits.

## Importing

In [7]:
from pprint import pprint   # for debugging
import sys
import math
import numpy as np
sys.path.append('..')
from vasp_settings_to_str import vasp_settings_to_str
from gas_pull import GasPull
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from tpot import TPOTRegressor
import alamopy
import dill as pickle
pickle.settings['recurse'] = True     # required to pickle lambdify functions
import matplotlib.pyplot as plt
from plotly.offline import init_notebook_mode, plot, iplot
init_notebook_mode(connected=True)
import plotly.plotly as py
import plotly.graph_objs as go

## Load data
Using the `energy_fr_coordcount_ads` method

In [2]:
# Location of the *.db file
#DB_LOC = '/global/cscratch1/sd/zulissi/GASpy_DB/'  # Cori
DB_LOC = '/Users/KTran/Nerd/GASpy'                 # Local

# Calculation settings we want to look at
VASP_SETTINGS = vasp_settings_to_str({'gga': 'BF',
                                      'pp_version': '5.4.',
                                      'encut': 350})

# Pull the data from the Local database
GAS_PULL = GasPull(DB_LOC, VASP_SETTINGS, split=True)
X, Y, DATA, X_TRAIN, X_TEST, Y_TRAIN, Y_TEST, lb_ads, lb_coord = GAS_PULL.energy_fr_coordcount_ads()

## Regressions
Create surrogate models using different methods

### SKLearn Linear Regression
Use SKLearn's simple linear regressor

In [3]:
LR = LinearRegression()
LR.fit(X_TRAIN, Y_TRAIN)
LR.name = 'Linear'
pickle.dump({'model': LR,
             'pre_processors': {'coordination': lb_coord,
                                'adsorbate': lb_ads}},
            open('pkls/CoordcountAds_Energy_LR.pkl', 'w'))


internal gelsd driver lwork query error, required iwork dimension not returned. This is likely the result of LAPACK bug 0038, fixed in LAPACK 3.2.2 (released July 21, 2010). Falling back to 'gelss' driver.



### SKLearn Ensemble Regression
Use SKLearn's gradient, boosting, ensemble method

In [4]:
GBE = GradientBoostingRegressor()
GBE.fit(X_TRAIN, Y_TRAIN)
GBE.name = 'GBE'
pickle.dump({'model': GBE,
             'pre_processors': {'coordination': lb_coord,
                                'adsorbate': lb_ads}},
            open('pkls/CoordcountAds_Energy_GBE.pkl', 'w'))

### TPOT Regression
Use TPOT, an auto-ML

In [None]:
TPOT = TPOTRegressor(generations=100,
                     population_size=100,
                     verbosity=2,
                     random_state=42)
TPOT.fit(X_TRAIN, Y_TRAIN)
TPOT.name = 'TPOT'
# TODO:  Figure out how to save the TPOT model
#pickle.dump({'model': TPOT,
#             'pre_processors': {'coordination': lb_coord,
#                                'adsorbate': lb_ads}},
#            open('pkls/CoordcountAds_Energy_TPOT.pkl', 'w'))



Optimization Progress:   2%|▏         | 187/10100 [03:10<2:04:53,  1.32pipeline/s]

Generation 1 - Current best internal CV score: 0.245020628343


Optimization Progress:   3%|▎         | 272/10100 [05:15<3:07:01,  1.14s/pipeline]

Generation 2 - Current best internal CV score: 0.245020628343


          on Progress:   4%|▎         | 357/10100 [07:24<3:36:59,  1.34s/pipeline]           

Generation 3 - Current best internal CV score: 0.245020628343


Optimization Progress:   4%|▍         | 453/10100 [09:50<2:32:20,  1.06pipeline/s] 

Generation 4 - Current best internal CV score: 0.245020628343


Optimization Progress:   5%|▌         | 543/10100 [11:57<3:02:35,  1.15s/pipeline]

Generation 5 - Current best internal CV score: 0.244618541169


Optimization Progress:   6%|▋         | 633/10100 [15:09<3:19:40,  1.27s/pipeline] 

Generation 6 - Current best internal CV score: 0.244164765208


Optimization Progress:   7%|▋         | 725/10100 [17:58<3:44:35,  1.44s/pipeline] 

Generation 7 - Current best internal CV score: 0.244164765208


Optimization Progress:   8%|▊         | 817/10100 [21:40<4:31:24,  1.75s/pipeline] 

Generation 8 - Current best internal CV score: 0.241615964464


Optimization Progress:   9%|▉         | 912/10100 [25:03<4:20:18,  1.70s/pipeline] 

Generation 9 - Current best internal CV score: 0.241332999545


Optimization Progress:  10%|▉         | 1003/10100 [27:35<4:21:21,  1.72s/pipeline]

Generation 10 - Current best internal CV score: 0.240282374664


Optimization Progress:  11%|█         | 1097/10100 [30:58<4:17:40,  1.72s/pipeline] 

Generation 11 - Current best internal CV score: 0.240282374664


Optimization Progress:  12%|█▏        | 1192/10100 [33:51<3:50:05,  1.55s/pipeline] 

Generation 12 - Current best internal CV score: 0.240282374664


Optimization Progress:  13%|█▎        | 1288/10100 [37:02<4:07:41,  1.69s/pipeline] 

Generation 13 - Current best internal CV score: 0.240282374664


Optimization Progress:  14%|█▎        | 1384/10100 [40:14<3:33:29,  1.47s/pipeline] 

Generation 14 - Current best internal CV score: 0.239611149217


Optimization Progress:  15%|█▍        | 1483/10100 [43:40<2:57:25,  1.24s/pipeline] 

Generation 15 - Current best internal CV score: 0.23809970003


Optimization Progress:  16%|█▌        | 1581/10100 [46:34<4:00:12,  1.69s/pipeline] 

Generation 16 - Current best internal CV score: 0.23809970003


Optimization Progress:  17%|█▋        | 1679/10100 [49:48<3:23:31,  1.45s/pipeline] 

Generation 17 - Current best internal CV score: 0.23809970003


Optimization Progress:  18%|█▊        | 1774/10100 [52:43<4:09:46,  1.80s/pipeline] 

Generation 18 - Current best internal CV score: 0.23809970003


Optimization Progress:  19%|█▊        | 1873/10100 [55:38<3:33:07,  1.55s/pipeline] 

Generation 19 - Current best internal CV score: 0.23809970003


Optimization Progress:  19%|█▉        | 1967/10100 [58:45<3:36:17,  1.60s/pipeline] 

Generation 20 - Current best internal CV score: 0.23809970003


Optimization Progress:  20%|██        | 2064/10100 [1:01:25<2:58:27,  1.33s/pipeline]

Generation 21 - Current best internal CV score: 0.23809970003


Optimization Progress:  21%|██▏       | 2154/10100 [1:04:06<3:07:52,  1.42s/pipeline] 

Generation 22 - Current best internal CV score: 0.235958437949


Optimization Progress:  22%|██▏       | 2244/10100 [1:07:00<3:14:41,  1.49s/pipeline] 

Generation 23 - Current best internal CV score: 0.235958437949


Optimization Progress:  23%|██▎       | 2330/10100 [1:09:28<3:19:36,  1.54s/pipeline] 

Generation 24 - Current best internal CV score: 0.235958437949


Optimization Progress:  24%|██▍       | 2422/10100 [1:12:31<3:49:38,  1.79s/pipeline] 

Generation 25 - Current best internal CV score: 0.235958437949


Optimization Progress:  25%|██▍       | 2511/10100 [1:15:09<3:13:59,  1.53s/pipeline] 

Generation 26 - Current best internal CV score: 0.234900002136


Optimization Progress:  26%|██▌       | 2602/10100 [1:17:42<2:57:36,  1.42s/pipeline] 

Generation 27 - Current best internal CV score: 0.234900002136


Optimization Progress:  27%|██▋       | 2690/10100 [1:20:14<2:45:56,  1.34s/pipeline] 

Generation 28 - Current best internal CV score: 0.234900002136


Optimization Progress:  28%|██▊       | 2778/10100 [1:22:30<2:38:46,  1.30s/pipeline] 

Generation 29 - Current best internal CV score: 0.234900002136


Optimization Progress:  28%|██▊       | 2870/10100 [1:25:01<2:22:36,  1.18s/pipeline] 

Generation 30 - Current best internal CV score: 0.234900002136


Optimization Progress:  29%|██▉       | 2963/10100 [1:27:33<3:31:27,  1.78s/pipeline] 

Generation 31 - Current best internal CV score: 0.234900002136


Optimization Progress:  30%|███       | 3053/10100 [1:30:17<3:31:59,  1.80s/pipeline]

Generation 32 - Current best internal CV score: 0.234900002136


Optimization Progress:  31%|███       | 3151/10100 [1:32:59<2:14:27,  1.16s/pipeline] 

Generation 33 - Current best internal CV score: 0.234900002136


Optimization Progress:  32%|███▏      | 3243/10100 [1:35:26<3:10:16,  1.66s/pipeline] 

Generation 34 - Current best internal CV score: 0.234900002136


Optimization Progress:  33%|███▎      | 3335/10100 [1:37:47<2:22:35,  1.26s/pipeline] 

Generation 35 - Current best internal CV score: 0.234900002136


Optimization Progress:  34%|███▍      | 3432/10100 [1:40:53<3:25:51,  1.85s/pipeline] 

Generation 36 - Current best internal CV score: 0.234430103727


Optimization Progress:  35%|███▍      | 3529/10100 [1:43:43<2:41:06,  1.47s/pipeline]

Generation 37 - Current best internal CV score: 0.234430103727


Optimization Progress:  36%|███▌      | 3624/10100 [1:46:19<2:23:35,  1.33s/pipeline] 

Generation 38 - Current best internal CV score: 0.234430103727


Optimization Progress:  37%|███▋      | 3713/10100 [1:48:55<2:28:22,  1.39s/pipeline] 

Generation 39 - Current best internal CV score: 0.234430103727


Optimization Progress:  38%|███▊      | 3804/10100 [1:51:43<2:35:40,  1.48s/pipeline] 

Generation 40 - Current best internal CV score: 0.234430103727


Optimization Progress:  39%|███▊      | 3897/10100 [1:54:07<2:53:16,  1.68s/pipeline] 

Generation 41 - Current best internal CV score: 0.234430103727


Optimization Progress:  39%|███▉      | 3988/10100 [1:56:35<2:14:09,  1.32s/pipeline]

Generation 42 - Current best internal CV score: 0.234158587386


Optimization Progress:  40%|████      | 4087/10100 [1:59:10<2:35:16,  1.55s/pipeline] 

Generation 43 - Current best internal CV score: 0.234158587386


Optimization Progress:  41%|████▏     | 4185/10100 [2:02:03<2:30:50,  1.53s/pipeline]

Generation 44 - Current best internal CV score: 0.234158587386


Optimization Progress:  42%|████▏     | 4281/10100 [2:04:55<3:06:00,  1.92s/pipeline] 

Generation 45 - Current best internal CV score: 0.234152080068


Optimization Progress:  43%|████▎     | 4379/10100 [2:08:01<3:20:39,  2.10s/pipeline]

Generation 46 - Current best internal CV score: 0.234152080068


Optimization Progress:  44%|████▍     | 4471/10100 [2:10:41<2:40:41,  1.71s/pipeline] 

Generation 47 - Current best internal CV score: 0.234152080068


Optimization Progress:  45%|████▌     | 4565/10100 [2:13:04<2:39:44,  1.73s/pipeline] 

Generation 48 - Current best internal CV score: 0.233534506268


Optimization Progress:  46%|████▌     | 4636/10100 [2:15:19<3:09:30,  2.08s/pipeline]

### Alamo Regression
Use Sahinidis' ALAMO

In [5]:
# Since Alamo can take awhile, we actually try to load a pickle of the previous run
# before calling alamopy. Simply delete the pickle if you want to re-run.
try:
    ALA = pickle.load(open('pkls/CoordcountAds_Energy_Ala.pkl', 'r'))['model']
except IOError:
    ALA = alamopy.doalamo(X_TRAIN, Y_TRAIN.reshape(len(Y_TRAIN), 1),
                          X_TEST, Y_TEST.reshape(len(Y_TEST), 1),
                          showalm=1,
                          linfcns=1,
                          expfcns=1,
                          logfcns=1,
                          monomialpower=(1, 2, 3),
                          multi2power=(1, 2, 3),
                          ratiopower=(1, 2, 3)
                     )
    ALA['name'] = 'Alamo'
    pickle.dump({'model': ALA,
                 'pre_processors': {'coordination': lb_coord,
                                    'adsorbate': lb_ads}},
                open('pkls/CoordcountAds_Energy_Ala.pkl', 'w'))
pprint(ALA['model'])

'  z1 = 0.40689210671486919501660 * x1 + 0.23979449370395070073592 * x3 - 0.24026016988363380066929 * x4 - 0.24303314414319163172529 * x6 + 0.36793318358804466550183 * x7 + 0.34045386387183534937506 * x9 + 0.52626055135851601551877 * x18 + 0.30153379893248954957130 * x19 + 0.54007522653274131485546 * x20 - 0.24635185588857760885517 * x21 + 1.4588916942296219492192 * x23 - 1.0809185728635437584444 * x24 + 1.5294995359829648418071 * x26 + 0.27768071547722195102637 * x2*x24 + 0.13519052619007632110026 * (x11*x24)^3'


## Plotting

### SKLearn
Use Plotly to display parity plots for SKLearn-type models

In [274]:
# For each model...
for model in [LR, GBE, TPOT]:
    traces = []
    # Create a parity plot where each adsorbate is shown. We do that by pulling out
    # data for each adsorbate and then plotting them.
    for ads in np.unique(DATA['adsorbate']):
        # We loop through all of our data and pull out the vectorized coordination (x),
        # the DFT energy (y), and the coordination site (text).
        x = []
        y = []
        text = []
        for i, _ads in enumerate(DATA['adsorbate']):
            if _ads == ads:
                x.append(X[i])
                y.append(Y[i])
                text.append('Site:  %s' % DATA['coordination'][i])
        # Use the vectorized coordination (x) to calculate a predicted energy (y_predicted).
        # Then add it to `traces` for plotting.
        y_predicted = model.predict(np.array(x))
        traces.append(go.Scatter(x=y_predicted,
                                 y=y,
                                 mode='markers',
                                 text=text,
                                 name=ads))
    # Create a diagonal line for the parity plot
    lims = [-4, 6]
    traces.append(go.Scatter(x=lims, y=lims,
                             line=dict(color=('black'), dash='dash'), name='Parity line'))
    # Format and plot
    layout = go.Layout(xaxis=dict(title='Regressed (eV)'),
                       yaxis=dict(title='DFT (eV)'),
                       title='Adsorption Energy as a function of (Coordination Count, Adsorbate); Model = %s; RMSE = %0.3f eV' \
                             % (model.name, math.sqrt(metrics.mean_squared_error(Y_TEST, model.predict(X_TEST)))))
    iplot(go.Figure(data=traces, layout=layout))

### Alamo

In [6]:
# Create Pyplot plots for each dictionary-type model
for model in [ALA]:
    traces = []
    # Create a parity plot where each adsorbate is shown. We do that by pulling out
    # data for each adsorbate and then plotting them.
    for ads in np.unique(DATA['adsorbate']):
        # We loop through all of our data and pull out the vectorized coordination (x),
        # the DFT energy (y), and the coordination site (text).
        x = []
        y = []
        text = []
        for i, _ads in enumerate(DATA['adsorbate']):
            if _ads == ads:
                x.append(X[i])
                y.append(Y[i])
                text.append('Site:  %s' % DATA['coordination'][i])
                
        # Do some footwork because Alamo returns a lambda function that doesn't accept np arrays
        def model_predict(factors):
            '''
            Turn a vector of input data, `factors`, into the model's guessed output. We use
            this function to do so because lambda functions suck. We should address this by
            making alamopy output a better lambda function.
            '''
            args = dict.fromkeys(range(0, len(factors)-1), None)
            for j, factor in enumerate(factors):
                args[j] = factor
            return model['f(model)'](args[0], args[1], args[2], args[3], args[4], args[5], args[6], args[7], args[8], args[9], args[10], args[11], args[12], args[13], args[14], args[15], args[16], args[17], args[18], args[19], args[20], args[21], args[22], args[23], args[24], args[25])
        y_predicted = map(model_predict, x)
        
        # Plot
        traces.append(go.Scatter(x=y_predicted,
                                 y=y,
                                 mode='markers',
                                 text=text,
                                 name=ads))
    # Create a diagonal line for the parity plot
    lims = [-4, 6]
    traces.append(go.Scatter(x=lims, y=lims,
                             line=dict(color=('black'), dash='dash'), name='Parity line'))
    # Format and plot
    layout = go.Layout(xaxis=dict(title='Regressed (eV)'),
                       yaxis=dict(title='DFT (eV)'),
                       title='Adsorption Energy as a function of (Coordination Count, Adsorbate); Model = %s; RMSE = %0.3f eV' \
                             % (model['name'], math.sqrt(metrics.mean_squared_error(Y_TEST, map(model_predict, X_TEST)))))
    iplot(go.Figure(data=traces, layout=layout))