# sandbox.ipynb

This python notebook performs regressions on data pulled from a processed mongo DB created by GASpy. It then saves these regressions into pickles (for later use) and creates parity plots of the regression fits.

# Initialize

In [1]:
# Importing
import pdb
import sys
from regressor import GASpyRegressor
import gpickle
sys.path.insert(0, '../')
from gaspy.utils import vasp_settings_to_str

VASP_SETTINGS = vasp_settings_to_str({'gga': 'RP',
                                      'pp_version': '5.4',
                                      'encut': 350})

# Regress

## Gaussian Process Models

In [2]:
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, WhiteKernel, RationalQuadratic, ExpSineSquared

In [3]:
model_name = 'GP'
features = ['coordcount', 'ads']
responses = ['energy']
blocks = None

In [4]:
gp = GaussianProcessRegressor(
                              #kernel= 1.0*RBF(length_scale=0.05) \
                              #       +1.0*RBF(length_scale=0.2) \
                              #       +1.0*WhiteKernel(noise_level=0.05**2.0),
                              #n_restarts_optimizer=2,
                             )
GP = GASpyRegressor(features=features, responses=responses,
                    blocks=blocks, vasp_settings=VASP_SETTINGS)
GP.fit_sk(gp, model_name=model_name)


Data with input dtype int64 was converted to float64 by StandardScaler.


From version 0.21, test_size will always complement train_size unless both are specified.



In [5]:
gpickle.dump(GP)

In [4]:
GP = gpickle.load(model_name, features, responses, blocks)

In [5]:
GP.parity_plot()

RMSE values:
	no_block
		test
			0.55133485125
		train
			0.435882992771
		train+test
			0.467433214201


## TPOT Models

In [2]:
from tpot import TPOTRegressor

In [3]:
model_name = 'TPOT'
features = ['coordcount', 'ads']
responses = ['energy']
blocks = None

In [4]:
tpot = TPOTRegressor(
                     generations=1,
                     population_size=2,
                     verbosity=2,
                     random_state=42
                    )
TPOT = GASpyRegressor(features=features, responses=responses,
                      blocks=blocks, vasp_settings=VASP_SETTINGS)
TPOT.fit_tpot(tpot, model_name=model_name)


This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.


Data with input dtype int64 was converted to float64 by StandardScaler.


From version 0.21, test_size will always complement train_size unless both are specified.

                                                                          

Generation 1 - Current best internal CV score: 0.294319983876

Best pipeline: KNeighborsRegressor(ElasticNetCV(input_matrix, ElasticNetCV__l1_ratio=0.35, ElasticNetCV__tol=0.1), KNeighborsRegressor__n_neighbors=80, KNeighborsRegressor__p=1, KNeighborsRegressor__weights=distance)


In [5]:
gpickle.dump(TPOT)

In [7]:
TPOT = gpickle.load(model_name, features, responses, blocks)

In [8]:
TPOT.parity_plot()

RMSE values:
	no_block
		test
			0.55007127944
		train
			0.436462798022
		train+test
			0.467462653202
