# Initial Design

This notebook contains the code used to select the reaction conditions tested for initial training of the GP. A modified latin hypercube sampling (LHS) strategy is used to select both discrete and continuous variables.  We previously demonstrated that several different designs work well for solvent selection, so LHS was chosen since it is already implemented by GPyOpt.  

## 1. Setup

Let's get everything loaded and ready to go.

In [2]:
#Autoreload automatically reloads any depdencies as you change them
%load_ext autoreload
%autoreload 2

In [3]:
#Import all the necessary packages
from summit.data import solvent_ds, ucb_list
from summit.domain import Domain, ContinuousVariable, DiscreteVariable, DescriptorsVariable
from summit.experiment_design import LatinDesign
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from tqdm import tqdm

In [4]:
missed_cas_numbers = []
successful_cas_numbers = []
for i, index in enumerate(ucb_list):
    try:
        select = solvent_ds.xs(index, level='cas_number', drop_level=False)
        successful_cas_numbers.append(index)
    except KeyError:
        missed_cas_numbers.append(index)
print(f"{len(missed_cas_numbers)} out of {len(ucb_list)} solvents from the UCB list are not in our solvent database.")

36 out of 115 solvents from the UCB list are not in our solvent database.


In [5]:
#Specify the optimization space

domain = Domain()

domain += ContinuousVariable(name='temperature',
                             description = "reaction temperature",
                             bounds=[20, 50])

domain += ContinuousVariable(name="acid_conc",
                             description = "propionic acid concentration",
                             bounds=[1,40])

domain += ContinuousVariable(name="cat_load",
                             description = "catalyst loading",
                             bounds=[0.1, 10])

domain += ContinuousVariable(name="co_cat_load",
                             description = "co-catalyst loading",
                             bounds=[15, 1500])

domain += ContinuousVariable(name="acrylate_amine_ratio",
                             description = "molar ratio of acrylate to amine",
                             bounds = [0.8, 2])


domain += ContinuousVariable(name="aldehyde_amine_ratio",
                             description = "molar ratio of aldehyde to amine",
                             bounds=[0.8, 2])

domain += DiscreteVariable(name="co_cat",
                           description="enumeration of the two potential cocatalysts",
                           levels = ['co_cat_1', 'co_cat_2'])


domain += DescriptorsVariable(name="solvent",
                             description="17 descriptors of the solvent",
                             df=solvent_ds,
                             select_subset=successful_cas_numbers,
                             select_index='cas_number')

domain #The domain should display as an html table 

0,1,2,3
Name,Type,Description,Values
temperature,continuous,reaction temperature,"[20,50]"
acid_conc,continuous,propionic acid concentration,"[1,40]"
cat_load,continuous,catalyst loading,"[0.1,10]"
co_cat_load,continuous,co-catalyst loading,"[15,1500]"
acrylate_amine_ratio,continuous,molar ratio of acrylate to amine,"[0.8,2]"
aldehyde_amine_ratio,continuous,molar ratio of aldehyde to amine,"[0.8,2]"
co_cat,discrete,enumeration of the two potential cocatalysts,2 levels
solvent,descriptors,17 descriptors of the solvent,459 examples of 17 descriptors


## 2. Construct Initial Design

In [6]:
lhs = LatinDesign(domain)
experiments = lhs.generate_experiments(20)
experiments

Unnamed: 0,temperature,acid_conc,cat_load,co_cat_load,acrylate_amine_ratio,aldehyde_amine_ratio,co_cat,stenutz_name,cosmo_name,chemical_formula,cas_number
0,26.75,3.925,0.8425,1165.875,1.67,0.89,co_cat_2,"1,2,3,4-tetrachlorobenzene","1,2,3,4-tetrachlorobenzene",C6H2Cl4,634-66-2
1,44.75,7.825,5.7925,497.625,1.73,1.73,co_cat_1,"1,2,3,4-tetramethylbenzene","1,2,3,4-tetramethylbenzene",C10H14,488-23-3
2,32.75,37.075,3.3175,868.875,0.83,1.31,co_cat_1,"(1Z,5Z)-cycloocta-1,5-diene","1,5-cyclooctadiene",C8H12,111-78-4
3,34.25,21.475,9.2575,1314.375,1.37,1.91,co_cat_2,"1,5-dichloropentane","1,5-dichloro-pentane",C5H10Cl2,628-76-2
4,49.25,27.325,2.3275,349.125,1.07,1.61,co_cat_2,1-bromohexane,1-bromohexane,C6H13Br,111-25-1
5,47.75,31.225,6.7825,1091.625,1.01,1.55,co_cat_2,"1,1,1-trichloroethane",glycerol,C2H3Cl3,71-55-6
6,43.25,33.175,8.2675,126.375,0.95,1.19,co_cat_1,"1,1,2,2-tetrabromoethane","1,1,2,2-tetrabromoethane",C2H2Br4,79-27-6
7,22.25,23.425,8.7625,1240.125,1.91,1.85,co_cat_1,"1,2,3,4-tetramethylbenzene","1,2,3,4-tetramethylbenzene",C10H14,488-23-3
8,31.25,9.775,4.8025,52.125,1.97,1.49,co_cat_2,"1,2,3,4-tetramethylbenzene","1,2,3,4-tetramethylbenzene",C10H14,488-23-3
9,38.75,13.675,5.2975,943.125,1.55,1.07,co_cat_1,"1,4-butanediol","1,4-butadiol",C4H10O2,110-63-4


In [7]:
expr_df = experiments.to_frame()
expr_df.to_csv('initial_experiments.csv')