<small>This notebook was put together by [wesley beckner](http://wesleybeckner.github.io)</small>


<a id='top'></a>

# Contents

[scrape data](#scrape)

[create descriptors](#descriptors)

[optimize LASSO](#optimize)

[create confidence intervals for coefficients](#ci_coeff)

[multi-layer perceptron (MLP) regressor](#nn)

[create static files](#static)

In [1]:
import statistics
import requests
import json
import pickle
import salty
import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg as LINA
from scipy import stats
from scipy.stats import uniform as sp_rand
from scipy.stats import mode
from sklearn.linear_model import Lasso
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPRegressor
import os
import sys
import pandas as pd
from collections import OrderedDict
from numpy.random import randint
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV
from math import log
from time import sleep
%matplotlib inline
tableau20 = [(31, 119, 180), (174, 199, 232), (255, 127, 14), (255, 187, 120),    
             (44, 160, 44), (152, 223, 138), (214, 39, 40), (255, 152, 150),    
             (148, 103, 189), (197, 176, 213), (140, 86, 75), (196, 156, 148),    
             (227, 119, 194), (247, 182, 210), (127, 127, 127), (199, 199, 199),    
             (188, 189, 34), (219, 219, 141), (23, 190, 207), (158, 218, 229)]   

# Scale the RGB values to the [0, 1] range, which is the format matplotlib accepts.    
for i in range(len(tableau20)):    
    r, g, b = tableau20[i]    
    tableau20[i] = (r / 255., g / 255., b / 255.)   

class dev_model():
    def __init__(self, coef_data, data):
        self.Coef_data = coef_data
        self.Data = data
        
class prod_model():
    def __init__(self, coef_data, model):
        self.Coef_data = coef_data
        self.Model = model

<a id='scrape'></a>

# Scrape ILThermo Data

[back to top](#top)

ILThermo has specific 4-letter tags for the properties in the database. These can be determined by inspecting the web elements on their website.

Melting point: prp=lcRG (note this in the paper_url string)

All that needs to be changed to scrape other property data is the 4-letter tag and the directory in which to save the information.

In [2]:
paper_url = "http://ilthermo.boulder.nist.gov/ILT2/ilsearch?"\
    "cmp=&ncmp=1&year=&auth=&keyw=&prp=lcRG"
    
r = requests.get(paper_url)
header = r.json()['header']
papers = r.json()['res']
i = 1
data_url = 'http://ilthermo.boulder.nist.gov/ILT2/ilset?set={paper_id}'
for paper in papers[:1]:
    
    r = requests.get(data_url.format(paper_id=paper[0]))
    data = r.json()['data']
    with open("../salty/data/MELTING_POINT/%s.json" % i, "w") as outfile:
        json.dump(r.json(), outfile)
    #then do whatever you want to data like writing to a file
    sleep(0.5) #import step to avoid getting banned by server
    i += 1

<a id='descriptors'></a>

# Create Descriptors

[back to top](#top)

The scraped data is in the form of a json file. The json files contain all the experimental information NIST has archived, including methods and experimental error!

Unfortunately the IUPAC names in the database are imperfect. We address this after the following cell.

In [None]:
###add JSON files to density.csv
outer_old = pd.DataFrame()
outer_new = pd.DataFrame()
number_of_files = 2266

for i in range(10):
    with open("../salty/data/DENSITY/%s.json" % str(i+1)) as json_file:
        
        #grab data, data headers (names), the salt name
        json_full = json.load(json_file)
        json_data = pd.DataFrame(json_full['data'])
        json_datanames = np.array(json_full['dhead'])
        json_data.columns =  json_datanames
        json_saltname = pd.DataFrame(json_full['components'])
        print(json_saltname.iloc[0][3])
        
        inner_old = pd.DataFrame()
        inner_new = pd.DataFrame()
        
        #loop through the columns of the data, note that some of the 
        #json files are missing pressure data. 
        for indexer in range(len(json_data.columns)):
            grab=json_data.columns[indexer]
            list = json_data[grab]
            my_list = [l[0] for l in list]
            dfmy_list = pd.DataFrame(my_list)
            dfmy_list.columns = [json_datanames[indexer][0]]
            inner_new = pd.concat([dfmy_list, inner_old], axis=1)
            inner_old = inner_new
            
        #add the name of the salt    
        inner_old['salt_name']=json_saltname.iloc[0][3]           
        
        #add to the growing dataframe
        outer_new = pd.concat([inner_old, outer_old], axis=0)
        outer_old = outer_new
print(outer_old)
# pd.DataFrame.to_csv(outer_old, path_or_buf='../salty/data/density.csv', index=False)

Dealing with messy data is commonplace. *Even highly vetted data in ILThermo.*

I addressed inaccuracies in the IUPAC naming by first parsing the IUPAC names into two strings (caiton and anion) and then hand checking the strings that had more than two components. I then matched these **weird** IUPAC names to their correct SMILES representations. These are stored in the salty database file cationInfo.csv and anionInfo.csv.

I've taken care of most of them but I've left a few unaddressed and you can see these after executing the cell bellow.

In [2]:
###a hacky hack solution to cleaning raw ILThermo data
# df = pd.read_csv("../salty/data/viscosity_full.csv")
df = pd.read_csv('../salty/data/density.csv',delimiter=',')
salts = pd.DataFrame(df["salt_name"])
salts = salts.rename(columns={"salt_name": "salts"})
###our data parsing was imperfect... some of the columns contain NaN
print(df.isnull().sum())
df = pd.concat([df["Temperature, K"], df["Pressure, kPa"],\
                    df["Specific density, kg/m<SUP>3</SUP>"]], axis=1)
df.dropna(inplace=True) #remove incomplete entries
df.reset_index(inplace=True, drop=True)
print(df.shape)

Molar volume, m<SUP>3</SUP>/mol       31242
Pressure, kPa                           373
Specific density, kg/m<SUP>3</SUP>       87
Specific volume, m<SUP>3</SUP>/kg     31323
Temperature, K                            0
salt_name                                 0
dtype: int64
(30866, 3)


In [3]:
anions= []
cations= []
missed = 0
for i in range(df.shape[0]):
    if len(salts['salts'].iloc[i].split()) == 2:
        cations.append(salts['salts'].iloc[i].split()[0])
        anions.append(salts['salts'].iloc[i].split()[1])
    elif len(salts['salts'].iloc[i].split()) == 3:
        #two word cation
        if"tris(2-hydroxyethyl) methylammonium" in salts['salts'].iloc[i]:
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            anions.append(salts['salts'].iloc[i].split()[2])
            cations.append(first + ' ' + second)
            
        #these strings have two word anions
        elif("sulfate" in salts['salts'].iloc[i] or\
        "phosphate" in salts['salts'].iloc[i] or\
        "phosphonate" in salts['salts'].iloc[i] or\
        "carbonate" in salts['salts'].iloc[i]):
            first = salts['salts'].iloc[i].split()[1]
            second = salts['salts'].iloc[i].split()[2]
            cations.append(salts['salts'].iloc[i].split()[0])
            anions.append(first + ' ' + second)
        elif("bis(trifluoromethylsulfonyl)imide" in salts['salts'].iloc[i]): 
            #this string contains 2 word cations
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            third = salts['salts'].iloc[i].split()[2]
            cations.append(first + ' ' + second)
            anions.append(third)
        else:
            print(salts['salts'].iloc[i])
            missed += 1
    elif len(salts['salts'].iloc[i].split()) == 4:
        #this particular string block contains (1:1) at end of name
        if("1,1,2,3,3,3-hexafluoro-1-propanesulfonate" in salts['salts'].iloc[i]):
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            cations.append(first + ' ' + second)
            anions.append(salts['salts'].iloc[i].split()[2])
        else:
            #and two word anion
            first = salts['salts'].iloc[i].split()[1]
            second = salts['salts'].iloc[i].split()[2]
            anions.append(first + ' ' + second)
            cations.append(salts['salts'].iloc[i].split()[0])
    elif("2-aminoethanol-2-hydroxypropanoate" in salts['salts'].iloc[i]):
        #one of the ilthermo salts is missing a space between cation/anion
        anions.append("2-hydroxypropanoate")
        cations.append("2-aminoethanol")
    elif len(salts['salts'].iloc[i].split()) == 5:
        if("bis[(trifluoromethyl)sulfonyl]imide" in salts['salts'].iloc[i]):
            anions.append("bis(trifluoromethylsulfonyl)imide")
            first = salts['salts'].iloc[i].split()[0]
            second = salts['salts'].iloc[i].split()[1]
            third = salts['salts'].iloc[i].split()[2]
            fourth = salts['salts'].iloc[i].split()[3]
            cations.append(first + ' ' + second + ' ' + third + ' ' + fourth)
        if("trifluoro(perfluoropropyl)borate" in salts['salts'].iloc[i]):
            anions.append("trifluoro(perfluoropropyl)borate")
            cations.append("N,N,N-triethyl-2-methoxyethan-1-aminium")    
    else:
        print(salts['salts'].iloc[i])
        missed += 1
anions = pd.DataFrame(anions, columns=["name-anion"])
cations = pd.DataFrame(cations, columns=["name-cation"])
salts=pd.read_csv('../salty/data/salts_with_smiles.csv',delimiter=',')
new_df = pd.concat([salts["name-cation"], salts["name-anion"], salts["Temperature, K"],\
                    salts["Pressure, kPa"], salts["Specific density, kg/m<SUP>3</SUP>"]],\
                   axis = 1)
print(missed)

3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
1H-imidazolium, 1-butyl-3-methyl-, (OC-6-11)-hexafluoroantimonate(1-)
1,3-propanediol, 2-amino-2-(hydroxymethyl)-, hydrochloride
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
1H-imidazolium, 1-ethyl-3-methyl-, salt with trifluoroacetic acid (1:1)
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
choline cyclopentane carboxylate
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium trifluoromethanesulfonate (1:1)
3-butyl-1-ethyl-1H-imidazolium triflu

After appending SMILES to the dataframe, we're ready to add RDKit descriptors. Because the descriptors are specific to a given cation and anion, and there are many repeats of these within the data (~10,000 datapoints with ~300 cations and ~150 anions) it is much faster to use pandas to append existing descriptor dataframes to our growing dataframe from ILThermo.

In [4]:
cationDescriptors = salty.load_data("cationDescriptors.csv")
cationDescriptors.columns = [str(col) + '-cation' for col in cationDescriptors.columns]
anionDescriptors = salty.load_data("anionDescriptors.csv")
anionDescriptors.columns = [str(col) + '-anion' for col in anionDescriptors.columns]

In [5]:
new_df = pd.concat([cations, anions, df["Temperature, K"], df["Pressure, kPa"],\
                    df["Specific density, kg/m<SUP>3</SUP>"]], axis=1)
new_df = pd.merge(cationDescriptors, new_df, on="name-cation", how="right")
new_df = pd.merge(anionDescriptors, new_df, on="name-anion", how="right")
# new_df.dropna(inplace=True) #remove entires not in smiles database

In [13]:
from salty import check_name
check_name("3-(3-aminopropyl)-1-methyl-1H-imidazolium")

UnboundLocalError: local variable 'target_lookup' referenced before assignment

In [6]:
cat_missing=[]
an_missing=[]
missing=[]
for i in range(new_df.shape[0]):
    if pd.isnull(new_df.iloc[i]).any() == True:
#         print(new_df.iloc[i], i)
#         cat_missing.append(new_df["name-cation"].iloc[i])
        an_missing.append(new_df["name-anion"].iloc[i])
        print(new_df["name-cation"].iloc[i], new_df["name-anion"].iloc[i], i)

3,3'-(1,9-nonanediyl)bis[1-methylimidazolium] tetrafluoroborate 4690
3,3'-(1,10-decanediyl)bis[1-methylimidazolium] tetrafluoroborate 4691
1,1'-(dodecane-1,12-diyl)bis(3-methyl-1H-imidazolium) tetrafluoroborate 4692
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4693
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4694
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4695
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4696
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4697
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4698
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4699
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4700
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4701
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4702
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4703
1-(3-aminopropyl)-3-methyl-1H-imidazolium tetrafluoroborate 4704
1-(3

In [19]:
trends = an_missing
output = []
for x in trends:
    if x not in output:
        output.append(x)
print(len(output))

55


appears to be 146 missing cations and 55 missing anions from the check_name database