# An interactive data exploration tool for ionic liquid data from ILThermo (NIST) to guide intelligent solvent design

![](../webapp/salty_web_app/collection/static/images/about_header.png)

High performance computing (HPC) and open-source software revolutionize strategies for materials discovery by allowing the exploration of design spaces insurmountable by traditional research methods. This project establishes predictive tools and models for the publicly available ILThermo data distributed by the **National Institute of Science and Technology (NIST)**. In partnership with NIST, we leverage the immense amount of chemical and physical properties of ionic liquids (ILs) with data science and statistical methods **(RDKit, Scikit Learn)** to create publicly accessible data exploration and visualization tools. Specifically, via a built-in web application, site users are able to: 1) perform calculations of common statistical metrics, 2) generate machine learning models such as neural network (NN), least absolute shrinkage and selection operator (LASSO), and support vector machine (SVM) regression to make property predictions of unknown ionic liquids, and 3) have access to sophisticated visual representations of their data selection and methods of coefficient confidence estimation to ascertain the robustness of their models and sensitivity to the underlying data. 

# Key steps in data acquisition, feature generation and model training

The following describes a general method of creating and selecting features and using them in sophisticated model training i.e. NN or SVM. The webapp provides an opportunity to experiment with the basic approach by selecting subsets of the underlying data, subsets of the features, and various hyperparameters of the models. 

NIST provided **31,326** experimental density measurements of ILs. We used RDKit to generate **194** physical and chemical descriptors of the cation and anion moieties and Least Absolute Shrinkage Selection Operator (LASSO) to assess the dependency of selected features on salt type.

![](../webapp/salty_web_app/collection/static/images/density_hist.png)

The most common use case for shuffle-split, cross validation, and bootstrap algorithms is for model coefficient confidence calculations. They can, however, also be used for model parameterization. In our development of the LASSO models, we used all three methods to systematically search for the optimum alpha value, the parameter that determines the number of features that are selected.

![](../webapp/salty_web_app/collection/static/images/lasso_param.png)

The LASSO model is then trained with the optimized lambda value on a random selection of the dataset

![](../webapp/salty_web_app/collection/static/images/lasso_regression.png)

LASSO is an excellent regression method for selecting features. Neural network (NN) models, however, typically have greater overall performance in their prediction accuracy due to their ability to select features and the networks they create. Therefore, for 1000 iterations we randomly created training and testing datasets, trained our LASSO models and used the top selected features to create NNs for the entire dataset and imidazolium subset. Histograms of the selected features are shown below. 


<img src="../webapp/salty_web_app/collection/static/images/features_hist.png" style="width: 450px;"/>

Results from the NN regression are shown bellow 

<img src="../webapp/salty_web_app/collection/static/images/nn_regression.png" style="width: 550px;"/>