Metal-Insulator Transition Classifiers
This repository contains the code and data used in constructing the thermally-driven metal-insulator transition (MIT) classifiers, which are 3 binary classifiers: a Metal vs. non-Metal model, an Insulator vs. non-Insulator model and an MIT vs. non-MIT model.
Check out our preprint paper on arXiv:
Georgescu, A. B.; Ren, P.; Toland, A. R.; Olivetti, E. A.; Wagner, N.; Rondinelli, J. M. A Database and Machine Learning Model to Identify Thermally Driven Metal-Insulator Transition Compounds. arXiv:2010.13306 [cond-mat] 2020.
Table of Content
- Model Description
- General Workflow
- Demo Notebooks
- pipeline_demo.ipynb (Make a prediction right in your web browser!)
- Supporting notebooks
The research question of this project is whether a machine learning classification model can predict temperature-driven metal-insulator transition behavior based on a series of compositional and structural descriptors/features of a given compound.
The training algorithm or the model type chosen for this task is an XGBoost tree classifier implemented in the Python programming language. XGBoost models have helped won numerous Kaggle competitions and have been shown to perform well on classification tasks. For this research project, if you wonder why we chose XGBoost over other model types and why binary classification over multi-class classification, you can refer to this section. The takeaway is that XGBoost is consistently among the best performing model types and that it is faster to train compared to other models with comparable performance. The performance across all model types on binary classifications is also better than that on multi-class classifications.
A Word of Caution
Since the vast majority of the training data comes from oxides and there are not that many well-documented oxides that exhibit MIT behavior, the training dataset as a result is quite small for machine learning standards (343 observations / rows). Thus, the models, especially with a high dimensional feature set, can easily overfit and there is an ongoing effort to expand and find new MIT materials to add to the dataset. Thus, as we continue to expand our dataset, the models trained on the dataset are also subject to change over the course of time.
We strongly encourage people to contribute temperature-driven MIT materials that aren't already included in our dataset. Please include your name, institution, the CIF file and reference publications in your email and send them to Professor James M. Rondinelli.
You can also suggest new MIT material(s) by opening an issue with the
New MIT material template.
1. Data Preparation
1.1 Getting CIF files
The CIF files are obtained through online databases such as ICSD database, Springer Materials and Materials Project in addition to a few hand generated ones. The vast majority of CIF files are high-quality experimental structures files from the ICSD database, with a few from the Springer and Materials Project databases.
Note: Unfortunately, we can not directly share the collected CIF files due to copyright concerns. However, you can find the material ID of the
compounds included in our dataset here
(you should look at the
struct_file_path column to find the IDs). Should you have access, you can use those IDs
to download CIF files from ICSD, Springer and Materials Project.
You will find 4 suffixes in
struct_file_path which correspond to 4 sources as follows.
|HandGenerated||Generated by hand based on publications|
1.2 Generate ionization lookup dataframe
This step creates an ionization lookup table that is used in the subsequent featurization process.
1.3 Generate features using the CIF files
A total of 164 compositional and structural features are generated using a combination of matminer and our in-house handbuilt featurizers. These features then undergo further processing and selection down the pipeline.
1.4 Clean up the data
After a brief exploratory data anaylsis, it is found that the raw output from the featurizers contains features with missing values, zero-variance (i.e. the feature value is the same for all compounds) and high linear correlation (greater than 0.95). Therefore, the data cleaning process is carried out in the following order:
- Drop rows / compounds with more than 10 missing features
- Impute missing values with KNNImputer
- For each row with missing values, find the 5 nearest neighbors using features that are not missing
- Impute missing values based on features in the 5 nearest neighbors weighted by their distance
- Remove features with zero variance
- Remove features with high linear correlation
- Find features with a linear correlation greater than 0.95
- Drop one of the two features in each pair of highly correlated features
After data cleaning, the dataset now has 106 (105 numeric & 1 one-hot-encoded categorical with 2 levels) features remaining and will be referred to as the full feature set from now on.
2. Model Building
The model building process follows an iterative approach. During the first iteration, the cleaned-up full feature set is fed into the classifiers, trained and then evaluated. Then with the help of SHAP values and domain knowledge, features with high importance are selected and used as input to the second iteration of model training and evaluation.
2.1 Tune the XGBoost model
The training process starts with hyperparameter tuning with grid search cross validation. The default parameter search grid for the XGBClassifier is as follows.
|n_estimators||[10, 20, 30, 40, 80, 100, 150, 200]|
|max_depth||[2, 3, 4, 5]|
|learning_rate||np.logspace(-3, 2, num=6)|
|subsample||[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]|
|scale_pos_weight||[num_of_negative_class / num_of_positive_class]|
|base_score||[0.3, 0.5, 0.7]|
The scoring metric during tuning is f1_weighted. The best tuned parameters are then stored for model evaluation,
2.2 Evaluate performance and save models
Due to the scarcity of training examples, stratified 5-fold cross validation (cv) is used to evaluate model performance instead of a hold-out test set. There are 4 evaluation metrics used:
Since the cross validation splits depend on the random seed, a list of 10 seeds (integers from 0 to 9) are used to take into account the variation in model performance due to different splits from different seeds. For each seed, a stratified 5-fold cv is carried out, from which the median / mean values for the metrics are obtained. With 10 seeds, there are 10 median / mean values for each metric and finally a median / mean value is calculated from those 10 values, along with the interquartile range / standard deviation respectively. Essentially, the values reported are either a median of medians by default or an average of averages should you choose so.
After model evaluation, the models are trained on the entire dataset (343 compounds with the full feature set) with the best parameters and then stored.
2.3 Select important features and iterate
Using the stored models, a SHAP analysis is carried out to find the most important features. These important features are further screened using domain knowledge. Currently, 10 features are selected to create a reduced feature set. This feature selection step mainly serves to prevent overfitting.
With this reduced feature set, the entire model building process is repeated and the models are re-tuned, re-evaluated and re-trained on the reduced feature set.
3. Deploy & Serve Models
The trained classifiers are made available to the larger materials science community through Jupyter notebooks hosted via the Binder service. One can immediately upload a CIF file and easily make a prediction using our classifiers directly in the web browser.
The models served on the Binder server are by default based on the reduced feature set.
There are several Jupyter notebooks available for easier result replication and demonstration purposes. You can immediately launch interactive versions of these notebooks in your web browser by clicking on the binder icon above or clicking on the subsection titles below.
Note: Any changes made on the server will not be saved unless you download a copy of the notebook onto your local machine.
You can replicate the workflow by using the notebooks in the following order.
This notebook generates the ionization energy lookup spreadsheet.
This notebook allows you to generate features for all the structures. As mentioned before, since we cannot share the structure files, running this notebook will not work due to the absence of CIF files.
This notebook presents an exploratory data analysis along with a data cleaning process on the output dataset from generate_compound_features.ipynb.
This notebook contains the code that tunes, trains and evaluates the models along with a SHAP analysis on models trained with the full feature set. It is NOT recommended to train the models directly on the Binder server since it is a very memory intensive process (it will also take a very long time to train!). The Binder container by default has 2GB of RAM and if the memory limit is exceeded, there is a possibility that the kernel will restart and you'll have to start over. That being said, you are welcome to download the repository onto your local machine and play around with the model parameters and selection.
This notebook demonstrates the prediction pipeline through which a prediction is made on a new structure that is not included in the original training set. You can even upload your own CIF structure and get a prediction! If you just want to play around with the trained models or make a prediction on a structure of your own choice, you can start here.
This notebook answers the question of "Why should one choose XGBoost over some other models?" by comparing the classification performance of 6 model types on the full feature set across 4 classification tasks. The model types are as follows.
|DummyClassifier||Naive models that are always random guessing (baseline performance)|
|LogisticRegression||Linear classifiers with L2 regularization|
|DecisionTreeClassifier||Generic decision tree classifiers|
|RandomForestClassifier||Ensemble decision tree classifiers|
|GradientBoostingClassifier||Gradient-boosting tree classifiers|
|XGBoostClassifier||Extreme gradient-boosting tree classifiers|
The 4 classification tasks are:
- Metal vs. non-Metals (Insulators + MITs)
- Insulator vs. non-Insulators (Metals + MITs)
- MIT vs. non-MIT (Metals + Insulators)
- Multi-class classification
This notebook presents a brief SHAP analysis on models trained with the reduced feature set.
This notebook contains visualization plots to be included in the paper.