Skip to content
master
Switch branches/tags
Go to file
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

README.md

Metal-Insulator Transition Classifiers

This repository contains the code and data used in constructing the thermally-driven metal-insulator transition (MIT) classifiers, which are 3 binary classifiers: a Metal vs. non-Metal model, an Insulator vs. non-Insulator model and an MIT vs. non-MIT model.

Check out our preprint paper on arXiv:

Georgescu, A. B.; Ren, P.; Toland, A. R.; Olivetti, E. A.; Wagner, N.; Rondinelli, J. M. A Database and Machine Learning Model to Identify Thermally Driven Metal-Insulator Transition Compounds. arXiv:2010.13306 [cond-mat] 2020.

Table of Content

Model Description

Research Question

The research question of this project is whether a machine learning classification model can predict temperature-driven metal-insulator transition behavior based on a series of compositional and structural descriptors/features of a given compound.

Training Algorithm

The training algorithm or the model type chosen for this task is an XGBoost tree classifier implemented in the Python programming language. XGBoost models have helped won numerous Kaggle competitions and have been shown to perform well on classification tasks. For this research project, if you wonder why we chose XGBoost over other model types and why binary classification over multi-class classification, you can refer to this section. The takeaway is that XGBoost is consistently among the best performing model types and that it is faster to train compared to other models with comparable performance. The performance across all model types on binary classifications is also better than that on multi-class classifications.

A Word of Caution

Since the vast majority of the training data comes from oxides and there are not that many well-documented oxides that exhibit MIT behavior, the training dataset as a result is quite small for machine learning standards (343 observations / rows). Thus, the models, especially with a high dimensional feature set, can easily overfit and there is an ongoing effort to expand and find new MIT materials to add to the dataset. Thus, as we continue to expand our dataset, the models trained on the dataset are also subject to change over the course of time.

We strongly encourage people to contribute temperature-driven MIT materials that aren't already included in our dataset. Please include your name, institution, the CIF file and reference publications in your email and send them to Professor James M. Rondinelli.

You can also suggest new MIT material(s) by opening an issue with the New MIT material template.

General Workflow

1. Data Preparation

1.1 Getting CIF files

The CIF files are obtained through online databases such as ICSD database, Springer Materials and Materials Project in addition to a few hand generated ones. The vast majority of CIF files are high-quality experimental structures files from the ICSD database, with a few from the Springer and Materials Project databases.

Note: Unfortunately, we can not directly share the collected CIF files due to copyright concerns. However, you can find the material ID of the compounds included in our dataset here (you should look at the struct_file_path column to find the IDs). Should you have access, you can use those IDs to download CIF files from ICSD, Springer and Materials Project. You will find 4 suffixes in struct_file_path which correspond to 4 sources as follows.

Suffix Source
CollCode ICSD
SD Springer Materials
MP Materials Project
HandGenerated Generated by hand based on publications

1.2 Generate ionization lookup dataframe

This step creates an ionization lookup table that is used in the subsequent featurization process.

1.3 Generate features using the CIF files

A total of 164 compositional and structural features are generated using a combination of matminer and our in-house handbuilt featurizers. These features then undergo further processing and selection down the pipeline.

1.4 Clean up the data

After a brief exploratory data anaylsis, it is found that the raw output from the featurizers contains features with missing values, zero-variance (i.e. the feature value is the same for all compounds) and high linear correlation (greater than 0.95). Therefore, the data cleaning process is carried out in the following order:

  • Drop rows / compounds with more than 10 missing features
  • Impute missing values with KNNImputer
    • For each row with missing values, find the 5 nearest neighbors using features that are not missing
    • Impute missing values based on features in the 5 nearest neighbors weighted by their distance
  • Remove features with zero variance
  • Remove features with high linear correlation
    • Find features with a linear correlation greater than 0.95
    • Drop one of the two features in each pair of highly correlated features

After data cleaning, the dataset now has 106 (105 numeric & 1 one-hot-encoded categorical with 2 levels) features remaining and will be referred to as the full feature set from now on.

2. Model Building

The model building process follows an iterative approach. During the first iteration, the cleaned-up full feature set is fed into the classifiers, trained and then evaluated. Then with the help of SHAP values and domain knowledge, features with high importance are selected and used as input to the second iteration of model training and evaluation.

2.1 Tune the XGBoost model

The training process starts with hyperparameter tuning with grid search cross validation. The default parameter search grid for the XGBClassifier is as follows.

Parameter Search space
n_estimators [10, 20, 30, 40, 80, 100, 150, 200]
max_depth [2, 3, 4, 5]
learning_rate np.logspace(-3, 2, num=6)
subsample [0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
scale_pos_weight [num_of_negative_class / num_of_positive_class]
base_score [0.3, 0.5, 0.7]
random_state [seed]

The scoring metric during tuning is f1_weighted. The best tuned parameters are then stored for model evaluation,

2.2 Evaluate performance and save models

Due to the scarcity of training examples, stratified 5-fold cross validation (cv) is used to evaluate model performance instead of a hold-out test set. There are 4 evaluation metrics used:

  1. precision_weighted
  2. recall_weighted
  3. roc_auc
  4. f1_weighted

Since the cross validation splits depend on the random seed, a list of 10 seeds (integers from 0 to 9) are used to take into account the variation in model performance due to different splits from different seeds. For each seed, a stratified 5-fold cv is carried out, from which the median / mean values for the metrics are obtained. With 10 seeds, there are 10 median / mean values for each metric and finally a median / mean value is calculated from those 10 values, along with the interquartile range / standard deviation respectively. Essentially, the values reported are either a median of medians by default or an average of averages should you choose so.

After model evaluation, the models are trained on the entire dataset (343 compounds with the full feature set) with the best parameters and then stored.

2.3 Select important features and iterate

Using the stored models, a SHAP analysis is carried out to find the most important features. These important features are further screened using domain knowledge. Currently, 10 features are selected to create a reduced feature set. This feature selection step mainly serves to prevent overfitting.

With this reduced feature set, the entire model building process is repeated and the models are re-tuned, re-evaluated and re-trained on the reduced feature set.

3. Deploy & Serve Models

The trained classifiers are made available to the larger materials science community through Jupyter notebooks hosted via the Binder service. One can immediately upload a CIF file and easily make a prediction using our classifiers directly in the web browser.

The models served on the Binder server are by default based on the reduced feature set.

Demo Notebooks

Binder

There are several Jupyter notebooks available for easier result replication and demonstration purposes. You can immediately launch interactive versions of these notebooks in your web browser by clicking on the binder icon above or clicking on the subsection titles below.

Note: Any changes made on the server will not be saved unless you download a copy of the notebook onto your local machine.

You can replicate the workflow by using the notebooks in the following order.

generate_lookup_table.ipynb

This notebook generates the ionization energy lookup spreadsheet.

generate_compound_features.ipynb

This notebook allows you to generate features for all the structures. As mentioned before, since we cannot share the structure files, running this notebook will not work due to the absence of CIF files.

EDA_and_data_cleaning.ipynb

This notebook presents an exploratory data analysis along with a data cleaning process on the output dataset from generate_compound_features.ipynb.

model_building_and_eval.ipynb

This notebook contains the code that tunes, trains and evaluates the models along with a SHAP analysis on models trained with the full feature set. It is NOT recommended to train the models directly on the Binder server since it is a very memory intensive process (it will also take a very long time to train!). The Binder container by default has 2GB of RAM and if the memory limit is exceeded, there is a possibility that the kernel will restart and you'll have to start over. That being said, you are welcome to download the repository onto your local machine and play around with the model parameters and selection.

pipeline_demo.ipynb

This notebook demonstrates the prediction pipeline through which a prediction is made on a new structure that is not included in the original training set. You can even upload your own CIF structure and get a prediction! If you just want to play around with the trained models or make a prediction on a structure of your own choice, you can start here.

Supporting notebooks

model_comparison.ipynb

This notebook answers the question of "Why should one choose XGBoost over some other models?" by comparing the classification performance of 6 model types on the full feature set across 4 classification tasks. The model types are as follows.

Model type Description
DummyClassifier Naive models that are always random guessing (baseline performance)
LogisticRegression Linear classifiers with L2 regularization
DecisionTreeClassifier Generic decision tree classifiers
RandomForestClassifier Ensemble decision tree classifiers
GradientBoostingClassifier Gradient-boosting tree classifiers
XGBoostClassifier Extreme gradient-boosting tree classifiers

The 4 classification tasks are:

  1. Metal vs. non-Metals (Insulators + MITs)
  2. Insulator vs. non-Insulators (Metals + MITs)
  3. MIT vs. non-MIT (Metals + Insulators)
  4. Multi-class classification

The metrics and evaluation method are the same as the process mentioned earlier. The comparison results are summarized in this table. A summary plot is also provided for easier interpretation.

shap_analysis.ipynb

This notebook presents a brief SHAP analysis on models trained with the reduced feature set.

test_featurizer_sub_functions.ipynb

This is a brief tutorial notebook that explains some sub-functions in the compound_featurizer.py file.

handbuilt_featurizer_benchmark.ipynb

This notebook provides a benchmark of how "good" the handbuilt featurizer is against values from Table 2 & 3 of Torrance et al.

dataset_visualization.ipynb

This notebook contains visualization plots to be included in the paper.

About

XGBoost models to classify materials with metal-insulator transition behavior

Resources

License

Releases

No releases published

Packages

No packages published