```{margin} Adaptation!
This work was adapted from one of the Materials Project workshop lessons, available at https://github.com/materialsproject/workshop/, available under the BSD-3-clause license. As such, this notebook is available under the BSD-3-clause license. 
```

`````{note}
This lecture is going to:
* Discuss why and how we can generate descriptors for materials problems
* Introduce several descriptors/features that can be used for:
    * chemical compositions
    * structure (atomic xyz structure) features
    * ...
* Demonstrate one package for generating features (`matminer`)
* Fit a simple linear model to a material science dataset 
`````

# Featurizing molecules and materials for chemical engineering

We've done a lot of work with polynomial features so far, but chemical engineering is full of examples where it's not clear exactly what the right features are. In this lecture we'll talk about how to turn data like chemical composition or atomic structures into features that can be used with common machine learning models that expect feature vectors (like our sklearn models). 

`````{seealso}
Google slides on various featurization strategies in molecules, materials, and catalysis!

https://docs.google.com/presentation/d/11cNUIWjqhxwYdVfmkkNXCac-rks7ZkrN9VgiIis6VhQ/edit#slide=id.g9b4bd6acd3_0_6
`````

## Demonstration: Materials features using matminer


A material science workflow looks very similar to the ones we've done so far, but the starting point is different!
1. Take raw inputs, such as a list of compositions, and an associated target property to learn.
2. Convert the raw inputs into *descriptors* or *features* that can be learned by machine learning algorithms.
3. Train a machine learning model on the data.
4. Plot and analyze the performance of the model.

![machine learning workflow](resources/ml_workflow.png)

There are many python packages available to featurize materials or molecules. The `matminer` package has been developed to help make machine learning of materials properties easy and hassle free. The aim of matminer is to connect materials data with data mining algorithms and data visualization.

`````{seealso}
Many more tutorials on how to use matminer (beyond the scope of this example) are available in the `matminer_examples` repository, available [here](https://github.com/hackingmaterials/matminer_examples).
`````


## Example materials science dataset (computational dielectric properties of inorganic crystals)

Matminer interfaces with many materials databases,  including:
- Materials Project
- Citrine
- AFLOW
- Materials Data Facility (MDF)
- Materials Platform for Data Science (MPDS)

In addition, it also includes datasets from published literature. Matminer hosts a repository of ~45 datasets which comes from published and peer-reviewed machine learning investigations of materials properties or publications of high-throughput computing studies. 

A list of the literature-based datasets can be printed using the `get_available_datasets()` function. This also prints information about what the dataset contains, such as the number of samples, the target properties, and how the data was obtained (e.g., via theory or experiment).

`````{seealso}
More information on accessing other materials databases are detailed in the [matminer_examples](https://github.com/hackingmaterials/matminer_examples) repository.
`````



### Loading the dielectric dataset

All datasets can be loaded using the `load_dataset()` function and the database name. To save installation space, the datasets are not automatically downloaded when matminer is installed. Instead, the first time the dataset is loaded, it will be downloaded from the internet and stored in the matminer installation directory.

Let's say we're interested in the `dielectric_constant` dataset,  which contains 1,056 structures with dielectric properties calculated with DFPT-PBE. We can download it with the `load_dataset` function.

We'll set an environment variable `MATMINER_DATA` which will tell matminer to download all our dataset to a directory `./data`. If you are running this locally, you usually don't need to set this variable as matminer will download the dataset directly to your matminer source code folder.



In [None]:
# Set an environment variable to tell matminer where to store the data
%env MATMINER_DATA data

# Load the dielectric dataset
from matminer.datasets import load_dataset

df = load_dataset("dielectric_constant")

We can get some more detailed information about this dataset using the `get_all_dataset_info(<dataset>)` function from matminer.

In [None]:
from matminer.datasets import get_all_dataset_info

print(get_all_dataset_info("dielectric_constant"))

### (Recap of pandas) Manipulating and examining pandas `DataFrame` objects

The datasets are made available as pandas `DataFrame` objects.

The `head()` function prints a summary of the first few rows of a data set. You can scroll across to see more columns. From this, it is easy to see the types of data available in in the dataset.

In [None]:
df.head()

Sometimes, if a dataset is very large, you will be unable to see all the available columns. Instead, you can see the full list of columns using the `columns` attribute:

In [None]:
df.columns

A pandas `DataFrame` includes a function called `describe()` that helps determine statistics for the various numerical/categorical columns in the data. Note that the `describe()` function only describes numerical columns by default.

Sometimes, the `describe()` function will reveal outliers that indicate mistakes in the data.

In [None]:
df.describe()

#### Indexing the dataset

We can access a particular column of `DataFrame` by indexing the object using the column name. For example:

In [None]:
df["band_gap"]

You can also access multiple columns by indexing with a list of column names rather than a single column name:

Alternatively, we can access a particular row of a `Dataframe` using the `iloc` attribute.

In [None]:
df.iloc[100]

#### Filtering the dataset

Pandas `DataFrame` objects make it very easy to filter the data based on a specific column. We can use the typical Python comparison operators (==, >, >=, <, etc) to filter numerical values. For example, let's find all entries where the cell volume is greater than 580. We do this by filtering on the `volume` column.

Note that we first produce a *boolean mask* – a series of `True` and `False` depending on the comparison. We can then use the mask to filter the `DataFrame`. 

In [None]:
mask = df["volume"] >= 580
df[mask]

We can use this method of filtering to clean our dataset. For example, if we only wanted our dataset to only include nonmetals (materials with a non-zero band gap), we can do this easily by filtering the `band_gap` column.

In [None]:
mask = df["band_gap"] > 0
nonmetal_df = df[mask]
nonmetal_df

Often, a dataset contains many additional columns that are not necessary for machine learning. Before we can train a model on the data, we need to remove any extraneous columns. We can remove whole columns from the dataset using the `drop()` function. This function can be used to drop both rows and columns.

The function takes a list of items to drop. For columns, this is column names whereas for rows it is the row number. Finally, the `axis` option specifies whether the data to drop is columns (`1`) or rows (`0`).

For example, to remove the `nsites`, `space_group`, `e_electronic`, and `e_total` columns, we can run: 

In [None]:
cleaned_df = df.drop(["nsites", "space_group", "e_electronic", "e_total"], axis=1)

Let's examine the cleaned `DataFrame` to see that the columns have been removed.

In [None]:
cleaned_df.head()

You can alternatively *select* multiple columns by passing in a list of column names as an index.

For example, if we're only interested in the `band_gap` and `structure` columns, we can index with `["band_gap", "structure"]`

In [None]:
df[["band_gap", "structure"]]

## Generating descriptors for machine learning using matminer

In this section, we will learn a bit about how to generate machine-learning descriptors from materials objects in pymatgen. First, we'll generate some descriptors with matminer's "featurizers" classes. Next, we'll use some of what we learned about dataframes in the previous section to examine our descriptors and prepare them for input to machine learning models.

![featurizers_overview.png](resources/featurizers_overview.png)

### Featurizers transform materials primitives into machine-learnable features

The general idea of featurizers is that they accept a materials primitive (e.g., pymatgen Composition) and output a vector. For example:

\begin{align}
f(\mathrm{Fe}_2\mathrm{O}_3) \rightarrow [1.5, 7.8, 9.1, 0.09]
\end{align}

#### Matminer contains featurizers for the following pymatgen objects:
* Composition
* Crystal structure
* Crystal sites
* Bandstructure
* Density of states

#### Depending on the featurizer, the features returned may be:
* numerical, categorical, or mixed vectors
* matrices 
* other pymatgen objects (for further processing)

#### Featurizers play nice with dataframes
Since most of the time we are working with pandas dataframes, all featurizers work natively with pandas dataframes. We'll provide examples of this later in the lesson

In this lesson, we'll go over the main methods present in all featurizers. By the end of this unit, you will be able to generate descriptors for a wide range of materials informatics problems using one common software interface.

#### Featurizers present in matminer

````{seealso}
Matminer hosts over 60 featurizers, most of which are implemented from methods published in peer reviewed papers. You can find a full list of featurizers on the [matminer website](https://hackingmaterials.lbl.gov/matminer/featurizer_summary.html). All featurizers have parallelization and convenient error tolerance built into their core methods.
`````


### The `featurize` method and basics

The core method of any matminer is "featurize". This method accepts a materials object and returns a machine learning vector or matrix. Let's see an example on a pymatgen composition:

In [None]:
from pymatgen.core import Composition

fe2o3 = Composition("Fe2O3")

As a trivial example, we'll get the element fractions with the `ElementFraction` featurizer.

In [None]:
from matminer.featurizers.composition.element import ElementFraction

ef = ElementFraction()

Now we can featurize our composition.

In [None]:
element_fractions = ef.featurize(fe2o3)

print(element_fractions)

We've managed to generate features for learning, but what do they mean? One way to check is by reading the `Features` section in the documentation of any featurizer... but a much easier way is to use the `feature_labels()` method.

In [None]:
element_fraction_labels = ef.feature_labels()
print(element_fraction_labels)

We now see the labels in the order that we generated the features. 

In [None]:
print(element_fraction_labels[7], element_fractions[7])
print(element_fraction_labels[25], element_fractions[25])

### Featurizing  dataframes

We just generated some descriptors and their labels from an individual sample but most of the time our data is in pandas dataframes. Fortunately, matminer featurizers implement a `featurize_dataframe()` method which interacts natively with dataframes.

Let's grab a new dataset from matminer and use our `ElementFraction` featurizer on it.

First, we download a dataset as we did in the previous unit. In this example, we'll download a dataset of super hard materials.

In [None]:
from matminer.datasets.dataset_retrieval import load_dataset

df = load_dataset("dielectric_constant")
df.head()

The dataset we loaded previously only contains a `formula` column with string objects. To convert this data into a composition column containing pymatgen Composition objects, we can use the `StrToComposition` conversion featurizer on the `formula` column.



In [None]:
from matminer.featurizers.conversions import StrToComposition

stc = StrToComposition()
df = stc.featurize_dataframe(df, "formula", pbar=False)

Next, we can use the `featurize_dataframe()` method (implemented by all featurizers) to apply ElementFraction to all of our data at once. The only required arguments are the dataframe as input and the input column name (in this case it is `composition`). `featurize_dataframe()` is parallelized by default using multiprocessing.


If we look at the dataframe, now we can see our new feature columns.

In [None]:
df = ef.featurize_dataframe(df, "composition")

df.head()

### Structure Featurizers

We can use the same syntax for other kinds of featurizers. Let's now assign descriptors to a structure. We do this with the same syntax as the composition featurizers. We'll use the same dataset of dielectric materials properties.

In [None]:
df = load_dataset("dielectric_constant")

df.head()

Let's calculate some basic density features of these structures using `DensityFeatures`.

In [None]:
from matminer.featurizers.structure import DensityFeatures

densityf = DensityFeatures()
densityf.feature_labels()

These are the features we will get. Now we use `featurize_dataframe()` to generate these features for all the samples in the dataframe. Since we are using the structures as input to the featurizer, we select the "structure" column.

Let's examine the dataframe and see the structural features.

In [None]:
df = densityf.featurize_dataframe(df, "structure")

df.head()

## Simple ML models using our matminer features!

In parts 1 and 2, we demonstrated how to download a dataset and add machine learnable features. In part 3, we show how to train a machine learning model on a dataset and analyze the results.

### Scikit-Learn
This unit makes extensive use of the [scikit-learn](https://scikit-learn.org/stable/) package, an open-source python package for  machine learning. Matminer has been designed to make machine learning with scikit-learn as easy as possible. Other machine learning packages exist, such as [TensorFlow](https://www.tensorflow.org), which implement neural network architectures. These packages can also be used with `matminer` but are outside the scope of this workshop.

### Load and prepare a pre-featurized model

First, let's load a dataset that we can use for machine learning. In advance, we've added some composition and structure features to the `elastic_tensor_2015` dataset used in exercises 1 and 2.

In [None]:
from matminer.datasets.dataset_retrieval import load_dataset
from matminer.featurizers.composition.composite import ElementProperty
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers.structure import DensityFeatures

df = load_dataset("dielectric_constant")

stc = StrToComposition()
df = stc.featurize_dataframe(df, "formula", pbar=False)

ep_feat = ElementProperty.from_preset(preset_name="magpie")

df = ep_feat.featurize_dataframe(
    df, col_id="composition"
)  # input the "composition" column to the featurizer

densityf = DensityFeatures()
df = densityf.featurize_dataframe(df, "structure")

df.head()

We first need to split the dataset into the "target" property, and the "features" used for learning. In this model, we will be using the bulk modulus (`K_VRH`) as the target property. We use the `values` attribute of the dataframe to give the target properties a numpy array, rather than pandas `Series` object.

In [None]:
y = df["band_gap"].values

print(y)

The machine learning algorithm can only use numerical features for training. Accordingly, we need to remove any non-numerical columns from our dataset. Additionally, we want to remove the `K_VRH` column from the set of features, as the model should not know about the target property in advance.

The dataset loaded above, includes `structure`, `formula`, and `composition` columns that were previously used to generate the machine learnable features. Let's remove them using the pandas `drop()` function, discussed in unit 1. Remember, `axis=1` indicates we are dropping columns rather than rows.

In [None]:
X = df.drop(
    [
        "structure",
        "formula",
        "nsites",
        "space_group",
        "volume",
        "band_gap",
        "e_electronic",
        "e_total",
        "material_id",
        "n",
        "poly_electronic",
        "poly_total",
        "pot_ferroelectric",
        "cif",
        "meta",
        "poscar",
        "composition",
    ],
    axis=1,
)
X.columns

We can see all the descriptors in model using the `column` attribute.

In [None]:
print("There are {} possible descriptors:".format(len(X.columns)))
print(X.columns)

### Try a random forest model using scikit-learn

The `scikit-learn` library makes it easy to use our generated features for training machine learning models. It implements a variety of different regression models and contains tools for cross-validation.

In the interests of time, in this example we will only trial a single model but it is good practice to trial multiple models to see which performs best for your machine learning problem. A good "starting" model is the random forest model. Let's create a random forest model.

In [None]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

model = make_pipeline(
    Ridge(alpha=10)
)


Notice we created the model with the number of estimators (`n_estimators`) set to `100`. `n_estimators` is an example of a machine learning *hyper-parameter*. Most models contain many tunable hyper-parameters. To obtain good performance, it is necessary to fine tune these parameters for each individual machine learning problem. There is currently no simple way to know in advance what hyper-parameters will be optimal. Usually, a trial and error approach is used.

We can now train our model to use the input features (`X`) to predict the target property (`y`). This is achieved using the `fit()` function.

In [None]:
model.fit(X, y)

That's it, we have trained our first machine learning model!

### Evaluating model performance

Next, we need to assess how the model is performing. To do this, we first ask the model to predict the bulk modulus for every entry in our original dataframe.

In [None]:
y_pred = model.predict(X)

Next, we can check the accuracy of our model by looking at the *root mean squared error* of our predictions. Scikit-learn provides a `mean_squared_error()` function to calculate the mean squared error. We then take the square-root of this to obtain our final performance metric.

In [None]:
import numpy as np
from sklearn.metrics import mean_squared_error

mse = mean_squared_error(y, y_pred)
print("training RMSE = {:.3f} eV".format(np.sqrt(mse)))

An RMSE of 7.2 GPa looks very reasonable! However, as the model was trained and evaluated on exactly the same data, this is not a true estimate of how the model will perform for unseen materials (the primary purpose of machine learning studies).

#### Cross validation

To obtain a more accurate estimate of prediction performance and validate that we are not over-fitting, we need to check the **cross-validation score** rather than the fitting score.

In cross-validation, the data is partitioned randomly into $n$ "splits" (in this case 10), each containing roughly the same number of samples. The model is trained on $n-1$ splits (the training set) and the model performance evaluated by comparing the actual and predicted values for the final split (the testing set). In total, this process is repeated $n$ times, such that each split is at some point used as the testing set. The cross-validation score is the average score across all testing sets.

There are a number of ways to partition the data into splits. In this example, we use the `KFold` method and select the number of splits to be 10. I.e., 90 % of the data will be used as the training set, with 10 % used as the testing set.

In [None]:
from sklearn.model_selection import KFold

kfold = KFold(n_splits=10, random_state=1, shuffle=True)

Note, we set `random_state=1` to ensure every attendee gets the same answer for their model.

Finally, obtaining the cross validation score can be automated using the Scikit-Learn `cross_val_score()` function. This function requires a machine learning model, the input features, and target property as arguments. Note, we pass the `kfold` object as the`cv` argument, to make `cross_val_score()` use the correct test/train splits.

For each split, the model will be trained from scratch, before the performance is evaluated. As we have to train and predict 10 times, cross validation can often take some time to perform. In our case, the model is quite small, so the process only takes about a minute. The final cross validation score is the average across all splits.

In [None]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=kfold)

rmse_scores = [np.sqrt(abs(s)) for s in scores]
print("Mean RMSE: {:.3f}".format(np.mean(rmse_scores)))

Notice that our RMSE has almost tripled as now it reflects the true predictive power of the model. However, a root-mean-squared error of ~19 GPa is still not bad!

### Visualizing model performance

We can visualize the predictive performance of our model by plotting the our predictions against the actual value, for each sample in the test set for all test/train splits. First, we get the predicted values of the testing set for each split using the `cross_val_predict` method. This is similar to the `cross_val_score` method, except it returns the actual predictions, rather than the model score.

In [None]:
from sklearn.model_selection import cross_val_predict

y_pred = cross_val_predict(model, X, y, cv=kfold)

Let's now add our predicted values to our dataframe and calculate an absolute percentage error for each sample. 

We can do this conveniently for all of our samples with the dataframe columns.

If we scroll to the end of the dataframe, our predicted `K_VRH` and percentage errors are given for each sample. This might allow us to examine manually which samples are performing well and which are performing poorly.

In [None]:
df["band_gap_predicted"] = y_pred
df["percentage_error"] = (df["band_gap"] - df["band_gap_predicted"]).abs() / df["band_gap"] * 100

df.describe()

A more convient way of examining our model's performance is by creating a graph comparing our cross-validation predicted bulk modulus to the actual bulk modulus for every sample. Here, we use `plotly.express` from the Plotly package to create our graphs.

Plotly Express is designed to create many kinds of plots directly from dataframes. Since we already have our data inside a dataframe, we can specify the column names to tell Plotly the data we'd like to show.

We make two series of data:
- First, a reference line indicating "perfect" peformance of the model. 
- Second, a scatter plot of the predicted K_VRH vs the actual K_VRH for every sample.

In [None]:
import plotly.express as px
import plotly.graph_objects as go

reference_line = go.Scatter(
    x=[0, 8],
    y=[0, 8],
    line=dict(color="black", dash="dash"),
    mode="lines",
    showlegend=False,
)

fig = px.scatter(
    df,
    x="band_gap",
    y="band_gap_predicted",
    hover_name="formula",
    color="percentage_error",
    color_continuous_scale=px.colors.sequential.Bluered,
)

fig.add_trace(reference_line)
fig.show()

Not too bad! However, there are definitely some outliers (you can hover over the points with your mouse to see what they are).

### Model interpretation

An important aspect of machine learning is being able to understand why a model is making certain predictions. Random forest models are particularly amenable to interpretation as they possess a `feature_importances` attribute, which contains the importance of each feature in deciding the final prediction. Let's look at the feature importances of our model.

In [None]:
model[-1].coef_

To make sense of this, we need to know which feature each number corresponds to. We can use `PlotlyFig` to plot the importances of the 5 most important features.

In [None]:
importances = np.abs(model[-1].coef_)
included = X.columns.values
indices = np.argsort(importances)[::-1]

fig_bar = px.bar(
    x=included[indices][0:5],
    y=importances[indices][0:5],
    title="Feature Importances of Linear Regression",
    labels={"x": "Feature", "y": "Importance"},
)
fig_bar.show()

## Bonus: Curated ML datasets with Matbench


If you are interested in comparing your machine learning algorithms with the state of the art, matminer also offers access to a curated set of 13 benchmarking datasets called Matbench, which have been used to benchmark SoTA algorithms like RooSt, CGCNN, CRABNet, MEGNet, Automatminer, and more. 

The Matbench datasets span a wide variety of materials informatics tasks such as:

- Predicting materials properties given **only composition**, or given **composition _and_ structure**
- Predicting a **wide array of target properties**, such as elastic constants, dielectric constants, formation energies, and steel yield strength
- **Data-sparse** tasks (300 samples) and (relatively) **data-rich** tasks (100k+ samples)
- Both regression and classification tasks


The full set of datasets is given in the table below:

| Task name | Task type | Target column (unit) | Input type | Samples | MAD (regression) or Fraction True (classification) | Links |
|-------|-------|-------|-------|-------|-------|-------|
| `matbench_steels` | regression | `yield strength` (MPa) | composition | 312 | 229.3743 | [download](https://ml.materialsproject.org/projects/matbench_steels.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_steels) | 
| `matbench_jdft2d` | regression | `exfoliation_en` (meV/atom) | structure | 636 | 67.2020 | [download](https://ml.materialsproject.org/projects/matbench_jdft2d.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_jdft2d) | 
| `matbench_phonons` | regression | `last phdos peak` (cm^-1) | structure | 1,265 | 323.7870 | [download](https://ml.materialsproject.org/projects/matbench_phonons.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_phonons) | 
| `matbench_expt_gap` | regression | `gap expt` (eV) | composition | 4,604 | 1.1432 | [download](https://ml.materialsproject.org/projects/matbench_expt_gap.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_expt_gap) | 
| `matbench_dielectric` | regression | `n` (unitless) | structure | 4,764 | 0.8085 | [download](https://ml.materialsproject.org/projects/matbench_dielectric.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_dielectric) | 
| `matbench_expt_is_metal` | classification | `is_metal` | composition | 4,921 | 0.4981 | [download](https://ml.materialsproject.org/projects/matbench_expt_is_metal.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_expt_is_metal) | 
| `matbench_glass` | classification | `gfa` | composition | 5,680 | 0.7104 | [download](https://ml.materialsproject.org/projects/matbench_glass.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_glass) | 
| `matbench_log_gvrh` | regression | `log10(G_VRH)` (log10(GPa)) | structure | 10,987 | 0.2931 | [download](https://ml.materialsproject.org/projects/matbench_log_gvrh.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_log_gvrh) | 
| `matbench_log_kvrh` | regression | `log10(K_VRH)` (log10(GPa)) | structure | 10,987 | 0.2897 | [download](https://ml.materialsproject.org/projects/matbench_log_kvrh.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_log_kvrh) | 
| `matbench_perovskites` | regression | `e_form` (eV/unit cell) | structure | 18,928 | 0.5660 | [download](https://ml.materialsproject.org/projects/matbench_perovskites.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_perovskites) | 
| `matbench_mp_gap` | regression | `gap pbe` (eV) | structure | 106,113 | 1.3271 | [download](https://ml.materialsproject.org/projects/matbench_mp_gap.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_mp_gap) | 
| `matbench_mp_is_metal` | classification | `is_metal` | structure | 106,113 | 0.4349 | [download](https://ml.materialsproject.org/projects/matbench_mp_is_metal.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_mp_is_metal) |
| `matbench_mp_e_form` | regression | `e_form` (eV/atom) | structure | 132,752 | 1.0059 | [download](https://ml.materialsproject.org/projects/matbench_mp_e_form.json.gz), [interactive](https://ml.materialsproject.org/projects/matbench_mp_e_form) | 



### The Matbench Leaderboard and Benchmarking Code


We host an online benchmark leaderboard - similar to an "ImageNet" for materials science - at the following URL:


### [https://hackingmaterials.lbl.gov/matbench](https://hackingmaterials.lbl.gov/matbench)

Which contains comprehensive data on various SoTA algorithm's performance across tasks in Matbench. On the website you can find:

- **A general purpose leaderboard comparing only the most-widely applicable algorithms**
- **Individual per-task (per-dataset) leaderboards for comparing any ML model on a particular task**
- Comprehensive breakdowns of cross-validation performance, statistics, and metadata for every model
- **Access to individual sample predictions for each and every submission**

![website_mb](resources/website_mb.png)



### General purpose leaderboard

| Task name | Samples | Algorithm | Verified MAE (unit) or ROCAUC | Notes |
|------------------|---------|-----------|----------------------|-------|
| matbench_steels | 312 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **97.4929 (MPa)** |  |
| matbench_jdft2d | 636 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **39.8497 (meV/atom)** |  |
| matbench_phonons | 1,265 | [CrabNet](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_CrabNet) | **55.1114 (cm^-1)** |  |
| matbench_expt_gap | 4,604 | [CrabNet](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_CrabNet) | **0.3463 (eV)** |  |
| matbench_dielectric | 4,764 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **0.3150 (unitless)** |  |
| matbench_expt_is_metal | 4,921 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **0.9209** |  |
| matbench_glass | 5,680 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **0.8607** |  |
| matbench_log_gvrh | 10,987 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **0.0874 (log10(GPa))** |  |
| matbench_log_kvrh | 10,987 | [AMMExpress v2020](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_automatminer_expressv2020) | **0.0647 (log10(GPa))** |  |
| matbench_perovskites | 18,928 | [CGCNN v2019](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_cgcnnv2019) | **0.0452 (eV/unit cell)** | structure required |
| matbench_mp_gap | 106,113 | [CrabNet](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_CrabNet) | **0.2655 (eV)** |  |
| matbench_mp_is_metal | 106,113 | [CGCNN v2019](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_cgcnnv2019) | **0.9520** | structure required |
| matbench_mp_e_form | 132,752 | [CGCNN v2019](https://hackingmaterials.github.io/matbench/Full%20Benchmark%20Data/matbench_v0.1_cgcnnv2019) | **0.0337 (eV/atom)** | structure required |
