# Materials Project Database

This notebook also illustrates how we can interface with the [Materials Project](https://materialsproject.org) (MP) database. We will use the MP data retrieval tool and convert it to a pandas dataframe, then apply matminer's tools to populate the dataframe with descriptors/features from pymatgen, and finally fit a linear regression model from the scikit-learn library to the dataset.

### Overview

In this notebook, we will:
1. Load and examine a dataset in a pandas dataframe
2. Add descriptors to the dataframe using matminer
3. Train and visualize a linear regression machine learning methods with scikit-learn.

In [1]:
#%pip install mp_api
#%pip install matminer
#%pip install flatten_dict # Patch Materials Project API downloads

# Libraries

In [1]:
import numpy                          as np
import pandas                         as pd
import seaborn                        as sns
import json

import matplotlib.pyplot              as plt

from scipy                            import stats
from flatten_dict                     import flatten
from mp_api.client                    import MPRester

from pymatgen.core                    import Composition
from matminer.utils.data              import PymatgenData
from matminer.featurizers.composition import ElementProperty

from sklearn.metrics                  import mean_squared_error
from sklearn.linear_model             import LinearRegression
from sklearn.model_selection          import KFold, cross_val_score

from monty.serialization              import dumpfn, loadfn

  from .autonotebook import tqdm as notebook_tqdm


## 1. Load and process data set

We use MPRester to load a data set of materials properties from MaterialsProject. To download data from [Materials Project](https://materialsproject.org), you will need to create an account. Simply go the page, and "Sign in or Register." Then select "API" in the upper left of the screen and copy your API key.

You can either set the environment variable MP_API_KEY to your API key or simply add the API key in Python. To set the environment variable MP_API_KEY in Miniconda/Anaconda:

`conda env config vars set MP_API_KEY="api_key_from_materialsproject"`

To activate the environment variable, you need to restart Miniconda/Anaconda.

In [2]:
# Use MPRester to get data from MaterialsProject. Set to None if using the environment variable.
api_key = 'yGkb4vFYNRQTV8g1NY2jSXRVsbwF0bOw'

# Create an adapter to the MP Database.
mpr = MPRester(api_key)

# Get list with fields available for extraction
mpr.materials.summary.available_fields

['builder_meta',
 'nsites',
 'elements',
 'nelements',
 'composition',
 'composition_reduced',
 'formula_pretty',
 'formula_anonymous',
 'chemsys',
 'volume',
 'density',
 'density_atomic',
 'symmetry',
 'property_name',
 'material_id',
 'deprecated',
 'deprecation_reasons',
 'last_updated',
 'origins',
 'structure',
 'task_ids',
 'uncorrected_energy_per_atom',
 'energy_per_atom',
 'formation_energy_per_atom',
 'energy_above_hull',
 'is_stable',
 'equilibrium_reaction_energy_per_atom',
 'decomposes_to',
 'xas',
 'grain_boundaries',
 'band_gap',
 'cbm',
 'vbm',
 'efermi',
 'is_gap_direct',
 'is_metal',
 'es_source_calc_id',
 'bandstructure',
 'dos',
 'dos_energy_up',
 'dos_energy_down',
 'is_magnetic',
 'ordering',
 'total_magnetization',
 'total_magnetization_normalized_vol',
 'total_magnetization_normalized_formula_units',
 'num_magnetic_sites',
 'num_unique_magnetic_sites',
 'types_of_magnetic_species',
 'bulk_modulus',
 'shear_modulus',
 'universal_anisotropy',
 'homogeneous_poisson

In [4]:
# Search for materials with the selected properties
properties = ['formula_pretty', 'bulk_modulus',
              'formation_energy_per_atom','band_gap',
              'energy_above_hull','density',
              'volume', 'nsites']

# If materials.json already exists, load it, otherwise download it
try:
    with open('materials.json.gz', 'r') as f:
        docs = loadfn("materials.json.gz")
except FileNotFoundError:
    # Download the data
    docs = mpr.materials.summary.search(fields=properties)
    # Save the data into file
    dumpfn(docs, "materials.json.gz")

# Create a dataframe with the selected properties
dataframe = pd.DataFrame.from_records(docs)

dataframe = dataframe.drop(columns=[col for col in dataframe if col not in properties])

dataframe.head()

Unnamed: 0,nsites,formula_pretty,volume,density,formation_energy_per_atom,energy_above_hull,band_gap,bulk_modulus
0,2,Si,40.329527,2.3128,0.0,0.0,0.6105,"{'voigt': 88.916, 'reuss': 88.916, 'vrh': 88.916}"
1,2,Fe,23.468168,7.902858,0.0,0.0,0.0,
2,4,LiCoO2,31.733697,5.121431,-1.74566,0.0,0.6623,


Quick inspection of the dataframe

In [5]:
dataframe.describe()

Unnamed: 0,nsites,volume,density,formation_energy_per_atom,energy_above_hull,band_gap
count,3.0,3.0,3.0,3.0,3.0,3.0
mean,2.666667,31.843797,5.112363,-0.581887,0.0,0.424267
std,1.154701,8.431218,2.79504,1.007857,0.0,0.368337
min,2.0,23.468168,2.3128,-1.74566,0.0,0.0
25%,2.0,27.600933,3.717116,-0.87283,0.0,0.30525
50%,2.0,31.733697,5.121431,0.0,0.0,0.6105
75%,3.0,36.031612,6.512145,0.0,0.0,0.6364
max,4.0,40.329527,7.902858,0.0,0.0,0.6623


### 1.1 Filter unstable materials

The data set above has some entries that correspond to thermodynamically or mechanically unstable materials. We filter these materials out using the distance from the convex hull and `K_VRH` (the Voight-Reuss-Hill average of the bulk modulus).

In [6]:
# Filter materials that are unstable by 100 meV/atom or more
# against decomposition into other phases
dataframe = dataframe[ dataframe['energy_above_hull'] < 0.1 ]

dataframe.describe()

Unnamed: 0,nsites,volume,density,formation_energy_per_atom,energy_above_hull,band_gap
count,3.0,3.0,3.0,3.0,3.0,3.0
mean,2.666667,31.843797,5.112363,-0.581887,0.0,0.424267
std,1.154701,8.431218,2.79504,1.007857,0.0,0.368337
min,2.0,23.468168,2.3128,-1.74566,0.0,0.0
25%,2.0,27.600933,3.717116,-0.87283,0.0,0.30525
50%,2.0,31.733697,5.121431,0.0,0.0,0.6105
75%,3.0,36.031612,6.512145,0.0,0.0,0.6364
max,4.0,40.329527,7.902858,0.0,0.0,0.6623


### 1.2 Create a New Descriptor

We can create a new desciptor, e.g, the volume per atom, and add it to the pandas dataframe.

In [None]:
# Add a new column to the pandas dataframe for the volume per atom as a new descriptor
dataframe['volume_per_atom'] = dataframe['volume']/dataframe['nsites']

# Verify the added column
dataframe.head()

### 1.3 Add More Descriptors

We use MatMiner’s pymatgen descriptor tools to add some more descriptors to our dataset.

In [None]:
dataframe["composition"] = dataframe['formula_pretty'].map(lambda x: Composition(x))

dataset     = PymatgenData()

descriptors = ['row', 'group', 'atomic_mass',
               'atomic_radius', 'boiling_point', 'melting_point', 'X']

statisctics = ["mean", "std_dev"]

element_property = ElementProperty(data_source=dataset, features=descriptors, stats=statisctics)

dataframe        = element_property.featurize_dataframe(dataframe, "composition")

# Remove NaN values
dataframe = dataframe.dropna()

dataframe.head()

## 2. Fit a Linear Regression Model Using SciKitLearn

We now have a sufficiently detailed dataset to fit a linear regression model that predicts the density. The linear model is given by
$$
y(x) = \beta_0 + \sum_{i=1}^n \beta_i\, x_i,
$$
where $x_i$ denotes the $n$ descriptors.

But before we proceed to the fitting, we need to remove outliers

In [None]:
mean    = dataframe['density'].mean()
std_dev = dataframe['density'].std()

lower_bound = mean - 3.0 * std_dev
upper_bound = mean + 3.0 * std_dev

print(f"removing outliers for {lower_bound:.3f} < density < {upper_bound:.3f}\n")

dataframe = dataframe[ (dataframe['density'] > lower_bound) & (dataframe['density'] < upper_bound)]

### 2.1 Define the target output and relevant descriptors

The data set above has many columns - we won't need all this data for our modeling. We try to predict density. We can drop most of the other output data.

In [None]:
# Target output column
y = dataframe['density'].values

# Possible descriptor columns
excluded = ["band_gap", "formula_pretty", "density",
            "volume", "nsites", "volume_per_atom",
            "energy_above_hull", "composition"]

# Remove descriptors from dataframe
X = dataframe.drop(excluded, axis=1)

descriptor_values = '\n'.join(X.columns.values)
print(f"There are {X.shape[1]} possible descriptors:\n\n{descriptor_values}")

### 2.2 Fit the linear regression model

Now that we have our set of descriptors, we use scikit learn to do a linear fit to our data

In [None]:
# Define linear regression object
linear_regression = LinearRegression()

# Fit linear regression to the data
linear_regression.fit(X, y)

mse = mean_squared_error( y_true=y, y_pred=linear_regression.predict(X) )

# Get fit statistics
print(f"Training R2   = {linear_regression.score(X, y):.4f}")
print(f"Training RMSE = {np.sqrt(mse):.4f}")

In [None]:
# Use a 10-fold cross validation (90% training, 10% test)
crossvalidation = KFold(n_splits=10, shuffle=True)

# Compute cross validation scores the model
r2_scores   = cross_val_score(linear_regression, X, y,
                              scoring='r2', cv=crossvalidation, n_jobs=1)

mse_scores  = cross_val_score(linear_regression, X, y,
                              scoring='neg_mean_squared_error', cv=crossvalidation, n_jobs=1)

rmse_scores = [np.sqrt(abs(s)) for s in mse_scores]


print("\n".join(f"fold {idx+1:2}, R2 = {i:.4f}" for idx, i in enumerate(r2_scores)))

print(f"\nCross-validation results:\n"
      
      f"Folds: {len(r2_scores)}, mean R2   = "
      f"{np.mean(np.abs(r2_scores)):.3f}\n"

      f"Folds: {len(rmse_scores)}, mean RMSE = "
      f"{np.mean(np.abs(rmse_scores)):.3f}")

Finally, we can visualize the results using a scatter plot with kernel density estimation

In [None]:
reference_predicted = np.stack([ y, linear_regression.predict(X)], axis=1)

results_frame = pd.DataFrame( reference_predicted,
                             columns=["reference", "predicted"] )

reference_predicted = reference_predicted.reshape( (2,-1) )

kernel  = stats.gaussian_kde(reference_predicted)(reference_predicted)
idx     = kernel.argsort()

fig, ax = plt.subplots( figsize=(6,6) )

ax.plot([-20, 20], [-20, 20], color="black",
        label=None, ls="solid", lw=1, zorder=0)

pcm = sns.scatterplot(ax=ax, data=results_frame,
                      x=results_frame["predicted"][idx],
                      y=results_frame["reference"][idx],
                      c=kernel[idx], s=4**2,
                      edgecolor="none")

mappable = plt.cm.ScalarMappable()

cbar = fig.colorbar(mappable=mappable, ax=ax,
                    location="right", orientation="vertical",
                    shrink=0.70, pad=0.01)

cbar.ax.set_title(f"Density", x=0.6, y=1.02, rotation=90)

ax.set_xlim(0, 17.5)
ax.set_ylim(0, 17.5)

ax.set_xlabel(r"Predicted [g cm$^{-3}$]")
ax.set_ylabel(r"Reference [g cm$^{-3}$]")

plt.show()