# XGBoost regression for `matbench_mp_e_form` task using basic crystallographic features
###### Created April 1, 2021

![logo](https://github.com/materialsproject/matbench/blob/main/benchmarks/matbench_v0.1_dummy/matbench_logo.png?raw=1)


## Description
###### Give a brief overview of this notebook and your algorithm.

This directory is an example of a matbench submission, which should be made via pull-request (PR). This also is a minimum working example of how to use the matbench python package to run, record, and submit benchmarks with nested cross validation.

The benchmark used here is the original Matbench v0.1, as described in [Dunn et al.](https://doi.org/10.1038/s41524-020-00406-3).

All submissions should include the following in a PR:
- Description
- Benchmark name
- Package versions
- Algorithm description
- Relevant citations
- Any other relevant info

## Benchmark name
###### Name the benchmark you are reporting results for.
Matbench v0.1

## Package versions
###### List all versions of packages required to run your notebook, including the matbench version used.
- matbench==0.1.0
- scikit-learn==0.24.1
- numpy==1.20.1

## Algorithm description
###### An in-depth explanation of your algorithm. 
###### Submissions are limited to one algorithm per notebook.
The model here is a dummy (random) model as described in [Dunn et al.](https://doi.org/10.1038/s41524-020-00406-3).
- Dummy classification model: randomly selects label in proportion to training+validation set. 
- Dummy regression model: predicts the mean of the training+validation set. 


## Relevant citations
###### List all relevant citations for your algorithm
- [Dunn et al.](https://doi.org/10.1038/s41524-020-00406-3)
- Your model's other citations go here.


## Any other relevant info
###### Freeform field to include any other relevant info about this notebook, your benchmark, or your PR submission.

---


General notes on notebooks:
- Please provide a short description for each code block, either
    - in markdown, as a separate cell
    - as inline comments
- Keep the output of each cell in the final notebook
- **The notebook must be named `notebook.ipynb`**!

In [1]:
# Import our required libraries and classes
%pip install matbench xgboost

from matbench.bench import MatbenchBenchmark
from sklearn.model_selection import train_test_split
import xgboost as xgb
import pandas as pd
import numpy as np
from typing import List, Optional, Sequence, Tuple, Union

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting matbench
  Downloading matbench-0.5-py3-none-any.whl (9.9 MB)
[K     |████████████████████████████████| 9.9 MB 2.9 MB/s 
Collecting matminer==0.7.4
  Downloading matminer-0.7.4-py3-none-any.whl (1.4 MB)
[K     |████████████████████████████████| 1.4 MB 14.5 MB/s 
[?25hCollecting monty==2021.8.17
  Downloading monty-2021.8.17-py3-none-any.whl (65 kB)
[K     |████████████████████████████████| 65 kB 886 kB/s 
[?25hCollecting scikit-learn==1.0
  Downloading scikit_learn-1.0-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (23.1 MB)
[K     |████████████████████████████████| 23.1 MB 6.4 MB/s 
Collecting pint>=0.17
  Downloading Pint-0.18-py2.py3-none-any.whl (209 kB)
[K     |████████████████████████████████| 209 kB 45.7 MB/s 
[?25hCollecting pymatgen>=2022.0.11
  Downloading pymatgen-2022.0.17.tar.gz (40.6 MB)
[K     |████████████████████████████████| 40.6 MB 1.3 MB/

In [2]:
def training_model():
  # Transfer train_inputs and train_outputs into a pandas DataFrame
  X = pd.DataFrame(
      {
          "a": latt_a,
          "b": latt_b,
          "c":latt_c,
          "alpha": alpha,
          "beta": beta,
          "gamma": gamma,
          "volume": volume,
          "space_group": space_group
      },
      # index=material_id
  )
  y = pd.Series(name="formation_energy", data=formation_energy)

  X=X[:-2]

  train = xgb.DMatrix(X, label=y)

  hyperparam = {
      'max_depth': 4,
      'learning_rate':0.05,
      'n_estimators':1000,
      'verbosity':1,
      'booster':"gbtree",
      'tree_method':"auto",
      'n_jobs':1,
      'gamma':0.0001,
      'min_child_weight':8,
      'max_delta_step':0,
      'subsample':0.6,
      'colsample_bytree':0.7,
      'colsample_bynode':1,
      'reg_alpha':0,
      'reg_lambda':4,
      'scale_pos_weight':1,
      'base_score':0.6,
      'num_parallel_tree':1,
      'importance_type':"gain",
      'eval_metric':"rmse",
      'nthread':4 }

  num_round = 100

  # train and validate your model
  my_model = xgb.train(hyperparam, train, num_round)
  return my_model

def testing_model():
  # Create dataframe for test_inputs and test model
  test_inputs = pd.DataFrame(
    {
        "a": t_latt_a,
        "b": t_latt_b,
        "c": t_latt_c,
        "alpha": t_alpha,
        "beta": t_beta,
        "gamma": t_gamma,
        "volume":t_volume,
        "space_group": t_space_group
    },
  )

  test = xgb.DMatrix(test_inputs)
  return test

## Running the actual benchmark

Create a benchmark of the 13 original matbench v0.1 tasks, train a model on each fold for each task, and record the results with any salient metadata.



In [5]:
# Create a benchmark
mb = MatbenchBenchmark(autoload=False, subset=["matbench_mp_e_form"])

# Run our benchmark on xgboost model
for task in mb.tasks:
  task.load()

  for fold in task.folds:
    # Define lists and databases
    latt_a: List[List[float]] = []
    latt_b: List[List[float]] = []
    latt_c: List[List[float]] = []
    alpha: List[List[float]] = []
    beta: List[List[float]] = []
    gamma: List[List[float]] = []
    volume: List[float] = []
    space_group: List[int] = []

    formation_energy: List[List[float]] = []

    t_latt_a: List[List[float]] = []
    t_latt_b: List[List[float]] = []
    t_latt_c: List[List[float]] = []
    t_alpha: List[List[float]] = []
    t_beta: List[List[float]] = []
    t_gamma: List[List[float]] = []
    t_volume: List[float] = []
    t_space_group: List[int] = []

    # Get the training inputs (an array of pymatgen.Structure or string Compositions, e.g. "Fe2O3")
    train_inputs, train_outputs = task.get_train_and_val_data(fold)

    for i in range(len(train_inputs)):
      latt_a.append(train_inputs.iloc[i]._lattice.a)
      latt_b.append(train_inputs.iloc[i]._lattice.b)
      latt_c.append(train_inputs.iloc[i]._lattice.c)
      alpha.append(train_inputs.iloc[i]._lattice.angles[0])
      beta.append(train_inputs.iloc[i]._lattice.angles[1])
      gamma.append(train_inputs.iloc[i]._lattice.angles[2])
      volume.append(train_inputs.iloc[i].volume)
      space_group.append(train_inputs.iloc[i].get_space_group_info()[1])

    # Get the training outputs (an array of either bools or floats, depending on problem)
    for i in range(len(train_outputs)):
      formation_energy.append(train_outputs.iloc[i])

    # Do all model tuning and selection with the training data only
    # The split of training/validation is up to you and your algorithm
    # Transfer train_inputs and train_outputs into a pandas DataFrame
    
    X = pd.DataFrame(
        {
            "a": latt_a,
            "b": latt_b,
            "c":latt_c,
            "alpha": alpha,
            "beta": beta,
            "gamma": gamma,
            "volume": volume,
            "space_group": space_group
        },
        # index=material_id
    )
    y = pd.Series(name="formation_energy", data=formation_energy)

    train = xgb.DMatrix(X, label=y)

    hyperparam = {
        'max_depth': 4,
        'learning_rate':0.05,
        'n_estimators':1000,
        'verbosity':1,
        'booster':"gbtree",
        'tree_method':"auto",
        'n_jobs':1,
        'gamma':0.0001,
        'min_child_weight':8,
        'max_delta_step':0,
        'subsample':0.6,
        'colsample_bytree':0.7,
        'colsample_bynode':1,
        'reg_alpha':0,
        'reg_lambda':4,
        'scale_pos_weight':1,
        'base_score':0.6,
        'num_parallel_tree':1,
        'importance_type':"gain",
        'eval_metric':"rmse",
        'nthread':4 }

    num_round = 100

    # train and validate your model
    my_model = xgb.train(hyperparam, train, num_round)

    # Get test data (an array of pymatgen.Structure or string compositions, e.g., "Fe2O3")
    test_inputs_raw = task.get_test_data(fold, include_target=False)

    for i in range(len(test_inputs_raw)):
      t_latt_a.append(test_inputs_raw.iloc[i]._lattice.a)
      t_latt_b.append(test_inputs_raw.iloc[i]._lattice.b)
      t_latt_c.append(test_inputs_raw.iloc[i]._lattice.c)
      t_alpha.append(test_inputs_raw.iloc[i]._lattice.angles[0])
      t_beta.append(test_inputs_raw.iloc[i]._lattice.angles[1])
      t_gamma.append(test_inputs_raw.iloc[i]._lattice.angles[2])
      t_volume.append(test_inputs_raw.iloc[i].volume)
      t_space_group.append(test_inputs_raw.iloc[i].get_space_group_info()[1])

    test_inputs = pd.DataFrame(
      {
          "a": t_latt_a,
          "b": t_latt_b,
          "c": t_latt_c,
          "alpha": t_alpha,
          "beta": t_beta,
          "gamma": t_gamma,
          "volume":t_volume,
          "space_group": t_space_group
      },
    )

    test = xgb.DMatrix(test_inputs)

    # Make predictions on the test data, returning an array of either bool or float, depending on problem
    predictions = my_model.predict(test)

    # Record our predictions into the benchmark object
    # you can optionally add parameters corresponding to the particular model in this fold
    # if particular hyperparameters or configurations are chosen based on training/validation
    task.record(fold, predictions)

2022-06-10 22:23:12 INFO     Initialized benchmark 'matbench_v0.1' with 1 tasks: 
['matbench_mp_e_form']
2022-06-10 22:23:12 INFO     Loading dataset 'matbench_mp_e_form'...
2022-06-10 22:26:56 INFO     Dataset 'matbench_mp_e_form loaded.
2022-06-10 22:39:05 INFO     Recorded fold matbench_mp_e_form-0 successfully.
2022-06-10 22:51:13 INFO     Recorded fold matbench_mp_e_form-1 successfully.
2022-06-10 23:03:22 INFO     Recorded fold matbench_mp_e_form-2 successfully.
2022-06-10 23:15:32 INFO     Recorded fold matbench_mp_e_form-3 successfully.
2022-06-10 23:27:40 INFO     Recorded fold matbench_mp_e_form-4 successfully.


In [None]:
print(len(mb.tasks))

In [6]:
import pickle

model = pickle.dump(my_model, open("xgbmodel.dat", "wb"))
# model = pickle.load(open("xgbmodel.dat", "rb"))

## Check out the results of the benchmark

First, validate the benchmark to make sure everything is ok - if you did not get any error messages during the recording process your benchmark results will almost certainly be valid. 

Next, get a feeling for how our benchmark is doing, in terms of MAE or ROCAUC, along with various other scores.

Finally, add some metadata related to this benchmark, if applicable.

In [7]:
# Make sure our benchmark is valid
valid = mb.is_valid
print(f"is valid: {valid}")


# Check out how our algorithm is doing using scores
import pprint
pprint.pprint(mb.scores)

# Get some more info about the benchmark
mb.get_info()

# Add some additional metadata about our algorithm
# These sections are very freeform; any and all data you think are relevant to your benchmark
# mb.add_metadata({"regression_strategy": "mean", "algorithm": "dummy"})

is valid: True
{'matbench_mp_e_form': {'mae': {'max': 0.7559645762744662,
                                'mean': 0.7514603730363221,
                                'min': 0.7463943260812504,
                                'std': 0.004167347004583424},
                        'mape': {'max': 8.208108588940437,
                                 'mean': 6.904368768866061,
                                 'min': 4.8884393331071925,
                                 'std': 1.323520300873098},
                        'max_error': {'max': 4.242506746409874,
                                      'mean': 4.057536813383573,
                                      'min': 3.9335069535836924,
                                      'std': 0.10426539042254096},
                        'rmse': {'max': 0.9454158134116134,
                                 'mean': 0.9414775887737938,
                                 'min': 0.936303190895938,
                                 'std': 0.0038121183426142904}}}


## Save our benchmark to file

Make sure you use the filename `results.json.gz` - this is important for our automated leaderboard to work properly!

In [8]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
# Save the valid benchmark to file to include with your submission
mb.to_file("/content/drive/MyDrive/sparks-baird/xtal2png/results.json.gz")

2022-06-11 00:02:22 INFO     Successfully wrote MatbenchBenchmark to file '/content/drive/MyDrive/sparks-baird/xtal2png/results.json.gz'.


Citation:
Dunn, A., Wang, Q., Ganose, A., Dopp, D., Jain, A. 
Benchmarking Materials Property Prediction Methods: 
The Matbench Test Set and Automatminer Reference Algorithm. 
npj Computational Materials 6, 138 (2020). 
https://doi.org/10.1038/s41524-020-00406-3
