# Introduction

This notebook demostrates the prediction pipeline for the trained classifiers. With the 3 pretrained classifiers, you can easily classify a new structure that is not included in the original training set.

**Note**: For easier readability, you can change the fontsize of this notebook by navigating to `Settings` -> `JupyterLab Theme` and increasing or decreasing the fontsize from the dropdown menu.

# Import packages and functions

In [1]:
import sys
# force the notebook to look for files in the upper level directory
sys.path.insert(1, '../')

In [2]:
import pandas as pd
import xgboost as xgb
from model.model_building import load_data
from data.data_cleaning import abbreviate_features
from data.compound_featurizer import read_new_struct, composition_featurizer, structure_featurizer, handbuilt_featurizer

# Set up constants
The `PROCESSED_PATH` contains the training dataset used to construct the classifiers. The `NEW_STRUCT_PATH` contains the demo cif structure file and you can test your own structure by uploading the cif file to the "user_defined_structures" folder and changing the `NEW_STRUCT_PATH`. You can save time by pressing <kbd>⇥ Tab</kbd> for auto-completion after typing the first few words.

**Note**: If you choose to upload your own cif structure file, it is preferable that the structure already has an oxidation state assigned to each site. If not, the featurizer will try to guess the oxidation states using the [oxi_state_guesses()](https://pymatgen.org/pymatgen.core.composition.html?highlight=oxi_state_guesses#pymatgen.core.composition.Composition.oxi_state_guesses) function from Pymatgen. There is no guarantee that the guessed oxidation states will be correct and the script will also ask for user input if it is unable to guess the oxidation states. In addition, the uploaded structure has to have at least **2 different elements** (i.e. at least a binary compound). A single element structure such as Si will lead to an error in the script.

There are two demo structures: CuNiO$_2$ and CaFeO$_3$ for you to try out and they are not present in the training database. The featurizer can guess the oxidation states for CuNiO$_2$, but not for CaFeO$_3$. If you run the script with CaFeO$_3$, you will be asked to manually assign the oxidation states by element.

In [3]:
PROCESSED_PATH = "../data/processed/IMT_Classification_Dataset_Processed_v4.xlsx"
NEW_STRUCT_PATH = "../notebooks/user_defined_structures/CuNiO2_mp-1178372_primitive.cif"

# Define some helper functions

In [4]:
def assign_oxi_state(elem_symbol):
    """Allow the user to assign oxidation state to each element."""
    oxi_state = input("{}:".format(elem_symbol))
    return float(oxi_state)

def check_oxi_state(structure):
    """Check if the guessed oxidation states are all zero. If so, trigger user input."""
    if not structure.composition.oxi_state_guesses():
        # get all the elements in the input structure
        elem_lst = [element.symbol for element in structure.composition.element_composition.elements]
        # get the reduced formula
        reduced_formula = structure.composition.reduced_formula
        print("Unable to guess oxidation states for {}. Please manually assign oxidation states by element".format(reduced_formula))
        # get a dictionary to overwrite the default guessed oxidation states
        elem_oxi_states = {elem_symbol: [assign_oxi_state(elem_symbol)] for elem_symbol in elem_lst}
        return elem_oxi_states
    return None

def featurizer_wrapper(df_input):
    """A wrapper function around the composition, structure and handbuilt featurizers."""
    # get the structure from the initialized dataframe
    new_struct = df_input.at[0, "structure"]
    # check if the guessed oxidation states are all zeros and allow user-overwrite if true
    oxi_states_by_element = check_oxi_state(new_struct)
    # featurize the given structure using 3 predefined featurizers
    df_output = composition_featurizer(df_input, oxi_states_override=oxi_states_by_element)
    df_output = structure_featurizer(df_output, oxi_states_override=oxi_states_by_element)
    df_output = handbuilt_featurizer(df_output)
    return df_output

# Read in the processed data

This is a quick overview of the training dataset. It will be used later on to select the relevant features from the raw output of the featurizer.

In [5]:
df = pd.read_excel(PROCESSED_PATH)
df

Unnamed: 0,Compound,Label,struct_file_path,range MendeleevNumber,avg_dev MendeleevNumber,range AtomicWeight,mean AtomicWeight,avg_dev AtomicWeight,range MeltingT,mean MeltingT,...,max_xx_dists,min_xx_dists,avg_xx_dists,v_m,v_x,iv,iv_p1,est_hubbard_u,est_charge_trans,volumn_per_sites
0,SrRuO3,0,../data/Structures/Metals/SrRuO3_75561.cif,79,26.400000,85.07060,47.337640,37.605888,2552.20,764.280000,...,3.579973,2.760023,2.947568,-44.758483,23.738172,45.000000,59.00000,10.330721,8.527722,12.089967
1,OsO2,0,../data/Structures/Metals/OsO2_15070.cif,30,13.333333,174.23060,74.076267,77.435822,3251.20,1138.533333,...,2.805520,2.442651,2.684563,-44.387080,25.269881,41.000000,55.00000,9.953087,13.687053,10.747095
2,SrLaCuO4,0,../data/Structures/Metals/LaSrCuO4_10252.cif,79,29.346939,122.90607,50.581296,39.522167,1302.97,545.710000,...,3.421841,2.662257,2.966018,-36.448400,23.905465,36.841000,57.38000,18.524833,8.820620,13.436088
3,SrCrO3,0,../data/Structures/Metals/SrCrO3_245834.cif,79,28.080000,71.62060,37.522860,25.828152,2125.20,678.880000,...,2.701006,2.701006,2.701006,-46.659812,24.337085,49.160000,69.46000,16.530261,6.586603,11.146843
4,CrO2,0,../data/Structures/Metals/CrO2_202836.cif,38,16.888889,35.99670,27.998300,15.998533,2125.20,763.200000,...,2.688819,2.471404,2.616347,-46.102564,26.561430,49.160000,69.46000,16.126339,8.219417,9.504907
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
223,YbFe4(CuO4)3,2,../data/Structures/MIT_materials/HighT/YbCu3Fe...,48,14.700000,157.05460,38.953240,27.544608,1756.20,653.345500,...,2.924013,2.552849,2.744642,-33.639068,24.360992,36.841000,57.38000,16.597335,7.361181,9.750947
224,NiSeS,2,../data/Structures/MIT_materials/HighT/NiS(2-x...,28,12.222222,46.89500,56.572800,16.338533,1339.64,870.120000,...,3.287898,2.376963,3.060164,-18.496007,9.778249,18.168838,35.18700,13.516153,8.891048,16.385810
225,Ti2O3,2,../data/Structures/MIT_materials/HighT/Ti2O3_H...,44,21.120000,31.86760,28.746440,15.296448,1886.20,809.280000,...,2.900002,2.771288,2.844355,-33.753924,24.648770,27.491710,43.26717,11.068473,16.169806,10.490597
226,Ca1.2La2.8Mn4O12,2,../data/Structures/MIT_materials/HighT/La0.7Ca...,80,26.592000,122.90607,42.438695,32.010437,1464.20,570.600000,...,3.516261,2.747250,2.906344,-39.888114,22.934073,38.930600,57.57000,14.915598,8.843086,11.576092


# Make a prediction on a never-seen-before structure

## 1. Load the three trained models

In [6]:
# load the metal vs. non_metal classifier
metal_model = xgb.XGBClassifier()
metal_model.load_model("../model/saved_models/new_models/metal.model")

# load the insulator vs. non_insulator classifier
insulator_model = xgb.XGBClassifier()
insulator_model.load_model("../model/saved_models/new_models/insulator.model")

# load the mit vs. non_mit classifier
mit_model = xgb.XGBClassifier()
mit_model.load_model("../model/saved_models/new_models/mit.model")

## 2. Read in and featurize the new structure

In [7]:
new_struct_df = read_new_struct(NEW_STRUCT_PATH)
new_struct_df = featurizer_wrapper(new_struct_df)

HBox(children=(FloatProgress(value=0.0, description='StrToComposition', max=1.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='ElementProperty', max=1.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='CompositionToOxidComposition', max=1.0, style=ProgressSty…




HBox(children=(FloatProgress(value=0.0, description='OxidationStates', max=1.0, style=ProgressStyle(descriptio…




HBox(children=(FloatProgress(value=0.0, description='StructureToOxidStructure', max=1.0, style=ProgressStyle(d…




HBox(children=(FloatProgress(value=0.0, description='EwaldEnergy', max=1.0, style=ProgressStyle(description_wi…




HBox(children=(FloatProgress(value=0.0, description='StructuralHeterogeneity', max=1.0, style=ProgressStyle(de…




  sites = np.array(sites)[uniq_inds]


HBox(children=(FloatProgress(value=0.0, description='GlobalInstabilityIndex', max=1.0, style=ProgressStyle(des…




  from pandas import Panel


HBox(children=(FloatProgress(value=0.0, description='Handbuilt Featurizer', max=1.0, style=ProgressStyle(descr…




Here is the raw output from the featurizer.

In [8]:
new_struct_df

Unnamed: 0,Compound,structure,composition,MagpieData minimum Number,MagpieData maximum Number,MagpieData range Number,MagpieData mean Number,MagpieData avg_dev Number,MagpieData mode Number,MagpieData minimum MendeleevNumber,...,max_xx_dists,min_xx_dists,avg_xx_dists,v_m,v_x,iv,iv_p1,est_hubbard_u,est_charge_trans,volumn_per_sites
0,CuNiO2,"[[0. 0. 0.] Cu, [0. 0. 2.84837...","(Cu, Ni, O)",8.0,29.0,21.0,18.25,10.25,8.0,61.0,...,2.864732,2.77446,2.817046,-24.969425,23.63686,18.168838,35.187,11.991636,15.842407,9.840079


## 3. Only select predictors that are in the processed data

In [9]:
new_struct_df = abbreviate_features(new_struct_df)
new_struct_df = new_struct_df.filter(items=df.columns).drop(columns="Compound")

In [10]:
new_struct_df

Unnamed: 0,range MendeleevNumber,avg_dev MendeleevNumber,range AtomicWeight,mean AtomicWeight,avg_dev AtomicWeight,range MeltingT,mean MeltingT,avg_dev MeltingT,range Column,avg_dev Column,...,max_xx_dists,min_xx_dists,avg_xx_dists,v_m,v_x,iv,iv_p1,est_hubbard_u,est_charge_trans,volumn_per_sites
0,26.0,12.25,47.5466,38.55955,22.56015,1673.2,798.8425,744.0425,6.0,2.75,...,2.864732,2.77446,2.817046,-24.969425,23.63686,18.168838,35.187,11.991636,15.842407,9.840079


Compare the number of predictors with the training data loaded into the MIT classifier

In [11]:
train_x, _ = load_data(df, "MIT")
train_x.shape[1]

93

## 4. Print out the prediction label and probability

After selecting the relevant features, we are now ready to make a prediction for the given structure. Below, you will see the outputs from the metal vs. non_metal, insulator vs. non_insulator and mit vs. non_mit classifiers. `1` means the structure is predicted to the positive class and `0` means it is predicted to be the negative class.

**Note**: It is possible for the classifier to classify a structure as multiple classes. (e.g. as both a metal and an MIT). We've provided you with the probability of each prediction and we'll let you be the final judge.

In [12]:
print("Is a metal: {}, and the probability of being a metal is :{:0.4f}\n".format(metal_model.predict(new_struct_df)[0], metal_model.predict_proba(new_struct_df)[0][1]))
print("Is an insulator: {}, and the probability of being an insulator is :{:0.4f}\n".format(insulator_model.predict(new_struct_df)[0], 
                                                                                    insulator_model.predict_proba(new_struct_df)[0][1]))
print("Is an mit: {}, and the probability of being an mit is :{:0.4f}".format(mit_model.predict(new_struct_df)[0], mit_model.predict_proba(new_struct_df)[0][1]))

Is a metal: 0, and the probability of being a metal is :0.0935

Is an insulator: 1, and the probability of being an insulator is :0.9783

Is an mit: 0, and the probability of being an mit is :0.0970
