feature engineering notebook

potential fingerprint types:

MAGPIE: composition based descriptor for inorganic materials purely from molecular formula
includes features such as statistics of elemental properties (molecular weights etc), atomic packing, electronegativity infomation

other options - rdkit is for organics. MAGPIE features are probably best for now

XGBoost feature importance - very quick and easy, but not a robust approach. 

suggested workflow and resources to properly investigate feature importance: 

1. remove low variance features 

2. correlations - correlation matrix? remove highly correlated features

3. supervised feature selection
- f_regression or mutual_info_regression for univariate statistical tests via selectkbest or selectpercentile 
- recursive feature elimination?
- use random forest / gradient boosting to rank feature importance 

problem with these - do we really want to use regression based selection when we know we are going to apply a categorical filter in our model? investigate 

4. dimensionality reduction - PCA, check lecture notes
- not best for this example as loses all original features so loss of insight 
- assumes linear relationship
- non target aware - may not improve mae for bandgap predicition, rather simply compresses data / features

5. test a bunch of models, basic untuned xgboost on different features could be a good shout for comparison, maybe make a function to test? 

permutation importance - a more robust way to reduce features 

can use sklearn.inspection.permutation_importance 

RESULTS SUMMARY

MAE on single most important feature (mean NPValence) is 0.82, double the MAE of a model using the entire feature set

In [1]:
import pandas as pd
import numpy as np
import eli5
from eli5.sklearn import PermutationImportance
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error 
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
from matplotlib import pyplot as plt


In [2]:
# import data 

df = pd.read_csv("team-a.csv")
df = df.drop(['formula'],axis=1)

X_exp = df.drop(['gap expt'],axis=1).values
y_exp = df['gap expt'].values
y_exp = y_exp.reshape(-1,1)
X_train_exp,X_test_exp,y_train_exp,y_test_exp = train_test_split(X_exp,y_exp,test_size=0.2,random_state=42)

In [3]:
# use xgboost model as before on full dataset, with pretuned hyperparameters

xgb_model = XGBRegressor(n_estimators=200,subsample=0.8,max_depth=7,learning_rate=0.1,random_state=42)

scores = cross_val_score(xgb_model, X_train_exp, y_train_exp, cv=5, scoring="neg_mean_absolute_error")
mae_scores = -scores
mae_scores
print(f"Mean absolute error: {mae_scores.mean():.2f} (+/- {mae_scores.std() * 2:.2f})")

Mean absolute error: 0.43 (+/- 0.02)


In [4]:
print(df)

      gap expt  MagpieData minimum Number  MagpieData maximum Number  \
0         0.00                       16.0                       79.0   
1         0.00                       35.0                       74.0   
2         1.83                       16.0                       82.0   
3         1.51                       32.0                       82.0   
4         0.00                        5.0                       47.0   
...        ...                        ...                        ...   
4599      1.72                        7.0                       73.0   
4600      0.00                       40.0                       52.0   
4601      0.00                        8.0                       40.0   
4602      0.00                        9.0                       40.0   
4603      0.00                       40.0                       74.0   

      MagpieData range Number  MagpieData mean Number  \
0                        63.0               47.400000   
1                    

In [5]:
# now use permutation importance:

xgb_model.fit(X_train_exp,y_train_exp)
perm = PermutationImportance(xgb_model, random_state=42).fit(X_test_exp,y_test_exp)

# make columns right size:
feature_cols = [col for col in df.columns if col != 'gap expt']
eli5.show_weights(perm, feature_names=feature_cols)

eli5.show_weights(perm, feature_names=feature_cols)

Weight,Feature
0.2017  ± 0.0087,MagpieData mean NpValence
0.0384  ± 0.0135,MagpieData mean MeltingT
0.0375  ± 0.0055,MagpieData range NdValence
0.0299  ± 0.0086,MagpieData avg_dev Electronegativity
0.0195  ± 0.0081,MagpieData maximum MendeleevNumber
0.0169  ± 0.0071,MagpieData mean NUnfilled
0.0156  ± 0.0095,MagpieData maximum Number
0.0155  ± 0.0074,MagpieData mean CovalentRadius
0.0154  ± 0.0061,MagpieData avg_dev Column
0.0147  ± 0.0047,MagpieData range Electronegativity


In [None]:
# lets train the whole thing on just mean valence 

df = pd.read_csv("team-a.csv")
df = df.drop(['formula'],axis=1)

X_exp = df['MagpieData mean NpValence'].values
y_exp = df['gap expt'].values
y_exp = y_exp.reshape(-1,1)
X_exp = X_exp.reshape(-1,1)
X_train_exp,X_test_exp,y_train_exp,y_test_exp = train_test_split(X_exp,y_exp,test_size=0.2,random_state=42)

# model: 

xgb_model = XGBRegressor(n_estimators=200,subsample=0.8,max_depth=7,learning_rate=0.1,random_state=42)

scores = cross_val_score(xgb_model, X_train_exp, y_train_exp, cv=5, scoring="neg_mean_absolute_error")
mae_scores = -scores
mae_scores
print(f"Mean absolute error: {mae_scores.mean():.2f} (+/- {mae_scores.std() * 2:.2f})")

Mean absolute error: 0.82 (+/- 0.05)
