# Featurizing the dataset with pGNN

This notebook goes trough the main functions and objects implemented in this library. Based on a dataset containing ~4,000 entries of type (mp_id, structure, refractive index) taken from the MaterialsProject (MP). The workflow can be devided in two parts. Each of the featurizers currently implemented are executed in the dataset, subsequently we compare the featurized datasets to predict the refractive index.

## 1. Loading the dataset

In this example the dataset is a dataframe saved as a pickle. But it can be any format as long as you can retreive the structures and targets.

In [1]:
import pandas as pd
df = pd.read_pickle('data/df_ref_index.pkl')
print('{} datapoints'.format(len(df)))
df.head()

4022 datapoints


Unnamed: 0,structure,ref_index
mp-624234,"[[0.67808954 1.32800354 5.90141888] Te, [1.500...",2.440483
mp-560478,"[[-0.62755181 6.55361247 9.268476 ] Ba, [4....",1.790685
mp-556346,"[[4.43332093 4.12714801 8.8721209 ] Pr, [ 1.40...",2.056131
mp-13676,"[[-0.14481557 3.41229366 4.12618551] O, [3.2...",2.023772
mp-7610,"[[ 0.12549448 3.01287591 -0.20434955] Li, [1....",1.745509


## 2. Import the featurizers

In [4]:
## add the syspath 
import os
import sys
pgnn_parent_path = os.path.abspath(os.path.join(".."))
sys.path.append(pgnn_parent_path)
## need to import each of these: __all__ = ( "l_MM_v1", "l_OFM_v1", "mvl32", "mvl16", "adj_megnet", "adj_megnet_layer16")
from pgnn.featurizers.structure import ( l_MM_v1, 
                                        l_OFM_v1, 
                                        mvl32, 
                                        mvl16, 
                                        adj_megnet, 
                                        adj_megnet_layer16 )
print("Featurizing with l_MM_v1...")
df_mmv1 = l_MM_v1.get_features(df['structure'])
print("l_MM_v1 features shape:", df_mmv1.shape)

print("Featurizing with l_OFM_v1...")
df_ofm = l_OFM_v1.get_features(df['structure'])
print("l_OFM_v1 features shape:", df_ofm.shape)

print("Featurizing with mvl32...")
df_mvl32 = mvl32.get_features(df['structure'])
print("mvl32 features shape:", df_mvl32.shape)

print("Featurizing with mvl16...")
df_mvl16 = mvl16.get_features(df['structure'])
print("mvl16 features shape:", df_mvl16.shape)

# the adjacent model needs to be trained beforehand.
print("Training adj_megnet...")
adj_megnet.train_adjacent_megnet(df['structure'])

print("Featurizing with adj_megnet...")
df_adj_megnet = adj_megnet.get_features(df['structure'])
print("adj_megnet features shape:", df_adj_megnet.shape)

print("Featurizing with adj_megnet_layer16...")
df_adj_megnet_layer16 = adj_megnet_layer16.get_features(df['structure'])
print("adj_megnet_layer16 features shape:", df_adj_megnet_layer16.shape)

Featurizing with l_MM_v1...
/auto/globalscratch/users/r/g/rgouvea/pGNN/pgnn/featurizers
/auto/globalscratch/users/r/g/rgouvea/pGNN/pgnn/featurizers/custom_models/MEGNetModel__MatMinerEncoded_v1.h5


Total params: 681,702
Following invalid structures: [].
l_MM_v1 features shape: (4022, 759)
Featurizing with l_OFM_v1...


ValueError: MEGNetModel__OFMEncoded_v1.h5 not found in custom_models directory.

### (b) Featurizing the data
The MODData has an integrated database containing the features of many materials from the MP. By enabling fast featurization they are directtly retreived from this database and not computed from the structure.

In [4]:
md.featurize(fast=True,
             db_file='../modnet/data/feature_database.pkl'
            )

2021-02-24 14:27:48,222 - modnet - INFO - Computing features, this can take time...
2021-02-24 14:27:48,223 - modnet - INFO - Fast featurization on, retrieving from database...
2021-02-24 14:27:52,091 - modnet - INFO - Retrieved features for 4022 out of 4022 materials
2021-02-24 14:27:53,354 - modnet - INFO - Data has successfully been featurized!


In [5]:
md.get_featurized_df().head()

Unnamed: 0,ElementProperty|MagpieData minimum Number,ElementProperty|MagpieData maximum Number,ElementProperty|MagpieData range Number,ElementProperty|MagpieData mean Number,ElementProperty|MagpieData avg_dev Number,ElementProperty|MagpieData mode Number,ElementProperty|MagpieData minimum MendeleevNumber,ElementProperty|MagpieData maximum MendeleevNumber,ElementProperty|MagpieData range MendeleevNumber,ElementProperty|MagpieData mean MendeleevNumber,...,OPSiteFingerprint|std_dev square pyramidal CN_5,OPSiteFingerprint|std_dev trigonal bipyramidal CN_5,OPSiteFingerprint|std_dev q2 CN_11,OPSiteFingerprint|std_dev q4 CN_11,OPSiteFingerprint|std_dev q6 CN_11,OPSiteFingerprint|std_dev L-shaped CN_2,OPSiteFingerprint|std_dev water-like CN_2,OPSiteFingerprint|std_dev bent 120 degrees CN_2,OPSiteFingerprint|std_dev hexagonal pyramidal CN_7,OPSiteFingerprint|std_dev pentagonal bipyramidal CN_7
mp-624234,8.0,82.0,74.0,32.0,30.0,8.0,81.0,90.0,9.0,85.875,...,0.186438,0.175091,0.021637,0.0472,0.072313,0.2280295,0.355493,0.217585,0.134621,0.163703
mp-560478,8.0,56.0,48.0,16.0,10.75,8.0,9.0,87.0,78.0,71.0625,...,0.098554,0.1012,0.029021,0.021497,0.036379,0.064974,0.051046,0.253411,0.061584,0.155998
mp-556346,8.0,59.0,51.0,22.307692,19.810651,8.0,17.0,96.0,79.0,83.692308,...,0.197575,0.19499,0.048936,0.049705,0.071292,0.1099133,0.268237,0.282694,0.12368,0.167256
mp-13676,8.0,81.0,73.0,21.333333,19.888889,8.0,76.0,87.0,11.0,84.5,...,0.032056,0.032056,0.046716,0.024166,0.059264,1.084202e-19,0.024395,0.199876,0.057122,0.193736
mp-7610,3.0,20.0,17.0,9.0,4.0,8.0,1.0,87.0,86.0,54.375,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### (c) Computing the optimal features

This runs the feature selction algorithm. First the multual information is computed, followed by the iterative selction based on relevance and redundancy.

This step takes time, but is normally run only once before being saved.

In [6]:
md.feature_selection(n=-1,
                     use_precomputed_cross_nmi=True
                    ) # Here we use precomputed cross_nmi to save time

2021-02-24 14:27:53,399 - modnet - INFO - Loading cross NMI from 'Features_cross' file.
2021-02-24 14:27:53,446 - modnet - INFO - Starting target 1/1: refractive_index ...
2021-02-24 14:27:53,450 - modnet - INFO - Computing mutual information between features and target...
2021-02-24 14:32:37,844 - modnet - INFO - Computing optimal features...
2021-02-24 14:32:59,177 - modnet - INFO - Selected 50/1019 features...
2021-02-24 14:33:20,065 - modnet - INFO - Selected 100/1019 features...
2021-02-24 14:33:40,132 - modnet - INFO - Selected 150/1019 features...
2021-02-24 14:33:59,150 - modnet - INFO - Selected 200/1019 features...
2021-02-24 14:34:17,283 - modnet - INFO - Selected 250/1019 features...
2021-02-24 14:34:34,455 - modnet - INFO - Selected 300/1019 features...
2021-02-24 14:34:50,608 - modnet - INFO - Selected 350/1019 features...
2021-02-24 14:35:05,670 - modnet - INFO - Selected 400/1019 features...
2021-02-24 14:35:19,783 - modnet - INFO - Selected 450/1019 features...
2021-02

In [7]:
md.get_optimal_descriptors()[:10]

['ElementProperty|MagpieData maximum GSbandgap',
 'ElementFraction|Th',
 'CrystalNNFingerprint|std_dev hexagonal bipyramidal CN_8',
 'DensityFeatures|density',
 'ElementProperty|MagpieData avg_dev Number',
 'LocalPropertyDifference|mean local difference in Electronegativity',
 'BondOrientationParameter|mean BOOP Q l=2',
 'ElementProperty|MagpieData range NdValence',
 'DensityFeatures|packing fraction',
 'OPSiteFingerprint|mean sgl_bd CN_1']

### (d) Saving the MODData

In [8]:
md.save('out/md_ref_index')

2021-02-24 14:36:42,692 - modnet - INFO - Data successfully saved as out/md_ref_index!


## 3. MODNet model

In [9]:
md = MODData.load('out/md_ref_index')

2021-02-24 14:36:44,826 - modnet - INFO - Loaded <modnet.preprocessing.MODData object at 0x7ffae983dcd0> object, created with modnet version 0.1.9~develop


### (a) Creating the MODNet

In [10]:
model = MODNetModel([[['refractive_index']]],{'refractive_index':1},
                    n_feat=1000,
                    num_neurons=[[128],[64],[32],[]],
                    act='elu'
                   )
model.model.summary()

Model: "functional_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 1000)]            0         
_________________________________________________________________
dense (Dense)                (None, 128)               128128    
_________________________________________________________________
dense_1 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_2 (Dense)              (None, 32)                2080      
_________________________________________________________________
refractive_index (Dense)     (None, 1)                 33        
Total params: 138,497
Trainable params: 138,497
Non-trainable params: 0
_________________________________________________________________


### (b) Training the model

#### option 1: using the fit_preset function

In [11]:
#model.fit_preset(md,nested=0) # no innner CV is used (only simple train-val here)

#### option 2: using the fit function
In this case, the user provides hand-chosen hyperparameters

In [12]:
model = MODNetModel([[['refractive_index']]],
                    {'refractive_index':1},
                    n_feat=1000,
                    num_neurons=[[128],[64],[32],[]],
                    act='elu'
                   )

In [13]:
model.fit(md,val_fraction=0.1,
          val_key='refractive_index',
          loss='mae', lr=0.001, epochs = 300,
          batch_size = 64, xscale='minmax',
          yscale=None,
          verbose=1
         )

Epoch 1/300
Epoch 2/300
Epoch 3/300
Epoch 4/300
Epoch 5/300
Epoch 6/300
Epoch 7/300
Epoch 8/300
Epoch 9/300
Epoch 10/300
Epoch 11/300
Epoch 12/300
Epoch 13/300
Epoch 14/300
Epoch 15/300
Epoch 16/300
Epoch 17/300
Epoch 18/300
Epoch 19/300
Epoch 20/300
Epoch 21/300
Epoch 22/300
Epoch 23/300
Epoch 24/300
Epoch 25/300
Epoch 26/300
Epoch 27/300
Epoch 28/300
Epoch 29/300
Epoch 30/300
Epoch 31/300
Epoch 32/300
Epoch 33/300
Epoch 34/300
Epoch 35/300
Epoch 36/300
Epoch 37/300
Epoch 38/300
Epoch 39/300
Epoch 40/300
Epoch 41/300
Epoch 42/300
Epoch 43/300
Epoch 44/300
Epoch 45/300
Epoch 46/300
Epoch 47/300
Epoch 48/300
Epoch 49/300
Epoch 50/300
Epoch 51/300
Epoch 52/300
Epoch 53/300
Epoch 54/300
Epoch 55/300
Epoch 56/300
Epoch 57/300
Epoch 58/300
Epoch 59/300
Epoch 60/300
Epoch 61/300
Epoch 62/300
Epoch 63/300
Epoch 64/300
Epoch 65/300
Epoch 66/300
Epoch 67/300
Epoch 68/300
Epoch 69/300
Epoch 70/300
Epoch 71/300
Epoch 72/300
Epoch 73/300
Epoch 74/300
Epoch 75/300
Epoch 76/300
Epoch 77/300
Epoch 78

In [14]:
model.fit(md,
          val_fraction=0.1,
          val_key='refractive_index',
          lr=0.0005,
          epochs = 100,
          batch_size = 128,
          xscale='minmax',
          yscale=None,
          verbose=1
         )

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

## 4. Saving the model

In [15]:
model.save('out/MODNet_refractive_index')

2021-02-24 14:37:16,879 - modnet - INFO - Saving model...
2021-02-24 14:37:16,894 - modnet - INFO - Saved model to out/MODNet_refractive_index(.json/.h5/.pkl)


## 5. Predicting on unseen data

See "predicting_ref_index" notebook