# Composition Based Feature Vector

Since most materials informatics models rely on small amounts of data, we need to rely on feature engineering in order to inject domain knowledge into our materials representations. One of the most simple ways to do so is through the mighty Composition Based Feature Vector.

My research group created the `CBFV` package to make it super easy to do composition based feature vectors! In order to follow along with the in class demo (https://github.com/sp8rks/MaterialsInformatics/blob/main/worked_examples/CBFV_example/CBFV_example.ipynb) you will need to go to do the following:
(1) open miniconda
(2) activate your MatInformatics python env `conda activate MatInformatics`
(3) install CBFV package `pip install CBFV` (read more at https://pypi.org/project/CBFV/)

Let's start by creating some dummy data

In [13]:
from CBFV import composition
import pandas as pd

data = [['Si1O2', 10], ['Al2O3', 15], ['Hf1C1Zr1', 14]]
#this next step is important!! The CBFV composition.generate_features() function 
#requires an input dataframe with a column named 'formula' and another column named 'target'
df = pd.DataFrame(data, columns=['formula', 'target'])

Now, let's do our simplest CBFV featurization and convert our data into a 'one hot encoding' vector

In [14]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='onehot')

Processing Input Data: 100%|██████████| 3/3 [00:00<00:00, 9131.29it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 3/3 [00:00<?, ?it/s]




	Creating Pandas Objects...


If we look at our input, the X variable, we'll see that the formulae strings are now converted to numerical values that are suitable for machine learning models.

For our first representation, the avg columns represent the *fractional encoding* of the elements. For example, SiO2 is 2/3 Oxygen, 1/3 Silicon so we see 0.66667 in the avg_8 (atomic number 8, Oxygen) position and we see 0.33333 in the 14th column (atomic number 14, Silicon)

In [15]:
#TIP! Open up the X variable in the Data Wrangler extension to see all the columns since they are truncated below
print(X)


   avg_0  avg_1  avg_2  avg_3  avg_4  avg_5     avg_6  avg_7     avg_8  avg_9  \
0    0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0  0.666667    0.0   
1    0.0    0.0    0.0    0.0    0.0    0.0  0.000000    0.0  0.600000    0.0   
2    0.0    0.0    0.0    0.0    0.0    0.0  0.333333    0.0  0.000000    0.0   

   ...  mode_109  mode_110  mode_111  mode_112  mode_113  mode_114  mode_115  \
0  ...       0.0       0.0       0.0       0.0       0.0       0.0       0.0   
1  ...       0.0       0.0       0.0       0.0       0.0       0.0       0.0   
2  ...       0.0       0.0       0.0       0.0       0.0       0.0       0.0   

   mode_116  mode_117  mode_118  
0       0.0       0.0       0.0  
1       0.0       0.0       0.0  
2       0.0       0.0       0.0  

[3 rows x 714 columns]


# Element property feature vectors
The one hot encoding is a super simple way to encode the formula. It doesn't include any information about the actual chemistry other than the formula. We know that other features should matter. For example, the melting point or ionic size or number of valence electrons etc should be important and useful in relating these materials to their material properties. 

Let's take a look at another featurization technique, the `magpie` feature vector, that encodes more chemical information beyond just one hot encoding.

Read more about `magpie` here in the original article https://doi.org/10.1038/npjcompumats.2016.28

Essentially, the feature vector is created by taking information from the individual elements and then combining the information from these individual elements based on their elemental ratio in the chemical formula. 

In [16]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='magpie')
print(X)

Processing Input Data: 100%|██████████| 3/3 [00:00<?, ?it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 3/3 [00:00<?, ?it/s]

	Creating Pandas Objects...





   avg_Number  avg_MendeleevNumber  avg_AtomicWeight  avg_MeltingT  \
0   10.000000            84.000000         20.028100    598.866667   
1   10.000000            81.400000         20.392255    406.268000   
2   39.333333            55.333333         93.908233   2819.000000   

   avg_Column   avg_Row  avg_CovalentRadius  avg_Electronegativity  \
0   15.333333  2.333333                81.0               2.926667   
1   14.800000  2.400000                88.0               2.708000   
2    7.333333  4.333333               142.0               1.726667   

   avg_NsValence  avg_NpValence  ...  mode_NValence  mode_NsUnfilled  \
0            2.0       3.333333  ...            6.0              0.0   
1            2.0       2.800000  ...            6.0              0.0   
2            2.0       0.666667  ...            4.0              0.0   

   mode_NpUnfilled  mode_NdUnfilled  mode_NfUnfilled  mode_NUnfilled  \
0              2.0              0.0              0.0             2.0   
1    

There are several others too including one of my favorites `olinyky` which we named after Anton Oliynyk, a great chemist who put it together. https://hunter.cuny.edu/people/anton-oliynyk/ Another is `jarvis` which came from the good folks at NIST. Read their article here https://doi.org/10.1038/s41524-020-00440-1

In [17]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='oliynyk')
print(X)

Processing Input Data: 100%|██████████| 3/3 [00:00<00:00, 838.41it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 3/3 [00:00<00:00, 417.61it/s]

	Creating Pandas Objects...





   avg_Atomic_Number  avg_Atomic_Weight  avg_Period  avg_group  avg_families  \
0          10.000000          20.028100    2.333333  15.333333      6.666667   
1          10.000000          20.392256    2.400000  14.800000      6.200000   
2          39.333333          93.908333    4.333333   7.333333      5.000000   

   avg_Metal  avg_Nonmetal  avg_Metalliod  avg_Mendeleev_Number  \
0   0.000000      1.000000            0.0             84.000000   
1   0.400000      0.600000            0.0             81.400000   
2   0.666667      0.333333            0.0             55.333333   

   avg_l_quantum_number  ...  mode_polarizability(A^3)  \
0              1.000000  ...                     0.793   
1              1.000000  ...                     0.793   
2              1.666667  ...                     1.800   

   mode_Melting_point_(K)  mode_Boiling_Point_(K)  mode_Density_(g/mL)  \
0                   54.75                   90.15              0.00143   
1                   54.75    

# Featurization based on scientific literature
There are also some really cool approaches for embedding domain knowledge. For example, `mat2vec` is a clever approach that uses a *natural language processing* tool known as word embeddings to create a feature vector based on scientific literature. You can read about it here https://doi.org/10.1038/s41586-019-1335-8

In [18]:
X, y, formulae, skipped = composition.generate_features(df, 
    elem_prop='mat2vec')
print(X)

Processing Input Data: 100%|██████████| 3/3 [00:00<?, ?it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 3/3 [00:00<?, ?it/s]

	Creating Pandas Objects...





      avg_0     avg_1     avg_2     avg_3     avg_4     avg_5     avg_6  \
0 -0.028195 -0.220953  0.103799  0.036430 -0.180534  0.168473  0.511174   
1 -0.051101 -0.165520  0.090990 -0.009840 -0.264100  0.131027  0.490254   
2 -0.153709 -0.147015  0.122145 -0.038134 -0.218932  0.098425  0.553805   

      avg_7     avg_8     avg_9  ...  mode_190  mode_191  mode_192  mode_193  \
0 -0.038618  0.203725 -0.196661  ...  0.383646 -0.112214 -0.140789 -0.297650   
1  0.051353  0.056146 -0.170191  ...  0.383646 -0.112214 -0.140789 -0.297650   
2 -0.026189  0.011437 -0.008558  ...  0.037482 -0.138211 -0.384747 -0.361625   

   mode_194  mode_195  mode_196  mode_197  mode_198  mode_199  
0 -0.016243  0.253866 -0.138298  0.224835 -0.092936 -0.036101  
1 -0.016243  0.253866 -0.138298  0.224835 -0.092936 -0.036101  
2  0.070045 -0.171340  0.141226  0.074487 -0.076700  0.051153  

[3 rows x 1200 columns]


Looking at these representations which can be hundreds of columns in length, we see that it went from a simple string like 'SiO2' and was turned into something rather complicated. These representations are less interpretable than a simple chemical formula, but are now mathematical vectors that represent the materials and do so with varying degrees of domain knowledge. In 2020 my group published a careful study that asked whether or not this domain knowledge was actually necessary or helpful in predicting materials properties. We essentially found that the domain knowledge does improve predictions, but as the data increases this advantage slowly disappears. 

You can read the article here https://doi.org/10.1007/s40192-020-00179-z

# Now you try it!
Generate a list of compounds you are interested in, look up their properties, and then featurize this data with your choice of feature set to create an X input and a y target label. Try adding a broken chemical formula that includes an abbreviation for an element that doesn't exist and then see what you find in the skipped variable output by the `CBFV.generate_features` method