# Question 1: Structural features
**1.1 cif-cn-featurizer**

Our friend Anton Oliynyk has recently released a new python script `cif-cn-featurizer`. 

Here is their description "A Python script designed to process CIF (Crystallographic Information File) files and extract various features from them. These features include interatomic distances, atomic environment information, and coordination numbers. The script can handle binary and ternary compounds."

Let's test it out and see how it works! To do so we will need some cifs. Right now the featurizer only works with binaries made up of certain elements shown in this plot. 

![allowed elements](https://github.com/sp8rks/MaterialsInformatics/blob/main/HW/HW2/cif-cn-featurizer-allowed-elements.png?raw=true)

**<font color='teal'>a)</font>** Download the `cif-cn-featurizer` files and run it on the cif files in the `HW\cn-featurizer\cifs` folder. 

Note: in case you can't get it working, you'll also find a csv folder with all the extracted features for these cifs already complete, but try and get it working so you can use it in the future!

In [1]:
#you can run this in your miniconda command prompt if you prefer



**1.2 Getting labeled data for the cifs**

**<font color='teal'>b)</font>** Now that you've got feature vectors in a series of .csv files, let's use them to build a model to predict a property. To get a property let's search for a materials project entry using the cif cards! If you've forgotten how, go back to the `legacy_MPRester_tutorial.ipynb` notebook where we did an example. Once you have the material project id, run a query to extract a property like bulk modulus (["elasticity"]["K_VRH"])

In [7]:
# Using the legacy version of the MPRester API extract a list of material ids from the cif cards in the cifds folder

from pymatgen.ext.matproj import MPRester
import os
import pandas as pd

apiKeyPath = "/Users/stanleywessman/Downloads/oldApiKey.txt"

# Read API key from file
def readFile(filename):
    try:
        with open(filename, 'r') as f:
            return f.read().strip()
    except FileNotFoundError:
        print("File not found")
        return None

apiKey = readFile(apiKeyPath)

#set up the MPRester API
mpr = MPRester(apiKey)

# using the legacy MPRester API extract the bulk modulus from each of the cif files in the folder cn-featurizer/cifs

cifFolder = "/Users/stanleywessman/Downloads/MaterialsInformatics/HW/HW2/cn-featurizer/cifs"

cifFiles = os.listdir(cifFolder)

cifFiles = [cifFolder + "/" + f for f in cifFiles]

cifFiles = [f for f in cifFiles if f.endswith(".cif")]


#extract the material ids from the cif files

materialIds = []

for cifFile in cifFiles:
    try:
        matID = mpr.find_structure(cifFile)[0]
        materialIds.append(matID)
    except:
        count = 0

print(materialIds)
        



['mp-22568', 'mp-20131', 'mp-1977', 'mp-2484', 'mp-20729', 'mp-30745', 'mp-980752', 'mp-11482', 'mp-376', 'mp-2751', 'mp-1911', 'mp-30866', 'mp-20369', 'mp-20903', 'mp-19977', 'mp-1080098', 'mp-633', 'mp-2465', 'mp-16513', 'mp-21197', 'mp-865411', 'mp-30787', 'mp-865411', 'mp-2391', 'mp-21430', 'mp-674', 'mp-481', 'mp-790', 'mp-20309', 'mp-2825', 'mp-2451', 'mp-2391', 'mp-2092', 'mp-1571', 'mp-801', 'mp-369', 'mp-718', 'mp-30634', 'mp-12553', 'mp-569196', 'mp-768', 'mp-2588', 'mp-16513', 'mp-567305', 'mp-674', 'mp-559', 'mp-2092', 'mp-959', 'mp-1571', 'mp-640095', 'mp-20469', 'mp-2333', 'mp-1451', 'mp-891', 'mp-20469', 'mp-636279', 'mp-1139', 'mp-357', 'mp-19977', 'mp-2134', 'mp-1451', 'mp-797', 'mp-2351', 'mp-718', 'mp-2391', 'mp-300', 'mp-1979', 'mp-1409', 'mp-21432', 'mp-2747', 'mp-21427', 'mp-20309', 'mp-2092', 'mp-2092', 'mp-21427', 'mp-865411', 'mp-801', 'mp-20469', 'mp-1549', 'mp-977', 'mp-2006', 'mp-1080756', 'mp-1082', 'mp-20258', 'mp-1232', 'mp-20920', 'mp-11506', 'mp-568823'

In [9]:
# using the legacy api extract the bulk modulus for each material id

bulkModuli = []
formula = []

for matID in materialIds:
    try:
        data = mpr.query(criteria={"material_id": matID}, properties=["elasticity", "K_VRH", "pretty_formula"])
        bulkModuli.append(data[0]["elasticity"]["K_VRH"])
        formula.append(data[0]["pretty_formula"])
    except:
        count = 0

# create a pandas dataframe with the formula and bulk moduli
df = pd.DataFrame({"formula": formula, "bulkModulus": bulkModuli})

Unnamed: 0,formula,bulkModulus
0,V3Ga,168.0
1,YIn3,195.0
2,NdSn3,56.0
3,SmSn3,57.0
4,LaIn3,179.0


**1.3 Comparing structural features to compositional features**

**<font color='teal'>c)</font>** Now that you've got structural features and you can get compositional features (use CBFV), let's compare them! Build a Support vector machine regressor model with each feature set and determine which works better. 

In [31]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Normalizer
import numpy as np
from CBFV import composition

df.rename(columns={"bulkModulus": "target"}, inplace=True)

xMV, yMV, formulaMV, skippedMV = composition.generate_features(df, elem_prop='mat2vec')
xMag, yMag, formulaMag, skippedMag = composition.generate_features(df, elem_prop='magpie')
xOli, yOli, formulaOli, skippedOli = composition.generate_features(df, elem_prop='oliynyk')

scaler = StandardScaler()
normalizer = Normalizer()

# split the data into training and test sets
dfTrain = df.sample(frac=0.8, random_state=0)
dfTest = df.drop(dfTrain.index)

# using the index of the df_train vairable index out values from the different model's variables and store them in their respective variables
xMVtrain = xMV.drop(dfTest.index)
xMVtest = xMV.drop(dfTrain.index)
yMVtrain = yMV.drop(dfTest.index)
yMVtest = yMV.drop(dfTrain.index)

xMagtrain = xMag.drop(dfTest.index)
xMagtest = xMag.drop(dfTrain.index)
yMagtrain = yMag.drop(dfTest.index)
yMagtest = yMag.drop(dfTrain.index)

xOlitrain = xOli.drop(dfTest.index)
xOlitest = xOli.drop(dfTrain.index)
yOlitrain = yOli.drop(dfTest.index)
yOlitest = yOli.drop(dfTrain.index)

# standardize and normalize the data
xMVtrain = scaler.fit_transform(xMVtrain)
xMVtest = scaler.transform(xMVtest)
xMVtrain = normalizer.fit_transform(xMVtrain)
xMVtest = normalizer.transform(xMVtest)

xMagtrain = scaler.fit_transform(xMagtrain)
xMagtest = scaler.transform(xMagtest)
xMagtrain = normalizer.fit_transform(xMagtrain)
xMagtest = normalizer.transform(xMagtest)

xOlitrain = scaler.fit_transform(xOlitrain)
xOlitest = scaler.transform(xOlitest)
xOlitrain = normalizer.fit_transform(xOlitrain)
xOlitest = normalizer.transform(xOlitest)

# create the SVR model for each feature set, train it, and test it, then print the results
modelMV = SVR(kernel='linear', C=100)
modelMV.fit(xMVtrain, yMVtrain)
yMVpred = modelMV.predict(xMVtest)
print("R2 score for mat2vec: ", r2_score(yMVtest, yMVpred))
print("Mean absolute error for mat2vec: ", mean_absolute_error(yMVtest, yMVpred))
print("Mean squared error for mat2vec: ", mean_squared_error(yMVtest, yMVpred))

modelMag = SVR(kernel='linear', C=100)
modelMag.fit(xMagtrain, yMagtrain)
yMagpred = modelMag.predict(xMagtest)
print("R2 score for magpie: ", r2_score(yMagtest, yMagpred))
print("Mean absolute error for magpie: ", mean_absolute_error(yMagtest, yMagpred))
print("Mean squared error for magpie: ", mean_squared_error(yMagtest, yMagpred))

modelOli = SVR(kernel='linear', C=100)
modelOli.fit(xOlitrain, yOlitrain)
yOlipred = modelOli.predict(xOlitest)
print("R2 score for oliynyk: ", r2_score(yOlitest, yOlipred))
print("Mean absolute error for oliynyk: ", mean_absolute_error(yOlitest, yOlipred))
print("Mean squared error for oliynyk: ", mean_squared_error(yOlitest, yOlipred))





Processing Input Data: 100%|██████████| 74/74 [00:00<00:00, 40178.45it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 74/74 [00:00<00:00, 8932.01it/s]


	Creating Pandas Objects...


Processing Input Data: 100%|██████████| 74/74 [00:00<00:00, 62437.84it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 74/74 [00:00<00:00, 17728.82it/s]


	Creating Pandas Objects...


Processing Input Data: 100%|██████████| 74/74 [00:00<00:00, 60644.49it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 74/74 [00:00<00:00, 18541.13it/s]

	Creating Pandas Objects...
R2 score for mat2vec:  0.6973978922963262
Mean absolute error for mat2vec:  20.005736798263275
Mean squared error for mat2vec:  1781.8341816127736
R2 score for magpie:  0.5764099406709824
Mean absolute error for magpie:  22.67196463845006
Mean squared error for magpie:  2494.256409618072
R2 score for oliynyk:  0.6772645361496904
Mean absolute error for oliynyk:  19.514916664757273
Mean squared error for oliynyk:  1900.3868990571273



