<a href="https://colab.research.google.com/github/vinayak2019/chemistry_python_intermediate/blob/main/Introduction_pymatgen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Please complete the survey by the end of the workshop. This will help improve the content for future workshops. This link is https://forms.gle/AohzLSuPgnJ7kfX79

# Installing pymatgen and matminer

We will install pymatgen with pip. You can create a conda environment and then install pymatgen too. Once installed with pip, you will be asked to restart the runtime. Click on "Restart runtime". You won't notice anything change. Proceed to the next cell to execute the commands

In [None]:
!pip install pymatgen==2020.12.3
!pip install matminer

After restarting the runtime, run the following cell to check for completion. If no error is displayed, the installation is success.

In [None]:
from pymatgen.core.structure import Structure
from pymatgen.core.surface import SlabGenerator
from matminer.datasets import load_dataset
from sklearn import metrics
from pymatgen import Element, Lattice
from matminer.featurizers.conversions import StrToComposition
from matminer.featurizers import composition as cf
from matminer.featurizers.base import MultipleFeaturizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import seaborn
import matplotlib.pyplot as plt

pymatgen can handle both single molecules (Molecule class) and periodic structures (Structure class). Here we will only discuss using the periodic structure.

We will create the periodic structure object from a CIF file.

In [None]:
# download the cif file
!wget https://raw.githubusercontent.com/vinayak2019/chemistry_python_intermediate/main/nacl.cif

nacl = Structure.from_file("nacl.cif")
nacl

Several parameter can be obtained from the class like the lattice dimensions, composition etc. Let's replace Na with K and generate a 2x1x3 supercell

In [None]:
# replacing Na with K
nacl.replace_species({Element("Na"):Element("K")})
print("The new structure witk K is \n",nacl)

In [None]:
# generating supercell


# **Try it yourself**

Generate a BCC cell for CsCl. Then generate a 3x2X1 supercell. Finally, replace any two Cl atoms with Br atoms.

Hint: The coordinates for Cs is 0.5,0.5,0.5; Cl is 0,0,0

In [None]:
# YOUR CODE HERE

The hierarchy in pymatgen is Element --> Site --> Structure. In the following cell we retrieve both site and element ojects

In [None]:
# Structure object has a list of site
nacl.sites

In [None]:
# Now the Element
nacl.sites[0].specie

#**Try it yourself**

Explore five methods for sites and element objects.

In [None]:
# YOUR CODE HERE

# Machine learning

We will use formation energy dataset for inorganic materials and predict the formation energy given the composition of the material. We will use matminer to generate the input features.


In [None]:
# loading the dataset
df = load_dataset("expt_formation_enthalpy")
df

In [None]:
# Let's get the composition and the target value - e_form expt.
# dropna removes entries with NaN
df = df[["formula","e_form expt"]].dropna()
df

## Generating features

We have the formula for representing the crystal. Let us convert this string into something the our random forrest model can handly easily. We will first convert the formula to a composition object then generate features with matminer 

In [None]:
# converting the formula to composition object
df = StrToComposition(target_col_id='composition').featurize_dataframe(df, 'formula')
df.head()

In [None]:
# creating the constructor for generating the features
# We will use compostion based features and get the names of the features
feature_calculators = MultipleFeaturizer([cf.Stoichiometry(), cf.ElementProperty.from_preset("magpie"),
                                          cf.ValenceOrbital(props=['avg']), cf.IonProperty(fast=True)])

feature_labels = feature_calculators.feature_labels()
print("Number of features is ",len(feature_labels))

In [None]:
# creating the features
df = feature_calculators.featurize_dataframe(df, col_id='composition')
df.head()

Now the we have the data for training a model, let's get the data into the right format and also perform a split on the data

In [None]:
X = df[feature_labels].values.tolist()
y = df["e_form expt"].values.tolist()

# splitting the data 20% test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [None]:
# creating the random forrest model and training
model = RandomForestRegressor(random_state=42) # initialize the model
model.fit(X_train, y_train) # train the model
y_predict = model.predict(X_test) # get prediction on the test set

In [None]:
# evaluation metrics
print("R2 score is ",metrics.r2_score(y_predict,y_test))
print("mean absolute error is ",metrics.mean_absolute_error(y_predict,y_test))
print("mean_squared_error is ",metrics.mean_squared_error(y_predict,y_test))

In [None]:
# plotting the prediction
import numpy as np
x = np.arange(-1.2,0.2,0.2)

plt.scatter(y_test,y_predict)
plt.plot(x,x,color="red")
plt.xlabel("True values")
plt.ylabel("Predicted values")

#**Try it yourself**

Explore other featurizers available in matminer and train the model to obtain better results

In [None]:
# YOUR CODE HERE