# Question 2: Training and using materials models
**2.1 data cleaning and splitting**

In a Materials informatics workflow you need to find and clean data, featurize the data, train models, and use models for some task. A few years ago, we put together a nice series of notebooks that describe this process for an example where we train heat capacity data as a function of temperature and composition and then use this model to predict heat capacity as a function of temperature for new materials. The best practices document is a great starting point for you and can be found here `https://github.com/anthony-wang/BestPractices`. I'd like you to go through a similar exercise as the best practices notebook but with a few changes. 

**<font color='teal'>a)</font>** First, you'll notice that the original notebooks used `pandas-profiling` but this has been deprecated and replaced by ydata-profiling. Try to get ydata-profiling to work and then use it to inspect your data. 

In [55]:
# look at cp_data_demo.csv using ydata-profiling

import pandas as pd
from ydata_profiling import ProfileReport

# Load the data
data = pd.read_csv('cp_data_demo.csv')

# Create a ProfileReport
profile = ProfileReport(data, title='Temperature Data Profiling Report')
profile.to_notebook_iframe()




Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

In [56]:
# remove duplicates and empty cells from the dataframe
data = data.drop_duplicates()
data = data.dropna()

profile = ProfileReport(data, title='Temperature Data Profiling Report')
profile.to_notebook_iframe()


Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

**<font color='teal'>b)</font>** Second, in the data-splitting notebook, you'll see how we came up with an elaborate way to make sure tha as we split the data, we made sure that all the values corresponding to a formula went to either test, val, or train but would never be randomly split across these groups. We were silly and didn't know about `GroupKFold` in teh scikit-learn library. (`https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GroupKFold.html`) Redo the splitting process using this much simpler tool. 


In [57]:
from CBFV import composition
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler

data = pd.read_csv('cp_data_demo.csv')
data = data.drop_duplicates()
data = data.dropna()
data.rename(columns={'PROPERTY: Heat Capacity (J/mol K)': 'target'}, inplace=True)
data.rename(columns={'CONDITION: Temperature (K)': 'temperature'}, inplace=True)
data.rename(columns={'FORMULA': 'formula'}, inplace=True)

from sklearn.model_selection import GroupKFold

X, y, formula, skipped = composition.generate_features(data, elem_prop='oliynyk', extend_features=True)
normalizer = Normalizer()
scaler = StandardScaler()
X = scaler.fit_transform(X)
X = normalizer.fit_transform(X)



groups = data['formula']


group_kfold = GroupKFold(n_splits=10)

Processing Input Data: 100%|██████████| 4553/4553 [00:00<00:00, 6434.69it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 4553/4553 [00:00<00:00, 26588.88it/s]


	Creating Pandas Objects...


**2.2 model training and hyper parameter tuning**

**<font color='teal'>c)</font>** Next, when we built our classic models, we never performed hyperparameter tuning! We just used them with default parameters. I'd like you to build two models and perform hyperparameter tuning on them. One model should be either `Ridge` or `Lasso` and the other should be `XGBoost`. Compare performance metrics including training time. 

In [58]:
# import the Ridge and XGBoost models from scikitlearn
from sklearn.linear_model import Ridge
# import XGBoost
from xgboost import XGBRegressor
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

# create a Ridge model
score = cross_val_score(Ridge(), X, y, groups=groups, cv=group_kfold, scoring='r2')
print("Ridge without hyperparameter tuning:")
print('R2 score: ', end='')
print(score.mean())
print("")

# perform hyperparameter tuning on the Ridge model, tuning the alpha parameter between 0.1 and 1000
param_grid = {'alpha': [0.1, 1, 5, 10, 100]}
ridgeGrid = GridSearchCV(Ridge(), param_grid, cv=group_kfold)
ridgeGrid.fit(X, y, groups=groups)
print('Best alpha: ', end='')
print(ridgeGrid.best_params_['alpha'])
print("")
score = cross_val_score(ridgeGrid.best_estimator_, X, y, groups=groups, cv=group_kfold, scoring='r2')
print('Ridge with hyperparameter tuning:')
print('R2 score: ', end='')
print(score.mean())
print("")


Ridge without hyperparameter tuning:
R2 score: 0.4140157410576199

Best alpha: 5

Ridge with hyperparameter tuning:
R2 score: 0.4393280634449204



In [59]:
# create an XGBoost model 
score = cross_val_score(XGBRegressor(), X, y, groups=groups, cv=group_kfold, scoring='r2')
XGBmodel = XGBRegressor()
XGBmodel.fit(X, y)
print("XGBoost without hyperparameter tuning:")
print('R2 score: ', end='')
print(score.mean())
print("")

XGBoost without hyperparameter tuning:
R2 score: 0.3431932815208098



In [36]:
# perform hyperparameter tuning on the XGBoost model, tuning the learning rate, max depth, and number of estimators
param_grid = {'learning_rate': [0.01, 0.1, 0.5], 'max_depth': [1, 3, 5], 'n_estimators': [100, 500, 1000]}
XGBgrid = GridSearchCV(XGBRegressor(), param_grid, cv=group_kfold)
XGBgrid.fit(X, y, groups=groups)
print(XGBgrid.best_estimator_)
score = cross_val_score(XGBgrid.best_estimator_, X, y, groups=groups, cv=group_kfold, scoring='neg_mean_squared_error')
print('XGBoost with hyperparameter tuning:')
print('MSE score:', end='')
print(np.sqrt(-score).mean())

KeyboardInterrupt: 

**2.3 using your model to make predictions**

**<font color='teal'>c)</font>** Finally, pick the best model from **2.2** and use it to predict the heat capacity from 1200K to 2000K for ZrN. See how it compares to experiment. 

![ZrN Cp](https://www.researchgate.net/publication/335403917/figure/fig2/AS:796198449467394@1566839911338/High-temperature-heat-capacity-Cp-of-zirconium-and-hafnium-carbides-and-carbonitrides.png)


In [52]:
# use the XGBoost model trained above to predict the heat capacity of ZrN from 1200K to 2000K
import numpy as np
from CBFV import composition
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt

# create a panda data frame with the column "formula" with the values ZrN and a column of temperatures ranging from 1200 to 2000 and a column named "target" with the value 0
zrn = pd.DataFrame({'formula': ['ZrN'] * 801, 'temperature': np.linspace(1200, 2000, 801), 'target': [0] * 801})
# generate the features for the ZrN data frame
X, y, formula, target = composition.generate_features(zrn, elem_prop='oliynyk', extend_features=True)
X = scaler.transform(X)
X = normalizer.transform(X)

# Use the ridge model with the best alpha value found in the previous section to predict the heat capacity of ZrN
y_zrn = ridgeGrid.best_estimator_.predict(X)
df = pd.DataFrame({'temperature': zrn['temperature'], 'target': y_zrn})

Processing Input Data: 100%|██████████| 801/801 [00:00<00:00, 40148.63it/s]


	Featurizing Compositions...


Assigning Features...: 100%|██████████| 801/801 [00:00<00:00, 21134.58it/s]

	Creating Pandas Objects...





Unnamed: 0,temperature,target
0,300.000,69.913743
1,301.625,69.931386
2,303.250,69.949030
3,304.875,69.966675
4,306.500,69.984321
...,...,...
796,1593.500,84.155182
797,1595.125,84.173214
798,1596.750,84.191247
799,1598.375,84.209279


In [53]:
# convert the target column from J/mol K to J/kg K (if applicable)
df['target'] = df['target'] * (1000/264.04)

In [54]:
# plot the predicted heat capacity of ZrN from 1200K to 2000K


plt.figure()
plt.plot(df['temperature'], df['target'])
plt.xlabel('Temperature (K)')
plt.ylabel('Heat Capacity (J/mol K)')
plt.title('Heat Capacity of ThO2')

# save the plot to a png file
plt.savefig('ZrN_heat_capacity.png')


<IPython.core.display.Javascript object>