<a href="https://colab.research.google.com/github/quothbonney/c06book/blob/master/psets/ps1-nonbio/pset1_template.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  <center> Problem Set 1 (Perovskites) <center>
<center> Spring 2025 <center>
<center> 3.C01/3.C51, 7.C01/7.C51, 10.C01/10.C51, 20.C01/20.C51 <center>
<center> Due: Monday, April 7, 2025 at 3:00 PM ET. <center>

<b>Name:</b>

<b>Kerberos ID:</b>

### Instructions:

Put your code in the code blocks flagged with `############# Code ##########`.

Numerical answers yielded from running the code should be included in an Answer Block (see next cell).

We have provided print statements where numerical answers are expected.

Your answer should be contained in a variable which you defined either in the Answer Block or the Code Block.

When a qualitative answer is expected, place those comments as Markdown/Text cells; when asked for within Code blocks, you can write answer as code comments by placing a # before your answer.

Your Answer Block should look like the following:

In [None]:
########## Answer ############

ans = 2
print("My answer is: {}.".format(ans))

# My regressor over-fitted the training data, I need to add regularization

########## Answer ############

My answer is: 2.


## Imports

In [None]:
# import packages
import numpy as np
import sklearn
import pandas as pd
from sklearn.model_selection import train_test_split

# models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from torch.utils.data import Dataset, DataLoader
from sklearn import preprocessing
from torch import nn
import torch.nn.functional as F

# metrics
from sklearn.metrics import ConfusionMatrixDisplay, RocCurveDisplay, roc_auc_score
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt

import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter("ignore", category=ConvergenceWarning)
import torch
from tqdm import tqdm


# plotting style, you can choose your own parameters
import matplotlib

matplotlib.rcParams.update({'font.size': 15})
matplotlib.rc('lines', linewidth=3, color='g')
matplotlib.rcParams['axes.linewidth'] = 2.0
matplotlib.rcParams['axes.linewidth'] = 2.0
matplotlib.rcParams["xtick.major.size"] = 6
matplotlib.rcParams["ytick.major.size"] = 6
matplotlib.rcParams["ytick.major.width"] = 2
matplotlib.rcParams["xtick.major.width"] = 2
matplotlib.rcParams['text.usetex'] = False

In [None]:
# A helper function for students to produce plots
def plot_clf(model, X, y, title):

    '''
        A function to plot confusion matrix and ROC curve

        Args:
            model(classifier object): model object (e.g. RandomForestClassifier, LogisticRegression)
            X(np.array): feature set
            y(np.array): label set
            title(str): plot name

        Example Usage:
            plot_clf(model, X_test, y_test, "test")
    '''

    fig, [ax_roc, ax_conf] = plt.subplots(1, 2, figsize=(12, 6))
    fig.tight_layout()

    RocCurveDisplay.from_estimator(model, X, y, ax=ax_roc)
    ConfusionMatrixDisplay.from_estimator(model, X, y, ax=ax_conf)

    ax_roc.set_title('{} ROC'.format(title))
    ax_conf.set_title('{} Confusion Matrix'.format(title))

    plt.show()

## Grading guideline

- Didn't answer the question 0%
- Showed some attempts, but clearly didn't try enough: 25%
- Showed solid attempts (showed code) but does not answer the question directly: 50%
- Showed solid attempts and get the question wrong: 60-80%
- Showed solid attempts with some small mistakes: 80-90%
- Showed code and answered the questions correctly: 100%

# Overview:
In this PSET, you will:

* Learn the basics of processing your data and training a machine learning model with PyTorch:
    * Processing data, including exploring OHE vs. other featurization techniques
    * Formatting your data into a `Dataset` object and wrapping in a `DataLoader` instance
    * Implementation of a Multi-Layer Perception in `PyTorch`
    * Setting up a training and testing loop, and how to do evaluation
    * Using a GPU to accelerate training!
    * Find the best hyperparameters


* Learn how to build some simple architectures for **classification** and **regression**:
    * Train a logistic regression model with `scikit-learn` to _classify_ breast cancers based on metabolite abundance
    *  Train a random forest classifier on the same data
    * Train a MLP using `PyTorch` on the same data
    * Train a MLP with `PyTorch` to _regress_ perovskite hull energies from their compositions
    * Apply regularization techniques to avoid overfitting (L1 and L2)
    * Apply physical descriptor-based encoding to improve training performance


## Download required data

In [None]:
# for 1st set of tasks: binary classification
# TODO: update the datapaths
! wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps1-nonbio/data/breastcancer_X.csv
! wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps1-nonbio/data/breastcancer_y.csv

# for 2nd set of tasks: regression of perovskite binding energies

! wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps1-nonbio/data/data_perov/mendeleev.csv
! wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps1-nonbio/data/data_perov/perov_train.csv
! wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps1-nonbio/data/data_perov/perov_val.csv
! wget https://raw.githubusercontent.com/coleygroup/ML4MolEng/main/psets/ps1-nonbio/data/data_perov/elements.npy

# Problem 1: Breast cancer classification from metabolite data

## 1.1 (5 points) Load and inspect the raw data

We have provided the code to load the dataset. Take a moment to understand what each line is doing. Briefly explain what each line of the code is doing by providing short comments below.

You will have to do it by yourself again in Problem 2.

In [None]:
p1_X = pd.read_csv("./breastcancer_X.csv", header='infer', index_col=0) # Comments here
p1_y = pd.read_csv("./breastcancer_y.csv", header='infer', index_col=0) # Comments here

metabolite_name = p1_X.columns.tolist() # Comments here

p1_X = p1_X.values # Comments here
p1_y = p1_y.values # Comments here

Report how many examples are in this dataset and the number of features for each data point.

In [None]:
########## Answer ############

print("There are {} samples.".format(N_samples))
print("There are {} features per sample.".format(N_features))

########## Answer ############

## 1.2 (5 points) Generate train/test splits.
Generate and print the shapes of your four variables, `X_train`, `X_test`, `y_train`, and `y_test`, and ensure sure that the dimensions match your expectations.

In [None]:
########### Code #############

########### Code #############

In [None]:
########## Answer ############

X_train_shape = X_train.shape
y_train_shape = y_train.shape
X_test_shape = X_test.shape
y_test_shape = y_test.shape


print("X_train shape: {}".format(X_train_shape))
print("y_train shape: {}".format(y_train_shape))

print("X_test shape: {}".format(X_test_shape))
print("y_test shape: {}".format(y_test_shape))

########## Answer ############

## 1.3 (5 points) Preprocess the data through scaling
Scale the dataset.

In [None]:
########### Code #############



########### Code #############

Print the mean/variance for each transformed feature.

In [None]:
########## Answer ############

train_mean = X_train_scaled.mean(0)
train_variance = X_train_scaled.std(0) ** 2

test_mean = X_test_scaled.mean(0)
test_variance = X_test_scaled.std(0) ** 2

print("The means of the transformed feature train set are {}".format(train_mean))
print("The variances of the transformed feature train set are {}".format(train_variance))
print("The means of the transformed feature test set are {}".format(test_mean))
print("The variances of the transformed feature test set are {}".format(test_variance))

########## Answer ############


Q: Discuss the importance of not fitting the scaler transform on both train & test; why the transformed mean / variance between X_test_scaled & X_train_scaled may not be the same; and why this difference is potentially heartening.

Your answer here

## 1.4 (10 points) Training a logistic regression classifier with `scikit-learn`

`scikit-learn` has many handy simple machine learning frameworks that make it relatively simple to train, cross-validate, and tune a machine learning model that does not need to be heavily customized.


Below, train and evaluate a Logistic Regression model.

In [None]:
########### Code #############

########### Code #############

Report the AUC for both the train and test datasets.

In [None]:
########## Answer ############

print("The training AUC score is {:.3f}".format(train_auc) )
print("The testing AUC score is {:.3f}".format(test_auc) )

########## Answer ############

Generate plots for the confusion matrices and the ROC curve for both training and testing. Please use the `plot_clf` function defined above.

In [None]:
########### Code #############

########### Code #############

Generate a plot of the model coefficients' distribution using `plt.hist`.

In [None]:
########### Code #############

########### Code #############

## 1.5 (5 points) Introduce L1 regularization
Modify the LogisticRegression call made to run L1-regularized logistic regression.

In [None]:
########### Code #############

########### Code #############

Report the ROC-AUC score.

In [None]:
########## Answer ############


print("The training AUC score is {:.2f}".format(train_auc) )
print("The testing AUC score is {:.2f}".format(test_auc) )

########## Answer ############

Correspondingly, generate the new confusion matrix and ROC curve.

In [None]:
########### Code #############


########### Code #############

Plot the new distribution of model coefficients.

In [None]:
########### Code #############



########### Code #############

Comment on the histogram you obtained, by comparing it to the one generated from the unregularized model's coefficients.

Your answer here

## 1.6  (optional +2.5 points) Connect model coefficients back to metabolites

Code to identify the top 5 metabolites that positively correlated the most with positive diagnosis.

In [None]:
########### Code #############


########### Code #############

Report the metabolites you identified.

In [None]:
########## Answer ############

print("The top 5 metabolites are {}".format(", ".join(metabolites)) )

########## Answer ############

## 1.7 (5 points) Hyperparameter tuning the regularization parameter
Scan over the following regularization values `C=[0.01, 1, 5, 10]` (in `scikit-learn`, the regularization parameter `C` is inversely related to the regularization strength) and report which one yields the best performance (with the AUROC metric) on the train dataset. What is its performance on the test data?

In [None]:
########### Code #############

best_value = None
best_metric = 0
C_values = [0.01, 1, 5, 10]

########### Code #############

In [None]:
########### Answer #############

print(f"Best value: C = {best_value}")
# Performance on test AUC:
print("The hyperparameterized model's test AUC score is {:.2f}".format(test_auc) )

########### Answer #############

Your answer here

## 1.8 (5 points) Training a random forest classifier with `scikit-learn`

To minimize the variance of our method to reduce error arising from epistemic error (that is, from a set of limited data), we may often look to train multiple copies of a machine learning model, i.e., an _ensemble_ machine learning method.

Here, train a Random Forest classifier, which is an ensemble of decision trees. Random Forests, empirically, work really well on tabular data, but can be known to overfit easily. To mitigate this concern, also perform cross validation.

In [None]:
########### Code #############
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score


scaler = # fill this in (reusing your scaler implementation from 1.3)
model = # fill this in
pipeline = Pipeline([('scaler', scaler), ('model', model)])

# Now call cross_val_score() and feed it: your pipeline as the estimator
# and your data (p1_X & p1_y) letting cross_val_score() handle the test / train split for you.

########### Code #############

Report the cross-validated ROC-AUC score.

In [None]:
########## Answer ############


print("The mean of CV scores is {:.2f}".format(mean) )
print("The std of CV scores is {:.2f}".format(std) )

########## Answer ############

# Problem 2: Training a Multi-Layer Perceptron (MLP) to predict perovskite energies
## 2.1: (5 points) Encoding the data

For this second task, we will utilize perovskite data to predict their $E_{hull}$ values. Before we can build a model, we need to process the data by one-hot encoding the perovskites based on their elements.

Generate the X matrix and y vector from processing `perov_train` and `perov_test` appropriately (do not run any inference on `perov_test` until Part 2.5, but it's helpful to process it the same way here for consistency). Hint: Note that the one-hot encoding we perform should still give us a 2D matrix of `n_samples` x `n_features`, but now `n_features` will be different from the number of features originally.

In [None]:
perov_train = pd.read_csv("perov_train.csv") # read train
perov_test = pd.read_csv("perov_val.csv") # read test
all_elements = np.load('./elements.npy', allow_pickle=True) # Read all elements


# Your code to featurize elements
########### Code #############


########### Code #############

Report the number of samples, number of features, and the number of possible values for one (not yet one-hot-encoded) feature you might have.

In [None]:
########## Answer ############

print("There are {} samples.".format(N_samples))
print("There are {} features per sample.".format(N_features))
print("There are {} possible values for one feature.".format(N_feat_vals))

########## Answer ############

## Check GPU usage

In [None]:
# Check if your GPU is requested successfully or not
assert torch.cuda.device_count() != 0

To work with GPU-accelerated training, we need to use `PyTorch`'s `Tensor` objects, which can help us manage CPU vs. GPU usage.

Demonstrate moving this sample tensor to and from the GPU.

In [None]:
numpy_sample = np.zeros((3, 5))
tensor_sample = torch.Tensor(numpy_sample)
print(tensor_sample.device)

###
tensor_sample = tensor_sample.to('cuda')
print(tensor_sample.device)
tensor_sample = tensor_sample.to('cpu')
print(tensor_sample.device)


###

## Build Datasets and DataLoaders in PyTorch

Below, we provide you with an example of a `Dataset` instance, the `SequenceDataset` class, to format your data.

Note that tensors are not moved onto the GPU at initialization or inside `__getitem__`; GPU memory is typically far more limited than CPU-available memory. To prevent out-of-memory errors, we typically minimize the amount of data on the GPU at any time. During `.forward()`, the easiest way to do this is by only putting one batch at a time on the GPU.

In [None]:
# Generate dataset
class PerovskiteDataset(Dataset):
    def __init__(self, X, y):
        self.X = torch.Tensor(np.array(X))  # store X as a pytorch Tensor
        self.y = torch.Tensor(np.array(y))  # store y as a pytorch Tensor
        self.len=len(self.X)                # number of samples in the data

    def __getitem__(self, index):
        return self.X[index], self.y[index] # get the appropriate item

    def __len__(self):
        return self.len

Use the above `Dataset` object to make train, validation, and test `PervoskiteDataset` objects, then wrap each in a `DataLoader` object. Here are some handy arguments you can use to tweak the DataLoader object:
* `batch_size`: the number of examples that your model will see during one forward/backward call.
* `shuffle`: whether or not to shuffle your data between epochs; if it is set to True, then each epoch's batches will be different in identity.
* `num_workers`: useful for when you have a lot of data to load in when making a batch; this allows you to multiprocess batch formation before they are used in forward calls on the GPU. it won't make much difference in this homework, but can be handy down the line :)

In [57]:
########### Code #############
X_subtrain, X_val, y_subtrain, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=0)

train_data = PerovskiteDataset(X_subtrain, y_subtrain)
val_data = PerovskiteDataset(X_val, y_val)
test_data = PerovskiteDataset(X_test, y_test)

batch_size = 128
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

########### Code #############

Run this cell to check that your DataLoaders work as expected.

In [58]:
########### Code #############

for loader in [train_dataloader, val_dataloader, test_dataloader]:
    for index, batch in enumerate(loader):
        # Your batch returns a X, y stacked in a batch
        X_batch, y_batch = batch[0], batch[1]
        print(X_batch.shape, y_batch.shape)
    print()

########### Code #############

torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([100, 136]) torch.Size([100])

torch.Size([128, 136]) torch.Size([128])
torch.Size([128, 136]) torch.Size([128])
torch.Size([90, 136]) torch.Size([90])

torch.Size([128, 136]) torch.Size([128])
torch.Size([82, 136]) torch.Size([82])



There should be 11 training set batches, 3 validation batches, and 2 test batches, with sizes shown above. Each set has an final, impartial batch.

## 2.2 (10 points) Define the MLP in PyTorch

Look at the following code snippet to understand how the linear layer works in PyTorch. Take careful note of the dimensions of the input and output.

In [59]:
linear = torch.nn.Linear(2, 3)

input_tensor = torch.ones((4, 2))
output_tensor = linear(input_tensor)

print(input_tensor, output_tensor, input_tensor.shape, output_tensor.shape)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]) tensor([[ 0.3321, -0.3318,  0.5806],
        [ 0.3321, -0.3318,  0.5806],
        [ 0.3321, -0.3318,  0.5806],
        [ 0.3321, -0.3318,  0.5806]], grad_fn=<AddmmBackward0>) torch.Size([4, 2]) torch.Size([4, 3])


Look at the following code snippet to understand how the ReLU layer works in PyTorch (the Tanh layer is similar). Take careful note of the dimensions of the input and output.

In [60]:
relu = torch.nn.ReLU()

input_tensor = torch.ones((4, 2))
output_tensor = relu(input_tensor)

print(input_tensor, output_tensor, input_tensor.shape, output_tensor.shape)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]) tensor([[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]) torch.Size([4, 2]) torch.Size([4, 2])


Look at the following code snippet to understand how to stack layers with the Sequential module.

In [61]:
layer1 = torch.nn.Linear(2, 3)
layer2 = torch.nn.Linear(3, 4)

sequential = torch.nn.Sequential(layer1, layer2)

input_tensor = torch.ones((5, 2))
output_tensor = sequential(input_tensor)

print(input_tensor, output_tensor, input_tensor.shape, output_tensor.shape)

tensor([[1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.],
        [1., 1.]]) tensor([[ 0.2243,  0.3194,  0.7627, -0.2021],
        [ 0.2243,  0.3194,  0.7627, -0.2021],
        [ 0.2243,  0.3194,  0.7627, -0.2021],
        [ 0.2243,  0.3194,  0.7627, -0.2021],
        [ 0.2243,  0.3194,  0.7627, -0.2021]], grad_fn=<AddmmBackward0>) torch.Size([5, 2]) torch.Size([5, 4])


Build your MLP within the following torch.nn.Module object.

In [65]:
########### Code #############
class PerovMLP(torch.nn.Module):
    def __init__(self, in_features=136, hidden_dim=256, layers=3, activation=nn.ReLU()):
        # You may either modify the above __init__ call to either take in hyperparameters as keyword arguments,
        # or hard-code them below. If you want to do hyperparameter search, then modifying to take in
        # keyword arguments is recommended.
        super().__init__()
        ########### Code #############

        # Implement your code here
        self.layers = []
        self.layers.append(nn.Linear(in_features, hidden_dim))
        self.layers.append(activation)

        for _ in range(layers - 1):
            self.layers.append(nn.Linear(hidden_dim, hidden_dim))
            self.layers.append(activation)

        self.layers.append(nn.Linear(hidden_dim, 1)) # output is 1d scalars

        self.model = nn.Sequential(*layers) # this is so extra i was so confused for a second.
        # also needlessly memory intensive since we are persisting basically two copies. torch might optimize this idk
        ########### Code #############



    def forward(self, x):
        x = self.model(x)

        return x

########### Code #############

## 2.3 (10 points) Implement functions for training and testing

We define your device, model, and epochs; fill in the optimizer and loss function. For the optimizer, use an L2 weight of 0.01 and the Adam optimizer.

In [66]:
########### Code #############

# device to train on
device = 'cuda:0'
# define your model
model = PerovMLP(in_features=146).to(device)

# define your optimizer
optimizer = torch.op
# define your loss function
loss_fn = # Fill in

# define number of epochs
epochs = 250

########### Code #############

SyntaxError: invalid syntax (<ipython-input-66-912b0722ab1b>, line 9)

Implement your training and validation loops here.

In [None]:
########### Code #############

def train(model, dataloader, optimizer, loss_fn, device):

    '''
    A function train on the entire dataset for one epoch.

    Args:
        model (torch.nn.Module): your model from before
        dataloader (torch.utils.data.DataLoader): DataLoader object for the train data
        optimizer (torch.optim.Optimizer(()): optimizer object to interface gradient calculation and optimization
        device (str): Your device (usually 'cuda:0' for your GPU)

    Returns:
        float: loss averaged over all the batches
    '''

    epoch_loss = []
    model.train() # Set model to training mode

    for batch in dataloader:
        X, y = batch
        X = X.to(device)
        y = y.to(device)

        # train your model on each batch here
        y_pred = model(X)

        ########### Code #############



        ########### Code #############

    return epoch_mean



def validate(model, dataloader, loss_fn, device):

    '''
    A function validate on the validation dataset for one epoch.

    Args:
        model (torch.nn.Module): your model for before
        dataloader (torch.utils.data.DataLoader): DataLoader object for the validation data
        device (str): Your device (usually 'cuda:0' for your GPU)

    Returns:
        float: loss averaged over all the batches

    '''

    val_loss = []
    model.eval() # Set model to evaluation mode
    with torch.no_grad():
        for batch in dataloader:
            X, y = batch
            X = X.to(device)
            y = y.to(device)

            # validate your model on each batch here
            y_pred = model(X)

            ########### Code #############




            ########### Code #############


    return epoch_mean
########### Code #############

Train and validate your model.

In [None]:
########### Code ###########
val_loss_curve = []
train_loss_curve = []

def update(progress_bar, train_loss, val_loss):
    progress_bar.set_postfix({"train_loss": train_loss, "val_loss": val_loss})

progress_bar = tqdm(range(epochs)) # this wraps your iteration in a handy progress bar to track any metrics.

for epoch in progress_bar:

    # Train your model on training data
    train_loss = train(model, train_dataloader, optimizer, loss_fn=loss_fn, device=device)

    # Validate your model on validation data
    val_loss = validate(model, val_dataloader, loss_fn=loss_fn, device=device)

    # Record train and loss performance
    train_loss_curve.append(train_loss)
    val_loss_curve.append(val_loss)

    update(progress_bar, train_loss, val_loss)
########### Code ###########

In [None]:
plt.plot(train_loss_curve, label="training")
plt.plot(val_loss_curve, label="validation")
plt.legend()

In [None]:
########### Code #############
from sklearn.metrics import r2_score

fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(10, 5))

yhat_train = # Fill in
yhat_val = # Fill in

yhat_train = yhat_train.cpu().detach().numpy()
yhat_val = yhat_train.cpu().detach().numpy()

ax[0].scatter(yhat_train, y_subtrain, label='Train', alpha=0.5)
ax[1].scatter(yhat_val, y_val, label='Test', alpha=0.5, c='orange')

ax[0].set_ylabel("True $E_{hull}$ (eV/atom)")
ax[0].set_xlabel("Predicted $E_{hull}$ (eV/atom)")
ax[1].set_xlabel("Predicted $E_{hull}$ (eV/atom)")
ax[0].set_title('Train')
ax[1].set_title('Validation')
fig.suptitle('Multi-Layer Perceptron')

print("Multi-Layer Perceptron training R^2 score: {:.2f}".format(r2_score(y_subtrain, yhat_train)))
print("Multi-Layer Perceptron validation R^2 score: {:.2f}".format((r2_score(y_val, yhat_val))))

########### Code #############

## 2.4 (5 points) (grad) Calculate model size

Calculate the total of number of parameters in your MLP model. What does the input hidden_layers_sizes = (256, 256, 256) mean?

In [None]:
########## Answer ############


########## Answer ############

Your answer here

## 2.5 (5 points) Chemical Transferability of One-Hot Representations

In [None]:
########### Code #############

# Load the test dataset which contains elements not seen in the training data
perov_test = pd.read_csv("perov_val.csv")

# Your code to preprocess data and predict on this test dataset, including a scatterplot of predictions





print("MLP validation R^2 score: {:.2f}".format(r2_score(y_test, yhat_test)))
########### Code #############

Comment on your validation results and briefly explain.

Your answer here

## 2.6 (10 points) (grad) Featurize perovskites with physical descriptors

In [None]:
########### Code #############
elements_pd = pd.read_csv("mendeleev.csv")
elements_pd = elements_pd.set_index('symbol')







########### Code #############

In [None]:
########### Code #############

# New dataloaders needed
new_train_data = PerovskiteDataset(X_atom_train, y_atom_train)
new_train_dataloader = DataLoader(new_train_data, batch_size=batch_size, shuffle=True)

new_val_data = PerovskiteDataset(X_atom_val, y_atom_val)
new_val_dataloader = DataLoader(new_val_data, batch_size=batch_size, shuffle=True)

# Fill in the rest





print("Retrained MLP Training R^2 score: {:.2f}".format(r2_score(y_atom_train, yhat_new_train)))
print("Retrained MLP Validation R^2 score: {:.2f}".format(r2_score(y_atom_val, yhat_new_val)))

########### Code #############

## 2.7 (10 points) (grad) Chemical transferability of Physical Descriptors

In [None]:
########### Code #############


print("Retrained MLP Testing R^2 score: {:.2f}".format(r2_score(y_test, yhat_test)))

########### Code #############

Briefly comment on your validation and explain why.

Your answer here.

You've reached the end! Upon completing your pset, note any collaborators or assistance from AI tools in a cell below; and submit to Gradescope [here](https://www.gradescope.com/courses/1011324/assignments/6007173/).