# Homework 2: Learning Chemical Structure with Autoencoders

In this notebook, you'll explore how an autoencoder can learn a compact representation of chemical elements based on their physical and chemical properties.

You will:
- Load a dataset of periodic elements containing data from Hydrogen (H) up to Radon (Rn),
- Normalize the data and encode categorical features,
- Train an autoencoder to compress features into 2D,
- Visualize the learned latent space,
- Interpret how chemical structure is captured in the latent space.

---

## 📌 Dataset Features

- `atomic_mass`
- `electronegativity`
- `type` (metal, nonmetal, metalloid)

---

## 🎯 Your Tasks

1. Preprocess the dataset (handle NaNs, normalize, encode types).
2. Define and train an autoencoder with 2D latent space.
3. Plot and interpret the latent space. Utilize color, shape, and size to visualize the data in the latent space.
4. Optimize the Hyperparameters.
5. Interprete the results.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer

# Load data
df = pd.read_csv("periodic_table_properties.csv")
df.head()


## 1. Preprocess the dataset

- Check the dataset for **missing or NaN** entries. What entries are missing for which elements? What would be a reasonable replacement value? Perform the replacement.
- Normalize numerical values.
- Encode types the element labels using a one hot encoding. Encode the type using a type encoding.

In [None]:
# Handle missing electronegativity values


# Encode metal type


# Normalize numerical values


# Combine features



print(f"Feature matrix shape: {X_tensor.shape}")


## 2. Define the Autoencoder and Optimize the Weights

Create an autoencoder that has a 2D latent space. To optimize the autoencoders architecture, you should investigate how increasing the number of hidden layers affects your performance.

In [3]:
class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim=2):
        super().__init__()
        self.encoder = nn.Sequential(
          ...
        )
        self.decoder = nn.Sequential(
          ...
        )
    
    def forward(self, x):
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return x_recon, z


In [None]:
# Optimize the autoencoder

torch.manual_seed(42)

model = Autoencoder(input_dim=X_tensor.shape[1])
optimizer = torch.optim.Adam(model.parameters(), lr=2e-3)
criterion = nn.MSELoss()

# Training loop
epochs = 10000
for epoch in range(epochs):
    model.train()
    optimizer.zero_grad()
    x_recon, _ = model(X_tensor)
    loss = criterion(x_recon, X_tensor)
    loss.backward()
    optimizer.step()
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, Loss: {loss.item():.6f}")


## 3. Latent space

- Visualize the learned latent space.
- To help identify the structure of the latent space I recommend using color, size, and symbol shapes.
- Color each element entry by its electronegativity, use different symbols for the three different types, and different size based on the atomic mass. Label each point by its element symbol.

In [None]:
# Evaluate the model
model.eval()
with torch.no_grad():
    _, Z = model(X_tensor)
Z_np = Z.numpy()

# Plot


## 4. Optimize the Hyperparameters

- Using cross-valdiation, optimize the learning rate, choice of activation function (up to three, e.g., Tanh, ReLU), number of epochs, and number of hidden layers (up to two).

In [None]:
# Hyperparameter optimization using cross-validation

# Define the parameter grid


# Create a function to train and evaluate the model


# Perform grid search



print(f"Best parameters: {best_params}, Best loss: {best_loss}")


## 🧠 5. Interpretation

- What do you notice about the placement of metals, nonmetals, and metalloids?
- Are chemically similar elements grouped together?
- What might each latent dimension represent?
- How does the activation functions affect compression?