# Problem Set 2: Analyzing pions with GNNs

**Software requirements:** See below -- specialized packages include `pytorch geometric`, `networkx`, `glob` (installed via `pip install glob2`). All requirements are installable via `pip` or `uv`. 

**Datasets:**
I recommend creating a folder called something like `data/pset_2` wherever you are working on this problem set. The datasets can be downloaded at this link: https://uwmadison.box.com/s/fd34yleydcj8l5sonmwhxjqz0ivm8xoc 

**Grading:**
This problem set will be graded as a quiz within Canvas.

**Deadline:** 
The Canvas quiz will close by class time (4pm Central Time) on Wednesday, October 22nd, 2025.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import networkx as nx # for visualizing graphs
import glob
from tqdm.notebook import tqdm
import warnings
warnings.filterwarnings('ignore', category=pd.errors.PerformanceWarning)

import torch
import torch.nn as nn
from torch.nn import Sequential, Linear, ReLU
from torch_geometric.data import Data, Dataset
from torch_geometric.loader import DataLoader
from torch_geometric.utils.convert import to_networkx
from deepsnap.batch import Batch
import wandb



In [2]:
### WARNING: don't change this cell -- setting these seeds will ensure that your answers are consistent with mine 
seed = 1234 
np.random.seed(seed)
torch.random.manual_seed(seed);

In [3]:
### Load data (multiple files)
folder = "/mnt/ceph/home/rcruzcan/private/courses/MLPhysics/MLPhysics_Course/projects/project2/data/"
pion_files = glob.glob(folder+"pion_files/*.npy")
pi0_files = glob.glob(folder+"pi0_files/*.npy")

# Pion classification

In [100]:
df_pion = pd.concat([pd.DataFrame(np.load(file, allow_pickle=True).item()) for file in tqdm(pion_files[:1])]) # restrict number of files to balance classes
df_pi0 = pd.concat([pd.DataFrame(np.load(file, allow_pickle=True).item()) for file in tqdm(pi0_files)])

print("Pion dataframe has {:,} events.".format(df_pion.shape[0]))
print("Pi0 dataframe has {:,} events.".format(df_pi0.shape[0]))

  0%|          | 0/1 [00:00<?, ?it/s]

  0%|          | 0/10 [00:00<?, ?it/s]

Pion dataframe has 10,371 events.
Pi0 dataframe has 6,551 events.


In [101]:
df_pion.columns

Index(['index', 'cluster_cell_E', 'cluster_cell_ID', 'trackPt', 'trackD0',
       'trackZ0', 'trackEta_EMB2', 'trackPhi_EMB2', 'trackEta_EME2',
       'trackPhi_EME2', 'trackEta', 'trackPhi', 'nCluster', 'nTrack',
       'truthPartE', 'truthPartPt', 'cluster_ENG_CALIB_TOT', 'cluster_E',
       'cluster_Eta', 'cluster_Phi', 'cluster_EM_PROBABILITY',
       'cluster_E_LCCalib', 'cluster_HAD_WEIGHT', 'deltaR', 'dR_pass',
       'event_number'],
      dtype='object')

In [102]:
df_pi0.columns

Index(['cluster_cell_E', 'cluster_cell_ID', 'trackPt', 'trackD0', 'trackZ0',
       'trackEta_EMB2', 'trackPhi_EMB2', 'trackEta', 'trackPhi', 'nCluster',
       'nTrack', 'truthPartE', 'truthPartPt', 'cluster_ENG_CALIB_TOT',
       'cluster_E', 'cluster_Eta', 'cluster_Phi', 'cluster_EM_PROBABILITY',
       'cluster_E_LCCalib', 'cluster_HAD_WEIGHT', 'dR', 'dR_pass',
       'event_number'],
      dtype='object')

In [97]:
df_pion.head()

Unnamed: 0,index,cluster_cell_E,cluster_cell_ID,trackPt,trackD0,trackZ0,trackEta_EMB2,trackPhi_EMB2,trackEta_EME2,trackPhi_EME2,...,cluster_ENG_CALIB_TOT,cluster_E,cluster_Eta,cluster_Phi,cluster_EM_PROBABILITY,cluster_E_LCCalib,cluster_HAD_WEIGHT,deltaR,dR_pass,event_number
0,1,"((7.7120857, 0.017031768, 0.8613132, 0.0408380...","((1149781504, 1149765120, 1149797888, 11497812...",[35.268047],[0.009745013],[73.633],[0.69286096],[-2.970729],[-1e+09],[-1e+09],...,"[27.174538, 0.29033858]","[22.896261, 0.40051812]","[0.6958733, 0.43738556]","[-2.981098, -2.8864744]","[0.00030118643, 0.019664463]","[30.707035, 0.76434183]","[1.152908, 1.0502272]","[0.010472925819019208, 0.27609453465122735]","[True, True]",1
1,4,"((1.9225851, 0.4782984, 1.0708888, 0.21156694,...","((767585972, 767585970, 767585974, 767585460, ...",[28.130478],[0.0010360059],[102.36897],[1.3765892],[2.274305],[1.3765893],[2.2724702],...,"[9.645189, 4.1669793, 7.294774, 3.8841047, 2.7...","[9.111117, 6.341108, 5.3945174, 3.2805374, 1.8...","[1.3920547, 1.4241304, 1.3460875, 1.3462025, 1...","[2.2259603, 2.304043, 2.2316482, 2.3136475, 2....","[0.19883664, 0.2674778, 0.0020222887, 0.013409...","[12.781342, 10.993462, 10.308264, 6.1097336, 6...","[1.0651466, 1.0251846, 1.240434, 1.0657715, 1....","[0.03441024025099552, 0.06707009514837796, 0.0...","[True, True, True, True, True, True]",4
2,5,"((16.31489, 1.3621913, 1.3081299, 0.17543113, ...","((1149732608, 1149716224, 1149732352, 11497328...",[74.77584],[-0.012955406],[-8.59665],[0.7014117],[2.9576569],[-1e+09],[-1e+09],...,"[49.449574, 26.568502, 0.24031131, 0.25694823]","[42.85047, 27.031446, 0.8118878, 0.2950738]","[0.7013916, 0.6824008, 0.78855264, 0.8235522]","[2.980659, 2.9671967, 2.7977874, 3.128827]","[0.00038892106, 0.041169945, 0.017027207, 0.02...","[53.96393, 36.372078, 0.8447713, 0.7373382]","[1.1539516, 1.0616913, 1.0, 1.2007517]","[0.013336900928857417, 0.01900154013426954, 0....","[True, True, True, True]",5
3,6,"((6.8593082, 3.1947296, 0.45396918, 0.6548578,...","((767561556, 767561554, 767561558, 767561044, ...",[64.417206],[-0.025962753],[118.277374],[0.12484062],[-2.106485],[-1e+09],[-1e+09],...,[49.988007],[47.388824],[0.18143325],[-2.1094959],[0.003837438],[63.956497],[1.1195381],[0.05722628251570102],[True],6
4,11,"((0.9065694, 0.04256741, 0.6076689, 0.03359242...","((1141014528, 1140998144, 1141030912, 11410147...",[10.022616],[-0.0033622894],[-35.850327],[0.03406541],[1.0819385],[-1e+09],[-1e+09],...,"[2.96507, 3.6390934]","[3.5169182, 3.2089727]","[-0.0043394375, 0.026746975]","[1.0550588, 1.0636826]","[0.0, 0.020064265]","[3.5169182, 5.518128]","[1.0, 1.1769588]","[0.0612177658530893, 0.05681867182434177]","[True, True]",11


In [118]:
def clean_dataframe(df, is_charged=False): 
    ### Start the dataframe of inputs 
    max_n_cols = pd.DataFrame(df.cluster_E.to_list()).shape[1]
    df2 = pd.DataFrame(pd.DataFrame(df.cluster_E.to_list(), columns=["cluster_e_"+str(x) for x in np.arange(max_n_cols)]))
    
    df3 = pd.DataFrame(pd.DataFrame(df.cluster_Eta.to_list(), columns=["cluster_eta_"+str(x) for x in np.arange(max_n_cols)]))
    df2['cluster_eta_0'] = df3['cluster_eta_0'] 
    
    df3 = pd.DataFrame(pd.DataFrame(df.cluster_Phi.to_list(), columns=["cluster_phi_"+str(x) for x in np.arange(max_n_cols)]))
    df2['cluster_phi_0'] = df3['cluster_phi_0']   
    
    ### Add cluster cell energy
    log10_cluster_cell_e = []
    for i in range(len(df2)): 
        log10_cluster_cell_e.append(np.array(np.log10(df.cluster_cell_E.iloc[i][0]))) # only cells from leading cluster
    df3["log10_cluster_cell_e"] = log10_cluster_cell_e
    max_n_cells = pd.DataFrame(df3.log10_cluster_cell_e.to_list()).shape[1]
    df_cells = pd.DataFrame(pd.DataFrame(df3.log10_cluster_cell_e.to_list(), columns=["log10_cluster_cell_e_"+str(x) for x in np.arange(max_n_cells)]))
    df2 = pd.concat([df2, df_cells], axis="columns")

    ### Leading cluster_E > 0.5
    df2 = df2[df2['cluster_e_0'] > 0.5]
    
    ### Cast as float
    df2 = df2.astype('float32')

    ### Add the log of leading cluster energy
    for var in ['cluster_e_0']:
        df2['log10_'+var] = np.log10(df2[var])
    
    ### Reduce variables
    vars = [
    'log10_cluster_e_0', 
    'cluster_eta_0',
    'cluster_phi_0',
             ]
    
    vars += [f'log10_cluster_cell_e_{i}' for i in range(10)] ### top 10 cells
    
    df2 = df2[vars]
    
    df2['num_cells_lead_cluster'] = np.sum(df2[[var for var in df2.keys() if "scaled" not in var and "cell" in var]] != 0, axis=1)

    if is_charged:
        df2['label'] = 1
    else:
        df2['label'] = 0
        
    ### Drop infs/NaNs 
    df2.replace([np.inf, -np.inf], np.nan, inplace=True)
    df2 = df2.fillna(0)
    
    return df2

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 1: Cleaning the data</span>
Add a cut to the dataframe cleaning function to require that the leading cluster energy is $>0.5$ GeV. What is the effect of adding this cut to the number of events in `df_pion`? 

- a) Reduced by <0.5%
- b) Reduced by 0.5%
- c) Reduced by 2%
- d) Reduced by 6%

In [85]:
df_pion = clean_dataframe(df_pion, is_charged = True)

In [None]:
df_pi0 = clean_dataframe(df_pi0, is_charged = False)

In [None]:
df = pd.concat([df_pion, df_pi0])
df = df.sample(frac=1) # shuffle the rows, for good measure
df = df.fillna(0) # fill nans with 0s 
df

In [None]:
class PionDataset_Classification(Dataset):
    def __init__(self, dataframe, 
                 cluster_features=['log10_cluster_e_0','cluster_eta_0','cluster_phi_0'], 
                 transform=None, 
                 pre_transform=None):
        self.dataframe = dataframe
        self.cluster_features = cluster_features
        super().__init__(None, transform, pre_transform)
        print(f"Initialized PionDataset with {len(dataframe)} samples")
    
    def len(self):
        return len(self.dataframe)
    
    def get(self, index):
        """Generates one sample of data"""
        dataframe = self.dataframe
        
        ### Define nodes 
        cluster_features = self.cluster_features

        ### define nodes with topo-cluster CELL energies! 
        cell_info = ['log10_cluster_cell_e_'+str(i) for i in range(int(dataframe.iloc[index].num_cells_lead_cluster))]
        cluster_nodes = np.zeros((int(dataframe.iloc[index].num_cells_lead_cluster), 1+len(cluster_features)))
        cluster_nodes[:,0] = dataframe.iloc[index][cell_info]
        
        cluster_global_node = np.array(dataframe.iloc[index][cluster_features])
        cluster_global_node = np.concatenate([np.zeros(1), cluster_global_node]) # cluster features come first
        
        nodes = np.vstack([cluster_nodes, cluster_global_node]) # shape = (num_nodes, num_node_features)
                
        ### Define edges (fully-connected, but no self-loops)
        edges = [(i, j) for i in range(nodes.shape[0]) for j in range(nodes.shape[0]) if i != j]
        edge_index = np.array(edges).T  # Shape: (2, num_edges)
        
        ### Define target labels 
        label = np.array([dataframe.iloc[index]['label']])
                
        ### Convert to torch tensors
        nodes = torch.tensor(nodes, dtype=torch.float)
        edge_index = torch.tensor(edge_index, dtype=torch.long)
        label = torch.tensor(label, dtype=torch.float)
        
        return Data(x=nodes, y=label, edge_index=edge_index)

In [None]:
dataset = PionDataset_Classification(df)

In [None]:
plt.figure() 
graph = dataset[0]
if graph.y[0] == 0:
    print("Neutral pion")
elif graph.y[0] == 1:
    print("Charged pion")
nx.draw(to_networkx(graph), 
        cmap='spring', 
        with_labels=True,
        font_weight='bold',
        node_color = np.arange(graph.num_nodes),
        node_size=200, linewidths=6)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 2: Defining a simple graph</span>
How many edges are in each graph? Report an integer value for your answer.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 3: Node features</span>
How many features are associated with each node?

In [None]:
args = {
    "device" : 'cuda' if torch.cuda.is_available() else 'cpu',
    "hidden_size" : 64,
    "epochs" : 10,
    "lr" : 0.001,
    "num_layers": 3,
    "batch_size": 32,
}

dataset = dataset.shuffle()
dataset_train = dataset[:int(0.8*len(dataset))]
dataset_val = dataset[int(0.8*len(dataset)):int(0.9*len(dataset))]
dataset_test = dataset[int(0.9*len(dataset)):]

print(f'Number of training graphs: {len(dataset_train)} ({100*len(dataset_train)/len(dataset):.0f}% of total)')
print(f'Number of val graphs: {len(dataset_val)} ({100*len(dataset_val)/len(dataset):.0f}% of total)')
print(f'Number of test graphs: {len(dataset_test)} ({100*len(dataset_test)/len(dataset):.0f}% of total)')

train_loader = DataLoader(dataset_train, collate_fn=Batch.collate(),\
    batch_size=args["batch_size"], shuffle=True)
val_loader = DataLoader(dataset_val, collate_fn=Batch.collate(),\
    batch_size=args["batch_size"])
test_loader = DataLoader(dataset_test, collate_fn=Batch.collate(),\
    batch_size=args["batch_size"])

In [None]:
from torch.nn import Linear
import torch.nn.functional as F
from torch_geometric.nn import GCNConv
from torch_geometric.nn import global_mean_pool

class GCN(torch.nn.Module):
    def __init__(self, hidden_channels):
        super(GCN, self).__init__()
        torch.manual_seed(12345)
        self.conv1 = GCNConv(dataset.num_node_features, hidden_channels)
        self.conv2 = GCNConv(hidden_channels, hidden_channels)
        self.conv3 = GCNConv(hidden_channels, hidden_channels)
        self.lin = Linear(hidden_channels, 1)

    def forward(self, x, edge_index, batch):
        ### 1. apply several graph convolutions
        x = self.conv1(x, edge_index)
        x = x.relu()
        x = self.conv2(x, edge_index)
        x = x.relu()
        x = self.conv3(x, edge_index)

        ### 2. aggregate embeddings across the graph using the "mean" aggregation function
        x = global_mean_pool(x, batch)  # shape: [batch_size, hidden_channels]

        ### 3. apply dropout & get a final graph-level output using a linear layer
        x = F.dropout(x, p=0.5, training=self.training)
        x = self.lin(x)
        
        return x

In [None]:
model = GCN(hidden_channels=args['hidden_size'])
print(model)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 4: Order of operations during training</span>
Put the following operations in order and add them to the training loop below. 

- a) `loss.backward()`
- b) `optimizer.step()`
- c) `loss = criterion(out, data.y.float())`
- d) `out = model(data.x, data.edge_index, data.batch).squeeze()`
- e) `optimizer.zero_grad() `

Then, fill out the `test` function to iterate over a generic dataloader and return the average accuracy and loss values.

In [None]:
optimizer = torch.optim.Adam(model.parameters(), lr=args['lr'])
criterion = torch.nn.BCEWithLogitsLoss() # no need to apply sigmoid activation b/c BCEWithLogitsLoss() has this built in

def train():
    model.train()

    for data in train_loader:
        ???

def test(loader):
     model.eval()
     correct = 0
     loss_ = 0
     for data in loader:
        ???
     return correct / len(loader.dataset), loss_ / len(loader.dataset)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 5: Prediction accuracy before training</span>
What is the accuracy on the test set *before* training the model? Report your answer rounded to the nearest 10%. 

In [None]:
import wandb
use_wandb = True 

# Initialize W&B run for training
wandb.init(project="pset2_classification") # name your project whatever you like

for epoch in range(args["epochs"]):
    train()
    train_acc, train_loss = test(train_loader)
    val_acc, val_loss = test(val_loader)
    
    # Log metrics to W&B
    if use_wandb:
        wandb.log({
            "train/loss": train_loss,
            "train/acc": train_acc,
            "val/acc": val_acc,
            "val/loss": val_loss,
        })

    torch.save(model, "pion_classification_model.pt")

# Finish the W&B run
if use_wandb:
    wandb.finish()

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 6: Prediction accuracy after training</span>
What is the accuracy on the test set *after* training the model? Report your answer rounded to the nearest 5%. 

# Pion regression
The goal in this section is to correctly predict the truth pion energy based on the cell-level energies for each pion topo-cluster as well as the leading track information. This graph is much more exciting, since it is constructed from as many nodes as there are calorimeter cells with energy deposits > 0.5 GeV! Plus one additional node for the track information. The global feature is the total cluster energy. The regression target is the truth pion energy. 

In [None]:
### Load data
### we're restricting this to one file just for speed purposes, but there are more files if you want to try training for longer on your own
### keep it as n_files = 1 for this problem set 
n_files = 1 
df_pion = pd.concat([pd.DataFrame(np.load(file, allow_pickle=True).item()) for file in tqdm(pion_files[:n_files])])
print("Pion dataframe has {:,} events.".format(df_pion.shape[0]))

In [None]:
def clean_dataframe(df): 
    ### Start the dataframe of inputs 
    max_n_cols = pd.DataFrame(df.cluster_E.to_list()).shape[1]
    df2 = pd.DataFrame(pd.DataFrame(df.cluster_E.to_list(), columns=["cluster_e_"+str(x) for x in np.arange(max_n_cols)]))
    
    df3 = pd.DataFrame(pd.DataFrame(df.cluster_Eta.to_list(), columns=["cluster_eta_"+str(x) for x in np.arange(max_n_cols)]))
    df2['cluster_eta_0'] = df3['cluster_eta_0'] 
    
    df3 = pd.DataFrame(pd.DataFrame(df.cluster_Phi.to_list(), columns=["cluster_phi_"+str(x) for x in np.arange(max_n_cols)]))
    df2['cluster_phi_0'] = df3['cluster_phi_0']   
    
    ### Add cluster cell energy
    log10_cluster_cell_e = []
    for i in range(len(df2)): 
        log10_cluster_cell_e.append(np.array(np.log10(df.cluster_cell_E.iloc[i][0]))) # only cells from leading cluster
    df3["log10_cluster_cell_e"] = log10_cluster_cell_e
    max_n_cells = pd.DataFrame(df3.log10_cluster_cell_e.to_list()).shape[1]
    df_cells = pd.DataFrame(pd.DataFrame(df3.log10_cluster_cell_e.to_list(), columns=["log10_cluster_cell_e_"+str(x) for x in np.arange(max_n_cells)]))
    df2 = pd.concat([df2, df_cells], axis="columns")
    
    ### Add track pT & truth particle E 
    track_pt = np.array(df.trackPt.explode())
    truth_particle_e = np.array(df.truthPartE.explode())
    track_eta = np.array(df.trackEta.explode())
    track_phi = np.array(df.trackPhi.explode())
    track_z0 = np.array(df.trackZ0.explode())

    df2["track_pt"] = track_pt
    df2["track_eta"] = track_eta
    df2["track_phi"] = track_phi
    df2["track_z0"] = track_z0
    df2["truth_particle_e"] = truth_particle_e
        
    ### Cluster_E > 0.5
    df2 = df2[df2.cluster_e_0 > 0.5]

    ### Lose outliers in track pT 
    df2 = df2[df2.track_pt < 5000]
    
    ### Cast as float
    df2 = df2.astype('float32')

    ### Add the log of all energy variables
    for var in ['cluster_e_0', 'track_pt', 'truth_particle_e']:
        df2['log10_'+var] = np.log10(df2[var])

    ### Drop infs/NaNs 
    df2.replace([np.inf, -np.inf], np.nan, inplace=True)
    df2 = df2.fillna(0)
    
    ### Reduce variables
    vars = [
    'log10_cluster_e_0', 
    'log10_track_pt',
    'track_eta', 
    'track_phi',
    'track_z0',
    'log10_truth_particle_e',
    'cluster_eta_0',
    'cluster_phi_0',
             ]
    
    vars += [var for var in df2.keys() if "cell" in var]
    
    df2 = df2[vars]
    
    df2['num_cells_lead_cluster'] = np.sum(df2[[var for var in df2.keys() if "scaled" not in var and "cell" in var]] != 0, axis=1)

    ### Drop infs/NaNs 
    df2.replace([np.inf, -np.inf], np.nan, inplace=True)
    df2 = df2.fillna(0)
    
    return df2

In [None]:
df = clean_dataframe(df_pion)

Let's inspect the dataframe we just assembled:

In [None]:
df

And now we'll convert the dataframe into a graph structure:

In [None]:
class PionDataset_Regression(Dataset):
    def __init__(self, dataframe, 
                 cluster_features=['log10_cluster_e_0'], 
                 track_features=['log10_track_pt', 'track_eta'],
                 transform=None, 
                 pre_transform=None):
        self.dataframe = dataframe
        self.cluster_features = cluster_features
        self.track_features = track_features
        super().__init__(None, transform, pre_transform)
        print(f"Initialized PionDataset with {len(dataframe)} samples")
    
    def len(self):
        return len(self.dataframe)
    
    def get(self, index):
        """Generates one sample of data"""
        dataframe = self.dataframe
        
        ### Define nodes 
        cluster_features = self.cluster_features
        track_features = self.track_features

        ### define nodes with topo-cluster CELL energies! 
        cell_info = ['log10_cluster_cell_e_'+str(i) for i in range(int(dataframe.iloc[index].num_cells_lead_cluster))]
        cluster_nodes = np.zeros((int(dataframe.iloc[index].num_cells_lead_cluster), 1+len(cluster_features)+len(track_features)))
        cluster_nodes[:,0] = dataframe.iloc[index][cell_info]
        
        cluster_global_node = np.array(dataframe.iloc[index][cluster_features])
        cluster_global_node = np.concatenate([np.zeros(1), cluster_global_node, np.zeros(len(track_features))]) # cluster features come first
        
        track_node = np.array(dataframe.iloc[index][track_features])
        track_node = np.concatenate([np.zeros(len(cluster_features)+1), track_node]) # cluster features come first
        
        nodes = np.vstack([cluster_nodes, cluster_global_node, track_node]) # shape = (num_nodes, num_node_features)
                
        ### Define edges (fully-connected, but no self-loops)
        edges = [(i, j) for i in range(nodes.shape[0]) for j in range(nodes.shape[0]) if i != j]
        edge_index = np.array(edges).T  # Shape: (2, num_edges)
        
        ### Define target labels 
        target = np.array([dataframe.iloc[index]['log10_truth_particle_e']])
                
        ### Convert to torch tensors
        nodes = torch.tensor(nodes, dtype=torch.float)
        edge_index = torch.tensor(edge_index, dtype=torch.long)
        target = torch.tensor(target, dtype=torch.float)
        
        return Data(x=nodes, y=target, edge_index=edge_index)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 7: log(E)</span>
What is the main reason to take the log of the energy?

- a) to compress the dynamic range and make the distribution easier to train on
- b) to convert negative values into positive ones
- c) to increase the precision of low-energy measurements
- d) to satisfy the requirement that neural networks can only process logarithmic inputs

Create the pion dataset:

In [None]:
cluster_features = ['log10_cluster_e_0']
track_features = ['log10_track_pt', 'track_eta', 'track_phi', 'track_z0']
dataset = PionDataset_Regression(df, cluster_features, track_features)

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 8: Maximum number of nodes</span>
What is the maximum number of nodes in any graph in the datasets? 

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 9: Average number of nodes</span>
What is the average number of nodes across all the graphs, rounded to the nearest integer? 

In [None]:
args = {
    "device" : 'cuda' if torch.cuda.is_available() else 'cpu',
    "hidden_size" : 32,
    "epochs" : 2, ### keeping the training short
    "lr" : 0.001,
    "num_layers": 3,
    "batch_size": 32,
}

dataset = dataset.shuffle()
dataset_train = dataset[:int(0.8*len(dataset))]
dataset_val = dataset[int(0.8*len(dataset)):int(0.9*len(dataset))]
dataset_test = dataset[int(0.9*len(dataset)):]

print(f'Number of training graphs: {len(dataset_train)} ({100*len(dataset_train)/len(dataset):.0f}% of total)')
print(f'Number of val graphs: {len(dataset_val)} ({100*len(dataset_val)/len(dataset):.0f}% of total)')
print(f'Number of test graphs: {len(dataset_test)} ({100*len(dataset_test)/len(dataset):.0f}% of total)')

train_loader = DataLoader(dataset_train, collate_fn=Batch.collate(),\
    batch_size=args["batch_size"], shuffle=True)
val_loader = DataLoader(dataset_val, collate_fn=Batch.collate(),\
    batch_size=args["batch_size"])
test_loader = DataLoader(dataset_test, collate_fn=Batch.collate(),\
    batch_size=args["batch_size"])

Using the `args` specified above, define the same GCN architecture that was used above, but make sure your input dimension reflects your new dataset. You should also change your loss function accordingly for the new task.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 10: Regression performance before training</span>
What is the MSE of the model before training is performed? Round to the nearest integer value.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 11: Regression performance after training</span>
What is the RMSE (i.e. $\sqrt{\text{MSE}}$) of the model after training is performed over 2 epochs? Round to the nearest integer value and include units.

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 12: Performance evaluation</span>

Make some plots to evaluate the quality of the trained model. What is true of the performance overall? 

- a) The model is predicting the same value for every test input. 
- b) For true pion energies < 200 GeV, the model is generally predicting energy values that are too small. 
- c) For true pion energies < 200 GeV, the model is generally predicting energy values that are too large. 
- d) For true pion energies > 200 GeV, the model is generally predicting energy values that are too small. 
- e) For true pion energies > 200 GeV, the model is generally predicting energy values that are too large. 

### <span style="border:3px; border-style:solid; padding: 0.15em; border-color: #1f77b4; color: #1f77b4;">Problem 13 (challenge): Response IQR</span>
Define the Interquantile Range (IQR), i.e. the spread between $+1\sigma$ and $-1\sigma$, in the function above. We will report 1/2 the IQR divided by the median value using the following reasoning: 
- What % of the data is expected to fall within $\pm1\sigma$ in a Gaussian distribution?
- Use `np.percentile` to calculate the values of a generic input $x$ at $\pm1\sigma$ from the *median* value of $x$.
- Take the difference of these two values to find the IQR.
- Divide by 2 to get 1/2 the IQR.
- Divide by the median of $x$.

Use your custom function as a statistic within `scipy.stats.binned_statistic` applied to: 
- $\hat{x}$ axis: True Particle Energy
- $\hat{y}$ axis: Predicted/True Particle Energy
- `xbins = [10**exp for exp in np.arange(-1., 3.1, 0.2)]`

Make a plot of 1/2 the response IQR divided by the median. Plot each value at the center of its corresponding xbin. What is the value of this quantity for a truth particle energy of 10 GeV?