# OTUS |  $ p p > t \bar{t} $ 

This notebook obtains the npz file of the trained model's results on validation data for the ablation study. Below are details about the problem.

This notebooks applies OTUS to our second test case: semi-leptonic $t \bar{t}$ decay.

Our physical latent-space is the $e^-$, $\bar{\nu}_e$, $b$, $\bar{b}$, $u$, $\bar{d}$ 4-momentum information produced by the program MadGraph.

Our data-space data is the $e^+$, $MET$, $jet1$, $jet2$, $jet3$, $jet4$ 4-momentum information produced by the program Delphes. The jets are ordered in descending $p_T$.

We arrange this information into 24 dimensional vectors

- Latent space (z): [$p^{\mu}_{e-}$,$p^{\mu}_{\bar{\nu}_e}$,$p^{\mu}_{b}$,$p^{\mu}_{\bar{b}}$,$p^{\mu}_{u}$,$p^{\mu}_{\bar{d}}$]
- Data space (x): [$p^{\mu}_{e^-}$,$p^{\mu}_{MET}$,$p^{\mu}_{jet1}$,$p^{\mu}_{jet2}$,$p^{\mu}_{jet3}$,$p^{\mu}_{jet4}$]

where $p^{\mu}=[p_x, p_y, p_z, E]$ is the 4-momentum of the given particle.

###### Additional Losses and Constraints:
We impose the following additional losses and constraints in this problem.

As in the $p p > Z > e^+ e^-$ test case, we explicitly enforce the Minkowski metric in the output of the networks. Namely, the networks predict the 3-momenta ($\vec{p}$) of the particles. Energy information is then restored using the Minkowski metric: $E^2 = |\vec{p}|^2 + m^2$.

We also explicitly enforce the lower $p_T$ threshold on jets, which requires that $p_T>20$ GeV. Only samples generated by the decoder which pass this threshold are used to calculate losses. This requires modifying the data-space loss term slightly. Additionally, to help with stable traiing, we choose a ResNet architecture for both our encoder and decoder networks.

See the paper for more details: https://arxiv.org/abs/2101.08944.

# Load Required Libraries

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import torch
import numpy as np
import os

root_dir = '../../../../'

#-- Add utilityFunctions/ to easily use utility .py files --#
import sys
sys.path.append(os.path.join(root_dir, "utilityFunctions/"))

#-- Determine if using GPU or CPU --#
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '2'  # Set to '-1' to disable GPU
from configs import device, data_dims

print('Using device:', device)

Using device: cpu


# Meta Parameters

In [2]:
#-- Set appropriate lambda value --#
# allLambs    = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
selectLambs = [0.001, 10, 100]
lamb = selectLambs[0]
print('lamb = ', lamb)

lamb =  0.001


In [3]:
#-- Set directory short-cuts --#
data_directory    = os.path.join(root_dir, "data/")
dataset_name      = 'ppttbar'

#-- Set random seeds --#
seed = 2
torch.manual_seed(seed)
np.random.seed(seed)

#-- Raw or standardized inputs/outputs --#
# If True, model inputs/outputs should be in the "raw" (unstandardized) space
raw_io = True  

#-- Set data type --#
from configs import float_type
print('Using data type: ', float_type)

Using data type:  float32


# Load Validation Data for Ablation Study

In [4]:
from func_utils import get_dataset, standardize
from torch.utils.data import DataLoader

#-- Get training and validation dataset --#
dataset = get_dataset(dataset_name, data_dir=data_directory)
z_data, x_data = dataset['z_data'], dataset['x_data']
print("Data total shapes: ",z_data.shape, x_data.shape)

x_dim = int(x_data.shape[1])
z_dim = int(z_data.shape[1])

#-- Split into training and validation sets --#
train_size = 222761
val_size = 40000  # Validation set used to evaluate/tune models

x_train = x_data[:train_size, :]
x_val = x_data[train_size:train_size+val_size, :]

z_train = z_data[:train_size, :]
z_val = z_data[train_size:train_size+val_size, :]

#-- Convert data to proper type --#
x_train, x_val, z_train, z_val = list(map(lambda x: x.astype(float_type), [x_train, x_val, z_train, z_val]))

#-- Obtain mean and std information --#
# This is needed to standardize/unstandardize data
x_train_mean, x_train_std = np.mean(x_train, axis=0), np.std(x_train, axis=0)
z_train_mean, z_train_std = np.mean(z_train, axis=0), np.std(z_train, axis=0)

# If raw_io == False, then standardize the data with training set statistics
if not raw_io:
    x_train = (x_train - x_train_mean) / x_train_std  
    x_val = (x_val - x_train_mean) / x_train_std  
if not raw_io:
    z_train = (z_train - z_train_mean) / z_train_std
    z_val = (z_val - z_train_mean) / z_train_std

#-- Set evaluation parameters --#
eval_batch_size = 20000  # Always use high batch size on validation set to accurately assess performance
eval_loaders = DataLoader(dataset=x_val, batch_size=eval_batch_size, shuffle=True), \
               DataLoader(dataset=z_val, batch_size=eval_batch_size, shuffle=True)

print("z_train shape, x_train shape: ", z_train.shape, x_train.shape)
print("z_val   shape, x_val   shape: ", z_val.shape, x_val.shape)

Data total shapes:  (262761, 24) (262761, 24)
z_train shape, x_train shape:  (222761, 24) (222761, 24)
z_val   shape, x_val   shape:  (40000, 24) (40000, 24)


In [5]:
#-- Define dictionary object for easy reference --#
all_arrs = {'train': {}, 'val': {}}  # This will store all numpy arrays of interest
all_arrs['train']['x'] = x_train
all_arrs['train']['z'] = z_train
all_arrs['val']['x']   = x_val
all_arrs['val']['z']   = z_val

### Define target invariant masses (for both training and validation data)

Invariant mass relation: $m^2 = E^2 - |\vec{p}|^2$. For objects with ill-defined mass (MET and Jets) we fix $m=0$.

In [6]:
x_inv_masses = np.zeros(6)
z_inv_masses = np.array([0., 0., 4.7, 4.7, 0., 0.])

# Train

## Import Training Specific Libraries and Functions

In [7]:
import torch
from torch import optim
import torch.nn as nn
from ppttbar_constraints import threshold_check
from ppttbar_utils import train_and_val

## Define Meta Network Parameters

In [8]:
from models import Autoencoder, StochasticResNet

## Define Model and Hyperparameters

###### Latent loss function:
Finite sample approximation of Sliced Wasserstein Distance (SWD) between $p(z)$ and $p_E(z) = \int_x p(x) p_E(z|x)$

- $L_{latent}(Z, \tilde{Z}) = \frac{1}{L * M} \sum_{l=1}^{L} \sum_{m=1}^{M} c((\theta_l \cdot z_m)_{sorted}, (\theta_l \cdot \tilde{z}_m)_{sorted})$

where $c(\cdot, \cdot) = |\cdot - \cdot|^2$

###### Data loss function:
- $L_{data}(X, \tilde{X}) = \frac{1}{M} \sum_{m=1}^M [\frac{1_S(\tilde{x}_m)}{p_D(S)} 
c(x_m,  \tilde{x}_m)]$

where $c(\cdot, \cdot) = |\cdot - \cdot|^2$; $1_S(x)$ is the indicator function of $S$ so that it equals $1$ if $x \in S$, and $0$ otherwise, and $p_D(S) := \int dt p_D(t) 1_S(t)$ normalizes this distribution.

###### Full loss function:
- $L_{tot} = \beta L_{data}(X, \tilde{X}) + \lambda L_{latent}(Z, \tilde{Z})$ 

###### Core Hyperparameters
The hyperparameter definitions are as follows:

- num_hidden_layers: The number of hidden layers in both the encoder and decoder networks
- dim_per_hidden_layer: The dimensions per hidden layer in both the encoder and decoder networks
- lr: The learning rate of the networks
- lamb: The $\lambda$ coefficient in front of the latent loss term
- num_slices: Number of random projections used for computing SWD
- epochs: The number of epochs used during training

Hyperparameters for other losses that were tried, but use during main training is currently discouraged:

- tau: Coefficient in front of the alternate data-space loss ("alt_x_loss"), which is the SWD between $p(x)$ and $p_D(x):=\int_z p(z) p_D(x|z)$
- rho: Coefficient in front of an additional decoder constraint loss (based on soft-penalty approach to learning hard thresholds/ttbar_constraints)

###### Joint Training Hyperparameters
- beta: Coefficient in front of data loss, $L_{data}$ 
- beta_e: Coefficient in front of the encoder "anchor loss" 
- beta_d: Coefficient in front of the decoder "anchor loss" 

In [9]:
# Note: most of the unspecified hyperparameters are set to 0 by default

# common configs
joint_step_config = {
    'lr': 0.001,
    'beta': 1.,  # coefficient in front of data loss, E[c(x, x reconstructed)], where c is typically the 2-norm.
    'lamb': lamb,  # coefficient in front of latent loss, SWD between p(z) and Q(z):=\int_x p(x) Q(z|x)
    'tau': 0,  # coefficient in front of "alt_x_loss", which is the SWD between p(x) and p_G(x):=\int_z p(z) p_G(x|z);
               # this loss is not part of the original WAE formulation and is not used.
    'rho': 0, # coef in front of decoder constraint loss (based on soft-penalty approach to learning hard thresholds/ttbar_constraints)
    'nu_e': 0,  # coefficient in front of encoder "anchor loss"
    'nu_d': 0,
    'epochs': 1000,
    'log_freq': 100,
}


decoder_finetuning_config = {
    'beta': 0,  # coefficient in front of data loss in (S)WAE objective
    'tau': 1, # coefficient in front of "alt_x_loss", which is the SWD between p(x) and p_D(x)
    'lamb': 0, # disable latent loss
    'rho': 0, # no x_constraint_loss (no longer used)
    'nu_e': 0,  # anchor loss
    'nu_d': 0,
    'lr': 0.0001,  # reduced lr for fine-tuning
    'epochs': 10,
    'log_freq': 1,
}


hidden_layer_dims = [64, 64]
activation = torch.nn.ReLU
from models import Autoencoder, StochasticResNet
model = Autoencoder(x_dim, z_dim, ConditionalModel=StochasticResNet, encoder_hidden_layer_dims=hidden_layer_dims,
                    stoch_enc=True, stoch_dec=True, activation=activation, raw_io=raw_io,
                    x_inv_masses=x_inv_masses, x_stats=np.stack([x_train_mean, x_train_std]),
                    z_inv_masses=z_inv_masses, z_stats=np.stack([z_train_mean, z_train_std]),
                    # ResNet settings:
                    io_residual=True,
                    res_mlp_depth=2
                            )

In [10]:
# Print model 
model

Autoencoder(
  (encoder): StochasticResNet(
    (nn): Sequential(
      (0): Linear(in_features=42, out_features=64, bias=True)
      (1): ResBlock(
        (module): Sequential(
          (0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=64, bias=True)
          (3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (4): ReLU()
          (5): Linear(in_features=64, out_features=64, bias=True)
        )
      )
      (2): Linear(in_features=64, out_features=18, bias=True)
    )
  )
  (decoder): StochasticResNet(
    (nn): Sequential(
      (0): Linear(in_features=42, out_features=64, bias=True)
      (1): ResBlock(
        (module): Sequential(
          (0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=64, bias=True)
          (3): B

# Evaluate Trained Model on Validation Data

For easier downstream analysis we also evaluate the trained model on our testing dataset.

In [11]:
#-- Reset random seeds --#
seed = 2
torch.manual_seed(seed)
np.random.seed(seed)

#-- Evaluate trained model on validation dataset --#

# Use CPU instead of GPU
model.to('cpu')
model.encoder.output_stats.to('cpu')
model.decoder.output_stats.to('cpu')

#-- Set save directory location for npz files --#
save_dir      = './npzFiles/'
save_filename = f'swae-lamb={lamb}.npz'

#-- Load model's trained weights and set to evaluation mode --#
model.load_state_dict(torch.load(f'swae-lamb={lamb}.pkl', map_location=torch.device('cpu')))
model.eval()

Autoencoder(
  (encoder): StochasticResNet(
    (nn): Sequential(
      (0): Linear(in_features=42, out_features=64, bias=True)
      (1): ResBlock(
        (module): Sequential(
          (0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=64, bias=True)
          (3): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (4): ReLU()
          (5): Linear(in_features=64, out_features=64, bias=True)
        )
      )
      (2): Linear(in_features=64, out_features=18, bias=True)
    )
  )
  (decoder): StochasticResNet(
    (nn): Sequential(
      (0): Linear(in_features=42, out_features=64, bias=True)
      (1): ResBlock(
        (module): Sequential(
          (0): BatchNorm1d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
          (1): ReLU()
          (2): Linear(in_features=64, out_features=64, bias=True)
          (3): B

In [12]:
#-- Get validation dataset set into dictionary --#
# z_val and x_val already defined from above
print(all_arrs['val']['x'].shape, all_arrs['val']['z'].shape)

# Evaluate trained model on validation dataset
arrs = all_arrs['val']

arrs['z_decoded'] = model.decode(torch.from_numpy(arrs['z'])) # p_D(x) = \int_z p(z) p_D(x|z)  "x_pred_truth"
arrs['x_encoded'] = model.encode(torch.from_numpy(arrs['x'])) # p_E(z) = \int_x p(x) p_E(z|x)  "z_pred"
arrs['x_reconstructed'] = model.decode(arrs['x_encoded'])     # p_D(y) = \int_x \int_z p(x) p_E(z|x) p_D(y|z) "x_pred"
       
# Feed the same z input to the decoder multiple times and study the stochastic output
num_repeats = 100
num_diff_zs = 100

arrs['z_rep'] = np.array([np.repeat(arrs['z'][i:i+1], num_repeats, axis=0) for i in range(num_diff_zs)])       # "z_fixed"
z_rep_tensor = torch.from_numpy(arrs['z_rep'])                                                                 # tmp
arrs['z_decoded_rep'] = np.array([model.decode(z_rep_tensor[i]).detach().numpy() for i in range(num_diff_zs)]) # "x_pred_truth_fixed"
arrs['x_rep'] = np.array([np.repeat(arrs['x'][i:i+1], num_repeats, axis=0) for i in range(num_diff_zs)])       # "x_fixed"

# Convert all results to numpy arrays
for (field, arr) in arrs.items():
    if isinstance(arr, torch.Tensor):
        arrs[field] = arr.detach().numpy()

(40000, 24) (40000, 24)


# Get mask corresponding to events which pass threshold

In [13]:
# Import ttbar constraint function
from ppttbar_constraints import threshold_check

In [14]:
#-- Create new arrays from model output that passes cuts --#
# Only save mask to dictionary

arrs = all_arrs['val']

for field in ('z_decoded', 'x_reconstructed'):
    arr_raw = arrs[field] # Raw = unstandardized
    
    # Get mask for events that pass threshold constraint
    good_mask = threshold_check(arr_raw)
    print('passing rate of', field, good_mask.mean())

    # Store masks that determine event-by-event passing
    arrs[field+'_good_mask'] = good_mask

# Print all keys in 'val' category
print(arrs.keys())

passing rate of z_decoded 0.68165
passing rate of x_reconstructed 0.84145
dict_keys(['x', 'z', 'z_decoded', 'x_encoded', 'x_reconstructed', 'z_rep', 'z_decoded_rep', 'x_rep', 'z_decoded_good_mask', 'x_reconstructed_good_mask'])


In [15]:
save_path = save_dir + save_filename
np.savez(save_path, **all_arrs['val'])
print('Model results saved at', save_path)

Model results saved at ./npzFiles/swae-lamb=0.001.npz
