# Dataset
I am using the "Quark and Gluon Jets" dataset from EnergyFlow [[https://energyflow.network/docs/datasets/#quark-and-gluon-jets]] for the initial part of my project[^1]. I will use this data at two different stages of my project:

**Stage 1 - Hypothesis Testing**

In this phase, I will:
  - Peform **statistical hypothesis tests** ($\chi^{2}$, Likelihood Ratio tests) to verify **quark vs. gluon distributions**.
  - Use features like **multiplicity & jet spread** to check how different quark and gluon jets are.
  - Apply **classification techniques** to separate quark vs. gluon jets.
The goal at this stage is to develop a **baseline statistical understanding** of how quarks vs. gluon jets differ.

**Stage 2 - Bayesian Uncertainty Quantififation**

After completing hypothesis testing, I will:
  - Train a **Neural Network (NN)** to classify quark vs. gluon jets.
  - Modify the NN into a **Bayesian Neural Network (BNN)** to ***quantify*** uncertainty.
  - Test if **Uncertainty Quantification (UQ)** helps improve **classification accuracy** (e.g., are some jets “hard to classify” due to overlapping properties?).


The dataset "Quark and Gluon Jets" focuses only on jet types (*quarks vs. gluon*), not full event-classification. It does not include a wide range of event types (like Higgs decays). So, for UQ, I will switch to "CMS Open Data and the MOD HDF5 Format", again from EnergyFlow [[https://energyflow.network/docs/datasets/#cms-open-data-and-the-mod-hdf5-format]]. 

Why use different dataset later?
- It represents real detector data, where uncertainties arise naturally.
- It allows us to test if ML models trained on simulated jets generalize to real physics data.
- This is crucial for practical applications in experimental physics.

Let's begin by understanding and visualising the dataset.

[^1]: The dataset is a ".npz" file, which is a **compressed NumPy archive**. It stores multiple NumPy arrays inside one file. We have to manually extract arrays before using them (unlike ".csv" - which stores tabular data and can be loaded into Pandas with df.head()).

In [1]:
import numpy as np

# Define file path (using the first file for exploration)
file_path = "/Users/sauravbania/My Projects/Uncertainty-Quantification-CMS/datasets/QG_jets_withbc_0.npz"

# Loading the dataset
data1 = np.load(file_path)

# Checking what arrays are stored in this file
print("Dataset Keys:", data1.files)

# Inspecting the shape of each array
for key in data1.files:
    print(f"{key}: shape {data1[key].shape}")

#Extracting first 5 jets from the X array
X_val = data1["X"]
y_val = data1["y"]

#for i in range(2):
    #print(f"Jet {i}:")
    #print(X_val[i])  # Print the particle features for each jet
    #print(y_val[i])
    #print(f"Label: {'Quark' if y_val[i] == 1 else 'Gluon'}\n")  # Convert label to text

Dataset Keys: ['X', 'y']
X: shape (100000, 134, 4)
y: shape (100000,)


This is telling us that the dataset has two arrays X:(100000,134,4) and y(100000). The X array says that there are **100000 Quark and Gluon Jets** stored in this file. Each jet consists of a maximum of **134 particles** (multiplicity varies from jet to jet) and has **4 features**. The y array tells us the label for Quark or Gluon, y == 1 for Quark and y == 0 for Gluon, as you can clearly see in the output.

In [None]:
print(len(X_val))
print(len(X_val[0]))
print(len(X_val[0,0]))
#for i in range(2):
    #print(X_val[0:1,0:1])
    #if X_val[i] != 0.0:
        #counter += 1
#return counter   

print(f"First value: {X_val[0,0,0]}") # X_val[0,0,0] = [first jet, first particle in jet, first feature of that particle]

for i in range(len(X_val)):
    counter = 0
    if X_val.any() != 0:
        counter += 1
print(counter)



100000
134
4
First value: 0.986502732089
