<a href="https://colab.research.google.com/github/ML4SCI/DeepLearnHackathon/blob/main/ParticleImagesChallenge/ParticleImages.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Particle Images

**Introduction:**

Machine Learning algorithms have become an increasingly important tool for analyzing the data from the Large Hadron Collider (LHC). Identification of particles in LHC collisions is an important task of LHC detector reconstruction algorithms.

Here we present a challenge where one of the detectors (the Electromagnetic Calorimeter or ECAL) is used as a camera to analyze detector images from two types of particles: electrons and photons that deposit their energy in this detector.

**Dataset:**

Each pixel in the image corresponds to a detector cell, while the intensity of the pixel corresponds to how much energy is measured in that cell. Timing of the energy deposits are also available, though this may or may not be relevant. The dataset contains 32x32 Images of the energy hits and their timing (channel 1: hit energy and channel 2: its timing) in each calorimeter cell (one cell = one pixel) for the two classes of particles: Electrons and Photons. The dataset contains around four hundred thousand images for electrons and photons. Please note that your final model will be evaluated on an unseen test dataset.

## Deliverables

* `.ipynb` (and a PDF version of it with outputs showing your results) file showing your solution, including your study of the data, final model structure, hyperparameters and the wat the model was trained that yielded the best possible performance.
* Final model accuracy (training and validation) ROC curve and AUC score, as well as an additional plot (e.g. precision-recall curves, confusion matrix) which further showcases the performance of your model.
* Your trained model containing the model architecture and its trained weights (HDF5 file, .pb file, .pt file, etc.). Also show in your notebooks how to load and use your model.

**Note: You are free to use the ML framework of your choice.**

## Download the Dataset

If you are working in Colab, to not have to re-download the data all the time, you can mount your Google Drive and download/fetch the data to/from there ([link for more info](https://towardsdatascience.com/different-ways-to-connect-google-drive-to-a-google-colab-notebook-pt-1-de03433d2f7a)).

In [None]:
#!/bin/bash
!wget https://cernbox.cern.ch/s/9YELv279NIAE9B8/download -O /content/SingleElectronPt50_IMGCROPS_n249k_RHv1.hdf5
!wget https://cernbox.cern.ch/s/EahdXxzgq7nPodp/download -O /content/SinglePhotonPt50_IMGCROPS_n249k_RHv1.hdf5

In [None]:
data_dir = "/content/" # Put here in what directory your data lives.

## Import Modules

In [None]:
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(1337)  # for reproducibility

import h5py
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

## Loading Image Data
- Two classes of particles: electrons and photons
- 32x32 matrices (two channels - hit energy and time) for the two classes of particles electrons and photons impinging on a calorimeter (one calorimetric cell = one pixel).
- Note that although timing channel is provided, it may not necessarily help the performance of the model.

In [None]:
# 1 -> electron
filename = os.path.join(data_dir, "SingleElectronPt50_IMGCROPS_n249k_RHv1.hdf5")
data1 = h5py.File(filename, "r")
Y1 = data1["y"]
X1 = data1["X"]

# 0 -> photon
filename = os.path.join(data_dir, "SinglePhotonPt50_IMGCROPS_n249k_RHv1.hdf5")
data0 = h5py.File(filename, "r")
Y0 = data0["y"]
X0 = data0["X"] 

# Combining datasets into one mixed dataset
X_final = np.concatenate((X0[:], X1[:]), axis=0)
Y_final = np.concatenate((Y0[:], Y1[:]), axis=0)

num_imgs = Y_final.shape[0]
print("Number of images: {}".format(num_imgs))

# Configure Training / Validation / Test Sets

In [None]:
# Divide into train and test
X_train, X_test, y_train, y_test = train_test_split( 
    X_final,          
    Y_final,
    test_size=0.2,
    random_state=42
)

# Further divide test into test and validation
X_valid, X_test, y_valid, y_test = train_test_split( 
    X_test,
    y_test,
    test_size=0.5,
    random_state=42
)

# Colab has limited RAM, so you might need to clear some memory...
del(Y1, X1, Y0, X0, X_final, Y_final) 

print(f"X_train shape: {X_train.shape} - y_train shape: {y_train.shape}")
print(f"X_test shape: {X_test.shape} - y_test shape: {y_test.shape}")
print(f"X_valid shape: {X_valid.shape} - y_valid shape: {y_valid.shape}")

## Task 1:

*Data: Training data*

Explore, visualize and analyze the data found in the training dataset.

In [None]:
# Your code here!

## Task 2:

*Data: Training and validation data*

Train a model by fitting it to the training data. Use at least one metric such as `roc_auc_score`, `accuracy`, etc. to analyze the model's performance on the validation data. Using that performance metric, optimize or improve your model. It should be clear from your notebook how you perform this optimization and you should explain your thinking clearly.

As you work on your model, you may use a subset of the actual dataset to haisten your tests. However, for final submission, you must use the full test set.

In [None]:
# Use your framework of choice. 

# Define your model here!

In [None]:
# Train your model here!

## Task 3: 

*Data: Testing data*

Without having done any optimization using the testing data set, analyze the performance of the model on the testing data. Your analysis should include the AUC score, a ROC curve plot, and at least one other plot of your choice such as precision-recall curves, confusion matrix, etc. Try to get your model to perform with AUC > 90%.

In [None]:
# Your code here!