<a href="https://colab.research.google.com/github/ML4SCI/DeepLearnHackathon/blob/main/HiggsBosonClassificationChallenge/higgs_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Higgs Classification Challenge

**Introduction:** 

High-energy collisions at the Large Hadron Collider (LHC) produce particles that interact with particle detectors. One important task is to classify different types of collisions based on their physics content, allowing physicists to find patterns in the data and to potentially unravel new discoveries.

The discovery of the Higgs boson by CMS and ATLAS Collaborations was announced at CERN in 2012. In this challenge, we will use machine learning to classify events containing Higgs bosons from the background events which do not contain Higgs bosons.

**Dataset:** 

The dataset consists of a total of 11 million labeled samples of Higgs and background events produced by Monte Carlo simulations. Each sample consists of 28 features. The first 21 features are kinematic properties of the events. The last seven are functions of the first 21. The data labels are 1 for signal (an event with Higgs bosons) and 0 for background (an event without Higgs bosons).

The dataset is hosted by the Center for Machine Learning and Intelligent Systems at University of California, Irvine. The dataset can be found on the [UCI Machine learning Repository](https://archive.ics.uci.edu/ml/datasets/HIGGS).

## Deliverables

* PDF and .ipynb file showing your solution.
* Final model accuracy (training and validation) ROC curve and AUC score, as well as an additional plot (e.g. precision-recall curves, confusion matrix) which further showcases the performance of your model 
* Final model structure, parameters and hyper-parameters yielding the best possible performance.
* Actual trained model containing the model architecture adn its trained weights (HDF5 file, .pb file, .pt file, etc). Also show in your notebooks how to load and use your model.

**Note: You are free to use the ML framework of your choice.**

## Downloading the Dataset
If you are having problems with this part in Colab, you can also download the file manually and put it in your Google Drive. You can then [connect your Google Drive to Colab](https://towardsdatascience.com/different-ways-to-connect-google-drive-to-a-google-colab-notebook-pt-1-de03433d2f7a)

In [None]:
#!/bin/bash
!wget https://archive.ics.uci.edu/ml/machine-learning-databases/00280/HIGGS.csv.gz
!gzip -d HIGGS.csv.gz

## Import Modules

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(1337)  # for reproducibility

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_curve, auc

## Loading Data

* Two classes: Higgs event and background event
* As loaded below, the first column is the label (1 = Higgs event, 0 = background event), and the rest of the oclumns are all of our inputs. 

In [None]:
data = pd.read_csv('./HIGGS.csv', header=None)
X = data.iloc[:,1:]
y = data.iloc[:,0]
#X = X.to_numpy(dtype=float) #Convert pandas dataframe to numpy array (optional)
#y = y.to_numpy(dtype=int)   #Convert pandas dataframe to numpy array (optional)

## Configuring Tarining / Validation / Test Sets

In [None]:
X_train, X_val1, y_train, y_val1 = train_test_split(
    X, 
    y, 
    test_size=0.0909090909, 
    random_state=42
)
X_val, X_test, y_val, y_test = train_test_split(
    X_val1, 
    y_val1, 
    test_size=0.5, 
    random_state=42
)

print(X_train.shape)
print(X_val.shape)
print(X_test.shape)
print(y_train.shape)
print(y_val.shape)
print(y_test.shape)

## Task 1:

Data: `X_train`

Generate histograms of the different variables in `X_train` with proper axis labels and titles.

Detailed information on what each feature column is can be found in *Attribute Information* section on the [UCI Machine learning Repository](https://archive.ics.uci.edu/ml/datasets/HIGGS). For further information, refer to the [paper](https://www.nature.com/articles/ncomms5308) by Baldi et. al

The following may be helpful:

`names = ["lepton pT", "lepton eta", "lepton phi", "missing energy magnitude",` <br>
`"missing energy phi", "jet 1 pt", "jet 1 eta", "jet 1 phi", "jet 1 b-tag",` <br>
`"jet 2 pt", "jet 2 eta","jet 2 phi", "jet 2 b-tag", "jet 3 pt", "jet 3 eta",` <br>
`"jet 3 phi", "jet 3 b-tag", "jet 4 pt", "jet 4 eta", "jet 4 phi", "jet 4 b-tag",`<br>` "m_jj", "m_jjj", "m_lv", "m_jlv", "m_bb", "m_wbb", "m_wwbb"]`

`for index, name in enumerate(names):`

## Task 2:

Data: `X_train`, `y_train`, `X_val`, `y_val`

Train a model by fitting it to the training data. Use at least one metric such as roc_auc_score, accuracy, etc. to analyze the model's performance on the validation data. Using that performance metric, optimize or improve your model. It should be clear from your notebook how you perform this optimization.

As you work on your model, you may use a subset of the actual dataset to haisten your tests. However, for final submission, you must use the full test set.

## Task 3:

Data: Testing data

Without having done any optimization using the testing data set, analyze the performance of the model on the testing data. You analysis should include the AUC score, a ROC curve plot, and at least one other plot of your choice such as precision-recall curves, confusion matrix, etc.
