<h1 style="text-align:center;">Higgs Boson: A Case Study</h1>

# Introduction

Ah, the journey to finding the Higgs boson! Let's delve into this ntoebook where the renowned Higgs Boson Kaggle Competition, and XGBoost, our machine learning hero, share the stage.

## A Glimpse into the Past: Physics and the Higgs Boson

Known colloquially as the "God particle", the Higgs boson was theorized by Peter Higgs in 1964 as a solution to the mystery: why do particles have mass? The search ended in 2012 at CERN’s Large Hadron Collider, unraveling the particle by smashing protons at ludicrous speeds and scrutinizing the aftermath. The discovery was monumental, solidifying the Standard Model of physics and leading us to meticulously measure the decay characteristics of the Higgs boson, particularly into two tau particles, amidst the data chaos.

## Kaggle Competitions: A Battlefield for Machine Learning Gladiators

Remember when Netflix sparked a machine learning competition frenzy in 2006 with a \$1 million prize? That was just the beginning! Kaggle became the arena where data scientists, from various realms, battled algorithms to solve problems, gaining insights and prizes along the way. The Higgs Boson Machine Learning Challenge was announced in 2014, [see here](https://www.kaggle.com/c/higgs-boson), drawing 1,875 teams into the fray with a $13,000 prize pool.

## Enter the Dragon: XGBoost

XGBoost, launched just 6 months before the Higgs challenge, became a sensation, propelling competitors up the Kaggle leaderboard while being a paragon of computational efficiency.

## A Peek into the Data

We're not using Kaggle's dataset, but the original from CERN’s open data portal, accessible [here](http://opendata.cern.ch/record/328). A bit larger than Kaggle’s, we'll stick to the first 250,000 rows and tweak it to mimic Kaggle's data. Fetch the dataset directly from [this GitHub repository](https://github.com/PacktPublishing/Hands-On-Gradient-Boosting-with-XGBoost-and-Scikit-learn/tree/master/Chapter05) and let's delve into it, keeping in mind it's a `.csv.gz` file, so we'll use `compression=gzip` when reading it into a pandas DataFrame.

---

And thus begins our exploration, where we fuse physics and machine learning, uncovering the stories data can tell us about the subatomic world with XGBoost as our guide!


In [28]:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split, cross_val_score

from xgboost import XGBClassifier, XGBRegressor
from sklearn.metrics import accuracy_score

from helper_file import *

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [3]:
url = "data/atlas-higgs-challenge-2014-v2.csv.gz"

In [10]:
pd.read_csv(url, 
                 nrows=250000, compression='gzip'
                ).columns

Index(['EventId', 'DER_mass_MMC', 'DER_mass_transverse_met_lep',
       'DER_mass_vis', 'DER_pt_h', 'DER_deltaeta_jet_jet', 'DER_mass_jet_jet',
       'DER_prodeta_jet_jet', 'DER_deltar_tau_lep', 'DER_pt_tot', 'DER_sum_pt',
       'DER_pt_ratio_lep_tau', 'DER_met_phi_centrality',
       'DER_lep_eta_centrality', 'PRI_tau_pt', 'PRI_tau_eta', 'PRI_tau_phi',
       'PRI_lep_pt', 'PRI_lep_eta', 'PRI_lep_phi', 'PRI_met', 'PRI_met_phi',
       'PRI_met_sumet', 'PRI_jet_num', 'PRI_jet_leading_pt',
       'PRI_jet_leading_eta', 'PRI_jet_leading_phi', 'PRI_jet_subleading_pt',
       'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt',
       'Weight', 'Label', 'KaggleSet', 'KaggleWeight'],
      dtype='object')

To match the Kaggle training data, let's delete the Kaggleset and Weight columns, convert KaggleWeight into 'Weight', and move the 'Label' column to the last column.

In [17]:
df = (
    pd.read_csv(url, nrows=250000, compression='gzip')
    .drop(columns=['Weight', 'KaggleSet'])
    .rename(columns={"KaggleWeight": "Weight"})
)


In [18]:
df.columns

Index(['EventId', 'DER_mass_MMC', 'DER_mass_transverse_met_lep',
       'DER_mass_vis', 'DER_pt_h', 'DER_deltaeta_jet_jet', 'DER_mass_jet_jet',
       'DER_prodeta_jet_jet', 'DER_deltar_tau_lep', 'DER_pt_tot', 'DER_sum_pt',
       'DER_pt_ratio_lep_tau', 'DER_met_phi_centrality',
       'DER_lep_eta_centrality', 'PRI_tau_pt', 'PRI_tau_eta', 'PRI_tau_phi',
       'PRI_lep_pt', 'PRI_lep_eta', 'PRI_lep_phi', 'PRI_met', 'PRI_met_phi',
       'PRI_met_sumet', 'PRI_jet_num', 'PRI_jet_leading_pt',
       'PRI_jet_leading_eta', 'PRI_jet_leading_phi', 'PRI_jet_subleading_pt',
       'PRI_jet_subleading_eta', 'PRI_jet_subleading_phi', 'PRI_jet_all_pt',
       'Label', 'Weight'],
      dtype='object')

In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250000 entries, 0 to 249999
Data columns (total 33 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   EventId                      250000 non-null  int64  
 1   DER_mass_MMC                 250000 non-null  float64
 2   DER_mass_transverse_met_lep  250000 non-null  float64
 3   DER_mass_vis                 250000 non-null  float64
 4   DER_pt_h                     250000 non-null  float64
 5   DER_deltaeta_jet_jet         250000 non-null  float64
 6   DER_mass_jet_jet             250000 non-null  float64
 7   DER_prodeta_jet_jet          250000 non-null  float64
 8   DER_deltar_tau_lep           250000 non-null  float64
 9   DER_pt_tot                   250000 non-null  float64
 10  DER_sum_pt                   250000 non-null  float64
 11  DER_pt_ratio_lep_tau         250000 non-null  float64
 12  DER_met_phi_centrality       250000 non-null  float64
 13 

# A Quick Look at Our Data Columns

Alright, let's take a swift tour through the columns of our data!

### Column 0: `EventId`
- **What's this?** Simply an identifier for each event. 
- **Use in Modeling?** Nope, our model doesn’t need this one.

### Columns 1-30: Physics Features
- **What's inside?** Various properties from LHC collisions.
- **Special Note:** To get a deeper dive into what each of these columns represents, check out the [technical documentation](http://higgsml.lal.in2p3.fr/documentation). 
- **Use in Modeling?** Absolutely, these are the predictors for our model.

### Column 31: `Weight`
- **What's its role?** It scales the data. Given the rarity of Higgs boson events, weights help manage the imbalance and ensure our model doesn’t overlook them.
- **A Caveat:** Weights aren't available in the test data, but we’ll discuss strategies on handling this in later chapters.

### Column 32: `Label`
- **What does it tell us?** Whether an event is a signal (`s`) or background (`b`). 
- **A Little Context:** The training data, though simulated from real data, has more signals than you’d typically find to aid learning about Higgs boson decay.
- **Use in Modeling?** Yes, this one's our target variable.

And there we have it! A straightforward overview of our data columns, ready to guide us as we dive into model building!

We need to chamge the `Label` column into numerical format by replacing `s` values with 1 and `b` values with 0.

In [16]:
df.Label

0         s
1         b
2         b
3         b
4         b
         ..
249995    b
249996    b
249997    s
249998    b
249999    b
Name: Label, Length: 250000, dtype: object

In [20]:
df = (
    pd.read_csv(url, nrows=250000, compression='gzip')
    # Drop unwanted columns
    .drop(columns=['Weight', 'KaggleSet'])
    # Rename column for clarity
    .rename(columns={"KaggleWeight": "Weight"})
    # Replace 's' and 'b' in 'Label' with 1 and 0, respectively
    .assign(Label=lambda x: x['Label'].replace(('s', 'b'), (1, 0)))
)

In [23]:
X, y = splitX_y(df, 'Label')

print(f"shape of target vector: {y.shape}")
print(f"shape of feature matrix: {X.shape}")

shape of target vector: (250000,)
shape of feature matrix: (250000, 32)


### A Simplified Guide to Scoring and Weights in the Higgs Challenge

##### The Unique Scoring: Approximate Median Significance (AMS)
- **What's Different?** The Higgs Challenge dances to its own tune - especially when it comes to scoring. It doesn't just look for accuracy but emphasizes the AMS. 
- **Defining AMS:** It’s a specific formula (detailed in the [technical documentation](http://higgsml.lal.in2p3.fr/documentation)) that values true positives and penalizes false negatives, with a regularization term of 10.
- **Good News:** No need to manually define AMS - XGBoost has got our back here!

##### A Quick Dive into Weights
- **Why Weights?** Because in the real world, signals are a needle in the haystack of background noise. We use weights to adjust our model’s sensitivity to this.
- **Using Weights:** Assign a higher weight to rare signals and a lower weight to common background noise. This ensures that our model doesn’t just predict everything as background!
- **Practical Usage:** The weight column must reflect reality (the expected number of signal and background events). It’s crafted and scaled to match the expectations from our test data.

###### Crafting the Weights: A Tiny How-to
1. **Scaling Weights:** Adjust weights to match the size of our test data.
   ```python
   df['test_Weight'] = df['Weight'] * 550000 / len(y)


In [24]:
df = (
    pd.read_csv(url, nrows=250000, compression='gzip')
    .drop(columns=['Weight', 'KaggleSet'])
    .rename(columns={"KaggleWeight": "Weight"})
    .assign(
        Label=lambda x: x['Label'].replace(('s', 'b'), (1, 0)),
        test_Weight=lambda x: x['Weight'] * 550000 / len(x)
    )
)

**Calculating Scale Factor**: A number derived from the sum of background weights divided by the sum of signal weights.

In [26]:
s = np.sum(df[df['Label']==1]['test_Weight'])

b = np.sum(df[df['Label']==0]['test_Weight'])

b/s

593.9401931492318

#### More on Weights
- To delve deeper into the world of weights, explore [this KDnuggets article](https://www.kdnuggets.com/2019/11/machine-learning-what-why-how-weighting.html).

##### Building The Model: A Throwback to Original XGBoost API
- **Old School API:** Back during the Higgs Challenge, everyone used the original XGBoost API (pre-scikit-learn).
- **Why Mentioning This?** Because you’ll probably bump into it online, and it’s what was used in the Higgs Challenge.
- **Our Approach:** We'll stick with this original API just for this chapter to keep things authentic.

**And there we go!** A brief, uncomplicated guide to scoring and weights in the context of the Higgs Challenge, aiming to keep things as straightforward and clear as possible!


In [29]:
import xgboost as xgb
xgb_clf = xgb.DMatrix(X, y, 
                      missing=-999.0, 
                      weight=df['test_Weight'])

In [30]:
param = {}
param['objective'] = 'binary:logitraw'
param['scale_pos_weight'] = b/s
param['eta'] = 0.1
param['max_depth'] = 6
param['eval_metric'] = 'auc'

In [34]:
plst = list(param.items())+[('eval_metric', 'ams@0.15')]
watchlist = [(xgb_clf, 'train')]
num_round = 120

print('loading data end, start to boost trees')

bst = xgb.train(plst, xgb_clf, num_round, watchlist)

bst.save_model('higgs.model')

print('finished training')

loading data end, start to boost trees
[0]	train-auc:1.00000	train-ams@0.15:66.38881
[1]	train-auc:1.00000	train-ams@0.15:66.38881
[2]	train-auc:1.00000	train-ams@0.15:66.38881
[3]	train-auc:1.00000	train-ams@0.15:66.38881
[4]	train-auc:1.00000	train-ams@0.15:66.38881
[5]	train-auc:1.00000	train-ams@0.15:66.38881
[6]	train-auc:1.00000	train-ams@0.15:66.38881
[7]	train-auc:1.00000	train-ams@0.15:66.38881
[8]	train-auc:1.00000	train-ams@0.15:66.38881
[9]	train-auc:1.00000	train-ams@0.15:66.38881
[10]	train-auc:1.00000	train-ams@0.15:66.38881
[11]	train-auc:1.00000	train-ams@0.15:66.38881
[12]	train-auc:1.00000	train-ams@0.15:66.38881
[13]	train-auc:1.00000	train-ams@0.15:66.38881
[14]	train-auc:1.00000	train-ams@0.15:66.38881
[15]	train-auc:1.00000	train-ams@0.15:66.38881
[16]	train-auc:1.00000	train-ams@0.15:66.38881
[17]	train-auc:1.00000	train-ams@0.15:66.38881
[18]	train-auc:1.00000	train-ams@0.15:66.38881
[19]	train-auc:1.00000	train-ams@0.15:66.38881
[20]	train-auc:1.00000	train-am

```python
plst = list(param.items())+[('eval_metric', 'ams@0.15')]
```

This line is crafting a list, `plst`, intended to configure parameters for training our XGBoost model. Let's dissect it:

- `param.items()`: Extracts key-value pairs from the `param` dictionary, presenting them as tuples within a list.
  
- `[('eval_metric', 'ams@0.15')]`: A list housing a single tuple, where `'eval_metric'` is the parameter identifier and `'ams@0.15'` is its assigned value. Here, `ams` refers to Approximate Median Significance, a metric utilized in the Higgs Boson Kaggle Competition, with `0.15` acting as a threshold in AMS computation.

- `list(param.items())+[('eval_metric', 'ams@0.15')]`: Merges the key-value pairs from `param` and the additional `eval_metric` parameter into one unified list, subsequently stored in `plst`.


In the realm of the Higgs Boson Kaggle Competition, XGBoost didn’t just make an appearance; it shone brightly and was wielded effectively by competitors, particularly the victor, Gabor Melis. The baseline model for this competition, provided by Tianqi Chen, offered a solid foundation upon which Melis sculpted his winning model. 

You might wonder - what set Melis apart? Dive into his solution [here](https://github.com/melisgl/higgsml), and you'll note that while his tweaks to the baseline model weren’t groundbreaking, they were tactical and effective. A noteworthy strategy he employed was feature engineering, subtly crafting additional columns to enrich the data, a tactic we’ll explore more in Chapter 9.

If you're intrigued by the prospect of crafting and submitting your own model to Kaggle post-competition, it's entirely possible! However, bear in mind that Kaggle submissions have their own set of nuances - they need to be ranked, correctly indexed, and submitted through the Kaggle API. A deeper dive into this is warranted for true mastery. A peek at the XGBoost ranking code [here](https://github.com/dmlc/xgboost/blob/master/demo/kaggle-higgs/higgs-pred.py) might offer some insights for those eager to embark on this journey.

In essence, this episode of Kaggle competition showcased the might of XGBoost and the nuanced art of model tweaking and feature engineering, pivotal in propelling competitors towards victory.