# Predict invariant mass of two electrons in particle collision events using Gradient Boosting and the power of Feature Engineering

## Table of Contents
* [Admin Columns](#1)
* [Target Distribution](#2)
* [Feature Distributions](#3)
* [Feature Correlations](#4)
* [Target vs Features](#5)
* [Feature Engineering](#6)
* [Gradient Boosting Model w/o Feature Engineering](#7)
* [Gradient Boosting Model using Feature Engineering](#8)
* [Local Explanations](#9)

### The features are not self-explanatory, so here is a copy of the data description:

Run: The run number of the event.

Event: The event number.

E1, E2: The total energy of the electron (GeV) for electrons 1 and 2.

px1, py1, pz1, px2, py2, pz2: The components of the momemtum of the electron 1 and 2 (GeV).

pt1, pt2: The transverse momentum of the electron 1 and 2 (GeV).

eta1, eta2: The pseudorapidity of the electron 1 and 2.

phi1, phi2: The phi angle of the electron 1 and 2 (rad).

Q1, Q2: The charge of the electron 1 and 2.

M: The invariant mass of two electrons (GeV) <= OUR TARGET

In [None]:
# packages

# standard
import numpy as np
import pandas as pd
import time

# plots
import matplotlib.pyplot as plt
import seaborn as sns

# Statistics
import scipy.stats as stats
from sklearn.metrics import mean_absolute_error, mean_squared_error

# ML tools
import h2o
from h2o.estimators import H2ORandomForestEstimator
from h2o.estimators import H2OGradientBoostingEstimator

In [None]:
# load data
df = pd.read_csv('../input/cern-electron-collision-data/dielectron.csv')
df.head()

In [None]:
# dimension of table
df.shape

In [None]:
# structure of data frame
df.info()

In [None]:
# clean column names
df.rename(columns = {'px1 ':'px1'}, inplace = True)

<a id='1'></a>
# Admin Columns

### Runs

In [None]:
# frequencies of run
df.Run.value_counts().plot(kind='bar')
plt.grid()
plt.title('Run')
plt.show()

#### We have 13 runs of varying size.

### Events

In [None]:
# events: a few of them occur more than once!
multis = df.Event.value_counts() # get counts
multis = multis[multis.values>1] # filter by frequency > 1
multis

In [None]:
# extract ids
multis_ids = multis.index.to_list()
print(multis_ids)
# and show corresponding rows
df[df.Event.isin(multis_ids)].sort_values('Event')

#### Multiple event occurrences are duplicates, only exception being Event=418006834:

In [None]:
df[df.Event==418006834]

#### Interestingly the two rows are even from different runs... maybe a bug?

In [None]:
# we fix the situation by changing the Event for the second row
df.loc[79612,'Event'] = 418006835 # use a number that is not yet in use!
# and adjust our duplicate list
multis_ids.remove(418006834)
# check:
df[df.Event==418006834]

#### Finally, remove remaining duplicates:

In [None]:
# remove duplicates
df = df.drop_duplicates(subset='Event')

<a id='2'></a>
# Target Distribution

In [None]:
# check for missing values in the target
df.M.isna().sum()

#### We have 85 rows without a target value.

In [None]:
# remove the rows having missing targets for the following as it's only a very small fraction
df = df[~df.M.isna()]
df.shape

In [None]:
# plot target
plt.figure(figsize=(10,6))
df.M.plot(kind='hist', bins=100)
plt.title('Distribution of M - The invariant mass of two electrons (GeV).')
plt.grid()
plt.show()

### Interesting shape, we have a concentration on the minimum value of 2 and two further peaks.

In [None]:
# stats for target; adding a few more percentiles compared to standard output
df.M.describe(percentiles=[0.01,0.1,0.25,0.5,0.75,0.9,0.99])

<a id='3'></a>
# Feature Distributions

### Charges of electron 1 and 2

In [None]:
df.Q1.value_counts().plot(kind='bar')
plt.title('Q1 - Charge of electron 1')
plt.grid()
plt.show()

df.Q2.value_counts().plot(kind='bar')
plt.title('Q2 - Charge of electron 2')
plt.grid()
plt.show()

#### Nicely balanced!

In [None]:
# define numeric features
features_num = ['E1', 'px1', 'py1', 'pz1', 'pt1', 'eta1', 'phi1', 
                'E2', 'px2', 'py2', 'pz2', 'pt2', 'eta2', 'phi2']

In [None]:
# summary stats, adding a few more percentiles compared to standard output
df[features_num].describe(percentiles=[0.01,0.1,0.25,0.5,0.75,0.9,0.99])

### Feature distributions

In [None]:
# combo plot hist / boxplot
for f in features_num:
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10,8))
    ax1.hist(df[f], bins=100)
    ax1.grid()
    ax1.set_title(f)
    ax2.boxplot(df[f], vert=False)
    ax2.grid()   
    ax2.set_title(f + '- boxplot')
    plt.show()

<a id='4'></a>
# Feature Correlations

### Numerical Features

In [None]:
# Pearson correlation
corr_pearson = df[features_num].corr(method='pearson')

fig = plt.figure(figsize = (14,8))
sns.heatmap(corr_pearson, annot=True, cmap='RdYlGn',
            vmin=-1, vmax=1)
plt.title('Pearson Correlation')
plt.show()

In [None]:
# Spearman (Rank) correlation
corr_spearman = df[features_num].corr(method='spearman')

fig = plt.figure(figsize = (14,8))
sns.heatmap(corr_spearman, annot=True, cmap='RdYlGn',
            vmin=-1, vmax=1)
plt.title('Spearman Correlation')
plt.show()

#### A few interesting scatter plots (we pick the ones having the highest correlation):

In [None]:
plt.scatter(df.E1, df.pt1, alpha=0.1)
plt.xlabel('E1')
plt.ylabel('pt1')
plt.grid()
plt.show()

In [None]:
plt.scatter(df.E2, df.pt2, alpha=0.1)
plt.xlabel('E2')
plt.ylabel('pt2')
plt.grid()
plt.show()

In [None]:
plt.scatter(df.py1, df.phi1, alpha=0.1)
plt.xlabel('py1')
plt.ylabel('phi1')
plt.grid()
plt.show()

In [None]:
plt.scatter(df.py2, df.phi2, alpha=0.1)
plt.xlabel('py2')
plt.ylabel('phi2')
plt.grid()
plt.show()

In [None]:
plt.scatter(df.pz1, df.eta1, alpha=0.1)
plt.xlabel('pz1')
plt.ylabel('eta1')
plt.grid()
plt.show()

In [None]:
plt.scatter(df.pz2, df.eta2, alpha=0.1)
plt.xlabel('pz2')
plt.ylabel('eta2')
plt.grid()
plt.show()

### Electron Charges

In [None]:
# cross table for electron charges
pd.crosstab(df.Q1, df.Q2)

#### Again, almost symmetric. However, combinations with different signs are more frequent than those having same signs. Let's try to add the product of the two charges (+1 if both have same sign, -1 if they have different signs) to our features:

In [None]:
df['Q12'] = df.Q1 * df.Q2
df['Q12'].value_counts().plot(kind='bar')
plt.title('Q1 = Q1*Q2')
plt.grid()
plt.show()

<a id='5'></a>
# Target vs Features

### Numerical Features

In [None]:
# scatter plot including correlation figure
for f in features_num:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

### Electron charges

In [None]:
for f in ['Q1','Q2','Q12']:
    sns.violinplot(data=df, x=f, y='M')
    plt.title('Target vs '+f)
    plt.grid()
    plt.show()

### We cannot really see a significant impact of each Q1 and Q2 on the target.
### But the product Q12 = Q1*Q2 seems to make a difference!

<a id='6'></a>
# Feature Engineering

### Let's check if this "product trick" works also with other features:

In [None]:
corr_method='spearman'

In [None]:
df['px12'] = df.px1 * df.px2
# calc correlation with target and visualize
for f in ['px1','px2','px12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

In [None]:
df['py12'] = df.py1 * df.py2
# calc correlation with target and visualize
for f in ['py1','py2','py12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

In [None]:
df['pz12'] = df.pz1 * df.pz2
# calc correlation with target and visualize
for f in ['pz1','pz2','pz12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

In [None]:
df['phi12'] = df.phi1 * df.phi2
# calc correlation with target and visualize
for f in ['phi1','phi2','phi12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

In [None]:
df['eta12'] = df.eta1 * df.eta2
# calc correlation with target and visualize
for f in ['eta1','eta2','eta12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

### Ok, this is somewhat surprising. None of the features px1/2, py1/2, pz1/2, phi1/2 and eta1/2 shows a significant correlation with the target, but all the products do!!!

In [None]:
df['pt12'] = df.pt1 * df.pt2
# calc correlation with target and visualize
for f in ['pt1','pt2','pt12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

In [None]:
df['E12'] = df.E1 * df.E2
# calc correlation with target and visualize
for f in ['E1','E2','E12']:
    c = np.round(df[f].corr(df.M, method='spearman'),4) # correlation
    plt.scatter(df[f], df.M, alpha=0.1)
    plt.title('Target vs '+f+' - corr_sp='+str(c))
    plt.grid()
    plt.show()

### Also the more predictive features show the effect that the product seems more predictive than each of the factors!

<a id='7'></a>
# Gradient Boosting Model w/o Feature Engineering

## Let's first build a model using just the original features:

In [None]:
# select predictors
predictors = features_num + ['Q1','Q2']
print('Number of predictors: ', len(predictors))
print(predictors)

# define target
target='M'

In [None]:
# start H2O
h2o.init(max_mem_size='12G', nthreads=4) # Use maximum of 12 GB RAM and 4 cores

In [None]:
# upload data frame in H2O environment
df_hex = h2o.H2OFrame(df)

# train / test split (80/20)
train_perc = 0.8
train_hex, test_hex = df_hex.split_frame(ratios=[train_perc], seed=999)

In [None]:
# export train/test for external processing
df_train = train_hex.as_data_frame()
df_test = test_hex.as_data_frame()

df_train.to_csv('df_train.csv')
df_test.to_csv('df_test.csv')

In [None]:
# # define (distributed) Random Forest model
# fit_1 = H2ORandomForestEstimator(ntrees=100,
#                                    max_depth=15,
#                                    min_rows=1,
#                                    nfolds=5,
#                                    seed=999)

In [None]:
# define Gradient Boosting model
fit_1 = H2OGradientBoostingEstimator(ntrees = 801,
                                     max_depth=4,
                                     min_rows=15,
                                     sample_rate=0.9,
                                     col_sample_rate=0.7,
                                     nfolds=5,
                                     seed=999)

In [None]:
# train model - this takes a few minutes...
t1 = time.time()
fit_1.train(x=predictors,
            y=target,
            training_frame=train_hex)
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# show training scoring history
plt.rcParams['figure.figsize']=(7,4)
fit_1.plot()

In [None]:
# show cross validation metrics
fit_1.cross_validation_metrics_summary()

In [None]:
# show scoring history - training vs cross validations
for i in range(5):
    cv_model_temp = fit_1.cross_validation_models()[i]
    df_cv_score_history = cv_model_temp.score_history()
    my_title = 'CV ' + str(1+i) + ' - Scoring History [RMSE]'
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.training_rmse, 
                c='blue', label='training')
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.validation_rmse, 
                c='darkorange', label='validation')
    plt.title(my_title)
    plt.xlabel('Number of Trees')
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
# variable importance using shap values => see direction as well as severity of feature impact
t1 = time.time()
fit_1.shap_summary_plot(train_hex);
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

#### The plot confirms that neither Q1 nor Q2 have relevant predictive power.

### Predict and evaluate performance on training data

In [None]:
# predict on training data
pred_train = fit_1.predict(train_hex)
y_train_act = train_hex.as_data_frame()[target].values # actuals
y_train_pred = pred_train.as_data_frame().predict.values # predictions

In [None]:
# plot predictions vs actual
p=sns.jointplot(y_train_act, y_train_pred,
              joint_kws={'alpha' : 0.1})
p.fig.suptitle('Prediction vs Actual - Training Data')
plt.xlabel('Actual')
plt.ylabel('Prediction')
plt.show()

In [None]:
print('Correlations - Training Data')
print('Correlation Pearson:', stats.pearsonr(y_train_act, y_train_pred))
print('Correlation Spearman:', stats.spearmanr(y_train_act, y_train_pred))

In [None]:
# metrics on training data
print('MAE (train): ', np.round(mean_absolute_error(y_train_act, y_train_pred),2))
print('RMSE(train): ', np.round(np.sqrt(mean_squared_error(y_train_act, y_train_pred)),2))

### Predict and evaluate performance on test set:

In [None]:
# predict on test data
pred_test = fit_1.predict(test_hex)
y_test_act = test_hex.as_data_frame()[target].values # actual values
y_test_pred = pred_test.as_data_frame().predict.values # predictions

In [None]:
# plot predictions vs actuals
p=sns.jointplot(y_test_act, y_test_pred,
              joint_kws={'alpha' : 0.1})
p.fig.suptitle('Prediction vs Actual - Test Data')
plt.xlabel('Actual')
plt.ylabel('Prediction')
plt.show()

In [None]:
print('Correlations - Test Set')
print('Correlation Pearson:', stats.pearsonr(y_test_act, y_test_pred))
print('Correlation Spearman:', stats.spearmanr(y_test_act, y_test_pred))

In [None]:
# metrics on test data
print('MAE (test): ', np.round(mean_absolute_error(y_test_act, y_test_pred),2))
print('RMSE(test): ', np.round(np.sqrt(mean_squared_error(y_test_act, y_test_pred)),2))

### Not bad, but let's see what we can achieve with our feature engineering...

<a id='8'></a>
# Gradient Boosting Model using Feature Engineering


## Let's check if our feature engineering (products feature_1 * feature_2) can improve our model:

In [None]:
# update predictors
predictors = features_num + ['Q1','Q2'] + ['Q12', 'px12', 'py12', 'pz12', 'pt12', 'phi12', 'eta12', 'E12']
print('Number of predictors: ', len(predictors))
print(predictors)

### We keep the same hyper-parameters for the sake of simplicity:

In [None]:
# define Gradient Boosting model
fit_2 = H2OGradientBoostingEstimator(ntrees = 801,
                                     max_depth=4,
                                     min_rows=15,
                                     sample_rate=0.9,
                                     col_sample_rate=0.7,
                                     nfolds=5,
                                     seed=999)

In [None]:
# train model - this takes a few minutes...
t1 = time.time()
fit_2.train(x=predictors,
            y=target,
            training_frame=train_hex)
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

In [None]:
# show training scoring history
plt.rcParams['figure.figsize']=(7,4)
fit_2.plot()

In [None]:
# show cross validation metrics
fit_2.cross_validation_metrics_summary()

### Wow, this is an unreal improvement! Our RMSE (on CV) is now less then half compared to the first model!!!

In [None]:
# show scoring history - training vs cross validations
for i in range(5):
    cv_model_temp = fit_2.cross_validation_models()[i]
    df_cv_score_history = cv_model_temp.score_history()
    my_title = 'CV ' + str(1+i) + ' - Scoring History [RMSE]'
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.training_rmse, 
                c='blue', label='training')
    plt.scatter(df_cv_score_history.number_of_trees,
                y=df_cv_score_history.validation_rmse, 
                c='darkorange', label='validation')
    plt.title(my_title)
    plt.xlabel('Number of Trees')
    plt.legend()
    plt.grid()
    plt.show()

In [None]:
# variable importance using shap values => see direction as well as severity of feature impact
t1 = time.time()
fit_2.shap_summary_plot(train_hex);
t2 = time.time()
print('Elapsed time [s]: ', np.round(t2-t1,2))

### Predict and evaluate performance on training data:

In [None]:
# predict on training data
pred_train = fit_2.predict(train_hex)
y_train_act = train_hex.as_data_frame()[target].values # actuals
y_train_pred = pred_train.as_data_frame().predict.values # predictions

In [None]:
# plot predictions vs actuals
p=sns.jointplot(y_train_act, y_train_pred,
              joint_kws={'alpha' : 0.1})
p.fig.suptitle('Prediction vs Actual - Training Data')
plt.xlabel('Actual')
plt.ylabel('Prediction')
plt.show()

In [None]:
print('Correlations - Training Data')
print('Correlation Pearson:', stats.pearsonr(y_train_act, y_train_pred))
print('Correlation Spearman:', stats.spearmanr(y_train_act, y_train_pred))

In [None]:
# metrics on training data
print('MAE (train): ', np.round(mean_absolute_error(y_train_act, y_train_pred),2))
print('RMSE(train): ', np.round(np.sqrt(mean_squared_error(y_train_act, y_train_pred)),2))

### Predict and evaluate performance on test set:

In [None]:
# predict on test data
pred_test = fit_2.predict(test_hex)
y_test_act = test_hex.as_data_frame()[target].values # actual values
y_test_pred = pred_test.as_data_frame().predict.values # predictions

In [None]:
# plot predictions vs actuals
p=sns.jointplot(y_test_act, y_test_pred,
              joint_kws={'alpha' : 0.1})
p.fig.suptitle('Prediction vs Actual - Test Data')
plt.xlabel('Actual')
plt.ylabel('Prediction')
plt.show()

### The improved performance is also clearly visible in our scatter plot.

In [None]:
print('Correlations - Test Set')
print('Correlation Pearson:', stats.pearsonr(y_test_act, y_test_pred))
print('Correlation Spearman:', stats.spearmanr(y_test_act, y_test_pred))

In [None]:
# metrics on test data
print('MAE (test): ', np.round(mean_absolute_error(y_test_act, y_test_pred),2))
print('RMSE(test): ', np.round(np.sqrt(mean_squared_error(y_test_act, y_test_pred)),2))

### Also the performance on the test set is more than twice as good!

<a id='9'></a>
# Local Explanations

In [None]:
# select individual row from training data
my_row = 1
train_hex[my_row,:]

In [None]:
# show detailed explanations for this prediction
fit_2.explain_row(frame=train_hex, row_index=my_row);