<a href="https://colab.research.google.com/github/shir994/misis_seminars/blob/master/MiSiS_ldm_seminar.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this seminar we will examine the toy data for simulation of the LDM and neutrino scattering inside SND detector. All data presented here are Monte Carlo truth data. You task will be to plot some plots, make some guesses and train basic algorithms to try to distinguish between LDM events and neutrino based on observed MC variables.

In [None]:
!wget -O dm_kinematics.pkl https://cernbox.cern.ch/index.php/s/Sw18OVFaIUcafus/download
!wget -O nu_e_elastic_kinematics.pkl https://cernbox.cern.ch/index.php/s/pQ2EnbWgY12tT3D/download
!wget -O nu_ccqe_elastic_kinematics.pkl https://cernbox.cern.ch/index.php/s/5w73TtuwUjMoTMp/download  

In [None]:
# Importing all the necessary packages 
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
import pickle

import matplotlib as mpl
mpl.rcParams['hatch.linewidth'] = 2.0
my_cmap = plt.cm.jet
my_cmap.set_under('white')

In [None]:
def read_file(filename):
    with open(filename, 'rb') as f:
        raw_data = pickle.load(f)
    
    dict_data = {"init_E":[], "ele_E": [], "ele_theta": [], "ele_theta_to_particle": [], "init_theta":[]}
    for event in raw_data:
        if len(event) == 2:
            dict_data["init_E"].append(event['initial_particle']['E'])
            dict_data["init_theta"].append(event["initial_particle"]["theta"])           
            dict_data["ele_E"].append(event["ele_params"]["E"])
            dict_data["ele_theta"].append(event["ele_params"]["theta"])
            dict_data["ele_theta_to_particle"].append(event["ele_params"]["theta_to_particle"])
    return pd.DataFrame(dict_data, columns=["init_E", "ele_E", "ele_theta", "ele_theta_to_particle", "init_theta"])

In [None]:
# Load data
dm_df = read_file("dm_kinematics.pkl")
nu_df_el = read_file("nu_e_elastic_kinematics.pkl")
nu_df_ccqe = read_file("nu_ccqe_elastic_kinematics.pkl")


## Examining the data

Ok, we have donwloaded our data. Lets have a look at it.

In [None]:
dm_df.head()

We have five initial features
- init_E - energy of neutrino
- init_theta - angle of neutrino wrt to Z-axis
- ele_E - energy of outgoing electron
- ele_theta - angle of outgoing electron wrt to Z-asis
- ele_theta_to_particle - angle of outgoing electron wrt to direction of neutrino 

What do you think we can actually observe in the detector? Can we approximate other variables somehow?

Lets plot some features to compare distributions for LDM and neutrino events.

Plots for EL

In [None]:
def add_plots(dm_data, nu_data, xlabel=None, x_range=None, nu_label="Elastic"):
    plt.hist(dm_data, density=True, bins=50, label="DM", range=x_range,
             edgecolor='k', hatch='/', fill=False, histtype='step', linewidth=5)
    plt.hist(nu_data, density=True, bins=50, alpha=0.5, label=r"{} $\nu_e$".format(nu_label), range=x_range,
             edgecolor='k', hatch='x', fill=False, histtype='step', linewidth=5)
    plt.grid()
    plt.legend(fontsize=30)
    plt.xlabel(xlabel, fontsize=25);
    plt.ylabel("a.u.", fontsize=25)
    plt.tick_params('both', labelsize=20)
    
plt.figure(figsize=(32, 8))
plt.subplot(1,3,1)
add_plots(dm_df["ele_E"], nu_df_el["ele_E"], "Electron  $E$, GeV",x_range=(0,30))
plt.subplot(1,3,2)
add_plots(dm_df["ele_theta"], nu_df_el["ele_theta"], r"Electron $\theta$, rad",x_range=(0, 0.05))
plt.subplot(1,3,3)
add_plots(dm_df["ele_theta_to_particle"], nu_df_el["ele_theta_to_particle"],
          r"Electron $\theta$ wrt to $\nu$, rad", x_range=(0, 0.05))

Plots for CCQE

In [None]:
plt.figure(figsize=(32, 8))
plt.subplot(1,3,1)
add_plots(dm_df["ele_E"], nu_df_ccqe["ele_E"], "Electron  $E$, GeV",x_range=(0,30), nu_label="CCQE")
plt.subplot(1,3,2)
add_plots(dm_df["ele_theta"], nu_df_ccqe["ele_theta"], r"Electron $\theta$, rad",x_range=(0, 0.05), nu_label="CCQE")
plt.subplot(1,3,3)
add_plots(dm_df["ele_theta_to_particle"], nu_df_ccqe["ele_theta_to_particle"],
          r"Electron $\theta$ wrt to $\nu$, rad", x_range=(0, 0.05), nu_label="CCQE")

Looks like for CCQE its quite hard to distinguish, lets plot 2d hists!

Ok, now its your turn to quickly draw some plots.
- Using the code above plot 2D histogram(use *plt.hist2d*) of electron energy ("ele_E") vs electron angle ("ele_theta") for LDM, EL, CCCQE
- Sacle histogram to be in the same size in x

In [None]:
def add_plots(dm_data, nu_data, xlabel=None, title=None):
    <YOUR CODE>
    
plt.figure(figsize=(33, 8))
plt.subplot(1,3,1)
add_plots(dm_df["ele_E"], dm_df["ele_theta"], xlabel="Electron  $E$, GeV", title="LDM")
plt.subplot(1,3,2)
add_plots(nu_df_el["ele_E"], nu_df_el["ele_theta"], xlabel="Electron  $E$, GeV", title="Elastic")
plt.subplot(1,3,3)
add_plots(nu_df_ccqe["ele_E"], nu_df_ccqe["ele_theta"], xlabel="Electron  $E$, GeV", title="CCQE")

What do you observe? How can we improve this very basic visual analysis?

# Fit some simple ML algorithm

Now, lets use the very basic alogirthms to decide what is LDM and what is noise!

In [None]:
# import packages we want to use
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, GridSearchCV, cross_val_predict
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_recall_curve, roc_curve, roc_auc_score
from sklearn.metrics import average_precision_score

Lets first preprocess data:

In [None]:
dm_df.shape, nu_df_el.shape, nu_df_ccqe.shape

In [None]:
def perpare_data(dm_df, nu_df_el, nu_df_ccqe):
    # We subsample data to make equal proportion of LDM events(signal) and ne_e elastic(background)
    # This is done for our metrics to be representative
    dm_df_sample = dm_df.sample(n=nu_df_el.shape[0], random_state=1543)
    
    # Label data
    dm_df_sample['label'] = 1
    nu_df_el['label'] = 0
    nu_df_ccqe['label'] = 2

    # and join in aggregative table
    joined_data = pd.concat((dm_df_sample, nu_df_el, nu_df_ccqe))
    joined_data = joined_data.sample(frac=1, replace=False)    
    print(joined_data.head())
    
    # Separate label from our discriminative features
    y = joined_data['label']
    joined_data = joined_data.drop(['label'], axis=1)
    ccqe_indeces = np.where(y==2)[0]
    y[y==2] = 0
    return joined_data, y, ccqe_indeces

In [None]:
joined_data, y, ccqe_indeces = perpare_data(dm_df, nu_df_el, nu_df_ccqe)

# !

Now, as for a bare curiosity lets first train algortihm only using one feature - energy of the electron

In [None]:
# This is current columns of the dataset
joined_data.columns

In [None]:
# Select only init_E column in a separate variable, name it e_only_train
<YOUR CODE>

## The output of our algo will be probabilities of an event to belong to class 1

In [None]:
model = LogisticRegression(C=0.001)

probs = cross_val_predict(model, e_only_train, y=y, cv=5, n_jobs=4, method='predict_proba')

In [None]:
plt.hist(probs[:,1]);
plt.xlabel("Probability", fontsize=25)
plt.ylabel("a.u.", fontsize=25);

Not very inforomative right? Of course, in physics we have a stanard metrics how we can define the performance of the algorigthm, for example, $\frac{S}{\sqrt{B}}$. In our case we will consider signal efficiency VS background rejection. Sklearn has inbulid functions to calculate to calculate almost what we need, just defined in mathematical terms:

In [None]:
fpr,tpr,t = roc_curve(y.values, probs[:, 1], pos_label=1)

The function above returns False Positive Rate(FPR) which is defined as:
$$
FPR = \frac{FP}{FP + TN}
$$

and True Positive Rate(TPR) which is defined as
$$
TPR = \frac{TP}{TP + FN}
$$
This definitions are basically what we need? How do they correspond to signal efficiency and background rejection? 

Now, plot the dependecy of signal efficiency VS background rejection as a curve.

In [None]:
plt.figure(figsize=(12,6))
<YOUR CODE>

plt.xlabel("Background rejection efficiency (EL + CCQE)", fontsize=25)
plt.ylabel("Signal efficiency", fontsize=25);
print("AVG precision", average_precision_score(y.values, probs[:, 1], pos_label=1))

Somewhat expected, isnt it? As you can see, the quality is qutie low. This was obious for the initial plots, we have plotted in the beginning. Now lets now repeat the procedure, using all the features we have!

In [None]:
def fit_and_plot(model, data, y):
    probs = cross_val_predict(model, data, y=y, cv=5, n_jobs=4, method='predict_proba')
    fpr,tpr,t = roc_curve(y.values, probs[:, 1], pos_label=1)
    plt.figure(figsize=(12,6))
    plt.plot(1 - fpr, tpr)
    plt.grid()
    plt.xlabel("Background rejection efficiency (EL + CCQE)", fontsize=25)
    plt.ylabel("Signal efficiency", fontsize=25);
    print("AVG precision", average_precision_score(y.values, probs[:, 1], pos_label=1))

In [None]:
model = LogisticRegression(C=0.001)
fit_and_plot(model, joined_data, y)

Much better, isnt it? But, do we have some features, we usually use in physics, that we can create from the given data? YES! We can can create, for example, $E_T$. So, add $E_T$ to the data and refit the algorithm.

In [None]:
<YOUR CODE>

In [None]:
model = LogisticRegression(C=0.001)
fit_and_plot(model, joined_data, y)

Finally, we can try different algorithm

In [None]:
from xgboost import XGBClassifier
model = XGBClassifier(n_estimators=500)
feature_columns = ['ele_E', 'ele_theta', 'ele_theta_to_particle', 'E_t', "init_theta", "E_t_rel"]
fit_and_plot(model, joined_data[feature_columns], y)

YAY much better out of the box!

In [None]:
model = XGBClassifier(n_estimators=500)
probs = cross_val_predict(model, joined_data[feature_columns], y=y, cv=5, n_jobs=4, method='predict_proba')

In [None]:
plt.figure(figsize=(12,6))
plt.hist(probs[y.values == 0, 1], bins=50, label='nu')
plt.hist(probs[y.values == 1, 1], bins=50,alpha=0.5, label='dm');
plt.legend()

We can also study answer distribution and go into the depth of the why we get this particular answers,
sample data in the right propotion to the reaction cross-secition and etc. All the above is just a toy example to give you and idea what is day to day task.

Bonus quest: try to do GridSearch and improve the resuls!

In [None]:
gs = GridSearchCV(model, param_grid={"C": np.logspace(-3, 3, 10)}, scoring="roc_auc", n_jobs=3, cv=5)
...