## Model Suggestions for Predicting the Target Variable (Binary Classification)

In [13]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score

# Feature Selection
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel

# Models
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from xgboost import XGBClassifier
from sklearn.svm import SVC


## Loading the Data

In [2]:
data = pd.read_csv('Data/training.csv')

In [3]:
data.head()

Unnamed: 0,EventId,DER_mass_MMC,DER_mass_transverse_met_lep,DER_mass_vis,DER_pt_h,DER_deltaeta_jet_jet,DER_mass_jet_jet,DER_prodeta_jet_jet,DER_deltar_tau_lep,DER_pt_tot,...,PRI_jet_num,PRI_jet_leading_pt,PRI_jet_leading_eta,PRI_jet_leading_phi,PRI_jet_subleading_pt,PRI_jet_subleading_eta,PRI_jet_subleading_phi,PRI_jet_all_pt,Weight,Label
0,100000,138.47,51.655,97.827,27.98,0.91,124.711,2.666,3.064,41.928,...,2,67.435,2.15,0.444,46.062,1.24,-2.475,113.497,0.002653,s
1,100001,160.937,68.768,103.235,48.146,-999.0,-999.0,-999.0,3.473,2.078,...,1,46.226,0.725,1.158,-999.0,-999.0,-999.0,46.226,2.233584,b
2,100002,-999.0,162.172,125.953,35.635,-999.0,-999.0,-999.0,3.148,9.336,...,1,44.251,2.053,-2.028,-999.0,-999.0,-999.0,44.251,2.347389,b
3,100003,143.905,81.417,80.943,0.414,-999.0,-999.0,-999.0,3.31,0.414,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,-0.0,5.446378,b
4,100004,175.864,16.915,134.805,16.405,-999.0,-999.0,-999.0,3.891,16.405,...,0,-999.0,-999.0,-999.0,-999.0,-999.0,-999.0,0.0,6.245333,b


In [4]:
data_X = data.drop(columns=['Label'], errors='ignore')
data_X = data_X.replace(-999.0, np.nan).fillna(data_X.mean())

X = data_X.drop(columns=['EventId', 'Weight'], errors='ignore')

# Convert 's' to 1, 'b' to 0
y = data['Label'].apply(lambda x: 1 if x == 's' else 0)  

In [5]:
# Feature Selection using RandomForest
selector = SelectFromModel(RandomForestClassifier(n_estimators=100, random_state=42))
selector.fit(X, y)
X_selected = selector.transform(X)
selected_features = X.columns[selector.get_support()]
print("Selected Features:", list(selected_features))

Selected Features: ['DER_mass_MMC', 'DER_mass_transverse_met_lep', 'DER_mass_vis', 'DER_deltar_tau_lep', 'DER_pt_ratio_lep_tau', 'DER_met_phi_centrality', 'PRI_tau_pt', 'PRI_met']


In [6]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.2, random_state=42)

In [7]:
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

### Model Training and Evaluation Functions

In [8]:
def evaluate_model(model, name):
    model.fit(X_train_scaled, y_train)
    preds = model.predict(X_test_scaled)
    proba = model.predict_proba(X_test_scaled)[:, 1] if hasattr(model, 'predict_proba') else None

    print(f"\n\n{name} Results")
    print("Accuracy:", accuracy_score(y_test, preds))
    print("ROC AUC:", roc_auc_score(y_test, proba) if proba is not None else "N/A")
    print("Confusion Matrix:\n", confusion_matrix(y_test, preds))
    print("Classification Report:\n", classification_report(y_test, preds))


In [9]:
# Logistic Regression
evaluate_model(LogisticRegression(max_iter=1000), "Logistic Regression")



Logistic Regression Results
Accuracy: 0.73198
ROC AUC: 0.7935351180189185
Confusion Matrix:
 [[28476  4589]
 [ 8812  8123]]
Classification Report:
               precision    recall  f1-score   support

           0       0.76      0.86      0.81     33065
           1       0.64      0.48      0.55     16935

    accuracy                           0.73     50000
   macro avg       0.70      0.67      0.68     50000
weighted avg       0.72      0.73      0.72     50000



In [10]:
# Random Forest
evaluate_model(RandomForestClassifier(n_estimators=100, random_state=42), "Random Forest")



Random Forest Results
Accuracy: 0.8236
ROC AUC: 0.8917204122057676
Confusion Matrix:
 [[29405  3660]
 [ 5160 11775]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87     33065
           1       0.76      0.70      0.73     16935

    accuracy                           0.82     50000
   macro avg       0.81      0.79      0.80     50000
weighted avg       0.82      0.82      0.82     50000



In [11]:
# Gradient Boosting
evaluate_model(GradientBoostingClassifier(n_estimators=100), "Gradient Boosting")



Gradient Boosting Results
Accuracy: 0.82194
ROC AUC: 0.8905117658622237
Confusion Matrix:
 [[29292  3773]
 [ 5130 11805]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87     33065
           1       0.76      0.70      0.73     16935

    accuracy                           0.82     50000
   macro avg       0.80      0.79      0.80     50000
weighted avg       0.82      0.82      0.82     50000



In [12]:
# XGBoost
evaluate_model(XGBClassifier(use_label_encoder=False, eval_metric='logloss'), "XGBoost")

Parameters: { "use_label_encoder" } are not used.

  bst.update(dtrain, iteration=i, fobj=obj)




XGBoost Results
Accuracy: 0.82528
ROC AUC: 0.8950464427302316
Confusion Matrix:
 [[29272  3793]
 [ 4943 11992]]
Classification Report:
               precision    recall  f1-score   support

           0       0.86      0.89      0.87     33065
           1       0.76      0.71      0.73     16935

    accuracy                           0.83     50000
   macro avg       0.81      0.80      0.80     50000
weighted avg       0.82      0.83      0.82     50000



In [14]:
# Multi Layer Perceptron
evaluate_model(MLPClassifier(max_iter=1000), "Multi Layer Perceptron")



Multi Layer Perceptron Results
Accuracy: 0.82634
ROC AUC: 0.8960879508743347
Confusion Matrix:
 [[29532  3533]
 [ 5150 11785]]
Classification Report:
               precision    recall  f1-score   support

           0       0.85      0.89      0.87     33065
           1       0.77      0.70      0.73     16935

    accuracy                           0.83     50000
   macro avg       0.81      0.79      0.80     50000
weighted avg       0.82      0.83      0.82     50000



## Conclusion

In this notebook, we built a binary classification pipeline to detect signal vs. background events in particle collision data. After thorough preprocessing, feature selection, and modeling, we compared several machine learning models.

### Model Performance Summary:
| Model                   | Accuracy | ROC AUC |
|------------------------|----------|---------|
| Logistic Regression     | 0.73198  | 0.7935  |
| Random Forest           | 0.82360  | 0.8917  |
| Gradient Boosting       | 0.82194  | 0.8905  |
| XGBoost                 | 0.82528  | 0.8950  |
| Multi-layer Perceptron  | **0.82634**  | **0.8961**  |

- Feature selection using **Random Forest importance scores** helped reduce dimensionality while maintaining strong model performance.
- All tree-based models performed significantly better than Logistic Regression, with **MLP and XGBoost** emerging as top performers.
- The best model, **MLP**, achieved an accuracy of **82.6%** and an ROC AUC of **0.896**, indicating strong discriminative power.

### Final Notes:
- Further gains could be achieved with hyperparameter tuning, ensemble stacking, or deeper neural network architectures.
- Evaluation metrics such as ROC AUC and confusion matrices confirm consistent class separation across models.

This project demonstrates how classic and modern ML algorithms can be applied effectively in high-energy physics experiments for event classification.