# HEPMASS NEURAL NETWORK

In this notebook I have built a Neural Network that will try and classify the result of particle collisions into known particles and exotic particles. Since this data set is so large I have set my computer up to use an eGPU to speed up the training of the model. The dataset can be found [here](http://archive.ics.uci.edu/ml/datasets/HEPMASS)

In [1]:
# using plaidml to connect to my eGPU
import os

os.environ["KERAS_BACKEND"] = "plaidml.keras.backend"

### Loading and Preprocessing

In [2]:
# read csv file into a pandas dataframe
import numpy as np
import pandas as pd

mass_train = pd.read_csv('all_train.csv')
mass_test = pd.read_csv('all_test.csv')
mass_test.head(10)



Unnamed: 0,# label,f0,f1,f2,f3,f4,f5,f6,f7,f8,...,f18,f19,f20,f21,f22,f23,f24,f25,f26,mass
0,0.0,0.094394,0.012756,0.911933,-0.090831,-0.233575,-1.054221,-0.975937,-1.067278,-0.61385,...,-1.376865,0.067591,1.372576,-0.573682,-1.368692,-0.479379,1.529256,-0.575782,-1.290232,499.999969
1,1.0,0.3272,-0.239554,-1.592038,-2.324984,-0.507093,1.574625,-1.050106,0.968664,1.312387,...,-0.333943,1.058411,0.436482,-0.573682,-0.021727,-0.579184,-0.326044,-0.202462,-0.458558,750.0
2,1.0,1.43501,0.400359,0.260659,0.829901,0.453934,-1.054221,1.16922,-0.541082,-1.230714,...,-1.654498,0.928221,0.63982,-0.573682,0.494222,-0.277551,-0.342811,1.774911,0.305253,1000.0
3,0.0,-1.18622,0.443335,0.003997,0.484752,-1.159905,-1.054221,-1.581964,-0.391629,0.529644,...,-0.520804,-1.241476,-0.137923,-0.573682,-0.254372,-0.253829,0.333148,-0.554347,-0.905452,1000.0
4,1.0,0.392461,-0.51525,-1.336984,1.895459,-1.068731,-0.005984,1.404694,0.176146,0.700568,...,-0.557441,0.838925,-0.128199,-0.573682,-0.629632,-0.673854,-0.238945,2.11899,0.938224,1250.0
5,1.0,-0.762194,-1.131781,1.212941,-0.014585,-1.627197,2.755198,0.069685,0.071915,0.690501,...,1.696673,0.25882,0.071498,1.743123,-0.57818,-0.3148,-0.282512,-0.110409,-0.322061,750.0
6,0.0,-1.452897,1.220308,-0.168225,0.221927,0.948663,0.850488,-1.464295,-0.437006,-1.231089,...,-0.18007,-1.599574,1.68113,-0.573682,0.043071,-0.362341,-0.327349,0.658676,-0.064869,1500.0
7,0.0,-0.744263,0.867744,1.667862,0.816056,1.59054,-1.054221,-0.319818,0.616564,-0.01622,...,0.021906,1.21162,-0.114015,-0.573682,0.426346,0.280618,-0.323626,0.630361,-0.51855,1500.0
8,0.0,0.763578,-0.190839,-0.639178,-0.690064,-0.314243,1.574625,-0.223045,-1.314112,0.948066,...,-0.721701,-1.414812,-1.468528,-0.573682,0.90579,-0.298147,-0.422263,-0.764455,-0.217903,1000.0
9,0.0,-0.262279,0.048897,-1.528087,-0.895004,-0.186632,1.574625,0.005288,0.659182,1.335679,...,1.021908,-0.137557,-0.665569,-0.573682,0.675431,-0.250276,-0.335505,-0.385025,-0.368745,1500.0


The data looks good and the labels are already in numerical form so there is no converting needed. The next step is to split the data into train, test and validation sets and create a vector with the labels to use as an input for the model. 

In [3]:
from sklearn.model_selection import train_test_split

train = mass_train
test, val = train_test_split(mass_test, test_size=0.2)
train_labels = train['# label']
test_labels = test['# label']



The final step before creating and training the model is to remove the labels and mass from the train and test datasets so we are left with just the features. This dataframe will be used as the other input for the model. 

In [4]:
train.drop(['# label', 'mass'], axis=1)
test.drop(['# label', 'mass'], axis=1)

Unnamed: 0,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,...,f17,f18,f19,f20,f21,f22,f23,f24,f25,f26
1284512,0.304282,-0.439926,-1.659651,2.039861,-1.453842,-1.054221,2.167531,-0.443090,0.372438,0.754261,...,-0.815440,-1.380615,-0.684099,-1.227002,-0.573682,-0.917302,0.409713,-0.376504,2.509622,1.253793
1540456,-0.271841,-2.308833,1.468884,-0.296901,-1.080800,-1.054221,-1.733294,-1.157868,0.446842,-1.325801,...,1.226331,-1.212674,-0.405786,1.559520,-0.573682,-0.506484,-1.229454,-0.332171,-0.847755,-1.517690
865951,-1.537759,-0.021417,0.070131,-0.475256,0.856006,-0.005984,0.013012,0.397030,-0.848228,0.754261,...,-0.815440,0.743664,-1.501652,1.225735,1.743123,-0.073985,-0.148441,-0.346559,-1.017736,0.340852
224560,0.667630,-0.928805,-1.127893,-0.020833,-1.588764,0.850488,-1.260588,1.493316,0.867540,0.754261,...,-0.815440,-0.147696,-0.939136,0.011533,1.743123,0.095526,-0.351649,-0.373619,0.762747,-0.092751
306081,1.334873,-0.410268,-1.180585,-0.617627,-0.061532,-1.054221,1.613617,0.052239,0.474271,-1.325801,...,-0.815440,1.245540,1.067326,1.628257,1.743123,3.245801,3.682671,2.735915,-0.080791,0.878276
2533445,-1.075016,-0.016649,-0.328944,0.458232,0.620466,-1.054221,0.242603,0.043725,0.432637,0.754261,...,-0.815440,-1.271510,1.007514,-0.927987,-0.573682,-0.243886,-1.074700,-0.286599,-1.038682,-0.304946
2113938,-1.227678,-0.752869,-1.559379,-0.049531,-0.698523,-0.005984,-1.302763,-0.109412,0.185828,0.754261,...,-0.815440,-0.781330,1.728015,-1.327846,1.743123,-0.777429,-0.203041,-0.345196,-0.545294,-1.276736
2507344,0.461897,-0.994147,0.701514,0.055164,0.345380,-0.005984,-0.396006,-0.229545,-1.131180,0.754261,...,-0.815440,-1.092175,0.431336,0.500816,1.743123,-1.080284,-0.553976,-0.253215,-0.551367,-0.760334
2588396,0.781106,-0.418172,-1.538029,-1.289207,1.017530,-0.005984,0.426534,-0.269298,1.469144,0.754261,...,-0.815440,1.296718,-0.730385,0.105802,-0.573682,-0.342472,-0.446907,-0.327669,-0.689587,-0.148086
343241,2.645493,1.644355,-0.686422,-0.107389,-1.128313,-1.054221,1.545731,0.411665,1.048003,0.754261,...,-0.815440,-1.016928,2.197234,-1.681630,-0.573682,0.967173,0.896138,2.275317,2.524771,1.170549


The next step is to build the model. The model will be built using keras because I am using and eGPU and plaidml. The model itself is faily simple consisting of two dense layers with relu functions , two dropout layers and an output layer with a sigmoid function because this is a binary classification problem. 

In [5]:
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras import optimizers
from keras import layers

dims = train.shape[1]
print(dims, 'dims')
print("Building model.....")

model = Sequential()
model.add(Dense(64, input_dim=29, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.2))
model.add(Dense(1, activation='sigmoid'))

model.compile(loss='binary_crossentropy',
              optimizer='rmsprop',
              metrics=['accuracy'])

model.fit(train, train_labels,
          epochs=1,
          batch_size=128)


Using plaidml.keras.backend backend.
INFO:plaidml:Opening device "metal_amd_radeon_rx_580.0"


29 dims
Building model.....
Epoch 1/1


<keras.callbacks.History at 0x14c902ef0>

The model trained better than expected in only one epoch. When a model trains well the next step is to make sure it did not over fit and is able to perform as well on a testing data set. 

In [6]:
score = model.evaluate(test, test_labels, batch_size=128)



In [7]:
score

[2.8188701655874087e-05, 0.9999935714285715]

The models performance on the second set of data was consistant with its performace on the training dataset which means that it most likey did not overfit and as able to make predictions on new data going foreward. 