**** Background****

> When 2 protons collide, the tracks they follow consequently tell us a lot about the protons themselves.
For eg, their direction of curvature of these tracks tells us if the generated particle is positivelly or negatively charged 
and then also how large is this radius of curvature that directly relates to the velocity and the momentum of the particle.


Each hit is the event of the detector sensing the particle on the detector surface.

There will be a set of hits (corresponding to set of places where a single particle was detected) following a single track. We need to associate these hits to the correct track. 

Particles produced in collisions normally travel in straight lines, but in the presence of a magnetic field their paths become curved. Electromagnets around particle detectors generate magnetic fields to exploit this effect. Physicists can calculate the momentum of a particle – a clue to its identity – from the curvature of its path: particles with high momentum travel in almost straight lines, whereas those with very low momentum move forward in tight spirals inside the detector.
(https://home.cern/about/how-detector-works)

Modern particle detectors consist of layers of subdetectors, each designed to look for particular properties, or specific types of particle.

#### Goal 
The goal of the tracking machine learning challenge is to group the recorded measurements or hits for each event into tracks, sets of hits that belong to the same initial particle. A solution must uniquely associate each hit to one track.
 
#### Approach

These grouping labels need not be one of the known particle_ids in particular from the training data ,but an arbitrary label ('track_id', in this context),common for a set of hit_ids.For the test data,the number of track_ids is not known,i.e, it is not known how many hits may be mapped to a single particle. 

Hence, this problem can be be solved with an unsupervised clustering approach.

In [None]:
import pandas as pd
import featuretools as ft
import matplotlib.pyplot as plt
import seaborn as sns

#### Data

The dataset comprises multiple independent events, where each event contains simulated measurements (essentially 3D points) of particles generated in a collision between proton bunches at the Large Hadron Collider at CERN. 
#### Exploring train_sample  files

In [None]:
# choosing event event000002387
df_particle = pd.read_csv("../input/train_1/event000002387-particles.csv") 

In [None]:
df_hits = pd.read_csv("../input/train_1/event000002387-hits.csv")

In [None]:
df_cells = pd.read_csv("../input/train_1/event000002387-cells.csv")

In [None]:
df_truth = pd.read_csv("../input/train_1/event000002387-truth.csv")

In [None]:
df_particle.shape

In [None]:
df_hits.isnull().sum()

In [None]:
df_hits.info()

In [None]:
df_particle.isnull().sum()

In [None]:
df_particle.info()

In [None]:
df_truth.isnull().sum()

In [None]:
df_truth.info()

In [None]:
df_cells.isnull().sum()

In [None]:
df_cells.info()

Hits
* hit_id: numerical identifier of the hit inside the event.
* x, y, z: measured x, y, z position (in millimeter) of the hit in global coordinates.
* volume_id: numerical identifier of the detector group.
* layer_id: numerical identifier of the detector layer inside the group.
* module_id: numerical identifier of the detector module inside the layer.

In [None]:
df_hits.head(7)

In [None]:
df_hits.describe()

Particles-
* particle_id: numerical identifier of the particle inside the event.
* vx, vy, vz: initial position or vertex (in millimeters) in global coordinates.
* px, py, pz: initial momentum (in GeV/c) along each global axis.
* q: particle charge (as multiple of the absolute electron charge).
* nhits: number of hits generated by this particle.

In [None]:
df_particle.head()

In [None]:
df_particle.tail()

In [None]:
df_particle.describe()

Truth-
The truth file contains the mapping between hits and generating particles and the true particle state at each measured hit. Each entry maps one hit to one particle.

* hit_id: numerical identifier of the hit as defined in the hits file.
* particle_id: numerical identifier of the generating particle as defined in the particles file. A value of 0 means that the hit did not originate from a reconstructible particle, but e.g. from detector noise.
* tx, ty, tz true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface.
* tpx, tpy, tpz true particle momentum (in GeV/c) in the global coordinate system at the intersection point. The corresponding vector is tangent to the particle trajectory at the intersection point.
* weight per-hit weight used for the scoring metric; total sum of weights within one event equals to one.

(Note:- Multiple hits can belong to the same particle (at different cooridinates) . That is how we will get a track of a single particle with multiple hits (sites of sensing) along its path of travel.

In [None]:
df_truth.head()

In [None]:
df_truth.tail()

Cells

The cells file contains the constituent active detector cells that comprise each hit. The cells can be used to refine the hit to track association. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is identified by two channel identifiers that are unique within each detector module and encode the position, much like column/row numbers of a matrix. A cell can provide signal information that the detector module has recorded in addition to the position. Depending on the detector type only one of the channel identifiers is valid, e.g. for the strip detectors, and the value might have different resolution.

* hit_id: numerical identifier of the hit as defined in the hits file.
* ch0, ch1: channel identifier/coordinates unique within one module.
* value: signal value information, e.g. how much charge a particle has deposited.

In [None]:
df_cells.head()

In [None]:
df_cells.tail()

> At index 669470 and 669471 we have different cell coordinates for the same hit_id 121511

#### Exploring detectors file

In [None]:
# This file contains additional detector geometry information.

df_detectors = pd.read_csv("../input/detectors.csv")

In [None]:
# Each module has a different position and orientation described in the detectors file.

df_detectors.head(7)

In [None]:
df_hits.nunique()

In [None]:
df_hits.volume_id.unique()

In [None]:
df_hits.layer_id.unique()

In [None]:
df_hits.module_id.unique()

#### Exploring test file

It is the test dataset with 125 events

The submission file must associate each hit in each event to one and only one reconstructed particle track. The reconstructed tracks must be uniquely identified only within each event. 

In [None]:
df_test_hits = pd.read_csv('../input/test/event000000008-hits.csv')

In [None]:
df_test_cells = pd.read_csv('../input/test/event000000008-cells.csv')

In [None]:
df_test_hits.info()

In [None]:
df_test_hits.head()

In [None]:
df_test_hits.tail()

In [None]:
df_test_cells.info()

In [None]:
df_test_cells.head()

In [None]:
df_test_cells.tail()

#### Feature engineering from training data

In [None]:
## Creating Entity set

es = ft.EntitySet(id="hits")

In [None]:
es1 = es.entity_from_dataframe(entity_id='hits', dataframe=df_hits,
                               index = 'hit_id',
                               variable_types = { "volume_id":ft.variable_types.Categorical,
                                                  "layer_id":ft.variable_types.Categorical,
                                                  "module_id":ft.variable_types.Categorical })

In [None]:
es1['hits'].variables

In [None]:
es2 = es1.entity_from_dataframe(entity_id='particle', dataframe=df_particle,
                               index = 'particle_id' )

In [None]:
es2['particle']

In [None]:
df_cells.info()

In [None]:
df_cells.reset_index(inplace=True)

In [None]:
df_cells.head()                                    # value column signifies the amount of charge deposited by the particle

In [None]:
df_cells.tail()

In [None]:
es3 = es2.entity_from_dataframe(entity_id='cells', dataframe=df_cells,index='index'  )

In [None]:
es4 = es3.entity_from_dataframe(entity_id='truth',dataframe=df_truth, index='hit_id')

In [None]:
df_detectors.reset_index(inplace=True)

In [None]:
es5 = es4.entity_from_dataframe(entity_id='detectors', dataframe=df_detectors, index='index')

In [None]:
es5

In [None]:
es5.entities

In [None]:
# Defining one-to-many relationships among features of different entities

relation1 = ft.Relationship(es5['hits']['hit_id'],es5['cells']['hit_id'])

relation2 = ft.Relationship(es5['particle']['particle_id'],es5['truth']['particle_id'])

In [None]:
es5

In [None]:
es5.add_relationships([relation1,relation2])

In [None]:
es5.entities

In [None]:
%time feature_matrix, features = ft.dfs(entityset=es5, target_entity='particle',agg_primitives=['min','max'],max_depth=2)

In [None]:
df_particle.head(1)

In [None]:
df_truth.head(1)

In [None]:
feature_matrix

In [None]:
features

####  Linking df_hits, df_truth and df_particle for the same hit_id

In [None]:
df_hits.head(2)

In [None]:
df_particle.head(2)

In [None]:
df_truth.head(2)

** tx, ty, tz ** : true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface (df_truth).
 
** x, y, z **: measured x, y, z position (in millimeter) of the hit in global coordinates (df_hits). 

--The above observations from df_hits and df_truth for the same hit_ids are quite close to each other but not identical.

In [None]:
# obtaining the number of times each particle was detected

df_truth.groupby('particle_id')['hit_id'].count()

In [None]:
temp = df_truth[df_truth['particle_id']==4503874505277440]

In [None]:
temp

In [None]:
temp.weight.sum()

In [None]:
temp.count()

In [None]:
# the above particle was sensed/detected at 12 different positions on the detector as observed in df_particle dataframe below.

df_particle[df_particle['particle_id']==4503874505277440]

In [None]:
hits_list = temp.hit_id.tolist()

In [None]:
df_hits.loc[hits_list]