In [None]:
import pandas as pd
import featuretools as ft
import os

#### Exploring train_sample  files

In [None]:
# choosing event event000002387
df_particle = pd.read_csv("../input/train_1/event000002387-particles.csv") 

In [None]:
df_hits = pd.read_csv("../input/train_1/event000002387-hits.csv")

In [None]:
df_cells = pd.read_csv("../input/train_1/event000002387-cells.csv")

In [None]:
df_truth = pd.read_csv("../input/train_1/event000002387-truth.csv")

In [None]:
df_particle.shape

In [None]:
df_hits.info()

In [None]:
df_particle.info()

In [None]:
df_truth.info()

In [None]:
df_cells.info()

Hits
* hit_id: numerical identifier of the hit inside the event.
* x, y, z: measured x, y, z position (in millimeter) of the hit in global coordinates.
* volume_id: numerical identifier of the detector group.
* layer_id: numerical identifier of the detector layer inside the group.
* module_id: numerical identifier of the detector module inside the layer.

In [None]:
df_hits.head()

In [None]:
df_hits.tail(2)

Particles-
* particle_id: numerical identifier of the particle inside the event.
* vx, vy, vz: initial position or vertex (in millimeters) in global coordinates.
* px, py, pz: initial momentum (in GeV/c) along each global axis.
* q: particle charge (as multiple of the absolute electron charge).
* nhits: number of hits generated by this particle.

In [None]:
df_particle.head()

In [None]:
df_particle.tail(2)

In [None]:
df_particle.describe()

Truth-
The truth file contains the mapping between hits and generating particles and the true particle state at each measured hit. Each entry maps one hit to one particle.

* hit_id: numerical identifier of the hit as defined in the hits file.
* particle_id: numerical identifier of the generating particle as defined in the particles file. A value of 0 means that the hit did not originate from a reconstructible particle, but e.g. from detector noise.
* tx, ty, tz true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface.
* tpx, tpy, tpz true particle momentum (in GeV/c) in the global coordinate system at the intersection point. The corresponding vector is tangent to the particle trajectory at the intersection point.
* weight per-hit weight used for the scoring metric; total sum of weights within one event equals to one.

(Note:- Multiple hits can belong to the same particle (at different cooridinates) . That is how we will get a track of a single particle with multiple hits (sites of sensing) along its path of travel.

In [None]:
df_truth.head()

In [None]:
df_truth.tail(2)

Cells

The cells file contains the constituent active detector cells that comprise each hit. The cells can be used to refine the hit to track association. A cell is the smallest granularity inside each detector module, much like a pixel on a screen, except that depending on the volume_id a cell can be a square or a long rectangle. It is identified by two channel identifiers that are unique within each detector module and encode the position, much like column/row numbers of a matrix. A cell can provide signal information that the detector module has recorded in addition to the position. Depending on the detector type only one of the channel identifiers is valid, e.g. for the strip detectors, and the value might have different resolution.

* hit_id: numerical identifier of the hit as defined in the hits file.
* ch0, ch1: channel identifier/coordinates unique within one module.
* value: signal value information, e.g. how much charge a particle has deposited.

In [None]:
df_cells.head()

In [None]:
df_cells.tail()

> At index 669470 and 669471 we have different cell coordinates for the same hit_id 121511

#### Exploring detectors file

In [None]:
# This file contains additional detector geometry information.

df_detectors = pd.read_csv("../input/detectors.csv")

In [None]:
# Each module has a different position and orientation described in the detectors file.

df_detectors

####  Linking df_hits, df_truth and df_particle for the same hit_id

In [None]:
print(df_hits.shape)
df_hits.head(2)

In [None]:
print(df_particle.shape)
df_particle.head(2)

In [None]:
print(df_truth.shape)
df_truth.head(2)

** tx, ty, tz ** : true intersection point in global coordinates (in millimeters) between the particle trajectory and the sensitive surface (df_truth).
 
** x, y, z **: measured x, y, z position (in millimeter) of the hit in global coordinates (df_hits). 

--The above observations from df_hits and df_truth for the same hit_ids are quite close to each other but not identical.

Mapping each row in df_hits containing hit information with its corresponding particle information provided as each row in df_truth

In [None]:
df_hits.index

In [None]:
hits_truth = df_hits.set_index('hit_id').join(df_truth.set_index('hit_id'))

In [None]:
df_hits.head(1)

In [None]:
df_truth.head(1)

In [None]:
hits_truth.head()

In [None]:
hits_truth.reset_index(inplace=True)

In [None]:
hits_truth.head(2)

In [None]:
hits_truth.shape

In [None]:
df_particle.shape

In [None]:
df_particle.head(2)

In [None]:
## Creating Entity set

es = ft.EntitySet(id="trackml")

In [None]:
es1 = es.entity_from_dataframe(entity_id='hits_truth', dataframe=hits_truth,
                               index = 'hit_id',
                               variable_types = { "volume_id":ft.variable_types.Categorical,
                                                  "layer_id":ft.variable_types.Categorical,
                                                  "module_id":ft.variable_types.Categorical })

In [None]:
es2 = es1.entity_from_dataframe(entity_id='particle', dataframe=df_particle,
                               index = 'particle_id' )

In [None]:
es2

In [None]:
# Defining one-to-many relationships among features of different entities

relation1 = ft.Relationship(es2['particle']['particle_id'],es2['hits_truth']['particle_id'])

In [None]:
relation1

In [None]:
es2.add_relationships([relation1])

In [None]:
es2.entities

Now, we want collective information for each particle_id. 

This unsupervised learning approach will involve merging data of hits,cells as primary input features which will then be mapped with corresponding particle information with truth data to include as something which is similar to target just to guide the ML algo about the kind of input features to be clustered as belongingg to the same particle (specifying the particle_id is not in the scope of this problem ).

In [None]:
df_hits.head()

In [None]:
df_cells.head()