# TrackML Problem Explanation and Data Exploration

Code along of [Wesam Elshamy's Kaggle kernel](https://www.kaggle.com/wesamelshamy/trackml-problem-explanation-and-data-exploration)

## 0. Problem Description

Link every **track** to one **hit**.

Every particle leaves a track behind it. We want to link every track to a unique (max 1 per detector) set of hits.

In every **event**, a large number of **particles** are released. They move along a path leaving behind their **tracks**. They eventually **hit** a particle detector surface on the other end.

In the training data we have the following information on each **event**:

- **Hits**: $x, y, z$ coords of each hit on the particle detector
- **Particles**: Each particle's initial position ($v_x, v_y, v_y$), momentum ($p_x, p_y, p_z$), charge ($q$), and number of hits.
- **Truth**: Mapping between hits and generating particles, particle trajectory, momentum, and hit weight.
- **Cells**: Precise location of where each particle hit the detector and how much energy it deposited.

## 1. Data Exploration:

1. settings –> add a custom package –>GitHub user/repo (LAL/trackml-library)
2. restart kernel

In [5]:
%matplotlib inline

import os
import numpy as np
import pandas as pd

from trackml.dataset import load_event
from trackml.randomize import shuffle_hits
from trackml.score import score_event

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns

from pathlib import Path

In [10]:
PATH = Path('../input/train_1')
event_prefix = 'event000001000'
hits,cells,particles,truth = load_event(PATH/event_prefix)

mem_bytes = (hits.memory_usage(index=True).sum()
             + cells.memory_usage(index=True).sum()
             + particles.memory_usage(index=True).sum()
             + truth.memory_usage(index=True).sum())
print(f'{event_prefix} memory usage {mem_bytes/2**20:.2f} MB')

### 2.1. Hits Data

#### Where Did it Hit?

Here we have the $x,y,z$ gloval coords [mm] of where the particles hit the detector surface.

In [11]:
hits.head()

Here's the distribution of $x,y,z$ hit locations in event 1000. This is only for one of 8,850 events.

#### Vertical Intersection ($x, y$) in Detection Layers

As shown in the figure below, the hits are semi-evenly distributed on the detector surface $x, y$. The white circle in the center of the plot is where the beam pipe lies. [Clarification](https://www.kaggle.com/wesamelshamy/trackml-problem-explanation-and-data-exploration/comments#323803): [agerom](https://www.kaggle.com/artemiosgeromitsos).

The colors represent different detector volumes. See [Joshua Bonatt's notebook](https://www.kaggle.com/jbonatt/trackml-eda-etc).

In [14]:
g = sns.jointplot(hits.x, hits.y, s=1, size=12)
g.ax_joint.cla()
plt.sca(g.ax_joint)

volumes = hits.volume_id.unique()
for volume in volumes:
    vol = hits[hits.volume_id==volume]
    plt.scatter(vol.x, vol.y, s=3, label='volume {}'.format(volume))
    
plt.xlabel('X [mm]'); plt.ylabel('Y [mm]'); plt.legend(); plt.show()

#### Horizontal Intersection ($y, z$) in Detection Layers

You can think of the chart below as a horizontal intersection in the detection surface, where every dot is a hit. Notice the relationship between the different activity levels in this char and the one above for $x, y$.

Again, the colors represent different volumes in the detector surface.

In [17]:
g = sns.jointplot(hits.z, hits.y, s=1, size=12)
g.ax_joint.cla()
plt.sca(g.ax_joint)

volumes = hits.volume_id.unique()
for volume in volumes:
    vol = hits[hits.volume_id==volume]
    plt.scatter(vol.z, vol.y, s=3, label=f'volume {volume}')

plt.xlabel('Z [mm]');plt.ylabel('Y [mm]');plt.legend();plt.show()

And here is how the hits in this event look in 3D. Again, a sample from 1 event. This combines the previous 2 charts in 3D.

Notice how th eparticles penetrate the detector surface along the $z$ coordinate:

In [20]:
fig = plt.figure(figsize=(12,12))
ax  = fig.add_subplot(111, projection='3d')
for volume in volumes:
    vol = hits[hits.volume_id==volume]
    ax.scatter(vol.z,vol.x,vol.y, s=1, label=f'volume {volume}', alpha=0.5)
ax.set_title('Hit Locations');ax.set_xlabel('Z [mm]');ax.set_ylabel('X [mm]')
ax.set_zlabel('Y [mm]'); plt.show()

#### Affected Surface Object

The **volume**, **layer**, and **module** are nested parts on the detector surface. The volume is made of layers, which in turn have modules. Analyzing their response could help us understand if some of them are dead/defective, and therefore we may need to account for the bias they cause.

The figure betlow shows a plot of every combination of `x`, `y`, `volume`, `layer`, and `module`. The colors identify different *volumes*. Along the main diagonal we have the variables' histograms.

The (`hit_id`, `x`) and (`hit_id`, `y`) pairs show us how different volumes are layered.

In [21]:
hits_sample = hits.sample(8000)
sns.pairplot(hits_sample, hue='volume_id', size=8); plt.show()

### 2.2 Particle Data

The particle data help us undrestand each particle's intitial position, momentum, and charge, which we can join with the event truth dataset to get the particle's final position and momentum. This is needed to identify the tracks that each particle generated.

The data looks like this:

In [22]:
particles.head()

#### Hit Rate and Charge Distribution

Let's see the distribution of the number of hits per particle, show below. A significant number of particles had no attributed hits, and most of them have positive charge in this event:

In [24]:
plt.figure(figsize=(15,5))
plt.subplot(1,2,1)
sns.distplot(particles.nhits.values, axlabel='Hits/Particle', bins=50)
plt.title('Distribution of number of hits per particle for event 1000.')
plt.subplot(1,2,2)
plt.pie(particles.groupby('q')['vx'].count(),
        labels=['negative', 'positive'],
        autopct='%.0f%%', shadow=True, radius=0.8)
plt.title('Distribution of particle charges.'); plt.show()

#### Initial Position and Momentum

Let's now take a look at the initial position of the particles around the global coordinates' origin $(x, y) = (0, 0)$, as shown in the figure below.

The initial position distribution is more concentrated around the origin (less variance) than its hit position (shown above under the Hits Data section). As the particles hit th edetection surface, they tend to scatter as shown in the particle trafjectory plot at the end of teh notebook.

The colors here show the number of hits for each particle.

In [27]:
g = sns.jointplot(particles.vx, particles.vy, s=3, size=12)
g.ax_joint.cla()
plt.sca(g.ax_joint)

n_hits = particles.nhits.unique()
for n_hit in n_hits:
    p = particles[particles.nhits==n_hit]
    plt.scatter(p.vx, p.vy, s=3, label=f'Hits {n_hit}')

plt.xlabel('X [mm]'); plt.ylabel('Y [mm]'); plt.legend(); plt.show()

And here's the initial position of the particles in a $z, y$ view. Colors show number of hits.

In [28]:
g = sns.jointplot(particles.vz, particles.vy, s=3, size=12)
g.ax_joint.cla()
plt.sca(g.ax_joint)

n_hits = particles.nhits.unique()
for n_hit in n_hits:
    p = particles[particles.nhits == n_hit]
    plt.scatter(p.vz, p.vy, s=3, label=f'Hits {n_hit}')

plt.xlabel('Z [mm]'); plt.ylabel('Y [mm]'); plt.legend(); plt.show()

And this is what they look like in 3D:

In [29]:
fig = plt.figure(figsize=(12,12))
ax = fig.add_subplot(111, projection='3d')
for charge in [-1,1]:
    q = particles[particles.q==charge]
    ax.scatter(q.vz, q.vx, q.vy, s=1, label=f'Charge {charge}', alpha=0.5)

ax.set_title('Sample of 1000 Particle initial locations')
ax.set_xlabel('Z [mm]'); ax.set_ylabel('X [mm]'); ax.set_zlabel('Y [mm]')
ax.legend(); plt.show()

#### Pair plot

Let's now take a look at the relationship between different combinations of the particle variables. Again, the colors represent the number of hits.

There's no large skew in the distribution of the number of hits over other variables. It looks like the particles are targetted towards the global origin $(x, y) = (0, 0)$ and are evently distributed aroudn it.

In [30]:
p_sample = particles.sample(8000)
sns.pairplot(p_sample, vars=['particle_id', 'vx', 'vy', 'vz', 'px', 'py', 'pz',
                             'nhits'], hue='nhits', size=8)
plt.show()

#### Particle Trajectory

We can reconstruct the trajectories for a few particles given their intersection points with the detector layers. As explained in the [competition evaluation page](https://www.kaggle.com/c/trackml-particle-identification#evaluation), hits from straight tracksshave larger weights, and random tracks or hits with very short tracks have weights of zero. The figure below shows 2 such exampmles.

See [Makahana's notebook for trajectory plotting](https://www.kaggle.com/makahana/quick-trajectory-plot).

In [33]:
# Get particle ID with max number of hits in this event
particle0 = particles.loc[particles.nhits==particles.nhits.max()].iloc[0]
particle1 = particles.loc[particles.nhits==particles.nhits.max()].iloc[1]

# Get points where the same particle intersected subsequent layers of the observation material
p_traj_surface0 = truth[truth.particle_id==particle0.particle_id][['tx','ty','tz']]
p_traj_surface1 = truth[truth.particle_id==particle1.particle_id][['tx','ty','tz']]

p_traj0 = (p_traj_surface0.append({'tx':particle0.vx, 
                                   'ty':particle0.vy,
                                   'tz':particle0.vz}, 
                                  ignore_index=True).sort_values(by='tz'))
p_traj1 = (p_traj_surface1.append({'tx':particle1.vx,
                                   'ty':particle1.vy,
                                   'tz':particle1.vz},
                                  ignore_index=True).sort_values(by='tz'))

fig = plt.figure(figsize=(10,10))
ax  = fig.add_subplot(111, projection='3d')

ax.plot(xs=p_traj0.tx, ys=p_traj0.ty, zs=p_traj0.tz, marker='o')
ax.plot(xs=p_traj1.tx, ys=p_traj1.ty, zs=p_traj1.tz, marker='o')

ax.set_xlabel('X [mm]'); ax.set_ylabel('Y [mm]'); ax.set_zlabel('Z [mm] –– Detection layers')
plt.title('Trajectories of 2 particles as they cross the detection surface ($Z$ axis).')
plt.show()