# Chapter 1: Introduction

The kernel demonstrates the clustering and expending we used to get to 7# place. Running this kernel on training event 1000 will score ~0.635 after the clustering stage, and 0.735 after expending stage.

Every stage takes about 8-10 min on Kaggle, and about half the time on my laptop. In the clustering part of the kernel the algorithm uses 5.500 pairs of z0, 1/2R (more on it below). By increasing the number to about 100.000 pairs the score will plateau at about 0.765 (after expending).

How does it work:
In each clustering loop the algorithm try to find all tracks originating from (0,0,z0) and with a radius of 1/(2*kt).
If a hit (x,y,z) is on a track the helix can be fully defined by the following features (1), (2)

rr=(x^2+y^2)^0.5
theta_=arctan(y/x)
dtheta = arcsin(kt*rr)
(1)	Theta=theta_+dtheta
(2)	(z-z0)*kt/dtheta

To solve the +pi,-pi problem we use sin, cos for theta.
To make (2) more uniform, we use arctan((z-z0)/(3.3*dtheta/kt))

After calculating the features, the algorithm tries to cluster all the hits with the same features. This is done by sparse binning – using np.unique.

The disadvantage of sparse binning over dbscan is it’s sensitivity, the advantages are its speed and its sensitivity (almost no outliners).

After clustering every hit choose if his cluster is good according to the clusters length.
Every 500 loops all hits belonging to tracks which are long enough are removed from the dataset
If two hits from the same detector are on the same track, the one which is closest to the track’s center of mass is chosen.
The z0, kt pairs are chosen randomly. While running, the algorithm changes the bin width and the length of the minimum track to be extracted from the dataset.

Expending is done by selecting the un-clustered hits which are close to the center of mass of the track.

# Chapter 2: Clustering

Import necessary packages:

In [12]:
import numpy as np
import sys
sys.path.insert(0, 'other/')
import pandas as pd
import datetime
import os
from ipywidgets import FloatProgress,FloatText
from IPython.display import display
import time
import pdb
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.utils import shuffle
from sklearn.cluster import DBSCAN
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from itertools import product
import gc
import cProfile
from tqdm import tqdm_notebook
%matplotlib inline
#make wider graphs
sns.set(rc={'figure.figsize':(12,5)})
plt.figure(figsize=(12,5))
path='files/'

from functions.other import calc_features, get_event, score_event_fast
from functions.expand import *
from functions.cluster import *

# the following two lines are for changing imported functions, and not needing to restart kernel to use their updated version
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


<Figure size 864x360 with 0 Axes>

Load train event 1000, which will be the event we will be working on:

In [9]:
event_num=0
event_prefix = 'event00000100{}'.format(event_num)
hits, cells, particles, truth = get_event(path,event_prefix)

Define parameters:

In [16]:
history=[]
weights={'pi':1,'theta':0.15}
stds={'z0':7.5, 'kt':7.5e-4}
d =    {'sint':[225,110,110,110,110,110],
        'cost':[225,110,110,110,110,110],
          'phi':[550,260,260,260,260,260],
        'min_group':[11,11,10,9,8,7],
        'npoints':[50,50,50,50,50,50]}  # quick run for testing
        #'npoints':[500,2000,1000,1000,500,500]}
filters=pd.DataFrame(d)
nu=500
nu=50
resa=clustering(hits,stds,filters,phik=3.3,nu=nu,truth=truth,history=history)
resa["event_id"]=event_num
score = score_event_fast(truth, resa.rename(index=str, columns={"label": "track_id"}))
print("Your score: ", score)

FloatText(value=0.0, description='full score:')

FloatText(value=0.0, description='score:')

FloatText(value=0.0, description='s rate:')

FloatText(value=0.0, description='add score:')

FloatText(value=120939.0, description='Rest size:')

FloatText(value=120939.0, description='Group size:')

FloatText(value=0.0, description='filter:')

 77%|███████▋  | 230/300 [00:10<00:03, 22.44it/s]
100%|██████████| 300/300 [00:13<00:00, 22.92it/s][A

took 13.16945 sec
Your score:  0.40190048140999995





# Chapter 3: Expand tracks

In [None]:
mstd_vol={7:0,8:0,9:0,12:2,13:1,14:2,16:3,17:2,18:3}
mstd_size=[4,4,4,4,3,3,3,2,2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
weights={'theta':0.1, 'phi':1}
nresa=expand_tracks(resa,hits,5,16,5,7,mstd=8,dstd=0.00085,phik=3.3,max_dtheta=0.9*np.pi/2,mstd_vol=mstd_vol,mstd_size=mstd_size,weights=weights,nhipo=100)
nresa['event_id']=0
score = score_event_fast(truth, nresa)
print("Your score: ", score)

# Chapter 4: Employ Machine Learning
## 4.1 General strategy
- Produce different submission candidates sub_1, sub_2, ..., sub_N
- Create a machine learning model, which gives probabilities between 0 and 1 for each track candidate
- Merge two submission candidates by assigning to each hit the track, which has higher probability
- Merge the submission candidates successively (sub_1 and sub_2 to sub_12, sub_12 and sub_3 to sub_123, etc.) to get the final submission

[Note: For our final solution, the methods described in this chapter gave around +0.01 to the LB score. Certain benefits from these methods have already been captured by the function, which expands the tracks.]

## 4.2 Create machine learning model (LightGBM)
- Training data: Use the truth file of the 13 train events 3-15 to get true tracks (target=1). To get wrong tracks (target=0) use generated clustering submissions on those latter events. In particular, we consider each track candidate, for which not all hits belong to the same particle_id, as a wrong track. Also, choose only tracks which have at least length 4, to slightly optimize compute time at negligible cost.
- Use 13 features per track: 
 - variance of x,y,z (these are the most important)
 - minimum of x,y,z
 - maximum of x,y,z
 - mean of z
 - volume_id of first hit 
 - number of clusters per track (i.e. are there many hits, which are close together)
 - number of hits / number of clusters   
- Validation data: Same as with training data, but use the 3 training events 0,1,2

[Note: We also tried many more features (compare below dataframe df_train), but they gave only small additional gains, so we just kept those 13 features for simplicity. In fact, many other interesting features, such as number of hits, number of different volumes crossed etc are closely related to the above used ones.]

We have prepared the training and test data and load it directly from a pkl-file:

In [None]:
df_train=load_obj('../input/trackml-validation-data-for-ml-pandas-df/df_train_v1.pkl')
df_test=load_obj('../input/trackml-validation-data-for-ml-pandas-df/df_test_v1.pkl')
y_train=df_train.target.values
y_test=df_test.target.values
print("The dataframe with all features:")
display(df_train.head())
print("Features for each track:",df_train.columns.values)

In the competition, we used roughly 250 events for training, but the additional improvement to just using 13 events is not too big. We now create the LightGBM model, using the mentioned training data and features:

In [None]:
import lightgbm
s=time.time()
# choose which features of the tracks we want to use:
columns=['svolume','nclusters', 'nhitspercluster', 'xmax','ymax','zmax', 'xmin','ymin','zmin', 'zmean', 'xvar','yvar','zvar']
rounds=1000
round_early_stop=50
parameters = { 'subsample_for_bin':800, 'max_bin': 512, 'num_threads':8, 
               'application': 'binary','objective': 'binary','metric': 'auc','boosting': 'gbdt',
               'num_leaves': 128,'feature_fraction': 0.7,'learning_rate': 0.05,'verbose': 0}
train_data = lightgbm.Dataset(df_train[columns].values, label=y_train)
test_data = lightgbm.Dataset(df_test[columns].values, label=y_test)
model = lightgbm.train(parameters,train_data,valid_sets=test_data,num_boost_round=rounds,early_stopping_rounds=round_early_stop,verbose_eval=50)
print('took',time.time()-s,'seconds')

## 4.3 Judge machine learning model

We doublecheck the model's performance, by calculating its precision, recall and accuracy on the validation set:

[Note: For ML to be helpful in our situation, it needs to distinguish correct from wrong tracks with very high precision=true_positives/(false_positives+true_positives)). Also, it needs to do so for various sets of track candidates (especially such, which are generated if one tries to find tracks which originate far away from the origin; in those situations, often a lot of bad candidates are produced).]

In [None]:


y_test_pred=model.predict(df_test[columns].values)
precision, recall, accuracy=precision_and_recall(y_test, y_test_pred,threshold=0.1)
precision, recall, accuracy=precision_and_recall(y_test, y_test_pred,threshold=0.5)
precision, recall, accuracy=precision_and_recall(y_test, y_test_pred,threshold=0.9)

## 4.4 Use machine learning model

Let us first load the event data of training event 0, as well as two generated submissions we have prepared:

In [None]:
### create one more submission
history=[]
resa2=clustering(hits,stds,filters,phik=3.3,nu=nu,truth=truth,history=history)
resa2["event_id"]=event_num
score = score_event_fast(truth, resa2.rename(index=str, columns={"label": "track_id"}))
print("Your score: ", score)

Calculate the probabilities for the tracks in those two submissions (small optimization: take also length of track into account, and after a couple of merged submissions, ask the probability of the track from the new submission to be at least C higher than the current probability; this latter option is not used in this kernel, but was used when merging >= 4 submissions)

In [None]:

preds={}
preds[0]=get_predictions(resa,hits,model)
preds[1]=get_predictions(resa2,hits,model)

Merge both submissions, based on the probabilities of its track candidates:

In [None]:
print('Merge submission 0 and 1 into sub01:')
sub01=merge_with_probabilities(resa,resa2,preds[0],preds[1],truth,length_factor=0.5)

To continue, take additional submission sub2 and merge it onto sub01. Therefore, need to first calculcate probabilities of the tracks in sub01, as well as sub2. Similarly for more submissions.

In [None]:
# Expanding the merged submission of the two clustering solutions gives:
mstd_vol={7:0,8:0,9:0,12:2,13:1,14:2,16:3,17:2,18:3}
mstd_size=[4,4,4,4,3,3,3,2,2,2,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
weights={'theta':0.1, 'phi':1}
nresa2=expand_tracks(sub01,hits,5,16,5,7,mstd=8,dstd=0.00085,phik=3.3,max_dtheta=0.9*np.pi/2,mstd_vol=mstd_vol,mstd_size=mstd_size,weights=weights,nhipo=100)
nresa2['event_id']=0
score = score_event_fast(truth, nresa2)
print("Your score: ", score)

We achieved 0.757, which is a +0.02 improvement in the case of this notebook.