**Track ML Challenge – Data Insights via Graph Analytics**

To explore what our universe is made of, scientists at CERN are colliding protons, essentially recreating mini big bangs, and meticulously observing these collisions with intricate silicon detectors.
While orchestrating the collisions and observations is already a massive scientific accomplishment, analysing the enormous amounts of data produced from the experiments is becoming an overwhelming challenge.
Event rates have already reached hundreds of millions of collisions per second, meaning physicists must sift through tens of petabytes of data per year. 
And, as the resolution of detectors improve, ever better software is needed for real-time pre-processing and filtering of the most promising events, producing even more data.

In this post/paper, it has been researched and studied, if Graph Analytics or Graph Databases can help the physics scientists working at CERN to discover and characterize new particles?

**Quick Problem Description**

*Link every track to one hit.*

![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-18.png)


Every particle leaves a track behind it, like a car leaving tire marks in the sand. We did not catch the particle in action. Now we want to link every track (tire mark) to one hit that the particle created.
In every event, a large number of particles are released. They move along a path leaving behind their tracks. They eventually hit a particle detector surface on the other end.
In the training data we have the following information on each event:
•	Hits: x,y,zx,y,z coordinates of each hit on the particle detector
•	Particles: Each particle's initial position (vx,vy,vzvx,vy,vz), momentum (px,py,pzpx,py,pz), charge (qq) and number of hits
•	Truth: Mapping between hits and generating particles; the particle's trajectory, momentum and the hit weight
•	Cells: Precise location of where each particle hit the detector and how much energy it deposited

The complete Dataset and all explanations related to the same can be directly loaded/seen from the Kaggle website - [https://www.kaggle.com/c/trackml-particle-identification/data]


**What is Graph Analytics**

“A picture speaks a thousand words” is one of the most commonly used phrases. But a graph speaks so much more than that. 


![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-19.png)

A visual representation of data, in the form of graphs, helps us gain actionable insights and make better data driven decisions based on them.
The science or the branch to analyse Graphs to make better data driven decisions is termed as Graph Analytics.

**Graph Analytics and ML Challenge**

The studies & research on over the data and end result that is required for this ML challenge shows infers that - this will be the perfect Used Case for Graph Analytics and that the Hit ID’s can be easily linked to the Track ID and that the users could have a visual representation of the same.
To begin with – Data Exploration has been started and a rough image of the output Graph have been framed in the mind and also have been drawn on a piece of paper. The initial rough images define various relations and looks as follows:


![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-20.png)

![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-21.png)

**Tool Selection**

Once the images are drawn and the data is explored, the 2nd important task is the Tool Selection. There are wide variety of Tools and all have various capabilities. Based on the expertise following tools have been listed:
•	Gephi
•	Neo4J
•	IGraph (R Package)
•	GraphX with GraphStream and BreezeViz(Apache Spark)
•	NetworkX (Python)
Out of these tools as mentioned above, GEPHI is the most sort out choice for the initial draft to have a look, if our purpose will be solved via Graph Analytics.

Also, for the initial Draft, Data selection has been done. The reason is that the Data is too large.
Two events i.e. Event event000001000 and event000001000 are utilized for this task.

The next important task is to create Nodes and Edges in the Gephi Format that requires csv files to be imported directly onto Gephi. 
For this Data creation – SQL server Management Studio is utilized to retrieve the results in the desired format based on various Queries:


![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-22.png)

Using all sort of Queries and logics – following two CSV files are created as per the Gephi’s Format:
•	Nodes.csv
•	Edges.csv
These files are then imported to Gephi and looks like as follows:


**NODES**
![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-23.png)

**EDGES**
![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-24.png)

Then the coordinate columns are re-casted to Graph Coordinates and nodes are partitioned to colours on the basis of particle_id.

**The initial Graph looks like as follows:**


![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-25.png)

**The 2nd Graph shows the edges Labels and Tracks are clearly visible:**

![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-26.png)

**A close look at the Graph with All Tracks defined:**

![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-27.png)

**Hit ID’s with Tracks**

![image.png](https://datafreakankur.com/wp-content/uploads/2019/02/image-28.png)

**Does this solve our Purpose?**

The Track ML challenge requires a file to be submitted based on the Event Hit ID’s with their Specific Tracks defined.
The same role is achieved using Graph Analytics. We can easily link an event Hit ID with a Track and can also visualise the same.
Also, for the Test Data – these Tracks can be analysed based on the close proximity and connections.

**Challenges faced:**
•	Very Large amount of Data
•	Data cleaning and Data Processing.

**Future Action Plan**
•	Implementing more Events and enhance Data capabilities
•	Using R iGraph package for Graph Analytics

**Readings & Data Understanding**
•	[https://www.kaggle.com/c/trackml-particle-identification]

•	[https://www.kaggle.com/wesamelshamy/trackml-problem-explanation-and-data-exploration/comments#323803]

•	[https://www.kaggle.com/makahana/quick-trajectory-plot]

•	[https://www.kaggle.com/jbonatt/trackml-eda-etc]
