# Example 1: Array of Struct
- Collections of particles, 1 collection for each type. AoS
- __Here, just a few manipulations:__
- We start with ROOT DataSource
- Apply some pipeline (here just skip the actual compution and write out) and write to disk in parquet
- Read back in parquet, apply some pipeline (skip the pipeline here)

__Schema Structure:__
- AoS for several particle types
- For each particle fields are kinematic or physical properties of that particle.

__To launch with Jupyter Notebook and Apache Toree:__
- `SPARK_OPTS="--packages org.diana-hep:spark-root_2.11:0.1.16 --master local" jupyter-notebook`

In [1]:
import org.dianahep.sparkroot.experimental._

In [2]:
val df = spark.read.root("file:/Users/vk/data/ML_MP_JR/ttbar_lepFilter_13TeV/ttbar_lepFilter_13TeV_950.root")

df = [Event: array<struct<fUniqueID:int,fBits:int,Number:bigint,ReadTime:float,ProcTime:float,ProcessID:int,MPI:int,Weight:float,Scale:float,AlphaQED:float,AlphaQCD:float,ID1:int,ID2:int,X1:float,X2:float,ScalePDF:float,PDF1:float,PDF2:float>>, Event_size: int ... 26 more fields]


[Event: array<struct<fUniqueID:int,fBits:int,Number:bigint,ReadTime:float,ProcTime:float,ProcessID:int,MPI:int,Weight:float,Scale:float,AlphaQED:float,AlphaQCD:float,ID1:int,ID2:int,X1:float,X2:float,ScalePDF:float,PDF1:float,PDF2:float>>, Event_size: int ... 26 more fields]

In [3]:
df.printSchema

root
 |-- Event: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- fUniqueID: integer (nullable = true)
 |    |    |-- fBits: integer (nullable = true)
 |    |    |-- Number: long (nullable = true)
 |    |    |-- ReadTime: float (nullable = true)
 |    |    |-- ProcTime: float (nullable = true)
 |    |    |-- ProcessID: integer (nullable = true)
 |    |    |-- MPI: integer (nullable = true)
 |    |    |-- Weight: float (nullable = true)
 |    |    |-- Scale: float (nullable = true)
 |    |    |-- AlphaQED: float (nullable = true)
 |    |    |-- AlphaQCD: float (nullable = true)
 |    |    |-- ID1: integer (nullable = true)
 |    |    |-- ID2: integer (nullable = true)
 |    |    |-- X1: float (nullable = true)
 |    |    |-- X2: float (nullable = true)
 |    |    |-- ScalePDF: float (nullable = true)
 |    |    |-- PDF1: float (nullable = true)
 |    |    |-- PDF2: float (nullable = true)
 |-- Event_size: integer (nullable = true)
 |-- Particle: arr

In [4]:
import org.apache.spark.sql._
df.limit(10).write.mode(SaveMode.Overwrite).parquet("file:/Users/vk/data/ML_MP_JR/ttbar_lepFilter_13TeV/test_oracle_particleList.parquet")

In [5]:
val dfp = spark.read.parquet("file:/Users/vk/data/ML_MP_JR/ttbar_lepFilter_13TeV/test_oracle_particleList.parquet")

dfp = [Event: array<struct<fUniqueID:int,fBits:int,Number:bigint,ReadTime:float,ProcTime:float,ProcessID:int,MPI:int,Weight:float,Scale:float,AlphaQED:float,AlphaQCD:float,ID1:int,ID2:int,X1:float,X2:float,ScalePDF:float,PDF1:float,PDF2:float>>, Event_size: int ... 26 more fields]


[Event: array<struct<fUniqueID:int,fBits:int,Number:bigint,ReadTime:float,ProcTime:float,ProcessID:int,MPI:int,Weight:float,Scale:float,AlphaQED:float,AlphaQCD:float,ID1:int,ID2:int,X1:float,X2:float,ScalePDF:float,PDF1:float,PDF2:float>>, Event_size: int ... 26 more fields]

In [6]:
dfp.select("Photon").show

+--------------------+
|              Photon|
+--------------------+
|                  []|
|[[0,50331648,11.1...|
|                  []|
|                  []|
|[[0,50331648,99.7...|
|                  []|
|[[0,50331648,14.0...|
|                  []|
|[[0,50331648,16.0...|
|[[0,50331648,12.3...|
+--------------------+



# Example 2: 2D Matrix of Features
__Schema Structure:__
- `hfeatures` - an array of high level kinematic features
- `lfeatures` - a 2D matrix with each row corresponding to some particle with kinematic features. Matrix is 800 x 19

In [7]:
val df2 = spark.read.parquet("file:/Users/vk/data/ML_MP_JR/parquet/qcd/part-06502-4e006c2f-7c89-4b89-aafb-e0977ef7f749.snappy.parquet")

df2 = [hfeatures: array<double>, lfeatures: array<array<double>>]


[hfeatures: array<double>, lfeatures: array<array<double>>]

In [8]:
df2.printSchema

root
 |-- hfeatures: array (nullable = true)
 |    |-- element: double (containsNull = true)
 |-- lfeatures: array (nullable = true)
 |    |-- element: array (containsNull = true)
 |    |    |-- element: double (containsNull = true)



In [9]:
df2.show

+--------------------+--------------------+
|           hfeatures|           lfeatures|
+--------------------+--------------------+
|[0.0, 26.71513748...|[WrappedArray(797...|
|[77.6775894165039...|[WrappedArray(72....|
|[0.0, 17.52551651...|[WrappedArray(49....|
|[44.1640167236328...|[WrappedArray(55....|
|[138.158115386962...|[WrappedArray(126...|
|[139.724960327148...|[WrappedArray(187...|
|[0.0, 17.14591407...|[WrappedArray(69....|
|[0.0, 39.43039321...|[WrappedArray(251...|
|[41.2450332641601...|[WrappedArray(67....|
|[0.0, 19.49344825...|[WrappedArray(33....|
|[45.5153923034668...|[WrappedArray(49....|
|[40.5815162658691...|[WrappedArray(52....|
|[0.0, 28.79856681...|[WrappedArray(26....|
+--------------------+--------------------+



# Example 3: 3D Matrix: Images
__Schema Structure:__
- `label` is the classificiation label
- `image` is the 3D matrix. Visually, here is the actual pipeline and how the image looks like  if you draw it https://github.com/vkhristenko/jupyter-notebooks/blob/master/python/pipeline_features2image_reproducelocally.ipynb

In [13]:
val df3 = spark.read.parquet("file:/Users/vk/data/ML_MP_JR/parquet/fororacle/test_oracle_images")

df3 = [image: vector, label: int]


[image: vector, label: int]

In [14]:
df3.printSchema

root
 |-- image: vector (nullable = true)
 |-- label: integer (nullable = true)



In [15]:
df3.show

+--------------------+-----+
|               image|label|
+--------------------+-----+
|[1.0,1.0,0.699999...|    0|
|[1.0,1.0,0.699999...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
|[1.0,1.0,1.0,1.0,...|    0|
+--------------------+-----+

