# An example to ingest ROOT CMS data 
- To run jupyter notebook for Apache Toree: `SPARK_OPTS="--packages org.diana-hep:spark-root_2.11:0.1.15" jupyter-notebook`

In [2]:
import org.dianahep.sparkroot.experimental._

## Read in the ROOT file
- Do not print the Schema -> __it's huge!__

In [3]:
val df = spark.read.option("tree", "Events").root("file:/Users/vk/data/cms/*.root")

df = [EventAuxiliary: struct<processHistoryID_: struct<hash_: string>, id_: struct<run_: int, luminosityBlock_: int ... 1 more field> ... 8 more fields>, EventSelections: array<struct<hash_:string>> ... 65 more fields]


[EventAuxiliary: struct<processHistoryID_: struct<hash_: string>, id_: struct<run_: int, luminosityBlock_: int ... 1 more field> ... 8 more fields>, EventSelections: array<struct<hash_:string>> ... 65 more fields]

## Select Muons 
- Select only momentum and position information
- Will include PDG id and status
- You need to include chi2 separately - possible just do not resolve down to the `m_state`

In [4]:
val df_muons = df.select("patMuons_slimmedMuons__PAT_.patMuons_slimmedMuons__PAT_obj.m_state")

df_muons = [m_state: array<struct<vertex_:struct<fCoordinates:struct<fX:float,fY:float,fZ:float>>,p4Polar_:struct<fCoordinates:struct<fPt:float,fEta:float,fPhi:float,fM:float>>,qx3_:int,pdgId_:int,status_:int>>]


[m_state: array<struct<vertex_:struct<fCoordinates:struct<fX:float,fY:float,fZ:float>>,p4Polar_:struct<fCoordinates:struct<fPt:float,fEta:float,fPhi:float,fM:float>>,qx3_:int,pdgId_:int,status_:int>>]

In [5]:
df_muons.printSchema

root
 |-- m_state: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- vertex_: struct (nullable = true)
 |    |    |    |-- fCoordinates: struct (nullable = true)
 |    |    |    |    |-- fX: float (nullable = true)
 |    |    |    |    |-- fY: float (nullable = true)
 |    |    |    |    |-- fZ: float (nullable = true)
 |    |    |-- p4Polar_: struct (nullable = true)
 |    |    |    |-- fCoordinates: struct (nullable = true)
 |    |    |    |    |-- fPt: float (nullable = true)
 |    |    |    |    |-- fEta: float (nullable = true)
 |    |    |    |    |-- fPhi: float (nullable = true)
 |    |    |    |    |-- fM: float (nullable = true)
 |    |    |-- qx3_: integer (nullable = true)
 |    |    |-- pdgId_: integer (nullable = true)
 |    |    |-- status_: integer (nullable = true)



## Generate Product Classes for easier Manipulation
- This option is not pretty code-wise right now... 
- Generate Case Classes so that you can case Spark's Row objects into these classes
- Will allow manipulation through the field resolution (e.g. `object.field`) just like in python... 

In [7]:
import codegen._
val queue = df_muons.schema.codeGen("Event")
val s = queue.mkString("\n")
println(s)


case class Record2 (
    fX : Float,
    fY : Float,
    fZ : Float
)


case class Record1 (
    fCoordinates : Record2
)


case class Record4 (
    fPt : Float,
    fEta : Float,
    fPhi : Float,
    fM : Float
)


case class Record3 (
    fCoordinates : Record4
)


case class Record0 (
    vertex_ : Record1,
    p4Polar_ : Record3,
    qx3_ : Int,
    pdgId_ : Int,
    status_ : Int
)


case class Event (
    m_state : Seq[Record0]
)



queue = 
s = 


Queue("
", "
", "
", "
", "
", "
")
"
    fCoordina...



case class Record2 (
    fX : Float,
    fY : Float,
    fZ : Float
)


case class Record1 (
    fCoordinates : Record2
)


case class Record4 (
    fPt : Float,
    fEta : Float,
    fPhi : Float,
    fM : Float
)


case class Record3 (
    fCoordinates : Record4
)


case class Record0 (
    vertex_ : Record1,
    p4Polar_ : Record3,
    qx3_ : Int,
    pdgId_ : Int,
    status_ : Int
)


case class Event (
    m_state : Seq[Record0]
)


In [8]:
case class Record2 (
    fX : Float,
    fY : Float,
    fZ : Float
)


case class Record1 (
    fCoordinates : Record2
)


case class Record4 (
    fPt : Float,
    fEta : Float,
    fPhi : Float,
    fM : Float
)


case class Record3 (
    fCoordinates : Record4
)


case class Record0 (
    vertex_ : Record1,
    p4Polar_ : Record3,
    qx3_ : Int,
    pdgId_ : Int,
    status_ : Int
)


case class Event (
    m_state : Seq[Record0]
)


defined class Record2
defined class Record1
defined class Record4
defined class Record3
defined class Record0
defined class Event


## Manipulate your data

In [10]:
val ds_muons = df_muons.as[Event]
ds_muons.show

+--------------------+
|             m_state|
+--------------------+
|[[[[0.16862841,0....|
|[[[[0.10406262,0....|
|[[[[0.10817633,0....|
|[[[[0.10356494,0....|
|[[[[0.10073657,0....|
|[[[[0.10944145,0....|
|[[[[0.11222846,0....|
|[[[[0.10781129,0....|
|[[[[0.10165176,0....|
|[[[[0.1002845,0.1...|
|                  []|
|[[[[0.09750543,0....|
|                  []|
|[[[[0.10538159,0....|
|[[[[0.12014926,0....|
|[[[[-0.0053656413...|
|[[[[0.10457884,0....|
|[[[[0.10495199,0....|
|[[[[0.106314346,0...|
|[[[[0.10803154,0....|
+--------------------+
only showing top 20 rows



ds_muons = [m_state: array<struct<vertex_:struct<fCoordinates:struct<fX:float,fY:float,fZ:float>>,p4Polar_:struct<fCoordinates:struct<fPt:float,fEta:float,fPhi:float,fM:float>>,qx3_:int,pdgId_:int,status_:int>>]


[m_state: array<struct<vertex_:struct<fCoordinates:struct<fX:float,fY:float,fZ:float>>,p4Polar_:struct<fCoordinates:struct<fPt:float,fEta:float,fPhi:float,fM:float>>,qx3_:int,pdgId_:int,status_:int>>]

In [11]:
ds_muons.map({case event => event.m_state.size}).show

+-----+
|value|
+-----+
|    5|
|    4|
|    3|
|    3|
|    1|
|    2|
|    1|
|    1|
|    1|
|    3|
|    0|
|    4|
|    0|
|    3|
|    2|
|    2|
|    4|
|    4|
|    2|
|    1|
+-----+
only showing top 20 rows

