# Describe the difference between eager and lazy execution

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Getting Started

Run the following cell to configure our "classroom."

In [0]:
%run ./Includes/Classroom-Setup

In [0]:
# Mount "/mnt/training" again using "%run "./Includes/Dataset-Mounts-New"" if it is failed in "./Includes/Classroom-Setup"
try:
    files = dbutils.fs.ls("/mnt/training")
except:
    dbutils.fs.unmount('/mnt/training/')

/mnt/training/ has been unmounted.


In [0]:
%run "./Includes/Dataset-Mounts-New"

##![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Laziness By Design

Fundamental to Apache Spark are the notions that
* Transformations are **LAZY**
* Actions are **EAGER**

The following code condenses the logic from the DataFrames modules in this learning path, and uses the DataFrames API to:
- Specify a schema, format, and file source for the data to be loaded
- Select columns to `GROUP BY`
- Aggregate with a `COUNT`
- Provide an alias name for the aggregate output
- Specify a column to sort on

This cell defines a series of **transformations**. By definition, this logic will result in a DataFrame and will not trigger any jobs.

In [0]:
schemaDDL = "NAME STRING, STATION STRING, LATITUDE FLOAT, LONGITUDE FLOAT, ELEVATION FLOAT, DATE DATE, UNIT STRING, TAVG FLOAT"

sourcePath = "/mnt/training/weather/StationData/stationData.parquet/"

countsDF = (spark.read
  .format("parquet")
  .schema(schemaDDL)
  .load(sourcePath)
  .groupBy("NAME", "UNIT").count()
)


In [0]:
countsDF = (spark.read
  .format("parquet")
  .schema(schemaDDL)
  .load(sourcePath)
  .groupBy("NAME", "UNIT").count()
  .withColumnRenamed("count", "counts")
  .orderBy("NAME")
)

Because `display` is an **action**, a job _will_ be triggered, as logic is executed against the specified data to return a result.

In [0]:
display(countsDF)

NAME,UNIT,counts
"BARNABY CALIFORNIA, CA US",C,151
"BIG ROCK CALIFORNIA, CA US",C,151
"BLACK DIAMOND CALIFORNIA, CA US",C,151
"BRIONES CALIFORNIA, CA US",F,151
"CONCORD BUCHANAN FIELD, CA US",F,149
"HAYWARD AIR TERMINAL, CA US",F,149
"HOUSTON INTERCONTINENTAL AIRPORT, TX US",F,150
"HOUSTON WILLIAM P HOBBY AIRPORT, TX US",C,150
"LAS TRAMPAS CALIFORNIA, CA US",C,151
"LOS PRIETOS CALIFORNIA, CA US",F,151


### Why is Laziness So Important?

Laziness is at the core of Scala and Spark.

It has a number of benefits:
* Not forced to load all data at step #1
  * Technically impossible with **REALLY** large datasets.
* Easier to parallelize operations
  * N different transformations can be processed on a single data element, on a single thread, on a single machine.
* Optimizations can be applied prior to code compilation