# Describe the difference between eager and lazy execution

## ➡️Getting Started

Run the following cell to configure our notebook.

In [None]:
%run Utilities

## ➡️ Laziness By Design

Fundamental to Apache Spark are the notions that
* Transformations are **LAZY**
* Actions are **EAGER**

The following code condenses the logic from the DataFrames modules in this learning path, and uses the DataFrames API to:
- Specify a schema, format, and file source for the data to be loaded
- Select columns to `GROUP BY`
- Aggregate with a `COUNT`
- Provide an alias name for the aggregate output
- Specify a column to sort on

This cell defines a series of **transformations**. By definition, this logic will result in a DataFrame and will not trigger any jobs.

In [None]:
parquetDir = "Files/taxidata/yellow*.parquet"

countsDF = (spark         # Our SparkSession & Entry Point
  .read                     # Our DataFrameReader
  .parquet(parquetDir)      # Returns an instance of DataFrame
  .groupBy("VendorID")
  .count()
  .withColumnRenamed("count", "Counts")
  .orderBy("VendorID")
)

Because `display` is an **action**, a job _will_ be triggered, as logic is executed against the specified data to return a result.

In [None]:
display(countsDF)

### ➡️ Why is Laziness So Important?

Laziness is at the core of Scala and Spark.

It has a number of benefits:
* Not forced to load all data at step #1
  * Technically impossible with **REALLY** large datasets.
* Easier to parallelize operations
  * N different transformations can be processed on a single data element, on a single thread, on a single machine.
* Optimizations can be applied prior to code compilation

# Actions & Transformations

## ➡️ Actions

In production code, actions will generally **write data to persistent storage** using the DataFrameWriter.

During interactive code development in Fabric notebooks, the `display` method will frequently be used to **materialize a view of the data** after logic has been applied.

A number of other actions provide the ability to return previews or specify physical execution plans for how logic will map to data. For the complete list, review the [API docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset).

| Method | Return | Description |
|--------|--------|-------------|
| `collect()` | Collection | Returns an array that contains all of Rows in this Dataset. |
| `count()` | Long | Returns the number of rows in the Dataset. |
| `first()` | Row | Returns the first row. |
| `foreach(f)` | - | Applies a function f to all rows. |
| `foreachPartition(f)` | - | Applies a function f to each partition of this Dataset. |
| `head()` | Row | Returns the first row. |
| `reduce(f)` | Row | Reduces the elements of this Dataset using the specified binary function. |
| `show(..)` | - | Displays the top 20 rows of Dataset in a tabular form. |
| `take(n)` | Collection | Returns the first n rows in the Dataset. |
| `toLocalIterator()` | Iterator | Return an iterator that contains all of Rows in this Dataset. |

❗ Actions such as `collect` can lead to out of memory errors by forcing the collection of all data.

## ➡️ Transformations

Transformations have the following key characteristics:
* They eventually return another `DataFrame`.
* They are immutable - that is each instance of a `DataFrame` cannot be altered once it's instantiated.
  * This means other optimizations are possible - such as the use of shuffle files (to be discussed in detail later)
* Are classified as either a Wide or Narrow operation

Most operations in Spark are **transformations**. While many transformations are [DataFrame operations](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset), writing efficient Spark code will require importing methods from the `sql.functions` module, which contains [transformations corresponding to SQL built-in operations](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$).

## ➡️ Types of Transformations

A transformation may be wide or narrow.

A wide transformation requires sharing data across workers. 

A narrow transformation can be applied per partition/worker with no need to share or shuffle data to other workers. 

## ➡️ Narrow Transformations

The data required to compute the records in a single partition reside in at most one partition of the parent Dataframe.

Examples include:
* `filter(..)`
* `drop(..)`
* `coalesce()`

![](https://github.com/weslbo/DP-601/blob/main/images/Narrow-Transformation.png?raw=true)

In [None]:
from pyspark.sql.functions import col

pickupLocation249 = (spark.read
                          .parquet(parquetDir)
                          .filter(col("PULocationID") == 249)
                          .limit(10))

display(pickupLocation249)

## ➡️ Wide Transformations

The data required to compute the records in a single partition may reside in many partitions of the parent Dataframe. These operations require that data is **shuffled** between executors.

Examples include:
* `distinct()`
* `groupBy(..).sum()`
* `repartition(n)`

![](https://github.com/weslbo/DP-601/blob/main/images/Wide-Transformation.png?raw=true)

In [None]:
from pyspark.sql.functions import col, desc

pickupLocations = (spark.read
                          .parquet(parquetDir)
                          .groupBy("PULocationID")
                          .sum("passenger_count")
                          .withColumnRenamed("sum(passenger_count)", "sum_passenger_count")
                          .sort(desc("sum_passenger_count"))
                          .limit(10))

display(pickupLocations)