# 2. Apache Spark Core

## SparkSQL
- It is a **module** for Structured data processing with multiple interfaces
- It includes any object that has a Schema or Structure, including SQL tables, DataFrames API for Python, Scala, Java and R

## Transformations
- DataFrame Transformations are **lazily** eveluated (Job won't start until having **Action**)
  - Schema eagerly evaludated by Driver, but Job not spawned
  - Benefit of "Lazy Evaluation": Spark can make Optimization decisions after it look at the DAG (Directed Acyclic Graph)
- Actions: are methods that trigger
  - Job is spawned
  - Examples: df.count(), df.collect(), df.show(), display(df)

## DataFrameReader
- Interface used to load a DataFrame from external storage
  - ```spark.read.csv("/Filestore/tables/LifeExp_headers.csv")```
- Explicit vs Implicit vs Infer Schema
  1. **Explicitly** define Schema _**without reading**_ data files
      ```
      DDL_schema = ("coutry STRING, lifeexp DOUBLE, region STRING)
      userDF = spark.read.option("header", True).schema(DDL_schema).csv("/Filestore/tables/LifeExp_headers.csv")
      ```
  2. **Implicitly** create default Column names and Data types _**without reading**_ data files
      ```
      df1 = spark.read.load("/Filestore/tables/LifeExp_headers.csv", format = "csv", header = False)
      display(df1)
      ```
  3. **Infer** column names and data types _**by reading**_ data files
      ```
      df2 = spark.read.load("/Filestore/tables/LifeExp_headers.csv", format = "csv", header = True, inferSchema = True)
      display(df2)
      ```

## DataFrameWriter
- Write DataFrame to external storage
    ```
    df.write
      .format("delta)
      .mode("append")
      .save(outPath)
    ```
- Write as SQL table
    ```
    df.write
      .mode("overwrite")
      .saveAsTable("evants_p")
    ```

## Query Execution
We can express the same query using any interface. The Spark SQL engine generates the same query plan used to optimize and execute on our Spark cluster.

![query execution engine](https://files.training.databricks.com/images/aspwd/spark_sql_query_execution_engine.png)

<img src="https://files.training.databricks.com/images/icon_note_32.png" alt="Note"> Resilient Distributed Datasets (RDDs) are the low-level representation of datasets processed by a Spark cluster. In early versions of Spark, you had to write <a href="https://spark.apache.org/docs/latest/rdd-programming-guide.html" target="_blank">code manipulating RDDs directly</a>. In modern versions of Spark you should instead use the higher-level DataFrame APIs, which Spark automatically compiles into low-level RDD operations.