# Chapter 1 Introduction

skip this

# Chapter 2 Getting Started

Spark download page: https://spark.apache.org/downloads.html

Or... since the release of Apache Spark 2.2, developers who only care about learning Spark in Python (that's me) have the option of installing pyspark through `pip install pyspark`.

## Transformation, Action and Lazy Evaluations

- **Transformation**: a transformation transform a Spark DataFrame into a new DataFrame without altering the original data, giving it the property of immutability
    - their results are not computed immediately, but they are recorded or remembered as a _lineage_
    - lazy evaluation is Spark’s strategy for delaying execution until an action is invoked or data is "touched"
    - e.g., `orderBy()`, `groupBy()`, `filter()`, `select()`, `join()`
- **Action**: an action triggers the lazy evaluation of all the recorded transformations
    - e.g., `show()`, `take()`, `count()`, `collect()`, `save()`
    
    
### Narrow vs. wide transformation

- **Narrow**: a single output partition can be computed from a single input partition
    - e.g., `filter()`, `contains()`
- **Wide**: data from other partitions is read in, combined, and written to disk
    - e.g., `groupBy()`, `orderBy()`

## Access The Spark UI

1. launch `pyspark` from terminal
2. `Spark context Web UI available at http://10.0.0.242:4040` --> open in a browser

# Chapter 3 Apache Spark's Structured APIs

## RDD

- **3 vital characteristics**: 
    - dependencies: instructs Spark how an RDD is constructed with its inputs is required
    - partitions: provides Spark the ability to split the work to parallelize computation on partitions across executors
    - compute function: produces an `Iterator[T]` for the data that will be stored in RDD

## Python data types in Spark

- basics: `ByteType`, `ShortType`, `IntegerType`, `LongType`, `FloatType`, `DoubleType`, `StringType`, `BooleanType`, `DecimalType`
- more sophisticated: `BinaryType`, `TimestampType`, `DateType`, `ArrayType`, `MapType`, `StructType`, `StructField`

# Chapter 4 Spark SQL and DataFrames: Introduction to Built-in Data Sources

- DataFrame --> SQL: `df.createOrReplaceTempView([TABLENAME])`
- read and write data in different formats (check dbc notebook 4.2)
     * Parquet
     * JSON
     * CSV
     * Avro
     * ORC
     * Image
     * Binary

# Chapter 5 Spark SQL and DataFrames: Interacting with External Data Sources

## create UDFs

- can create UDFs for both SQL and pandas
- Null checking: 
    1. Make the UDF itself `null`-aware and do null checking inside the UDF
    2. Use IF or CASE WHEN expressions to do the `null` check and invoke the UDF in a conditional branch
    
## querying with the Spark SQL Shell, Beeline, and Tableau

(skipped for now)

## common DataFrames and Spark SQL operations

Spark SQL doc: https://spark.apache.org/docs/latest/api/sql/index.html

- `union()`: union two different DataFrames with the same schema together
- `join()`: default inner join
- windowing: e.g., `rank()`, `denseRank()` (interesting), `percentRank()` (p149)
- modifications
    - adding: `withColumn()`
    - dropping: `drop()`
    - renaming: `withColumnRenamed()`
    - pivoting: `PIVOT` (p154)

# Chapter 6 Spark SQL and Datasets

In this chapter, we go under the hood to understand Datasets: we’ll explore working with Datasets in **Java and Scala**, how Spark manages memory to accommodate Dataset constructs as part of the high-level API, and the costs associated with using Datasets.

(skip for now)

# Chapter 7 Optimizing and Tuning Spark Applications

## Viewing and modifying Spark properties:
- configuration files in your deployment’s `$SPARK_HOME` directory
- specify Spark configurations directly in your Spark application
    - or, on the command line when submitting the application with `spark-submit`, using the `--conf` flag

```
  spark-submit --conf spark.sql.shuffle.partitions=5 --conf
    "spark.executor.memory=2g" --class main.scala.chapter7.SparkConfig_7_1 jars/main-
    scala-chapter7_2.12-1.0.jar
```
- through a programmatic interface via the Spark shell

## Scaling Spark for Large Workloads
- 

# Chapter 8 