### Spark Concepts

#### What's new in Spark 3.0?

1. Not much development in MLLIB library especially MLLIB with RDD interface.
2. Spark 3 is significantly faster. Adaptive Execution and Dynamic Partitioning Pruning.
3. Python 2 support is deprecated.
4. Deeper Kubernetes support.
5. Support for Binary files e.g. images, video etc.
6. SparkGraph. Uses Cypher query language.
7. ACID support in data lakes i.e. Delta Lakes.

### RDD - Resilient, Distributed, Dataset

1. RDDs can be created using many ways. E.g. paraellize, sc.textFile, Hive, JDBC connectors, Cassandara, HBASe etc.
2. Transformations: map, flatmap, filter, distinct, sample, union, intersection, substract, cartesian etc.
3. Actions: collect, count, countByValue, take, top, reduce etc.
4. Lazy Evaluation: Nothing happens until an action is being called.
5. RDD was the primary user-facing API in Spark since its inception. At the core, an RDD is an immutable distributed collection of elements of your data, partitioned across nodes in your cluster that can be operated in parallel with a low-level API that offers transformations and actions.

### When to use RDDs?
Consider these scenarios or common use cases for using RDDs when:

1. you want low-level transformation and actions and control on your dataset;
2. your data is unstructured, such as media streams or streams of text;
3. you want to manipulate your data with functional programming constructs than domain specific expressions;
4. you don’t care about imposing a schema, such as columnar format, while processing or accessing data attributes by name or column; and
5. you can forgo some optimization and performance benefits available with DataFrames and Datasets for structured and semi-structured data.

### Spark Context

1. Created by Driver Program
2. Is responsible for making RDDs

### Data Frames

1. Contains Row Objects
2. Can Run SQL Queries
3. Can have schema (leading to more effecient storage)
4. Read and write to JSON, Parquet, Hive, csv, etc.
5. Communicates with JDBC/ODBC, Tableau etc.
6. Data Frames allow for better interoperability
7. Data Frames simplify development e.g. you can perform most SQL operation with few lines of code.
8. RDDs are good for map redcue kind of problems so sometimes you might need to convert a DF to an RDD e.g. dataframe.rdd().map(mapper_function)
9. Like an RDD, a DataFrame is an immutable distributed collection of data. Unlike an RDD, data is organized into named columns, like a table in a relational database. Designed to make large data sets processing even easier, DataFrame allows developers to impose a structure onto a distributed collection of data, allowing higher-level abstraction; it provides a domain specific language API to manipulate your distributed data; and makes Spark accessible to a wider audience, beyond specialized data engineers.

### Data Sets

1. Data Sets are more used in Scala then Python. Since Python and R have no compile-time type-safety, we only have untyped APIs, namely DataFrames.
2. Data Sets are typed API and Data Frames are Untyped APIs.
3. Conceptually, consider DataFrame as an alias for a collection of generic objects Dataset[Row], where a Row is a generic untyped JVM object. Dataset, by contrast, is a collection of strongly-typed JVM objects, dictated by a case class you define in Scala or a class in Java.
4. Second, since Spark as a compiler understands your Dataset type JVM object, it maps your type-specific JVM object to Tungsten’s internal memory representation using Encoders. As a result, Tungsten Encoders can efficiently serialize/deserialize JVM objects as well as generate compact bytecode that can execute at superior speeds.

### When should I use DataFrames or Datasets?
1. If you want rich semantics, high-level abstractions, and domain specific APIs, use DataFrame or Dataset.
2. If your processing demands high-level expressions, filters, maps, aggregation, averages, sum, SQL queries, columnar access and use of lambda functions on semi-structured data, use DataFrame or Dataset.
3. If you want higher degree of type-safety at compile time, want typed JVM objects, take advantage of Catalyst optimization, and benefit from Tungsten’s efficient code generation, use Dataset.
4. If you want unification and simplification of APIs across Spark Libraries, use DataFrame or Dataset.
5. If you are a R user, use DataFrames.
6. If you are a Python user, use DataFrames and resort back to RDDs if you need more control.

### References

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html

Frank Kane's Udemy course: Taming Big Data with Apache Spark and Python - Hands On!