"In Apache Spark, **_RDDs (Resilient Distributed Datasets)_** are the fundamental, low-level data structure representing collections of elements. DataFrames, on the other hand, are a higher-level abstraction built on top of RDDs, providing a structured view of data with named columns and a schema. While RDDs offer flexibility, DataFrames provide more optimization opportunities through Spark's Catalyst optimizer, which can significantly improve query performance. DataFrames are now the primary API for most Spark users due to their ease of use and optimization capabilities."

<img src="https://data-science-at-scale.s3.us-east-1.amazonaws.com/images/rdd_dataframe_lineage.png" width="640">

**Detailed Explanation:**

1.  **Foundation (RDDs):**
    *   Think of RDDs as the "assembly language" of Spark. They are the underlying building blocks of all Spark computations.
    *   They represent a collection of elements that are partitioned across a cluster of machines, allowing for parallel processing.
    *   RDDs are inherently flexible – they can hold any type of data (e.g., raw text, objects, etc.).
    *   They expose low-level transformations and actions, giving developers fine-grained control over processing.
    *   However, this low-level control also means developers need to be more aware of optimization details.

2.  **Structured View (DataFrames):**
    *   DataFrames are like a "spreadsheet" or a "table" with columns and rows, where each column has a name and a specific data type (a schema).
    *   They are built on top of RDDs, and internally, they are a set of RDDs which hold Row objects.
    *   They bring structure to data, which allows Spark to reason about data and make optimization decisions.
    *   DataFrames utilize Spark's Catalyst optimizer, which analyzes execution plans and applies techniques like predicate pushdown, column pruning, and optimized join strategies, leading to increased performance.
    *   DataFrames are typically easier to use for most common data processing operations, as you can use SQL-like syntax for querying and manipulating data.
    *   DataFrames are the preferred API for new Spark applications, given their ease of use and improved performance over RDDs.

3.  **Key Differences (Table):**

| Feature         | RDD                                  | DataFrame                            |
|-----------------|--------------------------------------|----------------------------------------|
| **Level**       | Low-level                             | High-level                              |
| **Structure**   | Unstructured or semi-structured      | Structured with named columns & schema|
| **Optimization**| Requires manual optimization          | Optimized by Catalyst optimizer      |
| **Ease of Use**  | More flexible, less intuitive        | Easier, more intuitive                |
| **Performance**| Potentially slower, requires optimization | Generally faster due to optimization |
| **Data Types**   | Can handle any data type           | Works with structured and typed data    |

4.  **Analogy:**

*   Imagine RDDs as a box of random, unsorted objects. You can access each object individually, but you have to manually organize them to extract meaningful information.
*   DataFrames are like a neatly organized filing cabinet, where each folder is a column and documents within are the rows. You can quickly search and filter based on the folders (column names) and extract information effectively.

5.  **Interoperability:**

*   You can convert RDDs to DataFrames when needed using the `createDataFrame` method, and vice versa. You can also use RDD methods on a DataFrame by first converting it to a RDD object.
*   This ability to convert between RDDs and DataFrames allows for a gradual transition or mix-and-match based on specific use cases.
*   When custom operations that are not readily available on DataFrames are required, a developer may choose to manipulate RDDs directly.

**When to use RDDs?**

*   When you need very fine-grained control over data processing.
*   When working with unstructured data or custom data types that DataFrames can’t easily handle
* When you need to do very custom manipulations that are not readily available in the DataFrame API.

**Key Takeaway:**

**_DataFrames build on top of RDDs, offering a structured and optimized way to process data with less coding needed for optimal performance, making them the preferred API for most use cases. RDDs are best used for specialized use cases that require more fine-grained control._**


## RDD Example

In [0]:
# 1. Read text file into an RDD
lines_rdd = spark.read.text("dbfs:/FileStore/sherlock.txt").rdd.map(lambda row: row[0])

# 2. Split into words and flatten, filter and make key-value pairs
word_counts_rdd = (
    lines_rdd.flatMap(lambda line: line.split(" "))
        .filter(lambda word: len(word) > 0)
        .map(lambda word: (word, 1))
        .reduceByKey(lambda a, b: a + b)
)

# 3. Sort and take top 10
top_10_rdd = word_counts_rdd.sortBy(lambda x: x[1], ascending=False).take(10)

# 4. Print the top 10 words
print("Top 10 words using RDD:")
for word, count in top_10_rdd:
    print(f"{word}: {count}")

## DataFrame Example

In [0]:

from pyspark.sql.functions import split, explode, col, length

# 1. Read text file into a DataFrame
lines_df = spark.read.text("dbfs:/FileStore/sherlock.txt")

# 2. Split into words and explode and filter
words_df = lines_df.withColumn("word", explode(split(col("value"), " "))).filter(length(col("word")) >= 1)

# 3. Group, count, sort and limit
word_counts_df = (
    words_df.groupBy("word")
        .count()
        .orderBy(col("count"), ascending=False)
        .limit(10)
)

# 4. Show top 10 words
print("Top 10 words using DataFrame:")
word_counts_df.show()



**Observations:**

*   **Level of Abstraction:**
    *   **RDD:** The RDD code is more verbose and requires you to explicitly define every step of the computation. You manipulate individual elements (lines and words) directly through `map`, `flatMap`, etc. This is a low-level style that offers more flexibility, but also requires more effort.
    *   **DataFrame:** The DataFrame code is more concise and uses higher-level operations like `withColumn`, `explode`, `groupBy`, and `orderBy`. These are declarative operations that express what you want to do, not how, abstracting away the low-level details.
*   **Data Structure:**
    *   **RDD:** The RDD code uses raw text strings and tuples to represent the data, requiring developers to manually manage data structure and perform aggregations. The RDD is a collection of generic objects.
    *   **DataFrame:** The DataFrame code uses structured DataFrames with named columns ("value," "word," "count").  The structure allows Spark to understand what is meant by the code.
*   **Optimization:**
    *   **RDD:**  Requires the developer to manage the optimization process. `reduceByKey` can perform well in this case, however, more complex aggregations or data manipulation may require more thinking on how to optimize.
    *   **DataFrame:** The DataFrame code automatically leverages Spark's Catalyst optimizer. Even though the code is very explicit with operations like `groupBy` and `orderBy`, the optimizer is free to choose an optimal execution plan. For example, if you have more columns in a DataFrame, it can optimize away operations to select just a few columns.
*   **Ease of Use:**
    *   **RDD:** The RDD code might be less intuitive for some users, especially those not familiar with functional programming style.
    *   **DataFrame:** The DataFrame code is more readable and easier to understand due to its use of high level functions and schema.
* **Interoperability**
    * Note that both of these approaches read text files using a sparkSession. The first approach extracts an RDD from a DataFrame read, while the second approach operates directly on a DataFrame after the data is read.
    * You can always extract a RDD from a dataframe as we did in the first example, and then operate on it using RDD operations.



**Example 2: Conversion Between RDD and DataFrame**

This example shows how to convert an RDD to a DataFrame and vice versa, highlighting their interoperability.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# 1. Create an RDD of key value pairs (as a simple example)
data_rdd = spark.sparkContext.parallelize([("apple", 5), ("banana", 10), ("apple", 2)])

# 2. Convert the RDD to a DataFrame with column names
schema = StructType([
    StructField("fruit", StringType(), True),
    StructField("count", IntegerType(), True)
])

df_from_rdd = spark.createDataFrame(data_rdd, schema)

# 3. Print the dataFrame
print("DataFrame from RDD:")
df_from_rdd.show()

# 4. Convert a DataFrame to RDD (for example purposes only)
rdd_from_df = df_from_rdd.rdd.map(lambda row: (row[0], row[1]))

# 5. Print elements from RDD extracted from DataFrame
print("RDD from DataFrame:")
for element in rdd_from_df.collect():
  print(element)




**Observations:**

*   **Flexibility of RDDs:** The `data_rdd` here shows that an RDD can be created in any way, even just using `parallelize`. The RDDs can have custom datatypes and custom data structures.
*   **`createDataFrame` and Schema**: To create a DataFrame from an RDD, we have to explicitly provide a schema that defines the column names and their data types. DataFrames are structured and require a schema.
*   **`rdd` property:**  The `.rdd` property of a DataFrame can be used to access the underlying RDD, and you can manipulate it using the RDD API, when needed.
*   **Interoperability**: This shows that you can easily switch between RDDs and DataFrames, based on the needs of your workflow.

**In Conclusion:**

These examples highlight the main differences and relationships between RDDs and DataFrames in Apache Spark. The RDD API offers a lower-level approach to data manipulation, while the DataFrame API provides a higher-level abstraction with ease of use and performance optimization advantages. The choice between RDDs and DataFrames depends on your specific use case, requirements, and expertise. However, for most cases, DataFrames are the preferred option because of their advantages of performance and conciseness.
