# Lesson 4 - Creating and Exploring DataFrames

Okay, let's craft the detailed technical notes for Lesson 4, focusing on PySpark DataFrames.

---

**Technical Notes: PySpark DataFrame Fundamentals**

**Objective:** These notes delve into the PySpark DataFrame API, the primary interface for working with structured and semi-structured data in modern Spark applications. We will cover DataFrame creation from various sources, understanding and managing schemas, and performing fundamental data manipulation operations.

---

**1. Introduction to DataFrames**

*   **Theory:**
    A PySpark DataFrame is a **distributed, immutable collection of data organized into named columns**, conceptually equivalent to a table in a relational database or a data frame in R/pandas, but with powerful optimizations for distributed processing. It's built on top of RDDs but provides a higher-level abstraction with richer semantics.

    **Key Advantages over RDDs for Structured Data:**
    1.  **Schema:** DataFrames enforce a schema, a defined structure specifying column names and data types. This allows Spark to understand the data layout.
    2.  **Optimization (Catalyst Optimizer):** Spark leverages the schema and high-level operations (like `select`, `filter`, `groupBy`) to perform sophisticated query optimization through its Catalyst optimizer. Catalyst creates logical and physical execution plans, applying rules like predicate pushdown, column pruning, and join reordering to significantly improve performance.
    3.  **Tungsten Execution Engine:** Spark executes DataFrame operations using the Tungsten engine, which optimizes memory usage (off-heap storage) and CPU efficiency (whole-stage code generation).
    4.  **Unified API:** Provides seamless integration between Python, Scala, Java, R, and SQL for data manipulation.

    The primary entry point for DataFrame operations is the `SparkSession` object, typically named `spark`.

*   **Architecture Context:**
    While RDDs form the base, DataFrame operations are translated by Catalyst into optimized RDD operations. This abstraction layer allows developers to focus on *what* they want to compute, letting Spark figure out the most efficient *how*.

    ```
      +-----------------------+
      | DataFrame API (Python)| ----> User Code
      +-----------------------+
                | (Logical Plan)
      +-----------------------+
      |   Catalyst Optimizer  | ----> Optimization Rules, Cost Models
      +-----------------------+
                | (Optimized Logical & Physical Plans)
      +-----------------------+
      | Tungsten Execution    | ----> Code Generation, Off-Heap Memory
      +-----------------------+
                | (RDD Operations)
      +-----------------------+
      |      Spark Core (RDDs)| ----> Cluster Execution
      +-----------------------+
    ```

---

**2. Creating DataFrames from External Sources**

*   **Theory:**
    The most common way to create DataFrames is by reading data from external storage systems. PySpark's `DataFrameReader` interface, accessed via `spark.read`, provides methods to load data from various formats. The reader can often infer the schema, but providing an explicit schema is recommended for production robustness and performance.

*   **a) Reading from CSV Files**
    *   **Context:** Comma-Separated Values (CSV) files are ubiquitous for storing tabular data in plain text. However, they lack explicit type information and can have variations in delimiters, quoting, and header presence.
    *   **Code Example:**

        ```python
        from pyspark.sql import SparkSession
        import os

        # Initialize SparkSession
        spark = SparkSession.builder \
            .appName("DataFrameFromCSV") \
            .master("local[*]") \
            .getOrCreate()

        # Create a dummy CSV file for demonstration
        csv_file_path = "users.csv"
        with open(csv_file_path, "w") as f:
            f.write("id,name,age,country\n")
            f.write("1,Alice,30,USA\n")
            f.write("2,Bob,25,Canada\n")
            f.write("3,Charlie,35,USA\n")
            f.write("4,David,,UK\n") # Missing age
            f.write('5,"Eve, Jr.",40,Canada\n') # Name with comma

        # --- Reading CSV ---

        # Option 1: Basic read with header and schema inference
        print("--- Reading CSV with Header & Inference ---")
        df_infer = spark.read.csv(csv_file_path, header=True, inferSchema=True)
        df_infer.show()
        df_infer.printSchema()

        # Option 2: More robust read with specific options
        print("\n--- Reading CSV with Specific Options ---")
        df_options = spark.read.format("csv") \
            .option("header", "true") \
            .option("inferSchema", "false") \
            .option("sep", ",") \
            .option("nullValue", "") \
            .option("quote", "\"") \
            .option("escape", "\"") \
            .load(csv_file_path) # Use .load() when using .format() or multiple options

        # Note: Without inferSchema=True, all columns will be string type initially
        df_options.show()
        df_options.printSchema()

        # Clean up the dummy file
        os.remove(csv_file_path)

        spark.stop()
        ```
    *   **Explanation:**
        *   `spark.read.csv(path, ...)`: The primary method for reading CSV. Takes the path and various options as keyword arguments.
        *   `header=True`: Instructs Spark to use the first line as column names. Default is `False`.
        *   `inferSchema=True`: Tells Spark to sample the data and guess the data type for each column. Can be slow and potentially inaccurate, especially with nulls or inconsistent data. Default is `False`.
        *   `spark.read.format("csv")`: An alternative way to specify the format, useful when chaining multiple `.option()` calls.
        *   `.option("sep", ",")`: Specifies the delimiter character (default is comma).
        *   `.option("nullValue", "")`: Defines the string representation of null values in the CSV (here, empty strings).
        *   `.option("quote", "\"")`: Specifies the character used for quoting fields that may contain the delimiter (default is `"`).
        *   `.option("escape", "\"")`: Specifies the character used to escape quotes within a quoted field (often the quote character itself).
        *   `.load(path)`: Loads the data from the specified path after configuring options.
        *   `df.show()`: An action that displays the first 20 rows of the DataFrame in a tabular format.
        *   `df.printSchema()`: An action that prints the DataFrame's schema (column names and inferred/defined types). Notice how `inferSchema=False` results in all `string` types.
    *   **Use Case:** Ingesting data exports from spreadsheets, databases, or simple log files where data is tabular and text-based. Be mindful of the parsing options required for reliable loading.

*   **b) Reading from JSON Files**
    *   **Context:** JavaScript Object Notation (JSON) is a common format for semi-structured data, often used in web APIs and configuration files. Spark typically expects **line-delimited JSON** (one valid JSON object per line) for efficient parallel reading. It can handle nested structures and arrays.
    *   **Code Example:**

        ```python
        from pyspark.sql import SparkSession
        import os

        spark = SparkSession.builder.appName("DataFrameFromJSON").master("local[*]").getOrCreate()

        # Create a dummy line-delimited JSON file
        json_file_path = "events.json"
        with open(json_file_path, "w") as f:
            f.write('{"timestamp": "2023-10-27T10:00:00Z", "user_id": 1, "event": "login", "details": {"ip": "192.168.1.1"}}\n')
            f.write('{"timestamp": "2023-10-27T10:05:00Z", "user_id": 2, "event": "view_page", "details": {"page": "/home"}}\n')
            f.write('{"timestamp": "2023-10-27T10:10:00Z", "user_id": 1, "event": "logout"}\n') # Missing details field

        # --- Reading JSON ---
        print("--- Reading Line-Delimited JSON ---")
        df_json = spark.read.json(json_file_path)
        # Schema is typically inferred for JSON
        df_json.show(truncate=False)
        df_json.printSchema()

        # Example for multi-line JSON (less common for large datasets)
        # multi_line_json_path = "single_object.json"
        # with open(multi_line_json_path, "w") as f:
        #     f.write('[{"id": 1}, {"id": 2}]') # A single JSON array across multiple lines
        # df_multi = spark.read.option("multiLine", True).json(multi_line_json_path)
        # os.remove(multi_line_json_path)

        os.remove(json_file_path)
        spark.stop()
        ```
    *   **Explanation:**
        *   `spark.read.json(path)`: Reads line-delimited JSON files. Schema inference is the default and generally works well due to JSON's self-describing nature.
        *   `truncate=False` (in `show()`): Prevents truncating long column values in the output display. Useful for inspecting nested structures.
        *   `printSchema()` output: Notice how Spark correctly infers nested structures (`details` as a `struct`) and handles missing fields (nullable).
        *   `option("multiLine", True)`: Required if the entire file represents a single JSON object or array spanning multiple lines. This is less scalable as the entire file might need to be read by a single node.
    *   **Use Case:** Processing data from web APIs, application logs formatted in JSON, data exports from NoSQL databases like MongoDB.

*   **c) Reading from Parquet Files**
    *   **Context:** Apache Parquet is a **columnar storage format** optimized for analytics workloads in the Hadoop ecosystem. It's highly efficient for Spark.
        *   **Columnar Storage:** Data for each column is stored contiguously, leading to better compression ratios and allowing queries to read only the required columns (column pruning).
        *   **Schema Evolution:** Stores the schema within the data files, making it self-describing.
        *   **Predicate Pushdown:** Allows Spark to filter rows directly at the storage level based on query predicates, minimizing data read from disk/network.
    *   **Code Example:**

        ```python
        from pyspark.sql import SparkSession
        import os
        import shutil # To remove directory

        spark = SparkSession.builder.appName("DataFrameFromParquet").master("local[*]").getOrCreate()

        # --- First, create a Parquet file using a DataFrame ---
        data = [("Alice", 30), ("Bob", 25), ("Charlie", 35)]
        columns = ["name", "age"]
        df_source = spark.createDataFrame(data, columns)

        parquet_dir_path = "people.parquet"
        print(f"--- Writing DataFrame to Parquet: {parquet_dir_path} ---")
        # Parquet is typically written to a directory, not a single file
        df_source.write.mode("overwrite").parquet(parquet_dir_path)
        print("Write complete.")

        # --- Reading Parquet ---
        print(f"\n--- Reading Parquet from: {parquet_dir_path} ---")
        df_parquet = spark.read.parquet(parquet_dir_path)

        df_parquet.show()
        df_parquet.printSchema() # Schema is read directly from Parquet metadata

        # Clean up the dummy directory
        shutil.rmtree(parquet_dir_path)

        spark.stop()
        ```
    *   **Explanation:**
        *   `df_source = spark.createDataFrame(...)`: Creates a sample DataFrame to write out.
        *   `df_source.write.mode("overwrite").parquet(path)`: Writes the DataFrame to Parquet format.
            *   `write`: Accesses the `DataFrameWriter` interface.
            *   `mode("overwrite")`: Specifies the behavior if the path already exists. Other modes include `append`, `ignore`, `errorifexists` (default).
            *   `parquet(path)`: Specifies the format and output path (usually a directory).
        *   `spark.read.parquet(path)`: Reads data from a Parquet file or directory. Schema inference (`inferSchema`) is **not** needed as the schema is stored within the Parquet files' metadata. This makes reading faster and more reliable.
    *   **Use Case:** **Highly recommended format** for storing intermediate data in Spark pipelines, building data lakes, and achieving optimal performance for analytical queries. It's the de facto standard in many big data environments.

---

**3. Schema: Inference vs. Manual Definition**

*   **Theory:**
    The schema defines the structure of your DataFrame. Getting it right is crucial for data integrity and performance.
    *   **Schema Inference:** Convenient for initial exploration (`inferSchema=True`). Spark reads a portion of the data to guess column types.
        *   **Pros:** Easy, less code initially.
        *   **Cons:**
            *   Can be slow (requires an extra pass over data).
            *   May infer incorrect types (e.g., inferring `int` when `long` is needed, misinterpreting dates, treating everything as `string` if ambiguity or nulls dominate the sample).
            *   Potential for runtime errors if data outside the sample doesn't match the inferred schema.
            *   **Generally not recommended for production ETL pipelines.**
    *   **Manual Schema Definition:** Explicitly defining the schema using PySpark's data types.
        *   **Pros:**
            *   **Reliability:** Ensures data conforms to expected types. Errors occur predictably if data mismatches.
            *   **Performance:** Avoids the schema inference pass, speeding up reads.
            *   **Clarity:** Documents the expected data structure.
        *   **Cons:** Requires more upfront code.

*   **Code Example: Defining and Applying a Manual Schema**

    ```python
    from pyspark.sql import SparkSession
    from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, TimestampType
    import os

    spark = SparkSession.builder.appName("ManualSchema").master("local[*]").getOrCreate()

    # Create a dummy CSV file (potentially with ambiguous data)
    csv_file_path = "products.csv"
    with open(csv_file_path, "w") as f:
        f.write("product_id,name,price,stock_count,last_updated\n")
        f.write("P100,Widget A,99.99,50,2023-10-27 10:00:00\n")
        f.write("P200,Gadget B,145.00,,2023-10-26 15:30:00\n") # Empty stock_count
        f.write("P300,Thingamajig,,10,2023-10-27 11:00:00\n") # Empty price

    # Define the desired schema explicitly
    # StructType: Represents the overall structure (list of fields)
    # StructField: Defines a single column (name, dataType, nullable)
    product_schema = StructType([
        StructField("product_id", StringType(), True),  # Column name, data type, nullable flag
        StructField("name", StringType(), True),
        StructField("price", DoubleType(), True),        # Use Double for currency
        StructField("stock_count", IntegerType(), True), # Use Integer for counts
        StructField("last_updated", TimestampType(), True) # Use Timestamp for date/time
    ])

    print("--- Reading CSV with Manual Schema ---")
    df_manual = spark.read.format("csv") \
        .option("header", "true") \
        .option("nullValue", "") \
        .option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \
        .schema(product_schema) \
        .load(csv_file_path)

    df_manual.show()
    df_manual.printSchema() # Schema matches our definition

    os.remove(csv_file_path)
    spark.stop()
    ```
*   **Explanation:**
    *   `from pyspark.sql.types import ...`: Imports necessary classes for schema definition.
    *   `StructType([...])`: Creates the schema object, taking a list of `StructField` objects.
    *   `StructField("col_name", DataType(), nullable)`: Defines each column:
        *   `col_name`: The desired column name (string).
        *   `DataType()`: The PySpark SQL data type (e.g., `StringType()`, `IntegerType()`, `DoubleType()`, `BooleanType()`, `DateType()`, `TimestampType()`, `ArrayType()`, `MapType()`, `StructType()` for nested structures).
        *   `nullable`: A boolean indicating if the column can contain `null` values (usually `True` unless you have strong guarantees).
    *   `.option("timestampFormat", "...")`: Crucial when reading strings into `TimestampType` or `DateType` to specify the exact format for parsing. Use Java SimpleDateFormat patterns.
    *   `.schema(product_schema)`: Applies the defined schema during the read operation. Spark will now enforce these types. If data cannot be parsed according to the type (e.g., "abc" into an `IntegerType`), it will typically result in `null` for that field (or potentially throw errors depending on parse modes).
*   **Use Case:** Essential for production data pipelines, reading data with well-defined formats, ensuring type consistency, and improving read performance.

---

**4. Basic DataFrame Operations**

*   **Theory:**
    Once a DataFrame is created, you can perform various operations to manipulate and analyze the data. These operations are generally **transformations**, meaning they are lazily evaluated and return *new* DataFrames, preserving immutability. Common functions are available directly as DataFrame methods or within `pyspark.sql.functions`.

*   **a) Selecting Columns (`select`)**
    *   **Context:** Used to choose a subset of columns, rename columns, or derive new columns based on existing ones. Analogous to the `SELECT` clause in SQL.
    *   **Code Example:**

        ```python
        from pyspark.sql import SparkSession
        from pyspark.sql.functions import col, lit, upper # Import column functions

        spark = SparkSession.builder.appName("SelectOperation").master("local[*]").getOrCreate()

        data = [("Alice", 30, "USA"), ("Bob", 25, "Canada"), ("Charlie", 35, "USA")]
        df = spark.createDataFrame(data, ["name", "age", "country"])
        print("--- Original DataFrame ---")
        df.show()

        # Select specific columns by name
        print("\n--- Selecting 'name' and 'country' ---")
        df.select("name", "country").show()

        # Select using col() function (recommended for clarity and complex expressions)
        print("\n--- Selecting using col() ---")
        df.select(col("name"), col("age")).show()

        # Select with expressions and aliasing
        print("\n--- Selecting with expression and alias ---")
        df.select(
            col("name"),
            (col("age") + 5).alias("age_in_5_years"), # Calculate and rename
            upper(col("country")).alias("country_upper") # Apply function and rename
        ).show()

        # Select all columns (*)
        # df.select("*") # Less common in programmatic API but works

        spark.stop()
        ```
    *   **Explanation:**
        *   `df.select("col1", "col2", ...)`: Selects columns by passing their string names.
        *   `from pyspark.sql.functions import col`: Imports the `col` function, which returns a `Column` object based on the name. Using `col()` is often clearer and required for applying methods or operators directly to columns.
        *   `df.select(col("name"), ...)`: Selects using `Column` objects.
        *   `(col("age") + 5)`: Creates a new `Column` expression by performing arithmetic on an existing column.
        *   `.alias("new_name")`: Renames the resulting column expression. Essential for derived columns.
        *   `upper(col("country"))`: Applies a built-in Spark SQL function (`upper` from `pyspark.sql.functions`) to a column. Many functions are available (string manipulation, date/time functions, math functions, etc.).
    *   **Use Case:** Reducing data width for downstream processing, renaming columns for clarity, projecting data for reports, performing simple transformations within the selection.

*   **b) Filtering Rows (`filter` / `where`)**
    *   **Context:** Used to select a subset of rows based on specified conditions. Analogous to the `WHERE` clause in SQL. `filter()` and `where()` are aliases for the same operation.
    *   **Code Example:**

        ```python
        from pyspark.sql import SparkSession
        from pyspark.sql.functions import col

        spark = SparkSession.builder.appName("FilterOperation").master("local[*]").getOrCreate()

        data = [("Alice", 30, "USA"), ("Bob", 25, "Canada"), ("Charlie", 35, "USA"), ("David", 20, "UK")]
        df = spark.createDataFrame(data, ["name", "age", "country"])
        print("--- Original DataFrame ---")
        df.show()

        # Filter using a SQL-like string expression
        print("\n--- Filtering age > 28 (string expression) ---")
        df.filter("age > 28").show()

        # Filter using column expressions (recommended for programmatic conditions)
        print("\n--- Filtering country == 'USA' (column expression) ---")
        df.filter(col("country") == "USA").show()

        # Combine multiple conditions (AND)
        print("\n--- Filtering age > 25 AND country == 'USA' ---")
        df.filter((col("age") > 25) & (col("country") == "USA")).show()
        # Note: Parentheses are important due to operator precedence

        # Combine multiple conditions (OR)
        print("\n--- Filtering country == 'Canada' OR country == 'UK' ---")
        df.filter((col("country") == "Canada") | (col("country") == "UK")).show()

        # Using 'where' (identical to filter)
        print("\n--- Filtering using 'where' (age < 30) ---")
        df.where(col("age") < 30).show()

        spark.stop()
        ```
    *   **Explanation:**
        *   `df.filter("sql_condition_string")`: Filters rows based on a SQL WHERE clause string. Simple for basic conditions but less flexible programmatically.
        *   `df.filter(Column_Condition)`: Filters based on a boolean `Column` expression.
        *   `col("country") == "USA"`: Creates a boolean `Column` evaluating to true where the country is "USA". Standard comparison operators (`==`, `!=`, `>`, `<`, `>=`, `<=`) work on columns.
        *   `&`: Logical AND operator for combining `Column` conditions.
        *   `|`: Logical OR operator.
        *   `~`: Logical NOT operator (e.g., `~ (col("country") == "USA")`).
        *   Parentheses `()`: Essential for grouping conditions correctly, especially when mixing `&` and `|`.
        *   `df.where(...)`: An alias for `df.filter(...)`. Use whichever you prefer for readability.
    *   **Use Case:** Selecting specific subsets of data based on criteria, data cleaning (removing invalid rows), isolating data for targeted analysis.

*   **c) Adding or Replacing Columns (`withColumn`)**
    *   **Context:** Used to add a new column to a DataFrame or replace an existing column with the same name. The new column's values are defined by a `Column` expression.
    *   **Code Example:**

        ```python
        from pyspark.sql import SparkSession
        from pyspark.sql.functions import col, lit, concat_ws, lower, when

        spark = SparkSession.builder.appName("WithColumnOperation").master("local[*]").getOrCreate()

        data = [("Alice", 30, "USA"), ("Bob", 25, "Canada"), ("Charlie", 35, "USA")]
        df = spark.createDataFrame(data, ["name", "age", "country"])
        print("--- Original DataFrame ---")
        df.show()

        # Add a new column with a literal value
        print("\n--- Adding 'source' column with literal value ---")
        df_with_source = df.withColumn("source", lit("internal_db"))
        # lit() creates a Column from a literal value
        df_with_source.show()

        # Add a new column derived from existing columns
        print("\n--- Adding 'description' column ---")
        df_with_desc = df_with_source.withColumn(
            "description",
            concat_ws(" - ", col("name"), col("age").cast("string"), col("country"))
            # concat_ws concatenates columns with a separator
            # cast("string") converts age to string for concatenation
        )
        df_with_desc.show(truncate=False)

        # Replace an existing column (e.g., standardize country code)
        print("\n--- Replacing 'country' column with lowercase version ---")
        df_lower_country = df_with_desc.withColumn("country", lower(col("country")))
        df_lower_country.show()

        # Add a column based on conditional logic
        print("\n--- Adding 'age_group' column using 'when' ---")
        df_age_group = df_lower_country.withColumn("age_group",
             when(col("age") < 30, "Young")
            .when((col("age") >= 30) & (col("age") < 40), "Middle-aged")
            .otherwise("Senior") # Default case
        )
        df_age_group.show()

        spark.stop()
        ```
    *   **Explanation:**
        *   `df.withColumn("new_col_name", Column_Expression)`: Returns a *new* DataFrame with the added/replaced column.
        *   `lit(value)`: Creates a `Column` object representing a literal (constant) value. Necessary when adding a constant value, as `withColumn` expects a `Column` object as the second argument.
        *   `col("age").cast("string")`: Changes the data type of the `age` column to `string` for this operation. Casting is common when combining different types.
        *   `concat_ws(separator, col1, col2, ...)`: A function from `pyspark.sql.functions` that concatenates multiple columns into a single string column using a specified separator.
        *   `lower(col("country"))`: Applies the `lower` function to the `country` column. Since the column name "country" already exists, this *replaces* the original column.
        *   `when(condition1, value1).when(condition2, value2)...otherwise(default_value)`: Implements conditional logic (like SQL `CASE WHEN`). It evaluates conditions sequentially and returns the corresponding value for the first true condition, or the `otherwise` value if none match.
    *   **Use Case:** Feature engineering (creating new predictors for machine learning), data enrichment (adding information from other sources - often via joins first), data standardization/cleaning (e.g., converting cases, formatting dates), deriving calculated fields.

---

**Summary:**

DataFrames are the cornerstone of modern PySpark development for structured data. We learned how to create them from common formats like CSV, JSON, and the highly recommended Parquet, emphasizing the importance of schema definition (manual over inference for production). We also explored fundamental transformations: `select` for column projection/manipulation, `filter`/`where` for row selection based on conditions, and `withColumn` for adding or modifying columns. These operations, combined with Spark's lazy evaluation and Catalyst optimizer, provide a powerful and efficient way to work with large datasets.