# Lesson 5 - Data Cleaning and Manipulation

Okay, here are detailed technical notes on Data Cleaning and Manipulation in PySpark, suitable for professional learners and training material.

---

## PySpark Technical Notes: Data Cleaning and Manipulation

**Introduction**

Apache Spark, and its Python API PySpark, provides a powerful, distributed framework for processing large datasets. A critical phase in any data processing pipeline is cleaning and manipulating the raw data to make it suitable for analysis, machine learning, or downstream applications. This involves handling inconsistencies, missing values, incorrect data types, duplicates, and transforming data into desired formats. PySpark's DataFrame API offers a rich set of functions specifically designed for these tasks, leveraging Spark's distributed execution engine for performance at scale.

These notes delve into common data cleaning and manipulation techniques using the PySpark DataFrame API, focusing on practical implementation, underlying concepts, and performance considerations.

---

### 1. Handling Missing Data (Null Values)

**Theory**

Missing data, often represented as `null` or `NaN` (Not a Number), is a common issue in real-world datasets. It can arise from data entry errors, sensor failures, data integration problems, or optional fields. Ignoring missing data can lead to biased analysis, incorrect model predictions, or runtime errors. PySpark provides the `DataFrameNaFunctions` API (accessed via `df.na`) to handle null values effectively.

Common strategies include:

1.  **Dropping:** Removing rows or columns containing null values. This is simple but can lead to significant data loss if nulls are widespread.
2.  **Filling/Imputation:** Replacing null values with a specific constant (like 0, "Unknown"), or a calculated value (like the mean, median, or mode of the column). Imputation preserves data size but can introduce bias if not done carefully.

PySpark's operations are transformations and are lazily evaluated. Actions like `show()`, `count()`, or `write()` trigger the actual computation across the cluster.

**Code Examples**

Let's start by creating a sample DataFrame with missing values:

```python
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, isnan, when, count, mean, lit
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Initialize Spark Session
spark = SparkSession.builder.appName("DataCleaning").getOrCreate()

# Define schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("category", StringType(), True),
    StructField("sales", DoubleType(), True),
    StructField("revenue", DoubleType(), True)
])

# Create data with nulls and NaN
data = [
    (1, "Product A", "Category 1", 100.0, 500.0),
    (2, "Product B", None, 150.0, None),
    (3, None, "Category 1", 200.0, 600.0),
    (4, "Product D", "Category 2", None, 750.0),
    (5, "Product E", "Category 2", 50.0, 250.0),
    (6, "Product F", "Category 1", 100.0, float('nan')) # NaN example
]

df = spark.createDataFrame(data, schema=schema)
print("Original DataFrame:")
df.show()
```

**Line-by-Line Explanation:**

1.  `from pyspark.sql import SparkSession`: Imports the entry point for Spark SQL functionality.
2.  `from pyspark.sql.functions import ...`: Imports necessary SQL functions for data manipulation.
3.  `from pyspark.sql.types import ...`: Imports data types for defining the DataFrame schema.
4.  `spark = SparkSession.builder...getOrCreate()`: Creates or gets an existing SparkSession.
5.  `schema = StructType([...])`: Defines the structure and data types of the DataFrame columns. Using a schema is crucial for performance and data integrity, especially when reading from sources without inherent schemas.
6.  `data = [...]`: Defines the sample data, including `None` (representing SQL NULL) and `float('nan')`.
7.  `df = spark.createDataFrame(data, schema=schema)`: Creates the PySpark DataFrame from the sample data and schema.
8.  `df.show()`: An action that displays the first 20 rows of the DataFrame.

**a) Checking for Nulls**

Before handling nulls, it's often useful to count them.

```python
# Count nulls per column
print("Null counts per column:")
df.select([count(when(col(c).isNull() | isnan(col(c)), c)).alias(c) for c in df.columns]).show()

# Explanation:
# 1. `df.select([...])`: Selects specific expressions to compute.
# 2. `[... for c in df.columns]`: List comprehension iterating through each column name `c`.
# 3. `col(c).isNull()`: Returns true if the value in column `c` is NULL.
# 4. `isnan(col(c))`: Returns true if the value in column `c` is NaN (only applicable to float/double types).
# 5. `|`: Logical OR operator. Checks for either NULL or NaN.
# 6. `when(condition, value)`: SQL CASE WHEN equivalent. If the condition (isNull or isNaN) is true, it returns the column `c` (or any non-null literal like `lit(1)` could be used here).
# 7. `count(...)`: Counts the non-null values returned by the `when` expression, effectively counting rows where the condition was true.
# 8. `.alias(c)`: Renames the resulting count column to the original column name for clarity.
# 9. `.show()`: Action to display the counts.
```

**b) Dropping Rows with Nulls (`dropna`)**

```python
# Drop rows where *any* column has a null/NaN value
print("Dropping rows with any null/NaN:")
df_dropped_any = df.na.drop(how='any')
df_dropped_any.show()
# Explanation:
# 1. `df.na`: Accesses the DataFrameNaFunctions API.
# 2. `.drop()`: Function to drop rows with null values.
# 3. `how='any'`: (Default) Drops a row if it contains at least one null or NaN value in any column.

# Drop rows where *all* columns have null/NaN values (less common)
print("Dropping rows where all columns are null/NaN (no change here):")
df_dropped_all = df.na.drop(how='all')
df_dropped_all.show()
# Explanation:
# 1. `how='all'`: Drops a row only if *all* its values are null or NaN.

# Drop rows based on nulls in specific columns
print("Dropping rows if 'name' or 'sales' is null/NaN:")
df_dropped_subset = df.na.drop(subset=['name', 'sales'])
df_dropped_subset.show()
# Explanation:
# 1. `subset=['name', 'sales']`: Specifies that the drop condition should only consider nulls/NaNs in the 'name' and 'sales' columns.

# Drop rows if they have less than 'thresh' non-null values
print("Dropping rows with less than 4 non-null/NaN values:")
df_dropped_thresh = df.na.drop(thresh=4) # Keep rows with at least 4 non-null values
df_dropped_thresh.show()
# Explanation:
# 1. `thresh=4`: Specifies that a row must have at least 4 non-null/non-NaN values to be kept. Rows with fewer non-nulls are dropped.
```

**Practical Use Cases & Performance:**

*   Use `dropna(how='any')` cautiously, as it can remove substantial data.
*   `dropna(subset=[...])` is often preferred when nulls in specific key columns make a record unusable.
*   Dropping rows is generally computationally cheaper than imputation but reduces dataset size. Consider the trade-off based on the percentage of missing data and its importance.

**c) Filling Null/NaN Values (`fillna`)**

```python
# Fill all null/NaN values with a specific value (e.g., 0 for numeric, 'Unknown' for string)
# Note: fillna only replaces nulls with values of the same type or compatible types.
print("Filling nulls with 0 for numeric and 'Unknown' for string:")
df_filled_generic = df.fillna({'sales': 0.0, 'revenue': 0.0, 'name': 'Unknown', 'category': 'Unknown'})
# Alternatively: df.fillna(0, subset=['sales', 'revenue']).fillna('Unknown', subset=['name', 'category'])
df_filled_generic.show()
# Explanation:
# 1. `df.na.fillna({...})`: Fills null/NaN values.
# 2. `{'col1': val1, 'col2': val2}`: A dictionary specifying the replacement value for each column. The value type must match or be compatible with the column type. PySpark applies the fill selectively based on column type if a single value is provided, but using a dictionary is more explicit and robust.

# Fill nulls with the mean of the column (requires calculating the mean first)
# Calculate means for numeric columns
means = df.agg(
    mean(col("sales")).alias("mean_sales"),
    mean(col("revenue")).alias("mean_revenue")
).first() # .first() collects the result to the driver

mean_sales_val = means["mean_sales"]
mean_revenue_val = means["mean_revenue"]

print(f"Calculated Means - Sales: {mean_sales_val}, Revenue: {mean_revenue_val}")

# Fill nulls using calculated means
print("Filling nulls with column means:")
df_filled_mean = df.fillna({
    "sales": mean_sales_val,
    "revenue": mean_revenue_val
})
df_filled_mean.show()
# Explanation:
# 1. `df.agg(...)`: Performs aggregation on the DataFrame.
# 2. `mean(col("sales")).alias("mean_sales")`: Calculates the mean of the 'sales' column and gives it an alias.
# 3. `.first()`: An action that collects the single row result of the aggregation (the means) back to the driver program as a Row object. **Caution:** Use `first()` or `collect()` only when the aggregated result is small.
# 4. `means["mean_sales"]`: Accesses the calculated mean value from the Row object.
# 5. `df.fillna({...})`: Fills nulls in 'sales' and 'revenue' with their respective calculated means.
```

**Practical Use Cases & Performance:**

*   Filling with constants is simple and fast but might not be statistically sound.
*   Mean/median imputation is common but can distort variance and correlations. Mode imputation is suitable for categorical columns.
*   Calculating aggregates (like mean) requires a full data scan (a Spark job). If imputing multiple columns, calculate all aggregates in a single `agg` call for efficiency.
*   Imputation preserves data size but can be computationally more expensive than dropping.

---

### 2. Casting and Renaming Columns

**Theory**

Data often arrives with incorrect data types (e.g., numbers as strings, dates as strings) or column names that are unclear, inconsistent, or contain characters unsuitable for certain operations or downstream systems.

*   **Casting:** Changing the data type of a column is essential for performing correct calculations (e.g., arithmetic on numbers, date functions on dates) and ensuring schema compatibility. PySpark uses the `cast()` method, typically within `withColumn`.
*   **Renaming:** Changing column names improves readability and consistency. PySpark offers `withColumnRenamed()` for single renames or `selectExpr()` / `alias()` for more complex selections and renames.

These operations typically involve adding a `Project` operation to Spark's logical plan. While generally efficient, excessive chaining of `withColumn` or `withColumnRenamed` can slightly increase plan complexity; using `select` or `selectExpr` for multiple changes at once can sometimes be more concise and potentially optimized better by Catalyst.

**Code Examples**

Let's use the `df_filled_generic` DataFrame from the previous section.

```python
df_to_modify = df_filled_generic
df_to_modify.printSchema() # Check current schema

# a) Casting Columns
print("Casting 'sales' to IntegerType:")
df_casted = df_to_modify.withColumn("sales_int", col("sales").cast(IntegerType()))
df_casted.printSchema()
df_casted.show()
# Explanation:
# 1. `df_to_modify.withColumn("sales_int", ...)`: Creates a new column named 'sales_int' or replaces it if it exists.
# 2. `col("sales").cast(IntegerType())`: Selects the 'sales' column and casts its values to IntegerType. Note that casting float/double to integer truncates the decimal part.

# Cast using selectExpr for conciseness
print("Casting 'revenue' to Integer and 'id' to String using selectExpr:")
df_casted_selectexpr = df_to_modify.selectExpr(
    "id",
    "name",
    "CAST(revenue AS INT) as revenue_int", # Cast revenue to INT
    "CAST(id AS STRING) as id_string"      # Cast id to STRING
)
df_casted_selectexpr.printSchema()
df_casted_selectexpr.show()
# Explanation:
# 1. `df_to_modify.selectExpr(...)`: Selects columns using SQL-like expressions.
# 2. `"CAST(revenue AS INT) as revenue_int"`: SQL expression to cast 'revenue' to integer and alias it to 'revenue_int'.
# 3. `"CAST(id AS STRING) as id_string"`: SQL expression to cast 'id' to string and alias it to 'id_string'. This method allows multiple casts and renames within a single projection.


# b) Renaming Columns
print("Renaming 'name' to 'product_name' and 'category' to 'product_category':")
df_renamed = df_to_modify.withColumnRenamed("name", "product_name") \
                         .withColumnRenamed("category", "product_category")
df_renamed.printSchema()
df_renamed.show()
# Explanation:
# 1. `.withColumnRenamed("old_name", "new_name")`: Returns a new DataFrame with the specified column renamed. Can be chained for multiple renames.

# Renaming using select with alias
print("Renaming using select and alias:")
df_renamed_select = df_to_modify.select(
    col("id"),
    col("name").alias("product_name"), # Rename 'name'
    col("category").alias("product_category"), # Rename 'category'
    col("sales"),
    col("revenue")
)
df_renamed_select.printSchema()
df_renamed_select.show()
# Explanation:
# 1. `df_to_modify.select(...)`: Selects specific columns.
# 2. `col("name").alias("product_name")`: Selects the 'name' column and gives it a new alias 'product_name' in the resulting DataFrame. This requires listing all columns you want to keep.

# Renaming using selectExpr
print("Renaming using selectExpr:")
df_renamed_selectexpr = df_to_modify.selectExpr(
    "id",
    "name as product_name", # Rename 'name' using SQL alias syntax
    "category as product_category", # Rename 'category'
    "sales",
    "revenue"
)
df_renamed_selectexpr.printSchema()
df_renamed_selectexpr.show()
# Explanation:
# 1. `df_to_modify.selectExpr(...)`: Selects columns using SQL expressions.
# 2. `"name as product_name"`: Uses SQL 'AS' keyword within the expression string to rename the column.
```

**Practical Use Cases & Performance:**

*   Casting is crucial when reading data from loosely typed sources like CSV or JSON, or preparing data for mathematical operations or specific function requirements.
*   Renaming improves code readability and standardizes column names across different data sources.
*   For many simultaneous renames or casts, `select` or `selectExpr` can be more efficient and readable than chaining multiple `withColumn` / `withColumnRenamed` calls. Spark's Catalyst optimizer often fuses consecutive projections, but expressing the intent clearly in one operation is good practice.
*   Be mindful of data loss or errors during casting (e.g., casting non-numeric strings to numbers will result in `null`).

---

### 3. Working with Dates and Timestamps

**Theory**

Handling date and time information is fundamental in many data processing tasks, such as time series analysis, event logging, and reporting. PySpark provides `DateType` (representing calendar dates) and `TimestampType` (representing date and time with microsecond precision). Often, dates and times are initially loaded as strings and need to be parsed into the correct types. PySpark offers a suite of functions for parsing, formatting, extracting components, and performing date/time arithmetic.

Key Functions:

*   `to_date(col, format)`: Parses a string column into `DateType`.
*   `to_timestamp(col, format)`: Parses a string column into `TimestampType`.
*   `date_format(col, format)`: Formats a `DateType` or `TimestampType` column into a string.
*   `year()`, `month()`, `dayofmonth()`, `hour()`, `minute()`, `second()`: Extract components.
*   `current_date()`, `current_timestamp()`: Get the current date/timestamp.
*   `datediff(end, start)`: Calculates the difference in days between two dates.
*   `date_add(start, days)`, `date_sub(start, days)`: Adds or subtracts days from a date.
*   `months_between(end, start)`: Calculates the difference in months.

**Important:** Pay close attention to date/timestamp formats. Using the correct format string (following Java's `SimpleDateFormat` patterns) is crucial for successful parsing. The default format often assumes `yyyy-MM-dd` for dates and `yyyy-MM-dd HH:mm:ss` for timestamps. Mismatched formats will result in `null`. Handling time zones correctly is also critical for timestamp data, managed via Spark configuration (`spark.sql.session.timeZone`).

**Code Examples**

```python
from pyspark.sql.functions import to_date, to_timestamp, date_format, year, month, dayofmonth, datediff, date_add, current_timestamp, expr

# Create a DataFrame with date/time strings
date_data = [
    (1, "2023-10-26", "26/10/2023 14:30:00"),
    (2, "2023-11-05", "05/11/2023 09:15:10"),
    (3, "invalid-date", "30/11/2023 20:00:55"),
    (4, "2023-12-15", None)
]
date_schema = ["id", "date_str_iso", "datetime_str_dmy"]
date_df = spark.createDataFrame(date_data, date_schema)
print("Original Date DataFrame:")
date_df.show(truncate=False)
date_df.printSchema()

# a) Parsing String to Date/Timestamp
print("Parsing date and timestamp strings:")
parsed_df = date_df.withColumn(
    "event_date",
    to_date(col("date_str_iso"), "yyyy-MM-dd") # Format matches input
).withColumn(
    "event_timestamp",
    to_timestamp(col("datetime_str_dmy"), "dd/MM/yyyy HH:mm:ss") # Specify custom format
)
parsed_df.show(truncate=False)
parsed_df.printSchema()
# Explanation:
# 1. `to_date(col("date_str_iso"), "yyyy-MM-dd")`: Parses the 'date_str_iso' column using the ISO standard format. If the format string was omitted and the input matched the default, it would also work.
# 2. `to_timestamp(col("datetime_str_dmy"), "dd/MM/yyyy HH:mm:ss")`: Parses the 'datetime_str_dmy' column using the specified day/month/year hour:minute:second format. Crucially, the format string must exactly match the input string pattern. Note how 'invalid-date' resulted in null.

# b) Formatting Date/Timestamp to String
print("Formatting date and timestamp:")
formatted_df = parsed_df.withColumn(
    "date_formatted",
    date_format(col("event_date"), "MMMM dd, yyyy") # e.g., October 26, 2023
).withColumn(
    "time_formatted",
    date_format(col("event_timestamp"), "HH:mm") # e.g., 14:30
)
formatted_df.show(truncate=False)
formatted_df.printSchema()
# Explanation:
# 1. `date_format(col("event_date"), "MMMM dd, yyyy")`: Formats the 'event_date' column into a string with the full month name, day, and year.
# 2. `date_format(col("event_timestamp"), "HH:mm")`: Extracts and formats the hour and minute from the 'event_timestamp' column.

# c) Extracting Date/Time Components
print("Extracting components:")
components_df = parsed_df.filter(col("event_date").isNotNull()).select(
    col("event_date"),
    year(col("event_date")).alias("year"),
    month(col("event_date")).alias("month"),
    dayofmonth(col("event_date")).alias("day")
)
components_df.show()
# Explanation:
# 1. `.filter(col("event_date").isNotNull())`: Filters out rows where parsing might have failed, ensuring functions like year() don't receive nulls unexpectedly (though they typically handle nulls gracefully).
# 2. `year(col("event_date")).alias("year")`: Extracts the year component from 'event_date'. Similarly for month() and dayofmonth().

# d) Date Arithmetic
print("Date Arithmetic:")
arithmetic_df = parsed_df.filter(col("event_date").isNotNull()).withColumn(
    "date_plus_7_days",
    date_add(col("event_date"), 7)
).withColumn(
    "days_since_event",
    datediff(current_timestamp(), col("event_timestamp")) # Difference from now
).withColumn(
    "days_between_dates", # Example using two specific dates (or columns)
     datediff(lit("2023-12-01"), col("event_date"))
)
# Using expr for interval arithmetic
arithmetic_df = arithmetic_df.withColumn(
    "timestamp_plus_interval",
    expr("event_timestamp + interval 2 hours") # Add 2 hours using expr
)

arithmetic_df.select("id", "event_date", "date_plus_7_days", "event_timestamp", "timestamp_plus_interval", "days_since_event", "days_between_dates").show(truncate=False)
# Explanation:
# 1. `date_add(col("event_date"), 7)`: Adds 7 days to the 'event_date'.
# 2. `datediff(current_timestamp(), col("event_timestamp"))`: Calculates the difference in days between the current timestamp (cast to date for datediff) and the 'event_timestamp'.
# 3. `datediff(lit("2023-12-01"), col("event_date"))`: Calculates difference between a literal date string and the 'event_date' column.
# 4. `expr("event_timestamp + interval 2 hours")`: Uses the `expr` function to leverage SQL syntax for adding a time interval (2 hours) to the timestamp. `expr` is powerful for complex or SQL-native operations.
```

**Practical Use Cases & Performance:**

*   Parsing dates correctly is fundamental for filtering time ranges, calculating durations, and joining time-series data.
*   Date/timestamp functions can be computationally intensive, especially parsing from strings. Ensure input formats are consistent.
*   **Partitioning:** Partitioning large datasets by date columns (e.g., year, month, day) is a critical optimization technique. It allows Spark to prune partitions (skip reading irrelevant data) when filtering on the partition column (e.g., `WHERE event_date >= '2023-10-01'`), drastically improving query performance. When writing data, use `.partitionBy("year", "month", "day")`.
*   **Time Zones:** Be extremely careful with `TimestampType` and time zones. Configure `spark.sql.session.timeZone` appropriately for your environment. Store timestamps preferably in UTC and convert to local time zones only for display purposes if needed.

---

### 4. Identifying and Dropping Duplicate Records

**Theory**

Duplicate records can inflate counts, skew aggregations, and lead to incorrect analysis. They might represent exact copies of rows or duplicates based on a subset of key columns (e.g., same user ID and timestamp, even if other columns differ slightly). PySpark provides two main methods:

1.  `distinct()`: Returns a new DataFrame containing only the unique rows from the original DataFrame (considers all columns).
2.  `dropDuplicates()`: More flexible, allowing you to remove rows that have duplicate values only in specified columns.

Both operations are computationally expensive as they require a full shuffle of the data across the cluster to group identical rows together. Spark needs to send all rows with the same hash value (based on the relevant columns) to the same partition to identify duplicates.

**Code Examples**

```python
# Create DataFrame with duplicates
duplicate_data = [
    (1, "Alice", "HR", 5000),
    (2, "Bob", "Engineering", 7000),
    (1, "Alice", "HR", 5000), # Exact duplicate
    (3, "Charlie", "Engineering", 7500),
    (4, "Bob", "Sales", 6000), # Different department/salary for Bob
    (5, "Alice", "HR", 5500)  # Different salary for Alice
]
dup_schema = ["id", "name", "department", "salary"]
dup_df = spark.createDataFrame(duplicate_data, dup_schema)
print("Original DataFrame with Duplicates:")
dup_df.show()

# a) Get Distinct Rows (all columns considered)
print("Distinct rows (all columns considered):")
distinct_df = dup_df.distinct()
distinct_df.show()
# Explanation:
# 1. `dup_df.distinct()`: Returns a new DataFrame containing only unique rows. Row (1, "Alice", "HR", 5000) appears only once.

# b) Drop Duplicates Based on All Columns
# dropDuplicates() with no arguments is equivalent to distinct()
print("Dropping duplicates based on all columns:")
drop_dup_all_df = dup_df.dropDuplicates()
drop_dup_all_df.show()
# Explanation:
# 1. `dup_df.dropDuplicates()`: Without arguments, considers all columns to identify duplicates. Keeps the first occurrence encountered (non-deterministic without prior ordering). Result is identical to `distinct()`.

# c) Drop Duplicates Based on a Subset of Columns
print("Dropping duplicates based on 'id' and 'name':")
# Keeps the first occurrence for each combination of id, name
drop_dup_subset_df = dup_df.dropDuplicates(subset=['id', 'name'])
drop_dup_subset_df.show()
# Explanation:
# 1. `dup_df.dropDuplicates(subset=['id', 'name'])`: Identifies duplicates based *only* on the values in the 'id' and 'name' columns.
# 2. For (id=1, name='Alice'), there are two rows: (1, Alice, HR, 5000) and (1, Alice, HR, 5000). It keeps one.
# 3. For (id=4, name='Bob'), there's only one row: (4, Bob, Sales, 6000). It's kept.
# 4. For (id=2, name='Bob'), there's only one row: (2, Bob, Engineering, 7000). It's kept.
# 5. For (id=5, name='Alice'), there's only one row: (5, Alice, HR, 5500). It's kept.
# 6. **Important:** Which specific row is kept among duplicates is non-deterministic unless the DataFrame has been explicitly ordered beforehand. If you need to keep a *specific* duplicate (e.g., the one with the latest timestamp or highest salary), you need to use Window functions before dropping.

# Example: Keeping the duplicate record with the highest salary based on name
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number, desc

print("Keeping the record with the highest salary per name:")
window_spec = Window.partitionBy("name").orderBy(desc("salary"))
highest_salary_per_name = dup_df.withColumn("rn", row_number().over(window_spec)) \
                                .filter(col("rn") == 1) \
                                .drop("rn")
highest_salary_per_name.show()
# Explanation:
# 1. `Window.partitionBy("name").orderBy(desc("salary"))`: Defines a window partitioned by 'name', ordered by 'salary' descending within each partition.
# 2. `.withColumn("rn", row_number().over(window_spec))`: Assigns a rank (row number) within each partition based on the defined order. The row with the highest salary gets rank 1.
# 3. `.filter(col("rn") == 1)`: Keeps only the row with rank 1 for each name (i.e., the one with the highest salary).
# 4. `.drop("rn")`: Removes the temporary rank column.
```

**Practical Use Cases & Performance:**

*   `distinct()` is used when any difference makes a row unique.
*   `dropDuplicates(subset=[...])` is crucial when defining uniqueness based on specific business keys (e.g., user ID, transaction ID).
*   **Performance:** Both `distinct()` and `dropDuplicates()` trigger a wide transformation (shuffle). This can be expensive for large datasets.
*   Ensure the columns used in the `subset` accurately define a unique record according to business logic.
*   For very large datasets, consider if approximate distinct counts (`approx_count_distinct` function) are sufficient for initial analysis before performing expensive exact deduplication.
*   If dropping duplicates based on a subset, think carefully about which record you *want* to keep if other columns differ. Use Window functions for deterministic selection if needed.

---

### 5. Advanced Considerations: Optimization and Best Practices

While the functions above provide the core tools, efficient data cleaning in PySpark also involves understanding Spark's execution model and applying optimization techniques:

1.  **Caching:** If a DataFrame resulting from expensive cleaning steps (like complex parsing, imputation involving aggregations, or deduplication) is reused multiple times later in the pipeline, consider caching it in memory/disk using `df.cache()` or `df.persist(StorageLevel.MEMORY_AND_DISK)`. This avoids recomputing the lineage for subsequent actions. Use caching judiciously, as it consumes cluster resources. Monitor the Spark UI to see cache effectiveness.
2.  **Predicate Pushdown:** When reading data, filter as early as possible (`.filter()` or `WHERE` clause). For data sources that support it (like Parquet, ORC, databases), Spark pushes these filters down to the data source level, reducing the amount of data read into memory.
3.  **Column Pruning:** Select only the columns you need using `.select()`. Spark automatically prunes unused columns when reading columnar formats like Parquet, significantly reducing I/O. Avoid `select("*")` if not all columns are required.
4.  **Partitioning:** As mentioned under Dates/Timestamps, partitioning data correctly on disk (based on frequently filtered columns) is one of the most effective ways to speed up queries involving filters on those columns. Choose partition columns with appropriate cardinality (not too many small partitions, not too few large ones).
5.  **Efficient Function Usage:**
    *   Use built-in PySpark SQL functions whenever possible, as they are implemented in Scala/Java and optimized for Spark's engine. Avoid Python UDFs (User Defined Functions) for performance-critical tasks unless absolutely necessary, as they involve serialization/deserialization overhead and break Catalyst optimizations.
    *   Combine multiple operations in `selectExpr` or a single `select` with multiple expressions rather than chaining many `withColumn` calls, which can sometimes lead to slightly cleaner logical plans.
6.  **Shuffle Management:** Operations like `groupBy`, `join`, `distinct`, and `dropDuplicates` involve shuffling data. Understand how data is distributed. Misconfigured shuffle partitions (`spark.sql.shuffle.partitions`) can lead to performance bottlenecks (too few partitions -> large tasks, memory issues; too many -> scheduling overhead, small file issues). Tune this based on cluster size and data volume. Adaptive Query Execution (AQE) in newer Spark versions helps optimize shuffles automatically to some extent.
7.  **Monitor the Spark UI:** The Spark UI (typically at `http://<driver-node>:4040`) is invaluable for understanding job execution, identifying bottlenecks, checking shuffle read/write sizes, and analyzing query plans (logical and physical).

---

**Conclusion**

Data cleaning and manipulation are essential steps in leveraging PySpark for large-scale data processing. Mastering techniques for handling missing data, casting types, renaming columns, working with dates/timestamps, and removing duplicates using PySpark's DataFrame API is crucial for building robust and efficient data pipelines. By understanding the underlying principles of Spark's execution model and applying optimization best practices like caching, filtering early, and appropriate partitioning, professional learners can effectively prepare their data for complex analytics and machine learning tasks at scale.

---