## PySpark SQL when() Function: Implementing Conditional Logic Easily

## Introduction to the `when()` Function

The `when()` function in PySpark is used to implement conditional logic, similar to an if-else statement. It evaluates a condition, and if the condition is true, it returns a specified value. Otherwise, it returns a different value. This is especially useful when creating new columns based on existing values or applying conditional logic to rows.


## Basic Syntax:

```
from pyspark.sql.functions import when

DataFrame.select(when(condition, true_value).otherwise(false_value).alias("new_column"))
```

### Parameters:

- **`condition`**: The logical condition to evaluate.
- **`true_value`**: The value returned if the condition is true.
- **`false_value`**: The value returned if the condition is false.



## Why Use `when()`?

- The `when()` function is essential for data transformation tasks that require conditional logic, such as categorizing values, handling missing data, or applying complex conditions to determine the output of a new column.
- It allows you to apply customized logic across rows, making it flexible for a variety of data cleaning and transformation operations.


## Practical Examples

### 1. Applying Simple Conditional Logic

**Scenario**: You have a DataFrame with sales data, and you want to create a new column that labels sales as either "High" or "Low" based on a threshold value.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

df = spark.createDataFrame([
    ("Product A", 500),
    ("Product B", 200),
    ("Product C", 700)
], ["product_name", "sales"])

# Apply conditional logic using when()
df.select(
    df.product_name,
    df.sales,
    when(df.sales > 300, "High").otherwise("Low").alias("sales_category")
).show()


+------------+-----+--------------+
|product_name|sales|sales_category|
+------------+-----+--------------+
|   Product A|  500|          High|
|   Product B|  200|           Low|
|   Product C|  700|          High|
+------------+-----+--------------+



### 2. Using Multiple Conditions

**Scenario**: You want to categorize sales into "High," "Medium," or "Low" based on different thresholds.

**Code Example**:

In [0]:
# Apply multiple conditions using when()
df.select(
    df.product_name,
    df.sales,
    when(df.sales > 600, "High")
    .when(df.sales > 300, "Medium")
    .otherwise("Low").alias("sales_category")
).show()


+------------+-----+--------------+
|product_name|sales|sales_category|
+------------+-----+--------------+
|   Product A|  500|        Medium|
|   Product B|  200|           Low|
|   Product C|  700|          High|
+------------+-----+--------------+



### 3. Using `when()` for Conditional Column Creation

**Scenario**: You have a column of ages, and you want to create a new column that labels each age group as "Adult" or "Minor."

**Code Example**:

In [0]:
df_age = spark.createDataFrame([
    ("John", 25),
    ("Jane", 17),
    ("Tom", 30)
], ["name", "age"])

# Use when() to create a conditional column for age groups
df_age.select(
    df_age.name,
    df_age.age,
    when(df_age.age >= 18, "Adult").otherwise("Minor").alias("age_group")
).show()


+----+---+---------+
|name|age|age_group|
+----+---+---------+
|John| 25|    Adult|
|Jane| 17|    Minor|
| Tom| 30|    Adult|
+----+---+---------+



### 4. Combining `when()` with Other Functions

**Scenario**: You want to apply conditional logic to sales data, where you mark sales as "Valid" if they are greater than 100 and replace null values with "Unknown."

**Code Example**:

In [0]:
df_with_nulls = spark.createDataFrame([
    ("Product A", 500),
    ("Product B", None),
    ("Product C", 700)
], ["product_name", "sales"])

from pyspark.sql.functions import col, lit

# Combine when() with other functions
df_with_nulls.select(
    df_with_nulls.product_name,
    when(col("sales").isNull(), lit("Unknown"))
    .when(col("sales") > 100, "Valid")
    .otherwise("Invalid").alias("sales_status")
).show()


+------------+------------+
|product_name|sales_status|
+------------+------------+
|   Product A|       Valid|
|   Product B|     Unknown|
|   Product C|       Valid|
+------------+------------+



### 5. Using `when()` for Conditional Updates

**Scenario**: You want to update existing values based on certain conditions, such as marking all sales below 300 as "Needs Improvement."

**Code Example**:

In [0]:
# Update sales_status conditionally using when()
df.select(
    df.product_name,
    df.sales,
    when(df.sales < 300, "Needs Improvement").otherwise("Good").alias("sales_status")
).show()


+------------+-----+-----------------+
|product_name|sales|     sales_status|
+------------+-----+-----------------+
|   Product A|  500|             Good|
|   Product B|  200|Needs Improvement|
|   Product C|  700|             Good|
+------------+-----+-----------------+

