# DataFrame-Only Operations in Spark

This notebook provides a detailed tutorial on operations that can only be performed using Spark DataFrame functions, not with SQL alone. We'll explore custom User-Defined Functions (UDFs), chaining transformations, and complex conditional expressions using `when` and `otherwise`.

## Prerequisites
To start, make sure you have PySpark installed. If not, install it by running:
```python
!pip install pyspark
```

## Step 1: Initialize SparkSession

The first step is to create a SparkSession, which is the entry point for working with Spark DataFrames.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, udf
from pyspark.sql.types import IntegerType, StringType

# Initialize Spark Session
spark = SparkSession.builder.appName("DataFrame-Only Operations Tutorial").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/02 14:43:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/11/02 14:43:42 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/11/02 14:43:42 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/11/02 14:43:42 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/11/02 14:43:42 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/11/02 14:43:42 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


## Step 2: Create a Sample DataFrame

We'll create a sample DataFrame to use for the following examples.

In [2]:
data = [
    (1, "Alice", "Sales", 50000, 29),
    (2, "Bob", "Engineering", 60000, 35),
    (3, "Cathy", "Sales", 55000, 30),
    (4, "David", "Engineering", 65000, 40),
    (5, "Eva", "Marketing", 45000, 23)
]
columns = ["ID", "Name", "Department", "Salary", "Age"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()

+---+-----+-----------+------+---+
| ID| Name| Department|Salary|Age|
+---+-----+-----------+------+---+
|  1|Alice|      Sales| 50000| 29|
|  2|  Bob|Engineering| 60000| 35|
|  3|Cathy|      Sales| 55000| 30|
|  4|David|Engineering| 65000| 40|
|  5|  Eva|  Marketing| 45000| 23|
+---+-----+-----------+------+---+



## DataFrame-Only Operation 1: Custom User-Defined Functions (UDFs)

Spark DataFrames allow us to use custom Python functions with UDFs, which are not supported in SQL. UDFs are especially useful for complex transformations that go beyond SQL's capabilities.

In this example, we'll define a UDF to categorize employees as "Young" or "Experienced" based on their `Age`.

In [3]:
# Define a Python function to categorize age
def age_category(age):
    return "Young" if age < 30 else "Experienced"

# Convert the Python function to a UDF
age_category_udf = udf(age_category, StringType())

# Apply the UDF to create a new column
df_with_category = df.withColumn("Age_Category", age_category_udf(col("Age")))
df_with_category.show()

+---+-----+-----------+------+---+------------+
| ID| Name| Department|Salary|Age|Age_Category|
+---+-----+-----------+------+---+------------+
|  1|Alice|      Sales| 50000| 29|       Young|
|  2|  Bob|Engineering| 60000| 35| Experienced|
|  3|Cathy|      Sales| 55000| 30| Experienced|
|  4|David|Engineering| 65000| 40| Experienced|
|  5|  Eva|  Marketing| 45000| 23|       Young|
+---+-----+-----------+------+---+------------+



## DataFrame-Only Operation 2: Chaining Transformations

DataFrames allow you to chain multiple transformations in a single expression, which is not as seamless in SQL. Chaining makes the code more readable and efficient.

In this example, we’ll apply several transformations in a chain: filtering employees older than 30, adding a 10% bonus to their salary, and selecting specific columns.

In [4]:
# Chain transformations in a single expression
df_chained = df.filter(col("Age") > 30) \
    .withColumn("Bonus", col("Salary") * 0.1) \
    .select("Name", "Age", "Salary", "Bonus")

df_chained.show()

+-----+---+------+------+
| Name|Age|Salary| Bonus|
+-----+---+------+------+
|  Bob| 35| 60000|6000.0|
|David| 40| 65000|6500.0|
+-----+---+------+------+



## DataFrame-Only Operation 3: Conditional Expressions with `when` and `otherwise`

DataFrames support complex conditional expressions using `when` and `otherwise` functions, allowing you to apply different transformations based on conditions, which is cumbersome in SQL.

In this example, we’ll create a new column `High_Salary` that marks employees with a salary greater than 55000 as "Yes" and others as "No".

In [5]:
# Use when and otherwise to create conditional column
df_with_condition = df.withColumn(
    "High_Salary",
    when(col("Salary") > 55000, "Yes").otherwise("No")
)

df_with_condition.show()

+---+-----+-----------+------+---+-----------+
| ID| Name| Department|Salary|Age|High_Salary|
+---+-----+-----------+------+---+-----------+
|  1|Alice|      Sales| 50000| 29|         No|
|  2|  Bob|Engineering| 60000| 35|        Yes|
|  3|Cathy|      Sales| 55000| 30|         No|
|  4|David|Engineering| 65000| 40|        Yes|
|  5|  Eva|  Marketing| 45000| 23|         No|
+---+-----+-----------+------+---+-----------+



## DataFrame-Only Operation 4: Using Expressions (`expr`)

The `expr` function lets us use Spark SQL expressions within the DataFrame API, which is useful for performing arithmetic operations and string manipulations directly within the code.

In this example, we’ll calculate a new column `Adjusted_Salary` where we apply a 5% increase to the salary if the employee is in the `Sales` department, and no change otherwise.

In [8]:
from pyspark.sql.functions import expr

# Use expr to apply a conditional operation
df_with_adjusted_salary = df.withColumn(
    "Adjusted_Salary",
    expr("CASE WHEN Department = 'Sales' THEN Salary * 1.05 ELSE Salary END")
)
df_with_adjusted_salary.show()


+---+-----+-----------+------+---+---------------+
| ID| Name| Department|Salary|Age|Adjusted_Salary|
+---+-----+-----------+------+---+---------------+
|  1|Alice|      Sales| 50000| 29|       52500.00|
|  2|  Bob|Engineering| 60000| 35|       60000.00|
|  3|Cathy|      Sales| 55000| 30|       57750.00|
|  4|David|Engineering| 65000| 40|       65000.00|
|  5|  Eva|  Marketing| 45000| 23|       45000.00|
+---+-----+-----------+------+---+---------------+



## TODO 1: Create a UDF for Customized Greeting

Define a UDF named `greeting` that creates a customized greeting for each employee based on their name and department. For example, "Hello Alice from Sales!".

### Solution

In [9]:
# Solution: Define and apply the greeting UDF
def greeting(name, department):
    return f"Hello {name} from {department}!"

greeting_udf = udf(greeting, StringType())

df_with_greeting = df.withColumn("Greeting", greeting_udf(col("Name"), col("Department")))
df_with_greeting.show()

+---+-----+-----------+------+---+--------------------+
| ID| Name| Department|Salary|Age|            Greeting|
+---+-----+-----------+------+---+--------------------+
|  1|Alice|      Sales| 50000| 29|Hello Alice from ...|
|  2|  Bob|Engineering| 60000| 35|Hello Bob from En...|
|  3|Cathy|      Sales| 55000| 30|Hello Cathy from ...|
|  4|David|Engineering| 65000| 40|Hello David from ...|
|  5|  Eva|  Marketing| 45000| 23|Hello Eva from Ma...|
+---+-----+-----------+------+---+--------------------+



## TODO 2: Add a Conditional Bonus Column

Add a column `Conditional_Bonus` that provides a 10% bonus to employees in `Engineering` and a 5% bonus for employees in other departments. Use the `when` and `otherwise` functions.

### Solution

In [10]:
# Solution: Create Conditional_Bonus column
df_with_conditional_bonus = df.withColumn(
    "Conditional_Bonus",
    when(col("Department") == "Engineering", col("Salary") * 0.1).otherwise(col("Salary") * 0.05)
)
df_with_conditional_bonus.show()

+---+-----+-----------+------+---+-----------------+
| ID| Name| Department|Salary|Age|Conditional_Bonus|
+---+-----+-----------+------+---+-----------------+
|  1|Alice|      Sales| 50000| 29|           2500.0|
|  2|  Bob|Engineering| 60000| 35|           6000.0|
|  3|Cathy|      Sales| 55000| 30|           2750.0|
|  4|David|Engineering| 65000| 40|           6500.0|
|  5|  Eva|  Marketing| 45000| 23|           2250.0|
+---+-----+-----------+------+---+-----------------+



## TODO 3: Chain Multiple Transformations

1. Filter employees under 35.
2. Add a new column `High_Salary` (marking salaries over 55000 as "Yes" or "No").
3. Select only `Name`, `Age`, `High_Salary`, and `Department`.

### Solution

In [11]:
# Solution: Chain multiple transformations
df_transformed = df.filter(col("Age") < 35) \
    .withColumn("High_Salary", when(col("Salary") > 55000, "Yes").otherwise("No")) \
    .select("Name", "Age", "High_Salary", "Department")

df_transformed.show()

+-----+---+-----------+----------+
| Name|Age|High_Salary|Department|
+-----+---+-----------+----------+
|Alice| 29|         No|     Sales|
|Cathy| 30|         No|     Sales|
|  Eva| 23|         No| Marketing|
+-----+---+-----------+----------+



## Conclusion

In this tutorial, we covered key operations that are exclusive to DataFrames in Spark and cannot be done with SQL alone:

- **Custom User-Defined Functions (UDFs)**: Ideal for applying Python-based transformations.
- **Chaining Transformations**: Enables fluent, readable transformations.
- **Conditional Expressions**: Complex conditions handled easily with `when` and `otherwise`.
- **Using `expr`**: Integrates SQL expressions directly into DataFrame transformations.

These DataFrame-specific functions allow for more flexibility and Python integration, making them invaluable for complex data engineering tasks in Spark.

In [12]:
# Stop the Spark session
spark.stop()