# Spark DataFrame Column Manipulation with `withColumn`

In this notebook, we'll cover how to use `withColumn` in Spark to add, modify, and transform columns in a DataFrame. We'll also explore related functions like `withColumnRenamed` and `drop` for working with columns efficiently.

## Step 1: Initialize SparkSession

The first step is to create a SparkSession, which is the entry point for working with Spark DataFrames.

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, lit, expr

# Initialize Spark Session
spark = SparkSession.builder.appName("withColumn Tutorial").getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/02 14:27:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
24/11/02 14:27:24 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
24/11/02 14:27:24 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
24/11/02 14:27:24 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
24/11/02 14:27:24 WARN Utils: Service 'SparkUI' could not bind on port 4043. Attempting port 4044.
24/11/02 14:27:24 WARN Utils: Service 'SparkUI' could not bind on port 4044. Attempting port 4045.


## Step 2: Create a Sample DataFrame

We'll create a sample DataFrame to demonstrate various `withColumn` operations.

In [2]:
data = [
    (1, "Alice", 29, 50000),
    (2, "Bob", 35, 60000),
    (3, "Cathy", 30, 55000),
    (4, "David", 40, 65000),
    (5, "Eva", 23, 45000)
]
columns = ["ID", "Name", "Age", "Salary"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()

+---+-----+---+------+
| ID| Name|Age|Salary|
+---+-----+---+------+
|  1|Alice| 29| 50000|
|  2|  Bob| 35| 60000|
|  3|Cathy| 30| 55000|
|  4|David| 40| 65000|
|  5|  Eva| 23| 45000|
+---+-----+---+------+



## Step 3: Adding New Columns with `withColumn`

To add a new column, use `withColumn` with a literal value or a derived value from other columns. Here, we'll add a column for `Bonus` with a fixed value, and another column for `Salary_increase` by calculating 10% of the salary.

In [3]:
# Add a new column with a constant value
df_with_bonus = df.withColumn("Bonus", lit(5000))

# Add a new column with calculated values
df_with_salary_increase = df_with_bonus.withColumn("Salary_increase", col("Salary") * 0.1)
df_with_salary_increase.show()

+---+-----+---+------+-----+---------------+
| ID| Name|Age|Salary|Bonus|Salary_increase|
+---+-----+---+------+-----+---------------+
|  1|Alice| 29| 50000| 5000|         5000.0|
|  2|  Bob| 35| 60000| 5000|         6000.0|
|  3|Cathy| 30| 55000| 5000|         5500.0|
|  4|David| 40| 65000| 5000|         6500.0|
|  5|  Eva| 23| 45000| 5000|         4500.0|
+---+-----+---+------+-----+---------------+



## Step 4: Modifying Existing Columns

We can modify existing columns by overwriting them with `withColumn`. In this example, we’ll adjust the `Salary` column by adding the `Bonus` to it.

In [4]:
# Update the Salary column by adding Bonus
df_adjusted_salary = df_with_salary_increase.withColumn("Salary", col("Salary") + col("Bonus"))
df_adjusted_salary.show()

+---+-----+---+------+-----+---------------+
| ID| Name|Age|Salary|Bonus|Salary_increase|
+---+-----+---+------+-----+---------------+
|  1|Alice| 29| 55000| 5000|         5000.0|
|  2|  Bob| 35| 65000| 5000|         6000.0|
|  3|Cathy| 30| 60000| 5000|         5500.0|
|  4|David| 40| 70000| 5000|         6500.0|
|  5|  Eva| 23| 50000| 5000|         4500.0|
+---+-----+---+------+-----+---------------+



## Step 5: Renaming Columns with `withColumnRenamed`

`withColumnRenamed` is used to rename a column in a DataFrame. Let's rename the `Age` column to `Employee_Age`.

In [5]:
# Rename the Age column
df_renamed = df_adjusted_salary.withColumnRenamed("Age", "Employee_Age")
df_renamed.show()

+---+-----+------------+------+-----+---------------+
| ID| Name|Employee_Age|Salary|Bonus|Salary_increase|
+---+-----+------------+------+-----+---------------+
|  1|Alice|          29| 55000| 5000|         5000.0|
|  2|  Bob|          35| 65000| 5000|         6000.0|
|  3|Cathy|          30| 60000| 5000|         5500.0|
|  4|David|          40| 70000| 5000|         6500.0|
|  5|  Eva|          23| 50000| 5000|         4500.0|
+---+-----+------------+------+-----+---------------+



## Step 6: Dropping Columns with `drop`

We can remove columns from a DataFrame using the `drop` method. Here, we'll drop the `Bonus` column from our DataFrame.

In [6]:
# Drop the Bonus column
df_dropped = df_renamed.drop("Bonus")
df_dropped.show()

+---+-----+------------+------+---------------+
| ID| Name|Employee_Age|Salary|Salary_increase|
+---+-----+------------+------+---------------+
|  1|Alice|          29| 55000|         5000.0|
|  2|  Bob|          35| 65000|         6000.0|
|  3|Cathy|          30| 60000|         5500.0|
|  4|David|          40| 70000|         6500.0|
|  5|  Eva|          23| 50000|         4500.0|
+---+-----+------------+------+---------------+



## Step 7: Using Expressions with `withColumn`

We can use `expr` to perform complex calculations or transformations within `withColumn`. Let's calculate a 15% tax on the adjusted salary and add it as a new column called `Tax`.

In [7]:
# Calculate tax with expr
df_with_tax = df_dropped.withColumn("Tax", expr("Salary * 0.15"))
df_with_tax.show()

+---+-----+------------+------+---------------+--------+
| ID| Name|Employee_Age|Salary|Salary_increase|     Tax|
+---+-----+------------+------+---------------+--------+
|  1|Alice|          29| 55000|         5000.0| 8250.00|
|  2|  Bob|          35| 65000|         6000.0| 9750.00|
|  3|Cathy|          30| 60000|         5500.0| 9000.00|
|  4|David|          40| 70000|         6500.0|10500.00|
|  5|  Eva|          23| 50000|         4500.0| 7500.00|
+---+-----+------------+------+---------------+--------+



## TODO 1: Add a Column for Net Salary

Using the `withColumn` function, add a new column called `Net_Salary` which is calculated by subtracting `Tax` from `Salary`.

### Solution

In [8]:
# Solution: Calculate Net Salary
df_net_salary = df_with_tax.withColumn("Net_Salary", col("Salary") - col("Tax"))
df_net_salary.show()

+---+-----+------------+------+---------------+--------+----------+
| ID| Name|Employee_Age|Salary|Salary_increase|     Tax|Net_Salary|
+---+-----+------------+------+---------------+--------+----------+
|  1|Alice|          29| 55000|         5000.0| 8250.00|  46750.00|
|  2|  Bob|          35| 65000|         6000.0| 9750.00|  55250.00|
|  3|Cathy|          30| 60000|         5500.0| 9000.00|  51000.00|
|  4|David|          40| 70000|         6500.0|10500.00|  59500.00|
|  5|  Eva|          23| 50000|         4500.0| 7500.00|  42500.00|
+---+-----+------------+------+---------------+--------+----------+



## TODO 2: Rename `Employee_Age` Back to `Age`

Use `withColumnRenamed` to rename the `Employee_Age` column back to `Age`.

### Solution

In [9]:
# Solution: Rename Employee_Age back to Age
df_final = df_net_salary.withColumnRenamed("Employee_Age", "Age")
df_final.show()

+---+-----+---+------+---------------+--------+----------+
| ID| Name|Age|Salary|Salary_increase|     Tax|Net_Salary|
+---+-----+---+------+---------------+--------+----------+
|  1|Alice| 29| 55000|         5000.0| 8250.00|  46750.00|
|  2|  Bob| 35| 65000|         6000.0| 9750.00|  55250.00|
|  3|Cathy| 30| 60000|         5500.0| 9000.00|  51000.00|
|  4|David| 40| 70000|         6500.0|10500.00|  59500.00|
|  5|  Eva| 23| 50000|         4500.0| 7500.00|  42500.00|
+---+-----+---+------+---------------+--------+----------+



## TODO 3: Add a Column to Calculate Years Until Retirement

Assume retirement age is 65. Create a new column `Years_Until_Retirement` that calculates the number of years left until retirement based on the `Age` column.

### Solution

In [10]:
# Solution: Calculate years until retirement
df_retirement = df_final.withColumn("Years_Until_Retirement", expr("65 - Age"))
df_retirement.show()

+---+-----+---+------+---------------+--------+----------+----------------------+
| ID| Name|Age|Salary|Salary_increase|     Tax|Net_Salary|Years_Until_Retirement|
+---+-----+---+------+---------------+--------+----------+----------------------+
|  1|Alice| 29| 55000|         5000.0| 8250.00|  46750.00|                    36|
|  2|  Bob| 35| 65000|         6000.0| 9750.00|  55250.00|                    30|
|  3|Cathy| 30| 60000|         5500.0| 9000.00|  51000.00|                    35|
|  4|David| 40| 70000|         6500.0|10500.00|  59500.00|                    25|
|  5|  Eva| 23| 50000|         4500.0| 7500.00|  42500.00|                    42|
+---+-----+---+------+---------------+--------+----------+----------------------+



## Conclusion

In this tutorial, we explored how to use `withColumn` to add, modify, and calculate new columns in Spark DataFrames. We also learned how to rename and drop columns efficiently. These techniques are essential for data manipulation and transformation in Spark.

`withColumn` is versatile and allows us to add custom calculations and transformations directly into the DataFrame, making it a key function for data engineering tasks.

In [11]:
# Stop the Spark session
spark.stop()