# PySpark SQL withColumn() Function: How to Add or Modify Columns

## Introduction to the `withColumn()` Function

The `withColumn()` function in PySpark is used to create a new column or modify an existing column in a DataFrame. It allows you to apply transformations to existing columns or introduce completely new ones based on some logic or calculation.


## Basic Syntax:

```
DataFrame.withColumn(columnName, expression)
```

#### Parameters

- **`columnName`**: The name of the new or existing column that you want to create or modify.
- **`expression`**: The transformation or calculation to be applied to create or modify the column.


## How It Works:

- If the specified column already exists, `withColumn()` will overwrite it.
- If the column does not exist, a new column will be created.
- It’s a powerful function for data transformation and preparing data for further analysis.


## Practical Examples

### 1. Adding a New Column with a Simple Expression

**Scenario**: You have a DataFrame with employee salary data, and you want to add a new column that increases each employee’s salary by 10%.

**Code Example**:

In [0]:
df = spark.createDataFrame([
    (1, "John", 3000),
    (2, "Jane", 4000),
    (3, "Tom", 3500)
], ["EMPLOYEE_ID", "NAME", "SALARY"])

# Add new column "NEW_SALARY" with 10% salary increase
df.withColumn("NEW_SALARY", df.SALARY * 1.1).show()


+-----------+----+------+------------------+
|EMPLOYEE_ID|NAME|SALARY|        NEW_SALARY|
+-----------+----+------+------------------+
|          1|John|  3000|3300.0000000000005|
|          2|Jane|  4000|            4400.0|
|          3| Tom|  3500|3850.0000000000005|
+-----------+----+------+------------------+



### 2. Modifying an Existing Column

**Scenario**: You want to modify the `Salary` column itself by applying a 10% increase, instead of adding a new column.

**Code Example**:


In [0]:
# Modify the existing "SALARY" column by increasing it by 10%
df.withColumn("SALARY", df.SALARY * 1.1).show()


+-----------+----+------------------+
|EMPLOYEE_ID|NAME|            SALARY|
+-----------+----+------------------+
|          1|John|3300.0000000000005|
|          2|Jane|            4400.0|
|          3| Tom|3850.0000000000005|
+-----------+----+------------------+



### 3. Adding a Column Based on a Conditional Expression

**Scenario**: You want to add a new column `BONUS_ELIGIBLE` that assigns `True` if an employee’s salary is greater than $3,500, and `False` otherwise.

**Code Example**:

In [0]:
from pyspark.sql.functions import when

# Add new column "BONUS_ELIGIBLE" based on condition
df.withColumn("BONUS_ELIGIBLE", when(df.SALARY > 3500, True).otherwise(False)).show()


+-----------+----+------+--------------+
|EMPLOYEE_ID|NAME|SALARY|BONUS_ELIGIBLE|
+-----------+----+------+--------------+
|          1|John|  3000|         false|
|          2|Jane|  4000|          true|
|          3| Tom|  3500|         false|
+-----------+----+------+--------------+



### 4. Using `withColumn()` to Create a Column Based on Multiple Columns

**Scenario**: You want to add a new column `TOTAL_COMPENSATION` which sums up both `Salary` and `Bonus` columns.

**Code Example**:

In [0]:
df_bonus = spark.createDataFrame([
    (1, "John", 3000, 500),
    (2, "Jane", 4000, 800),
    (3, "Tom", 3500, 600)
], ["EMPLOYEE_ID", "NAME", "SALARY", "BONUS"])

# Add a new column "TOTAL_COMPENSATION" as the sum of SALARY and BONUS
df_bonus.withColumn("TOTAL_COMPENSATION", df_bonus.SALARY + df_bonus.BONUS).show()


+-----------+----+------+-----+------------------+
|EMPLOYEE_ID|NAME|SALARY|BONUS|TOTAL_COMPENSATION|
+-----------+----+------+-----+------------------+
|          1|John|  3000|  500|              3500|
|          2|Jane|  4000|  800|              4800|
|          3| Tom|  3500|  600|              4100|
+-----------+----+------+-----+------------------+



### 5. Using Built-in Functions in `withColumn()`

**Scenario**: You want to add a column that contains the length of each employee’s name.

**Code Example**:

In [0]:
from pyspark.sql.functions import length

# Add new column "NAME_LENGTH" using length function
df.withColumn("NAME_LENGTH", length(df.NAME)).show()


+-----------+----+------+-----------+
|EMPLOYEE_ID|NAME|SALARY|NAME_LENGTH|
+-----------+----+------+-----------+
|          1|John|  3000|          4|
|          2|Jane|  4000|          4|
|          3| Tom|  3500|          3|
+-----------+----+------+-----------+

