# PySpark DataFrame Manipulation
## Part 2: Adding, Renaming, and Dropping Columns

In this notebook, we will cover the following topics:
* Adding new columns to a DataFrame using `withColumn()`
* Renaming columns using `withColumnRenamed()`
* Dropping columns using `drop()`
* Immutability of DataFrames in Spark

### Adding New Columns with `withColumn()`
In PySpark, the `withColumn()` function is widely used to add new columns to a DataFrame. You can either assign a constant value using `lit()` or perform transformations using existing columns.


In [8]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, expr, col

# Create Spark session
spark = SparkSession.builder.appName("ColumnManipulation").getOrCreate()

# Create sample data
data = [
    ("John", 30, "Sales", 50000),
    ("Lisa", 25, "Marketing", 45000),
    ("Mike", 35, "Engineering", 60000)
]

# Create DataFrame
df = spark.createDataFrame(data, ["Name", "Age", "Department", "Salary"])
df.show()

StatementMeta(, ead4c1c5-3cc6-49d8-9601-eac144ea7f8d, 10, Finished, Available, Finished)

+----+---+-----------+------+
|Name|Age| Department|Salary|
+----+---+-----------+------+
|John| 30|      Sales| 50000|
|Lisa| 25|  Marketing| 45000|
|Mike| 35|Engineering| 60000|
+----+---+-----------+------+



## 1. Adding New Columns
Demonstrate different ways to add columns using `withColumn()`:

In [9]:
# Add a constant value column
df_with_bonus = df.withColumn("Bonus", lit(5000))

# Add a calculated column
df_with_total = df_with_bonus.withColumn("TotalComp", col("Salary") + col("Bonus"))

# Add a column with conditional logic
df_with_category = df_with_total.withColumn(
    "SalaryCategory",
    expr("CASE WHEN Salary >= 50000 THEN 'High' ELSE 'Standard' END")
)

df_with_category.show()

StatementMeta(, ead4c1c5-3cc6-49d8-9601-eac144ea7f8d, 11, Finished, Available, Finished)

+----+---+-----------+------+-----+---------+--------------+
|Name|Age| Department|Salary|Bonus|TotalComp|SalaryCategory|
+----+---+-----------+------+-----+---------+--------------+
|John| 30|      Sales| 50000| 5000|    55000|          High|
|Lisa| 25|  Marketing| 45000| 5000|    50000|      Standard|
|Mike| 35|Engineering| 60000| 5000|    65000|          High|
+----+---+-----------+------+-----+---------+--------------+



## 2. Renaming Columns
Examples of renaming columns:

In [10]:
# Rename single column
df_renamed = df_with_category.withColumnRenamed("Department", "Division")

# Rename multiple columns using multiple withColumnRenamed calls
df_final = df_renamed \
    .withColumnRenamed("TotalComp", "TotalCompensation") \
    .withColumnRenamed("SalaryCategory", "CompLevel")

df_final.show()

StatementMeta(, ead4c1c5-3cc6-49d8-9601-eac144ea7f8d, 12, Finished, Available, Finished)

+----+---+-----------+------+-----+-----------------+---------+
|Name|Age|   Division|Salary|Bonus|TotalCompensation|CompLevel|
+----+---+-----------+------+-----+-----------------+---------+
|John| 30|      Sales| 50000| 5000|            55000|     High|
|Lisa| 25|  Marketing| 45000| 5000|            50000| Standard|
|Mike| 35|Engineering| 60000| 5000|            65000|     High|
+----+---+-----------+------+-----+-----------------+---------+



## 3. Dropping Columns
Examples of removing columns:

In [11]:
# Drop a single column
df_dropped = df_final.drop("Bonus")

# Drop multiple columns
df_minimal = df_dropped.drop("CompLevel", "TotalCompensation")

df_minimal.show()

StatementMeta(, ead4c1c5-3cc6-49d8-9601-eac144ea7f8d, 13, Finished, Available, Finished)

+----+---+-----------+------+
|Name|Age|   Division|Salary|
+----+---+-----------+------+
|John| 30|      Sales| 50000|
|Lisa| 25|  Marketing| 45000|
|Mike| 35|Engineering| 60000|
+----+---+-----------+------+



## 4. Verify DataFrame Immutability
Demonstrate that original DataFrame remains unchanged:

In [12]:
print("Original DataFrame:")
df.show()

print("\nFinal Modified DataFrame:")
df_minimal.show()

StatementMeta(, ead4c1c5-3cc6-49d8-9601-eac144ea7f8d, 14, Finished, Available, Finished)

Original DataFrame:
+----+---+-----------+------+
|Name|Age| Department|Salary|
+----+---+-----------+------+
|John| 30|      Sales| 50000|
|Lisa| 25|  Marketing| 45000|
|Mike| 35|Engineering| 60000|
+----+---+-----------+------+


Final Modified DataFrame:
+----+---+-----------+------+
|Name|Age|   Division|Salary|
+----+---+-----------+------+
|John| 30|      Sales| 50000|
|Lisa| 25|  Marketing| 45000|
|Mike| 35|Engineering| 60000|
+----+---+-----------+------+

