## Dataframe Transformations in PySpark

**Topics covered**
- Aliasing
- Filter / Where
- Literal
- Adding columns
- Renaming columns
- Casting data types
- Removing columns

In [0]:
from pyspark.sql.functions import col

# creating a dataframe
df = spark.read.format("csv") \
        .option("header", "true") \
        .option("inferSchema", "true") \
        .load("dbfs:/FileStore/shared_uploads/zaderohish5@gmail.com/employee_1.csv")

# output of df
df.show(5)

+---+--------+---+------+------------+--------+
| id|    name|age|salary|     address| nominee|
+---+--------+---+------+------------+--------+
|  1|  Rohish| 26| 75000|        Pune|nominee1|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|
|  3|  Pritam| 22|150000|   Bangalore|nominee3|
|  4|Prantosh| 17|200000|     Kolkata|nominee4|
|  5|  Vikash| 31|300000|        null|nominee5|
+---+--------+---+------+------------+--------+



### Aliasing
Used to assign a temporary name to a DataFrame or column to make it more readable or to avoid conflicts. 

In [0]:
# Aliasing Columns
df.select(col("id").alias("emp_id"), col("name")).show()

+------+--------+
|emp_id|    name|
+------+--------+
|     1|  Rohish|
|     2|  Nikita|
|     3|  Pritam|
|     4|Prantosh|
|     5|  Vikash|
+------+--------+



### Filter / Where:
Filters rows based on a condition. Both methods are equivalent:

In [0]:
# filter for employees whose salary is more than 150000
df.filter(col("salary") > 150000).show()

+---+--------+---+------+-------+--------+
| id|    name|age|salary|address| nominee|
+---+--------+---+------+-------+--------+
|  4|Prantosh| 17|200000|Kolkata|nominee4|
|  5|  Vikash| 31|300000|   null|nominee5|
+---+--------+---+------+-------+--------+



In [0]:
# filter for employees whose salary is more than 150000: using where
df.where(col("salary") > 150000).show()

+---+--------+---+------+-------+--------+
| id|    name|age|salary|address| nominee|
+---+--------+---+------+-------+--------+
|  4|Prantosh| 17|200000|Kolkata|nominee4|
|  5|  Vikash| 31|300000|   null|nominee5|
+---+--------+---+------+-------+--------+



In [0]:
# filter for employees whose salary is more than 150000 and age < 18
df.filter((col("salary") > 150000) & (col("age") < 18)).show()

+---+--------+---+------+-------+--------+
| id|    name|age|salary|address| nominee|
+---+--------+---+------+-------+--------+
|  4|Prantosh| 17|200000|Kolkata|nominee4|
+---+--------+---+------+-------+--------+



In [0]:
fc = (col("salary") > 150000) & (col("age") < 18)
df.filter(fc).show()

+---+--------+---+------+-------+--------+
| id|    name|age|salary|address| nominee|
+---+--------+---+------+-------+--------+
|  4|Prantosh| 17|200000|Kolkata|nominee4|
+---+--------+---+------+-------+--------+



### Literal:
Used to add constant values as columns in a DataFrame:

In [0]:
# add a column last name to df with literal value as "Zade"
from pyspark.sql.functions import lit

df.select("*", lit("Zade").alias("last_name")).show()

+---+--------+---+------+------------+--------+---------+
| id|    name|age|salary|     address| nominee|last_name|
+---+--------+---+------+------------+--------+---------+
|  1|  Rohish| 26| 75000|        Pune|nominee1|     Zade|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|     Zade|
|  3|  Pritam| 22|150000|   Bangalore|nominee3|     Zade|
|  4|Prantosh| 17|200000|     Kolkata|nominee4|     Zade|
|  5|  Vikash| 31|300000|        null|nominee5|     Zade|
+---+--------+---+------+------------+--------+---------+



In [0]:
df.withColumn("new_column", lit(100)).show()

+---+--------+---+------+------------+--------+----------+
| id|    name|age|salary|     address| nominee|new_column|
+---+--------+---+------+------------+--------+----------+
|  1|  Rohish| 26| 75000|        Pune|nominee1|       100|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|       100|
|  3|  Pritam| 22|150000|   Bangalore|nominee3|       100|
|  4|Prantosh| 17|200000|     Kolkata|nominee4|       100|
|  5|  Vikash| 31|300000|        null|nominee5|       100|
+---+--------+---+------+------------+--------+----------+



### Adding Columns: `withColumn`
We can use `withColumn` method to add column to the dataframe

The `withColumn` method in PySpark is used to:
- Add a new column to a DataFrame.
- Update an existing column by applying transformations or functions.
- Perform operations such as type casting, arithmetic, and applying expressions to columns.
- Syntax: `DataFrame.withColumn(colName, col)`
  - `colName`: The name of the new or existing column you want to add or update.
  - `col`: A column expression, such as a transformation or calculation.

In [0]:
# adding a new column name middle_name
df.withColumn("middle_name", lit("Jesus")).show()

+---+--------+---+------+------------+--------+-----------+
| id|    name|age|salary|     address| nominee|middle_name|
+---+--------+---+------+------------+--------+-----------+
|  1|  Rohish| 26| 75000|        Pune|nominee1|      Jesus|
|  2|  Nikita| 23|100000|uttarpradesh|nominee2|      Jesus|
|  3|  Pritam| 22|150000|   Bangalore|nominee3|      Jesus|
|  4|Prantosh| 17|200000|     Kolkata|nominee4|      Jesus|
|  5|  Vikash| 31|300000|        null|nominee5|      Jesus|
+---+--------+---+------+------------+--------+-----------+



### Renaming Columns: `withColumnRenamed`
The `withColumnRenamed` method in PySpark is used to rename an existing column in a DataFrame. 

It is a straightforward and efficient way to change column names without altering the data.

**Syntax:**` DataFrame.withColumnRenamed(existing, new)`
- `existing`: The current name of the column.
- `new`: The new name to assign to the column.

**`Non-Destructive`:** Similar to withColumn, this method returns a new DataFrame without modifying the original DataFrame.

**Renaming a Single Column**

In [0]:
# change the id column name to emp_id
df.withColumnRenamed("id", "emp_id").show()

+------+--------+---+------+------------+--------+
|emp_id|    name|age|salary|     address| nominee|
+------+--------+---+------+------------+--------+
|     1|  Rohish| 26| 75000|        Pune|nominee1|
|     2|  Nikita| 23|100000|uttarpradesh|nominee2|
|     3|  Pritam| 22|150000|   Bangalore|nominee3|
|     4|Prantosh| 17|200000|     Kolkata|nominee4|
|     5|  Vikash| 31|300000|        null|nominee5|
+------+--------+---+------+------------+--------+



**Rename Multiple Columns (Chained Calls)**

In [0]:
df.withColumnRenamed("id", "emp_id") \
    .withColumnRenamed("name", "emp_name").show()

+------+--------+---+------+------------+--------+
|emp_id|emp_name|age|salary|     address| nominee|
+------+--------+---+------+------------+--------+
|     1|  Rohish| 26| 75000|        Pune|nominee1|
|     2|  Nikita| 23|100000|uttarpradesh|nominee2|
|     3|  Pritam| 22|150000|   Bangalore|nominee3|
|     4|Prantosh| 17|200000|     Kolkata|nominee4|
|     5|  Vikash| 31|300000|        null|nominee5|
+------+--------+---+------+------------+--------+



### Casting Data Types: Changing the data type of a column

PySpark provides the `cast()` method as part of the Column class for changing the data type.

**Syntax:** `DataFrame.withColumn(colName, colExpression.cast(dataType))`
- `colName`: Name of the column you want to cast.
- `colExpression`: The column or transformation to apply.
- `dataType`: Target data type (e.g., StringType, IntegerType, etc.).
 

In [0]:
df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: integer (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



In [0]:
# change the datatype of salary to longa
df.withColumn("salary", col("salary").cast("long")).printSchema()

root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: long (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



In [0]:
df.withColumn("salary", col("salary").cast("long")) \
   .withColumn("id", col("id").cast("string")).printSchema()

root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- salary: long (nullable = true)
 |-- address: string (nullable = true)
 |-- nominee: string (nullable = true)



### Dropping Columns:
PySpark provides a built-in `drop()` method to remove one or more columns.

**Syntax:** `DataFrame.drop(*cols)`
- `cols:` The names of the columns to drop (as strings). Multiple column names can be passed.


In [0]:
# Drop a Single Column id
df.drop(col("id")).show()

+--------+---+------+------------+--------+
|    name|age|salary|     address| nominee|
+--------+---+------+------------+--------+
|  Rohish| 26| 75000|        Pune|nominee1|
|  Nikita| 23|100000|uttarpradesh|nominee2|
|  Pritam| 22|150000|   Bangalore|nominee3|
|Prantosh| 17|200000|     Kolkata|nominee4|
|  Vikash| 31|300000|        null|nominee5|
+--------+---+------+------------+--------+



In [0]:
# Drop Multiple Columns: id, age, salary, address
df.drop("id", col("age"), df["salary"]).show()

+--------+------------+--------+
|    name|     address| nominee|
+--------+------------+--------+
|  Rohish|        Pune|nominee1|
|  Nikita|uttarpradesh|nominee2|
|  Pritam|   Bangalore|nominee3|
|Prantosh|     Kolkata|nominee4|
|  Vikash|        null|nominee5|
+--------+------------+--------+

