## Trim Function in DataFrame

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit, ltrim, rtrim, rpad, lpad, trim, col

# Create a Spark session
spark = SparkSession.builder.appName("PySparkTrimFunctions").getOrCreate()

# Sample employee data with leading and trailing spaces in the 'Name' column
data = [
    (1, " Alice ", "HR"),
    (2, " Bob", "IT"),
    (3, "Charlie ", "Finance"),
    (4, " David ", "HR"),
    (5, "Eve ", "IT")
]

# Define the schema for the DataFrame
columns = ["EmployeeID", "Name", "Department"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show the original DataFrame
df.show(truncate=False)

StatementMeta(, cca39ed9-25dd-487b-8658-1dfc14f7c9ee, 3, Finished, Available, Finished)

+----------+--------+----------+
|EmployeeID|Name    |Department|
+----------+--------+----------+
|1         | Alice  |HR        |
|2         | Bob    |IT        |
|3         |Charlie |Finance   |
|4         | David  |HR        |
|5         |Eve     |IT        |
+----------+--------+----------+



## Applying Trimming and Padding Functions

### 1. `ltrim()`, `rtrim()`, and `trim()`
- **`ltrim()`**: Removes leading spaces.
- **`rtrim()`**: Removes trailing spaces.
- **`trim()`**: Removes both leading and trailing spaces.


In [2]:
df_trimmed = df.select(
    col("EmployeeID"),
    col("Department"),
    ltrim(col("Name")).alias("ltrim_Name"),  # Remove leading spaces
    rtrim(col("Name")).alias("rtrim_Name"),  # Remove trailing spaces
    trim(col("Name")).alias("trim_Name")     # Remove both leading and trailing spaces
)

df_trimmed.show(truncate=False)

StatementMeta(, cca39ed9-25dd-487b-8658-1dfc14f7c9ee, 4, Finished, Available, Finished)

+----------+----------+----------+----------+---------+
|EmployeeID|Department|ltrim_Name|rtrim_Name|trim_Name|
+----------+----------+----------+----------+---------+
|1         |HR        |Alice     | Alice    |Alice    |
|2         |IT        |Bob       | Bob      |Bob      |
|3         |Finance   |Charlie   |Charlie   |Charlie  |
|4         |HR        |David     | David    |David    |
|5         |IT        |Eve       |Eve       |Eve      |
+----------+----------+----------+----------+---------+



### 2. `lpad()` and `rpad()`
- **`lpad()`**: Pads the left side of a string with a specified character up to a certain length.
- **`rpad()`**: Pads the right side of a string with a specified character up to a certain length.


In [3]:
df_padded = df.select(
    col("EmployeeID"),
    col("Department"),
    lpad(col("Name"), 10, "X").alias("lpad_Name"),  # Left pad with 'X' to make the string length 10
    rpad(col("Name"), 10, "Y").alias("rpad_Name")   # Right pad with 'Y' to make the string length 10
)

df_padded.show(truncate=False)

StatementMeta(, cca39ed9-25dd-487b-8658-1dfc14f7c9ee, 5, Finished, Available, Finished)

+----------+----------+----------+----------+
|EmployeeID|Department|lpad_Name |rpad_Name |
+----------+----------+----------+----------+
|1         |HR        |XXX Alice | Alice YYY|
|2         |IT        |XXXXXX Bob| BobYYYYYY|
|3         |Finance   |XXCharlie |Charlie YY|
|4         |HR        |XXX David | David YYY|
|5         |IT        |XXXXXXEve |Eve YYYYYY|
+----------+----------+----------+----------+



### Output Explanation:
- **`ltrim_Name`**: The leading spaces from the `Name` column are removed.
- **`rtrim_Name`**: The trailing spaces from the `Name` column are removed.
- **`trim_Name`**: Both leading and trailing spaces are removed from the `Name` column.
- **`lpad_Name`**: The `Name` column is padded on the left with `'X'` until the string length becomes `10`.
- **`rpad_Name`**: The `Name` column is padded on the right with `'Y'` until the string length becomes `10`.
