## PySpark SQL regexp_replace() Function: Replacing Substrings with Regex

## Introduction to the `regexp_replace()` Function

The `regexp_replace()` function in PySpark is used to replace parts of a string column based on a regular expression (regex). It allows you to search for a pattern in a string and replace it with another substring. This is particularly useful for cleaning or transforming text data where patterns like special characters, spaces, or certain words need to be replaced.


## Basic Syntax:

```
from pyspark.sql.functions import regexp_replace

DataFrame.select(regexp_replace(column, pattern, replacement).alias("new_column"))
```

### Parameters:

- **`column`**: The column containing the string to modify.
- **`pattern`**: The regex pattern to search for.
- **`replacement`**: The string to replace the matched pattern with.


## Why Use `regexp_replace()`?

- It’s a powerful tool for cleaning and transforming data where you need to identify patterns in strings, such as removing special characters, correcting formatting, or replacing certain words.
- Regular expressions allow you to define complex patterns that can match multiple variations, making this function flexible for a wide range of tasks.


## Practical Examples

### 1. Replacing All Occurrences of a Substring

**Scenario**: You have a DataFrame with product codes, and you want to replace all dashes (`-`) with spaces.

**Code Example**:

In [0]:
from pyspark.sql.functions import regexp_replace

df = spark.createDataFrame([
    ("ABC-123-XYZ",),
    ("DEF-456-UVW",),
    ("GHI-789-RST",)
], ["product_code"])

# Replace dashes with spaces using regexp_replace()
df.select(regexp_replace(df.product_code, "-", " ").alias("cleaned_product_code")).show()


+--------------------+
|cleaned_product_code|
+--------------------+
|         ABC 123 XYZ|
|         DEF 456 UVW|
|         GHI 789 RST|
+--------------------+



### 2. Removing Special Characters

**Scenario**: You want to clean up a column by removing all non-alphanumeric characters (such as punctuation).

**Code Example**:

In [0]:
# Remove all non-alphanumeric characters
df_special = spark.createDataFrame([
    ("Product@123!",),
    ("Price$50%",),
    ("Code#ABC",)
], ["text"])

df_special.select(regexp_replace(df_special.text, "[^a-zA-Z0-9]", "").alias("cleaned_text")).show()


+------------+
|cleaned_text|
+------------+
|  Product123|
|     Price50|
|     CodeABC|
+------------+



### 3. Replacing Patterns in Date Strings

**Scenario**: You have dates in different formats (e.g., `2023/10/12`, `2023-10-12`), and you want to standardize them by replacing slashes (`/`) and dashes (`-`) with dots (`.`).

**Code Example**:

In [0]:
df_dates = spark.createDataFrame([
    ("2023/10/12",),
    ("2023-10-12",),
    ("2023/01/01",)
], ["date"])

# Standardize date format by replacing slashes and dashes with dots
df_dates.select(
    regexp_replace(df_dates.date, "[-/]", ".").alias("standardized_date")
).show()


+-----------------+
|standardized_date|
+-----------------+
|       2023.10.12|
|       2023.10.12|
|       2023.01.01|
+-----------------+



### 4. Replacing Multiple Patterns in One Step

**Scenario**: You want to replace multiple patterns in a single transformation, such as replacing commas (`,`) and semicolons (`;`) with spaces.

**Code Example**:

In [0]:
df_text = spark.createDataFrame([
    ("Hello, world;",),
    ("Hi, there;",),
    ("Goodbye, everyone;",)
], ["text"])

# Replace both commas and semicolons with spaces
df_text.select(
    regexp_replace(df_text.text, "[,;]", " ").alias("cleaned_text")
).show()


+------------------+
|      cleaned_text|
+------------------+
|     Hello  world |
|        Hi  there |
|Goodbye  everyone |
+------------------+



### 5. Using `regexp_replace()` with Complex Regex

**Scenario**: You want to replace all numeric values in a column with the word "NUMBER".

**Code Example**:

In [0]:
df_numbers = spark.createDataFrame([
    ("Item 123",),
    ("Product 456",),
    ("Order 789",)
], ["description"])

# Replace all numeric values with the word "NUMBER"
df_numbers.select(
    regexp_replace(df_numbers.description, "\\d+", "NUMBER").alias("cleaned_description")
).show()


+-------------------+
|cleaned_description|
+-------------------+
|        Item NUMBER|
|     Product NUMBER|
|       Order NUMBER|
+-------------------+



### 6. Handling Null Values with `regexp_replace()`

**Scenario**: You want to use `regexp_replace()` on a column that contains null values and ensure that nulls are not affected by the replacement process.

**Code Example**:

In [0]:
df_with_nulls = spark.createDataFrame([
    ("ABC-123",),
    (None,),
    ("XYZ-789",)
], ["product_code"])

# Handle null values safely while replacing dashes
df_with_nulls.select(
    regexp_replace(df_with_nulls.product_code, "-", " ").alias("cleaned_product_code")
).show()


+--------------------+
|cleaned_product_code|
+--------------------+
|             ABC 123|
|                null|
|             XYZ 789|
+--------------------+

