# PySpark SQL substring() Function: Extracting Substrings Easily

## Introduction to the `substring()` Function

The `substring()` function in PySpark is used to extract a specific portion (substring) of a string column. It is often used when you need to extract a fixed-length portion of a string, such as a date, an identifier, or specific characters within a string.


## Basic Syntax:

```
from pyspark.sql.functions import substring

DataFrame.select(substring(column, startPos, length).alias("new_column"))
```

### Parameters:

- **`column`**: The column from which to extract the substring.
- **`startPos`**: The starting position of the substring (1-based index).
- **`length`**: The number of characters to extract.


## Why Use `substring()`?

- It is useful for extracting parts of strings, such as splitting dates, pulling out area codes from phone numbers, or isolating specific characters in a structured string format.
- This function is often used for data transformation, data cleaning, or reformatting string columns.


## Practical Examples

### 1. Extracting the First Three Characters of a String

**Scenario**: You have a DataFrame with product codes, and you want to extract the first three characters of the product code.

**Code Example**:

In [0]:
from pyspark.sql.functions import substring

df = spark.createDataFrame([
    ("ABC12345",),
    ("XYZ98765",),
    ("LMN54321",)
], ["product_code"])

# Extract the first 3 characters of product_code
df.select(substring(df.product_code, 1, 3).alias("code_prefix")).show()


+-----------+
|code_prefix|
+-----------+
|        ABC|
|        XYZ|
|        LMN|
+-----------+



### 2. Extracting a Substring from the Middle of a String

**Scenario**: You want to extract the last five characters from the product codes.

**Code Example**:

In [0]:
# Extract the last 5 characters of product_code
df.select(substring(df.product_code, 4, 5).alias("last_five_chars")).show()


+---------------+
|last_five_chars|
+---------------+
|          12345|
|          98765|
|          54321|
+---------------+



### 3. Extracting Substrings Based on a Dynamic Position

**Scenario**: You want to extract a dynamic portion of a string where the length is not fixed. For example, extract the first part of the string until a certain character (e.g., extracting everything before the '-' in a product code).

**Code Example**:

In [0]:
df_dynamic = spark.createDataFrame([
    ("ABC-1234",),
    ("XYZ-5678",),
    ("LMN-9876",)
], ["product_code"])

# Extract the part before the hyphen
df_dynamic.withColumn("prefix", substring(df_dynamic.product_code, 1, 3)).show()


+------------+------+
|product_code|prefix|
+------------+------+
|    ABC-1234|   ABC|
|    XYZ-5678|   XYZ|
|    LMN-9876|   LMN|
+------------+------+



### 4. Using `substring()` for Date Formatting

**Scenario**: You have a DataFrame with a date column in the format `YYYY-MM-DD`, and you want to extract the year, month, and day separately.

**Code Example**:

In [0]:
df_dates = spark.createDataFrame([
    ("2023-10-12",),
    ("2022-05-06",),
    ("2024-01-01",)
], ["date"])

# Extract year, month, and day using substring
df_dates.select(
    substring(df_dates.date, 1, 4).alias("year"),
    substring(df_dates.date, 6, 2).alias("month"),
    substring(df_dates.date, 9, 2).alias("day")
).show()


+----+-----+---+
|year|month|day|
+----+-----+---+
|2023|   10| 12|
|2022|   05| 06|
|2024|   01| 01|
+----+-----+---+



### 5. Combining `substring()` with Other Functions

**Scenario**: You want to extract part of a string and combine it with other transformations, such as adding a prefix or suffix.

**Code Example**:

In [0]:
from pyspark.sql.functions import lit, concat

# Extract the first 3 characters and add a prefix
df.select(concat(lit("CODE_"), substring(df.product_code, 1, 3)).alias("new_code")).show()


+--------+
|new_code|
+--------+
|CODE_ABC|
|CODE_XYZ|
|CODE_LMN|
+--------+

