### Pattern Matching in PySpark

Pattern matching in PySpark refers to several techniques for filtering and selecting data based on pattern conditions.

In [0]:
# sample data
data = [
    (1, 'Rohish', 'HR', 5000),
    (2, 'Smit', 'HR', 6000),
    (3, 'Faisal', 'IT', 7000),
    (4, 'Pushpak', 'IT', 9000),
    (5, 'Rishabh', 'HR', 5500),
    (6, 'Vinit', 'IT', 8000),
    (7, 'DemonSlayer69', 'IT', '10000')
]

columns = ["EmployeeID", "Name", "Department", "Salary"]

df = spark.createDataFrame(data, columns)
df.show()

+----------+-------------+----------+------+
|EmployeeID|         Name|Department|Salary|
+----------+-------------+----------+------+
|         1|       Rohish|        HR|  5000|
|         2|         Smit|        HR|  6000|
|         3|       Faisal|        IT|  7000|
|         4|      Pushpak|        IT|  9000|
|         5|      Rishabh|        HR|  5500|
|         6|        Vinit|        IT|  8000|
|         7|DemonSlayer69|        IT| 10000|
+----------+-------------+----------+------+



#### Using `like()` for SQL-style Pattern Matching

Works like SQL's LIKE, supporting `%` (wildcard for multiple characters) and `_` (wildcard for a single character).

In [0]:
from pyspark.sql.functions import col

# Find names starting with 'R'
df.filter(col("Name").like("R%")).show()

+----------+-------+----------+------+
|EmployeeID|   Name|Department|Salary|
+----------+-------+----------+------+
|         1| Rohish|        HR|  5000|
|         5|Rishabh|        HR|  5500|
+----------+-------+----------+------+



In [0]:
# Find names starting with 'R'
df.filter(col("Name").like("%R")).show()

+----------+----+----------+------+
|EmployeeID|Name|Department|Salary|
+----------+----+----------+------+
+----------+----+----------+------+



In [0]:
# Filter names containing "sh"
df.filter(col("Name").like("%sh%")).show()

+----------+-------+----------+------+
|EmployeeID|   Name|Department|Salary|
+----------+-------+----------+------+
|         1| Rohish|        HR|  5000|
|         4|Pushpak|        IT|  9000|
|         5|Rishabh|        HR|  5500|
+----------+-------+----------+------+



In [0]:
# Filter names with 6 letters, starting with "A"
df.filter(col("Name").like("R_____")).show()

+----------+------+----------+------+
|EmployeeID|  Name|Department|Salary|
+----------+------+----------+------+
|         1|Rohish|        HR|  5000|
+----------+------+----------+------+



#### Using `rlike()` for Regular Expression Matching

The rlike function allows you to use regular expressions for more complex pattern matching.
- `.` in regex: Matches any single character
- `*` in regex: Matches zero or more of the preceding element
- `+` in regex: Matches one or more of the preceding element
- `^` in regex: Matches the start of the string
- `$` in regex: Matches the end of the string

In [0]:
# Filter names starting with "R" (using regex)
df.filter(col("Name").rlike("^R")).show()

+----------+-------+----------+------+
|EmployeeID|   Name|Department|Salary|
+----------+-------+----------+------+
|         1| Rohish|        HR|  5000|
|         5|Rishabh|        HR|  5500|
+----------+-------+----------+------+



In [0]:
# Filter names ending with "h"
df.filter(col("Name").rlike("h$")).show()

+----------+-------+----------+------+
|EmployeeID|   Name|Department|Salary|
+----------+-------+----------+------+
|         1| Rohish|        HR|  5000|
|         5|Rishabh|        HR|  5500|
+----------+-------+----------+------+



In [0]:
# Filter names containing numbers
df.filter(col("name").rlike("[0-9]")).show()

+----------+-------------+----------+------+
|EmployeeID|         Name|Department|Salary|
+----------+-------------+----------+------+
|         7|DemonSlayer69|        IT| 10000|
+----------+-------------+----------+------+



#### Using `regexp_like` Function (Regular Expressions, Spark 3.0+)

`regexp_like` is available in Spark 3.0 and later versions. It is similar to rlike but is more aligned with standard SQL regular expression syntax.

In [0]:
from pyspark.sql.functions import regexp_like

# Filter names starting with "F" (using regex)
df.filter(regexp_like(col("Name"), "^F")).show()

[0;31m---------------------------------------------------------------------------[0m
[0;31mImportError[0m                               Traceback (most recent call last)
File [0;32m<command-2921023638831196>:1[0m
[0;32m----> 1[0m # Filter names starting with "F" (using regex)
[1;32m      2[0m df.filter(regexp_like(col("name"), "^F")).show()

[0;31mImportError[0m: cannot import name 'regexp_like' from 'pyspark.sql.functions' (/databricks/spark/python/pyspark/sql/functions.py)

##### NOTE: rlike() is PySpark's equivalent of SQL’s regexp_like() if its not availabe.

#### Using `contains()` for Substring Matching

To check if a column contains a substring:

In [0]:
# Find names containing 'ish'
df.filter(col("Name").contains("ish")).show()

+----------+-------+----------+------+
|EmployeeID|   Name|Department|Salary|
+----------+-------+----------+------+
|         1| Rohish|        HR|  5000|
|         5|Rishabh|        HR|  5500|
+----------+-------+----------+------+



#### Using `startswith()` and `endswith()`

If you only need to filter based on a prefix or suffix, these functions are more efficient than `like()` or `rlike().`

In [0]:
# Names starting with 'D'
df.filter(col("Name").startswith("D")).show()

+----------+-------------+----------+------+
|EmployeeID|         Name|Department|Salary|
+----------+-------------+----------+------+
|         7|DemonSlayer69|        IT| 10000|
+----------+-------------+----------+------+



In [0]:
# Names ending with 'h'
df.filter(col("Name").endswith("h")).show()

+----------+-------+----------+------+
|EmployeeID|   Name|Department|Salary|
+----------+-------+----------+------+
|         1| Rohish|        HR|  5000|
|         5|Rishabh|        HR|  5500|
+----------+-------+----------+------+

