### What is colRegex() in PySpark?

In PySpark, colRegex() is used inside the DataFrame.select() method to select columns whose names match a given regular expression pattern.
It’s part of the selectExpr() API that works on column names.

In [0]:
data = [
    (1, "Alice", 5000, "HR"),
    (2, "Bob", 4000, "IT"),
    (3, "Charlie", 4500, "Finance")
]

columns = ["emp_id", "emp_name", "emp_salary", "dept_name"]
df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.display()

Original DataFrame:


emp_id,emp_name,emp_salary,dept_name
1,Alice,5000,HR
2,Bob,4000,IT
3,Charlie,4500,Finance


In [0]:
df_emp = df.select(df.colRegex("`^emp_.*`"))
print("Columns starting with 'emp_':")
df_emp.display()

Columns starting with 'emp_':


emp_id,emp_name,emp_salary
1,Alice,5000
2,Bob,4000
3,Charlie,4500


In [0]:
# Select columns ending with '_name'
df_name = df.select(df.colRegex("`.*_name$`"))
print("Columns ending with '_name':")
df_name.display()

Columns ending with '_name':


emp_name,dept_name
Alice,HR
Bob,IT
Charlie,Finance


In [0]:
# Case-insensitive match for columns containing 'NAME'
df_ci = df.select(df.colRegex("`(?i).*name.*`"))
print("Case-insensitive match for 'name':")
df_ci.display()

Case-insensitive match for 'name':


emp_name,dept_name
Alice,HR
Bob,IT
Charlie,Finance


### Key Points:

Always wrap regex inside **backticks ()** — required for colRegex`.

You can combine multiple regex patterns using | (OR condition), e.g.,

In [0]:
df.select(df.colRegex("`^emp_.*|.*_name$`"))