In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, regexp_extract

# Initialize Spark session
spark = SparkSession.builder \
    .appName("Data Validation") \
    .getOrCreate()

# Example DataFrame
data = [("James", "Sales", 3000), ("Michael", "Sales", 4600), ("Robert", "Sales", 4100), ("Maria", "Finance", 3000), ("James", "Sales", 3000)]
columns = ["Employee Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)

# Validate Salary: Must be greater than 0
df_valid = df.filter(col("Salary") > 0)

# Validate Employee Name: Must not contain numbers, using regex
df_valid = df_valid.withColumn("Valid Name", regexp_extract(col("Employee Name"), "^[a-zA-Z]+\\s*[a-zA-Z]*$", 0))
df_valid = df_valid.filter(col("Valid Name") != "")

# Show validated DataFrame
df_valid.show()

# Stop Spark session
spark.stop()


Regular Expression Breakdown
The regular expression used here is "^[a-zA-Z]+\\s*[a-zA-Z]*$":

^: Asserts the start of a string. This ensures that the pattern must match from the very beginning of the string.
[a-zA-Z]+: Matches one or more (+) characters that are either lowercase (a-z) or uppercase (A-Z). This part expects at least one alphabetic character at the beginning of the string.
\s*: Matches zero or more (*) whitespace characters. In regular expressions in Spark, the backslash must be escaped with another backslash, hence \\s instead of \s.
[a-zA-Z]*: Matches zero or more (*) alphabetic characters following the optional whitespace. This allows for a second part of a name that can be completely omitted.
$: Asserts the end of a string. This ensures that the pattern must match until the very end of the string.
What the Regex Does
This regex pattern essentially validates names that could consist of one or two words (e.g., "John" or "John Doe"). It does not allow for any characters other than alphabetic characters in these words, and they may be separated by a space. Names with more than one space, or names including characters other than letters (like hyphens, apostrophes, or additional spaces), will not match.

Practical Example
When this line is executed, Spark will check each entry in the "Employee Name" column, apply the regex, and populate the "Valid Name" column with the matched result if the name fits the criteria described, or an empty string if it doesn't. The use of 0 in regexp_extract means it captures the entire match of the pattern.