Problem Statement:

You have a dataset in a PySpark DataFrame containing a column with string values. You need to compute the count of occurrences of a specific character (e.g., 'l') in each string of the column. The solution should work efficiently at scale and can be implemented using:

In [0]:
from pyspark.sql.functions import col, expr, length

# Sample DataFrame
data = [("hello",), ("world",), ("spark",), ("hello world",)]
columns = ["text"]
df = spark.createDataFrame(data, columns)
df.display()

text
hello
world
spark
hello world


In [0]:
# Define the character to count
char_to_count = 'l'

# Count the occurrences of the character
df_with_count = df.withColumn(
    "char_count",
    expr(f"length(text) - length(replace(text, '{char_to_count}', ''))")
)

df_with_count.display()


text,char_count
hello,2
world,1
spark,0
hello world,3


In [0]:
# Create a temporary view for SQL queries
df.createOrReplaceTempView("text_table")

In [0]:
# Define the character to count
char_to_count = 'l'

# Write a SQL query to count occurrences of the character
query = f"""
SELECT 
    text, 
    LENGTH(text) - LENGTH(REPLACE(text, '{char_to_count}', '')) AS char_count 
FROM text_table
"""

result_df = spark.sql(query)

result_df.display()

text,char_count
hello,2
world,1
spark,0
hello world,3


Explanation:

col("text").replace(char_to_count, ""): Removes all occurrences of the specified character.

length(col("text")): Computes the original length of the string.

length(...): Computes the length of the string after removing the specified character.

length(original) - length(modified): The difference gives the count of the character.

This avoids the use of expr while achieving the same result. Let me know if you want further clarification!