problem statement:
Splitting Celebrity Names into First, Middle, and Last Names
You have a dataset containing celebrity names in a single column (e.g., celebrity_name). Each name may consist of one, two, or three parts:

Some names include only the first name.

Some names include the first and last name.

Some names include the first, middle, and last name.

You need to split this celebrity_name column into three separate columns:

First Name (fn)
Middle Name (mn) (or NULL if not available)
Last Name (ln) (or NULL if not available)
The solution should handle names with varying lengths (1, 2, or 3 parts) and output the first, middle, and last names accordingly.

In [0]:
from pyspark.sql.types import StructType, StructField, StringType

# Define your data and schema
data = [
    ('Virat Kohli',),
    ('Narendra Damodardas Modi',),
    ('Salman',),
]
schema = StructType([
    StructField('celebrity_name', StringType(), True),
])

# Create DataFrame
df = spark.createDataFrame(data=data, schema=schema)

# Display the DataFrame
df.display()


celebrity_name
Virat Kohli
Narendra Damodardas Modi
Salman


In [0]:
from pyspark.sql.functions import split, coalesce, lit

# Assuming sdf is your original DataFrame
sdf = df.withColumn("split_names", split(df.celebrity_name, " "))

# Create final DataFrame with first name (fn), middle name (mn), and last name (ln)
finaldf = (sdf
           .withColumn('fn', coalesce(sdf.split_names.getItem(0), lit(None)))
           .withColumn('mn', coalesce(sdf.split_names.getItem(1), lit(None)))
           .withColumn('ln', coalesce(sdf.split_names.getItem(2), lit(None)))
          )

# Show the resulting DataFrame
finaldf.display(truncate=False)


celebrity_name,split_names,fn,mn,ln
Virat Kohli,"List(Virat, Kohli)",Virat,Kohli,
Narendra Damodardas Modi,"List(Narendra, Damodardas, Modi)",Narendra,Damodardas,Modi
Salman,List(Salman),Salman,,


Explanation:

split: This function splits the celebrity_name column into an array of strings based on the space delimiter (" ").
getItem(n): Extracts the nth element (0-based index) from the array of strings generated by the split.
coalesce: Ensures that if there is no middle or last name, it assigns None instead of failing.

In [0]:
df.createOrReplaceTempView("celebrities")


In [0]:
result = spark.sql("""
    SELECT 
        celebrity_name,
        SPLIT(celebrity_name, ' ')[0] AS fn,
        CASE 
            WHEN SIZE(SPLIT(celebrity_name, ' ')) > 2 THEN SPLIT(celebrity_name, ' ')[1]
            ELSE NULL
        END AS mn,
        CASE 
            WHEN SIZE(SPLIT(celebrity_name, ' ')) > 2 THEN SPLIT(celebrity_name, ' ')[2]
            WHEN SIZE(SPLIT(celebrity_name, ' ')) = 2 THEN SPLIT(celebrity_name, ' ')[1]
            ELSE NULL
        END AS ln
    FROM celebrities
""")

# Show the result
result.show(truncate=False)


+------------------------+--------+----------+-----+
|celebrity_name          |fn      |mn        |ln   |
+------------------------+--------+----------+-----+
|Virat Kohli             |Virat   |null      |Kohli|
|Narendra Damodardas Modi|Narendra|Damodardas|Modi |
|Salman                  |Salman  |null      |null |
+------------------------+--------+----------+-----+



Explanation:

SPLIT(celebrity_name, ' '): Splits the celebrity_name column into an array using a space as the delimiter.

SIZE(): Returns the size of the array, which helps determine how many name components are available.
CASE WHEN: Handles different conditions to assign the middle name (mn) and last name (ln).
If there are more than two parts, the first part is the first name, the second part is the middle name, and the third part is the last name.
If there are exactly two parts, the first part is the first name, and the second part is considered the last name.
If there's only one part, the middle and last names are NULL.