#Problem Statement:
You are given two datasets representing cricket players' statistics and their associated country information. Your task is to transform and analyze this data using PySpark.

Dataset 1: Players

This dataset contains information about cricket players, including their names, total runs scored, and their 50s and 100s in the format player-COUNTRY_CODE.
Columns:
player: The name of the player concatenated with a hyphen and the country code (e.g., Sachin-IND).
runs: The total runs scored by the player.
50s/100s: The number of 50s and 100s scored by the player, separated by a forward slash (e.g., 93/49).


Dataset 2: Countries

This dataset maps the country code to the full country name.
Columns:
SRT: The country code (e.g., IND).
country: The full name of the country (e.g., India).

Task:
Extract Information:

Split the player column in the Players dataset into two separate fields:
player_name: The name of the player.
SRT: The country code.
Split the 50s/100s column into two separate fields:
runs_50s: The number of 50s.
runs_100s: The number of 100s.
Join the Datasets:

Perform an inner join between the transformed Players dataset and the Countries dataset using the SRT field to map country codes to full country names.
Calculate the Sum:

Calculate the sum of 50s and 100s for each player.
Filter:

Filter the results to include only those players whose sum of 50s and 100s is greater than 90.
Sort:

Sort the filtered results by country name.

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

# Initialize Spark session
spark = SparkSession.builder.appName("CricketData").getOrCreate()

# Define schema for players_df
schema = StructType([
    StructField("player", StringType(), True),
    StructField("runs", IntegerType(), True),
    StructField("50s/100s", StringType(), True)
])

# Create players_df DataFrame
data = [
    ("Sachin-IND", 18694, "93/49"),
    ("Ricky-AUS", 11274, "66/31"),
    ("Lara-WI", 10222, "45/21"),
    ("Rahul-IND", 10355, "95/11"),
    ("Jhonty-SA", 7051, "43/5"),
    ("Hayden-AUS", 8722, "67/19")
]
players_df = spark.createDataFrame(data, schema)

# Create countries_df DataFrame
data1 = [("IND", "India"), ("AUS", "Australia"), ("WI", "WestIndies"), ("SA", "SouthAfrica")]
countries_df = spark.createDataFrame(data1, ["SRT", "country"])

# Show the DataFrames
players_df.display()
countries_df.display()


player,runs,50s/100s
Sachin-IND,18694,93/49
Ricky-AUS,11274,66/31
Lara-WI,10222,45/21
Rahul-IND,10355,95/11
Jhonty-SA,7051,43/5
Hayden-AUS,8722,67/19


SRT,country
IND,India
AUS,Australia
WI,WestIndies
SA,SouthAfrica


In [0]:
players_df.createOrReplaceTempView('players')
countries_df.createOrReplaceTempView('country')

In [0]:
# Execute the Spark SQL query
result_df = spark.sql("""
    WITH player AS (
        SELECT 
            player,
            runs,
            SPLIT(player, '-')[0] AS player_name,
            SPLIT(player, '-')[1] AS SRT,
            SPLIT(`50s/100s`, '/')[0] AS runs_50s,
            SPLIT(`50s/100s`, '/')[1] AS runs_100s
        FROM players
    )
    SELECT 
        p.player_name AS playername,
        c.country,
        p.runs,
        CAST(p.runs_50s AS INT) + CAST(p.runs_100s AS INT) AS sum
    FROM player p
    INNER JOIN country c 
    ON p.SRT = c.SRT
    WHERE CAST(p.runs_50s AS INT) + CAST(p.runs_100s AS INT) > 90
    ORDER BY c.country
""")

# Show the result
result_df.display()

playername,country,runs,sum
Ricky,Australia,11274,97
Sachin,India,18694,142
Rahul,India,10355,106


In [0]:
from pyspark.sql.functions import col, expr, split

# Split the player column into player_name and SRT using the '-' delimiter
players_transformed_df = players_df.withColumn(
    "player_name", split(col("player"), "-").getItem(0)
).withColumn("SRT", split(col("player"), "-").getItem(1))

# Split the 50s/100s column into runs_50s and runs_100s using the '/' delimiter
players_transformed_df = (
    players_transformed_df.withColumn(
        "runs_50s", split(col("50s/100s"), "/").getItem(0).cast("int")
    )
    .withColumn("runs_100s", split(col("50s/100s"), "/").getItem(1).cast("int"))
    .withColumn("total_50s_100s", col("runs_50s") + col("runs_100s"))
)

# Join with countries_df on SRT
result_df = (
    players_transformed_df.join(
        countries_df, players_transformed_df["SRT"] == countries_df["SRT"]
    )
    .select(
        col("player_name").alias("playername"),
        col("country"),
        col("runs"),
        col("total_50s_100s").alias("sum"),
    )
    .where(col("total_50s_100s") > 90)
    .orderBy("country")
)

# Show the result
result_df.display()

playername,country,runs,sum
Ricky,Australia,11274,97
Sachin,India,18694,142
Rahul,India,10355,106


Explanation of Changes:

Splitting Columns: Instead of trying to use substring, we use the split() function in PySpark to split the player and 50s/100s columns into separate parts.

split(col("player"), "-").getItem(0): Extracts the part of the string before the hyphen (player name).
split(col("player"), "-").getItem(1): Extracts the part of the string after the hyphen (country code).
Similarly, split(col("50s/100s"), "/").getItem(0) and split(col("50s/100s"), "/").getItem(1) are used to extract the number of 50s and 100s, and they are cast to int.
Column Operations: Operations like adding columns are done using the standard PySpark column functions, which are designed to work on DataFrame columns.

This approach avoids the "Column is not iterable" error and produces the correct output.