Problem Statement:

Here are tasked with analyzing cricket match data stored in a PySpark DataFrame. Each row in the dataset represents a match with the following details:

The goal is to compute the following statistics for each team:

Team Name: 
The name of the cricket team.
Number of Matches Played: 
The total number of matches in which the team participated.
Number of Matches Won: 
The number of matches the team won (when the team's name matches the result column).
Number of Matches Lost: 
The number of matches the team lost (when the result column does not match the team name and is not "DRAW").

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import col, when, count, sum as _sum

# Define the schema for the cricket_match DataFrame
schema = StructType([
    StructField("match_id", IntegerType(), True),
    StructField("team1", StringType(), True),
    StructField("team2", StringType(), True),
    StructField("result", StringType(), True)
])

# Define the data
data = [
    (1, 'ENG', 'NZ', 'NZ'),
    (2, 'PAK', 'NED', 'PAK'),
    (3, 'AFG', 'BAN', 'BAN'),
    (4, 'SA', 'SL', 'SA'),
    (5, 'AUS', 'IND', 'AUS'),
    (6, 'NZ', 'NED', 'NZ'),
    (7, 'ENG', 'BAN', 'ENG'),
    (8, 'SL', 'PAK', 'PAK'),
    (9, 'AFG', 'IND', 'IND'),
    (10, 'SA', 'AUS', 'SA'),
    (11, 'BAN', 'NZ', 'BAN'),
    (12, 'PAK', 'IND', 'IND'),
    (13, 'SA', 'IND', 'DRAW')
]

# Create a PySpark DataFrame
df = spark.createDataFrame(data, schema)

# Show the DataFrame
df.display()


match_id,team1,team2,result
1,ENG,NZ,NZ
2,PAK,NED,PAK
3,AFG,BAN,BAN
4,SA,SL,SA
5,AUS,IND,AUS
6,NZ,NED,NZ
7,ENG,BAN,ENG
8,SL,PAK,PAK
9,AFG,IND,IND
10,SA,AUS,SA


In [0]:
# Step 1: Combine team1 and team2 into a single column along with match_id and result (similar to the CTE `all_matches`)
all_matches = df.select(
    col("match_id"),
    col("team1").alias("team"),
    col("result")
).union(
    df.select(
        col("match_id"),
        col("team2").alias("team"),
        col("result")
    )
)

# Step 2: Calculate statistics for each team
team_stats = all_matches.groupBy("team").agg(
    count("match_id").alias("matches_played"),
    _sum(when(col("result") == col("team"), 1).otherwise(0)).alias("matches_won"),
    _sum(when((col("result") != col("team")) & (col("result") != "DRAW"), 1).otherwise(0)).alias("matches_lost")
)

# Step 3: Show the results
team_stats.orderBy("team").display()

team,matches_played,matches_won,matches_lost
AFG,2,0,2
AUS,2,1,1
BAN,3,2,1
ENG,2,1,1
IND,4,2,1
NED,2,0,2
NZ,3,2,1
PAK,3,2,1
SA,3,2,0
SL,2,0,2


In [0]:
df.createOrReplaceTempView("cricket_match")

In [0]:
%sql
WITH all_matches AS (
    SELECT match_id, team1 AS team, result FROM cricket_match
    UNION ALL
    SELECT match_id, team2 AS team, result FROM cricket_match
),
team_stats AS (
    SELECT 
        team AS team_name,
        COUNT(match_id) AS matches_played,
        SUM(CASE WHEN result = team THEN 1 ELSE 0 END) AS matches_won,
        SUM(CASE WHEN result != team AND result != 'DRAW' THEN 1 ELSE 0 END) AS matches_lost
    FROM all_matches
    GROUP BY team
)
SELECT *
FROM team_stats
ORDER BY team_name;



team_name,matches_played,matches_won,matches_lost
AFG,2,0,2
AUS,2,1,1
BAN,3,2,1
ENG,2,1,1
IND,4,2,1
NED,2,0,2
NZ,3,2,1
PAK,3,2,1
SA,3,2,0
SL,2,0,2


Explanation:
First CTE (all_matches):

Combines team1 and team2 into a single column (team) using UNION ALL to ensure every match is considered for both teams.
Second CTE (team_stats):

Calculates the required statistics:
COUNT(match_id) for the total matches played.
SUM(CASE WHEN result = team THEN 1 ELSE 0 END) for the matches won.
SUM(CASE WHEN result != team AND result != 'DRAW' THEN 1 ELSE 0 END) for the matches lost.
Final SELECT:

Fetches all columns from the team_stats CTE and orders the output by team_name.