Problem Statement:

Write a SQL Query to report the number of bank accounts of each salary category.
the salary categories are:

"Low Salary": All the salaries strictly less than 20000.

"Average Salary: All the Salaries in the inclusive rang [20000, 50000].

"High Salary": All the salaries strictly graeter than 50000.

The result table must contain all three categories.
if there are no accoutns in a category, then report 0.

In [0]:
from pyspark.sql.types import *

# Define schema
schema = StructType(
    [
        StructField("account_id", IntegerType(), True),
        StructField("income", IntegerType(), True),
    ]
)

# Create data
data = [(3, 108939), (2, 12747), (8, 87709), (6, 91796)]

# Create DataFrame
accounts_df = spark.createDataFrame(data, schema)

# Show the DataFrame
accounts_df.display()

account_id,income
3,108939
2,12747
8,87709
6,91796


In [0]:
# Optionally, create a temporary table to query with SQL
accounts_df.createOrReplaceTempView("Accounts")

# Example SQL Query
result = spark.sql("SELECT * FROM Accounts WHERE income > 50000")
result.display()

account_id,income
3,108939
8,87709
6,91796


In [0]:
from pyspark.sql.functions import col, when, count, lit

low_salary = accounts_df.filter(col("income") < 20000).agg(
    lit("low salary").alias("category"), count("*").alias("accounts_count")
)
average_salary = accounts_df.filter(
    (col("income") >= 20000) & (col("income") <= 50000)
).agg(lit("Average salary").alias("category"), count("*").alias("accounts_count"))
high_salary = accounts_df.filter(col("income") > 50000).agg(
    lit("high salary").alias("category"), count("*").alias("accounts_count")
)
# Union all results
result = low_salary.union(average_salary).union(high_salary)

# Show the result
result.display()

category,accounts_count
low salary,1
Average salary,0
high salary,3


In [0]:
%sql
with cte as(
  select
    'low salary' as category,
    count(
      case
        when income < 20000 then 1
      end
    ) as accounts_count
  from
    Accounts
  union all
  select
    'Average salary' as category,
    count(*) as accounts_count
  from
    Accounts
  where
    income between 20000
    and 50000
  union all
  select
    'high salary' as category,
    count(*) as accounts_count
  from
    Accounts
  where
    income > 50000
)
select
  *
from
  cte

category,accounts_count
low salary,1
Average salary,0
high salary,3


Explanation:

filter(): Filters rows based on the income ranges.

agg(): Aggregates data and includes a literal column for the category name.

union(): Combines the results of all three categories into one DataFrame.