###Problem Statement: 
Candidate Experience Analysis
You are provided with a dataset named assessments that contains the following columns:

id: Unique identifier for each candidate.
experience: The number of years of experience the candidate has.
sql: Score for SQL proficiency (ranges from 0 to 100 or NULL).
algo: Score for algorithm knowledge (ranges from 0 to 100 or NULL).
bug_fixing: Score for debugging skills (ranges from 0 to 100 or NULL).
Your goal is to analyze the data to:

Determine the total number of candidates for each experience level.
Count how many candidates per experience level meet the following max-score conditions:
SQL score is 100 or is NULL.
Algorithm score is 100 or is NULL.
Bug fixing score is 100 or is NULL.
The output should include:

experience: Candidate experience level.
total_candidate: Total number of candidates with that experience level.
max_score_flag: Number of candidates satisfying the max-score conditions.
The results should be grouped by experience and ordered by experience.



In [0]:
from pyspark.sql.types import *

# Define the schema
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("experience", IntegerType(), True),
    StructField("sql", IntegerType(), True),
    StructField("algo", IntegerType(), True),
    StructField("bug_fixing", IntegerType(), True)
])

# Define the data
data = [
    (1, 3, 100, None, 50),
    (2, 5, None, 100, 100),
    (3, 1, 100, 100, 100),
    (4, 5, 100, 50, None),
    (5, 5, 100, 100, 100)
]

# Create the DataFrame
assessments_df = spark.createDataFrame(data, schema)

# display the DataFrame
assessments_df.display()


id,experience,sql,algo,bug_fixing
1,3,100.0,,50.0
2,5,,100.0,100.0
3,1,100.0,100.0,100.0
4,5,100.0,50.0,
5,5,100.0,100.0,100.0


In [0]:
from pyspark.sql import functions as F

# Add the "max_score_flag" column based on the conditions
result_df = assessments_df.groupBy("experience").agg(
    F.count("*").alias("total_candidate"),
    F.count(
        F.when(
            ((F.col("sql").eqNullSafe(100) | F.col("sql").isNull()) &
             (F.col("algo").eqNullSafe(100) | F.col("algo").isNull()) &
             (F.col("bug_fixing").eqNullSafe(100) | F.col("bug_fixing").isNull())),
            1
        )
    ).alias("max_score_flag")
).orderBy("experience")

# display the results
result_df.display()


experience,total_candidate,max_score_flag
1,1,1
3,1,0
5,3,2


In [0]:
# Register the DataFrame as a temporary view
assessments_df.createOrReplaceTempView("assessments")

In [0]:
# Spark SQL query
query = """
SELECT 
    experience, 
    COUNT(*) AS total_candidate,
    COUNT(
        CASE 
            WHEN 
                (sql = 100 OR sql IS NULL) AND 
                (algo = 100 OR algo IS NULL) AND 
                (bug_fixing = 100 OR bug_fixing IS NULL) 
            THEN 1 
        END
    ) AS max_score_flag
FROM assessments
GROUP BY experience
ORDER BY experience
"""

# Execute the query
result_df = spark.sql(query)

# display the results
result_df.display()


experience,total_candidate,max_score_flag
1,1,1
3,1,0
5,3,2


###Explanation of the Query:

COUNT(*) AS total_candidate:

Counts all rows for each group defined by experience.
COUNT(CASE WHEN ... THEN 1 END):

Evaluates the condition for each row:
(sql = 100 OR sql IS NULL)
(algo = 100 OR algo IS NULL)
(bug_fixing = 100 OR bug_fixing IS NULL)
If the condition is true, it counts the row; otherwise, it skips it.
GROUP BY experience:

Groups the rows based on the experience column.
ORDER BY experience:

Sorts the results by experience in ascending order.