Given a table of candidates and their skills, you're tasked with finding the candidates best suited for an open Data Science job. You want to find candidates who are proficient in Python, Tableau, and PostgreSQL.

Write a query to list the candidates who possess all of the required skills for the job. Sort the output by candidate ID in ascending order.

Assumption:

There are no duplicates in the candidates table.
candidates Table:
Column Name	Type
candidate_id	integer
skill	varchar
candidates 

Example Input:

candidate_id	skill

123	Python

123	Tableau

123	PostgreSQL

234	R

234	PowerBI

234	SQL Server

345	Python

345	Tableau

Example Output:

candidate_id

123

Explanation
Candidate 123 is displayed because they have Python, Tableau, and PostgreSQL skills. 345 isn't included in the output because they're missing one of the required skills: PostgreSQL.

The dataset you are querying against may have different input & output - this is just an example!

p.s. give the hints below a try if you're stuck and don't know where to start!

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, countDistinct

# Initialize Spark session
spark = SparkSession.builder.appName("CandidatesSkills").getOrCreate()

# Sample data
data = [
    (123, 'Python'),
    (123, 'Tableau'),
    (123, 'PostgreSQL'),
    (234, 'R'),
    (234, 'PowerBI'),
    (234, 'SQL Server'),
    (345, 'Python'),
    (345, 'Tableau')
]

# Create DataFrame
candidates_df = spark.createDataFrame(data, ["candidate_id", "skill"])


candidates_df.display()


candidate_id,skill
123,Python
123,Tableau
123,PostgreSQL
234,R
234,PowerBI
234,SQL Server
345,Python
345,Tableau


In [0]:
# Define the required skills
required_skills = {'Python', 'Tableau', 'PostgreSQL'}

# Filter for the required skills
filtered_df = candidates_df.filter(col("skill").isin(required_skills))

# Group by candidate_id and count distinct skills
grouped_df = filtered_df.groupBy("candidate_id").agg(countDistinct("skill").alias("skill_count"))

# Filter candidates with all required skills (3 distinct skills)
result_df = grouped_df.filter(col("skill_count") == len(required_skills))

# Sort the result by candidate_id in ascending order
result_df = result_df.orderBy("candidate_id")
result_df.show()
# Show the results

+------------+-----------+
|candidate_id|skill_count|
+------------+-----------+
|         123|          3|
+------------+-----------+



In [0]:
candidates_df.createOrReplaceTempView('candidates')

In [0]:
%sql
SELECT candidate_id
FROM candidates
WHERE skill IN ('Python', 'Tableau', 'PostgreSQL')
GROUP BY candidate_id
HAVING COUNT(DISTINCT skill) = 3
ORDER BY candidate_id ASC;


candidate_id
123
