# SQL-Only Operations in Spark SQL

This notebook demonstrates SQL-only operations in Spark SQL:
1. Complex Joins with Multiple Conditions
2. Window Functions
3. Subqueries

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import col, rank

# Initialize Spark session
spark = SparkSession.builder.master("local").appName("SQLOnlyFunctions").getOrCreate()

# Create sample data for employees and departments DataFrames
employees_data = [
    (1, "Alice", 10000, "HR"),
    (2, "Bob", 20000, "Engineering"),
    (3, "Charlie", 30000, "Engineering"),
    (4, "David", 25000, "HR"),
    (5, "Eve", 15000, "Marketing")
]
departments_data = [
    (1, "HR"),
    (2, "Engineering"),
    (3, "Marketing")
]

# Create DataFrames and temporary views
employees_df = spark.createDataFrame(employees_data, ["ID", "Name", "Salary", "Department"])
departments_df = spark.createDataFrame(departments_data, ["ID", "DepartmentName"])

# Register as SQL temporary views for SQL queries
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/11/02 14:54:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## 1. Complex Joins with Multiple Conditions

In this section, we perform joins with multiple conditions.

In [2]:
# Complex join with multiple conditions
result = spark.sql("""
    SELECT a.Name, b.DepartmentName
    FROM employees a
    JOIN departments b ON a.ID = b.ID AND a.Salary > 15000
""")
result.show()

+-------+--------------+
|   Name|DepartmentName|
+-------+--------------+
|    Bob|   Engineering|
|Charlie|     Marketing|
+-------+--------------+



## 2. Window Functions

Spark SQL supports window functions. Here, we use the `RANK` function to rank employees by salary within each department.

In [3]:
# Define a window specification by department and ordered by salary descending
windowSpec = Window.partitionBy("Department").orderBy(col("Salary").desc())

# Apply rank function
ranked_employees = employees_df.withColumn("SalaryRank", rank().over(windowSpec))
ranked_employees.show()

+---+-------+------+-----------+----------+
| ID|   Name|Salary| Department|SalaryRank|
+---+-------+------+-----------+----------+
|  3|Charlie| 30000|Engineering|         1|
|  2|    Bob| 20000|Engineering|         2|
|  4|  David| 25000|         HR|         1|
|  1|  Alice| 10000|         HR|         2|
|  5|    Eve| 15000|  Marketing|         1|
+---+-------+------+-----------+----------+



## 3. Subqueries

Subqueries allow us to filter data based on computed values from another query.
Example: Select employees who earn more than the average salary.

In [4]:
# Calculate average salary
avg_salary = employees_df.selectExpr("avg(Salary)").collect()[0][0]

# Filter employees with salary above average salary
high_earners = employees_df.filter(employees_df["Salary"] > avg_salary)
high_earners.show()

+---+-------+------+-----------+
| ID|   Name|Salary| Department|
+---+-------+------+-----------+
|  3|Charlie| 30000|Engineering|
|  4|  David| 25000|         HR|
+---+-------+------+-----------+



In [5]:
# Stop the Spark session after running
spark.stop()