# Spark SQL: Comprehensive Guide with Best Practices

This notebook covers the essentials of Spark SQL, including:

1. **SQL Basics**: Creating tables/views and running queries
2. **User-Defined Functions (UDFs)** in SQL context
3. **Advanced SQL Features**: Window functions, complex types, and SQL optimization
4. **Performance Best Practices**: Tips for efficient Spark SQL usage

Let's get started!

In [1]:
!pip install pandas numpy pyarrow

[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [2]:
# Initialize Spark Session
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, expr, udf, lit, when, avg, sum, max, min, count, desc
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType, ArrayType, MapType, BooleanType
import time
import pandas as pd
import numpy as np

# Create a Spark session
spark = SparkSession \
    .builder \
    .appName("Spark SQL Tutorial") \
    .config("spark.sql.shuffle.partitions", "4") \
    .getOrCreate()

print("Spark version:", spark.version)
print("Spark Session initialized successfully!")

/opt/spark/bin/load-spark-env.sh: line 68: ps: command not found
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/04/18 08:54:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Spark version: 3.5.1
Spark Session initialized successfully!


## 1. Creating Sample Datasets

Before diving into SQL, let's create some sample datasets to work with.

In [3]:
# Create sample data for employees
employee_data = [
    (1, "John", "Doe", "Engineering", 80000, 5),
    (2, "Jane", "Smith", "Engineering", 95000, 7),
    (3, "Alice", "Johnson", "Sales", 75000, 3),
    (4, "Bob", "Brown", "Sales", 68000, 2),
    (5, "Charlie", "Miller", "Marketing", 72000, 4),
    (6, "Dave", "Wilson", "Engineering", 105000, 9),
    (7, "Eve", "Davis", "HR", 65000, 5),
    (8, "Frank", "Jones", "Marketing", 78000, 6),
    (9, "Grace", "Taylor", "Engineering", 92000, 6),
    (10, "Helen", "Moore", "Sales", 81000, 4)
]

employee_schema = StructType([
    StructField("emp_id", IntegerType(), False),
    StructField("first_name", StringType(), False),
    StructField("last_name", StringType(), False),
    StructField("department", StringType(), False),
    StructField("salary", IntegerType(), True),
    StructField("years_exp", IntegerType(), True)
])

# Create DataFrame
employees_df = spark.createDataFrame(employee_data, employee_schema)
employees_df.show()

+------+----------+---------+-----------+------+---------+
|emp_id|first_name|last_name| department|salary|years_exp|
+------+----------+---------+-----------+------+---------+
|     1|      John|      Doe|Engineering| 80000|        5|
|     2|      Jane|    Smith|Engineering| 95000|        7|
|     3|     Alice|  Johnson|      Sales| 75000|        3|
|     4|       Bob|    Brown|      Sales| 68000|        2|
|     5|   Charlie|   Miller|  Marketing| 72000|        4|
|     6|      Dave|   Wilson|Engineering|105000|        9|
|     7|       Eve|    Davis|         HR| 65000|        5|
|     8|     Frank|    Jones|  Marketing| 78000|        6|
|     9|     Grace|   Taylor|Engineering| 92000|        6|
|    10|     Helen|    Moore|      Sales| 81000|        4|
+------+----------+---------+-----------+------+---------+



In [4]:
# Create sample data for departments
department_data = [
    ("Engineering", "San Francisco", "John Smith", 35),
    ("Sales", "New York", "Mary Johnson", 28),
    ("Marketing", "Chicago", "James Brown", 22),
    ("HR", "Boston", "Patricia Davis", 15),
    ("Finance", "San Jose", "Robert Wilson", 18)
]

department_schema = StructType([
    StructField("dept_name", StringType(), False),
    StructField("location", StringType(), True),
    StructField("manager", StringType(), True),
    StructField("employee_count", IntegerType(), True)
])

# Create DataFrame
departments_df = spark.createDataFrame(department_data, department_schema)
departments_df.show()

+-----------+-------------+--------------+--------------+
|  dept_name|     location|       manager|employee_count|
+-----------+-------------+--------------+--------------+
|Engineering|San Francisco|    John Smith|            35|
|      Sales|     New York|  Mary Johnson|            28|
|  Marketing|      Chicago|   James Brown|            22|
|         HR|       Boston|Patricia Davis|            15|
|    Finance|     San Jose| Robert Wilson|            18|
+-----------+-------------+--------------+--------------+



In [5]:
# Create a more complex dataset with arrays and maps
projects_data = [
    (1, "Mobile App", [1, 2, 6, 9], {"budget": 250000, "status": "active", "priority": "high"}),
    (2, "Website Redesign", [3, 5, 8], {"budget": 175000, "status": "active", "priority": "medium"}),
    (3, "Database Migration", [2, 6], {"budget": 300000, "status": "planning", "priority": "high"}),
    (4, "API Integration", [1, 4, 7], {"budget": 120000, "status": "completed", "priority": "low"}),
    (5, "Data Analytics", [5, 8, 9, 10], {"budget": 200000, "status": "active", "priority": "medium"})
]

projects_schema = StructType([
    StructField("project_id", IntegerType(), False),
    StructField("name", StringType(), False),
    StructField("team_members", ArrayType(IntegerType()), True),
    StructField("details", MapType(StringType(), StringType()), True)
])

# Create DataFrame
projects_df = spark.createDataFrame(projects_data, projects_schema)
projects_df.show(truncate=False)

+----------+------------------+-------------+--------------------------------------------------------+
|project_id|name              |team_members |details                                                 |
+----------+------------------+-------------+--------------------------------------------------------+
|1         |Mobile App        |[1, 2, 6, 9] |{priority -> high, status -> active, budget -> 250000}  |
|2         |Website Redesign  |[3, 5, 8]    |{priority -> medium, status -> active, budget -> 175000}|
|3         |Database Migration|[2, 6]       |{priority -> high, status -> planning, budget -> 300000}|
|4         |API Integration   |[1, 4, 7]    |{priority -> low, status -> completed, budget -> 120000}|
|5         |Data Analytics    |[5, 8, 9, 10]|{priority -> medium, status -> active, budget -> 200000}|
+----------+------------------+-------------+--------------------------------------------------------+



## 2. Spark SQL Basics

Now that we have our data, let's explore different ways to work with Spark SQL. Spark SQL provides a SQL interface to interact with structured data in Spark.

### 2.1 Creating Temporary Views

To query data using SQL, we first need to create temporary views from our DataFrames.

In [6]:
# Create temporary views from DataFrames
employees_df.createOrReplaceTempView("employees")
departments_df.createOrReplaceTempView("departments")
projects_df.createOrReplaceTempView("projects")

# List all tables in the current session
print("Available tables:")
spark.sql("SHOW TABLES").show()

Available tables:
+---------+-----------+-----------+
|namespace|  tableName|isTemporary|
+---------+-----------+-----------+
|         |departments|       true|
|         |  employees|       true|
|         |   projects|       true|
+---------+-----------+-----------+



### 2.2 Basic SQL Queries

Let's start with some basic SQL queries.

In [7]:
# Simple SELECT query
query = """
SELECT 
    emp_id, 
    first_name, 
    last_name, 
    department, 
    salary
FROM 
    employees
WHERE 
    salary > 80000
ORDER BY 
    salary DESC
"""

spark.sql(query).show()

+------+----------+---------+-----------+------+
|emp_id|first_name|last_name| department|salary|
+------+----------+---------+-----------+------+
|     6|      Dave|   Wilson|Engineering|105000|
|     2|      Jane|    Smith|Engineering| 95000|
|     9|     Grace|   Taylor|Engineering| 92000|
|    10|     Helen|    Moore|      Sales| 81000|
+------+----------+---------+-----------+------+



In [8]:
# Aggregation query
query = """
SELECT 
    department, 
    COUNT(*) as employee_count, 
    AVG(salary) as avg_salary, 
    MAX(salary) as max_salary,
    MIN(salary) as min_salary,
    SUM(salary) as total_salary
FROM 
    employees
GROUP BY 
    department
ORDER BY 
    avg_salary DESC
"""

spark.sql(query).show()

+-----------+--------------+-----------------+----------+----------+------------+
| department|employee_count|       avg_salary|max_salary|min_salary|total_salary|
+-----------+--------------+-----------------+----------+----------+------------+
|Engineering|             4|          93000.0|    105000|     80000|      372000|
|  Marketing|             2|          75000.0|     78000|     72000|      150000|
|      Sales|             3|74666.66666666667|     81000|     68000|      224000|
|         HR|             1|          65000.0|     65000|     65000|       65000|
+-----------+--------------+-----------------+----------+----------+------------+



In [9]:
# JOIN query
query = """
SELECT 
    e.emp_id, 
    e.first_name, 
    e.last_name, 
    e.department, 
    d.location,
    d.manager as dept_manager
FROM 
    employees e
JOIN 
    departments d
ON 
    e.department = d.dept_name
ORDER BY 
    e.emp_id
"""

spark.sql(query).show()

+------+----------+---------+-----------+-------------+--------------+
|emp_id|first_name|last_name| department|     location|  dept_manager|
+------+----------+---------+-----------+-------------+--------------+
|     1|      John|      Doe|Engineering|San Francisco|    John Smith|
|     2|      Jane|    Smith|Engineering|San Francisco|    John Smith|
|     3|     Alice|  Johnson|      Sales|     New York|  Mary Johnson|
|     4|       Bob|    Brown|      Sales|     New York|  Mary Johnson|
|     5|   Charlie|   Miller|  Marketing|      Chicago|   James Brown|
|     6|      Dave|   Wilson|Engineering|San Francisco|    John Smith|
|     7|       Eve|    Davis|         HR|       Boston|Patricia Davis|
|     8|     Frank|    Jones|  Marketing|      Chicago|   James Brown|
|     9|     Grace|   Taylor|Engineering|San Francisco|    John Smith|
|    10|     Helen|    Moore|      Sales|     New York|  Mary Johnson|
+------+----------+---------+-----------+-------------+--------------+



### 2.3 Working with Complex Types

Spark SQL can handle complex data types like arrays and maps.

In [10]:
# Query with array operations
query = """
SELECT 
    project_id, 
    name, 
    size(team_members) as team_size,
    team_members as member_ids,
    array_contains(team_members, 1) as has_employee_1
FROM 
    projects
ORDER BY 
    team_size DESC
"""

spark.sql(query).show(truncate=False)

+----------+------------------+---------+-------------+--------------+
|project_id|name              |team_size|member_ids   |has_employee_1|
+----------+------------------+---------+-------------+--------------+
|5         |Data Analytics    |4        |[5, 8, 9, 10]|false         |
|1         |Mobile App        |4        |[1, 2, 6, 9] |true          |
|2         |Website Redesign  |3        |[3, 5, 8]    |false         |
|4         |API Integration   |3        |[1, 4, 7]    |true          |
|3         |Database Migration|2        |[2, 6]       |false         |
+----------+------------------+---------+-------------+--------------+



In [11]:
# Query with map operations
query = """
SELECT 
    project_id, 
    name, 
    details['budget'] as budget,
    details['status'] as status,
    details['priority'] as priority
FROM 
    projects
WHERE 
    details['status'] = 'active'
ORDER BY 
    details['budget'] DESC
"""

spark.sql(query).show()

+----------+----------------+------+------+--------+
|project_id|            name|budget|status|priority|
+----------+----------------+------+------+--------+
|         1|      Mobile App|250000|active|    high|
|         5|  Data Analytics|200000|active|  medium|
|         2|Website Redesign|175000|active|  medium|
+----------+----------------+------+------+--------+



### 2.4 Subqueries and Common Table Expressions (CTEs)

Spark SQL supports advanced SQL features like subqueries and CTEs.

In [12]:
# Subquery example
query = """
SELECT 
    emp_id, 
    first_name, 
    last_name, 
    department, 
    salary
FROM 
    employees
WHERE 
    salary > (
        SELECT AVG(salary) 
        FROM employees
    )
ORDER BY 
    salary DESC
"""

spark.sql(query).show()

+------+----------+---------+-----------+------+
|emp_id|first_name|last_name| department|salary|
+------+----------+---------+-----------+------+
|     6|      Dave|   Wilson|Engineering|105000|
|     2|      Jane|    Smith|Engineering| 95000|
|     9|     Grace|   Taylor|Engineering| 92000|
+------+----------+---------+-----------+------+



In [13]:
# CTE (Common Table Expression) example
query = """
WITH dept_stats AS (
    SELECT 
        department, 
        AVG(salary) as avg_dept_salary,
        MAX(years_exp) as max_experience
    FROM 
        employees
    GROUP BY 
        department
),
high_paid_departments AS (
    SELECT 
        department
    FROM 
        dept_stats
    WHERE 
        avg_dept_salary > 80000
)
SELECT 
    e.first_name,
    e.last_name,
    e.department,
    e.salary,
    s.avg_dept_salary,
    s.max_experience
FROM 
    employees e
JOIN 
    dept_stats s
ON 
    e.department = s.department
WHERE 
    e.department IN (SELECT department FROM high_paid_departments)
ORDER BY 
    e.salary DESC
"""

spark.sql(query).show()

+----------+---------+-----------+------+---------------+--------------+
|first_name|last_name| department|salary|avg_dept_salary|max_experience|
+----------+---------+-----------+------+---------------+--------------+
|      Dave|   Wilson|Engineering|105000|        93000.0|             9|
|      Jane|    Smith|Engineering| 95000|        93000.0|             9|
|     Grace|   Taylor|Engineering| 92000|        93000.0|             9|
|      John|      Doe|Engineering| 80000|        93000.0|             9|
+----------+---------+-----------+------+---------------+--------------+



### 2.5 Window Functions

Window functions perform calculations across related rows.

In [14]:
# Window function example
query = """
SELECT 
    emp_id,
    first_name,
    last_name,
    department,
    salary,
    RANK() OVER (PARTITION BY department ORDER BY salary DESC) as dept_salary_rank,
    DENSE_RANK() OVER (ORDER BY salary DESC) as overall_salary_rank,
    salary - AVG(salary) OVER (PARTITION BY department) as diff_from_dept_avg,
    salary / SUM(salary) OVER (PARTITION BY department) * 100 as pct_of_dept_total
FROM 
    employees
ORDER BY 
    department, 
    dept_salary_rank
"""

spark.sql(query).show()

25/04/18 08:54:55 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/18 08:54:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/18 08:54:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
25/04/18 08:54:56 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


+------+----------+---------+-----------+------+----------------+-------------------+-------------------+------------------+
|emp_id|first_name|last_name| department|salary|dept_salary_rank|overall_salary_rank| diff_from_dept_avg| pct_of_dept_total|
+------+----------+---------+-----------+------+----------------+-------------------+-------------------+------------------+
|     6|      Dave|   Wilson|Engineering|105000|               1|                  1|            12000.0|28.225806451612907|
|     2|      Jane|    Smith|Engineering| 95000|               2|                  2|             2000.0|25.537634408602152|
|     9|     Grace|   Taylor|Engineering| 92000|               3|                  3|            -1000.0|24.731182795698924|
|     1|      John|      Doe|Engineering| 80000|               4|                  5|           -13000.0| 21.50537634408602|
|     7|       Eve|    Davis|         HR| 65000|               1|                 10|                0.0|             100.0|


## 3. User-Defined Functions (UDFs) in Spark SQL

User-Defined Functions allow you to extend SQL with custom logic.

### 3.1 Creating and Using SQL UDFs

Let's create some UDFs and use them in SQL queries.

In [15]:
from decimal import Decimal

def calculate_bonus(salary, years, rate):
    # Convert everything to float to avoid type issues
    # This is the simplest solution though it may lose some precision
    salary_float = float(salary)
    years_float = float(years)
    rate_float = float(rate)
    
    if years_float > 5:
        return salary_float * rate_float * 1.5
    else:
        return salary_float * rate_float

# Register the UDF
spark.udf.register("calculate_bonus", calculate_bonus)

# Now use the UDF in a SQL query
query = """
SELECT 
    emp_id, 
    first_name,
    last_name,
    salary,
    years_exp,
    calculate_bonus(salary, years_exp, 0.1) AS bonus
FROM 
    employees
ORDER BY 
    bonus DESC
"""

spark.sql(query).show()

+------+----------+---------+------+---------+-------+
|emp_id|first_name|last_name|salary|years_exp|  bonus|
+------+----------+---------+------+---------+-------+
|    10|     Helen|    Moore| 81000|        4| 8100.0|
|     1|      John|      Doe| 80000|        5| 8000.0|
|     3|     Alice|  Johnson| 75000|        3| 7500.0|
|     5|   Charlie|   Miller| 72000|        4| 7200.0|
|     4|       Bob|    Brown| 68000|        2| 6800.0|
|     7|       Eve|    Davis| 65000|        5| 6500.0|
|     6|      Dave|   Wilson|105000|        9|15750.0|
|     2|      Jane|    Smith| 95000|        7|14250.0|
|     9|     Grace|   Taylor| 92000|        6|13800.0|
|     8|     Frank|    Jones| 78000|        6|11700.0|
+------+----------+---------+------+---------+-------+



In [16]:
# Method 1: Register a UDF using SQL syntax

# spark.sql("""
# CREATE TEMPORARY FUNCTION calculate_bonus
# AS (salary, years, rate) -> 
#     CASE
#       WHEN years > 5 THEN salary * rate * 1.5
#       ELSE salary * rate
#     END
# """)

# # Use the UDF in a SQL query
# query = """
# SELECT 
#     emp_id, 
#     first_name, 
#     last_name, 
#     salary, 
#     years_exp,
#     calculate_bonus(salary, years_exp, 0.1) as bonus
# FROM 
#     employees
# ORDER BY 
#     bonus DESC
# """

# spark.sql(query).show()

In [17]:
# Method 2: Register a Python function as a UDF
# Define Python function
def full_name(first_name, last_name, add_title=False):
    if add_title:
        return f"Mr./Ms. {first_name} {last_name}"
    else:
        return f"{first_name} {last_name}"

# Register as UDF
spark.udf.register("full_name", full_name, StringType())

# Use in SQL
query = """
SELECT 
    emp_id, 
    first_name, 
    last_name, 
    full_name(first_name, last_name) as name,
    full_name(first_name, last_name, true) as formal_name
FROM 
    employees
"""

spark.sql(query).show(truncate=False)

+------+----------+---------+--------------+----------------------+
|emp_id|first_name|last_name|name          |formal_name           |
+------+----------+---------+--------------+----------------------+
|1     |John      |Doe      |John Doe      |Mr./Ms. John Doe      |
|2     |Jane      |Smith    |Jane Smith    |Mr./Ms. Jane Smith    |
|3     |Alice     |Johnson  |Alice Johnson |Mr./Ms. Alice Johnson |
|4     |Bob       |Brown    |Bob Brown     |Mr./Ms. Bob Brown     |
|5     |Charlie   |Miller   |Charlie Miller|Mr./Ms. Charlie Miller|
|6     |Dave      |Wilson   |Dave Wilson   |Mr./Ms. Dave Wilson   |
|7     |Eve       |Davis    |Eve Davis     |Mr./Ms. Eve Davis     |
|8     |Frank     |Jones    |Frank Jones   |Mr./Ms. Frank Jones   |
|9     |Grace     |Taylor   |Grace Taylor  |Mr./Ms. Grace Taylor  |
|10    |Helen     |Moore    |Helen Moore   |Mr./Ms. Helen Moore   |
+------+----------+---------+--------------+----------------------+



### 3.2 Complex UDFs with Struct Returns

UDFs can return complex data types like structs.

In [18]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

# Define return schema for our UDF
compensation_schema = StructType([
    StructField("base", IntegerType(), True),
    StructField("bonus", DoubleType(), True),
    StructField("total", DoubleType(), True),
    StructField("rating", StringType(), True)
])

# Define Python function
def calculate_compensation(salary, years):
    # Calculate bonus
    bonus_rate = 0.05 + (years * 0.01)  # 5% + 1% per year
    bonus = salary * bonus_rate
    total = salary + bonus
    
    # Determine rating
    if total > 100000:
        rating = "Excellent"
    elif total > 80000:
        rating = "Good"
    else:
        rating = "Average"
        
    return (int(salary), float(bonus), float(total), rating)

# Register UDF
spark.udf.register("calculate_compensation", calculate_compensation, compensation_schema)

# Use in SQL
query = """
SELECT 
    emp_id, 
    first_name, 
    last_name, 
    department,
    calculate_compensation(salary, years_exp) as compensation,
    calculate_compensation(salary, years_exp).total as total_comp,
    calculate_compensation(salary, years_exp).rating as performance
FROM 
    employees
ORDER BY 
    total_comp DESC
"""

spark.sql(query).show(truncate=False)

+------+----------+---------+-----------+-------------------------------------------------+----------+-----------+
|emp_id|first_name|last_name|department |compensation                                     |total_comp|performance|
+------+----------+---------+-----------+-------------------------------------------------+----------+-----------+
|6     |Dave      |Wilson   |Engineering|{105000, 14700.000000000002, 119700.0, Excellent}|119700.0  |Excellent  |
|2     |Jane      |Smith    |Engineering|{95000, 11400.0, 106400.0, Excellent}            |106400.0  |Excellent  |
|9     |Grace     |Taylor   |Engineering|{92000, 10120.0, 102120.0, Excellent}            |102120.0  |Excellent  |
|10    |Helen     |Moore    |Sales      |{81000, 7290.0, 88290.0, Good}                   |88290.0   |Good       |
|1     |John      |Doe      |Engineering|{80000, 8000.0, 88000.0, Good}                   |88000.0   |Good       |
|8     |Frank     |Jones    |Marketing  |{78000, 8580.0, 86580.0, Good}         

### 3.3 Pandas UDFs (Vectorized UDFs)

Pandas UDFs provide much better performance than regular UDFs by leveraging vectorized operations.

In [19]:
import pandas as pd
from pyspark.sql.functions import pandas_udf

# Register a pandas UDF
@pandas_udf(DoubleType())
def calculate_tax(salary_series: pd.Series) -> pd.Series:
    # Apply tax brackets
    # - 10% up to 50K
    # - 20% from 50K to 100K
    # - 30% above 100K
    def tax_for_salary(salary):
        if salary <= 50000:
            return salary * 0.10
        elif salary <= 100000:
            return 5000 + (salary - 50000) * 0.20
        else:
            return 5000 + 10000 + (salary - 100000) * 0.30
    
    return salary_series.apply(tax_for_salary)

# Register for SQL use
spark.udf.register("calculate_tax", calculate_tax)

# Use in SQL
query = """
SELECT 
    emp_id, 
    first_name, 
    last_name, 
    department,
    salary,
    calculate_tax(salary) as tax,
    salary - calculate_tax(salary) as after_tax
FROM 
    employees
ORDER BY 
    tax DESC
"""

spark.sql(query).show()

+------+----------+---------+-----------+------+-------+---------+
|emp_id|first_name|last_name| department|salary|    tax|after_tax|
+------+----------+---------+-----------+------+-------+---------+
|     6|      Dave|   Wilson|Engineering|105000|16500.0|  88500.0|
|     2|      Jane|    Smith|Engineering| 95000|14000.0|  81000.0|
|     9|     Grace|   Taylor|Engineering| 92000|13400.0|  78600.0|
|    10|     Helen|    Moore|      Sales| 81000|11200.0|  69800.0|
|     1|      John|      Doe|Engineering| 80000|11000.0|  69000.0|
|     8|     Frank|    Jones|  Marketing| 78000|10600.0|  67400.0|
|     3|     Alice|  Johnson|      Sales| 75000|10000.0|  65000.0|
|     5|   Charlie|   Miller|  Marketing| 72000| 9400.0|  62600.0|
|     4|       Bob|    Brown|      Sales| 68000| 8600.0|  59400.0|
|     7|       Eve|    Davis|         HR| 65000| 8000.0|  57000.0|
+------+----------+---------+-----------+------+-------+---------+



## 4. Advanced Features and Best Practices

This section covers advanced features and best practices for working with Spark SQL.

### 4.1 Creating and Using Tables

Beyond temporary views, you can create more persistent tables in Spark.

In [20]:
# Save data as a Spark table (in the metastore)
employees_df.write.saveAsTable("global_employees")

# List all tables
spark.sql("SHOW TABLES").show()

25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 95.00% for 8 writers
25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 84.44% for 9 writers
25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 76.00% for 10 writers
25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 69.09% for 11 writers
25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 76.00% for 10 writers
25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.00% (1,020,054,720 bytes) of heap memory
Scaling row group sizes to 84.44% for 9 writers
25/04/18 08:54:58 WARN MemoryManager: Total allocation exceeds 95.0

+---------+----------------+-----------+
|namespace|       tableName|isTemporary|
+---------+----------------+-----------+
|  default|global_employees|      false|
|         |     departments|       true|
|         |       employees|       true|
|         |        projects|       true|
+---------+----------------+-----------+



In [21]:
# Describe table schema
spark.sql("DESCRIBE TABLE global_employees").show(truncate=False)

+----------+---------+-------+
|col_name  |data_type|comment|
+----------+---------+-------+
|emp_id    |int      |NULL   |
|first_name|string   |NULL   |
|last_name |string   |NULL   |
|department|string   |NULL   |
|salary    |int      |NULL   |
|years_exp |int      |NULL   |
+----------+---------+-------+



### 4.2 Optimizing SQL Queries

Let's look at how to optimize SQL queries in Spark.

In [22]:
# Examine query plans to understand optimization
query = """
SELECT 
    e.department, 
    AVG(e.salary) as avg_salary,
    d.location
FROM 
    employees e
JOIN 
    departments d
ON 
    e.department = d.dept_name
WHERE 
    e.salary > 70000
GROUP BY 
    e.department, d.location
HAVING 
    AVG(e.salary) > 75000
ORDER BY 
    avg_salary DESC
"""

# Get query plan
print("Logical and Physical Plans:")
spark.sql(query).explain(True)

Logical and Physical Plans:
== Parsed Logical Plan ==
'Sort ['avg_salary DESC NULLS LAST], true
+- 'UnresolvedHaving ('AVG('e.salary) > 75000)
   +- 'Aggregate ['e.department, 'd.location], ['e.department, 'AVG('e.salary) AS avg_salary#729, 'd.location]
      +- 'Filter ('e.salary > 70000)
         +- 'Join Inner, ('e.department = 'd.dept_name)
            :- 'SubqueryAlias e
            :  +- 'UnresolvedRelation [employees], [], false
            +- 'SubqueryAlias d
               +- 'UnresolvedRelation [departments], [], false

== Analyzed Logical Plan ==
department: string, avg_salary: double, location: string
Sort [avg_salary#729 DESC NULLS LAST], true
+- Filter (avg_salary#729 > cast(75000 as double))
   +- Aggregate [department#3, location#38], [department#3, avg(salary#4) AS avg_salary#729, location#38]
      +- Filter (salary#4 > 70000)
         +- Join Inner, (department#3 = dept_name#37)
            :- SubqueryAlias e
            :  +- SubqueryAlias employees
            :   

In [23]:
# Broadcast join for small tables
# Enable automatic broadcast joins for small tables
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 10485760)  # 10MB

# Force broadcast with a hint
query = """
SELECT /*+ BROADCAST(d) */ 
    e.emp_id, 
    e.first_name, 
    e.last_name, 
    e.department, 
    d.location
FROM 
    employees e
JOIN 
    departments d
ON 
    e.department = d.dept_name
"""

# Check if broadcast was applied
print("Physical Plan (should include BroadcastHashJoin):")
spark.sql(query).explain()

Physical Plan (should include BroadcastHashJoin):
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [emp_id#0, first_name#1, last_name#2, department#3, location#38]
   +- BroadcastHashJoin [department#3], [dept_name#37], Inner, BuildRight, false
      :- Project [emp_id#0, first_name#1, last_name#2, department#3]
      :  +- Scan ExistingRDD[emp_id#0,first_name#1,last_name#2,department#3,salary#4,years_exp#5]
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]),false), [plan_id=1193]
         +- Project [dept_name#37, location#38]
            +- Scan ExistingRDD[dept_name#37,location#38,manager#39,employee_count#40]




### 4.3 Performance Comparison: DataFrame vs SQL

Let's compare the performance of equivalent operations using DataFrame API vs SQL syntax.

In [24]:
# Create a larger dataset for performance testing
large_df = spark.range(0, 1000000) \
    .withColumn("random_value", (col("id") * 12.345) % 100) \
    .withColumn("group", (col("id") % 10).cast("integer")) \
    .withColumn("subgroup", (col("id") % 100).cast("integer"))

large_df.createOrReplaceTempView("large_table")

# Warm up the JVM
large_df.count()

1000000

In [25]:
# Test with SQL
def test_sql_performance():
    start_time = time.time()
    result = spark.sql("""
        SELECT 
            group, 
            subgroup, 
            COUNT(*) as count, 
            AVG(random_value) as avg_val,
            MAX(random_value) as max_val
        FROM 
            large_table
        WHERE 
            random_value > 50
        GROUP BY 
            group, subgroup
        HAVING 
            COUNT(*) > 5
        ORDER BY 
            group, subgroup
    """)
    result.collect()  # Force execution
    return time.time() - start_time

# Test with DataFrame API
def test_df_performance():
    start_time = time.time()
    result = large_df \
        .filter(col("random_value") > 50) \
        .groupBy("group", "subgroup") \
        .agg( \
            count("*").alias("count"), \
            avg("random_value").alias("avg_val"), \
            max("random_value").alias("max_val") \
        ) \
        .filter(col("count") > 5) \
        .orderBy("group", "subgroup")
    result.collect()  # Force execution
    return time.time() - start_time

# Run multiple times for more accurate comparison
sql_times = []
df_times = []

for i in range(3):
    sql_time = test_sql_performance()
    df_time = test_df_performance()
    sql_times.append(sql_time)
    df_times.append(df_time)
    print(f"Run {i+1}: SQL: {sql_time:.3f}s, DataFrame: {df_time:.3f}s")

import builtins  # Import Python's built-in functions
print(f"\nAverage: SQL: {builtins.sum(sql_times)/len(sql_times):.3f}s, DataFrame: {builtins.sum(df_times)/len(df_times):.3f}s")

Run 1: SQL: 0.317s, DataFrame: 0.117s
Run 2: SQL: 0.100s, DataFrame: 0.105s
Run 3: SQL: 0.096s, DataFrame: 0.096s

Average: SQL: 0.171s, DataFrame: 0.106s


### 4.4 SQL Best Practices

Here are some best practices for using Spark SQL effectively:

#### 1. Filter Early

Apply filters as early as possible to reduce data size before expensive operations.

In [26]:
# Good: Filter before join
query_good = """
SELECT e.first_name, e.last_name, d.location
FROM 
    (SELECT * FROM employees WHERE salary > 80000) e
JOIN 
    departments d
ON 
    e.department = d.dept_name
"""

# Bad: Filter after join
query_bad = """
SELECT e.first_name, e.last_name, d.location
FROM 
    employees e
JOIN 
    departments d
ON 
    e.department = d.dept_name
WHERE 
    e.salary > 80000
"""

# Compare execution plans
print("Good query plan:")
spark.sql(query_good).explain()

print("\nBad query plan:")
spark.sql(query_bad).explain()

Good query plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [first_name#1, last_name#2, location#38]
   +- SortMergeJoin [department#3], [dept_name#37], Inner
      :- Sort [department#3 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(department#3, 4), ENSURE_REQUIREMENTS, [plan_id=1872]
      :     +- Project [first_name#1, last_name#2, department#3]
      :        +- Filter (isnotnull(salary#4) AND (salary#4 > 80000))
      :           +- Scan ExistingRDD[emp_id#0,first_name#1,last_name#2,department#3,salary#4,years_exp#5]
      +- Sort [dept_name#37 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(dept_name#37, 4), ENSURE_REQUIREMENTS, [plan_id=1873]
            +- Project [dept_name#37, location#38]
               +- Scan ExistingRDD[dept_name#37,location#38,manager#39,employee_count#40]



Bad query plan:
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [first_name#1, last_name#2, location#38]
   +- SortMer

#### 2. Use Appropriate Join Strategies

- Use broadcast joins for small tables
- Be mindful of join types (inner, left, right, full)
- Avoid cartesian products (cross joins)

In [27]:
# Use broadcast hint for small tables
query = """
SELECT /*+ BROADCAST(d) */ 
    e.emp_id, e.first_name, e.last_name, d.location
FROM 
    employees e
JOIN 
    departments d
ON 
    e.department = d.dept_name
"""

spark.sql(query).explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [emp_id#0, first_name#1, last_name#2, location#38]
   +- BroadcastHashJoin [department#3], [dept_name#37], Inner, BuildRight, false
      :- Project [emp_id#0, first_name#1, last_name#2, department#3]
      :  +- Scan ExistingRDD[emp_id#0,first_name#1,last_name#2,department#3,salary#4,years_exp#5]
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, string, false]),false), [plan_id=1937]
         +- Project [dept_name#37, location#38]
            +- Scan ExistingRDD[dept_name#37,location#38,manager#39,employee_count#40]




#### 3. Favor Standard SQL Functions Over UDFs When Possible

Built-in functions are optimized and faster than custom UDFs.

In [28]:
# Using built-in functions
query = """
SELECT 
    first_name, 
    last_name,
    CONCAT(first_name, ' ', last_name) AS full_name,
    CASE 
        WHEN salary > 90000 THEN 'High'
        WHEN salary > 70000 THEN 'Medium'
        ELSE 'Low'
    END AS salary_tier
FROM 
    employees
"""

spark.sql(query).show()

+----------+---------+--------------+-----------+
|first_name|last_name|     full_name|salary_tier|
+----------+---------+--------------+-----------+
|      John|      Doe|      John Doe|     Medium|
|      Jane|    Smith|    Jane Smith|       High|
|     Alice|  Johnson| Alice Johnson|     Medium|
|       Bob|    Brown|     Bob Brown|        Low|
|   Charlie|   Miller|Charlie Miller|     Medium|
|      Dave|   Wilson|   Dave Wilson|       High|
|       Eve|    Davis|     Eve Davis|        Low|
|     Frank|    Jones|   Frank Jones|     Medium|
|     Grace|   Taylor|  Grace Taylor|       High|
|     Helen|    Moore|   Helen Moore|     Medium|
+----------+---------+--------------+-----------+



#### 4. Use Persistent Tables for Frequently Accessed Data

Save frequently used DataFrames as tables.

In [29]:
# Option 2: Use DataFrame API
dept_stats_df = spark.sql("""
SELECT 
    e.department, 
    d.location,
    COUNT(*) AS emp_count,
    AVG(salary) AS avg_salary,
    MAX(salary) AS max_salary,
    MIN(salary) AS min_salary,
    SUM(salary) AS total_salary,
    AVG(years_exp) AS avg_experience
FROM 
    employees e
JOIN 
    departments d
ON 
    e.department = d.dept_name
GROUP BY 
    e.department, d.location
""")

# Save as table (for persistence)
dept_stats_df.write.mode("overwrite").saveAsTable("dept_statistics")

# Now use the table
spark.sql("SELECT * FROM dept_statistics").show()

+-----------+-------------+---------+-----------------+----------+----------+------------+--------------+
| department|     location|emp_count|       avg_salary|max_salary|min_salary|total_salary|avg_experience|
+-----------+-------------+---------+-----------------+----------+----------+------------+--------------+
|Engineering|San Francisco|        4|          93000.0|    105000|     80000|      372000|          6.75|
|      Sales|     New York|        3|74666.66666666667|     81000|     68000|      224000|           3.0|
|  Marketing|      Chicago|        2|          75000.0|     78000|     72000|      150000|           5.0|
|         HR|       Boston|        1|          65000.0|     65000|     65000|       65000|           5.0|
+-----------+-------------+---------+-----------------+----------+----------+------------+--------------+



#### 5. Leverage Spark SQL Configurations

Tune Spark SQL with appropriate configurations.

In [30]:
# Show current SQL configurations
print("Current SQL Configurations:")
configs = [
    "spark.sql.shuffle.partitions",
    "spark.sql.autoBroadcastJoinThreshold",
    "spark.sql.adaptive.enabled",
    "spark.sql.adaptive.coalescePartitions.enabled",
    "spark.sql.optimizer.maxIterations"
]

for config in configs:
    try:
        value = spark.conf.get(config)
        print(f"{config}: {value}")
    except:
        print(f"{config}: Not set")

# Set some optimized values
print("\nSetting optimized values...")
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.shuffle.partitions", "8")

# Verify changes
print("\nVerify new configurations:")
for config in configs:
    try:
        value = spark.conf.get(config)
        print(f"{config}: {value}")
    except:
        print(f"{config}: Not set")

Current SQL Configurations:
spark.sql.shuffle.partitions: 4
spark.sql.autoBroadcastJoinThreshold: 10485760
spark.sql.adaptive.enabled: true
spark.sql.adaptive.coalescePartitions.enabled: true
spark.sql.optimizer.maxIterations: 100

Setting optimized values...

Verify new configurations:
spark.sql.shuffle.partitions: 8
spark.sql.autoBroadcastJoinThreshold: 10485760
spark.sql.adaptive.enabled: true
spark.sql.adaptive.coalescePartitions.enabled: true
spark.sql.optimizer.maxIterations: 100


## 5. Conclusion

In this notebook, we've covered a wide range of Spark SQL features and best practices:

1. **Basic SQL Operations**: Queries, joins, aggregations, and complex data types
2. **Advanced SQL Features**: Subqueries, CTEs, and window functions
3. **User-Defined Functions**: Different types of UDFs and when to use them
4. **Performance Optimization**: Tips and tricks for efficient Spark SQL usage

Remember these key takeaways:

- Use Spark SQL when working with structured data and when you need SQL-like operations
- Prefer built-in functions over UDFs when possible for better performance
- Apply filters early and use appropriate join strategies
- Leverage Spark's SQL optimizer by understanding query plans
- Choose the right persistence strategy for your data

In [31]:
# Clean up resources
spark.sql("DROP TABLE IF EXISTS global_employees")
spark.sql("DROP TABLE IF EXISTS dept_statistics")
spark.catalog.clearCache()

print("Cleanup complete!")

Cleanup complete!
