# Spark Cluster Integration Demo

This notebook demonstrates how to connect to the Spark cluster and work with:
- Spark Master
- MinIO (S3-compatible storage)
- Hive Metastore

## Environment Overview
- **Spark Master**: spark://spark-master:7077
- **MinIO**: http://minio:9000
- **Hive Metastore**: thrift://hive-metastore:9083

## 1. Create Spark Session with Cluster Connection

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, count, avg, sum as spark_sum
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
import os

# # Create Spark session connected to the cluster
# spark = SparkSession.builder \
#     .appName("Jupyter Spark Demo") \
#     .master("spark://spark-master:7077") \
#     .config("spark.sql.warehouse.dir", "s3a://warehouse/") \
#     .config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000") \
#     .config("spark.hadoop.fs.s3a.access.key", "admin") \
#     .config("spark.hadoop.fs.s3a.secret.key", "admin123") \
#     .config("spark.hadoop.fs.s3a.path.style.access", "true") \
#     .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
#     .config("spark.hadoop.fs.s3a.connection.ssl.enabled", "false") \
#     .config("spark.sql.catalogImplementation", "hive") \
#     .config("hive.metastore.uris", "thrift://hive-metastore:9083") \
#     .config("spark.executor.memory", "1g") \
#     .config("spark.executor.cores", "1") \
#     .enableHiveSupport() \
#     .getOrCreate()

# print(f"Spark Version: {spark.version}")
# print(f"Spark Master: {spark.sparkContext.master}")
# print(f"Application ID: {spark.sparkContext.applicationId}")

## 2. Create Sample Data

In [1]:
# Create sample employee data
data = [
    (1, "Alice", "Engineering", 95000, "New York"),
    (2, "Bob", "Sales", 75000, "San Francisco"),
    (3, "Charlie", "Engineering", 105000, "New York"),
    (4, "Diana", "Marketing", 80000, "Chicago"),
    (5, "Eve", "Engineering", 98000, "San Francisco"),
    (6, "Frank", "Sales", 72000, "Chicago"),
    (7, "Grace", "Marketing", 85000, "New York"),
    (8, "Henry", "Engineering", 110000, "San Francisco"),
    (9, "Iris", "Sales", 78000, "New York"),
    (10, "Jack", "Marketing", 82000, "Chicago")
]

columns = ["id", "name", "department", "salary", "location"]

# Create DataFrame
df = spark.createDataFrame(data, columns)
df.show()
df.printSchema()

25/10/23 09:19:46 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

+---+-------+-----------+------+-------------+
| id|   name| department|salary|     location|
+---+-------+-----------+------+-------------+
|  1|  Alice|Engineering| 95000|     New York|
|  2|    Bob|      Sales| 75000|San Francisco|
|  3|Charlie|Engineering|105000|     New York|
|  4|  Diana|  Marketing| 80000|      Chicago|
|  5|    Eve|Engineering| 98000|San Francisco|
|  6|  Frank|      Sales| 72000|      Chicago|
|  7|  Grace|  Marketing| 85000|     New York|
|  8|  Henry|Engineering|110000|San Francisco|
|  9|   Iris|      Sales| 78000|     New York|
| 10|   Jack|  Marketing| 82000|      Chicago|
+---+-------+-----------+------+-------------+

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)
 |-- department: string (nullable = true)
 |-- salary: long (nullable = true)
 |-- location: string (nullable = true)



## 3. Basic DataFrame Operations

In [None]:
# Filter employees in Engineering department
print("\n=== Engineering Employees ===")
engineering_df = df.filter(col("department") == "Engineering")
engineering_df.show()

In [None]:
# Calculate average salary by department
print("\n=== Average Salary by Department ===")
avg_salary_df = df.groupBy("department").agg(
    count("*").alias("employee_count"),
    avg("salary").alias("avg_salary")
).orderBy(col("avg_salary").desc())
avg_salary_df.show()

In [None]:
# Employee count by location
print("\n=== Employees by Location ===")
location_df = df.groupBy("location").agg(
    count("*").alias("employee_count")
).orderBy(col("employee_count").desc())
location_df.show()

In [None]:
df.printSchema()

## 4. Save Data to MinIO (S3)

In [None]:
# Save as Parquet to MinIO
output_path = "s3a://data/employees/parquet"
df.write.mode("overwrite").parquet(output_path)
print(f"Data saved to: {output_path}")

In [None]:
# Read back from MinIO
df_from_s3 = spark.read.parquet(output_path)
print(f"\nRecords read from S3: {df_from_s3.count()}")
df_from_s3.show(5)

## 5. Create and Query Hive Tables

In [None]:
# Create a database
spark.sql("CREATE DATABASE IF NOT EXISTS company")
spark.sql("SHOW DATABASES").show()

In [None]:
# Create a managed Hive table
df.write.mode("overwrite").saveAsTable("company.employees")
print("Table 'company.employees' created successfully")

# Show tables in the database
spark.sql("SHOW TABLES IN company").show()

In [None]:
# Query the Hive table using SQL
print("\n=== Query: Top 5 Highest Paid Employees ===")
result = spark.sql("""
    SELECT name, department, salary, location
    FROM company.employees
    ORDER BY salary DESC
    LIMIT 5
""")
result.show()

In [None]:
# Complex SQL query with aggregation
print("\n=== Query: Department Statistics ===")
result = spark.sql("""
    SELECT 
        department,
        COUNT(*) as num_employees,
        ROUND(AVG(salary), 2) as avg_salary,
        MIN(salary) as min_salary,
        MAX(salary) as max_salary
    FROM company.employees
    GROUP BY department
    ORDER BY avg_salary DESC
""")
result.show()

## 6. Check Table Metadata

In [None]:
# Describe the table
spark.sql("DESCRIBE EXTENDED company.employees").show(truncate=False)

## 7. Create an External Table

In [None]:
# Create external table pointing to data in MinIO
spark.sql("""
    CREATE TABLE IF NOT EXISTS company.employees_external (
        id LONG,
        name STRING,
        department STRING,
        salary LONG,
        location STRING
    )
    STORED AS PARQUET
    LOCATION 's3a://data/employees/parquet'
""")

print("External table created successfully")
spark.sql("SHOW TABLES IN company").show()

In [None]:
# Query the external table
spark.sql("SELECT * FROM company.employees_external").show()

## 8. Monitor Spark Jobs

You can monitor your Spark jobs at:
- Spark Master UI: http://localhost:8080
- Spark Application UI: http://localhost:4040 (when job is running)
- MinIO Console: http://localhost:9001 (admin/admin123)

In [None]:
# Get Spark context information
sc = spark.sparkContext
print(f"Application Name: {sc.appName}")
print(f"Application ID: {sc.applicationId}")
print(f"Master: {sc.master}")
print(f"Spark UI: {sc.uiWebUrl}")
print(f"Default Parallelism: {sc.defaultParallelism}")

## 9. Cleanup (Optional)

In [None]:
# Uncomment to drop tables and database
# spark.sql("DROP TABLE IF EXISTS company.employees")
# spark.sql("DROP TABLE IF EXISTS company.employees_external")
# spark.sql("DROP DATABASE IF EXISTS company")
# print("Cleanup completed")

## Summary

This notebook demonstrated:
1. ✅ Connecting to the Spark cluster from Jupyter
2. ✅ Creating and manipulating DataFrames
3. ✅ Saving and reading data from MinIO (S3)
4. ✅ Creating and querying Hive tables
5. ✅ Working with both managed and external tables
6. ✅ Running SQL queries on distributed data

You can now use this setup to:
- Develop and test PySpark applications
- Analyze large datasets
- Build data pipelines
- Prototype ML models with MLlib