# PySpark Student Example Notebook

This notebook demonstrates the simple one-line setup for PySpark.

## Quick Start

Just run the cell below to initialize your PySpark environment!

In [None]:
# One-line initialization - this does everything!
from pyspark_local import initialize_pyspark
spark = initialize_pyspark(run_tests=True)

: 

## What Just Happened?

The `initialize_pyspark()` function just:
1. ✓ Checked that Java is installed
2. ✓ Verified all PySpark dependencies
3. ✓ Created a Spark session configured for local use
4. ✓ Ran validation tests to ensure everything works

Now you can start working with Spark!

## Example 1: Simple Range Operation

In [None]:
# Create a simple range DataFrame
df = spark.range(10)
df.show()

## Example 2: Create DataFrame from Data

In [None]:
# Create a DataFrame from Python data
data = [
    ('James', '', 'Smith', '1991-04-01', 'M', 3000),
    ('Michael', 'Rose', '', '2000-05-19', 'M', 4000),
    ('Robert', '', 'Williams', '1978-09-05', 'M', 4000),
    ('Maria', 'Anne', 'Jones', '1967-12-01', 'F', 4000),
    ('Jen', 'Mary', 'Brown', '1980-02-17', 'F', 5000)
]

columns = ["firstname", "middlename", "lastname", "dob", "gender", "salary"]
df = spark.createDataFrame(data=data, schema=columns)

df.show()

In [None]:
# Check the schema
df.printSchema()

## Example 3: DataFrame Operations

In [None]:
# Filter data
high_earners = df.filter(df.salary > 3500)
high_earners.show()

In [None]:
# Select specific columns
df.select("firstname", "lastname", "salary").show()

In [None]:
# Group by and aggregate
df.groupBy("gender").avg("salary").show()

## Example 4: SQL Queries

In [None]:
# Register DataFrame as a temporary view
df.createOrReplaceTempView("employees")

# Run SQL queries
result = spark.sql("""
    SELECT gender, AVG(salary) as avg_salary, COUNT(*) as count
    FROM employees
    GROUP BY gender
    ORDER BY avg_salary DESC
""")

result.show()

## Example 5: Reading JSON Files

In [None]:
# Read JSON file (if available)
try:
    json_df = spark.read.json("data/people.json")
    json_df.show()
    json_df.printSchema()
except Exception as e:
    print(f"Note: Sample data file not found. Error: {e}")

## Monitoring Your Spark Application

While your Spark session is running, you can monitor it at:
- **Spark UI**: http://localhost:4040

This shows you:
- Jobs and stages
- Storage
- Environment settings
- Executors

## When You're Done

Always remember to stop your Spark session when you're finished!

In [None]:
# Stop the Spark session
spark.stop()
print("Spark session stopped successfully!")

## Alternative: Manual Setup (for reference)

If you want more control, you can use the functions individually:

In [None]:
# Don't run this if you already initialized above!

from pyspark_local import check_environment, create_spark_session, run_validation_tests

# 1. Check environment manually
env = check_environment()
print(env)

# 2. Create session with custom config
spark = create_spark_session(
    app_name="MyCustomApp",
    master="local[2]",  # Use only 2 cores
    log_level="WARN"
)

# 3. Run validation tests
results = run_validation_tests(spark, verbose=True)

# Remember to stop when done
# spark.stop()