# Spark Pi Estimation - Assignment 3

This notebook demonstrates distributed Pi estimation using Apache Spark.

## Original Setup:
- **Local Cluster:** 1 master + 1 worker (2 cores, 4GB RAM each)
- **OCI Cluster:** 1 master + 1 worker (2 cores, 12GB RAM each)

## Current Setup:
- **Single PC:** 12 cores, 32GB RAM using Spark local mode

In [6]:
import time
from pyspark.sql import SparkSession
import matplotlib.pyplot as plt
import numpy as np

## Constants and Configuration

In [7]:
FIXED_STEPS = 1000000
STEP_COUNTS = [1000, 10000, 100000, 1000000]

## Pi Estimation Function

Uses Riemann sum to approximate π through the integral ∫₀¹ 4/(1+x²) dx = π

In [8]:
def pi_estimate_riemann(index, total_steps_n):
    """
    Calculate one rectangle's contribution to the Riemann sum.
    
    Args:
        index: Current step index
        total_steps_n: Total number of steps
    
    Returns:
        Area of rectangle at this index
    """
    delta_x = 1.0 / total_steps_n  # Width of rectangle
    x = delta_x * (index - 0.5)    # Midpoint
    area = 4.0 / (1.0 + x * x) * delta_x  # Height * width
    return area

## Experiment Runner

Distributes computation across Spark executors using RDD operations.

In [9]:
def run_spark_experiment(spark, total_steps):
    """
    Run Pi estimation using Spark RDD operations.
    
    Args:
        spark: SparkSession instance
        total_steps: Number of Riemann sum steps
    
    Returns:
        runtime: Execution time
        estimated_pi: Calculated π value
    """
    start_time = time.time()
    
    # Create RDD with range of indices
    rdd = spark.sparkContext.range(1, total_steps + 1)
    
    # Map each index to its area contribution and sum all areas
    estimated_pi = rdd.map(lambda index: pi_estimate_riemann(index, total_steps)).sum()
    
    runtime = time.time() - start_time
    
    return runtime, estimated_pi

## Initialize Spark Session

Using `local[4]` to simulate 4-core cluster (like original setup with 2 cores per node).

In [None]:
import os
import sys

# Set critical environment variables BEFORE importing SparkSession
os.environ["SPARK_LOCAL_IP"] = "127.0.0.1"
os.environ["HADOOP_USER_NAME"] = "spark"
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

from pyspark.sql import SparkSession

# Comprehensive Java options for Java 17/21/23
# The critical flag is -Djava.security.manager=allow
java_opts = (
    "-Djava.security.manager=allow "
    "--add-opens=java.base/java.lang=ALL-UNNAMED "
    "--add-opens=java.base/java.lang.invoke=ALL-UNNAMED "
    "--add-opens=java.base/java.lang.reflect=ALL-UNNAMED "
    "--add-opens=java.base/java.io=ALL-UNNAMED "
    "--add-opens=java.base/java.net=ALL-UNNAMED "
    "--add-opens=java.base/java.nio=ALL-UNNAMED "
    "--add-opens=java.base/java.util=ALL-UNNAMED "
    "--add-opens=java.base/java.util.concurrent=ALL-UNNAMED "
    "--add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED "
    "--add-opens=java.base/sun.nio.ch=ALL-UNNAMED "
    "--add-opens=java.base/sun.nio.cs=ALL-UNNAMED "
    "--add-opens=java.base/sun.security.action=ALL-UNNAMED "
    "--add-opens=java.base/sun.util.calendar=ALL-UNNAMED "
    "--add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED "
)

spark = SparkSession.builder \
    .appName("SparkPiEstimation") \
    .master("local[4]") \
    .config("spark.driver.memory", "2g") \
    .config("spark.ui.enabled", "false") \
    .config("spark.driver.bindAddress", "127.0.0.1") \
    .config("spark.driver.host", "localhost") \
    .config("spark.driver.extraJavaOptions", java_opts) \
    .config("spark.executor.extraJavaOptions", java_opts) \
    .getOrCreate()

print(f"Spark Version: {spark.version}")
print(f"Master: {spark.sparkContext.master}")
print(f"Default Parallelism: {spark.sparkContext.defaultParallelism}")

Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.UnsupportedOperationException: getSubject is supported only if a security manager is allowed
	at java.base/javax.security.auth.Subject.getSubject(Subject.java:347)
	at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:588)
	at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2446)
	at scala.Option.getOrElse(Option.scala:201)
	at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2446)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:339)
	at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:59)
	at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
	at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:501)
	at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:485)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
	at py4j.Gateway.invoke(Gateway.java:238)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:184)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:108)
	at java.base/java.lang.Thread.run(Thread.java:1575)


## Experiment 1: Core Scaling

Compare performance with 1 core vs 4 cores (fixed steps = 1M).

In [None]:
print("=" * 80)
print("Core Scaling Experiment")
print("=" * 80)

core_results = []

# Test with 1 core
spark.stop()
spark_1 = SparkSession.builder \
    .appName("Spark_1Core") \
    .master("local[1]") \
    .getOrCreate()

runtime, pi = run_spark_experiment(spark_1, FIXED_STEPS)
core_results.append({'cores': 1, 'time': runtime, 'pi': pi})
print(f"1 Core: Time={runtime:.4f}s, Pi={pi:.6f}")
spark_1.stop()

# Test with 4 cores
spark_4 = SparkSession.builder \
    .appName("Spark_4Cores") \
    .master("local[4]") \
    .getOrCreate()

runtime, pi = run_spark_experiment(spark_4, FIXED_STEPS)
core_results.append({'cores': 4, 'time': runtime, 'pi': pi})
print(f"4 Cores: Time={runtime:.4f}s, Pi={pi:.6f}")

# Restore 4-core session for next experiments
spark = spark_4

: 

## Experiment 2: Step Scaling

Test different step counts with 4 cores.

In [None]:
print("\n" + "=" * 80)
print("Step Scaling Experiment (4 Cores)")
print("=" * 80)

step_results = []

for steps in STEP_COUNTS:
    runtime, pi = run_spark_experiment(spark, steps)
    step_results.append({'steps': steps, 'time': runtime, 'pi': pi})
    print(f"Steps: {steps:>7}, Time: {runtime:.4f}s, Pi: {pi:.6f}")

: 

## Visualization: Core Scaling

In [None]:
core_labels = [f"{r['cores']} Core(s)" for r in core_results]
core_times = [r['time'] for r in core_results]

plt.figure(figsize=(8, 6))
plt.bar(core_labels, core_times, color=['skyblue', 'lightcoral'])
plt.title('Spark: Core Scaling Performance (Steps = 1M)')
plt.xlabel('Number of Cores')
plt.ylabel('Execution Time (seconds)')
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

# Calculate speedup
speedup = core_times[0] / core_times[1]
print(f"\nSpeedup (1 core → 4 cores): {speedup:.2f}x")

: 

## Visualization: Step Scaling

In [None]:
step_values = [r['steps'] for r in step_results]
step_times = [r['time'] for r in step_results]

plt.figure(figsize=(10, 6))
plt.plot(step_values, step_times, marker='o', linestyle='-', color='blue', linewidth=2)
plt.xscale('log')
plt.title('Spark: Step Scaling Performance (Cores = 4)')
plt.xlabel('Number of Steps (Log Scale)')
plt.ylabel('Execution Time (seconds)')
plt.grid(True, which="both", ls="--", alpha=0.7)
plt.show()

: 

## Cleanup

In [None]:
spark.stop()
print("Spark session stopped.")

: 

## Analysis

### Key Observations:
1. **Core Scaling:** Performance improvement from 1 to 4 cores
2. **Step Scaling:** How overhead affects small vs large computations
3. **Spark Overhead:** Note initialization time in small step counts

### Comparison with Original Clusters:
*(Add your original cluster results here)*

### Why Spark May Be Slower for Small Problems:
- Task scheduling overhead
- Data serialization costs
- JVM warmup time
- Better suited for large-scale data processing (100GB+)

### Spark vs Ray:
- Ray: Lower overhead, better for iterative/small tasks
- Spark: Optimized for batch processing at scale