# Overview
Apache Spark is a distributed compute environment. One of the classic use cases for distributed compute is running Monte Carlo simulations.

In this notebook we explore the common Hello, World! example you'll find in most spark tutorials. This example is commonly referred to as Spark Pi. It uses a Monte Carlo simulation to compute the value of pi.

# 1. The Spark Pi Problem
We are going to run the Spark Pi example which uses a "Monte Carlo Method" and the "Circle Method" to approximate the value of pi. 
In short; We will generate a large number or random points within a unit square and determine the ratio of the points within the unit circle; This will give us an approximation for the value of pi.

Recall that the area of a circle is defined as:
$$ A_c = \pi r^2 $$
Considering we are dealing with a unit circle, we have $r = 0.5$, and therfore

$$ A_c = 0.5^2 \pi  = 0.25\pi = \frac{\pi}{4}$$
Recall that the area of a square is defined as:
$$ A_s = l^2 = 1^2 = 1 $$
If we divide the area of the circle (smaller) by the area of the square (larger) we have the following equality:

$$ \frac{A_c}{A_s} = \frac{\pi / 4}{1} = \frac{\pi}{4}$$
And therefore we can say:
$$ \pi = 4 \frac{A_c}{A_s} $$
With this equation we can derive the value of pi using the area of the circle and the square.


We can approximate the ratio of these areas using a set of random numbers and a bit of logic.


If we generate uniform random variables we can treat them as points on a discrete grid.
The number of grid points that fall in the circle compared to the total number of points approximates the ratio of the area of the circle and the square respectively.

$$ \frac{num \ points \ in  \ circle}{num \ of \ points} \approx \frac{A_c}{A_s} $$

As the number of random points increases, we converge to the true areas and thus the true value of pi.

<center><img src='images/Convergence of Monte Carlo.gif' width="300px"/></center>

We can determine which poitns are inside the circle vs the ones that are not by using the Pythagorean Theorem.
Given a triangle, we can determine the length of a side if we know the length of the other two sides.
$$ A^2 + B^2 = C^2 $$

$$ C = \sqrt{A^2 + B^2} $$
If we compare the hypotinuse with the radius of a circle we will be able to determine whether or not a point is within a circle or not

<center><img src='images/Circle Method Pythagorean Diameter.png' width="300px"/></center>

The criteria for being inside the circle thus becomes:

$$ r \le \sqrt{X^2 + Y^2} $$

Because we are dealing with a unit circle, $r = 1; \sqrt{1} = 1$ , thus we can also say:

$$ r \le X^2 + Y^2 $$

# 2. The Spark Pi Code
First we create the SparkContext

In [4]:
from spark_helper import create_spark_session
spark_app_name = "jupyter-pi-spark"
docker_image = "tschneider/apache-spark-k8:v7"
k8_master_ip = "15.4.7.11"
spark_session = create_spark_session(spark_app_name, docker_image, k8_master_ip)

Setting SPARK_HOME
/usr/lib/spark-3.1.1-bin-hadoop2.7

Running findspark.init() function
['/usr/lib/spark-3.1.1-bin-hadoop2.7/python', '/usr/lib/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip', '/usr/lib/spark-3.1.1-bin-hadoop2.7/python', '/tmp/spark-7a9e5055-b789-413e-9b07-35377b9d309d/userFiles-3f81e9a9-d0c4-4bdf-b3ac-5a84d6ba4d0c', '/usr/lib/spark-3.1.1-bin-hadoop2.7/python/lib/py4j-0.10.9-src.zip', '/root/ml-training-jupyter-notebooks/Machine Learning/Big Data And Big Compute/Apache Spark', '/usr/local/lib/python39.zip', '/usr/local/lib/python3.9', '/usr/local/lib/python3.9/lib-dynload', '', '/usr/local/lib/python3.9/site-packages', '/root/ml-training-jupyter-notebooks/Utilities']

Setting PYSPARK_PYTHON
/usr/local/bin/python3

Configuring URL for kubernetes master
k8s://https://15.4.7.11:6443

Determining IP Of Server
The ip was detected as: 15.4.12.12

Creating SparkConf Object
('spark.executor.instances', '3')
('spark.kubernetes.container.image', 'tschneider/pyspark:v5

Then we define a function to run the monte carlo and submit code to the spark cluster

In [5]:
# Define a function to generate a pair or random numbers and determine whether they corespond to a point within a circle
import random

def monte_carlo_trial(var):
    # Generate random variables for x and y
    x, y = random.random(), random.random()
    # Calculate whether or not the point is inside the circle
    inside_circle =  x*x + y*y < 1
    # Return the value
    return inside_circle

# Set the number of trials for the monte carlo simulation
number_of_trials = 10000

# Use the SparkContext to apply the monte carlo trials in parrallel and count the positive results
sc = spark_session.sparkContext
count = sc.parallelize(range(0, number_of_trials)).filter(monte_carlo_trial).count()

# Compute the value of pi based on the information from the monte carlo simulation
pi = 4 * count / number_of_trials

# Print the value of pi
print(pi)

22/02/16 03:01:04 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 10) (10.46.0.1 executor 2): java.io.IOException: Cannot run program "/usr/local/bin/python3": error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
	at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:209)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:132)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.schedule

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 1.0 failed 4 times, most recent failure: Lost task 2.3 in stage 1.0 (TID 18) (10.46.0.1 executor 2): java.io.IOException: Cannot run program "/usr/local/bin/python3": error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
	at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:209)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:132)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
	... 17 more

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2253)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2202)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2201)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2201)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1078)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1078)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2440)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2382)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2371)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:868)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2202)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2223)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2242)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2267)
	at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:414)
	at org.apache.spark.rdd.RDD.collect(RDD.scala:1029)
	at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
	at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: Cannot run program "/usr/local/bin/python3": error=2, No such file or directory
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1128)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1071)
	at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:209)
	at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:132)
	at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:105)
	at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:119)
	at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:145)
	at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:131)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1439)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.io.IOException: error=2, No such file or directory
	at java.base/java.lang.ProcessImpl.forkAndExec(Native Method)
	at java.base/java.lang.ProcessImpl.<init>(ProcessImpl.java:340)
	at java.base/java.lang.ProcessImpl.start(ProcessImpl.java:271)
	at java.base/java.lang.ProcessBuilder.start(ProcessBuilder.java:1107)
	... 17 more


22/02/16 03:01:06 WARN TaskSetManager: Lost task 1.0 in stage 1.0 (TID 9) (10.38.0.1 executor 3): TaskKilled (Stage cancelled)
22/02/16 03:01:06 WARN TaskSetManager: Lost task 4.0 in stage 1.0 (TID 12) (10.38.0.1 executor 3): TaskKilled (Stage cancelled)
22/02/16 03:01:06 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 11) (10.34.0.0 executor 1): TaskKilled (Stage cancelled)
22/02/16 03:01:06 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 8) (10.34.0.0 executor 1): TaskKilled (Stage cancelled)


In [6]:
# Cleanup
sc.stop()