# Provisioning a Spark Cluster in the Cloud

In this notebook you will have the opportunity to provision a Spark cluster on Google Cloud without the need to manually install Apache Spark on multiple worker nodes or handle networking, logging and security. We will also run the same program (Pi estimation from Lecture 11) but in two different programming languages (java and python).

To do this, we will use the [Dataproc](https://cloud.google.com/dataproc) service provided by Google Cloud. Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Hadoop, Apache Flink, Presto, and 30+ open source tools and frameworks. 

## 1. Setting up a Cloud Storage Bucket

1. Go to the [project selector](https://console.cloud.google.com/projectselector2/home/dashboard) in Google Cloud and either select an existing or create a new project. 

2. Press the **Navigate menu**, scroll down, click on **Cloud Storage** --> **Buckets** and then press **Create**.
<img src="images/bucket0_2.png" alt="bucket" width='100%'/>


3. Fill in the **Name your bucket** field with an appropriate and unique name and press **Create**. 
<img src="images/bucket1_2.png" alt="bucket" width='100%'/>


4. When done the bucket will be created and you will see the following where you have the options to CREATE FOLDER, UPLOAD FOLDER, UPLOAD FILES or to drag and drop files. You can also delete folders and files after picking them. In the following screenshot two folders are created.<br> In case you need the path to the created bucket (ie creating a cluster or moving files from the master node to your bucket) it is **gs://xxxxxxx** where xxxxxx is the name of your bucket.
<img src="images/bucket2_2.png" alt="bucket" width='100%'/>

## 2. Setting up a Dataproc Cluster

1. Go to the [project selector](https://console.cloud.google.com/projectselector2/home/dashboard) in Google Cloud and either select an existing or create a new project. Note, since Dataproc will provision VMs to run services, you are billed (of course) for the resources you reserve. Hence, it is suggested that you create a new project (e.g., i called mine dataproc-playground) so that when finished you can immediately just delete the project and de-reserve any resources you have created.

2. If the project is new, or if you have never used Dataproc before, then you must enable the Dataproc API for your project. Navigate [here](https://console.cloud.google.com/flows/enableapi?apiid=dataproc) and select your project.

<img src="images/dataproc0.png" alt="dataproc" width='100%'/>

3. When done you will see the following and you can just ignore the enable credentials (this is only required if you will be working outside of the Google Cloud platform e.g., in a notebook, like we did with Lecture5-firestore).

<img src="images/dataproc1.png" alt="dataproc" width='40%'/>

4. Now, navigate to Dataproc by clicking [here](https://console.cloud.google.com/dataproc/clusters) and let's start by configuring our first Spark cluster.

<img src="images/dataproc2.png" alt="dataproc" width='100%'/>

5. Give your cluster a name (e.g., my-spark-cluster), select a region to deploy the worker nodes (e.g., eu-central) and select for `standard` for cluster type. High-availability is the option you would use in production where you would like data replication and failure handling (of course, it is a bit more expensive).

<img src="images/dataproc3.png" alt="dataproc" width='100%'/>

6. When finished click **Create**. It will take 2-3min until the cluster is ready and by default, the standard cluster will create 2 VMs for the cluster but if the job submitted is intensive you may increase/decrease resources whenever you want!

<img src="images/dataproc4.png" alt="dataproc" width='100%'/>

7. You have now successfully created a Dataproc cluster with just a couple of clicks!

## Submit Spark Jobs to Dataproc

### Pi Estimation in Java
Note: we will not look into the coding of the Pi estimation program as our focus is in the execution of a job on the cloud. For the coded example, please take a look at Lecture11-Spark notebook and for the java implementation you can have a look [here](https://github.com/mudrutom/spark-examples/blob/master/src/main/java/org/apache/spark/examples/SparkPi.java).

1. Navigate through the left menu to the **Jobs** tab.

2. Select **Submit Job** to create a new job.

<img src="images/dataproc5.png" alt="dataproc" width='100%'/>

3. When creating a job you can give it an id (i left the default). Go to region and select the region of the created cluster (e.g., eu-central). Select the region and then from the dropdown cluster field, select your cluster. As the job type, select `Spark`. You will notice how many frameworks Dataproc supports. 

<img src="images/dataproc6.png" alt="dataproc" width='100%'/>

Note, `Spark` for Dataproc means Apache Spark running java, while `PySpark` is for python and `SparkR` for R.

4. Now, it is time to configure the Spark program we want to run. Dataproc runs programs that are uploaded on Google Cloud [Storage](https://console.cloud.google.com/storage/browser) (COMP-543 reminder!!!). The Pi estimation program is available as part of the Spark foundation examples for Google Cloud so no need to upload anything, but if you are to create your own program you must upload it first to a public storage bucket. For java programs you must define the main class (program main entry point) and for this, just paste the following `org.apache.spark.examples.SparkPi`. Now for the jar files field (storage url) paste the following `file:///usr/lib/spark/examples/jars/spark-examples.jar`. Finally, the Pi estimation needs as an `argument` the number of iterations to run and for this you can give any number (e.g., 1000).

<img src="images/dataproc7.png" alt="dataproc" width='100%'/>

When finished click **Submit** and just watch the program run!


<img src="images/dataproc8.png" alt="dataproc" width='100%'/>


### Pi Estimation in Python

1. Navigate again to the **Jobs** tab and opt to submit a new job.

2. Select again the region of your provisioned cluster and select that cluster.

3. Now, let's configure our new job to run python by selecting in the job type dropdown menu `PySpark`. For the main python file give the following `file:///usr/lib/spark/examples/src/main/python/pi.py` and remember to set the number of iterations in the arguments (e.g., 1000). When finished, click submit.

<img src="images/dataproc9.png" alt="dataproc" width='100%'/>

4. Now, the cluster will run a PySpark job without any manual installations or library configurations:

<img src="images/dataproc10.png" alt="dataproc" width='100%'/>

5. If you navigate again to the **Jobs** tab you will see the status of the jobs you have run.

<img src="images/dataproc11.png" alt="dataproc" width='100%'/>

## Resources

- All Dataproc Sample Examples https://cloud.google.com/dataproc/docs/samples
- Quickstart using client libraries (try python in jupyter notebook) https://cloud.google.com/dataproc/docs/quickstarts/quickstart-lib