![ga4](https://www.google-analytics.com/collect?v=2&tid=G-6VDTYWLKX6&cid=1&en=page_view&sid=1&dl=statmike%2Fvertex-ai-mlops%2F08+-+R&dt=R+-+Dataproc+Serverless+Spark-R+Jobs.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/08%20-%20R/R%20-%20Dataproc%20Serverless%20Spark-R%20Jobs.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/08%20-%20R/R%20-%20Dataproc%20Serverless%20Spark-R%20Jobs.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/R%20-%20Dataproc%20Serverless%20Spark-R%20Jobs.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https%3A//raw.githubusercontent.com/statmike/vertex-ai-mlops/main/08%20-%20R/R%20-%20Dataproc%20Serverless%20Spark-R%20Jobs.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# R - Dataproc Serverless Spark-R Jobs

Running an **R** script as a job using [SparkR](https://spark.apache.org/docs/latest/sparkr.html#overview).  Submit a prepared script directly to Google Cloud [Dataproc Serverless](https://cloud.google.com/dataproc-serverless/docs/overview) as a batch job.  This allows for a completely serverless Spark job with a startup time under 60s.

---
Part of the series of [**R**](https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/readme.md) workflows:

A series of workflows focused on using **R** in Vertex AI as well as other Google Cloud services to run R code, train models with R, and serve predictionns with R.

---

**Prerequisites:**

- This notebook running in Vertex AI Workbench Instance as described in the series [readme](./readme.md)
- Run the workflow: [R - Notebook Based Workflow](./R%20-%20Notebook%20Based%20Workflow.ipynb)
    - This prepares the data source used by the custom job in this workflow

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [42]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.storage', 'google-cloud-storage'),
    ('google.cloud.dataproc', 'google-cloud-dataproc')
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user

### Enable APIs

In [43]:
!gcloud services enable dataproc.googleapis.com

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [44]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [45]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [46]:
REGION = 'us-central1'
EXPERIMENT = 'dataproc-serverless'
SERIES = 'r'

# BigQuery Parameters
BQ_PROJECT = PROJECT_ID
BQ_DATASET = SERIES
BQ_TABLE = 'bigquery-data'
BQ_REGION = REGION[0:2]

# GCS Parameters: Give bucket name
GCS_BUCKET = PROJECT_ID

# key columns in the data:
VAR_TARGET = 'Class'
VAR_OMIT = ['transaction_id', 'splits']

packages:

In [47]:
from google.cloud import storage
from google.cloud import dataproc_v1

from IPython.display import Markdown as md
from datetime import datetime
import os

parameters:

In [48]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
URI = f"gs://{GCS_BUCKET}/{SERIES}/{EXPERIMENT}"

clients:

In [50]:
gcs = storage.Client(project = PROJECT_ID)
dataproc_batch = dataproc_v1.BatchControllerClient(client_options = dict(api_endpoint = f"{REGION}-dataproc.googleapis.com:443"))

---
## Prepare Training Code: **SparkR** Script

The prior workflow in this series, [R - Notebook Based Workflow](./R%20-%20Notebook%20Based%20Workflow.ipynb), did the model training work in a notebook using an **R** kernel.  

The first step is converting the workflow of the prior notebook to a script that will run with SparkR. The steps from the notebook workflow have been replicated in the **R** script included with this repository.  The cell below loads and shows this script.  
- review directly in GitHub with [this link](https://github.com/statmike/vertex-ai-mlops/blob/main/08%20-%20R/code/sparkr.R)

**Notes On Script**
- The steps are replicated identically with the following additions:


In [51]:
# load a view the script:
SCRIPT_PATH = './code/sparkr.R'

with open(SCRIPT_PATH, 'r') as file:
    data = file.read()
md(f"```R\n\n{data}\n```")

```R

n <- 1000000  # Number of random points
x <- runif(n, -1, 1)
y <- runif(n, -1, 1)

inside <- x^2 + y^2 <= 1  # Points within the unit circle
pi_estimate <- 4 * sum(inside) / n 
print(pi_estimate)

```

---
## Create Batch SparkR Job With Dataproc Serverless

### Setup Dataproc
Using Google APIs from Spark code will require the subnet to have Private Google Access enabled.
- Network Configuration: https://cloud.google.com/dataproc-serverless/docs/concepts/network
    - Configure Private Google Access: https://cloud.google.com/vpc/docs/configure-private-google-access#config-pga

Current networks name:

In [52]:
!gcloud compute networks list

NAME     SUBNET_MODE  BGP_ROUTING_MODE  IPV4_RANGE  GATEWAY_IPV4
default  AUTO         REGIONAL


Enable the network's subnet for the region for Private Google Access:

In [53]:
status = !gcloud compute networks subnets describe default --region={REGION} --format="get(privateIpGoogleAccess)"
if status[0] == 'False':
  !gcloud compute networks subnets update default --region={REGION} --enable-private-ip-google-access
  status = !gcloud compute networks subnets describe default --region={REGION} --format="get(privateIpGoogleAccess)"
print(f"Private Google Access is Enable = {status[0]}")

Private Google Access is Enable = True


Open subnet connectivity to allow ingress communication:

In [54]:
!gcloud compute firewall-rules create allow-internal-ingress \
--network=default \
--source-ranges=10.128.0.0/9 \
--direction=ingress \
--action=allow \
--rules=all

Creating firewall...failed.                                                    
[1;31mERROR:[0m (gcloud.compute.firewall-rules.create) Could not fetch resource:
 - The resource 'projects/statmike-mlops-349915/global/firewalls/allow-internal-ingress' already exists



### Copy Script To GCS

In [55]:
bucket = gcs.lookup_bucket(GCS_BUCKET)
SOURCEPATH = f'{SERIES}/{EXPERIMENT}/models/{TIMESTAMP}'

In [56]:
blob = bucket.blob(f'{SOURCEPATH}/sparkr.R')
blob.upload_from_filename(SCRIPT_PATH)

In [57]:
blob.name

'r/dataproc-serverless/models/20240129012536/sparkr.R'

### Submit Job

The [script can be submitted](https://cloud.google.com/dataproc-serverless/docs/quickstarts/spark-batch#submit_a_spark_batch_workload) with Google Cloud Console, the [`gcloud` CLI](https://cloud.google.com/sdk/gcloud/reference/dataproc/batches/submit) or [one of the APIs](https://cloud.google.com/dataproc-serverless/docs/reference) including the [Python Client](https://cloud.google.com/python/docs/reference/dataproc/latest) used here.


In [59]:
operation = dataproc_batch.create_batch(
    parent = f'projects/{PROJECT_ID}/locations/{REGION}',
    batch = dataproc_v1.Batch(
        spark_r_batch = dataproc_v1.SparkRBatch(
            main_r_file_uri = f'gs://{GCS_BUCKET}/{blob.name}',
            args = ['1000']
        )
    ),
    batch_id = f'{SERIES}-{EXPERIMENT}'
)

In [61]:
result = operation.result()

In [65]:
result

name: "projects/statmike-mlops-349915/locations/us-central1/batches/r-dataproc-serverless"
uuid: "89bffc2c-ae2d-4234-863d-10d64a725c94"
create_time {
  seconds: 1706491705
  nanos: 480689000
}
spark_r_batch {
  main_r_file_uri: "gs://statmike-mlops-349915/r/dataproc-serverless/models/20240129012536/sparkr.R"
  args: "1000"
}
runtime_info {
  output_uri: "gs://dataproc-staging-us-central1-1026793852137-jnbcmtsj/google-cloud-dataproc-metainfo/15fac571-1146-4a43-93f2-b3752d108a52/jobs/srvls-batch-89bffc2c-ae2d-4234-863d-10d64a725c94/driveroutput"
  approximate_usage {
    milli_dcu_seconds: 720000
    shuffle_storage_gb_seconds: 72000
  }
}
state: SUCCEEDED
state_time {
  seconds: 1706491765
  nanos: 633147000
}
creator: "1026793852137-compute@developer.gserviceaccount.com"
labels {
  key: "goog-dataproc-location"
  value: "us-central1"
}
labels {
  key: "goog-dataproc-batch-uuid"
  value: "89bffc2c-ae2d-4234-863d-10d64a725c94"
}
labels {
  key: "goog-dataproc-batch-id"
  value: "r-datapr

### Wait On Job

In [68]:
print(f"Review job details in the console at this link:\nhttps://console.cloud.google.com/dataproc/batches/us-central1/{result.name.split('/')[-1]}/monitoring?project={PROJECT_ID}")

Review job details in the console at this link:
https://console.cloud.google.com/dataproc/batches/us-central1/r-dataproc-serverless/monitoring?project=statmike-mlops-349915
