# Dataproc Serverless - Tabular Setup Guide
Hey, welcome!

A few prereqs:
- you'll need the google CLI installed
- you'll need pipenv installed if you want to use the `make lab` target to fire up your jupyter lab instance

## Deploying your custom image
Dataproc insists on custom docker images in the GCP artifact registry. I have a Makefile that handles all the dirty work for that below.

First, clone this repository and from this directory run the following:
```bash
make push
```

This command:
- builds a docker image that installs dataproc serverless prereqs and installs the iceberg dependencies for spark
- tags the image with your current project id and the region in the makefile (modify these if you like)
- authenticates with google
- creates an artifact repository to push to (if this fails because it isn't enabled, jump into GCP and enable this API)
- pushes the image to the google artifact repository 
- **Note:** you'll want to go to GCP IAM and grant the Artifact Repository Read role to your compute service account. The default compute service account should be fine unless you select a specific service account to execute as, in which case it will need access. Otherwise the dataproc init will fail when trying to pull this image.

## Fire up jupyter lab
Google has a jupyter lab plugin for interactive spark jobs on serverless dataproc.

Make sure you have pipenv installed (`pip install pipenv`) and run the following:
```
make lab
```

This installs the dependencies from the Pipefile and runs your jupyter lab server

## Create a serverless template
Once you open the jupyter lab UI, do the following:
- open a new jupyter tab, select create serverless template in the dataproc section
- give your template a name, copy and paste your Current image ID from the output of the `make push` command in the custom image area.
- select your network
- **Important:** for your network, you need to make sure all internal communication is enabled within the network so your spark instances can communicate correctly and enable the private network access. [See bullet three in this GCP docs section for more info](https://cloud.google.com/dataproc-serverless/docs/quickstarts/jupyterlab-sessions)
- no need to save any spark configs up front -- all iceberg configs are provided in the pyspark below. 
- hit save, and you're good to go

## Start Kernel and Test
After hitting save, you should be able to open this file in jupyter lab and select your template as your kernel.

You may need to refresh the page if you don't see the kernel as an option right away.

After that, fill in your own `tabular_credential` and `warehouse_name` below and you're good to test connectivity!

In [None]:
# 👇 replace this with your own credential and your tabular warehouse name
tabular_credential = 't-123-123'
warehouse_name = 'rpw_gcp'

from pyspark.sql import SparkSession

pkgs_str = 'org.apache.iceberg:iceberg-spark-runtime-3.3_2.12-1.4.3,org.apache.iceberg:iceberg-gcp-bundle-1.4.3'

spark = (
  SparkSession.builder
    .appName("Iceberg")
    .config("spark.jars.packages", pkgs_str)
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config(f"spark.sql.catalog.{warehouse_name}", "org.apache.iceberg.spark.SparkCatalog")
    .config(f"spark.sql.catalog.{warehouse_name}.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")
    .config(f"spark.sql.catalog.{warehouse_name}.uri", "https://api.tabular.io/ws")
    .config(f"spark.sql.catalog.{warehouse_name}.credential", tabular_credential)
    .config(f"spark.sql.catalog.{warehouse_name}.warehouse", warehouse_name)
    .config("spark.sql.defaultCatalog", warehouse_name)
    .getOrCreate()
)

spark.sql(f'SHOW CATALOGS;').show()

spark.sql(f'CREATE DATABASE IF NOT EXISTS {warehouse_name}.DATAPROC_INIT;')
spark.sql(f'CREATE TABLE IF NOT EXISTS {warehouse_name}.DATAPROC_INIT.HELLO_WORLD AS (SELECT 1 AS ID);')


spark.sql(
    f'SELECT * FROM {warehouse_name}.DATAPROC_INIT.HELLO_WORLD;'
).show()