# Tabular - GCP Dataproc Guide
This script shows you how to configure a fresh dataproc cluster for use with [Tabular](tabular.io).

**NOTE**: You'll need a Tabular credential for this. Log in to your Tabular account and create a service credential. You'll figure it out 🌞 

The steps are:
- install spark dependencies
- register the Tabular Iceberg REST Catalog with spark
- write some sample queries!

🚨⚠️ - If you need ANY help, please reach out to support@tabular.io


## Before we begin, provide your inputs below.

In [None]:
# 👇 replace this with your own credential and your tabular warehouse name
tabular_credential = 't-123:456'
warehouse_name = 'rpw_gcp'

## 1 - Install dependencies
Nothing too tricky here. 
- Spark needs a JAR to help it talk with Apache Iceberg.
- Iceberg needs a JAR to help it talk with GCS buckets

### Spark Iceberg Runtime
I like to go to the [Apache Iceberg site](https://iceberg.apache.org/releases/) to grab a release that matches the version of Spark I'm using.
- For this notebook, I'm using Spark 3.3 with Scala 2.12. 
- So I grabbed the [1.4.3 Spark 3.3_2.12 runtime Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.4.3/iceberg-spark-runtime-3.3_2.12-1.4.3.jar) release
- Copy the link you want and place it in the first entry in the `url` list.

### Iceberg GCP Bundle
You can get this again at the [Apache Iceberg site](https://iceberg.apache.org/releases/)
- I always just grab the latest version of this, regardless of Spark version
- I'm using this one [1.4.3 gcp-bundle Jar](https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-gcp-bundle/1.4.3/iceberg-gcp-bundle-1.4.3.jar) release
- Copy the link you want and place it in the second entry of the `url` list.

In [1]:
# 👇 replace these if you have a different spark env
spark_iceberg_runtime_url = "https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-spark-runtime-3.3_2.12/1.4.3/iceberg-spark-runtime-3.3_2.12-1.4.3.jar"
iceberg_gcp_bundle_url = "https://search.maven.org/remotecontent?filepath=org/apache/iceberg/iceberg-gcp-bundle/1.4.3/iceberg-gcp-bundle-1.4.3.jar"


import requests

# Define the paths where the jars should be stored
urls = [
  spark_iceberg_runtime_url,
  iceberg_gcp_bundle_url,
]
deps_folder = '/usr/lib/spark/jars'
filenames = [url.split('/')[-1] for url in urls]

# install each jar
for url, filename in zip(urls, filenames):
    r = requests.get(url)
    path = f'{deps_folder}/{filename}'
    with open(path, 'wb') as f:
        f.write(r.content)
        print(f'Successfully installed {filename} into {deps_folder}')

        
# Now build a pkg string telling spark what deps to load
pkgs = [f'org.apache.iceberg:{filename.replace(".jar", "")}' for filename in filenames]
pkgs_str = ','.join(pkgs)
print(f'\n\nSpark package sring: "{pkgs_str}"')

Successfully installed iceberg-spark-runtime-3.3_2.12-1.4.3.jar into /usr/lib/spark/jars
Successfully installed iceberg-gcp-bundle-1.4.3.jar into /usr/lib/spark/jars


Spark package sring: "org.apache.iceberg:iceberg-spark-runtime-3.3_2.12-1.4.3,org.apache.iceberg:iceberg-gcp-bundle-1.4.3"


## 2 - Connect Spark to Tabular 💪
Let's get an old-fashioned python Spark session cooking and connect to Tabular.

In [2]:
from pyspark.sql import SparkSession

spark = (
  SparkSession.builder
    .appName("Iceberg")
    .config("spark.jars.packages", pkgs_str)
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
    .config(f"spark.sql.catalog.{warehouse_name}", "org.apache.iceberg.spark.SparkCatalog")
    .config(f"spark.sql.catalog.{warehouse_name}.catalog-impl", "org.apache.iceberg.rest.RESTCatalog")
    .config(f"spark.sql.catalog.{warehouse_name}.uri", "https://api.tabular.io/ws")
    .config(f"spark.sql.catalog.{warehouse_name}.credential", tabular_credential)
    .config(f"spark.sql.catalog.{warehouse_name}.warehouse", warehouse_name)
    .config("spark.sql.defaultCatalog", warehouse_name)
    .getOrCreate()
)

24/01/20 17:02:36 WARN SparkSession: Using an existing Spark session; only runtime SQL configurations will take effect.


## 3 - Run sample queries to validate your connection

In [4]:
spark.sql(f'SHOW CATALOGS;').show()

spark.sql(f'CREATE DATABASE IF NOT EXISTS {warehouse_name}.DATAPROC_INIT;')
spark.sql(f'CREATE TABLE IF NOT EXISTS {warehouse_name}.DATAPROC_INIT.HELLO_WORLD AS (SELECT 1 AS ID);')


spark.sql(
    f'SELECT * FROM {warehouse_name}.DATAPROC_INIT.HELLO_WORLD;'
).show()

+-------+
|catalog|
+-------+
|rpw_gcp|
+-------+

+---+
| ID|
+---+
|  1|
+---+

