# Apache Iceberg with Local PySpark on SageMaker Studio

This notebook shows how to run local PySpark code within a SageMaker Studio notebook and use [Apache Iceberg](https://iceberg.apache.org/docs/latest/aws/) on AWS with Studio. For this example we use the **Data Science - Python3** image and kernel, but this methodology should work for any kernel within SM Studio, including BYO custom images.

## Setup
There are two things that must be done to enable local PySpark within SageMaker Studio.
1. Make sure there is an available Java installation. The easiest way to install JDK and set the proper paths is to utilize conda
2. We need to append the local container's hostname into `/etc/hosts` in order for Spark to properly communicate

In [None]:
# Setup - Run only once per Kernel App
%conda install openjdk -y
!grep `hostname` /etc/hosts >/dev/null || echo 127.0.0.1 `hostname` >> /etc/hosts

## Install PySpark

In [None]:
! pip install pyspark==3.2.1

## Utilize S3 Data within local PySpark
* By specifying the `hadoop-aws` jar in our Spark config we're able to access S3 datasets using the s3a file prefix. 
* Since we've already authenticated ourself to SageMaker Studio , we can use our assumed SageMaker ExecutionRole for any S3 reads/writes by setting the credential provider as `ContainerCredentialsProvider`

### Download data

In [None]:
! mkdir ./../../data

In [None]:
! aws s3 cp s3://ws-assets-prod-iad-r-iad-ed304a55c2ca1aee/9e2e09b0-7142-4ab8-8b89-531349b817b9/deep-ar-electricity/LD2011_2014.csv.gz ./../../data

### Upload Data to S3

In [None]:
import boto3

In [None]:
s3_client = boto3.client("s3")

In [None]:
s3_bucket = ""

object_name = "./../../data/LD2011_2014.csv.gz"

In [None]:
s3_client.upload_file(object_name, s3_bucket, "data/input/{}".format(object_name.split("/")[-1]))

***

## Work with Local PySpark

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

import random
from pyspark.sql import SparkSession
import pyspark.sql.functions as fn
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import StructType, StructField, ArrayType, DoubleType, StringType, IntegerType

# Important: PySpark version 3.2.x

Run the cell below if you are using a PySpark version >= 3.2.x

If you want to use a `pyspark >= 3.2.x`, you need to provide the hadoop-aws jars version >=3.2.x for interacting with AWS services, such as Amazon S3.

For using Apache Iceberg, some MVN dependecies are required:

In [None]:
DEPENDENCIES=[
    "org.apache.hadoop:hadoop-aws:3.2.2",
    "org.apache.iceberg:iceberg-spark3-runtime:0.13.1",
    "software.amazon.awssdk:bundle:2.15.40",
    "software.amazon.awssdk:url-connection-client:2.15.40"
]

For using Apache Iceberg capabilities, the following configuration parameters are really important:

* warehouse: S3 path where operations will be executed
* lock.table: For handling concurrency during the access to data, a DynamoDB table will handle access operations to S3 data

In [None]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.jars.packages", ",".join(DEPENDENCIES))
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.my_catalog.warehouse", "s3://{}/warehouse".format(s3_bucket))
    .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .config("spark.sql.catalog.my_catalog.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager")
    .config("spark.sql.catalog.my_catalog.lock.table", "LockTable")
    .getOrCreate()
)

print(spark.version)

***

# Important: PySpark version 2.4.x

Run the cell below if you are using a PySpark version ~= 2.4.x

If you want to use a pyspark version ~= 2.4.x, you have to provide the list of aws-java-sdk jars for interacting with AWS services, such as Amazon S3.

You can use the python module `sagemaker_spark==1.4.2` and extract the list of jars to provide for the creation of the spark session.

In [None]:
! pip install pyspark==2.4.1

In [None]:
%pip install sagemaker_pyspark==1.4.2

In [None]:
import sagemaker_pyspark

classpath = ":".join(sagemaker_pyspark.classpath_jars())

For using Apache Iceberg, some MVN dependecies are required:

In [None]:
DEPENDENCIES=[
    "org.apache.hadoop:hadoop-aws:3.2.2",
    "org.apache.iceberg:iceberg-spark3-runtime:0.13.1",
    "software.amazon.awssdk:bundle:2.15.40",
    "software.amazon.awssdk:url-connection-client:2.15.40"
]

In [None]:
# Import pyspark and build Spark session
from pyspark.sql import SparkSession

spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.driver.extraClassPath", classpath)
    .config("spark.jars.packages", ",".join(DEPENDENCIES))
    .config(
        "fs.s3a.aws.credentials.provider",
        "com.amazonaws.auth.ContainerCredentialsProvider",
    )
    .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
    .config("spark.sql.catalog.my_catalog.warehouse", "s3://{}/warehouse".format(s3_bucket))
    .config("spark.sql.catalog.my_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
    .config("spark.sql.catalog.my_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
    .config("spark.sql.catalog.my_catalog.lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager")
    .config("spark.sql.catalog.my_catalog.lock.table", "LockTable")
    .getOrCreate()
)

print(spark.version)

***

In [None]:
schema = "date TIMESTAMP, client STRING, value FLOAT"

In [None]:
df = spark \
    .read \
    .schema(schema) \
    .options(sep =',', header=True, mode="FAILFAST", timestampFormat="yyyy-MM-dd HH:mm:ss") \
    .csv("s3a://{}/data/input/{}".format(s3_bucket, object_name.split("/")[-1]), header=True)

In [None]:
df.show()

In [None]:
df.createOrReplaceTempView("tempview");

In [None]:
spark.sql("CREATE or REPLACE TABLE my_catalog.iceberg_db.tempview USING iceberg AS SELECT * FROM tempview");

In [None]:
data = spark.sql("SELECT * FROM my_catalog.iceberg_db.tempview");

In [None]:
data.show()