# From Delta Lake to Amazon SageMaker

[Delta Lake](https://delta.io/) is a common open-source framework used for storing data in Lakehouse architectures.

In this sample we demonstrate how to integrate Delta Tables with Amazon SageMaker for performing data exploration, ingestion, processing, training, and hosting for Machine Learning.

---

## 0 - Connection Set-up - Via Local Spark Session and Delta Sharing

***Use Kernel "Data Science 3.0 (Python 3)" for running this notebook***

In this notebook, we will setup a connection between Amazon SageMaker and a Delta Table. We will do this with two different methods:
1. By establishing a connection with a Delta Table in Amazon S3 via [Spark Sessions](https://sparkbyexamples.com/pyspark/pyspark-what-is-sparksession/)
2. By establishing a connection with a Delta Table via [Delta Sharing](https://delta.io/sharing/)

<center><img src="../images/DeltaLake_to_SageMaker_0.png" width="60%"></center>


Each methods has its own pros and cons, like e.g.:

* Connection via Spark Session allows you to interactively query the data in the Delta Table, without the need to materialize a full copy of the table in memory.
* Connection via Spark Session allows you to rely on AWS IAM roles for managing access and control to the data.
* Connection via Delta Sharing allows you to rely on tokens centrally managed by the Delta Lake's Delta Sharing Server, for read-only access via short-lived URLs.
* Connection via Delta Sharing simplifies the access since it just require a single library (delta_sharing) installed in the client.
* Connection via Delta Sharing allows you to read the tables as either Pandas or Spark data-frames.

In the following cells we will explore both alternatives.

Let's start by making sure we have an updated version of the SageMaker SDK, and install the other libraries required for this sample...

In [2]:
%pip install -U "sagemaker"

Collecting sagemaker
  Downloading sagemaker-2.128.0.tar.gz (660 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m660.4/660.4 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting boto3<2.0,>=1.26.28
  Downloading boto3-1.26.47-py3-none-any.whl (132 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m132.7/132.7 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting botocore<1.30.0,>=1.29.47
  Downloading botocore-1.29.47-py3-none-any.whl (10.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m41.2 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
Building wheels for collected packages: sagemaker
  Building wheel for sagemaker (setup.py) ... [?25ldone
[?25h  Created wheel for sagemaker: filename=sagemaker-2.128.0-py2.py3-none-any.whl size=896997 sha256=2a970bd92d0b55dcd7b6bd7d102127cebf1d93b7204ec7ac6bbd9651edc8eda2
  Stored in di

In [3]:
%conda install openjdk -q -y 

Collecting package metadata (current_repodata.json): ...working... done
Solving environment: ...working... done

## Package Plan ##

  environment location: /opt/conda

  added / updated specs:
    - openjdk


The following packages will be downloaded:

    package                    |            build
    ---------------------------|-----------------
    ca-certificates-2022.10.11 |       h06a4308_0         124 KB
    certifi-2022.12.7          |  py310h06a4308_0         150 KB
    conda-22.11.1              |  py310h06a4308_4         937 KB
    openjdk-11.0.13            |       h87a67e3_0       341.0 MB
    ------------------------------------------------------------
                                           Total:       342.2 MB

The following NEW packages will be INSTALLED:

  openjdk            pkgs/main/linux-64::openjdk-11.0.13-h87a67e3_0 

The following packages will be UPDATED:

  ca-certificates    conda-forge::ca-certificates-2022.9.2~ --> pkgs/main::ca-certificates-2022.1

In [4]:
%pip install pyspark==3.2.0

Collecting pyspark==3.2.0
  Downloading pyspark-3.2.0.tar.gz (281.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.3/281.3 MB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
[?25hCollecting py4j==0.10.9.2
  Downloading py4j-0.10.9.2-py2.py3-none-any.whl (198 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.8/198.8 kB[0m [31m3.4 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.2.0-py2.py3-none-any.whl size=281805898 sha256=48c46efd5555f5252a19562c912492b5c8f1728715817da991fdb1f8f588257e
  Stored in directory: /root/.cache/pip/wheels/3d/f5/13/04a82efe56a577a8f1509e75ffd4253dc31ee18bae0ff701ea
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.2 pyspark-3.2.0
[0mNote: y

In [5]:
%pip install delta-spark==1.1.0

Collecting delta-spark==1.1.0
  Downloading delta_spark-1.1.0-py3-none-any.whl (19 kB)
Installing collected packages: delta-spark
Successfully installed delta-spark-1.1.0
[0mNote: you may need to restart the kernel to use updated packages.


In [6]:
%pip install delta-sharing

Collecting delta-sharing
  Downloading delta_sharing-0.6.2-py3-none-any.whl (16 kB)
Collecting aiohttp
  Downloading aiohttp-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m0:00:01[0m
Collecting yarl>=1.6.0
  Downloading yarl-1.8.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (264 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m264.0/264.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting multidict>=4.0
  Downloading multidict-6.0.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (114 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.5/114.5 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hCollecting frozenlist>=1.1.1
  Downloading frozenlist-1.3.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl

In [7]:
import sagemaker
sagemaker.__version__

'2.128.0'

In [8]:
import numpy as np
import pandas as pd
import os
import boto3

In [9]:
# S3 bucket for saving processing job outputs
sm_session = sagemaker.Session()
bucket = sm_session.default_bucket()
region = sm_session.boto_region_name

sm_client = boto3.client('sagemaker')
iam_role = sagemaker.get_execution_role()

print('Default bucket: '+bucket)

Default bucket: sagemaker-eu-west-1-889960878219


----

### Option 1: Connecting to a Delta Table in Amazon S3 via Spark Session

In this section we will establish a Spark Session for interacting with a Delta Table stored in Amazon S3. For this purpose, we will:
* Install some libraries required
* Establish a Spark Session and Context
* Upload a sample dataset to Amazon S3, and write it as a Delta Table
* Test our connection towards the Delta Table and verify the format

We can now import the required libraries, and setup the packages required for our Spark Session...

In [10]:
# Import pyspark and build Spark session
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext

In [11]:
# Build list of packages entries using Maven coordinates (groupId:artifactId:version)
pkg_list = []
pkg_list.append("io.delta:delta-core_2.12:1.1.0")
pkg_list.append("org.apache.hadoop:hadoop-aws:3.2.2")

packages=(",".join(pkg_list))
print('packages: '+packages)

packages: io.delta:delta-core_2.12:1.1.0,org.apache.hadoop:hadoop-aws:3.2.2


We can now establish the Spark Session and opening a Spark Context...

In [12]:
# Instantiate Spark via builder
# Note: we use the `ContainerCredentialsProvider` to give us access to underlying IAM role permissions
spark = (SparkSession
    .builder
    .appName("PySparkApp") 
    .config("spark.jars.packages", packages) 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
    .config("fs.s3a.aws.credentials.provider",'com.amazonaws.auth.ContainerCredentialsProvider') 
    .getOrCreate())

sc = spark.sparkContext

print('Spark version: '+str(sc.version))



:: loading settings :: url = jar:file:/opt/conda/lib/python3.10/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-689a7d44-478e-484f-9686-8cb507f00344;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.1.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
downloading https://repo1.maven.org/maven2/io/delta/delta-core_2.12/1.1.0/delta-core_2.12-1.1.0.jar ...
	[SUCCESSFUL ] io.delta#delta-core_2.12;1.1.0!delta-core_2.12.jar (188ms)
downloading https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.2.2/hadoop-aws-3.2.2.jar ...
	[SUCCESSFUL ] org.apache.hadoop#hadoop-aws;3.2.2!hadoop-aws.jar (78ms)
downloading https://repo1.maven.org/maven2/org/antlr/

Spark version: 3.2.0


For this example, we will create a table in Amazon S3 by uploading a sample synthetic dataset for fact rating, and writting it as a Delta Table...

In [14]:
from sagemaker.s3 import S3Uploader

local_basename = 'fact_rating_synthetic.csv'
local_file = '../data/' + local_basename
upload_s3_uri = f's3://{bucket}/delta_to_sagemaker/raw_csv'

S3Uploader.upload(local_file, upload_s3_uri, sagemaker_session=sm_session)

's3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/raw_csv/fact_rating_synthetic.csv'

In [15]:
# Load raw data from S3 location

s3_raw_csv = f's3://{bucket}/delta_to_sagemaker/raw_csv/fact_rating_synthetic.csv'
s3a_raw_csv = s3_raw_csv.replace('s3:','s3a:')

print(s3a_raw_csv)

s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/raw_csv/fact_rating_synthetic.csv


In [19]:
%%time

rating_df = spark.read.csv(s3a_raw_csv, header=True)

CPU times: user 3.06 ms, sys: 285 µs, total: 3.34 ms
Wall time: 791 ms


In [20]:
print('Rows: '+str(rating_df.count()))
rating_df.dtypes

Rows: 8448


[('_c0', 'string'),
 ('timestamp', 'string'),
 ('ratingID', 'string'),
 ('userID', 'string'),
 ('placeID', 'string'),
 ('rating_overall', 'string'),
 ('rating_food', 'string'),
 ('rating_service', 'string')]

In [21]:
rating_df.show(10)

+---+----------+--------+------+-------+--------------+-----------+--------------+
|_c0| timestamp|ratingID|userID|placeID|rating_overall|rating_food|rating_service|
+---+----------+--------+------+-------+--------------+-----------+--------------+
|  0|2022-08-25|    3416|    gK|    681|             1|          2|             2|
|  1|2022-08-25|    3417|    gK|    719|             1|          1|             1|
|  2|2022-08-25|    3418|    gK|   1128|             1|          2|             2|
|  3|2022-08-25|    3419|    gK|   1203|             1|          2|             2|
|  4|2022-08-25|    3420|    gK|   1058|             1|          1|             1|
|  5|2022-08-25|    3421|    gK|    585|             1|          0|             0|
|  6|2022-08-25|    3422|    gL|    990|             2|          2|             2|
|  7|2022-08-25|    3423|    gL|   1192|             2|          2|             2|
|  8|2022-08-25|    3424|    gL|   1390|             2|          2|             2|
|  9

23/01/11 13:32:29 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , timestamp, ratingID, userID, placeID, rating_overall, rating_food, rating_service
 Schema: _c0, timestamp, ratingID, userID, placeID, rating_overall, rating_food, rating_service
Expected: _c0 but found: 
CSV file: s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/raw_csv/fact_rating_synthetic.csv


In [23]:
# Write dataframe to Delta Table location using 's3a' protocol
s3_delta_table_uri=f's3://{bucket}/delta_to_sagemaker/delta_format/'
s3a_delta_table_uri=s3_delta_table_uri.replace('s3:','s3a:')

print(s3a_delta_table_uri)

s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_format/


In [24]:
rating_df.write.format("delta").mode("overwrite").save(s3a_delta_table_uri)

23/01/11 13:32:57 WARN CSVHeaderChecker: CSV header does not conform to the schema.
 Header: , timestamp, ratingID, userID, placeID, rating_overall, rating_food, rating_service
 Schema: _c0, timestamp, ratingID, userID, placeID, rating_overall, rating_food, rating_service
Expected: _c0 but found: 
CSV file: s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/raw_csv/fact_rating_synthetic.csv
                                                                                

In [25]:
from delta import DeltaTable

# Use static method to determine table type
print(f'Is this a Delta Table?:\n{DeltaTable.isDeltaTable(spark, s3a_delta_table_uri)}')

Is this a Delta Table?:
True


We now have our sample Delta Table prepared and connected via the Spark Session.

Note, you can replace the URI of the Delta Table above for connecting to your own tables.


----

### Option 2: Connecting to a Delta Table via [Delta Sharing](https://delta.io/sharing/)

In this section we will connect directly towards an external Delta Table by using the open-source library Delta Sharing. For being able to use this method you should have a Delta Sharing Server available in your Delta Lake for managing accesses and permissions, you can check further details in the blog post [here](https://aws.amazon.com/blogs/opensource/delta-sharing-on-aws/).

Note this code is using a `profile_file` that contains the endpoint of the Delta Sharing server hosted either at Databricks or an open-source implementation, together with a bearer token that allows you to access the data.
Typically this file is managed and secured on the client-side. Because our experiment with Delta Sharing is about reading data from the Databricks server, we can stick with the provided example profile_file on GitHub and retrieve it via HTTP.

For this purpose we will:
* Configure the profile for Delta Sharing
* Establish a Delta Sharing Client
* Load data from our Delta Table and read it as a Pandas data-frame

This time, we will read the [Boston Hosing](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html) dataset directly from the Delta-IO [Delta Sharing repository](https://github.com/delta-io/delta-sharing)...

In [26]:
profile_file = "https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share"
!wget {profile_file} -P ./ -O 'open-datasets.share'

--2023-01-11 13:33:24--  https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.108.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148 [text/plain]
Saving to: ‘open-datasets.share’


2023-01-11 13:33:25 (12.3 MB/s) - ‘open-datasets.share’ saved [148/148]



In [27]:
!cat ./open-datasets.share

{
  "shareCredentialsVersion": 1,
  "endpoint": "https://sharing.delta.io/delta-sharing/",
  "bearerToken": "faaie590d541265bcab1f2de9813274bf233"
}

In [28]:
sample_profile_file_url = sagemaker.Session().upload_data(
    './open-datasets.share', bucket=bucket, key_prefix='delta_to_sagemaker/delta_sharing/profile'
)

print(sample_profile_file_url)

s3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_sharing/profile/open-datasets.share


In [29]:
# Create a SharingClient
import delta_sharing

client = delta_sharing.SharingClient(sample_profile_file_url)
table_url = profile_file + '#delta_sharing.default.boston-housing'

In [30]:
# Load the table as a Pandas DataFrame
print('Loading boston-housing table from Delta Lake')
train_data = delta_sharing.load_as_pandas(table_url)
print(f'Train data shape: {train_data.shape}')

Loading boston-housing table from Delta Lake
Train data shape: (506, 15)


In [31]:
train_data.head()

Unnamed: 0,ID,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
3,5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
4,7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
