# From Delta Lake to Amazon SageMaker

[Delta Lake](https://delta.io/) is a common open-source framework used for storing data in Lakehouse architectures.

In this sample we demonstrate how to integrate Delta Tables with Amazon SageMaker for performing data exploration, ingestion, processing, training, and hosting for Machine Learning.

---

## 1 - Data Exploration and Visualization

***Use Kernel: SparkMagic (PySpark) for running this notebook***

In this notebook, we will perform some Exploratory Data Analysis (EDA) over our Delta Tables with the two connection methods explained in the previous notebook.


In [3]:
import sagemaker
sagemaker.__version__

'2.107.0'

In [4]:
import numpy as np
import pandas as pd
import boto3

In [5]:
# S3 bucket for saving processing job outputs
sm_session = sagemaker.Session()
bucket = sm_session.default_bucket()
region = sm_session.boto_region_name

sm_client = boto3.client('sagemaker')
iam_role = sagemaker.get_execution_role()

print('Default bucket: '+bucket)

Default bucket: sagemaker-eu-west-1-889960878219


----
### Option 1: Reading from a Delta Table stored in Amazon S3 via Spark Session and Context

In [6]:
# Import pyspark and build Spark session
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession, SQLContext

In [7]:
# Build list of packages entries using Maven coordinates (groupId:artifactId:version)
pkg_list = []
pkg_list.append("io.delta:delta-core_2.12:1.1.0")
pkg_list.append("org.apache.hadoop:hadoop-aws:3.2.2")

packages=(",".join(pkg_list))
print('packages: '+packages)

packages: io.delta:delta-core_2.12:1.1.0,org.apache.hadoop:hadoop-aws:3.2.2


In [8]:
# Instantiate Spark via builder
# Note: we use the `ContainerCredentialsProvider` to give us access to underlying IAM role permissions

spark = (SparkSession
    .builder
    .appName("PySparkApp") 
    .config("spark.jars.packages", packages) 
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") 
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") 
    .config("fs.s3a.aws.credentials.provider",'com.amazonaws.auth.ContainerCredentialsProvider') 
    .getOrCreate())

sc = spark.sparkContext

print('Spark version: '+str(sc.version))



:: loading settings :: url = jar:file:/opt/conda/lib/python3.8/site-packages/pyspark/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /root/.ivy2/cache
The jars for the packages stored in: /root/.ivy2/jars
io.delta#delta-core_2.12 added as a dependency
org.apache.hadoop#hadoop-aws added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fb442fe9-413e-4efc-ae35-8a818023343e;1.0
	confs: [default]
	found io.delta#delta-core_2.12;1.1.0 in central
	found org.antlr#antlr4-runtime;4.8 in central
	found org.codehaus.jackson#jackson-core-asl;1.9.13 in central
	found org.apache.hadoop#hadoop-aws;3.2.2 in central
	found com.amazonaws#aws-java-sdk-bundle;1.11.563 in central
:: resolution report :: resolve 714ms :: artifacts dl 36ms
	:: modules in use:
	com.amazonaws#aws-java-sdk-bundle;1.11.563 from central in [default]
	io.delta#delta-core_2.12;1.1.0 from central in [default]
	org.antlr#antlr4-runtime;4.8 from central in [default]
	org.apache.hadoop#hadoop-aws;3.2.2 from central in [default]
	org.codehaus.jackson#jackson-core-asl;1.9.13 from central in [default]
	--------

Spark version: 3.2.0


In [9]:
s3a_delta_table_uri=f's3a://{bucket}/delta_to_sagemaker/delta_format/'
print(s3a_delta_table_uri)

s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_format/


In [10]:
# Create SQL command inserting the S3 path location

sql_cmd = f'SELECT * FROM delta.`{s3a_delta_table_uri}` ORDER BY medv'
print(f'SQL command: {sql_cmd}')

SQL command: SELECT * FROM delta.`s3a://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_format/` ORDER BY medv


In [11]:
# Execute SQL command which returns dataframe

sql_results = spark.sql(sql_cmd)
print(type(sql_results))

sql_results.show(10)

22/08/31 13:52:52 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
                                                                                

<class 'pyspark.sql.dataframe.DataFrame'>


[Stage 8:>                                                          (0 + 1) / 1]

+-------+---+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|   crim| zn|indus|chas|  nox|   rm| age|   dis|rad|tax|ptratio|     b|lstat|medv|
+-------+---+-----+----+-----+-----+----+------+---+---+-------+------+-----+----+
|14.3337|  0| 18.1|   0|  0.7| 4.88| 100|1.5895| 24|666|   20.2|372.92|30.62|10.2|
|12.2472|  0| 18.1|   0|0.584|5.837|59.7|1.9976| 24|666|   20.2| 24.65|15.69|10.2|
|17.8667|  0| 18.1|   0|0.671|6.223| 100|1.3861| 24|666|   20.2|393.74|21.78|10.2|
|88.9762|  0| 18.1|   0|0.671|6.968|91.9|1.4165| 24|666|   20.2| 396.9|17.21|10.4|
|25.9406|  0| 18.1|   0|0.679|5.304|89.1|1.6475| 24|666|   20.2|127.36|26.64|10.4|
|22.0511|  0| 18.1|   0| 0.74|5.818|92.4|1.8662| 24|666|   20.2|391.45|22.11|10.5|
|24.3938|  0| 18.1|   0|  0.7|4.652| 100|1.4672| 24|666|   20.2| 396.9|28.28|10.5|
|12.8023|  0| 18.1|   0| 0.74|5.854|96.6|1.8956| 24|666|   20.2|240.52|23.79|10.8|
|15.8744|  0| 18.1|   0|0.671|6.545|99.1|1.5192| 24|666|   20.2| 396.9|21.08|10.9|
|37.

                                                                                

(TBC).....

----
### Option 2: Reading from an external Delta Table via Delta Sharing

In [12]:
profile_file = 'https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share'
!wget {profile_file} -P ./ -O 'open-datasets.share'

--2022-08-31 13:53:16--  https://raw.githubusercontent.com/delta-io/delta-sharing/main/examples/open-datasets.share
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 148 [text/plain]
Saving to: ‘open-datasets.share’


2022-08-31 13:53:16 (7.79 MB/s) - ‘open-datasets.share’ saved [148/148]



In [13]:
!cat ./open-datasets.share

{
  "shareCredentialsVersion": 1,
  "endpoint": "https://sharing.delta.io/delta-sharing/",
  "bearerToken": "faaie590d541265bcab1f2de9813274bf233"
}

In [15]:
sample_profile_file_url = sagemaker.Session().upload_data(
    './open-datasets.share', bucket=bucket, key_prefix='delta_to_sagemaker/delta_sharing/profile'
)

print(sample_profile_file_url)

s3://sagemaker-eu-west-1-889960878219/delta_to_sagemaker/delta_sharing/profile/open-datasets.share


In [16]:
# Create a SharingClient
import delta_sharing

client = delta_sharing.SharingClient(sample_profile_file_url)
table_url = profile_file + '#delta_sharing.default.boston-housing'

In [20]:
# Load the table as a Pandas DataFrame
print('Loading boston-housing table from Delta Lake')
train_data = delta_sharing.load_as_spark(table_url)
#print(f'Train data shape: {train_data.shape}')

Loading boston-housing table from Delta Lake


Py4JJavaError: An error occurred while calling o59.load.
: java.lang.ClassNotFoundException: 
Failed to find data source: deltaSharing. Please find packages at
http://spark.apache.org/third-party-projects.html
       
	at org.apache.spark.sql.errors.QueryExecutionErrors$.failedToFindDataSourceError(QueryExecutionErrors.scala:443)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:670)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:720)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:210)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:188)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: deltaSharing.DefaultSource
	at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
	at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:656)
	at scala.util.Try$.apply(Try.scala:213)
	at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:656)
	at scala.util.Failure.orElse(Try.scala:224)
	at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:656)
	... 15 more
