# Interactive data process with PySpark on SageMaker

The following demonstrates how to process data interactively with PySpark and Amazon SageMaker.

There are many ways to do this and here are 3 options you can start with:

* Option A: On a **Notebook Instance** to develop locally with Spark. (single node)
* Option B: With **SageMaker Studio** and its native integration to [Glue Interactive sessions](https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions). (serverless)
* Option C: With **SageMaker Studio** and its native integration to [EMR clusters](https://catalog.workshops.aws/sagemaker-studio-emr/en-US/01-interacting-emr-cluster). (cluster-based)


This Notebook shows the local Spark processing you can run in your **Notebook Instance**.


<div style="text-align:center">
    <img src="media/sm_spark.png" width="800"/>
</div>

# Option A: you can use Spark on your Notebook Instance (single node) 
Develop and debug Spark code quickly, on a smaller or sample dataset.

## Setup environment

In [None]:
import boto3
import sagemaker 
from pyspark.sql import SparkSession

# get notebook instance IAM role
iam_role = sagemaker.get_execution_role()

# get temporary credentials from IAM role, so Spark can use them to read/write to S3
sts = boto3.client('sts')
response = sts.assume_role(RoleArn=iam_role, RoleSessionName='pyspark',DurationSeconds=3600)
credentials = response['Credentials']

## Setup Spark Session 
We add [S3A dependencies](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html) to the session, so Spark can talk to S3


In [None]:
spark = (
    SparkSession.builder.appName("PySparkApp")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .config("spark.hadoop.com.amazonaws.services.s3.enableV4", "true")
    .config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:3.2.2,com.amazonaws:aws-java-sdk-bundle:1.11.888")
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    .config("spark.hadoop.fs.s3a.access.key", credentials['AccessKeyId'])
    .config("spark.hadoop.fs.s3a.secret.key", credentials['SecretAccessKey'])
    .config("spark.hadoop.fs.s3a.session.token", credentials['SessionToken'])
    .getOrCreate()
)

## Process data with Spark

In [None]:
# GET DATA FROM S3
df = spark.read.csv("<ADD YOUR S3A FILE INPUT HERE>", header=True) # example: s3a://mybucket/spark-input/dataset.csv

# ==================================================
# ============= DO PROCESSING HERE =================
# ==================================================

# UPLOAD PROCESSED DATA TO S3
df.write.parquet("<ADD YOUR S3A FILE OUTPUT HERE>", mode="overwrite") # example: s3a://mybucket/spark-output/dataset.parquet

# Option B and C are with SageMaker Studio
Go to SageMaker Studio, and follow instructions for [Glue integration](https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions) or [EMR integration](https://catalog.workshops.aws/sagemaker-studio-emr/en-US/01-interacting-emr-cluster).