
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X.                                                                           |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer.                      |

Our goal is to use the notebook to develop and test a batch_function named processBatch. This processBatch function will process a given test dataframe with the same schema as the streaming data inside our development environment.

Replace `${BUCKET_NAME}` with your bucket name to set up the environment and variables for the test. And run the code block below.

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import udf, col
from pyspark.sql.types import IntegerType, StringType
from pyspark import SparkContext
from pyspark.sql import SQLContext
from datetime import datetime

glueContext = GlueContext(SparkContext.getOrCreate())
s3_bucket = "s3://${BUCKET_NAME}"
output_path = s3_bucket + "/output/lab4/notebook/"
job_time_string = datetime.now().strftime("%Y%m%d%H%M%S")
s3_target = output_path + job_time_string


Load the lookup dataframe from the S3 folder.

In [None]:
country_lookup_frame = glueContext.create_dynamic_frame.from_options(
                            format_options = {"withHeader":True, "separator":",", "quoteChar":"\""},
                            connection_type = "s3", 
                            format = "csv", 
                            connection_options = {"paths": [s3_bucket + "/input/lab4/country_lookup/"], "recurse":True}, 
                            transformation_ctx = "country_lookup_frame")


Here is the batch function body where we do conversion and look-up transformation on the incoming data.

In [None]:
def processBatch(data_frame, batchId):
    if (data_frame.count() > 0):
        dynamic_frame = DynamicFrame.fromDF(data_frame, glueContext, "from_data_frame")
        apply_mapping = ApplyMapping.apply(frame=dynamic_frame, mappings=[
            ("uuid", "string", "uuid", "string"),
            ("country", "string", "country", "string"),
            ("itemtype", "string", "itemtype", "string"),
            ("saleschannel", "string", "saleschannel", "string"),
            ("orderpriority", "string", "orderpriority", "string"),
            ("orderdate", "string", "orderdate", "string"),
            ("region", "string", "region", "string"),
            ("shipdate", "string", "shipdate", "string"),
            ("unitssold", "string", "unitssold", "string"),
            ("unitprice", "string", "unitprice", "string"),
            ("unitcost", "string", "unitcost", "string"),
            ("totalrevenue", "string", "totalrevenue", "string"),
            ("totalcost", "string", "totalcost", "string"),
            ("totalprofit", "string", "totalprofit", "string")],
                                           transformation_ctx="apply_mapping")

        final_frame = Join.apply(apply_mapping, country_lookup_frame, 'country', 'CountryName').drop_fields(
            ['CountryName', 'country', 'unitprice', 'unitcost', 'totalrevenue', 'totalcost', 'total profit'])

        s3sink = glueContext.write_dynamic_frame.from_options(frame=final_frame,
                                                              connection_type="s3",
                                                              connection_options={"path": s3_target},
                                                              format="csv",
                                                              transformation_ctx="s3sink")


Now we will load some test data to test the batch function.

`The batchID in the following test is just a random number we use for testing. During streaming job, it will be provided by Glue runtime.`


In [None]:
dynaFrame = glueContext.create_dynamic_frame.from_catalog(database="glueworkshop_cloudformation", 
                                                          table_name="json-static-table")
processBatch(dynaFrame.toDF(), "12")
