
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X.                                                                           |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer.                      |

In [6]:
%glue_version 3.0
%spark_conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
%spark_conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
%number_of_workers 2

%%configure
{
  "--datalake-formats": "delta"
}



Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0 
Setting Glue version to: 3.0
Previous Spark configuration: None
Setting new Spark configuration to: spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Previous Spark configuration: spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
Setting new Spark configuration to: spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
Previous number of workers: 5
Setting new number of workers to: 2
The following configurations have been updated: {'--datalake-formats': 'delta'}


In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from delta.tables import *
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, array, ArrayType, DateType, TimestampType, FloatType

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::175908995626:role/glue-role
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 2
Session ID: c15ab862-78f8-4788-8a7a-dbb73cc37045
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
--datalake-formats delta
Waiting for session c15ab862-78f8-4788-8a7a-dbb73cc37045 to get into ready status...
Session c15ab862-78f8-4788-8a7a-dbb73cc37045 has been created.



In [2]:
ORDERS_SCHEMA =[
    ('order_number', StringType()),
    ('customer_id', StringType()),
    ('product_id', StringType()),
    ('order_date', StringType()),
    ('units', StringType()),    
    ('sale_price', StringType()),
    ('currency', StringType()),
    ('order_mode', StringType()),
    ('updated_at', StringType())
]
fields = [StructField(*field) for field in ORDERS_SCHEMA]
schema = StructType(fields)

df_read_data_full = spark.read.csv("s3://aws-analytics-course/bronze/dms/sales/store_orders/LOAD00000001.csv",schema=schema )
df_read_data_full.show()

+------------+-----------+----------+----------+----------+----------+--------+----------+----------+
|order_number|customer_id|product_id|order_date|     units|sale_price|currency|order_mode|updated_at|
+------------+-----------+----------+----------+----------+----------+--------+----------+----------+
|           I|          1|       212|         5|02/03/2019|        10|   11.60|       USD|       NEW|
|           I|          2|      1940|        10|06/24/2020|         8|   72.31|       USD|       NEW|
|           I|          3|        60|         6|02/11/2019|         4|   24.82|       INR|       NEW|
|           I|          4|      2776|         6|05/20/2018|         4|   20.91|       USD|       NEW|
|           I|          5|       409|         9|07/05/2019|         5|   98.41|       INR|       NEW|
|           I|          6|       978|         6|12/16/2020|         1|    6.90|       USD|       NEW|
|           I|          7|      2904|         6|01/04/2021|         1|   71.56|   

In [3]:
df_read_data_full.write \
    .format("delta") \
    .save("s3://aws-analytics-course/temp/store_orders") 




In [9]:
spark.sql("CREATE DATABASE testing")

DataFrame[]


In [4]:
deltaTable = DeltaTable.forPath(spark, "s3://aws-analytics-course/temp/store_orders")
deltaTable.generate("symlink_format_manifest")




In [8]:
spark.sql("CREATE  EXTERNAL TABLE IF NOT EXISTS default.store_orders (order_number string, customer_id string, product_id string, order_date string, units string, sale_price string, currency string, order_mode string, updated_at string) ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'  STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION  's3://aws-analytics-course/temp/store_orders/_symlink_format_manifest/' ")

DataFrame[]


In [38]:

df1=spark.read.format("delta").load("s3://aws-analytics-course/temp/store_orders").show()

+---+---+----+---+----------+---+-----+---+---+-------------------+
|_c0|_c1| _c2|_c3|       _c4|_c5|  _c6|_c7|_c8|                _c9|
+---+---+----+---+----------+---+-----+---+---+-------------------+
|  I|  1| 212|  5|02/03/2019| 10|11.60|USD|NEW|2023-01-10 15:16:52|
|  I|  2|1940| 10|06/24/2020|  8|72.31|USD|NEW|2023-01-10 15:16:52|
|  I|  3|  60|  6|02/11/2019|  4|24.82|INR|NEW|2023-01-10 15:16:52|
|  I|  4|2776|  6|05/20/2018|  4|20.91|USD|NEW|2023-01-10 15:16:52|
|  I|  5| 409|  9|07/05/2019|  5|98.41|INR|NEW|2023-01-10 15:16:52|
|  I|  6| 978|  6|12/16/2020|  1| 6.90|USD|NEW|2023-01-10 15:16:52|
|  I|  7|2904|  6|01/04/2021|  1|71.56|EUR|NEW|2023-01-10 15:16:52|
|  I|  8|1269|  3|08/11/2018|  6|47.67|USD|NEW|2023-01-10 15:16:52|
|  I|  9|2628|  5|01/16/2017|  1|59.05|EUR|NEW|2023-01-10 15:16:52|
|  I| 10|1672|  8|08/01/2020|  3|43.42|USD|NEW|2023-01-10 15:16:52|
|  I| 11|2666|  5|05/25/2018|  5|43.98|EUR|NEW|2023-01-10 15:16:52|
|  I| 12|1521|  9|02/21/2019|  1| 9.70|EUR|NEW|2