# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [2]:
%%configure
{
   "--datalake-formats": "iceberg",
    "--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse --conf spark.sql.defaultCatalog=glue_catalog"
}  

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 1.0.7 
The following configurations have been updated: {'--datalake-formats': 'iceberg', '--conf': 'spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse --conf spark.sql.defaultCatalog=glue_catalog'}


####  Run this cell to set up and start your interactive session.


In [1]:
%idle_timeout 2880
%glue_version 4.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

from datetime import datetime
import pandas as pd
from pyspark.sql.functions import to_timestamp
from awsglue import DynamicFrame

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Current idle_timeout is None minutes.
idle_timeout has been set to 2880 minutes.
Setting Glue version to: 4.0
Previous worker type: None
Setting new worker type to: G.1X
Previous number of workers: None
Setting new number of workers to: 5
Trying to create a Glue session for the kernel.
Session Type: glueetl
Worker Type: G.1X
Number of Workers: 5
Idle Timeout: 2880
Session ID: 6a919fa4-b9a5-4fb3-8cfa-91baeba3aafb
Applying the following default arguments:
--glue_kernel_version 1.0.7
--enable-glue-datacatalog true
--datalake-formats iceberg
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.warehouse=file:///tmp/spark-warehouse --conf spark.sql.defaultCatalog=glue_c

In [2]:
# Read Silver data
df_moth_silver = glueContext.create_data_frame.from_catalog(database="tgsn_silver", table_name="kobo_moth")



In [3]:
df_moth_to_gold = df_moth_silver.where(df_moth_silver['submission_time'] >= datetime.today().strftime('%Y-%m-%d'))




In [5]:
df_moth_to_gold.select('id').show()

+---+
| id|
+---+
| 85|
+---+


In [6]:
# Script generated for node Amazon S3
additional_options = {}
tables_collection = spark.catalog.listTables("tgsn_gold")
table_names_in_db = [table.name for table in tables_collection]
table_exists = "kobo_moth" in table_names_in_db
if table_exists:
    df_moth_to_gold.sortWithinPartitions("submission_time") \
        .writeTo("glue_catalog.tgsn_gold.kobo_moth") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://tgsn-gold/kobo/moth/meta/tgsn_gold/kobo_moth") \
        .tableProperty("write.parquet.compression-codec", "gzip") \
        .options(**additional_options) \
.append()
else:
    df_moth_to_gold.writeTo("glue_catalog.tgsn_gold.kobo_moth") \
        .tableProperty("format-version", "2") \
        .tableProperty("location", "s3://tgsn-gold/kobo/moth/meta/tgsn_gold/kobo_moth") \
        .tableProperty("write.parquet.compression-codec", "gzip") \
        .options(**additional_options) \
        .partitionedBy("submission_time") \
.create()


