
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session.                                                                                                 |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0).                               |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |  Changes the session type to Glue ETL.                                                                                                                    |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X.                                                                           |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer.                      |

In [2]:
%connections hudi-connection

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0 
Connections to be included:
hudi-connection


In [1]:
try:
    import os
    import sys
    import uuid

    import boto3


    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.dynamicframe import DynamicFrame
    from awsglue.context import GlueContext
    from awsglue.job import Job
    
    import pyspark
    from pyspark import SparkConf, SparkContext
    from pyspark.sql import SparkSession
    print("All modules are loaded .....")

except Exception as e:
    print("Some modules are missing {} ".format(e))

Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::043916019468:role/Lab3
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: 4ca9b47c-b55c-48ec-8245-a2c796da21e3
Job Type: glueetl
Applying the following default arguments:
--glue_kernel_version 0.37.0
--enable-glue-datacatalog true
Waiting for session 4ca9b47c-b55c-48ec-8245-a2c796da21e3 to get into ready status...
Session 4ca9b47c-b55c-48ec-8245-a2c796da21e3 has been created.
All modules are loaded .....


In [2]:
print("lets start")

lets start


# Define Spark Session

In [3]:
def create_spark_session():
    spark = SparkSession \
        .builder \
        .config('spark.serializer', 'org.apache.spark.serializer.KryoSerializer') \
        .config('spark.sql.hive.convertMetastoreParquet','false') \
        .config('spark.sql.legacy.pathOptionBehavior.enabled', 'true') \
        .getOrCreate()
    return spark




In [4]:
spark = create_spark_session()
sc = spark.sparkContext
glueContext = GlueContext(sc)





# Read from DynamoDB Glue Catlog

In [5]:
dataFrame = glueContext.create_data_frame.from_catalog(
    database="dev.dynamodbdb",
    table_name="dev_users",
)
dataFrame = dataFrame.withColumnRenamed("id", "pk")




In [7]:
dataFrame.show(1)

+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|first_name|                city|               state|                text|                  pk|last_name|             address|
+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|     Jesus|Unit 7720 Box 773...|Clearly around si...|Clearly around si...|567adc65-79cd-48e...| Campbell|Unit 7720 Box 773...|
+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
only showing top 1 row


# HUDI Setttings 

In [9]:
hudi_options = {
    'hoodie.table.name': "hudi_table",
    "hoodie.datasource.write.storage.type": "COPY_ON_WRITE",
    'hoodie.datasource.write.recordkey.field': 'pk',
    'hoodie.datasource.write.table.name': "hudi_table",
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.precombine.field': 'first_name',

    'hoodie.datasource.hive_sync.enable': 'true',
    "hoodie.datasource.hive_sync.mode":"hms",
    'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
    'hoodie.datasource.hive_sync.database': "mydb",
    'hoodie.datasource.hive_sync.table': "hudi_table",
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor',
    'hoodie.datasource.write.hive_style_partitioning': 'true'
    

}





In [10]:
dynamic_frame = DynamicFrame.fromDF(dataFrame, glueContext, "from_kinesis_data_frame")
data_frame = dynamic_frame.toDF()




In [11]:
data_frame.show(1)

+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|first_name|                city|               state|                text|                  pk|last_name|             address|
+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
|     Jesus|Unit 7720 Box 773...|Clearly around si...|Clearly around si...|567adc65-79cd-48e...| Campbell|Unit 7720 Box 773...|
+----------+--------------------+--------------------+--------------------+--------------------+---------+--------------------+
only showing top 1 row


In [13]:
data_frame.write.format("hudi").options(**hudi_options).mode("overwrite").save("s3://glue-learn-begineers/hudi/")

Py4JJavaError: An error occurred while calling o118.save.
: java.lang.NoSuchMethodError: scala.Some.value()Ljava/lang/Object;
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:100)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apache.spark.rdd.RDD

# trying write using custom connector 

In [16]:
commonConfig = {
    'path': "s3://glue-learn-begineers/hudi/"
}

hudiWriteConfig = {
    'className': 'org.apache.hudi',
    'hoodie.table.name': "hudi_table",
    'hoodie.datasource.write.operation': 'upsert',
    'hoodie.datasource.write.table.type': 'COPY_ON_WRITE',
    'hoodie.datasource.write.precombine.field': 'first_name',
    'hoodie.datasource.write.recordkey.field': 'pk',
    'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.CustomKeyGenerator'
 
}

hudiGlueConfig = {
    'hoodie.datasource.hive_sync.enable': 'true',
    'hoodie.datasource.hive_sync.sync_as_datasource': 'false',
    'hoodie.datasource.hive_sync.database': "mydb1",
    'hoodie.datasource.hive_sync.table': "hudi_table",
    'hoodie.datasource.hive_sync.use_jdbc': 'false',
    'hoodie.datasource.write.hive_style_partitioning': 'true',
    'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor'
    
}

combinedConf = {
    **commonConfig,
    **hudiWriteConfig,
    **hudiGlueConfig
}





In [19]:
glueContext.write_dynamic_frame.from_options(
        frame=DynamicFrame.fromDF(data_frame, glueContext, "evolved_kinesis_data_frame"),
        connection_type="custom.spark",
        connection_options=combinedConf
    )

Py4JJavaError: An error occurred while calling o223.pyWriteDynamicFrame.
: java.lang.NoSuchMethodError: scala.Some.value()Ljava/lang/Object;
	at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:100)
	at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:164)
	at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:45)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
	at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:86)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
	at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
	at org.apach

# Stop the Session

In [16]:
%stop_session

Stopping session: 4d7f02f9-80a8-47e1-a175-30a8defe3b72
Stopped session.
