
# Glue Studio Notebook
You are now running a **Glue Studio** notebook; before you can start using your notebook you *must* start an interactive session.

## Available Magics
|          Magic              |   Type       |                                                                        Description                                                                        |
|-----------------------------|--------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------|
| %%configure                 |  Dictionary  |  A json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. |
| %profile                    |  String      |  Specify a profile in your aws configuration to use as the credentials provider.                                                                          |
| %iam_role                   |  String      |  Specify an IAM role to execute your session with.                                                                                                        |
| %region                     |  String      |  Specify the AWS region in which to initialize a session                                                                                                  |
| %session_id                 |  String      |  Returns the session ID for the running session.                                                                                                          |
| %connections                |  List        |  Specify a comma separated list of connections to use in the session.                                                                                     |
| %additional_python_modules  |  List        |  Comma separated list of pip packages, s3 paths or private pip arguments.                                                                                 |
| %extra_py_files             |  List        |  Comma separated list of additional Python files from S3.                                                                                                 |
| %extra_jars                 |  List        |  Comma separated list of additional Jars to include in the cluster.                                                                                       |
| %number_of_workers          |  Integer     |  The number of workers of a defined worker_type that are allocated when a job runs. worker_type must be set too.                                          |
| %worker_type                |  String      |  Standard, G.1X, *or* G.2X. number_of_workers must be set too. Default is G.1X                                                                            |
| %glue_version               |  String      |  The version of Glue to be used by this session. Currently, the only valid options are 2.0 and 3.0 (eg: %glue_version 2.0)                                |
| %security_config            |  String      |  Define a security configuration to be used with this session.                                                                                            |
| %sql                        |  String      |  Run SQL code. All lines after the initial %%sql magic will be passed as part of the SQL code.                                                            |
| %streaming                  |  String      |  Changes the session type to Glue Streaming.                                                                                                              |
| %etl                        |  String      |   Changes the session type to Glue ETL.                                                                                                                   |
| %status                     |              |  Returns the status of the current Glue session including its duration, configuration and executing user / role.                                          |
| %stop_session               |              |  Stops the current session.                                                                                                                               |
| %list_sessions              |              |  Lists all currently running sessions by name and ID.                                                                                                     |
| %spark_conf                 |  String      |  Specify custom spark configurations for your session. E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer                       |

In [None]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
  
sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.35 
Authenticating with environment variables and user-defined glue_role_arn: arn:aws:iam::043916019468:role/Lab3
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: d872c2db-addc-4f54-9bdd-b9c01176abb7
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session d872c2db-addc-4f54-9bdd-b9c01176abb7 to get into ready status...
Session d872c2db-addc-4f54-9bdd-b9c01176abb7 has been created




In [32]:
AWSGlueDataCatalog_node1668083342455 = glueContext.create_dynamic_frame.from_catalog(
    database="learndb",
    table_name="soumil_data",
    transformation_ctx="AWSGlueDataCatalog_node1668083342455",
)




In [2]:
AWSGlueDataCatalog_node1668083342455.toDF()

DataFrame[first_name: string, last_name: string, address: string, text: string, id: string, city: string, state: string]


In [3]:
AWSGlueDataCatalog_node1668083342455.toDF().show()

+----------+---------+--------------------+--------------------+--------------------+---------------+----------+
|first_name|last_name|             address|                text|                  id|           city|     state|
+----------+---------+--------------------+--------------------+--------------------+---------------+----------+
|      Cody|  Daniels|653 Jones Port Ap...|Price resource co...|5fe7b9c4-34f8-491...|Griffithborough|    Oregon|
|      Cody|  Daniels|653 Jones Port Ap...|Price resource co...|5fe7b9c4-34f8-491...|Griffithborough|    Oregon|
| Catherine|   Thomas|PSC 8441, Box 525...|Would end friend ...|978474fd-582a-4a5...|       Brayberg|      Ohio|
|     Laura|   Harris|176 Garcia Brook
...|Meeting edge nor ...|05f2eb79-de24-47a...|     Taylorside|  Maryland|
| Stephanie|     Roth|2530 John Locks A...|Song fear quality...|da6614d4-c2cf-47e...|    Lake Ashley|Washington|
|    Nathan|Mccormick|7480 Calvin Drive...|Drug this approac...|169569b1-75a3-449...|    Jeffrey

In [4]:
AWSGlueDataCatalog_node1668083342455.toDF().printSchema()

root
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- address: string (nullable = true)
 |-- text: string (nullable = true)
 |-- id: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)


# Removing Duplicates


In [33]:
from pyspark.sql.functions import udf
import hashlib
from pyspark.sql.functions import concat_ws,udf,concat




In [45]:
@udf("String")  
def hasher(x):
    try:
        data = hashlib.md5(repr(x).encode("UTF-8")).hexdigest().__str__()
        return data
    except Exception as e:
        return ""





In [46]:
df = AWSGlueDataCatalog_node1668083342455.toDF()




In [47]:
df.show(2)

+----------+---------+--------------------+--------------------+--------------------+---------------+------+
|first_name|last_name|             address|                text|                  id|           city| state|
+----------+---------+--------------------+--------------------+--------------------+---------------+------+
|      Cody|  Daniels|653 Jones Port Ap...|Price resource co...|5fe7b9c4-34f8-491...|Griffithborough|Oregon|
|      Cody|  Daniels|653 Jones Port Ap...|Price resource co...|5fe7b9c4-34f8-491...|Griffithborough|Oregon|
+----------+---------+--------------------+--------------------+--------------------+---------------+------+
only showing top 2 rows


In [48]:
df = df.withColumn('dedup_hash', 
                   hasher(concat(df.first_name, df.last_name))
                   .alias('dedup_hash'))




In [50]:
df.show(2)

+----------+---------+--------------------+--------------------+--------------------+---------------+------+--------------------+
|first_name|last_name|             address|                text|                  id|           city| state|          dedup_hash|
+----------+---------+--------------------+--------------------+--------------------+---------------+------+--------------------+
|      Cody|  Daniels|653 Jones Port Ap...|Price resource co...|5fe7b9c4-34f8-491...|Griffithborough|Oregon|3cab1d104db40bb81...|
|      Cody|  Daniels|653 Jones Port Ap...|Price resource co...|5fe7b9c4-34f8-491...|Griffithborough|Oregon|3cab1d104db40bb81...|
+----------+---------+--------------------+--------------------+--------------------+---------------+------+--------------------+
only showing top 2 rows


In [51]:
df.count()

12


In [52]:
df_new = df.dropDuplicates(['dedup_hash'])




In [53]:
df_new.count()

6


# Convert Spark frame to Glue dynamic Frame and write to source 

In [54]:
from awsglue.dynamicframe import DynamicFrame




In [57]:
MyDynamicFrame = DynamicFrame.fromDF(df_new, glueContext, "test_nest")




In [61]:
MyDynamicFrame.show(1)

{"first_name": "Stephanie", "last_name": "Roth", "address": "2530 John Locks Apt. 941
Allisonchester, IL 65026", "text": "Song fear quality follow character. Star factor can lose child. Worker drop laugh.
Product Congress five. Guess left wish the increase especially example.", "id": "da6614d4-c2cf-47e4-bc28-80a06a6f94a6", "city": "Lake Ashley", "state": "Washington", "dedup_hash": "d62aae4bddd9fc014ac9b4feed8a804b"}
{"first_name": "Lynn", "last_name": "Garcia", "address": "PSC 3067, Box 3116
APO AP 33308", "text": "Measure gun trial. List note practice attack their heavy.
Economy start whether next get fight hot. Girl level century let.", "id": "0cf8f55f-0b22-498e-84db-46bb01d7025a", "city": "North Mark", "state": "Delaware", "dedup_hash": "c383845e6c1557c749eb01c8d2d0d524"}
{"first_name": "Catherine", "last_name": "Thomas", "address": "PSC 8441, Box 5251
APO AE 72289", "text": "Would end friend special international century sell yourself. Garden soldier letter pressure fast option. C

In [63]:
AmazonS3_node1668085615595 = glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame,
    connection_type="s3",
    format="json",
    connection_options={
        "path": "s3://glue-learn-begineers/new/",
        "partitionKeys": [],
    },
    transformation_ctx="AmazonS3_node1668085615595",
)


