# Glue Dev Endpoint - SageMaker Notebook 개발 실습

1. 실습 내용
  Glue Dev Endpoint와 연결된 SageMaker Notebook에서 Titanic.csv 샘플 데이터를 parquet로 변환하는 간단한 ETL 작업을 수행합니다 

2. 사전작업
 - 샘플 데이터는 Titanic 데이터를 사용하였습니다.
   https://www.openml.org/d/40945
 - Titanic 샘플 데이터를 S3로 업로드 합니다.
 - S3로 업로드 후 Glue Crawler를 이용하여 Data Catalog 테이블을 생성합니다(아래 코드는 Database: sample, Table: titanic_csv 생성)


3. 아래는 Glue Dev Endpoint 와 SageMaker Notebook 설정이 잘 되었는 지 확인을 위해 sample.titanic_csv 테이블에 스키마 정보를 확인합니다. 

In [1]:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

glueContext = GlueContext(SparkContext.getOrCreate())
titanic_csv_DyF = glueContext.create_dynamic_frame.from_catalog(database="sample", table_name="titanic_csv")
print ("Count:  ", titanic_csv_DyF.count())
titanic_csv_DyF.printSchema()

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
4,application_1607786024852_0005,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Count:   1309
root
|-- pclass: long
|-- survived: long
|-- name: string
|-- sex: string
|-- age: string
|-- sibsp: long
|-- parch: long
|-- ticket: string
|-- fare: choice
|    |-- double
|    |-- string
|-- cabin: string
|-- embarked: string
|-- boat: string
|-- body: string
|-- home.dest: string

4. 기존에 생성된(없으면 생성) Glue ETL Job "titanic-csv-parquet" 을 불러옵니다.

In [2]:
job = Job(glueContext)
job.init("titanic-csv-parquet")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

5. Glue ETL Job "titanic-csv-parquet" 이 수행하는 스크립트를 정의합니다.

In [3]:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "sample", table_name = "titanic_csv", transformation_ctx = "datasource0")

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("pclass", "long", "pclass", "long"), ("survived", "long", "survived", "long"), ("name", "string", "name", "string"), ("sex", "string", "sex", "string"), ("age", "string", "age", "string"), ("sibsp", "long", "sibsp", "long"), ("parch", "long", "parch", "long"), ("ticket", "string", "ticket", "string"), ("fare", "double", "fare", "double"), ("cabin", "string", "cabin", "string"), ("embarked", "string", "embarked", "string"), ("boat", "string", "boat", "string"), ("body", "string", "body", "string"), ("`home.dest`", "string", "`home.dest`", "string")], transformation_ctx = "applymapping1")

resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "s3://sample-titanic/titanic-parquet"}, format = "parquet", transformation_ctx = "datasink4")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

null_fields []

6. Glue ETL Job "titanic-csv-parquet" 을 종료합니다.

In [4]:
job.commit()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

참고. sample.titanic_csv 테이블 에 데이터를 확인 합니다.

In [5]:
titanic_csv_DyF.toDF().select(['survived', 'name']).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------+--------------------+
|survived|                name|
+--------+--------------------+
|       1|Allen Miss. Elisa...|
|       1|Allison Master. H...|
|       0|Allison Miss. Hel...|
|       0|Allison Mr. Hudso...|
|       0|Allison Mrs. Huds...|
|       1|  Anderson Mr. Harry|
|       1|Andrews Miss. Kor...|
|       0|Andrews Mr. Thoma...|
|       1|Appleton Mrs. Edw...|
|       0|Artagaveytia Mr. ...|
|       0|Astor Col. John J...|
|       1|Astor Mrs. John J...|
|       1|Aubart Mme. Leont...|
|       1|Barber Miss. Elle...|
|       1|Barkworth Mr. Alg...|
|       0|  Baumann Mr. John D|
|       0|Baxter Mr. Quigg ...|
|       1|Baxter Mrs. James...|
|       1|Bazzani Miss. Albina|
|       0| Beattie Mr. Thomson|
+--------+--------------------+
only showing top 20 rows