# AWS Glue Studio Notebook
##### You are now running a AWS Glue Studio notebook; To start using your notebook you need to start an AWS Glue Interactive Session.


#### Optional: Run this cell to see available notebook commands ("magics").


In [2]:
%help

Welcome to the Glue Interactive Sessions Kernel
For more information on available magic commands, please type %help in any new cell.

Please view our Getting Started page to access the most up-to-date information on the Interactive Sessions kernel: https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
Installed kernel version: 0.37.0 



# Available Magic Commands

## Sessions Magic

----
    %help                             Return a list of descriptions and input types for all magic commands. 
    %profile            String        Specify a profile in your aws configuration to use as the credentials provider.
    %region             String        Specify the AWS region in which to initialize a session. 
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\ USERNAME \.aws\config" on Windows.
    %idle_timeout       Int           The number of minutes of inactivity after which a session will timeout. 
                                      Default: 2880 minutes (48 hours).
    %session_id_prefix  String        Define a String that will precede all session IDs in the format 
                                      [session_id_prefix]-[session_id]. If a session ID is not provided,
                                      a random UUID will be generated.
    %status                           Returns the status of the current Glue session including its duration, 
                                      configuration and executing user / role.
    %session_id                       Returns the session ID for the running session. 
    %list_sessions                    Lists all currently running sessions by ID.
    %stop_session                     Stops the current session.
    %glue_version       String        The version of Glue to be used by this session. 
                                      Currently, the only valid options are 2.0 and 3.0. 
                                      Default: 2.0.
----

## Selecting Job Types

----
    %streaming          String        Sets the session type to Glue Streaming.
    %etl                String        Sets the session type to Glue ETL.
    %glue_ray           String        Sets the session type to Glue Ray.
----

## Glue Config Magic 
*(common across all job types)*

----

    %%configure         Dictionary    A json-formatted dictionary consisting of all configuration parameters for 
                                      a session. Each parameter can be specified here or through individual magics.
    %iam_role           String        Specify an IAM role ARN to execute your session with.
                                      Default from ~/.aws/config on Linux or macOS, 
                                      or C:\Users\%USERNAME%\.aws\config` on Windows.
    %number_of_workers  int           The number of workers of a defined worker_type that are allocated 
                                      when a session runs.
                                      Default: 5.
    %additional_python_modules  List  Comma separated list of additional Python modules to include in your cluster 
                                      (can be from Pypi or S3).
----

                                      
## Magic for Spark Jobs (ETL & Streaming)

----
    %worker_type        String        Set the type of instances the session will use as workers. 
                                      ETL and Streaming support G.1X and G.2X. 
                                      Default: G.1X.
    %connections        List          Specify a comma separated list of connections to use in the session.
    %extra_py_files     List          Comma separated list of additional Python files From S3.
    %extra_jars         List          Comma separated list of additional Jars to include in the cluster.
    %spark_conf         String        Specify custom spark configurations for your session. 
                                      E.g. %spark_conf spark.serializer=org.apache.spark.serializer.KryoSerializer
----
                                      
## Magic for Ray Job

----
    %min_workers        Int           The minimum number of workers that are allocated to a Ray job. 
                                      Default: 1.
    %object_memory_head Int           The percentage of free memory on the instance head node after a warm start. 
                                      Minimum: 0. Maximum: 100.
    %object_memory_worker Int         The percentage of free memory on the instance worker nodes after a warm start. 
                                      Minimum: 0. Maximum: 100.
----

## Action Magic

----

    %%sql               String        Run SQL code. All lines after the initial %%sql magic will be passed
                                      as part of the SQL code.  
----



####  Run this cell to set up and start your interactive session.


In [2]:
%idle_timeout 2800
%glue_version 3.0
%worker_type G.1X
%number_of_workers 5

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from pyspark import SparkConf
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import col, to_date, to_timestamp
from pyspark.sql import SparkSession
from pyspark.ml.feature import OneHotEncoder, StringIndexer
  
config=SparkConf().set('spark.rpc.message.maxSize','256')
sc = SparkContext.getOrCreate(conf = config)
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)

You are already connected to a glueetl session 4a89e243-8e86-4164-a131-05b62ada7cb5.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Current idle_timeout is 2800 minutes.
idle_timeout has been set to 2800 minutes.


You are already connected to a glueetl session 4a89e243-8e86-4164-a131-05b62ada7cb5.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Setting Glue version to: 3.0


You are already connected to a glueetl session 4a89e243-8e86-4164-a131-05b62ada7cb5.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous worker type: G.1X
Setting new worker type to: G.1X


You are already connected to a glueetl session 4a89e243-8e86-4164-a131-05b62ada7cb5.

No change will be made to the current session that is set as glueetl. The session configuration change will apply to newly created sessions.


Previous number of workers: 5
Setting new number of workers to: 5



#### Example: Create a DynamicFrame from a table in the AWS Glue Data Catalog and display its schema


## Aux function

In [6]:
from pyspark.sql.types import *
from pyspark import SQLContext

sqlContext = SQLContext(sc)
# Auxiliar functions
def equivalent_type(f):
    if f == 'datetime64[ns]': return TimestampType()
    elif f == 'int64': return LongType()
    elif f == 'int32' or f == 'uint8': return IntegerType()
    elif f == 'float64': return DoubleType()
    elif f == 'float32': return FloatType()
    else: return StringType()

def define_structure(string, format_type):
    try: typo = equivalent_type(format_type)
    except: typo = StringType()
    return StructField(string, typo)

# Given pandas dataframe, it will return a spark's dataframe.
def pandas_to_spark(pandas_df):
    columns = list(pandas_df.columns)
    types = list(pandas_df.dtypes)
    struct_list = []
    for column, typo in zip(columns, types): 
      struct_list.append(define_structure(column, typo))
    p_schema = StructType(struct_list)
    return sqlContext.createDataFrame(pandas_df, p_schema)





## Trips

In [7]:
# Read in trips static as dynamic frame
trips_static = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://refit-iot/data/divvy/trips_static/"],
        "recurse": True,
        "header": "true"
    },
    format="csv"
)

# Read in trips static as dynamic frame
trips_streamed = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://refit-iot/data/divvy/trips_streamed/"],
        "recurse": True,
        "header": "true"
    },
    format="csv"
)
# Convert to spark df
trips_df_static = trips_static.toDF()
trips_df_streamed = trips_streamed.toDF()




In [8]:
# Static
header = trips_df_static.rdd.first()
trips_final_static = spark.createDataFrame(trips_df_static.rdd.filter(lambda x: x != header), header)
trips_final_static = trips_final_static.drop("")

#Streamed
header = trips_df_streamed.rdd.first()
trips_final_streamed = spark.createDataFrame(trips_df_streamed.rdd.filter(lambda x: x != header), header)
trips_final_streamed = trips_final_streamed.drop("")

# Display the PySpark DataFrame
trips_final_static.show(5)
trips_final_streamed.show(5)

+-------------------+-----+-----+
|         start_time|trips|  zip|
+-------------------+-----+-----+
|2013-06-27 01:00:00|    1|60661|
|2013-06-27 11:00:00|    1|60622|
|2013-06-27 11:00:00|    3|60607|
|2013-06-27 12:00:00|    1|60614|
|2013-06-27 12:00:00|    2|60611|
+-------------------+-----+-----+
only showing top 5 rows

+-------------------+-----+-----+
|         start_time|trips|  zip|
+-------------------+-----+-----+
|2019-06-14 08:00:00|   61|60661|
|2019-06-14 08:00:00|    1|60202|
|2019-06-14 08:00:00|   15|60603|
|2019-06-14 08:00:00|   75|60657|
|2019-06-14 08:00:00|    4|60641|
+-------------------+-----+-----+
only showing top 5 rows


In [29]:
# Static
# Time: str to timestamp
trips_final_static = trips_final_static.withColumn("start_time", to_timestamp(col("start_time"),"YYYY-MM-DD HH:MM:SS.fffffffff"))

# trips: str to int
trips_final_static = trips_final_static.withColumn("trips", col("trips").cast("int"))

# Streamed
# Time: str to timestamp
trips_final_streamed = trips_final_streamed.withColumn("start_time", to_timestamp(col("start_time"), "YYYY-MM-DD HH:MM:SS.fffffffff"))

# trips: str to int
trips_final_streamed = trips_final_streamed.withColumn("trips", col("trips").cast("int"))




In [30]:
trips_final_static.printSchema()
trips_final_streamed.printSchema()

root
 |-- start_time: timestamp (nullable = true)
 |-- trips: integer (nullable = true)
 |-- zip: string (nullable = true)

root
 |-- start_time: timestamp (nullable = true)
 |-- trips: integer (nullable = true)
 |-- zip: string (nullable = true)


## Landmark

In [11]:
# Static and Streamed all in 1 DF
landmark = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://refit-iot/data/divvy/landmark_csv/dataload=20230423/"],
        "recurse": True,
        "header": "true"
    },
    format="csv"
)

# Convert to spark df
landmark_df = landmark.toDF()




In [12]:
# Make first row of data header
header = landmark_df.rdd.first()
landmark_final = spark.createDataFrame(landmark_df.rdd.filter(lambda x: x != header), header)
landmark_final = landmark_final.drop("")




In [13]:
landmark_final.show(5)

+--------+---------+
|zip_code|landmarks|
+--------+---------+
|   60302|        1|
|   60409|        1|
|   60601|       15|
|   60602|        9|
|   60603|       12|
+--------+---------+
only showing top 5 rows


In [14]:
# Landmark: str to int
landmark_final = landmark_final.withColumn("landmarks", col("landmarks").cast("int"))




In [15]:
landmark_final.show(5)
landmark_final.printSchema()

+--------+---------+
|zip_code|landmarks|
+--------+---------+
|   60302|        1|
|   60409|        1|
|   60601|       15|
|   60602|        9|
|   60603|       12|
+--------+---------+
only showing top 5 rows

root
 |-- zip_code: string (nullable = true)
 |-- landmarks: integer (nullable = true)


## Weather

In [16]:
# Read in from S3 as dynamic frame
# Static
weather_static = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://refit-iot/data/divvy/weather_static/"],
        "recurse": True,
        "header": "true"
    },
    format="csv"
)

weather_streamed = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": ["s3://refit-iot/data/divvy/weather_streamed/"],
        "recurse": True,
        "header": "true"
    },
    format="csv"
)
# Convert to spark df
weather_df_static = weather_static.toDF()
weather_df_streamed = weather_streamed.toDF()




In [17]:
# Make first row of data header
# Static
header = weather_df_static.rdd.first()
weather_final_static = spark.createDataFrame(weather_df_static.rdd.filter(lambda x: x != header), header)
weather_final_static = weather_final_static.drop("")

#Streamed
header = weather_df_streamed.rdd.first()
weather_final_streamed = spark.createDataFrame(weather_df_streamed.rdd.filter(lambda x: x != header), header)
weather_final_streamed = weather_final_streamed.drop("")

# Display the PySpark DataFrame
weather_final_static.show(5)
weather_final_streamed.show(5)

+-------------------+----+------------+--------+-------------+------+----+----+----------+---------+
|               time|temp|rel_humidity|dewpoint|apparent_temp|precip|rain|snow|cloudcover|windspeed|
+-------------------+----+------------+--------+-------------+------+----+----+----------+---------+
|2013-01-01 00:00:00|-4.2|          66|    -9.7|        -10.2|   0.0| 0.0| 0.0|        79|     15.8|
|2013-01-01 01:00:00|-4.3|          67|    -9.5|        -10.5|   0.0| 0.0| 0.0|        72|     16.1|
|2013-01-01 02:00:00|-4.4|          67|    -9.7|        -10.3|   0.0| 0.0| 0.0|        82|     14.6|
|2013-01-01 03:00:00|-4.6|          67|    -9.8|        -10.5|   0.0| 0.0| 0.0|        80|     14.4|
|2013-01-01 04:00:00|-4.8|          68|    -9.9|        -11.9|   0.0| 0.0| 0.0|        37|     16.3|
+-------------------+----+------------+--------+-------------+------+----+----+----------+---------+
only showing top 5 rows

+-------------------+----+------------+--------+-------------+----

In [31]:
# Static
# Time: str to timestamp
weather_final_static = weather_final_static.withColumn("time", to_timestamp(col("time"), "YYYY-MM-DD HH:MM:SS.fffffffff"))

# The rest: str to double
cols_to_cast = weather_final_static.columns[1:]
for col_name in cols_to_cast:
    weather_final_static = weather_final_static.withColumn(col_name, col(col_name).cast("double"))

# Streamed
# Time: str to timestamp
weather_final_streamed = weather_final_streamed.withColumn("time", to_timestamp(col("time"), "YYYY-MM-DD HH:MM:SS.fffffffff"))

# The rest: str to double
cols_to_cast = weather_final_streamed.columns[1:]
for col_name in cols_to_cast:
    weather_final_streamed = weather_final_streamed.withColumn(col_name, col(col_name).cast("double"))





In [32]:
weather_final_static.show(5)
weather_final_static.printSchema()

weather_final_streamed.show(5)
weather_final_streamed.printSchema()

+-------------------+----+------------+--------+-------------+------+----+----+----------+---------+
|               time|temp|rel_humidity|dewpoint|apparent_temp|precip|rain|snow|cloudcover|windspeed|
+-------------------+----+------------+--------+-------------+------+----+----+----------+---------+
|2013-01-01 00:00:00|-4.2|        66.0|    -9.7|        -10.2|   0.0| 0.0| 0.0|      79.0|     15.8|
|2013-01-01 01:00:00|-4.3|        67.0|    -9.5|        -10.5|   0.0| 0.0| 0.0|      72.0|     16.1|
|2013-01-01 02:00:00|-4.4|        67.0|    -9.7|        -10.3|   0.0| 0.0| 0.0|      82.0|     14.6|
|2013-01-01 03:00:00|-4.6|        67.0|    -9.8|        -10.5|   0.0| 0.0| 0.0|      80.0|     14.4|
|2013-01-01 04:00:00|-4.8|        68.0|    -9.9|        -11.9|   0.0| 0.0| 0.0|      37.0|     16.3|
+-------------------+----+------------+--------+-------------+------+----+----+----------+---------+
only showing top 5 rows

root
 |-- time: timestamp (nullable = true)
 |-- temp: double (nul

## Join

In [33]:
# Weather and trips
wt_static = trips_final_static.join(weather_final_static, trips_final_static.start_time == weather_final_static.time, "left")
wt_streamed = trips_final_streamed.join(weather_final_streamed, trips_final_streamed.start_time == weather_final_streamed.time, "left")

# Weather, trips, and landmark
wtl_static = wt_static.join(landmark_final, wt_static.zip == landmark_final.zip_code, "left")
wtl_streamed = wt_streamed.join(landmark_final, wt_streamed.zip == landmark_final.zip_code, "left")

# Drop duplicate
wtl_static_final = wtl_static.drop("zip", "time").orderBy("start_time")
wtl_streamed_final = wtl_streamed.drop("zip", "time").orderBy("start_time")

# Check
wtl_static_final.show(5)
wtl_static_final.printSchema()
wtl_streamed_final.show(5)
wtl_streamed_final.printSchema()

+-------------------+-----+----+------------+--------+-------------+------+----+----+----------+---------+--------+---------+
|         start_time|trips|temp|rel_humidity|dewpoint|apparent_temp|precip|rain|snow|cloudcover|windspeed|zip_code|landmarks|
+-------------------+-----+----+------------+--------+-------------+------+----+----+----------+---------+--------+---------+
|2013-06-27 01:00:00|    1|22.5|        87.0|    20.2|         24.6|   0.0| 0.0| 0.0|      34.0|      6.8|   60661|        2|
|2013-06-27 11:00:00|    1|26.5|        72.0|    21.0|         31.9|   0.1| 0.1| 0.0|      36.0|     10.5|   60622|        8|
|2013-06-27 11:00:00|    3|26.5|        72.0|    21.0|         31.9|   0.1| 0.1| 0.0|      36.0|     10.5|   60607|       10|
|2013-06-27 12:00:00|    2|27.2|        70.0|    21.2|         33.2|   0.0| 0.0| 0.0|      31.0|     12.9|   60611|       20|
|2013-06-27 12:00:00|    1|27.2|        70.0|    21.2|         33.2|   0.0| 0.0| 0.0|      31.0|     12.9|   60614|   

In [34]:
import pandas as pd

# Turn spark into pd
wtl_pd_static = wtl_static_final.toPandas()
wtl_pd_streamed = wtl_streamed_final.toPandas()

# One-hot encode zipcode
static_ohe = pd.get_dummies(wtl_pd_static["zip_code"])
streamed_ohe = pd.get_dummies(wtl_pd_streamed["zip_code"])

# Combine
wtl_pd_static = pd.concat([wtl_pd_static, static_ohe], axis = 1)
wtl_pd_streamed = pd.concat([wtl_pd_streamed, streamed_ohe], axis = 1)

# Drop original
wtl_pd_static.drop(labels = 'zip_code', axis = 1, inplace = True)
wtl_pd_streamed.drop(labels = 'zip_code', axis = 1, inplace = True)




In [35]:
wtl_pd_streamed.columns

Index(['start_time', 'trips', 'temp', 'rel_humidity', 'dewpoint',
       'apparent_temp', 'precip', 'rain', 'snow', 'cloudcover', 'windspeed',
       'landmarks', '60302', '60601', '60602', '60603', '60604', '60605',
       '60606', '60607', '60608', '60609', '60610', '60611', '60612', '60613',
       '60614', '60615', '60616', '60617', '60618', '60619', '60620', '60621',
       '60622', '60623', '60624', '60625', '60626', '60628', '60629', '60630',
       '60632', '60636', '60637', '60640', '60641', '60642', '60643', '60644',
       '60645', '60647', '60649', '60651', '60653', '60654', '60657', '60659',
       '60660', '60661'],
      dtype='object')


In [36]:
# Back to spark and fill NaN
wtl_static_final = pandas_to_spark(wtl_pd_static).fillna(0)
wtl_streamed_final = pandas_to_spark(wtl_pd_streamed).fillna(0)




## Convert spark df to glue dynamic frame

In [37]:
from awsglue.dynamicframe import DynamicFrame

#Convert from spark df to dynamic frame
wtl_static_dyf = DynamicFrame.fromDF(wtl_static_final, glueContext, 'convert')
wtl_streamed_dyf = DynamicFrame.fromDF(wtl_streamed_final, glueContext, 'convert')




In [None]:
# Coalesce output into 1 file
# coalesced_wtl_static = wtl_static_dyf.coalesce()
# coalesced_wtl_streamed = wtl_streamed_dyf.coalesce(1)

# Write to S3
# Static
# glueContext.write_dynamic_frame.from_options(
#     frame = wtl_static_dyf,
#     connection_type = 's3',
#     connection_options = {'path': 's3://refit-iot/final_data_landing/static/'},
#     format = 'csv',
#     format_options = {
#         'separator': ','
#     },
#     transformation_ctx = 'datasink2'
# )

# # Streamed
# glueContext.write_dynamic_frame.from_options(
#     frame = wtl_streamed_dyf,
#     connection_type = 's3',
#     connection_options = {'path': 's3://refit-iot/final_data_landing/streamed/'},
#     format = 'csv',
#     format_options = {
#         'separator': ','
#     },
#     transformation_ctx = 'datasink2'
# )

<awsglue.dynamicframe.DynamicFrame object at 0x7f3299f14e50>


In [41]:
import boto3

# Housekeeping
database_name = "divvy"
table_name = "streamed"
glue_client = boto3.client('glue')

# Define schema
schema = wtl_streamed_dyf.schema()
columns = [
    {
        "Name": field.name,
        "Type": field.dataType.typeName()
    }
    for field in schema.fields
]

# Create table configurations
create_table_options_streamed = {
    "DatabaseName": database_name,
    "TableInput": {
        "Name": table_name,
        "Description": "Streamed data for divvy bikes",
        
        "StorageDescriptor": {
            "Columns": columns,
            "Location": "s3://refit-iot/final_data_landing/streamed/",
            "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": False,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                "Parameters": {
                    "dateTimeFormat": "YYYY-MM-DD HH:MM:SS.fffffffff",
                    "field.delim": ",",
                    "skip.header.line.count" : "1"
                }
            }
        },
        "PartitionKeys": []
    }
}

# Check if streamed table exists
# If the streamed table does not exist, create

try: 
    response = glue_client.get_table(
    DatabaseName=database_name,
    Name=table_name
)
    print(f"{table_name} already exists. Directly writing...")
except:
    glue_client = boto3.client('glue')
    response_streamed = glue_client.create_table(**create_table_options_streamed)
    print(f"{table_name} does not exist. Creating...")

glueContext.write_dynamic_frame.from_catalog(
    frame = wtl_streamed_dyf,
    database = "divvy",
    table_name = "streamed"
    
)

print(f"Sucessfully wrote to {table_name}")


streamed does not exist. Creating...
Sucessfully wrote to streamed


In [38]:
wtl_streamed_dyf.printSchema()

root
|-- start_time: timestamp
|-- trips: int
|-- temp: double
|-- rel_humidity: double
|-- dewpoint: double
|-- apparent_temp: double
|-- precip: double
|-- rain: double
|-- snow: double
|-- cloudcover: double
|-- windspeed: double
|-- landmarks: double
|-- 60302: int
|-- 60601: int
|-- 60602: int
|-- 60603: int
|-- 60604: int
|-- 60605: int
|-- 60606: int
|-- 60607: int
|-- 60608: int
|-- 60609: int
|-- 60610: int
|-- 60611: int
|-- 60612: int
|-- 60613: int
|-- 60614: int
|-- 60615: int
|-- 60616: int
|-- 60617: int
|-- 60618: int
|-- 60619: int
|-- 60620: int
|-- 60621: int
|-- 60622: int
|-- 60623: int
|-- 60624: int
|-- 60625: int
|-- 60626: int
|-- 60628: int
|-- 60629: int
|-- 60630: int
|-- 60632: int
|-- 60636: int
|-- 60637: int
|-- 60640: int
|-- 60641: int
|-- 60642: int
|-- 60643: int
|-- 60644: int
|-- 60645: int
|-- 60647: int
|-- 60649: int
|-- 60651: int
|-- 60653: int
|-- 60654: int
|-- 60657: int
|-- 60659: int
|-- 60660: int
|-- 60661: int


In [None]:
wtl_streamed_dyf.toDF().toPandas().dtypes

Execution Interrupted. Attempting to cancel the statement (statement_id=40)
Statement 40 has been cancelled


In [42]:
database_name = "divvy"
table_name = "streamed"
test = glueContext.create_dynamic_frame.from_catalog(
    database = database_name,
    table_name = table_name,
    additional_options={"skip.header.line.count": "1"}
)
    
# test.printSchema()

filtered_df = Filter.apply(frame = test, f = lambda x: x["trips"] != "trips")
filtered_df.printSchema()

root
|-- 60626: int
|-- 60632: int
|-- 60653: int
|-- 60647: int
|-- rain: double
|-- 60620: int
|-- 60608: int
|-- 60629: int
|-- 60614: int
|-- 60641: int
|-- 60623: int
|-- 60617: int
|-- 60602: int
|-- 60611: int
|-- 60605: int
|-- 60661: int
|-- apparent_temp: double
|-- 60302: int
|-- 60637: int
|-- 60643: int
|-- 60610: int
|-- 60619: int
|-- 60625: int
|-- 60628: int
|-- 60649: int
|-- 60607: int
|-- 60640: int
|-- 60613: int
|-- start_time: string
|-- cloudcover: double
|-- 60622: int
|-- 60601: int
|-- 60616: int
|-- temp: double
|-- rel_humidity: double
|-- dewpoint: double
|-- trips: int
|-- 60604: int
|-- 60660: int
|-- 60654: int
|-- windspeed: double
|-- 60621: int
|-- 60642: int
|-- 60636: int
|-- 60657: int
|-- 60624: int
|-- 60630: int
|-- 60603: int
|-- snow: double
|-- 60645: int
|-- 60651: int
|-- 60618: int
|-- 60612: int
|-- 60606: int
|-- precip: double
|-- 60609: int
|-- landmarks: double
|-- 60615: int
|-- 60659: int
|-- 60644: int


In [None]:
# Check if static table exists
# If the static table does not exist, create
database_name = "divvy"
table_name = "static"
glue_client = boto3.client('glue')

schema = wtl_static_dyf.schema()
columns = [
    {
        "Name": field.name,
        "Type": field.dataType.typeName()
    }
    for field in schema.fields
]

# Create table configurations
create_table_options_static = {
    "DatabaseName": database_name,
    "TableInput": {
        "Name": table_name,
        "Description": "Streamed data for divvy bikes",
        "StorageDescriptor": {
            "Columns": columns,
            "Location": "s3://refit-iot/final_data_landing/static/",
            "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
            "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
            "Compressed": False,
            "SerdeInfo": {
                "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                "Parameters": {
                    "field.delim": ","
                }
            }
        },
        "PartitionKeys": []
    }
}


try: 
    response = glue_client.get_table(
    DatabaseName=database_name,
    Name=table_name
)
except:
    glue_client = boto3.client('glue')
    response_static = glue_client.create_table(**create_table_options_static)
    print(f"{table_name} does not exist. Creating...")

glueContext.write_dynamic_frame.from_catalog(
    frame = wtl_streamed_dyf,
    database = "divvy",
    table_name = "static",
    create_dynamic_frame_options={
        "type": "csv",
        "schema": wtl_static_dyf.schema()
    }
)

print(f"Sucessfully wrote to {table_name}")