Glue and redshift
Vaquar Khan edited this page Dec 9, 2022
·
4 revisions
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame, DynamicFrameReader, DynamicFrameWriter, DynamicFrameCollection
from pyspark.sql.functions import lit
from awsglue.job import Job
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
table = glueContext.create_dynamic_frame.from_options(connection_type="redshift", connection_options =
{"url": "jdbc:redshift://xxxxx.yyyyy.us-east-1.redshift.amazonaws.com:5439/db",
"user": "yyyy",
"password": "yyyyy",
"dbtable": "schema.table_name",
"redshiftTmpDir": "s3://aws-glue-temporary-accountnumber-us-east-1/"}
)
applyformat = ApplyMapping.apply(frame =table, mappings =
[("field1","string","field1","string"),
("field2","string","field2","string") ], transformation_ctx = "applyformat")
dfred = table.toDF().createOrReplaceTempView("table_df")
sqlDF = spark.sql(
"SELECT COUNT(*) FROM table_df"
)
======================================================
AWS Glue service internally handles the write_dynamic_frame_from_jdbc_conf method for redshift is to write the Glue DyanamicFrame data into multiple CSV files and create a manifest file onto the S3 location specified in the temporary directory parameter of the Glue job. Then these CSV files are loaded into the redshift database via COPY command using the manifest file. And as CSV does not support map data type the Glue job was failing.
======================================================
- https://stackoverflow.com/questions/69219209/aws-glue-error-an-error-occurred-while-calling-o75-pywritedynamicframe-cannot
- https://stackoverflow.com/questions/48257351/error-while-connecting-pyspark-to-aws-redshift
- https://stackoverflow.com/questions/67483163/py4j-protocol-py4jjavaerror-an-error-occurred-while-calling-o133-pywritedynamic