##### Spark SQL Data Sources
- [Official Spark SQL Data Sources Documentation](https://spark.apache.org/docs/latest/sql-data-sources.html)
- [Demystifying the Parquet File Format](https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705)
- [What's the Buzz About Parquet File Format](https://medium.com/analytics-vidhya/whats-the-buzz-about-parquet-file-format-8a1fe4f65de)

##### Delimited File
- Creating Dataframe from List and Dictionary
- Reading CSV and Separator-Separated Files
- Understanding Escape Character, Quote Character, Escape Sequence, Separator
- Passing Custom Schema
- Available Options while Reading CSV
- Options Variation Across Different Sources
- How to Pass and Choose Options

##### Databases
- Reading from JDBC
- Passing Queries in JDBC
- Specifying Database Name, Schema, Table Name in JDBC
- Commonly Used Options in JDBC
- Understanding `dbtable` and `query` Parameters in JDBC

##### Parquet File
- Introduction to Parquet File Format
- Reading Parquet Files
- Differences from Other Formats
- Why Parquet File is a Default Choice
- Compression and Encoding Techniques in Parquet

##### JSON File
- Reading JSON Files
- Handling Multiline JSON and its Relevance in Saving

##### SaveAsTable vs. df.write
- Difference Between `saveAsTable` and `df.write`
- Available Writing Modes in `df.write`


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.conf import SparkConf

In [2]:
spark_conf = SparkConf()
spark_conf.setAppName('readwrite')
spark_conf.set("spark.sql.warehouse.dir", "/home/glue_user/workspace/data-engineering/data-processing/read-write/data/wh")
spark_conf.set("hive.metastore.warehouse.dir", "/home/glue_user/workspace/data-engineering/data-processing/read-write/data/wh")

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
df = spark.range(1).repartition(1)

24/03/04 18:25:02 WARN SharedState: Not allowing to set hive.metastore.warehouse.dir in SparkSession's options, please use spark.sql.warehouse.dir to set statically for cross-session usages


In [4]:
#df.write.format('csv').mode('overwrite').save('/home/glue_user/workspace/data-engineering/data-processing/read-write/data/csv-data')
# managed table -> default data format is parquet and defaault path -> spark.sql.warehouse.dir is used
df.write.mode('overwrite').saveAsTable('default.tabledata')
# managed table -> format is csv
#df.write.format('csv').mode('overwrite').saveAsTable('csvtabledata')
# un-managed table

24/03/04 18:25:07 INFO HiveConf: Found configuration file file:/home/glue_user/spark/conf/hive-site.xml
24/03/04 18:25:13 WARN EC2MetadataUtils: Unable to retrieve the requested metadata (/latest/dynamic/instance-identity/document). connecting to 169.254.169.254:80: connecting to 169.254.169.254:80: dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network. (Service: null; Status Code: 403; Error Code: null; Request ID: null; Proxy: null)
com.amazonaws.AmazonServiceException: connecting to 169.254.169.254:80: connecting to 169.254.169.254:80: dial tcp 169.254.169.254:80: connectex: A socket operation was attempted to an unreachable network. (Service: null; Status Code: 403; Error Code: null; Request ID: null; Proxy: null)
	at com.amazonaws.internal.EC2ResourceFetcher.handleErrorResponse(EC2ResourceFetcher.java:149)
	at com.amazonaws.internal.EC2ResourceFetcher.doReadResource(EC2ResourceFetcher.java:94)
	at com.amazonaws.internal.EC2ResourceFetch

AnalysisException: Can not create the managed table('`default`.`tabledata`'). The associated location('file:/home/glue_user/workspace/sparklearning/practice/spark-warehouse/tabledata') already exists.

In [6]:
print(spark.conf.get("hive.metastore.warehouse.dir"))

Py4JJavaError: An error occurred while calling o64.get.
: java.util.NoSuchElementException: hive.metastore.warehouse.dir
	at org.apache.spark.sql.errors.QueryExecutionErrors$.noSuchElementExceptionError(QueryExecutionErrors.scala:1660)
	at org.apache.spark.sql.internal.SQLConf.$anonfun$getConfString$3(SQLConf.scala:5434)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:5434)
	at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:72)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
	at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
	at java.lang.Thread.run(Thread.java:750)


In [None]:
spark.sql("DESCRIBE EXTENDED tabledatas").show(truncate=False)
