##### Spark SQL Data Sources
- [Official Spark SQL Data Sources Documentation](https://spark.apache.org/docs/latest/sql-data-sources.html)
- [Demystifying the Parquet File Format](https://towardsdatascience.com/demystifying-the-parquet-file-format-13adb0206705)
- [What's the Buzz About Parquet File Format](https://medium.com/analytics-vidhya/whats-the-buzz-about-parquet-file-format-8a1fe4f65de)

##### Delimited File
- Creating Dataframe from List and Dictionary
- Reading CSV and Separator-Separated Files
- Understanding Escape Character, Quote Character, Escape Sequence, Separator
- Passing Custom Schema
- Available Options while Reading CSV
- Options Variation Across Different Sources
- How to Pass and Choose Options

##### Databases
- Reading from JDBC
- Passing Queries in JDBC
- Specifying Database Name, Schema, Table Name in JDBC
- Commonly Used Options in JDBC
- Understanding `dbtable` and `query` Parameters in JDBC

##### Parquet File
- Introduction to Parquet File Format
- Reading Parquet Files
- Differences from Other Formats
- Why Parquet File is a Default Choice
- Compression and Encoding Techniques in Parquet

##### JSON File
- Reading JSON Files
- Handling Multiline JSON and its Relevance in Saving

##### SaveAsTable vs. df.write
- Difference Between `saveAsTable` and `df.write`
- Available Writing Modes in `df.write`


In [1]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql import types as T
from pyspark.conf import SparkConf

In [2]:
spark_conf = SparkConf()
spark_conf.setAppName('readwrite')
#spark_conf.set("spark.sql.warehouse.dir", "/home/glue_user/workspace/data-engineering/data-processing/read-write/data/wh")

spark = SparkSession.builder.config(conf=spark_conf).getOrCreate()

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/spark/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/glue_user/aws-glue-libs/jars/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
df = spark.range(1000).repartition(1)

In [23]:
#df.write.format('csv').mode('overwrite').save('/home/glue_user/workspace/data-engineering/data-processing/read-write/data/csv-data')
# managed table -> default data format is parquet and defaault path -> spark.sql.warehouse.dir is used
# df.write.mode('overwrite').saveAsTable('default.tabledata')
# un-managed table -> format is csv
spark.range(1000).write.format('parquet').mode('overwrite').options(path='data-processing/read-write/data/sat/').saveAsTable('csvdata')
#spark.range(10).write.format('parquet').mode('overwrite').options(path='data-processing/read-write/data/sat/').saveAsTable('tabledata')

                                                                                

In [24]:
spark.sql('''
select * from csvdata''').show()

+---+
| id|
+---+
+---+



In [None]:
spark.sql("DESCRIBE EXTENDED tabledatas").show(truncate=False)
