# Introduction to Built-in Data Sources
Spark SQL:
- Provides the engine upon which the high-level Structured APIs we explored previously are built.
- Can read and write data in a variety of structured formats (e.g., JSON, Hive tables, Parquet, Avro, ORC, CSV).
- Lets you query data using JDBC/ODBC connectors from external business intelligence (BI) data sources such as Tableau, Power BI, Talend, or from RDBMSs such as MySQL and PostgreSQL.
- Offers an interactive shell to issue SQL queries on your structured data.
- Supports ANSI SQL:2003-compliant commands and HiveQL.

image

## Using Spark SQL in Spark Applications
The SparkSession, introduced in Spark 2.0, provides a unified entry point for programming Spark with the Structured APIs. You can use a SparkSession to access Spark functionality.
To issue any SQL query, use the `sql()` method on the `SparkSession` instance, spark, such as `spark.sql("SELECT * FROM tableName")`. All  `spark.sql` queries executed in this manner return a DataFrame on which you may perform further Spark operations if you desire.

### Basic Query Examples
The U.S. Department of Transportation's (DOT) Bureau of Transportation Statistics (BTS) tracks the on-time performance of domestic flights operated by large air carriers. Summary information on the number of on-time, delayed, canceled and diverted flights appears in DOT's monthly Air Travel Consumer Report, published about 30 days after the month's end, as well as in summary tables posted on this website. BTS began collecting details on the causes of flight delays in June 2003. Summary statistics and raw data are made available to the public at the time the Air Travel Consumer Report is released.
Data is available as a CSV file with over a million records. Using a schema, we’ll read the data into a DataFrame and register
the DataFrame as a temporary view so we can query it with SQL.

In [1]:
import findspark

# If you know spark path you can specify it as init function parameter
findspark.init()

In [2]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = (SparkSession
            .builder
            .appName("SparkSQLandDFsPart1")
            .enableHiveSupport()
            .getOrCreate())

In [3]:
# Path to data set
csv_file = "../data/DelayedFlights.csv"

# Read and create a temporary view
# Infer schema (note that for larger files you
# may want to specify the schema)
df = (spark.read.format("csv")
        .option("inferSchema", "true")
        .option("header", "true")
        .load(csv_file))

df.createOrReplaceTempView("us_delay_flights_tbl")

<b>Note!</b> If you want to specify a schema, you can use a DDL-formatted string. For example:

`schema = "date STRING, delay INT, distance INT, origin STRING, destination STRING"`

Now that we have a temporary view, we can issue SQL queries using Spark SQL. These queries are no different from those you might issue against a SQL table (ex: MySQL or PostgreSQL) database. The point here is to show that Spark SQL offers an `ANSI:2003–compliant` SQL interface, and to demonstrate the interoperability between SQL and DataFrames.

In [4]:
df.printSchema()

root
 |-- _c0: integer (nullable = true)
 |-- Year: integer (nullable = true)
 |-- Month: integer (nullable = true)
 |-- DayofMonth: integer (nullable = true)
 |-- DayOfWeek: integer (nullable = true)
 |-- DepTime: double (nullable = true)
 |-- CRSDepTime: integer (nullable = true)
 |-- ArrTime: double (nullable = true)
 |-- CRSArrTime: integer (nullable = true)
 |-- UniqueCarrier: string (nullable = true)
 |-- FlightNum: integer (nullable = true)
 |-- TailNum: string (nullable = true)
 |-- ActualElapsedTime: double (nullable = true)
 |-- CRSElapsedTime: double (nullable = true)
 |-- AirTime: double (nullable = true)
 |-- ArrDelay: double (nullable = true)
 |-- DepDelay: double (nullable = true)
 |-- Origin: string (nullable = true)
 |-- Dest: string (nullable = true)
 |-- Distance: integer (nullable = true)
 |-- TaxiIn: double (nullable = true)
 |-- TaxiOut: double (nullable = true)
 |-- Cancelled: integer (nullable = true)
 |-- CancellationCode: string (nullable = true)
 |-- Diverted

In [5]:
df.columns

['_c0',
 'Year',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'DepTime',
 'CRSDepTime',
 'ArrTime',
 'CRSArrTime',
 'UniqueCarrier',
 'FlightNum',
 'TailNum',
 'ActualElapsedTime',
 'CRSElapsedTime',
 'AirTime',
 'ArrDelay',
 'DepDelay',
 'Origin',
 'Dest',
 'Distance',
 'TaxiIn',
 'TaxiOut',
 'Cancelled',
 'CancellationCode',
 'Diverted',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay']

let’s try some example queries against this data set. First, we’ll find all flights whose distance is greater than 1,000 miles:

In [6]:
spark.sql("""SELECT Distance as distance, Origin as origin, Dest as destination
             FROM us_delay_flights_tbl WHERE Distance > 1000
             ORDER BY Distance DESC""").show(10)

+--------+------+-----------+
|distance|origin|destination|
+--------+------+-----------+
|    4962|   EWR|        HNL|
|    4962|   HNL|        EWR|
|    4962|   HNL|        EWR|
|    4962|   EWR|        HNL|
|    4962|   EWR|        HNL|
|    4962|   EWR|        HNL|
|    4962|   EWR|        HNL|
|    4962|   HNL|        EWR|
|    4962|   EWR|        HNL|
|    4962|   EWR|        HNL|
+--------+------+-----------+
only showing top 10 rows



As the results show, all of the longest flights were between (DFW) and (FNT). Next, we’ll find all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay:

In [7]:
spark.sql("""SELECT CONCAT(dayOfMonth, '/', month, '/', year) as date, (ArrDelay + DepDelay) as delay, Origin as origin, Dest as destination
             FROM us_delay_flights_tbl
             WHERE (Origin = 'SFO') AND (Dest = 'ORD')
             ORDER by delay DESC""").show(10)

+----------+------+------+-----------+
|      date| delay|origin|destination|
+----------+------+------+-----------+
| 30/9/2008|1032.0|   SFO|        ORD|
|14/10/2008| 841.0|   SFO|        ORD|
| 27/7/2008| 831.0|   SFO|        ORD|
|  4/1/2008| 779.0|   SFO|        ORD|
| 30/9/2008| 759.0|   SFO|        ORD|
| 31/1/2008| 733.0|   SFO|        ORD|
| 15/2/2008| 707.0|   SFO|        ORD|
| 15/7/2008| 671.0|   SFO|        ORD|
| 17/2/2008| 669.0|   SFO|        ORD|
| 31/1/2008| 668.0|   SFO|        ORD|
+----------+------+------+-----------+
only showing top 10 rows



As with the DataFrame and Dataset APIs, with the `spark.sql` interface you can conduct common data analysis operations like those we explored in the previous chapter.

All three of the preceding SQL queries can be expressed with an equivalent Data‐Frame API query. For example, the first query can be expressed in the Python Data‐Frame API as:

In [8]:
# In Python
from pyspark.sql.functions import col, desc

(df.select("ArrDelay", "DepDelay", "Origin", "Dest")
    .where(col("distance") > 1000)
    .orderBy(desc("distance"))).show(10)

+--------+--------+------+----+
|ArrDelay|DepDelay|Origin|Dest|
+--------+--------+------+----+
|     9.0|     6.0|   EWR| HNL|
|    99.0|    84.0|   HNL| EWR|
|    14.0|    33.0|   HNL| EWR|
|    48.0|    42.0|   EWR| HNL|
|   -12.0|    11.0|   EWR| HNL|
|    29.0|    61.0|   EWR| HNL|
|     1.0|     8.0|   EWR| HNL|
|   264.0|   228.0|   HNL| EWR|
|    -7.0|    19.0|   EWR| HNL|
|    -5.0|    10.0|   EWR| HNL|
+--------+--------+------+----+
only showing top 10 rows



In [9]:
# Or
(df.select("ArrDelay", "DepDelay", "Origin", "Dest")
    .where("distance > 1000")
    .orderBy("distance", ascending=False).show(10))

+--------+--------+------+----+
|ArrDelay|DepDelay|Origin|Dest|
+--------+--------+------+----+
|     9.0|     6.0|   EWR| HNL|
|    99.0|    84.0|   HNL| EWR|
|    14.0|    33.0|   HNL| EWR|
|    48.0|    42.0|   EWR| HNL|
|   -12.0|    11.0|   EWR| HNL|
|    29.0|    61.0|   EWR| HNL|
|     1.0|     8.0|   EWR| HNL|
|   264.0|   228.0|   HNL| EWR|
|    -7.0|    19.0|   EWR| HNL|
|    -5.0|    10.0|   EWR| HNL|
+--------+--------+------+----+
only showing top 10 rows



To enable you to query structured data as shown in the preceding examples, Spark manages all the complexities of creating and managing views and tables, both in memory and on disk.

### SQL Tables and Views
Tables hold data. Associated with each table in Spark is its relevant metadata, which is information about the table and its data: the schema, description, table name, database name, column names, partitions, physical location where the actual data resides, etc. All of this is stored in a central metastore.
Instead of having a separate metastore for Spark tables, Spark by default uses the Apache Hive metastore, located at `/user/hive/warehouse`, to persist all the metadata about your tables. However, you may change the default location by setting the Spark config variable `spark.sql.warehouse.dir` to another location, which can be set to a local or external distributed storage.

### Managed Versus UnmanagedTables
Spark allows you to create two types of tables: managed and unmanaged. For a `managed` table, Spark manages both the metadata and the data in the file store. This could
be a local filesystem, HDFS, or an object store such as Amazon S3 or Azure Blob. For an `unmanaged` table, Spark only manages the metadata, while you manage the data
yourself in an external data source such as Cassandra.
With a managed table, because Spark manages everything, a SQL command such as DROP TABLE `table_name` deletes both the metadata and the data. With an unmanaged table, the same command will delete only the metadata, not the actual data. We will look at some examples of how to create managed and unmanaged tables in the next section.

### Creating SQL Databases and Tables
By default, Spark creates tables under the default database. To create your own database name, you can issue a SQL command fromyour Spark application or notebook. Using the US flight delays data set, let’s create both a managed and an unmanaged table. To begin, we’ll create a database called `learn_spark_db` and tell Spark we want to use that database:

In [10]:
spark.sql("CREATE DATABASE IF NOT EXISTS learn_spark_db")
spark.sql("USE learn_spark_db")

DataFrame[]

From this point, any commands we issue in our application to create tables will result in the tables being created in this database and residing under the database name `learn_spark_db`.

#### Creating a managed table
To create a managed table within the database learn_spark_db, you can issue a SQL query like the following:

In [11]:
spark.sql('DROP TABLE IF EXISTS us_delay_flights_tbl')

DataFrame[]

In [12]:
spark.sql("SHOW TABLES;").show()

+--------------+--------------------+-----------+
|     namespace|           tableName|isTemporary|
+--------------+--------------------+-----------+
|learn_spark_db|us_delay_flights_tbl|      false|
+--------------+--------------------+-----------+



In [13]:
# Using SQL
spark.sql("""CREATE TABLE IF NOT EXISTS managed_us_delay_flights_tbl (date STRING, delay INT,
             distance INT, origin STRING, destination STRING)""")

DataFrame[]

In [14]:
spark.catalog.dropGlobalTempView("managed_us_delay_flights_tbl")

False

In [15]:
spark.sql('DROP TABLE IF EXISTS managed_us_delay_flights_tbl')

DataFrame[]

In [16]:
# Using Dataframe API
# Path to our US flight delays CSV file
csv_file = "../data/DelayedFlights.csv"

# Schema as defined in the preceding example
flights_df = spark.read.option("inferSchema", "true").option("header", "true").csv(csv_file)

# flights_df.write.saveAsTable("managed_us_delay_flights_tbl")

In [17]:
flights_df.columns

['_c0',
 'Year',
 'Month',
 'DayofMonth',
 'DayOfWeek',
 'DepTime',
 'CRSDepTime',
 'ArrTime',
 'CRSArrTime',
 'UniqueCarrier',
 'FlightNum',
 'TailNum',
 'ActualElapsedTime',
 'CRSElapsedTime',
 'AirTime',
 'ArrDelay',
 'DepDelay',
 'Origin',
 'Dest',
 'Distance',
 'TaxiIn',
 'TaxiOut',
 'Cancelled',
 'CancellationCode',
 'Diverted',
 'CarrierDelay',
 'WeatherDelay',
 'NASDelay',
 'SecurityDelay',
 'LateAircraftDelay']

In [18]:
flights_df.show(2, False)

+---+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|_c0|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+---+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|0  |2008|1    |3         |4        |2003.0 |1955      |2211.0 |2225      |WN     

#### Creating an unmanaged table
By contrast, you can create unmanaged tables from your own data sources—say, Parquet, CSV, or JSON files stored in a file store accessible to your Spark application.

To create an unmanaged table from a data source such as a CSV file, in SQL use:

In [19]:
spark.sql("DROP TABLE IF EXISTS us_delay_flights_tbl")

DataFrame[]

In [20]:
# spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT,
#                 distance INT, origin STRING, destination STRING)
#                 USING csv OPTIONS (PATH
#                 'departuredelays.csv')""")

In [21]:
# Using Dataframe API
(flights_df.write
            .mode("overwrite")
            .saveAsTable("us_delay_flights_tbl"))

### Creating Views
In addition to creating tables, Spark can create views on top of existing tables. Views can be global (visible across all SparkSessions on a given cluster) or session-scoped
(visible only to a single SparkSession), and they are temporary: they disappear after your Spark application terminates.

Creating views has a similar syntax to creating tables within a database. Once you create a view, you can query it as you would a table. The difference between a view and a
table is that views don’t actually hold the data; tables persist after your Spark application terminates, but views disappear.

You can create a view from an existing table using SQL. For example, if you wish to work on only the subset of the US flight delays data set with origin airports of New
York (JFK) and San Francisco (SFO), the following queries will create global temporary and temporary views consisting of just that slice of the table:

In [22]:
df = spark.sql("""SELECT * FROM us_delay_flights_tbl""").show(2, False)

+-------+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|_c0    |Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+-------+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|4130166|2008|7    |26        |6        |1820.0 |1800      |1935.0 |19

In [23]:
# In Python
df_sfo = spark.sql("""SELECT CONCAT(dayOfMonth, '/', month, '/', year) as date, (ArrDelay + DepDelay) as delay, Origin, Dest FROM
                      us_delay_flights_tbl WHERE Origin = 'SFO'""")
df_sfo.show()

+---------+-----+------+----+
|     date|delay|Origin|Dest|
+---------+-----+------+----+
| 1/7/2008|271.0|   SFO| ORD|
| 2/7/2008|140.0|   SFO| ORD|
| 4/7/2008| 47.0|   SFO| ORD|
| 7/7/2008|153.0|   SFO| ORD|
|11/7/2008| 56.0|   SFO| ORD|
|15/7/2008| 35.0|   SFO| ORD|
|18/7/2008|428.0|   SFO| ORD|
|19/7/2008|102.0|   SFO| ORD|
|21/7/2008| 76.0|   SFO| ORD|
|23/7/2008| 25.0|   SFO| ORD|
|24/7/2008|144.0|   SFO| ORD|
|25/7/2008| 81.0|   SFO| ORD|
|27/7/2008|  4.0|   SFO| ORD|
|28/7/2008|495.0|   SFO| ORD|
| 1/7/2008| 48.0|   SFO| ORD|
| 3/7/2008|  8.0|   SFO| ORD|
| 4/7/2008| 14.0|   SFO| ORD|
| 7/7/2008| 41.0|   SFO| ORD|
|11/7/2008| 81.0|   SFO| ORD|
|14/7/2008| 63.0|   SFO| ORD|
+---------+-----+------+----+
only showing top 20 rows



In [24]:
# Create a temporary and global temporary view
# df_sfo.createOrReplaceGlobalTempView("us_origin_airport_SFO_global_tmp_view")
df_sfo.createOrReplaceTempView("us_origin_airport_SFO_tmp_view")

Once you’ve created these views, you can issue queries against them just as you would against a table. Keep in mind that when accessing a global temporary view you must use the prefix `global_temp.<view_name>`, because Spark creates global temporary views in a global temporary database called `global_temp.` For example:

In [25]:
# spark.read.table("us_origin_airport_JFK_tmp_view")
# Or
spark.sql("SELECT * FROM us_origin_airport_SFO_tmp_view").show()

+---------+-----+------+----+
|     date|delay|Origin|Dest|
+---------+-----+------+----+
| 1/7/2008|271.0|   SFO| ORD|
| 2/7/2008|140.0|   SFO| ORD|
| 4/7/2008| 47.0|   SFO| ORD|
| 7/7/2008|153.0|   SFO| ORD|
|11/7/2008| 56.0|   SFO| ORD|
|15/7/2008| 35.0|   SFO| ORD|
|18/7/2008|428.0|   SFO| ORD|
|19/7/2008|102.0|   SFO| ORD|
|21/7/2008| 76.0|   SFO| ORD|
|23/7/2008| 25.0|   SFO| ORD|
|24/7/2008|144.0|   SFO| ORD|
|25/7/2008| 81.0|   SFO| ORD|
|27/7/2008|  4.0|   SFO| ORD|
|28/7/2008|495.0|   SFO| ORD|
| 1/7/2008| 48.0|   SFO| ORD|
| 3/7/2008|  8.0|   SFO| ORD|
| 4/7/2008| 14.0|   SFO| ORD|
| 7/7/2008| 41.0|   SFO| ORD|
|11/7/2008| 81.0|   SFO| ORD|
|14/7/2008| 63.0|   SFO| ORD|
+---------+-----+------+----+
only showing top 20 rows



#### Temporary views versus global temporary views

The difference between temporary and global temporary views being subtle, it can be a source of mild confusion among developers new to Spark. A temporary view is tied to a single `SparkSession` within a Spark application. In contrast, a global temporary view is visible across multiple SparkSessions within a Spark application. Yes, you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations.

### Viewing the Metadata
Temporary views versus global temporary views The difference between temporary and global temporary views being subtle, it can be a source of mild confusion among developers new to Spark. A temporary view is tied to a single SparkSession within a Spark application. In contrast, a global temporary view is visible across multiple SparkSessions within a Spark application. Yes, you can create multiple SparkSessions within a single Spark application—this can be handy, for example, in cases where you want to access (and combine) data from two different SparkSessions that don’t share the same Hive metastore configurations.

In [26]:
spark.catalog.listDatabases()

[Database(name='default', description='Default Hive database', locationUri='file:/C:/Users/dm/Documents/Github/training-bigdata-lab/Section%2005%20Apache%20Spark%20(PySpark)/spark-warehouse'),
 Database(name='learn_spark_db', description='', locationUri='file:/C:/Users/dm/Documents/Github/training-bigdata-lab/Section%2005%20Apache%20Spark%20(PySpark)/spark-warehouse/learn_spark_db.db')]

In [27]:
spark.catalog.listTables()

[Table(name='us_delay_flights_tbl', database='learn_spark_db', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='us_origin_airport_sfo_tmp_view', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [28]:
spark.catalog.listColumns("us_delay_flights_tbl")

[Column(name='_c0', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='Year', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='Month', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='DayofMonth', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='DayOfWeek', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='DepTime', description=None, dataType='double', nullable=True, isPartition=False, isBucket=False),
 Column(name='CRSDepTime', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(name='ArrTime', description=None, dataType='double', nullable=True, isPartition=False, isBucket=False),
 Column(name='CRSArrTime', description=None, dataType='int', nullable=True, isPartition=False, isBucket=False),
 Column(nam

### Caching SQL Tables
We will discuss table caching stragies further later but for now, like DataFrames, you can cache and uncache SQL tables and views.
In Spark 3.0, in addition to other options, you can specify a table as LAZY, meaning that it should only be cached when it is first used instead of immediately:

### Reading Tables into DataFrames
Often, data engineers build data pipelines as part of their regular data ingestion and ETL processes. They populate Spark SQL databases and tables with cleansed data for consumption by applications downstream.
Let’s assume you have an existing database, `learn_spark_db`, and table, `us_delay_flights_tbl`, ready for use. Instead of reading from an external JSON file, you can simply use SQL to query the table and assign the returned result to a DataFrame:

In [29]:
# us_flights_df = spark.sql("SELECT * FROM us_delay_flights_tbl")
us_flights_df2 = spark.table("us_delay_flights_tbl")

## Data Sources for DataFrames and SQL Tables
Spark SQL provides an interface to a variety of data sources. It also provides a set of common methods for reading and writing data to and from these data sources using the <a href="https://www.databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html">Data Sources API</a>.

### DataFrameReader
DataFrameReader is the core construct for reading data from a data source into a DataFrame. It has a defined format and a recommended pattern for usage:

`DataFrameReader.format(args).option("key", "value").schema(args).load()`

Note that you can only access a DataFrameReader through a SparkSession instance. That is, you cannot create an instance of DataFrameReader. To get an instance handle to it, use:


In [30]:
SparkSession.read
# or
SparkSession.readStream

<property at 0x1b626349950>

While `read` returns a handle to `DataFrameReader` to read into a DataFrame from a static data source, `readStream` returns an instance to read from a streaming source.

| Method      | Arguments | Description     |
| :----:      |    :----   |    :----:     |
| format()      | "parquet", "csv", "txt", "json", "jdbc", "orc", "avro", etc.       | If you don’t specify this method, then the default is Parquet or whatever is set in spark.sql.sources.default.   |
| option()   | ("mode", {PERMISSIVE , FAILFAST, DROPMALFORMED } ) ("inferSchema", {true , false}) ("path", "path_file_data_source")        | A series of key/value pairs and options. The Spark documentation shows some examples and explains the different modes and their actions. The default mode is PERMISSIVE. The "inferSchema" and "mode" options are specific to the JSON and CSV file formats.     |
| schema()   | DDL String or StructType, e.g., 'A INT, B STRING' or StructType(...)        | For JSON or CSV format, you can specify to infer the schema in the option() method. Generally, providing a schema for any format makes loading faster and ensures your data conforms to the  expected schema.      |
| load()   | "/path/to/data/source"        | The path to the data source. This can be empty if specified in option("path", "...").     |

In [31]:
# Parquet
df = spark.read.format("parquet").load("../data/us_flights_delay")
df.show(1, False)

+-------+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|_c0    |Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+-------+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|4130166|2008|7    |26        |6        |1820.0 |1800      |1935.0 |19

In [32]:
# CSV
df3 = spark.read.format("csv").option("inferSchema", "true").option("header", "true").option("mode", "PERMISSIVE").load("../data/csv_mindex.csv")
df3.show(1, False)

+----+----+------+------+
|key1|key2|value1|value2|
+----+----+------+------+
|one |a   |1     |2     |
+----+----+------+------+
only showing top 1 row



In [33]:
# JSON
df4 = spark.read.format("json").load("../data/examplePYspark.json")
df4.show(1, False)

+---+---+---+
|a  |b  |c  |
+---+---+---+
|1  |2  |3  |
+---+---+---+
only showing top 1 row



<b>Note!</b> Parquet is the default and preferred data source for Spark because it’s efficient, uses columnar storage, and employs a fast compression algorithm.

### DataFrameWriter

`DataFrameWriter` does the reverse of its counterpart: it saves or writes data to a specified built-in data source. Unlike with DataFrameReader, you access its instance not from a SparkSession but from the DataFrame you wish to save. It has a few recommended usage patterns:

`DataFrameWriter.format(args).option(args).sortBy(args).saveAsTable(table)`

| Method      | Arguments | Description     |
| :----:      |    :----   |    :----:     |
| format()      | "parquet", "csv", "txt", "json", "jdbc", "orc", "avro", etc.       | If you don’t specify this method, then the default is Parquet or whatever is set in spark.sql.sources.default.   |
| option()   | ("mode", {append , overwrite, ignore , error or errorifexists} )("mode", {SaveMode.Overwrite, SaveMode.Append, SaveMode.Ignore, SaveMode.ErrorIfExists}) ("path", "path_to_write_to")        | A series of key/value pairs and options. The Spark documentation shows some examples. This is an overloaded method. The default mode options are error or error ifexists and SaveMode.ErrorIfExists; they throw an exception at runtime if the data already exists.     |
| bucketBy()   | (numBuckets, col, col...,col)        | The number of buckets and names of columns to bucket by. Uses Hive’s bucketing scheme on a filesystem.      |
| save()   | "/path/to/data/source"        | The path to save to. This can be empty if specified in option("path", "...").     |
| saveAsTable()  | "table_name"            | The table to save to.|

<b>Note!</b> You can find PARQUET options for DataFrameReader and DataFrameWriter in this <a href="https://spark.apache.org/docs/latest/sql-data-sources-parquet.html">link</a>.

### Parquet
Parquet, because it’s the default data source in Spark. Supported and widely used by many big data processing frameworks and platforms, Parquet is an open source columnar file format that offers many I/O optimizations (such as compression, which saves storage space and allows for quick access to data columns).
Because of its efficiency and these optimizations, we recommend that after you have transformed and cleansed your data, you save your DataFrames in the Parquet format for downstream consumption.

#### Reading Parquet files into a DataFrame
Parquet files are stored in a directory structure that contains the data files, metadata, a number of compressed files, and some status files. Metadata in the footer contains
the version of the file format, the schema, and column data such as the path, etc. For example, a directory in a Parquet file might contain a set of files like this:

            _SUCCESS
            _committed_1799640464332036264
            _started_1799640464332036264
            part-00000-tid-1799640464332036264-91273258-d7ef-4dc7-<...>-c000.snappy.parquet

#### Reading Parquet files into a Spark SQL table
As well as reading Parquet files into a Spark DataFrame, you can also create a Spark SQL unmanaged table or view directly using SQL:

Once you’ve created the table or view, you can read data into a DataFrame using SQL, as we saw in some earlier examples:

In [34]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show()

+-------+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|    _c0|Year|Month|DayofMonth|DayOfWeek|DepTime|CRSDepTime|ArrTime|CRSArrTime|UniqueCarrier|FlightNum|TailNum|ActualElapsedTime|CRSElapsedTime|AirTime|ArrDelay|DepDelay|Origin|Dest|Distance|TaxiIn|TaxiOut|Cancelled|CancellationCode|Diverted|CarrierDelay|WeatherDelay|NASDelay|SecurityDelay|LateAircraftDelay|
+-------+----+-----+----------+---------+-------+----------+-------+----------+-------------+---------+-------+-----------------+--------------+-------+--------+--------+------+----+--------+------+-------+---------+----------------+--------+------------+------------+--------+-------------+-----------------+
|4130166|2008|    7|        26|        6| 1820.0|      1800| 1935.0|  

#### Writing DataFrames to Parquet files
Writing or saving a DataFrame as a table or file is a common operation in Spark. To write a DataFrame you simply use the methods and arguments to the DataFrame
Writer outlined earlier in this chapter, supplying the location to save the Parquet file to. For example:

In [35]:
(df.write.format("parquet")
            .mode("overwrite")
            .option("compression", "snappy")
            .save("../data/parquet/df_parquet"))

#### Writing DataFrames to Spark SQL tables

Writing a DataFrame to a SQL table is as easy as writing to a file just use `saveAsTable()` instead of `save()`. This will create a managed table called `us_delay_flights_tbl`:

In [36]:
(df.write
    .mode("overwrite")
    .saveAsTable("us_delay_flights_tbl"))

### JSON
JavaScript Object Notation (JSON) is also a popular data format. It came to prominence as an easy-to-read and easy-to-parse format compared to XML. It has two representational
formats: <a href="https://docs.databricks.com/external-data/json.html">single-line mode and multiline mode</a>. Both modes are supported in Spark.

#### Reading a JSON file into a DataFrame

In [37]:
file = "../data/examplePYspark.json"
df = spark.read.format("json").load(file)

In [38]:
df.show()

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
|  7|  8|  9|
+---+---+---+



#### Reading a JSON file into a Spark SQL table

### CSV
As widely used as plain text files, this common text file format captures each field delimited by a comma; each line with comma-separated fields represents a record. Even though a comma is the default separator, you may use other delimiters to separate fields in cases where commas are part of your data. Popular spreadsheets can generate CSV files, so it’s a popular format among data and business analysts.
#### Reading a CSV file into a DataFrame

In [42]:
file = "../data/stock_px.csv"
schema = "index date, AAPL float, MSFT float, XOM float, SPX float"

df = (spark.read.format("csv")
            .option("header", "true")
            .schema(schema)
            .option("mode", "FAILFAST") # Exit if any errors
            .option("nullValue", "") # Replace any null data field with quotes
            .load(file))

In [43]:
df.show(1, False)

+----------+----+-----+-----+------+
|index     |AAPL|MSFT |XOM  |SPX   |
+----------+----+-----+-----+------+
|2003-01-02|7.4 |21.11|29.22|909.03|
+----------+----+-----+-----+------+
only showing top 1 row



#### Reading a CSV file into a Spark SQL table

<b>Note!</b> You can find CSV options for DataFrameReader and DataFrameWriter in this <a href="https://spark.apache.org/docs/latest/sql-data-sources-csv.html">link</a>.

<b>Note!</b> For the full list of possible data sources check this <a href="https://spark.apache.org/docs/latest/sql-data-sources.html">link</a>.