# Spark SQL

Spark SQL is arguably one of the most important and powerful features in Spark. In a nutshell, with Spark SQL you can run SQL queries against views or tables organized into databases. You also can use system functions or define user functions and analyze query plans in order to optimize their workloads. This integrates directly into the DataFrame API, and as we saw in previous classes, you can choose to express some of your data manipulations in SQL and others in DataFrames and they will compile to the same underlying code.

## Big Data and SQL: Apache Hive

Before Spark’s rise, Hive was the de facto big data SQL access layer. Originally developed at Facebook, Hive became an incredibly popular tool across industry for performing SQL operations on big data. In many ways it helped propel Hadoop into different industries because analysts could run SQL queries. Although Spark began as a general processing engine with Resilient Distributed Datasets (RDDs), a large cohort of users now use Spark SQL.

## Big Data and SQL: Spark SQL

With the release of Spark 2.0, its authors created a superset of Hive’s support, writing a native SQL parser that supports both ANSI-SQL as well as HiveQL queries. This, along with its unique interoperability with DataFrames, makes it a powerful tool for all sorts of companies. For example, in late 2016, Facebook announced that it had begun running Spark workloads and seeing large benefits in doing so. In the words of the blog post’s authors:

>We challenged Spark to replace a pipeline that decomposed to hundreds of Hive jobs into a single Spark job. Through a series of performance and reliability improvements, we were able to scale Spark to handle one of our entity ranking data processing use cases in production…. The Spark-based pipeline produced significant performance improvements (4.5–6x CPU, 3–4x resource reservation, and ~5x latency) compared with the old Hive-based pipeline, and it has been running in production for several months.

The power of Spark SQL derives from several key facts: SQL analysts can now take advantage of Spark’s computation abilities by plugging into the Thrift Server or Spark’s SQL interface, whereas data engineers and scientists can use Spark SQL where appropriate in any data flow. This unifying API allows for data to be extracted with SQL, manipulated as a DataFrame, passed into one of Spark MLlibs’ large-scale machine learning algorithms, written out to another data source, and everything in between.

**NOTE:** Spark SQL is intended to operate as an online analytic processing (OLAP) database, not an online transaction processing (OLTP) database. This means that it is not intended to perform extremely low-latency queries. Even though support for in-place modifications is sure to be something that comes up in the future, it’s not something that is currently available.

In [1]:
spark.sql("SELECT 1 + 1").show()

+-------+
|(1 + 1)|
+-------+
|      2|
+-------+



As we have seen before, you can completely interoperate between SQL and DataFrames, as you see fit. For instance, you can create a DataFrame, manipulate it with SQL, and then manipulate it again as a DataFrame. It’s a powerful abstraction that you will likely find yourself using quite a bit:

In [2]:
bucket = spark._jsc.hadoopConfiguration().get("fs.gs.system.bucket")
data = "gs://" + bucket + "/notebooks/data/"

spark.read.json(data + "flight-data/json/2015-summary.json")\
  .createOrReplaceTempView("flights_view") # DF => SQL

In [3]:
spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count)
FROM flights_view GROUP BY DEST_COUNTRY_NAME
""")\
  .where("DEST_COUNTRY_NAME like 'S%'").where("`sum(count)` > 10")\
  .count() # SQL => DF

12

## Creating Tables

You can create tables from a variety of sources. For instance below we are creating a table from a SELECT statement:

In [4]:
spark.sql('''
CREATE TABLE IF NOT EXISTS flights_from_select USING parquet AS SELECT * FROM flights_view
''')

DataFrame[]

In [5]:
spark.sql('SELECT * FROM flights_from_select').show(5)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|    United States|            Romania|   15|
|    United States|            Croatia|    1|
|    United States|            Ireland|  344|
|            Egypt|      United States|   15|
|    United States|              India|   62|
+-----------------+-------------------+-----+
only showing top 5 rows



In [6]:
spark.sql('''
DESCRIBE TABLE flights_from_select
''').show()

+-------------------+---------+-------+
|           col_name|data_type|comment|
+-------------------+---------+-------+
|  DEST_COUNTRY_NAME|   string|   null|
|ORIGIN_COUNTRY_NAME|   string|   null|
|              count|   bigint|   null|
+-------------------+---------+-------+



## Catalog
The highest level abstraction in Spark SQL is the Catalog. The Catalog is an abstraction for the storage of metadata about the data stored in your tables as well as other helpful things like databases, tables, functions, and views. The catalog is available in the `spark.catalog` package and contains a number of helpful functions for doing things like listing tables, databases, and functions.

In [7]:
Cat = spark.catalog

In [8]:
Cat.listTables()

[Table(name='flights_from_select', database='default', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='flights_view', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

In [9]:
spark.sql('SHOW TABLES').show(5, False)

+--------+-------------------+-----------+
|database|tableName          |isTemporary|
+--------+-------------------+-----------+
|default |flights_from_select|false      |
|        |flights_view       |true       |
+--------+-------------------+-----------+



In [10]:
Cat.listDatabases()

[Database(name='default', description='Default Hive database', locationUri='hdfs://bigcluster-m/user/hive/warehouse')]

In [11]:
spark.sql('SHOW DATABASES').show()

+------------+
|databaseName|
+------------+
|     default|
+------------+



In [12]:
Cat.listColumns('flights_from_select')

[Column(name='DEST_COUNTRY_NAME', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='ORIGIN_COUNTRY_NAME', description=None, dataType='string', nullable=True, isPartition=False, isBucket=False),
 Column(name='count', description=None, dataType='bigint', nullable=True, isPartition=False, isBucket=False)]

In [13]:
Cat.listTables()

[Table(name='flights_from_select', database='default', description=None, tableType='MANAGED', isTemporary=False),
 Table(name='flights_view', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

### Caching Tables

In [14]:
spark.sql('''
CACHE TABLE flights_view
''')

DataFrame[]

In [15]:
spark.sql('''
UNCACHE TABLE flights_view
''')

DataFrame[]

## Explain

In [16]:
spark.sql('''
EXPLAIN SELECT * FROM just_usa_view
''').show(1, False)

+-----------------------------------------------------------------------------------------------------------------+
|plan                                                                                                             |
+-----------------------------------------------------------------------------------------------------------------+
|== Physical Plan ==
org.apache.spark.sql.AnalysisException: Table or view not found: just_usa_view; line 2 pos 22|
+-----------------------------------------------------------------------------------------------------------------+



###  VIEWS - create/drop

In [17]:
spark.sql('''
CREATE VIEW just_usa_view AS
  SELECT * FROM flights_from_select WHERE dest_country_name = 'United States'
''')

DataFrame[]

In [18]:
spark.sql('''
DROP VIEW IF EXISTS just_usa_view
''')

DataFrame[]

### Drop tables

In [19]:
spark.sql('DROP TABLE flights_from_select')

DataFrame[]

In [20]:
spark.sql('DROP TABLE IF EXISTS flights_from_select')

DataFrame[]

## `spark-sql`

Go to the command line tool and check for the list of databases and tables. For instance:

`SHOW TABLES`