# Spark SQL
Revised for Spark 2.0  
<br>
Chris Overton   
2016.11.03  
Adapted most recently from Ivan Corneillet,
with advice from Jean-Francois Omhover and Miles Erickson

## Objectives
- Basic facility with SQL within Spark, as revised in version 2.0
- Mix and match RDD and SQL approaches to data manipulation
- Understand current status of Spark SQL technology

## Introduction: flow of ideas leading to Spark (and esp. to its SQL dialect)
1) 'Old SQL' - successively larger db's  
2a) Rebellion: other ways to deal with structured data  
    - for easier analysis, force data into single rectangular table 
    (e.g. star schema, R data frame, Pandas)  
    - programmatic manipulation of 'flat files'  
    - object-oriented db's  
    - NoSQL  
    - you only get to see the data once: streamed data  
    - graph db's  
2b) Scalability, especially through map-reduce  
3) Spark - combines learning from 2a and 2b

## Introduction: paradigm shift
- Old way: move data to centralized, fast executor of code
- Hadoop: forced parallelism/concurrency through map and reduce steps
    - Try to iron the cycles out of data flow -> immutability, fewer side-effects
- Spark:  
    1) **Move code to data**  
    2) Smarter optimization of process flow  
    3) Do more in memory, rather than through transits through slower storage (e.g. disk) and across machines  
    3) Maintain 'small data feel' with DataFrames

## Introduction: paradigm shift
Where this doesn't work well: highly 'cyclic', mutable data  
<br>
<details><summary>
Q: What's a good example of this?
</summary>
Real-time transactional db, like for banking
<br>
...but Spark would be fine for analyzing transaction logs
</details>

What is Spark SQL?

- Schema-enabled RDD subclasses + SQL DML (Data Manipulation Language)

What are schemas?

- Schema = Table Names + Column Names + Column Types

What are the pros of schemas?

- Can use column names instead of column positions
- Can query using SQL and DataFrame syntax
- Make your data more structured
    - Type safety can be verified before piping data through algorithms
    - Possible storage economy and improved access speed, such as through columnar db's

# The state of Spark with release of version 2.0 
A lot has changed in the recent version upgrade, especially for Spark SQL, where fundamental classes have been overhauled.

In fact, Spark has continued to evolve rapidly. That means knowledge (and software) need to be kept up to date!  
This presentation emphasizes the new version: if someone is enough of an 'early adopter' to use Spark, they're likely to upgrade quickly to version 2.0  

Many texts and online resources cover Spark 1.* --> you may want to focus on more recent documentation  

We cover only the python interface to Spark, using pyspark. Besides R and Java, it is useful to learn to talk to Spark from scala, since this is the language of its implementation, and fits with its api paradigms

In [21]:
import pyspark
#sc = pyspark.SparkContext() #Don't need to instantiate - is done by pyspark
type(sc)

pyspark.context.SparkContext

## Start Spark SQL

**In Spark >= 2.0**, create a SparkSession from the SparkContext

In [102]:
from pyspark.sql import SparkSession

#left out of following chain:     .config("spark.some.config.option", "some-value") \
spark = SparkSession.builder \
    .master("local[4]") \
    .appName("Word Count") \
    .getOrCreate()
type(spark)

pyspark.sql.session.SparkSession

**The old way for Spark 1.* **, create a HiveContext from the SparkContext.  
**Not used further in this presentation:**   
Alternately, create queries from a SqlContext

In [3]:
#not run: remains available for integration with Hadoop and backwards compatibility
#sqlContext = pyspark.HiveContext(sc)

sqlContext

<pyspark.sql.context.HiveContext at 0x10ae09990>

What is the difference between SparkContext and HiveContext?

- HiveContext gives you access to the metadata stored in Hive.
- This enables Spark SQL to interact with tables created in Hive.
- Hive tables can be backed by HDFS files, S3, HBase, and other data sources.

## DataFrames in 2.0

- DataFrames are the primary abstraction in Spark SQL.
- Think of a DataFrames as RDDs with schema. 

For strongly-typed languages (Scala, Java), a DataFrame is a DataSet(Row). This means an instance of class DataSet, parametrized by what it is a set of, namely an instance of class Row.  
<br>
For weakly-typed languages (Python, R), DataFrame is called by a similar api, but is a different subclass of RDD.  
This sacrifices some execution speed and type safety, but requires less code.

## Spark SQL using CSV

How can I pull in my CSV data and use Spark SQL on it?

In [26]:
%%writefile sales.csv
#ID,Date,Store,State,Product,Amount
101,11/13/2014,100,WA,331,300.00
104,11/18/2014,700,OR,329,450.00
102,11/15/2014,203,CA,321,200.00
106,11/19/2014,202,CA,331,330.00
103,11/17/2014,101,WA,373,750.00
105,11/19/2014,202,CA,321,200.00

Overwriting sales.csv


In [27]:
!cat sales.csv

#ID,Date,Store,State,Product,Amount
101,11/13/2014,100,WA,331,300.00
104,11/18/2014,700,OR,329,450.00
102,11/15/2014,203,CA,321,200.00
106,11/19/2014,202,CA,331,330.00
103,11/17/2014,101,WA,373,750.00
105,11/19/2014,202,CA,321,200.00

- Read the file and convert columns to right types.

In [7]:
rdd = sc.textFile('sales.csv')\
    .filter(lambda line: not line.startswith('#'))\
    .map(lambda line: line.split(','))\
    .map(lambda (id, date, store, state, product, amount):\
        (int(id), date, int(store), state, int(product), float(amount)))

In [8]:
rdd.collect()

[(101, u'11/13/2014', 100, u'WA', 331, 300.0),
 (104, u'11/18/2014', 700, u'OR', 329, 450.0),
 (102, u'11/15/2014', 203, u'CA', 321, 200.0),
 (106, u'11/19/2014', 202, u'CA', 331, 330.0),
 (103, u'11/17/2014', 101, u'WA', 373, 750.0),
 (105, u'11/19/2014', 202, u'CA', 321, 200.0)]

## Hard-coding a schema
- Import data types.

In [28]:
from pyspark.sql.types import *

- Define a schema.

In [29]:
schema = StructType( [
    StructField('id', IntegerType(), True),
    StructField('date', StringType(), True),
    StructField('store', IntegerType(), True),
    StructField('state', StringType(), True),
    StructField('product', IntegerType(), True),
    StructField('amount', FloatType(), True)
] )

- Define the DataFrame object.

In [77]:
df = spark.createDataFrame(rdd, schema)

In [103]:
df

DataFrame[amount: double, date: string, id: bigint, product: bigint, state: string, store: bigint]

In [32]:
df.show()

+---+----------+-----+-----+-------+------+
| id|      date|store|state|product|amount|
+---+----------+-----+-----+-------+------+
|101|11/13/2014|  100|   WA|    331| 300.0|
|104|11/18/2014|  700|   OR|    329| 450.0|
|102|11/15/2014|  203|   CA|    321| 200.0|
|106|11/19/2014|  202|   CA|    331| 330.0|
|103|11/17/2014|  101|   WA|    373| 750.0|
|105|11/19/2014|  202|   CA|    321| 200.0|
+---+----------+-----+-----+-------+------+



## Pop Quiz

<details><summary>
What change do we have to make to the code above if we are processing a TSV file instead of a CSV file?
</summary>
<br>
Replace `line.split(',')` with `line.split()` (or `line.split('\t')`)
</details>

## Using SQL With DataFrames

How can I run SQL queries on DataFrames?

- Register the table with SparkSession (formerly SqlContext!)

In [33]:
df.registerTempTable('sales')

- Run queries on the registered tables.

In [104]:
result = spark.sql('''
SELECT state, amount
    FROM sales
    WHERE amount > 100
''')

## Pop Quiz

<details><summary>
Wait a minute?! What just got executed?
</summary>
<br>
This records a possible transformation that is not run until required by an action
</details>

- View the results using `show()` or `collect()`.

In [81]:
result.show()

+-----+------+
|state|amount|
+-----+------+
|   WA| 300.0|
|   OR| 450.0|
|   CA| 200.0|
|   CA| 330.0|
|   WA| 750.0|
|   CA| 200.0|
+-----+------+



In [17]:
result.collect()

[Row(state=u'WA', amount=300.0),
 Row(state=u'OR', amount=450.0),
 Row(state=u'CA', amount=200.0),
 Row(state=u'CA', amount=330.0),
 Row(state=u'WA', amount=750.0),
 Row(state=u'CA', amount=200.0)]

## Pop Quiz

<details><summary>
If I run `result.collect()` twice how many times will the data be read from disk?
</summary>
1. RDDs are lazy.<br>
2. Therefore the data will be read twice.<br>
3. Unless you cache the RDD, all transformations in the RDD will execute on each action.<br>
</details>

## Caching Tables

How can I cache the RDD for a table to avoid roundtrips to disk on each action?

- Use `cacheTable()` - EXCEPT this api hasn't been moved to SparkSession in v 2.0. Now you have to use 'catalog' to get at the legacy SqlContext method...

Note: this also lets you see the execution time, and gets it out of the way.

In [136]:
#Old v 1.* code: sqlContext.cacheTable('sales')
spark.catalog.cacheTable('sales')

- This is particularly useful if you are using Spark SQL to explore data.

# Saving Results - csv, json, and parquet file formats

How can I save the results back out to the file system?

(first, make sure the files don't exist...)

In [108]:
!rm -rf high-sales.json high-sales.parquet

- You can write them out using the JSON format:

In [84]:
result.toJSON().saveAsTextFile('high-sales.json')

- Or you can save them as Parquet...:

In [39]:
result.write.parquet('high-sales.parquet')

- Let's take a look at the files.

In [40]:
!ls -l sales.csv
!echo sales.csv
!cat sales.csv

-rw-r--r--  1 christopher.overton  staff  233 Nov  3 08:16 sales.csv
sales.csv
#ID,Date,Store,State,Product,Amount
101,11/13/2014,100,WA,331,300.00
104,11/18/2014,700,OR,329,450.00
102,11/15/2014,203,CA,321,200.00
106,11/19/2014,202,CA,331,330.00
103,11/17/2014,101,WA,373,750.00
105,11/19/2014,202,CA,321,200.00

In [41]:
!ls -l high-sales.json
!for i in high-sales.json/part-*; do echo $i; cat $i; done

total 16
-rw-r--r--  1 christopher.overton  staff   0 Nov  3 08:26 _SUCCESS
-rw-r--r--  1 christopher.overton  staff  90 Nov  3 08:26 part-00000
-rw-r--r--  1 christopher.overton  staff  90 Nov  3 08:26 part-00001
high-sales.json/part-00000
{"state":"WA","amount":300.0}
{"state":"OR","amount":450.0}
{"state":"CA","amount":200.0}
high-sales.json/part-00001
{"state":"CA","amount":330.0}
{"state":"WA","amount":750.0}
{"state":"CA","amount":200.0}


In [42]:
!ls -l high-sales.parquet 
!for i in high-sales.parquet/part-*; do echo $i; cat $i; done

total 16
-rw-r--r--  1 christopher.overton  staff    0 Nov  3 08:27 _SUCCESS
-rw-r--r--  1 christopher.overton  staff  533 Nov  3 08:27 part-r-00000-14cef377-3a09-4adc-ad38-a83d53049287.snappy.parquet
-rw-r--r--  1 christopher.overton  staff  549 Nov  3 08:27 part-r-00001-14cef377-3a09-4adc-ad38-a83d53049287.snappy.parquet
high-sales.parquet/part-r-00000-14cef377-3a09-4adc-ad38-a83d53049287.snappy.parquet
PAR1 02, WACA       4WA   OR   CA $(,   �C  HC    D     �C  �C  HC<Hspark_schema %state%  %amount ,&5 statejl&<WACA    &t5 amountfj&t<  �C  HC    � )org.apache.spark.sql.parquet.row.metadata�{"type":"struct","fields":[{"name":"state","type":"string","nullable":true,"metadata":{}},{"name":"amount","type":"float","nullable":true,"metadata":{}}]} ;parquet-mr (build 32c46643845ea8a705c35d4ec8fc654cc8ff816d) �  PAR1high-sales.parquet/part-r-00001-14cef377-3a09-4adc-ad38-a8

## JSON vs CSV vs Parquet 

What are the pros and cons of JSON vs CSV vs Parquet?

Feature | JSON | CSV | Parquet
---|---|---|---
Human-Readable | Yes | Yes | No
Compact | No | Moderately | Highly
Columnar | No | No | Yes
Self-Describing | Yes | No | Yes
Requires Schema | No | Yes | No
Splittable | Yes | Yes | Yes
Popular | Getting there | Yes | Not yet

What are columnar data formats?

- Columnar data formats store data column-wise.
- This allows them to do run-length encoding (RLE).
- Instead of storing `San Francisco` 100 times, they will just store it once and the count of how many times it occurs.
- When the data is repetitive and redundant as unstructured big data tends to be, columnar data formats use up a fraction of the disk space of non-columnar formats.

What are splittable data formats?

- On big data systems data is stored in blocks.
- For example, on HDFS data is stored in 64/128 MB blocks.
- Splittable data formats enable records in a block to be processed without looking at the entire file.

What are some examples of a non-splittable data format?

- Gzip

## Parsing text files

- JSON-formatted data can be saved as text using `saveAsTextFile()` and read using `textFile()`
- JSON works well with Spark SQL because the data has an embedded schema  
<br>
- Parquet is similarly self-describing, and is more efficient for larger tables  
<br>
- If you load CSV, you still have to add the schema programmatically

In [44]:
%%writefile sales.json
{"id":101, "date":"11/13/2014", "store":100, "state":"WA", "product":331, "amount":300.00}
{"id":104, "date":"11/18/2014", "store":700, "state":"OR", "product":329, "amount":450.00}
{"id":102, "date":"11/15/2014", "store":203, "state":"CA", "product":321, "amount":200.00}
{"id":106, "date":"11/19/2014", "store":202, "state":"CA", "product":331, "amount":330.00}
{"id":103, "date":"11/17/2014", "store":101, "state":"WA", "product":373, "amount":750.00}
{"id":105, "date":"11/19/2014", "store":202, "state":"CA", "product":321, "amount":200.00}

Writing sales.json


In [45]:
!cat sales.json

{"id":101, "date":"11/13/2014", "store":100, "state":"WA", "product":331, "amount":300.00}
{"id":104, "date":"11/18/2014", "store":700, "state":"OR", "product":329, "amount":450.00}
{"id":102, "date":"11/15/2014", "store":203, "state":"CA", "product":321, "amount":200.00}
{"id":106, "date":"11/19/2014", "store":202, "state":"CA", "product":331, "amount":330.00}
{"id":103, "date":"11/17/2014", "store":101, "state":"WA", "product":373, "amount":750.00}
{"id":105, "date":"11/19/2014", "store":202, "state":"CA", "product":321, "amount":200.00}

- Now read in the file.

In [89]:
df = spark.read.json('sales.json')

In [48]:
df

DataFrame[amount: double, date: string, id: bigint, product: bigint, state: string, store: bigint]

In [49]:
df.show()

+------+----------+---+-------+-----+-----+
|amount|      date| id|product|state|store|
+------+----------+---+-------+-----+-----+
| 300.0|11/13/2014|101|    331|   WA|  100|
| 450.0|11/18/2014|104|    329|   OR|  700|
| 200.0|11/15/2014|102|    321|   CA|  203|
| 330.0|11/19/2014|106|    331|   CA|  202|
| 750.0|11/17/2014|103|    373|   WA|  101|
| 200.0|11/19/2014|105|    321|   CA|  202|
+------+----------+---+-------+-----+-----+



- Here is how to look at a 50% sample of the DataFrame (without
  replacement).

In [90]:
df.sample(False, 0.5).show()

+------+----------+---+-------+-----+-----+
|amount|      date| id|product|state|store|
+------+----------+---+-------+-----+-----+
| 200.0|11/15/2014|102|    321|   CA|  203|
| 330.0|11/19/2014|106|    331|   CA|  202|
| 750.0|11/17/2014|103|    373|   WA|  101|
+------+----------+---+-------+-----+-----+



- Here is how to inspect the schema.

In [91]:
df.schema

StructType(List(StructField(amount,DoubleType,true),StructField(date,StringType,true),StructField(id,LongType,true),StructField(product,LongType,true),StructField(state,StringType,true),StructField(store,LongType,true)))

In [52]:
df.schema.fields

[StructField(amount,DoubleType,true),
 StructField(date,StringType,true),
 StructField(id,LongType,true),
 StructField(product,LongType,true),
 StructField(state,StringType,true),
 StructField(store,LongType,true)]

In [92]:
df.describe()

DataFrame[summary: string, amount: string, id: string, product: string, store: string]

In [54]:
df.printSchema()

root
 |-- amount: double (nullable = true)
 |-- date: string (nullable = true)
 |-- id: long (nullable = true)
 |-- product: long (nullable = true)
 |-- state: string (nullable = true)
 |-- store: long (nullable = true)



## DataFrame Methods

How can I slice the DataFrame by column and by row?

- DataFrames provide a *Pandas*-like API for manipulating data.
- To select specific columns use `select()`.

In [55]:
df.select('state', 'amount').show()

+-----+------+
|state|amount|
+-----+------+
|   WA| 300.0|
|   OR| 450.0|
|   CA| 200.0|
|   CA| 330.0|
|   WA| 750.0|
|   CA| 200.0|
+-----+------+



- You can also modify the columns while selecting.

In [56]:
df.select('state', df.amount + 100).show()

# (or)
df.select('state', df['amount'] + 100).show()

+-----+--------------+
|state|(amount + 100)|
+-----+--------------+
|   WA|         400.0|
|   OR|         550.0|
|   CA|         300.0|
|   CA|         430.0|
|   WA|         850.0|
|   CA|         300.0|
+-----+--------------+

+-----+--------------+
|state|(amount + 100)|
+-----+--------------+
|   WA|         400.0|
|   OR|         550.0|
|   CA|         300.0|
|   CA|         430.0|
|   WA|         850.0|
|   CA|         300.0|
+-----+--------------+



- You can evaluate boolean expressions.

In [93]:
df.select('state', df.amount < 300).show()

df.select('state', df.amount == 300).show()

+-----+--------------+
|state|(amount < 300)|
+-----+--------------+
|   WA|         false|
|   OR|         false|
|   CA|          true|
|   CA|         false|
|   WA|         false|
|   CA|          true|
+-----+--------------+

+-----+--------------+
|state|(amount = 300)|
+-----+--------------+
|   WA|          true|
|   OR|         false|
|   CA|         false|
|   CA|         false|
|   WA|         false|
|   CA|         false|
+-----+--------------+



- You can group values.

In [111]:
df.select('state', 'amount').groupBy('state').count().show()

+-----+-----+
|state|count|
+-----+-----+
|   OR|    1|
|   CA|    3|
|   WA|    2|
+-----+-----+



- You can filter rows based on conditions.

In [59]:
df.filter(df.state == 'CA').select('id').show()

+---+
| id|
+---+
|102|
|106|
|105|
+---+



- You can use SQL to write more elaborate queries.

In [61]:
df.registerTempTable('sales')

spark.sql('''
SELECT id
    FROM sales
    WHERE amount > 300
''')\
    .show()

+---+
| id|
+---+
|104|
|106|
|103|
+---+



How can I convert DataFrames to regular RDDs?

Old way: Spark 1.*:  
- DataFrames are also RDDs.
- You can use `map()` to iterate over the rows of the DataFrame.
- You can access the values in a row using field names or column names.

In [64]:
#Old v 1.* code
#df.map(lambda row: row.amount).collect()

In [None]:
New way: Spark 2.0: just call rdd from a df.

In [95]:
(type(df), type(df.rdd))

(pyspark.sql.dataframe.DataFrame, pyspark.rdd.RDD)

- You can also use `collect()` or `take()` to pull DataFrame rows into
  the driver.

In [69]:
df.collect()

[Row(amount=300.0, date=u'11/13/2014', id=101, product=331, state=u'WA', store=100),
 Row(amount=450.0, date=u'11/18/2014', id=104, product=329, state=u'OR', store=700),
 Row(amount=200.0, date=u'11/15/2014', id=102, product=321, state=u'CA', store=203),
 Row(amount=330.0, date=u'11/19/2014', id=106, product=331, state=u'CA', store=202),
 Row(amount=750.0, date=u'11/17/2014', id=103, product=373, state=u'WA', store=101),
 Row(amount=200.0, date=u'11/19/2014', id=105, product=321, state=u'CA', store=202)]

How can I convert Spark DataFrames to Pandas data frames?

- Use `toPandas()` to convert Spark DataFrames to Pandas.

In [96]:
pandas_df = df.toPandas()

print type(pandas_df)
pandas_df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,amount,date,id,product,state,store
0,300.0,11/13/2014,101,331,WA,100
1,450.0,11/18/2014,104,329,OR,700
2,200.0,11/15/2014,102,321,CA,203
3,330.0,11/19/2014,106,331,CA,202
4,750.0,11/17/2014,103,373,WA,101
5,200.0,11/19/2014,105,321,CA,202


## User Defined Functions

How can I create my own User-Defined Functions ("UDF")?

- Import the types (e.g. StringType, IntegerType, FloatType) that we are returning  
- Register the new udf function

In [149]:
#The 1.* old way: 
def add_tax1(amount):
    return amount * 1.10
#sqlContext.registerFunction('add_tax', add_tax, FloatType())

#The 2.0 new way:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

spark.udf.register('add_tax2', lambda x: x * 1.10, FloatType())
#or use legacy call:
#spark.catalog.registerFunction('add_tax2', lambda x: x * 1.10, FloatType())
spark.sql("select amount, add_tax2(amount) as amt_plus_tax from sales").show()

+------+------------+
|amount|amt_plus_tax|
+------+------------+
| 300.0|       330.0|
| 450.0|       495.0|
| 200.0|       220.0|
| 330.0|       363.0|
| 750.0|       825.0|
| 200.0|       220.0|
+------+------------+



- Optional last argument of `registerFunction` is function return
  type; default is `StringType`.

- UDFs can use single or multiple arguments. 


## Core and Spark SQL Changes (from 2.0 release notes)

One of the largest changes in Spark 2.0 is the new updated APIs:

    - Unifying DataFrame and Dataset: In Scala and Java, DataFrame and Dataset have been unified, i.e. DataFrame is just a type alias for Dataset of Row. In Python and R, given the lack of type safety, DataFrame is the main programming interface.
    - SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
    - A new, streamlined configuration API for SparkSession
    - Simpler, more performant accumulator API
    - A new, improved Aggregator API for typed aggregation in Datasets

## SQL (from 2.0 release notes)

Spark 2.0 substantially improved SQL functionalities with SQL2003 support. Spark SQL can now run all 99 TPC-DS queries. More prominently, we have improved:

    - A native SQL parser that supports both ANSI-SQL as well as Hive QL
    - Native DDL command implementations
    - Subquery support, including
        - Uncorrelated Scalar Subqueries
        - Correlated Scalar Subqueries
        - NOT IN predicate Subqueries (in WHERE/HAVING clauses)
        - IN predicate subqueries (in WHERE/HAVING clauses)
        - (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
    - View canonicalization support

In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.