In [None]:
# Consistent with Spark 2.1.0

## Objective 
- Interact with Spark RDDs using SQL


### Why?
- SQL is familiar
- Easy to use
- Portable

#### Use a Hive Context to bridge the gap between an RDD and a "Hive table"

In [None]:
import pyspark
spark = pyspark.sql.SparkSession.builder \
            .master("local[2]") \
            .appName("SQL Lecture") \
            .getOrCreate()

- A SparkSession is like a HiveContext, which is like an sqlContext.

- In Spark 2, use SparkSession.

Q: What is the difference between SparkContext and SparkSession (or HiveContext)?

- HiveContext gives you access to the metadata stored in Hive.

- This enables Spark SQL to interact with tables created in Hive.

- Hive tables can be backed by HDFS files, S3, HBase, and other data
  sources.


Spark SQL Using CSV
-------------------

In [None]:
%%writefile sales.csv
#ID,Date,Store,State,Product,Amount
101,11/13/2014,100,WA,331,300.00
104,11/18/2014,700,OR,329,450.00
102,11/15/2014,203,CA,321,200.00
106,11/19/2014,202,CA,331,330.00
103,11/17/2014,101,WA,373,750.00
105,11/19/2014,202,CA,321,200.00

### Read this data as a Spark DataFrame
- Deal with header
- Split into fields
- Assign column types

In [None]:
# Load RDD
rdd = sc.textFile('sales.csv')

# Deal with header
rdd = rdd.filter(lambda line: not line.startswith('#'))

# Split into fields
rdd = rdd.map(lambda line: line.split(','))

# Assign variable types
rdd = rdd.map(lambda (id, date, store, state, product, amount):
              (int(id), date, int(store), state, int(product), float(amount)))

In [None]:
# Create a schema
from pyspark.sql.types import *
schema = StructType([
        StructField('id', IntegerType(), True),
        StructField('date', StringType(), True),
        StructField('store', IntegerType(), True),
        StructField('state', StringType(), True),
        StructField('product', IntegerType(), True),
        StructField('amount', FloatType(), True)])

In [None]:
df = spark.createDataFrame(rdd, schema)

In [None]:
rdd.collect()

In [None]:
df.show()

Using SQL With DataFrames
-------------------------

Register the table so it can be seen in 'SQL World'

In [None]:
df.registerTempTable('sales')

Run queries on the registered table.

In [None]:
query = '''
    SELECT * FROM sales 
    WHERE amount > 300
    '''

In [None]:
result = spark.sql(query)

In [None]:
type(result)

In [None]:
result

In [None]:
# Remember, you have to call the .show() method
# of a Spark DF to see its contents.

result.show()

Saving Results
--------------

- Saving to JSON is preferred (for ease of use)

In [None]:
!rm -rf high-sales.json
result.toJSON().saveAsTextFile('high-sales.json')

In [None]:
cat high-sales.json/part-00000

Spark SQL Using JSON Data
-------------------------

Q: What is JSON-formatted data?

- In Spark the JSON format means that each line is a JSON document.

- JSON-formatted data can be saved as text using `saveAsTextFile()` and
  read using `textFile()`.

- JSON works well with Spark SQL because the data has an embedded
  schema.

Q: What other formats are supported by Spark SQL?

- Spark SQL also supports Parquet, which is a compact binary format
  for big data.

- If your data is in CSV then you have to add the schema
  programmatically after you load the data.

Parsing JSON Data
-----------------

In [None]:
%%writefile sales.json
{"id":101, "date":"11/13/2014", "store":100, "state":"WA", "product":331, "amount":300.00}
{"id":104, "date":"11/18/2014", "store":700, "state":"OR", "product":329, "amount":450.00}
{"id":102, "date":"11/15/2014", "store":203, "state":"CA", "product":321, "amount":200.00}
{"id":106, "date":"11/19/2014", "store":202, "state":"CA", "product":331, "amount":330.00}
{"id":103, "date":"11/17/2014", "store":101, "state":"WA", "product":373, "amount":750.00}
{"id":105, "date":"11/19/2014", "store":202, "state":"CA", "product":321, "amount":200.00}

Read in the file.

In [None]:
df = spark.read.json('sales.json')

When the hive context reads a json file, it automatically gathers the schema

In [None]:
df.show()

Look at a 50% sample of the DataFrame (without replacement).

In [None]:
df.sample(False, .5).show()

Inspect the schema.

In [None]:
df.printSchema()

### Execute SQL statements

In [None]:
df.registerTempTable('sales')
result = spark.sql('select state, store from sales')
result.show()

In [None]:
result.collect()

DataFrame Methods
-----------------

Slice the DataFrame by column and by row

In [None]:
df.select(['state','store']).show()

In [None]:
df.filter(df.amount > 300).show()

Modify the columns while selecting.

In [None]:
df.select('state','store', df.amount + 100).show()

In [None]:
df.select('state','store',df['amount'] + 100).show()

Evaluate boolean expressions.

In [None]:
df.select('state', df.amount > 300).show()

GroupBy

In [None]:
df.select('state','amount').groupby('state').mean().show()

Use SQL to write more elaborate queries.

In [None]:
query = '''
    SELECT state, AVG(amount) as avg_amount
    FROM sales
    GROUP BY state
    '''
result = spark.sql(query)
result.show()

Write user-defined functions in Python

In [None]:
def add_one(x):
    return x + 1
spark.udf.register(
    'add_one',
    add_one)

In [None]:
result = spark.sql('''
    SELECT state, add_one(amount) AS
    FROM sales
    ''')
result.show()

### DataFrames are also RDDs.
- Each row is an RDD object
- Field names are attributes

In [None]:
df.rdd.map(lambda row: row.amount**2).collect()

### Once you are ready for 'big math' convert your Dataframe to Pandas

In [None]:
pandas_df = df.toPandas()
pandas_df

### More complicated schemas
- Nested json
- Field of lists

In [None]:
%%writefile people.json
{"name":"Yin", "address":{"city":"SF","state":"CA"}, "hobbies" : ["fishing","tennis"]}
{"name":"Mike", "address":{"city":"SE", "state":"WA"}, "hobbies":["coding", "fishing"]}
{"name":"Mary", "address":{"city":"SE", "state":"WA"}, "hobbies":["playing chess"]}

Read people.json using the SparkSession

In [None]:
people = spark.read.json('people.json')
people.show()

Get name and city for each record

In [None]:
people.registerTempTable('people')
result = spark.sql('''
    SELECT name, address.city
    FROM people
    ''')
result.show()

Lists are hard to deal with...

In [None]:
result = spark.sql('''
    SELECT * FROM people
    LATERAL VIEW explode(hobbies) h as hobby
    ''')
result.show()