# **SparkSQL Lab: **
#### From this lab, you would write code to execute SQL query in Spark. Makes your analytic life simpler and faster.
#### ** During this lab we will cover: **
#### *Part 1:* Create a SchemaRDD (or DataFrame) 
#### *Part 2:* Loading data programmatically
#### *Part 3:* Caching for performance
#### *Part 4:* How many authors tagged as spam?
#### Reference for Spark RDD [Spark's Python API](https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD)

### ** Part 1: Create a SchemaRDD (or DataFrame) **

In [14]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

NameError: name 'sc' is not defined

#### ** (1a) DataFrame from existing RDD **

In [None]:
df = sc.parallelize([Row(name="Gordon",beverage="coffee"),
                     Row(name="Katrina",beverage="tea"),
                     Row(name="Graham",beverage="juice")])

### ** Part 2: Loading data programmatically **

#### ** (2a) Read local JSON file to DataFrame **
#### Thank for the hashed spam data from PIXNET [PIXNET HACKATHON 2015](https://pixnethackathon2015.events.pixnet.net/)

In [None]:
# spark 1.3
df = sqlContext.read.json("examples/src/main/resources/people.json")
# spark 1.4
#df = sqlContext.read.load("examples/src/main/resources/people.json", format="json")

#### ** (2b) Read data from HDFS **

In [None]:
# Load a text file and convert each line to a Row.
lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.
schemaPeople = sqlContext.createDataFrame(people)
schemaPeople.registerTempTable("people")

#### ** (2c) Read Hive table**

In [None]:
from pyspark.sql import HiveContext
sqlContext = HiveContext(sc)

sqlContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
sqlContext.sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")

# Queries can be expressed in HiveQL.
results = sqlContext.sql("FROM src SELECT key, value").collect()

#### ** (2x) User defined functions (UDF) in Spark SQL **
#### Don't forget the configuration of Hive should be done by placing your hive-site.xml file in conf/.

In [None]:
# Create an UDF for how long some text is
hiveCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType()) 
lengthSchemaRDD = hiveCtx.sql("SELECT strLenPython('text') FROM tweets LIMIT 10")

In [13]:
from test_helper import Test
# TEST Pluralize and test (1b)
Test.assertEquals(y, 1, "y is incorrect")

ImportError: No module named test_helper

#### ** (2x) Saving to persistent tables**
####  `saveAsTable ` : Saves the contents of this DataFrame to a data source as a table.

In [None]:
df.saveAsTable('test_table', format='csv', mode='overwrite', path='file:///') 

In [32]:
import subprocess
print subprocess.check_output(["ls"])

Kaggle_Device.ipynb
Lucky_Draw.ipynb
ML_lab1_review_student.ipynb
PIXNET_Spam_2015.ipynb
SparkSQL_handson_2015.ipynb
UTM_user_detection.ipynb
Vagrantfile
lab1_word_count_student.ipynb
mooc-setup-master



### ** Part 3: Caching for performance**

In [None]:
sqlContext.cacheTable("tableName")

In [None]:
sqlContext.uncacheTable("tableName")

### ** Part 4: How many authors tagged as spam? **

#### Use the `wordCount()` function and `takeOrdered()` to obtain top 3 most frequently author ID and their counts.

In [41]:
spam_data = '/Users/etu/Desktop/kaggle/spam/user-action-log.json'

print subprocess.check_output("cat %s | head -n 7"%(spam_data), shell=True)

[
    {
        "operate_at": 1427817600,
        "operate_date": "2015-04-01T00:00:00+08:00",
        "author": "b1e26f5cbf4d68eee850946a7d788666bffa7cbd",
        "action": "87a4425c5b350b685a97b5c7aa123e74599dd481"
    },



In [None]:
# TODO: Replace <FILL IN> with appropriate code
# 


In [None]:
from test_helper import Test
# TEST Pluralize and test (1b)
Test.assertEquals(y, 1, "y is incorrect")