# Hello Spark
Demonstration based on the [Spark Quick Start](https://spark.apache.org/docs/latest/quick-start.html)

# Create a Spark Session
The SparkSession object is our connection to the Spark Context Manager running on the spark-master host.

There are a few important details in the setting up of the SparkSession:
1. The `appName` is what shows up in the "Running Apps" section of http://localhost:8080/ -- It'll move to "Completed Apps" once we call `.stop()` on this session.
2. The `master` tells it where to our Spark config-manager so we can launch spark-applications from this session.
3. The `spark.sql.warehouse.dir` tells it where to find our Hive tables.


In [1]:
from pyspark.sql import SparkSession

In [2]:
spark_session = SparkSession.builder\
    .appName("hello-pyspark")\
    .master("spark://spark-master:7077")\
    .config("spark.executor.instances", 1)\
    .config("spark.cores.max", 2)\
    .getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/04/17 12:48:41 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Word Count
This is a very basic hello-world to make sure the we can run a little PySpark:

### Get Some Sample Data
We pull Shakespeare's "As You Like It" from Project Gutenberg, and write it to `/opt/data`.  This is mounted to our `fileshare` volume which is mounted on this docker container as well as all of the spark-containers (master and worker(s)).  

In [3]:
import requests
resp = requests.get('https://www.gutenberg.org/cache/epub/1121/pg1121.txt')
with open('/opt/data/as-you-like-it.txt','w')as fp:
    fp.write(resp.text)


In [4]:
ls /opt/data

as-you-like-it.txt


### Perform word-count on Spark

In [5]:
ayli = spark_session.read.text('/opt/data/as-you-like-it.txt')
ans = ayli.count()
print(ans)

[Stage 0:>                                                          (0 + 1) / 1]

4215


                                                                                

In [6]:
ayli

DataFrame[value: string]

23/04/16 21:37:43 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
23/04/16 21:37:43 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:978)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

# Spark grep

In [23]:
orlandos_lines = ayli.filter(ayli.value.contains("ORLANDO"))

In [24]:
orlandos_lines.show(n=10)

+--------------------+
|               value|
+--------------------+
|  ORLANDO,  "   "...|
|Enter ORLANDO and...|
|  ORLANDO. As I r...|
|  ORLANDO. Go apa...|
|  ORLANDO. Nothin...|
|  ORLANDO. Marry,...|
|  ORLANDO. Shall ...|
|  ORLANDO. O, sir...|
|  ORLANDO. Ay, be...|
|  ORLANDO. Come, ...|
+--------------------+
only showing top 10 rows



# Term Frequency

In [25]:
from pyspark.sql.functions import explode, split
wordCounts = ayli.select(explode(split(ayli.value, "\s+")).alias("word")).groupBy("word").count()
_coll = wordCounts.collect()

                                                                                

In [26]:
wordCounts.show()

+-----------+-----+
|       word|count|
+-----------+-----+
|     online|    4|
|PERMISSION.|    7|
|       some|   26|
|  disgrace,|    1|
|       hope|    8|
|      still|    7|
|         By|   24|
| misplaced;|    1|
|      those|    8|
|    knight,|    1|
| FREDERICK.|   20|
|  wrestler?|    1|
|    embrace|    1|
|        art|   21|
|      burs,|    1|
| likelihood|    1|
|     travel|    3|
|assailants.|    1|
|      cold,|    1|
|    blossom|    1|
+-----------+-----+
only showing top 20 rows



## Save as Parquet File
https://spark.apache.org/docs/latest/sql-data-sources-parquet.html

In [27]:
wordCounts.write.mode('overwrite').parquet('/opt/warehouse/wordcounts.parquet')

                                                                                

## Read back Parquet data

In [3]:
wc2 = spark_session.read.parquet('/opt/warehouse/wordcounts.parquet/')
wc2.show()

[Stage 1:>                                                          (0 + 1) / 1]

+-----------+-----+
|       word|count|
+-----------+-----+
|     online|    4|
|PERMISSION.|    7|
|       some|   26|
|  disgrace,|    1|
|       hope|    8|
|      still|    7|
|         By|   24|
| misplaced;|    1|
|      those|    8|
|    knight,|    1|
| FREDERICK.|   20|
|  wrestler?|    1|
|    embrace|    1|
|        art|   21|
|      burs,|    1|
| likelihood|    1|
|     travel|    3|
|assailants.|    1|
|      cold,|    1|
|    blossom|    1|
+-----------+-----+
only showing top 20 rows



                                                                                

### Enable SQL-querying
Create a temp-view from wc2 with name "wordcounts" so we can reference that as a table name in subsequent SQL queries.

In [5]:
wc2.createOrReplaceTempView("wordcounts")

ans = spark_session.sql("SELECT * FROM wordcounts WHERE LEN(word) > 4 ORDER BY count DESC")
ans.show()

+------------+-----+
|        word|count|
+------------+-----+
|   ROSALIND.|  201|
|    ORLANDO.|  120|
|      CELIA.|  109|
|     Project|   78|
| TOUCHSTONE.|   74|
|       would|   68|
|       shall|   61|
|     JAQUES.|   57|
|Gutenberg-tm|   53|
|       Enter|   51|
|       which|   50|
|     OLIVER.|   37|
|      should|   35|
|       there|   35|
|       these|   32|
|     SENIOR.|   32|
|       their|   31|
|  electronic|   27|
|      cannot|   27|
|      Exeunt|   27|
+------------+-----+
only showing top 20 rows



23/04/16 21:27:39 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
23/04/16 21:27:39 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:978)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

In [55]:
ans.limit(10).write.json('/opt/warehouse/answer.json')

In [60]:
list_of_dicts = ans.limit(10).rdd.map(lambda row: row.asDict()).collect()


23/04/16 21:00:50 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: Master removed our application: KILLED
23/04/16 21:00:50 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
	at org.apache.spark.errors.SparkCoreErrors$.clusterSchedulerError(SparkCoreErrors.scala:291)
	at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:978)
	at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.dead(StandaloneSchedulerBackend.scala:165)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint.markDead(StandaloneAppClient.scala:263)
	at org.apache.spark.deploy.client.StandaloneAppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(StandaloneAppClient.scala:170)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.proce

# Close Session
This shuts down the executors running on the workers and relinquishes cluster resources associated with this app.

In [15]:
spark_session.stop()