# Spark Introduction

Spark Documentation: <http://spark.apache.org/docs/latest/>

## 1. Initialize Spark

**Please edit env.py in HOME directory to make sure to point to right Spark installation!**

The following codes shows how the Pilot-Abstraction is used to connect to an existing YARN cluster and startup Spark.

In [13]:
%run ../env.py
%run ../util/init_spark.py

from pilot_hadoop import PilotComputeService as PilotSparkComputeService

pilotcompute_description = {
    "service_url": "yarn-client://yarn-aws.radical-cybertools.org",
    "number_of_processes": 2
}

print "SPARK HOME: %s"%os.environ["SPARK_HOME"]
print "PYTHONPATH: %s"%os.environ["PYTHONPATH"]

pilot_spark = PilotSparkComputeService.create_pilot(pilotcompute_description=pilotcompute_description)
sc = pilot_spark.get_spark_context()

SPARK HOME: /usr/hdp/2.3.2.0-2950/spark-1.5.1-bin-hadoop2.6
PYTHONPATH: /usr/hdp/2.3.2.0-2950/spark-1.5.1-bin-hadoop2.6/python:/usr/hdp/2.3.2.0-2950/spark-1.5.1-bin-hadoop2.6/python/lib/py4j-0.8.2.1-src.zip


After the Spark application has been submitted it can be monitored via the YARN web interface: <http://yarn-aws.radical-cybertools.org:8088/> or by executing the following command:

In [2]:
!yarn application -list -appTypes Spark -appStates RUNNING

15/11/08 16:40:16 INFO impl.TimelineClientImpl: Timeline service address: http://ip-10-63-179-69.ec2.internal:8188/ws/v1/timeline/
15/11/08 16:40:16 INFO client.RMProxy: Connecting to ResourceManager at ip-10-63-179-69.ec2.internal/10.63.179.69:8050
Total number of applications (application-types: [SPARK] and states: [RUNNING]):2
                Application-Id	    Application-Name	    Application-Type	      User	     Queue	             State	       Final-State	       Progress	                       Tracking-URL
application_1446998777703_0001	        PySparkShell	               SPARK	   radical	   default	           RUNNING	         UNDEFINED	            10%	          http://10.99.194.113:4041
application_1447000128355_0002	         Pilot-Spark	               SPARK	   radical	   default	           RUNNING	         UNDEFINED	            10%	          http://10.99.194.113:4040


In [4]:
output=!yarn application -list -appTypes Spark -appStates RUNNING
print_application_url(output)

Unnamed: 0,User,Name,Spark Application URL
0,radical,Pilot-Spark,http://yarn-aws.radical-cybertools.org:8088/proxy/application_1447000128355_0002


## 2. Spark: Hello RDD Abstraction

**Line Count:** How many lines of logs do we have?

In [5]:
text_rdd = sc.textFile("/data/nasa/")
text_rdd.count()

1891715

**Word Count:** How many words?

In [11]:
text_rdd.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda x,y: x+y).take(10)

[(u'', 2817),
 (u'[13/Jul/1995:17:48:56', 1),
 (u'[13/Jul/1995:17:48:54', 1),
 (u'[13/Jul/1995:17:48:52', 3),
 (u'/cgi-bin/imagemap/countdown?107,174', 19),
 (u'[13/Jul/1995:17:48:50', 3),
 (u'[22/Jul/1995:12:10:41', 1),
 (u'[03/Jul/1995:02:03:47', 2),
 (u'[16/Jul/1995:14:34:31', 1),
 (u'[07/Jul/1995:18:18:55', 1)]

**HTTP Response Code Count:** How many HTTP errors did we observe?

In [7]:
# NASA
text_rdd = sc.textFile("/data/nasa/")
text_rdd.filter(lambda x: len(x)>8).map(lambda x: (x.split()[-2],1)).reduceByKey(lambda x,y: x+y).collect()

[(u'403', 54),
 (u'302', 46573),
 (u'304', 132627),
 (u'500', 62),
 (u'501', 14),
 (u'200', 1701534),
 (u'404', 10845),
 (u'400', 5)]

## 3. Spark-SQL

In [8]:
from pyspark.sql import SQLContext, Row
sqlContext = SQLContext(sc)
text_filtered = text_rdd.filter(lambda x: len(x)>8)
logs = text_filtered.top(20)
cleaned = text_filtered.map(lambda l: (l.split(" ")[0], l.split(" ")[3][1:], l.split(" ")[6], l.split(" ")[-2]))
rows = cleaned.map(lambda l: Row(referer=l[0], ts=l[1], response_code=l[3]))
schemaLog = sqlContext.createDataFrame(rows)
schemaLog.registerTempTable("row")

In [9]:
rows.take(10)

[Row(referer=u'199.72.81.55', response_code=u'200', ts=u'01/Jul/1995:00:00:01'),
 Row(referer=u'unicomp6.unicomp.net', response_code=u'200', ts=u'01/Jul/1995:00:00:06'),
 Row(referer=u'199.120.110.21', response_code=u'200', ts=u'01/Jul/1995:00:00:09'),
 Row(referer=u'burger.letters.com', response_code=u'304', ts=u'01/Jul/1995:00:00:11'),
 Row(referer=u'199.120.110.21', response_code=u'200', ts=u'01/Jul/1995:00:00:11'),
 Row(referer=u'burger.letters.com', response_code=u'304', ts=u'01/Jul/1995:00:00:12'),
 Row(referer=u'burger.letters.com', response_code=u'200', ts=u'01/Jul/1995:00:00:12'),
 Row(referer=u'205.212.115.106', response_code=u'200', ts=u'01/Jul/1995:00:00:12'),
 Row(referer=u'd104.aa.net', response_code=u'200', ts=u'01/Jul/1995:00:00:13'),
 Row(referer=u'129.94.144.152', response_code=u'200', ts=u'01/Jul/1995:00:00:13')]

In [10]:
sqlContext.sql("select response_code, count(*) from row group by response_code").collect()

[Row(response_code=u'500', _c1=62),
 Row(response_code=u'501', _c1=14),
 Row(response_code=u'400', _c1=5),
 Row(response_code=u'403', _c1=54),
 Row(response_code=u'404', _c1=10845),
 Row(response_code=u'302', _c1=46573),
 Row(response_code=u'304', _c1=132627),
 Row(response_code=u'200', _c1=1701534)]

## 4. Stop Pilot-Spark Application

In [12]:
pilot_spark.cancel()