#Apache Spark Overview

* [**Apache Spark**](http://spark.apache.org/downloads.html) is a fast and general-purpose **cluster computing system**. It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports general execution graphs. 
* It also supports a rich set of higher-level tools including [Spark SQL](https://spark.apache.org/docs/latest/sql-programming-guide.html) for SQL and structured data processing, [MLlib](https://spark.apache.org/docs/latest/mllib-guide.html) for machine learning, [GraphX](https://spark.apache.org/docs/latest/graphx-programming-guide.html) for graph processing, and [Spark Streaming](https://spark.apache.org/docs/latest/streaming-programming-guide.html).

* Spark’s primary abstraction is a distributed collection of items called a **Resilient Distributed Dataset (RDD)**.
* RDDs can be created from Hadoop InputFormats (such as HDFS files) or by transforming other RDDs.
* Let’s make a new RDD from the text of the README file in the Spark source directory:

In [18]:
import os
import sys
from operator import add

In [19]:
textFile = sc.textFile("/home/sasidhar/spark-1.3.1-bin-hadoop2.6/README.md")
print "Lines in text file: ", textFile.count()
print "First line in the text file: ", textFile.first()

Lines in text file:  98
First line in the text file:  # Apache Spark


##RDD Explained

In [21]:
myRange = range(1,20)
sampleRDD = sc.parallelize(myRange)
myList = sampleRDD.glom().collect()
# Watch the output,
print len(myList)
print myList

2
[[1, 2, 3, 4, 5, 6, 7, 8, 9], [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]]


In [8]:
textFile.first()

u'# Apache Spark'

In [9]:
linesWithSpark = textFile.filter(lambda line: "Spark" in line)

In [10]:
linesWithSpark.count()

19

In [11]:
linesWithSpark.first()

u'# Apache Spark'

In [12]:
linesWithSpark.count()

19

In [13]:
textFile.filter(lambda line: "Spark" in line).count()

19

In [14]:
textFile.map(lambda line : len(line.split())).reduce(lambda a, b: a if (a > b) else b)

14

In [16]:
foo = [2, 18, 9, 22, 17, 24, 8, 12, 27]

In [17]:
sum(foo)

139

In [28]:
print filter(lambda x: x % 3 == 0, foo)

[18, 9, 24, 12, 27]


In [30]:
print map(lambda x: x - 2,foo)

[0, 16, 7, 20, 15, 22, 6, 10, 25]


In [1]:
rdd1 = sc.parallelize(range(1,100))

In [7]:
len(rdd1.glom().collect())

2

In [9]:
rdd1.toDebugString()

'(2) ParallelCollectionRDD[0] at parallelize at PythonRDD.scala:392 []'