In [1]:
# make a new RDD from the text file
textFile = sc.textFile('../data/README.md')

In [2]:
textFile

../data/README.md MapPartitionsRDD[1] at textFile at NativeMethodAccessorImpl.java:0

In [3]:
# # of items in this RDD
textFile.count()

49

In [4]:
# first item in this RDD
# == textFile.take(1) 
textFile.first()

'Welcome to the Spark documentation!'

In [5]:
# a list returns first n elements in this RDD.
textFile.take(5)

['Welcome to the Spark documentation!',
 '',
 'This readme will walk you through navigating and building the Spark documentation, which is included here with the Spark source code. You can also find documentation specific to release versions of Spark at https://spark.apache.org/documentation.html.',
 '',
 'Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that correspond to whichever version of Spark you currently have checked out of revision control.']

In [6]:
# return a new RDD with a subset of items in RDD 
# The subset is comprimised of items make user defined fuction return TRUE 
linesWithSpark = textFile.filter(lambda line: "Spark" in line)
print(linesWithSpark)
print('')
# return all the elements of the RDD (dataset) as an array
print(linesWithSpark.collect())

PythonRDD[5] at RDD at PythonRDD.scala:49

['Welcome to the Spark documentation!', 'This readme will walk you through navigating and building the Spark documentation, which is included here with the Spark source code. You can also find documentation specific to release versions of Spark at https://spark.apache.org/documentation.html.', 'Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that correspond to whichever version of Spark you currently have checked out of revision control.', 'The Spark documentation build uses a number of tools to build HTML docs and API docs in Scala, Java, Python, R and SQL.', 'Note: Other versions of roxygen2 might work in SparkR documentation generation but RoxygenNote field in $SPARK_HOME/R/pkg/DESCRIPTION is 5.0.1, which is updated if the version is mismatched.', 'We include the Spark documentation as part of the source (as opposed to u

In [7]:
# textFile.filter(lambda line: "Spark" in line).count()
linesWithSpark.count()

9

In [8]:
# map(); r
# return a new distributed dataset formed by passing each element of the source through a function func.
textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a if (a > b) else b)

89

In [9]:
def max(a, b):
    if a > b:
        return a
    else:
        return b
    
textFile.map(lambda line: len(line.split())).reduce(max)

89

In [10]:
# user defined func should be commutative and associative ==  deterministic
# Else, illustrate the race conditon case 

# textFile func provides an optional second argument for controlling the number of partitions of the file
# By default, Spark creates one partition for each block of the file (blocks being 128MB by default in HDFS)
textFile = sc.textFile('../data/README.md')
print(textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a - b))

textFile = sc.textFile('../data/README.md', 3)
print(textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a - b))

textFile = sc.textFile('../data/README.md', 5)
print(textFile.map(lambda line: len(line.split())).reduce(lambda a, b: a - b))

46
214
376


In [11]:
# Lazy transformations

"""RDDs support two types of operations: 
transformations, which create a new dataset from an existing one, and 
actions, which return a value to the driver program after running a computation on the dataset. 

For example, map is a transformation that passes each dataset element through a function 
    and returns a new RDD representing the results. 
On the other hand, reduce is an action that aggregates all the elements of the RDD using some function 
    and returns the final result to the driver program (although there is also a parallel reduceByKey that returns a distributed dataset).

All transformations in Spark are lazy, in that they do not compute their results right away. 
Instead, they just remember the transformations applied to some base dataset (e.g. a file). 
The transformations are only computed when an action requires a result to be returned to the driver program. 
This design enables Spark to run more efficiently. 

For example, we can realize that a dataset created through map will be used in a reduce 
    and return only the result of the reduce to the driver, rather than the larger mapped dataset.

By default, each transformed RDD may be recomputed each time you run an action on it. 
However, you may also persist an RDD in memory using the persist (or cache) method
    , in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. 
    There is also support for persisting RDDs on disk, or replicated across multiple nodes."""

import math
# gives a list
# Parallelized collections are created by calling SparkContext’s parallelize method on an existing collection in your driver program (a Scala Seq). 
# The elements of the collection are copied to form a distributed dataset that can be operated on in parallel
a = sc.parallelize(range(1, 100000))

In [12]:
# faster than below
# transformation is lazy. It doesn't compute the result right away
b = a.map(lambda x: math.sqrt(x))

In [13]:
# slower than above
# action requires a result, then compute transformation and action
# this tactic can reduce intermediate data
b.count()

99999

In [14]:
# This is for self contained applications section
# for $spark-submit
# not $pyspark
"""
If you built a spark application, 
you need to use spark-submit to run the application
The code can be written either in python/scala
The mode can be either local/cluster

If you just want to test/run few individual commands, 
you can use the shell provided by spark

pyspark (for spark in python)
spark-shell (for spark in scala)
"""

"""
If you are using EMR , there are three things

using pyspark(or spark-shell)
using spark-submit without using --master and --deploy-mode
using spark-submit and using --master and --deploy-mode
although using all the above three will run the application in spark cluster, 
there is a difference how the driver program works.

in 1st and 2nd the driver will be in client mode 
whereas in 3rd the driver will also be in the cluster.

in 1st and 2nd, you will have to wait untill one application complete to run another
, but in 3rd you can run multiple applications in parallel.
"""

'\nIf you are using EMR , there are three things\n\nusing pyspark(or spark-shell)\nusing spark-submit without using --master and --deploy-mode\nusing spark-submit and using --master and --deploy-mode\nalthough using all the above three will run the application in spark cluster, \nthere is a difference how the driver program works.\n\nin 1st and 2nd the driver will be in client mode \nwhereas in 3rd the driver will also be in the cluster.\n\nin 1st and 2nd, you will have to wait untill one application complete to run another\n, but in 3rd you can run multiple applications in parallel.\n'