# Learning Spark: Lightning-fast data analysis

In [1]:
sc.version

u'2.1.0'

In [27]:
lines = sc.textFile("README.md")
lines.count()

104

In [28]:
lines.first()

u'# Apache Spark'

### Chapter 3 Programming with RDDs

In [29]:
pythonLines = lines.filter(lambda line: "Python" in line)

In [30]:
pythonLines.first()

u'high-level APIs in Scala, Java, Python, and R, and an optimized engine that'

persist() is used to load a subset of your data into memory and query it repeatedly. 

In [31]:
pythonLines.persist
pythonLines.count()

3

Creating RDDs
- Transformations: 
            - operations on RDDs that return a new RDD
            - transformations are lazily evaluated, meaning Spark will not begin to execute until it sees an action. 
- Actions: operations that return a final value to the driver program or write data to an external storage system

In [32]:
lines = sc.parallelize(["pandas", "i like pandas"])

In [33]:
inputRDD = sc.textFile("log.txt")

In [34]:
errorsRDD = inputRDD.filter(lambda x: "Praesent" in x)
warningsRDD = inputRDD.filter(lambda x: "sapien" in x)
badLinesRDD = errorsRDD.union(warningsRDD)

In [35]:
badLinesRDD.count()

16

In [36]:
##print "Input had " + badLinesRDD.count() + " concerning lines"
##print "Here are 10 examples:"
# voorbeeld uit het boek, selecteer de eerste regel
for line in badLinesRDD.take(1):
    print line

Nam pretium turpis et arcu. Duis arcu tortor, suscipit eget, imperdiet nec, imperdiet iaculis, ipsum. Sed aliquam ultrices mauris. Integer ante arcu, accumsan a, consectetuer eget, posuere ut, mauris. Praesent adipiscing. Phasellus ullamcorper ipsum rutrum nunc. Nunc nonummy metus. Vestibulum volutpat pretium libero. Cras id dui. Aenean ut eros et nisl sagittis vestibulum. Nullam nulla eros, ultricies sit amet, nonummy id, imperdiet feugiat, pede. Sed lectus. Donec mollis hendrerit risus. Phasellus nec sem in justo pellentesque facilisis. Etiam imperdiet imperdiet orci. Nunc nec neque. Phasellus leo dolor, tempus non, auctor et, hendrerit quis, nisi.


In [37]:

# Creating a RDD 
nums = sc.parallelize([1,2,3,3])
# squaring the values in an RDD
squared = nums.map(lambda x: x * x).collect()
for k in squared:
    print "%i " % (k)

1 
4 
9 
9 


In [38]:
# flatMap() in Python, splitting lines into words
lines = sc.parallelize(["hello world", "hi"])
words = lines.flatMap(lambda line: line.split(" "))
words.first()

'hello'

In [39]:
# Basic transformations on page 38
# first make RDD to work with
cijfers = sc.parallelize([1,2,3,3])
# map() - apply a function to each element in the RDD and return n RDD of the result
test = cijfers.map(lambda x: x + 1).collect()
for k in test:
    print "%i " % (k)

2 
3 
4 
4 


In [40]:
# flatMap() - apply a function to each element in the RDD and return an RDD of the contents of the iterators returned.
# Often used to extract words 
woorden = sc.parallelize(["woord", "het" "is vrijdag"])
woord = woorden.flatMap(lambda x: x.split(" "))
woord.first()

'woord'

In [41]:
# filter() - return an RDD consisting of only elements that pass the condition passed to filter()
f = cijfers.filter(lambda x: x != 1)
f.first()

2

In [42]:
# distinct() - remove duplicates
cijfer = sc.parallelize([1,2,3,3])
d = cijfer.distinct().collect()
for k in d:
    print "%i " % (k)

1 
2 
3 


In [23]:
# sample(withReplacement, fraction, [seed]) 
# wat doet deze ?? 
t = cijfer.sample(2, 0.5).collect()
for k in t:
    print (k)

2
3


In [24]:
# union(), intersection(), subtract(), cartesian()
t1 = sc.parallelize([1,2,3])
t2 = sc.parallelize([3,4,5])
t3 = t1.union(t2).collect()
for k in t3:
    print (k)

1
2
3
3
4
5


In [47]:
# Basic actions on a RDD 
# collect(), count(), countByValue()
v = sc.parallelize([1,2,4,4])
v.collect()

[1, 2, 4, 4]

In [46]:
v.countByValue()

defaultdict(int, {1: 1, 2: 1, 4: 2})

In [60]:
# take(num), top(num), TakeOrdered(num)(ordering), takeSample(withReplacement, num, [seed]), reduce(func), fold(zero)(func)
# aggregate(zeroValue)(seqOp, combOp), foreach(func)
v = sc.parallelize([1,2,4,5])
v.take(2)
v.top(2)

[5, 4]

### Chapter 4 Working with Key/Value Pairs

- Key/value RDDs are commonly used to perform aggregations. 
- Partitioning: lets users control the layout of pair RDDs across nodes
- ! Choosing the right partitioning for a distributed dataset is similar to choosing the right data structure for a local one- in both case, data layout can greatly affect performance. 
