This Notebook shows introduces the basic concepts of RDDs and operations on them visually, by showing the contents of the RDDs as a table.

**Note: If you are looking at this in GitHub, you may not be able to see the HTML tables. Make sure to use the nbviewer link: http://nbviewer.jupyter.org/github/umddb/cmsc424-fall2016/tree/master/**

### Introduction

Apache Spark is a relatively new cluster computing framework, developed originally at UC Berkeley. It significantly generalizes the 2-stage Map-Reduce paradigm (originally proposed by Google and popularized by open-source Hadoop system); Spark is instead based on the abstraction of **resilient distributed datasets (RDDs)**. An RDD is basically a distributed collection of items, that can be created in a variety of ways. Spark provides a set of operations to transform one or more RDDs into an output RDD, and analysis tasks are written as chains of these operations.

### Display RDD
The following helper functions displays the current contents of an RDD (partition-by-partition). This is best used for small RDDs with manageable number of partitions.

In [None]:
class DisplayRDD:
        def __init__(self, rdd):
                self.rdd = rdd

        def _repr_html_(self):                                  
                x = self.rdd.mapPartitionsWithIndex(lambda i, x: [(i, [y for y in x])])
                l = x.collect()
                s = "<table><tr>{}</tr><tr><td>".format("".join(["<th>Partition {}".format(str(j)) for (j, r) in l]))
                s += '</td><td valign="bottom" halignt="left">'.join(["<ul><li>{}</ul>".format("<li>".join([str(rr) for rr in r])) for (j, r) in l])
                s += "</td></table>"
                return s

### Basics 1
Lets start with some basic operations using a small RDD to visualize what's going on. We will create a RDD of Strings, using the `states.txt` file which contains a list of the state names.

The notebook has already initialized a SparkContext, and we can refer to it as `sc`.

We will use `sc.textFile` to create this RDD. This operations reads the file and treats every line as a separate object. We will use DisplayRDD() to visualize it. The second argument of `sc.textFile` is the number of partitions. We will set this as 10 to get started. If we don't do that, Spark will only create a single partition given the file is pretty small.

In [2]:
states_rdd = sc.textFile('states.txt', 10)
DisplayRDD(states_rdd)

                                                                                

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4,Partition 5,Partition 6,Partition 7,Partition 8,Partition 9
AlabamaHawaiiMassachusettsNew MexicoSouth Dakota,AlaskaIdahoMichiganNew YorkTennesseeArizona,IllinoisMinnesotaNorth CarolinaTexas,ArkansasIndianaMississippiNorth DakotaUtah,CaliforniaIowaMissouriOhioVermontColorado,KansasMontanaOklahomaVirginiaConnecticutKentucky,NebraskaOregonWashingtonDelawareLouisiana,NevadaPennsylvaniaWest VirginiaFlorida,MaineNew HampshireRhode IslandWisconsinGeorgia,MarylandNew JerseySouth CarolinaWyoming


The above table shows the contents of each partition as a list -- so the first Partition has 5 elements in it ('Alabama', ...). We can `repartition` the RDD to get a fewer partitions so it will be easier to see. 

Note: There is some randomness in this process, so the result may vary if you repeat the below command.

In [3]:
states_rdd = states_rdd.repartition(5)
DisplayRDD(states_rdd)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
,KansasMontanaOklahomaVirginiaConnecticutKentuckyNebraskaOregonWashingtonDelawareLouisianaNevadaPennsylvaniaWest VirginiaFloridaMaineNew HampshireRhode IslandWisconsinGeorgia,,ArkansasIndianaMississippiNorth DakotaUtah,AlabamaHawaiiMassachusettsNew MexicoSouth DakotaAlaskaIdahoMichiganNew YorkTennesseeArizonaIllinoisMinnesotaNorth CarolinaTexasCaliforniaIowaMissouriOhioVermontColoradoMarylandNew JerseySouth CarolinaWyoming


Let's do a transformation where we convert a string to a 2-tuple, where the second value is the length of the string. We will just use a `map` for this -- we have to provide a function as the input that transforms each element of the RDD. In this case, we are using the `lambda` keyword to define a function inline. See here: https://pythonconquerstheuniverse.wordpress.com/2011/08/29/lambda_tutorial/ for a tutorial on lambda functions.

The below lambda function is simply taking in a string: s, and returning a 2-tuple: (s, len(s))

In [4]:
states1 = states_rdd.map(lambda s: (s, len(s)))
DisplayRDD(states1)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
,"('Kansas', 6)('Montana', 7)('Oklahoma', 8)('Virginia', 8)('Connecticut', 11)('Kentucky', 8)('Nebraska', 8)('Oregon', 6)('Washington', 10)('Delaware', 8)('Louisiana', 9)('Nevada', 6)('Pennsylvania', 12)('West Virginia', 13)('Florida', 7)('Maine', 5)('New Hampshire', 13)('Rhode Island', 12)('Wisconsin', 9)('Georgia', 7)",,"('Arkansas', 8)('Indiana', 7)('Mississippi', 11)('North Dakota', 12)('Utah', 4)","('Alabama', 7)('Hawaii', 6)('Massachusetts', 13)('New Mexico', 10)('South Dakota', 12)('Alaska', 6)('Idaho', 5)('Michigan', 8)('New York', 8)('Tennessee', 9)('Arizona', 7)('Illinois', 8)('Minnesota', 9)('North Carolina', 14)('Texas', 5)('California', 10)('Iowa', 4)('Missouri', 8)('Ohio', 4)('Vermont', 7)('Colorado', 8)('Maryland', 8)('New Jersey', 10)('South Carolina', 14)('Wyoming', 7)"


Lets collect all the names with the same length together using a group by operation. 
```
groupByKey([numTasks]) 	When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs. 
```
This wouldn't work as is, because `states1` is using the name as the key. Let's change that around.

In [5]:
states2 = states1.map(lambda t: (t[1], t[0]))
DisplayRDD(states2)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
,"(6, 'Kansas')(7, 'Montana')(8, 'Oklahoma')(8, 'Virginia')(11, 'Connecticut')(8, 'Kentucky')(8, 'Nebraska')(6, 'Oregon')(10, 'Washington')(8, 'Delaware')(9, 'Louisiana')(6, 'Nevada')(12, 'Pennsylvania')(13, 'West Virginia')(7, 'Florida')(5, 'Maine')(13, 'New Hampshire')(12, 'Rhode Island')(9, 'Wisconsin')(7, 'Georgia')",,"(8, 'Arkansas')(7, 'Indiana')(11, 'Mississippi')(12, 'North Dakota')(4, 'Utah')","(7, 'Alabama')(6, 'Hawaii')(13, 'Massachusetts')(10, 'New Mexico')(12, 'South Dakota')(6, 'Alaska')(5, 'Idaho')(8, 'Michigan')(8, 'New York')(9, 'Tennessee')(7, 'Arizona')(8, 'Illinois')(9, 'Minnesota')(14, 'North Carolina')(5, 'Texas')(10, 'California')(4, 'Iowa')(8, 'Missouri')(4, 'Ohio')(7, 'Vermont')(8, 'Colorado')(8, 'Maryland')(10, 'New Jersey')(14, 'South Carolina')(7, 'Wyoming')"


Note above that Spark did not do a shuffle to ensure that the same `keys` end up on the same partition. In fact, the `map` operation does not do a shuffle. 

Now we can do a groupByKey. 

In [6]:
states3 = states2.groupByKey()
DisplayRDD(states3)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
"(10, )(5, )","(6, )(11, )","(7, )(12, )","(8, )(13, )","(9, )(4, )(14, )"


That looks weird... it seems to have done a group by, but we are missing the groups themselves. This is because the type of the value is a `pyspark.resultiterable.ResultIterable` which our DisplayRDD code does not translate into strings. We can fix that by converting the `values` to lists, and then doing DisplayRDD.

In [7]:
DisplayRDD(states3.mapValues(list))

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
"(10, ['Washington', 'New Mexico', 'California', 'New Jersey'])(5, ['Maine', 'Idaho', 'Texas'])","(6, ['Kansas', 'Oregon', 'Nevada', 'Hawaii', 'Alaska'])(11, ['Connecticut', 'Mississippi'])","(7, ['Montana', 'Florida', 'Georgia', 'Indiana', 'Alabama', 'Arizona', 'Vermont', 'Wyoming'])(12, ['Pennsylvania', 'Rhode Island', 'North Dakota', 'South Dakota'])","(8, ['Oklahoma', 'Virginia', 'Kentucky', 'Nebraska', 'Delaware', 'Arkansas', 'Michigan', 'New York', 'Illinois', 'Missouri', 'Colorado', 'Maryland'])(13, ['West Virginia', 'New Hampshire', 'Massachusetts'])","(9, ['Louisiana', 'Wisconsin', 'Tennessee', 'Minnesota'])(4, ['Utah', 'Iowa', 'Ohio'])(14, ['North Carolina', 'South Carolina'])"


There it goes. Now we can see that the operation properly grouped together the state names by their lengths. This operation required a `shuffle` since originally all names with length, say 10, were all over the place.

`groupByKey` does not reduce the size of the RDD. If we were interested in `counting` the number of states with a given length (i.e., a `group by count` query), we can use `reduceByKey` instead. However that requires us to do a map first.

In [8]:
states4 = states2.mapValues(lambda x: 1)
DisplayRDD(states4)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
,"(6, 1)(7, 1)(8, 1)(8, 1)(11, 1)(8, 1)(8, 1)(6, 1)(10, 1)(8, 1)(9, 1)(6, 1)(12, 1)(13, 1)(7, 1)(5, 1)(13, 1)(12, 1)(9, 1)(7, 1)",,"(8, 1)(7, 1)(11, 1)(12, 1)(4, 1)","(7, 1)(6, 1)(13, 1)(10, 1)(12, 1)(6, 1)(5, 1)(8, 1)(8, 1)(9, 1)(7, 1)(8, 1)(9, 1)(14, 1)(5, 1)(10, 1)(4, 1)(8, 1)(4, 1)(7, 1)(8, 1)(8, 1)(10, 1)(14, 1)(7, 1)"


`reduceByKey` takes in a single reduce function as the input which tells us what to do with any two values. In this case, we are simply going to use sum them up.

In [9]:
DisplayRDD(states4.reduceByKey(lambda v1, v2: v1 + v2))

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
"(10, 4)(5, 3)","(6, 5)(11, 2)","(7, 8)(12, 4)","(8, 12)(13, 3)","(9, 4)(4, 3)(14, 2)"


These operations could be done faster through using `aggregateByKey`, but the syntax takes some getting used to. `aggregateByKey` takes a `start` value, a function that tells it what to do for a given element in the RDD, and another reduce function. 

In [10]:
DisplayRDD(states2.aggregateByKey(0, lambda k, v: k+1, lambda v1, v2: v1+v2))

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
"(10, 4)(5, 3)","(6, 5)(11, 2)","(7, 8)(12, 4)","(8, 12)(13, 3)","(9, 4)(4, 3)(14, 2)"


### Basics 2: FlatMap

Unlike a `map`, the function used for `flatMap` returns a list -- this is used to allow for the possibility that we will generate different numbers of outputs for different elements. Here is an example where we split each string in `states_rdd` into multiple substrings.

The lambda function below splits a string into chunks of size 5: so 'South Dakota' gets split into 'South', ' Dako', 'ta', and so on. The lambda function itself returns a list. If you try this with 'map' the result would not be the same.

In [11]:
DisplayRDD(states_rdd.flatMap(lambda x: [str(x[i:i+5]) for i in range(0, len(x), 5)]))

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4
,KansasMontanaOklahomaVirginiaConnecticutKentuckyNebraskaOregonWashingtonDelawareLouisianaNevadaPennsylvaniaWest VirginiaFloridaMaineNew HampshireRhode IslandWisconsinGeorgia,,ArkansasIndianaMississippiNorth DakotaUtah,AlabamaHawaiiMassachusettsNew MexicoSouth DakotaAlaskaIdahoMichiganNew YorkTennesseeArizonaIllinoisMinnesotaNorth CarolinaTexasCaliforniaIowaMissouriOhioVermontColoradoMarylandNew JerseySouth CarolinaWyoming


### Basics 3: Joins

Finally, lets look at an example of joins. We will still use small RDDs, but we now need two of them. We will just use `sc.parallelize` to create those RDDs. That functions takes in a list and creates an RDD of that by creating partitions and splitting them across machines. It takes the number of partitions as the second argument (optional).

Note again that Spark made no attempt to co-locate the objects (i.e., the tuples) with the same key.

In [12]:
rdd1 = sc.parallelize([('alpha', 1), ('beta', 2), ('gamma', 3), ('alpha', 5), ('beta', 6)], 3)
DisplayRDD(rdd1)

Partition 0,Partition 1,Partition 2
"('alpha', 1)","('beta', 2)('gamma', 3)","('alpha', 5)('beta', 6)"


In [13]:
rdd2 = sc.parallelize([('alpha', 'South Dakota'), ('beta', 'North Dakota'), ('zeta', 'Maryland'), ('beta', 'Washington')], 3)
DisplayRDD(rdd2)

Partition 0,Partition 1,Partition 2
"('alpha', 'South Dakota')","('beta', 'North Dakota')","('zeta', 'Maryland')('beta', 'Washington')"


Here is the definition of join from the programming guide.
```
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin. 
```
We want to join on the first attributes, so we can just call join directly, otherwise a map may have been required.

In [14]:
rdd3 = rdd1.join(rdd2)
DisplayRDD(rdd3)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4,Partition 5
,,,,,"('alpha', (1, 'South Dakota'))('alpha', (5, 'South Dakota'))('beta', (2, 'North Dakota'))('beta', (2, 'Washington'))('beta', (6, 'North Dakota'))('beta', (6, 'Washington'))"


There is a bunch of empty partitions. We could have controlled the number of partitions with an optional argument to join. But in any case, the output looks like what we were trying to do. Using `outerjoins` behaves as you would expect, with two extra tuples for fullOuterJoin.

In [15]:
DisplayRDD(rdd1.fullOuterJoin(rdd2))

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4,Partition 5
,"('zeta', (None, 'Maryland'))",,"('gamma', (3, None))",,"('alpha', (1, 'South Dakota'))('alpha', (5, 'South Dakota'))('beta', (2, 'North Dakota'))('beta', (2, 'Washington'))('beta', (6, 'North Dakota'))('beta', (6, 'Washington'))"


`cogroup` is a related function, but basically creates two lists with each key. The `value` in that case is more complex, and our code above can't handle it. As we can see, there is a single object corresponding to each key, and the values are basically a pair of `iterables`.

In [16]:
DisplayRDD(rdd1.cogroup(rdd2).mapValues(lambda x: (list(x[0]), list(x[1]))))

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4,Partition 5
,"('zeta', ([], ['Maryland']))",,"('gamma', ([3], []))",,"('alpha', ([1, 5], ['South Dakota']))('beta', ([2, 6], ['North Dakota', 'Washington']))"


### Basics 4

Here we will run some of the commands from the README file. This uses an RDD created from the lines of README.md file. You can use the DisplayRDD function here, but the output is rather large.

In [17]:
textFile = sc.textFile("README.md", 10)

In [18]:
textFile.count()

147

In [19]:
textFile.take(5)

['# Assignment 4: Apache Spark',
 '',
 "The goal of this assignment is to learn how to do large-scale data analysis tasks using Apache Spark: for this assignment, we will use relatively small datasets and  we won't run anything in distributed mode; however Spark can be easily used to run the same programs on much larger datasets.",
 '',
 '### Getting Started with Spark']

As described in the README file, the following command does a word count, by first separating out the words using a `flatMap`, and then using a `reduceByKey`.

In [20]:
counts = textFile.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
DisplayRDD(counts)

Partition 0,Partition 1,Partition 2,Partition 3,Partition 4,Partition 5,Partition 6,Partition 7,Partition 8,Partition 9
"('Assignment', 2)('', 91)('The', 20)('of', 55)('is', 36)('run', 4)('anything', 1)('basically', 2)('at', 6)('[Apache', 1)('(originally', 1)('instead', 1)('an', 17)('output', 7)('HDFS', 1)('1.', 2)('**Version', 1)('have', 7)('there,', 1)('ready', 1)('languages:', 1)('quick', 2)('PYSPARK_DRIVER_PYTHON_OPTS=""notebook', 1)('--no-browser', 1)('Shell', 2)('relevant', 1)('python', 3)('sc.textFile(""README.md"")`:', 1)('commands', 3)('line', 2)('RDD).', 1)('(http://spark.apache.org/docs/latest/quick-start.html).', 1)('following', 4)('appears', 1)('Use', 6)('class,', 1)('preferable,', 1)('especially', 1)('mode.', 1)('look', 2)('Programming', 1)('Details', 1)('initializes', 1)('*', 6)('consisting', 6)('tuple', 2)('documents', 1)('inspect', 1)('defined', 1)('would', 8)('(0.25)**:', 16)('no', 2)('year', 3)('`moviesRDD`.', 1)('(genre,', 1)('two', 6)('tag', 1)('before', 3)('applying', 1)('((7,', 3)('input', 5)('was', 1)('Easiest', 1)('[(0,', 1)('10', 1)('everything', 1)('not"",', 1)('characters', 1)('-->', 3)(""'before',"", 1)('50', 1)('score', 1)('`postsRDD`', 1)('purpose,', 1)('Output', 1)('count', 1)('score.', 1)('answers:', 1)('197,', 1)('starting', 1)('category', 1)('(just', 1)('format', 2)('Try', 1)('than', 1)('analogous', 1)('host,', 1)('host', 2)('something', 1)('(with', 2)('initial', 1)('""simply', 1)('Sample', 1)('results.txt', 1)","('how', 2)('analysis', 2)('and', 41)('much', 1)('computing', 1)('paradigm', 1)('by', 13)('Installing', 1)('you', 14)('Download', 1)('2.', 2)('available', 1)('`tar', 1)('provided,', 1)('here', 5)('follow', 2)('standard', 2)('Jupyter', 2)('Notebook', 3)('within', 1)('directly.', 1)('`$SPARKHOME/bin/pyspark`:', 1)('local', 2)('`sc.textFile`', 1)('(which', 1)('simply', 1)('application.', 1)('b)`', 1)('functions.', 1)('return', 5)('sum(a,', 1)('`flatmap`', 2)('splits', 1)('detailed', 1)('Walk-Through).', 1)('functions,', 2)('separate', 2)('write', 4)('which', 2)('program', 1)('RDDs:', 1)('""movies""', 1)('Dataset', 1)('any', 2)('`functions.py`', 1)('(several', 1)('**Task', 16)('dictionaries', 1)('-', 15)('tuple.', 1)('lexicographically', 3)('`reduceByKey`', 2)('movieID', 1)('Using', 1)('movie.', 2)('First', 1)('null,', 1)('average', 3)('most', 1)('rating.', 1)('done', 2)('(3,', 1)('1163)]_', 1)('Write', 2)('`task10_flatmap`', 1)('""sanitization""', 1)('processing', 1)('""is', 1)('""s"".', 1)('order.', 1)(""'i',"", 2)(""'scene',"", 1)(""'beatrice',"", 1)('Nobel', 3)('13', 1)('were', 2)('logs', 1)(""'01/Jul/1995'"", 1)('self-explanatory,', 1)('need:', 1)('(provided', 1)('Task', 1)('recommendations.', 1)('closest', 1)('reality,', 1)('measure).', 1)('doable', 1)('watched.', 1)('ratingsRDD', 1)","('#', 1)('Spark', 12)('this', 9)('we', 13)('use', 12)('datasets', 2)('in', 29)('Started', 1)('new', 3)('Google', 1)('based', 1)('created', 1)('these', 3)('already', 1)('But', 1)('downloaded', 3)('directory', 2)('3.', 2)('supports', 1)('way', 2)('\tPYSPARK_PYTHON=/usr/bin/python3', 1)('8881', 1)('file.', 2)('array', 1)('rest', 1)('Count', 2)('command', 1)('word:', 1)('textfile.flatMap(split).map(generateone).reduceByKey(sum)', 1)('(we', 2)('representation', 1)('compact', 1)('out', 5)('wordcount.py`', 1)('""Users""', 1)('(`se_posts.json`),', 1)('Shakespeare', 1)('examples', 1)('16', 2)('1', 1)('form:', 3)('ViewCount).', 1)('Note', 3)('`flatMap`', 2)('extract', 2)('title', 1)('`join`', 1)('that,', 1)('element.', 1)('moviesRDD,', 1)('title,', 2)('latter', 1)('7:', 1)('form', 2)('movie).', 1)('requests', 1)('teh', 1)('`(0,', 1)('2268),', 1)('(1,', 1)('(3)', 1)('replace', 1)('""is\'t""', 1)('""the"",', 1)(""'hero',"", 1)('potentially', 1)('anomalous', 1)('behavior.', 1)('[(154,', 1)('(630,', 1)('64,', 1)('transformations', 1)('returns', 2)('print', 1)('*dates*', 1)('dates', 2)('[NASA', 1)('14', 1)('days', 1)('fetched', 2)('user,', 2)('Jaccard', 3)('rated', 2)('actual', 1)('compare', 1)('<=', 3)('sets', 1)('0.1270358306188925))_', 1)('sequences', 1)(""'motivation'"", 1)('`spark-submit`', 1)('over).', 1)","('data', 3)('relatively', 2)('small', 3)('be', 22)('found', 1)('Spark](https://spark.apache.org)', 1)('cluster', 1)('2-stage', 1)('system);', 1)('items,', 1)('variety', 1)('or', 5)('RDDs', 6)('it', 14)('package', 1)('(modify', 1)('`/spark`', 1)('--', 7)('Scala', 2)('(Spark', 1)('To', 2)('\t```', 2)('--port=8881""', 1)('also', 5)('about', 2)('doing).', 1)('initialized', 1)('called', 1)('`textFile`,', 1)('reading', 1)('output.', 1)('without', 1)('cluster.', 1)('runs', 1)('`$SPARKHOME/bin/spark-submit', 1)('(`se_users.json`),', 1)('""Posts""', 1)('Noble', 1)('few', 3)('""ratings""', 1)('RDDs.', 1)('Before', 1)('`filter`', 2)('not', 4)('so', 1)('postsRDD', 1)('2', 3)('RDD.', 1)('So', 3)('appropriate', 3)('contain', 2)('first', 8)('rated,', 1)('2-tuple', 1)('genre),', 1)('6', 1)(""Let's"", 2)(""'mysql'),"", 1)('Complete', 2)('computes', 3)('rating', 3)('2.87)`', 1)('(not', 2)('followed', 3)('previous', 2)('100)`', 1)('days),', 1)('"",', 1)('""it', 2)('""\'tis""', 1)('(4),', 1)(""'leonato',"", 2)(""'with',"", 1)('9.341232227488153),', 1)('PairRDD', 2)('Make', 1)('strings,', 1)('entries', 1)('On', 1)(""'cogroup'"", 1)('15', 1)('earlier', 1)('neighbor', 1)('ratings', 1)('every', 1)('100)', 1)('collect', 1)('consecutive', 1)('sentence', 1)('are"",', 1)('its', 1)('easier', 1)('develop', 1)('(by', 1)","('easily', 1)('###', 9)('excellent', 2)('tutorials', 1)('[Spark', 2)('website](http://spark.apache.org).', 1)('developed', 1)('open-source', 1)('provides', 1)('set', 2)('more', 5)('RDD,', 2)('are', 14)('distribution.', 1)('manually:', 1)('uncompress', 1)('SPARKHOME=/data/Assignment-3/spark-3.5.0-bin-hadoop3/`', 1)('Note:', 2)('persist', 1)('docker', 1)('Python', 7)('Scala),', 1)('instructions', 1)('languages.', 1)('very', 1)('verbose', 1)('follow.', 1)('(and', 1)('--ip=0.0.0.0', 1)('make', 2)('just', 2)('=', 2)('(word,', 2)('`lambda`', 2)('b', 1)('Tutorial](http://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html#Source+Code)', 1)('tuples', 4)('last', 1)('MovieLens', 1)('listing', 1)('genres,', 1)('appear', 2)('regular', 1)('`reduceByKey`.', 2)('Ignore', 1)('there', 2)('4', 2)('rated.', 4)(""['Adventure',"", 1)(""'1'"", 1)('connect', 1)('`ratingsRDD`,', 1)('userID', 1)('(title-word,', 1)('value', 5)('specific', 2)('owneruserid', 1)('tag)', 1)('may', 2)('[((7,', 1)(""'my.cnf'),"", 1)('1)]_', 1)('common', 2)('9', 2)('execute', 1)('further', 1)('ML).', 1)('(2)', 1)('stop', 1)('it"",', 1)('""\'twere""', 1)('after', 1)(""['act',"", 1)('3.979695431472081),', 1)('12', 1)('surnames).', 1)('class', 1)('(if', 1)(""'02/Jul/1995')."", 1)('details', 1)('""nearest', 1)('neighbors""', 1)('incorporate', 1)('Since', 1)('100).', 1)('100),', 1)('""sequences', 1)('different', 1)(""Don't"", 1)","('to', 49)('learn', 1)('large-scale', 1)('Spark:', 1)('assignment,', 1)(""won't"", 2)('distributed', 3)('on', 13)('Getting', 1)('Berkeley.', 1)('generalizes', 1)('Map-Reduce', 2)('Hadoop', 3)('(RDDs)**.', 1)('ways.', 1)('one', 2)('written', 2)('chains', 1)('ecosystem,', 1)('if', 3)('want', 3)('`Assignment-5/`', 1)('(so', 1)('`export', 1)('appropriately', 1)('else).', 1)('stuff', 2)('equivalent', 1)('code', 4)('play', 3)('PYSPARK_DRIVER_PYTHON=""jupyter""', 1)('You', 8)('variables', 1)('textFile', 1)('see', 3)('prints', 1)('5', 2)('(in', 1)('shell)', 1)('does', 1)('word', 3)('count,', 2)('i.e.,', 2)('number', 8)('+', 2)('```', 2)('`map`', 5)('discuss', 1)('Running', 1)('file,', 2)('contains', 4)('Guide](https://spark.apache.org/docs/latest/programming-guide.html)', 1)('manipulation', 1)('should', 14)('interfaces.', 1)('table', 2)('lines', 2)('all', 17)('tasks,', 1)('require', 3)('`null`', 1)('atleast', 1)('10000,', 1)('has', 6)('OwnerUserId,', 1)('desired', 1)('tuples.', 1)('moviesRDD', 1)('Genre),', 1)('outputRDD', 2)('(using', 1)('title.', 1)('smallest', 3)('associated', 3)('users', 5)('function', 8)('second', 3)('key', 5)('former', 1)('`aggregateByKey`', 2)('filter', 1)('split', 1)('reduceByKey', 1)('appears.', 1)('_Answer', 3)(""'schema'),"", 1)('7', 1)('takes', 3)('`ratingsRDD`', 2)('2-tuples', 1)('`map`.', 1)('`groupByKey`', 1)('map', 1)('movies,', 1)('For', 5)('(2,', 1)('(4)', 1)('(very', 1)('were"",', 1)('remove:', 1)('""is"",', 1)('lines:', 1)('10.', 1)('userid', 2)('_A', 1)('211,', 1)('etc),', 1)('their', 1)('values', 1)('`take(5)`,', 1)('lists).', 1)('Logs](http://ita.ee.lbl.gov/html/contrib/NASA-HTTP.html)', 1)('above),', 1)('2-tuple,', 1)('product', 1)('restrict', 1)('id', 2)('u2', 1)('highest', 1)('comparing', 1)('answer:', 1)('simply"",', 1)('application', 1)('bigrams', 1)('`motivation`s', 1)('appear.', 1)('File', 1)","('Apache', 2)('goal', 3)('do', 8)('using', 5)('mode;', 1)('however', 1)('larger', 1)('It', 2)('**resilient', 1)('collection', 1)('operations', 2)('operations.', 1)('resource', 1)('image', 2)('http://spark.apache.org/downloads.html.', 1)('using:', 1)('spark-3.5.0-bin-hadoop3.tgz`', 1)('`spark-3.5.0-bin-hadoop3/`.', 1)('Set', 1)('SPARKHOME', 2)('5.', 1)('move', 1)('but', 9)('when', 1)('primarily', 1)('three', 2)('Java,', 1)('start', 3)('other', 4)('Java', 2)('shell.', 2)('sure', 2)('work.', 1)('PySpark', 2)('shell', 1)('####', 1)('counts', 2)('`README.md`.', 1)('"")).map(lambda', 1)('generateone(word):', 1)('b):', 1)('`project5`', 1)('try', 2)('folllowing', 1)('(`play.txt`)', 1)('years', 1)('Two', 1)('(starting', 1)('amount', 1)('typically', 1)('code.', 1)(""'OwnerUserId'"", 1)(""'viewcount'"", 1)('viewCount', 1)('contents', 1)('movie', 3)('`flatMap`s', 1)('tags,', 1)(""'version-control'),"", 1)('`mode`', 1)('ties,', 1)('`mode`.', 1)('aggregating', 1)('users.', 2)('(the', 1)('group', 1)('3004),', 1)('steps', 2)('Specifically:', 2)('lowercase,', 1)('expand', 1)('like', 2)('""do', 1)(""'and',"", 1)('sequence', 1)('`list`s,', 1)('`logsRDD`.', 2)('dates.', 1)('minimize', 1)('likely', 1)(""let's"", 1)('100', 2)(""('4',"", 1)('[Bigrams](http://en.wikipedia.org/wiki/Bigram)', 1)('example,', 1)('etc.', 1)('task', 1)('bigram', 1)('Prizes', 1)('copying', 1)","('assignment', 2)('for', 26)('will', 14)('the', 165)('same', 3)('summary', 1)('that', 26)('RDD', 16)('transform', 1)('file', 8)('system', 1)('manager.', 1)('Docker', 1)('includes', 1)('up', 3)('We', 7)('3.5.0,', 1)('3.3', 1)('later**.', 1)('create', 6)('directory:', 1)('interface', 1)('--allow-root', 1)('(it', 1)('otherwise', 1)('`>>>', 2)('creates', 1)('per', 3)('some', 4)('Application', 2)('pyspark', 1)('times', 3)('each', 17)('`counts.take(5)`', 1)('def', 3)('"")', 1)('1)', 1)('words,', 1)('[Hadoop', 1)('your', 1)('*submit*', 1)('encourage', 1)('provided', 5)('file:', 2)('`spark_assignment.py`,', 1)('JSON', 1)('Laureates', 2)('beginning', 1)('`print(rdd.take(10))`).', 1)('`task`).', 1)('one-liners),', 1)('find', 7)('(None', 1)('then', 4)('Title,', 1)('running', 1)('`postsRDD.take(10)`', 1)('If', 2)('user', 8)('across', 5)('they', 8)(""'Fantasy']),"", 1)('genre', 2)('title-word', 1)('couple', 1)('count.', 1)('compute', 3)('`postsRDD`.', 1)('pair', 1)(""`('1',"", 1)('year.', 1)('records', 1)('3565),', 1)('often', 1)('text', 1)('(1)', 1)('common)', 1)('words.', 2)('(2),', 1)('is""', 1)(""'house',"", 1)('11', 1)('Specifically,', 1)('post', 2)('<', 1)('`map`,', 1)('(`physics`', 1)('objects', 1)('present', 1)('entries.', 1)('element', 2)('day,', 1)('day.', 1)('did', 2)('coefficient', 2)('users,', 1)('u1', 1)('transformation', 1)('_Example', 1)('sequences"",', 1)('of"",', 1)('reason', 1)('assume', 1)","('tasks', 4)('used', 4)('programs', 1)('datasets.', 1)('guide', 2)('originally', 1)('UC', 1)('significantly', 1)('proposed', 1)('into', 2)('as', 15)(""'/data/Assignment-5'),"", 1)('always', 2)('quit', 1)('image.', 1)('Spark.', 1)('Python.', 1)('below', 1)('do:', 3)('$SPARKHOME/bin/pyspark', 1)('bunch', 1)('shell,', 2)('`textFile.first()`,', 1)('`textFile.take(5)`', 1)('items', 1)('recommend', 1)('line.split(""', 2)('1)).reduceByKey(lambda', 1)('b:', 1)('split(line):', 1)('counting', 3)('description:', 1)('large', 1)('program,', 1)('More...', 1)('dictionary', 2)('(`NASA_logs_sample.txt`)', 1)('(`prize.json`)', 1)('(e.g.,', 3)('fill', 1)('functions', 1)('where', 8)('python)', 1)('(ID,', 2)('movies', 11)('num-movies).', 1)('similar', 2)('above,', 1)('User', 1)('1),', 2)('correct', 2)('8', 1)('higher', 1)('question,', 1)('only', 1)('aggregate.', 1)('flatmap', 1)('`playRDD`', 1)('line,', 1)('tokenization', 1)('""don\'t""', 1)('""', 1)('remove', 1)(""'a',"", 1)('>', 1)('posts,', 1)('`filter`.', 1)('(26626,', 1)('2.734375)]_', 1)('`prizeRDD`', 1)('""hosts""', 1)('end', 1)('URLs', 2)('looking', 1)('ignore', 1)('distance', 1)(""('57',"", 1)('Prize).', 1)('bigram,', 1)('`motivations`', 1)('`pyspark`', 1)","('4:', 1)('can', 15)('with', 20)('This', 7)('a', 58)('framework,', 1)('popularized', 1)('abstraction', 1)('An', 6)('including', 1)('YARN', 1)('spark', 2)('Pre-built', 1)('Move', 1)('zxvf', 1)('4.', 2)('variable:', 1)('somewhere', 1)('tutorial', 1)('(http://spark.apache.org/docs/latest/quick-start.html)', 1)('hard', 1)('shows', 1)('through', 2)('provided),', 1)('need', 4)('mapping', 1)('port', 1)('what', 1)('from', 12)('containing', 2)('entry', 1)('information', 2)('doing', 1)('`textFile.count()`', 1)('Here', 3)('Word', 2)('textFile.flatMap(lambda', 1)('line:', 1)('a,', 1)('`reduce`', 1)('(look', 1)('better', 1)('definitions.', 1)('Instead', 1)('`wordcount.py`,', 1)('commands.', 1)('Stackexchange', 2)('log', 4)('pertaining', 1)('over', 2)('(https://grouplens.org/datasets/movielens/)', 1)('Your', 2)('posts', 2)('our', 1)('genres', 7)('3', 1)('year,', 1)('expression),', 1)(""('1',"", 1)('assuming', 1)('those', 2)('list', 8)('(user,', 1)('2),', 1)('answer).', 2)('either', 1)('(i.e.,', 3)('pick', 1)('unlike', 1)('`logsRDD`,', 2)('day', 2)('sample', 2)('operates', 2)('words', 2)('non-alphanumerical', 1)('""in"",', 1)(""'enter',"", 1)(""'messenger']_"", 1)('`reduceByKey`,', 1)('well', 2)('`category`', 1)('final', 1)('creating.', 1)('given', 3)('RDD:', 1)('Cartesian', 1)('who', 1)('computation,', 1)('bigrams:', 1)('""Bigrams', 1)('""are', 1)('many', 1)('present.', 1)('`spark_assignment.py`', 1)"
