<h1 align = "center"> Spark Fundamentals 1 - Introduction to Spark</h1>
<h2 align = "center"> Scala - Working with RDD operations</h2>
<h4 align = "center"> January 18, 2015 </h4>
<br align = "left">

**Related free online courses:**  
- [Spark Fundamentals II](http://bigdatauniversity.com/bdu-wp/bdu-course/spark-fundamentals-ii/)  
- [Data Analysis using R](https://bigdatauniversity.com/bdu-wp/bdu-course/introduction-to-data-analysis-using-r/)  
- [Big Data Fundamentals](http://bigdatauniversity.com/bdu-wp/bdu-course/big-data-fundamentals/)  

<img src = "http://spark.apache.org/images/spark-logo.png", height = 100, align = 'left'>

<img src = "https://upload.wikimedia.org/wikipedia/en/8/85/Scala_logo.png", height = 85, align = 'left'>


### Starting with Spark using Scala

In this lab we going to be able to create a RDD from an external data set. we are going to be able to view a direct acyclic graph of an RDD, something transformations update when they are executed and we'll be able to work with various RDD operations including some that we have seen already in lab one but also new operations we haven't seen yet and also we'll get to work with shared variables and key value pairs. 

This first section is a repeat of some of the work in the previous lab except, it is using Scala rather than Python. There are some differences between the two which are reflected in the commands below. 

Now we are going to create an RDD file from the file README. This is created using the spark context ".textFile" just as in the previous lab. As we know the initial operation is a transformation, so nothing actually happens. We're just telling it that we want to create a readme RDD. 

Since we are working with the README file again, it should show up in Recent Data. 

Highlight the text string between the double quotes in the cell below then, click the twistie next to the "README.md" in Recent Data and select Insert Path to paste the path below. Then, run the code in the cell.

This was an RDD transformation, thus it returned a pointer to a RDD, which we have named as readme. 

In [1]:
val readme = sc.textFile("")

Let’s perform some RDD actions on this text file. Count the number of items in the RDD using this command:

In [2]:
readme.count()

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Let’s run another action. Run this command to find the first item in the RDD:

In [3]:
readme.first()

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Now let’s try a transformation. Use the filter transformation to return a new RDD with a subset of the items in the file. Type in this command:

In [4]:
val linesWithSpark = readme.filter(line => line.contains("Spark"))
linesWithSpark.count()

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Again, this returned a pointer to a RDD with the results of the filter transformation.

You can even chain together transformations and actions. To find out how many lines contains the word “Spark”, type in:

In [5]:
readme.filter(line => line.contains("Spark")).count()

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

### More on RDD Operations

This section builds upon the previous section. In this section, you will see that RDD can be used for more complex computations. You will find the line from that readme file with the most words in it.

In [6]:
readme.map(line => line.split(" ").size).
                    reduce((a, b) => if (a > b) a else b)

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

There are two parts to this. The first maps a line to an integer value, the number of words in that line. In the second part reduce is called to find the line with the most words in it. The arguments to map and reduce are Scala function literals (closures), but you can use any language feature or Scala/Java library.

In the next step, you use the Math.max() function to show that you can indeed use a Java library instead.
Import in the java.lang.Math library:

In [7]:
import java.lang.Math

Now run with the max function:

In [8]:
readme.map(line => line.split(" ").size).
        reduce((a, b) => Math.max(a, b))

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Spark has a MapReduce data flow pattern. We can use this to do a word count on the readme file.

In [9]:
val wordCounts = readme.flatMap(line => line.split(" ")).
                        map(word => (word, 1)).
                        reduceByKey((a,b) => a + b)

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Here we combined the flatMap, map, and the reduceByKey functions to do a word count of each word in the readme file.

To collect the word counts, use the collect action.

####It should be noted that the collect function brings all of the data into the driver node. For a small dataset, this isacceptable but, for a large dataset this can cause an Out Of Memory error. It is recommended to use collect() for testing only. The safer approach is to use the take() function e.g. take(n).foreach(println)

In [11]:
wordCounts.collect().foreach(println)

Name: Compile Error
Message: <console>:20: error: not found: value wordCounts
              wordCounts.collect().foreach(println)
              ^
StackTrace: 

You can also do:


 println(wordCounts.collect().mkString("\n"))
 
 println(wordCounts.collect().deep)


### <span style="color: red">YOUR TURN:</span> 

#### In the cell below, determine what is the most frequent CHARACTER in the README, and how many times was it used?

In [12]:
val wordCounts = readme.flatMap(line => line.split(" ")).
                        map(word => (word, 1)).
                        reduceByKey((a,b) => a + b).
                        reduce((a, b) => if (a._2 > b._2) a else b)

println(wordCounts)


Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Highlight text for answer:

<textarea rows="6" cols="80" style="color: white">
val wordCounts = readme.flatMap(line => line.split(" ")).
                        map(word => (word, 1)).
                        reduceByKey((a,b) => a + b).
                        reduce((a, b) => if (a._2 > b._2) a else b)

println(wordCounts)
</textarea>

## Analysing a log file

First, create a RDD by loading in a log file:

In [13]:
val logFile = sc.textFile("")

Filter out the lines that contains INFO (or ERROR, if the particular log has it)

In [14]:
val info = logFile.filter(line => line.contains("INFO"))

Count the lines:

In [15]:
info.count()

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Count the lines with Spark in it by combining transformation and action.

In [16]:
info.filter(line => line.contains("spark")).count()

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Fetch those lines as an array of Strings

In [17]:
info.filter(line => line.contains("spark")).collect() foreach println

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Remember that we went over the DAG. It is what provides the fault tolerance in Spark. Nodes can re-compute its state by borrowing the DAG from a neighboring node. You can view the graph of an RDD using the toDebugString command.

In [18]:
println(info.toDebugString)

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

## Joining RDDs

Next, you are going to create RDDs for the README and the CHANGES file.

In [19]:
val readmeFile = sc.textFile("")
val pom = sc.textFile("")

How many Spark keywords are in each file?

In [20]:
println(readmeFile.filter(line => line.contains("Spark")).count())
println(pom.filter(line => line.contains("Spark")).count())

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

Now do a WordCount on each RDD so that the results are (K,V) pairs of (word,count)

In [21]:
val readmeCount = readmeFile.
                    flatMap(line => line.split(" ")).
                    map(word => (word, 1)).
                    reduceByKey(_ + _)

val pomCount = pom.
                flatMap(line => line.split(" ")).
                map(word => (word, 1)).
                reduceByKey(_ + _)

Name: java.lang.IllegalArgumentException
Message: Can not create a Path from an empty string
StackTrace: org.apache.hadoop.fs.Path.checkPathArg(Path.java:127)
org.apache.hadoop.fs.Path.<init>(Path.java:135)
org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:244)
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:409)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$33.apply(SparkContext.scala:1015)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:176)
scala.Option.map(Option.scala:145)
org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:176)
org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:195)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:239)
org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.sca

To see the array for either of them, just call the collect function on it.

In [22]:
println("Readme Count\n")
readmeCount.collect() foreach println

Readme Count



Name: Compile Error
Message: <console>:20: error: not found: value readmeCount
              readmeCount.collect() foreach println
              ^
StackTrace: 

In [23]:
println("Pom Count\n")
pomCount.collect() foreach println

Pom Count



Name: Compile Error
Message: <console>:20: error: not found: value pomCount
              pomCount.collect() foreach println
              ^
StackTrace: 

Now let's join these two RDDs together to get a collective set. The join function combines the two datasets (K,V) and (K,W) together and get (K, (V,W)). Let's join these two counts together and then cache it.

In [24]:
val joined = readmeCount.join(pomCount)
joined.cache()

Name: Compile Error
Message: <console>:19: error: not found: value readmeCount
         val joined = readmeCount.join(pomCount)
                      ^
StackTrace: 

Let's see what's in the joined RDD.

In [25]:
joined.collect.foreach(println)

Name: Compile Error
Message: <console>:20: error: not found: value joined
              joined.collect.foreach(println)
              ^
StackTrace: 

Let's combine the values together to get the total count. The operations in this command tells Spark to combine the values from (K,V) and (K,W) to give us(K, V+W). The ._ notation is a way to access the value on that particular index of the key value pair.

In [26]:
val joinedSum = joined.map(k => (k._1, (k._2)._1 + (k._2)._2))
joinedSum.collect() foreach println

Name: Compile Error
Message: <console>:19: error: not found: value joined
         val joinedSum = joined.map(k => (k._1, (k._2)._1 + (k._2)._2))
                         ^
StackTrace: 

To check if it is correct, print the first five elements from the joined and the joinedSum RDD

In [27]:
println("Joined Individial\n")
joined.take(5).foreach(println)

println("\n\nJoined Sum\n")
joinedSum.take(5).foreach(println)

Joined Individial



Name: Compile Error
Message: <console>:20: error: not found: value joined
              joined.take(5).foreach(println)
              ^
StackTrace: 

## Shared variables

Broadcast variables allow the programmer to keep a read-only variable cached on each worker node rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. After the broadcast variable is created, it should be used instead of the value v in any functions run on the cluster so that v is not shipped to the nodes more than once. In addition, the object v should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).

Read more here: [http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables](http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables)

Let's create a broadcast variable:

In [28]:
val broadcastVar = sc.broadcast(Array(1,2,3))

To get the value, type in:

In [29]:
broadcastVar.value

Array(1, 2, 3)

Accumulators are variables that can only be added through an associative operation. It is used to implement counters and sum efficiently in parallel. Spark natively supports numeric type accumulators and standard mutable collections. Programmers can extend these for new types. Only the driver can read the values of the accumulators. The workers can only invoke it to increment the value.

Create the accumulator variable. Type in:

In [30]:
val accum = sc.accumulator(0)

Next parallelize an array of four integers and run it through a loop to add each integer value to the accumulator variable. Type in:

In [31]:
sc.parallelize(Array(1,2,3,4)).foreach(x => accum += x)

To get the current value of the accumulator variable, type in:

In [32]:
accum.value

10

You should get a value of 10.
This command can only be invoked on the driver side. The worker nodes can only increment the accumulator.

## Key-value pairs

You have already seen a bit about key-value pairs in the Joining RDD section. Here is a brief example of how to create a key-value pair and access its values. Remember that certain operations such as map and reduce only works on key-value pairs.

Create a key-value pair of two characters. Type in:

In [33]:
val pair = ('a', 'b')

To access the value of the first index using the *._1* method and *._2* method for the 2nd.

In [34]:
pair._1

a

In [35]:
pair._2

b

## Sample Application

In this section, you will be using a subset of a data for taxi trips that will determine the top 10 medallion numbers based on the number of trips.

In [36]:
val taxi = sc.textFile("/resources/LabData/nyctaxi.csv")

To view the five rows of content, invoke the take function. Type in:

In [37]:
taxi.take(5).foreach(println)

"_id","_rev","dropoff_datetime","dropoff_latitude","dropoff_longitude","hack_license","medallion","passenger_count","pickup_datetime","pickup_latitude","pickup_longitude","rate_code","store_and_fwd_flag","trip_distance","trip_time_in_secs","vendor_id"
"29b3f4a30dea6688d4c289c9672cb996","1-ddfdec8050c7ef4dc694eeeda6c4625e","2013-01-11 22:03:00",+4.07033460000000E+001,-7.40144200000000E+001,"A93D1F7F8998FFB75EEF477EB6077516","68BC16A99E915E44ADA7E639B4DD5F59",2,"2013-01-11 21:48:00",+4.06760670000000E+001,-7.39810790000000E+001,1,,+4.08000000000000E+000,900,"VTS"
"2a80cfaa425dcec0861e02ae44354500","1-b72234b58a7b0018a1ec5d2ea0797e32","2013-01-11 04:28:00",+4.08190960000000E+001,-7.39467470000000E+001,"64CE1B03FDE343BB8DFB512123A525A4","60150AA39B2F654ED6F0C3AF8174A48A",1,"2013-01-11 04:07:00",+4.07280540000000E+001,-7.40020370000000E+001,1,,+8.53000000000000E+000,1260,"VTS"
"29b3f4a30dea6688d4c289c96758d87e","1-387ec30eac5abda89d2abefdf947b2c1","2013-01-11 22:02:00",+4.07277180000000E+00

Note that the first line is the headers. Normally, you would want to filter that out, but since it will not affect our results, we can leave it in.

To parse out the values, including the medallion numbers, you need to first create a new RDD by splitting the lines of the RDD using the comma as the delimiter. Type in:

In [38]:
val taxiParse = taxi.map(line=>line.split(","))

Now create the key-value pairs where the key is the medallion number and the value is 1. We use this model to later sum up all the keys to find out the number of trips a particular taxi took and in particular, will be able to see which taxi took the most trips. Map each of the medallions to the value of one. Type in:

In [39]:
val taxiMedKey = taxiParse.map(vals=>(vals(6), 1))

vals(6) corresponds to the column where the medallion key is located

Next use the reduceByKey function to count the number of occurrence for each key.

In [40]:
val taxiMedCounts = taxiMedKey.reduceByKey((v1,v2)=>v1+v2)

taxiMedCounts.take(5).foreach(println)

                                                                                ("A9907052C8BBDED5079252EFE6177ECF",195)
("26DE3DC2FCBB37A233BE231BA6F7364E",173)
("BA60553FAA4FE1A36BBF77B4C10D3003",171)
("67DD83EA2A67933B2724269121BF45BB",196)
("AD57F6329C387766186E1B3838A9CEDD",214)


Finally, the values are swapped so they can be ordered in descending order and the results are presented correctly.

In [41]:
for (pair <-taxiMedCounts.map(_.swap).top(10)) println("Taxi Medallion %s had %s Trips".format(pair._2, pair._1))

Taxi Medallion "FE4C521F3C1AC6F2598DEF00DDD43029" had 415 Trips
Taxi Medallion "F5BB809E7858A669C9A1E8A12A3CCF81" had 411 Trips
Taxi Medallion "8CE240F0796D072D5DCFE06A364FB5A0" had 406 Trips
Taxi Medallion "0310297769C8B049C0EA8E87C697F755" had 402 Trips
Taxi Medallion "B6585890F68EE02702F32DECDEABC2A8" had 399 Trips
Taxi Medallion "33955A2FCAF62C6E91A11AE97D96C99A" had 395 Trips
Taxi Medallion "4F7C132D3130970CFA892CC858F5ECB5" had 391 Trips
Taxi Medallion "78833E177D45E4BC520222FFBBAC5B77" had 383 Trips
Taxi Medallion "E097412FE23295A691BEEE56F28FB9E2" had 380 Trips
Taxi Medallion "C14289566BAAD9AEDD0751E5E9C73FBD" had 377 Trips


While each step above was processed one line at a time, you can just as well process everything on one line:

In [42]:
val taxiMedCountsOneLine = taxi.map(line=>line.split(',')).map(vals=>(vals(6),1)).reduceByKey(_ + _)

Run the same line as above to print the taxiMedCountsOneLine RDD.

In [43]:
for (pair <-taxiMedCountsOneLine.map(_.swap).top(10)) println("Taxi Medallion %s had %s Trips".format(pair._2, pair._1))

                                                                                Taxi Medallion "FE4C521F3C1AC6F2598DEF00DDD43029" had 415 Trips
Taxi Medallion "F5BB809E7858A669C9A1E8A12A3CCF81" had 411 Trips
Taxi Medallion "8CE240F0796D072D5DCFE06A364FB5A0" had 406 Trips
Taxi Medallion "0310297769C8B049C0EA8E87C697F755" had 402 Trips
Taxi Medallion "B6585890F68EE02702F32DECDEABC2A8" had 399 Trips
Taxi Medallion "33955A2FCAF62C6E91A11AE97D96C99A" had 395 Trips
Taxi Medallion "4F7C132D3130970CFA892CC858F5ECB5" had 391 Trips
Taxi Medallion "78833E177D45E4BC520222FFBBAC5B77" had 383 Trips
Taxi Medallion "E097412FE23295A691BEEE56F28FB9E2" had 380 Trips
Taxi Medallion "C14289566BAAD9AEDD0751E5E9C73FBD" had 377 Trips


Let's cache the taxiMedCountsOneLine to see the difference caching makes. Run it with the logs set to INFO and you can see the output of the time it takes to execute each line. First, let's cache the RDD

In [45]:
taxiMedCountsOneLine.cache()

ShuffledRDD[33] at reduceByKey at <console>:23

Next, you have to invoke an action for it to actually cache the RDD. Note the time it takes here (either empirically using the INFO log or just notice the time it takes)

In [46]:
taxiMedCountsOneLine.count()

13464

Run it again to see the difference.

In [47]:
taxiMedCountsOneLine.count()

13464

The bigger the dataset, the more noticeable the difference will be. In a sample file such as ours, the difference may be negligible.

The next lab will show you how to create Spark applications using SQL, Graphx, Mlib, etc. The lab is only available in Scala. 

<h1 align="center" style="font-family: Monaco;">Continue on "[Spark Fundamentals 1 - ScalaLibs.ipynb](/api/v1/resources/Spark%20Fundamentals%201%20-%20ScalaLibs.ipynb)"</h1>
