# Introduction to Apache Spark lab, part 1: Basic concepts
This notebook guides you through the basic concepts to start working with Apache Spark, including how to set up your environment, create and analyze data sets, and work with data files.

This notebook uses pySpark, the Python API for Spark. Some knowledge of Python is recommended. This notebook runs on Python 2 with Spark 2.X.

If you are new to notebooks, here's how the user interface works: [Parts of a notebook](http://datascience.ibm.com/docs/content/analyze-data/parts-of-a-notebook.html)


## About Apache Spark
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for processing structured data, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

<img src='https://github.com/carloapp2/SparkPOT/blob/master/spark.png?raw=true' width="50%" height="50%"></img>


A Spark program has a driver program and worker programs. Worker programs run on cluster nodes or in local threads. Data sets are distributed across workers. 

<img src='https://github.com/carloapp2/SparkPOT/blob/master/Spark%20Architecture.png?raw=true' width="50%" height="50%"></img>

## Table of Contents
In the first four sections of this notebook, you'll learn about Spark with very simple examples. In the last two sections, you'll use what you learned to analyze data files that have more realistic data sets.

1. [Work with the SparkContext](#sparkcontext)<br>
    1.1 [Invoke the SparkContext and get the version](#sparkcontext1)<br>
2. [Work with RDDs](#rdd)<br>
    2.1 [Create an RDD](#rdd1)<br>
    2.2 [View the data](#rdd2)<br>
    2.3 [Create another RDD](#rdd3)<br>
3. [Manipulate data in RDDs](#trans)<br>
    3.1 [Update numeric values](#trans1)<br>
    3.2 [Add numbers in an array](#trans2)<br>
    3.3 [Split and count strings](#trans3)<br>
    3.4 [Count words using a Pair RDD](#trans4)<br>
4. [Filter data](#filter)<br>
5. [Analyze text data from a file](#wordfile)<br>
    5.1 [Get the data from a URL](#wordfile1)<br>
    5.2 [Create an RDD from the file](#wordfile2)<br>
    5.3 [Filter for a word](#wordfile3)<br>
    5.4 [Count instances of a string at the beginning of words](#wordfile4)<br>
    5.5 [Count instances of a string within words](#wordfile5)<br>
6. [Analyze numeric data from a file](#numfile)<br>
7. [Summary and next steps](#summary)

# Lab 1 - Hello Spark

This lab will introduce you to Apache Spark.  It is written in Python and runs in IBM's Data Science Experience environment through a Jupyter notebook.  While you work, it will be valuable to reference the [Apache Spark Documentation](http://spark.apache.org/docs/latest/programming-guide.html).  Since it is Python, be careful of whitespace!

<a id="sparkcontext"></a>
## Step 1 - Working with the SparkContext object

The Apache Spark driver application uses the SparkContext object to allow a programming interface to interact with the driver application. The SparkContext object tells Spark how and where to access a cluster.

The Data Science Experience notebook environment predefines the Spark context for you.   This context variable will always be called 'sc'.

In other environments, you need to pick an interpreter (for example, pyspark for Python) and create a SparkConf object to initialize a SparkContext object. For example:
<br/><br/>
`from pyspark import SparkContext, SparkConf`<br>
`conf = SparkConf().setAppName(appName).setMaster(master)`<br>
`sc = SparkContext(conf=conf)`<br>

<a id="sparkcontext1"></a>
### Step 1.1 - Using the spark context object, reading the <i>version</i> attribute will return the working version of Apache Spark<br><br>
 <div class="panel-group" id="accordion-11">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-11" href="#collapse1-11">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-11" class="panel-collapse collapse">
      <div class="panel-body">The spark context is automatically set in a Jupyter notebook.   It is called: sc</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-11" href="#collapse2-11">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-11" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>sc.version</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-11" href="#collapse3-11">
        Optional</a>
      </h4>
    </div>
    <div id="collapse3-11" class="panel-collapse collapse">
      <div class="panel-body">Jupyter notebooks have command completion which can be invoked via the TAB key.<br>Type:<br>&nbsp;&nbsp;&nbsp;&nbsp;<i>sc.&lt;TAB&gt;</i><br>to see all the possible options within the Spark context</div>
    </div>
  </div>
</div> 

In [1]:
#Step 1 - Check spark version
sc.version


u'2.0.2'

<a id="rdd"></a>
## Step 2 - Working with Resilient Distributed Datasets (RDD)

Apache Spark uses an abstraction for working with data called a Resilient Distributed Dataset (RDD). An RDD is a collection of elements that can be operated on in parallel. RDDs are immutable, so you can't update the data in them. To update data in an RDD, you must create a new RDD. In Apache Spark, all work is done by creating new RDDs, transforming existing RDDs, or using RDDs to compute results. When working with RDDs, the Spark driver application automatically distributes the work across the cluster.

You can construct RDDs by parallelizing existing Python collections (lists), by manipulating RDDs, or by manipulating files in HDFS or any other storage system.

You can run these types of methods on RDDs: 
 - Actions: query the data and return values
 - Transformations: manipulate data values and return pointers to new RDDs. 

Find more information on Python methods in the [PySpark documentation](http://spark.apache.org/docs/latest/api/python/pyspark.html).

<a id="rdd1"></a>
### Step 2.1 - Create an RDD with numbers 1 to 10

There are three ways to create an RDD: parallelizing an existing collection, referencing a dataset in an external storage system which offers a Hadoop InputFormat -- or transforming an existing RDD.<br>
<br>
Create an iterable or collection in your program with numbers 1 to 10 and then invoke the Spark Context's (sc) <i>parallelize()</i> method on it.<br>

 <div class="panel-group" id="accordion-21">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-21" href="#collapse1-21">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-21" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]<br><br>
Or we can try to be a little clever by typing:<br>
x = range(1, 11)
      </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-21" href="#collapse2-21">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-21" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]<br>
x_nbr_rdd = sc.parallelize(x)
      </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-21" href="#collapse3-21">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-21" class="panel-collapse collapse">
      <div class="panel-body">An optional parameter to parallelize is the number of partitions to cut the dataset into.   Spark will run one task for each partition.   Typically you want 2-4 partitions for each CPU.   Normally, Spark will set it automatically, but you can control this by specifying it manually as a second parameter to the parallelize method.<br><br>
You can obtain the partitions size by calling <i>&lt;RDD&gt;.getNumPartitions()</i><br>
Try experimenting with different partitions sizes -- including ones higher than the number of values.   To see how the values are distributed use:<br><br>
<i>
def f(iterator):<br>
    &nbsp;&nbsp;&nbsp;&nbsp;
    count = 0<br>
    &nbsp;&nbsp;&nbsp;&nbsp;
    for value in iterator:<br>
    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
        count = count + 1<br>
    &nbsp;&nbsp;&nbsp;&nbsp;
    yield count<br>
x_nbr_rdd.mapPartitions(f).collect()</i><br>
      </div>
    </div>
  </div>
</div> 

In [13]:
x = range(1,10)
x_nbr_rdd = sc.parallelize(x, 20)
x_nbr_rdd.getNumPartitions()


20

In [16]:
x_nbr_rdd = sc.parallelize(x)
x_nbr_rdd.getNumPartitions()

7

In [17]:

def f(iterator):
     count = 0
     for value in iterator:
         count = count + 1
     yield count
x_nbr_rdd.mapPartitions(f).collect()

[1, 1, 1, 2, 1, 1, 2]

Notice that there's no return value. The parallelize method didn't compute a result, which means it's a transformation. Spark only recorded how to create the RDD.
<a id="rdd2"></a>
### Step 2.2 - View the data
Return the first element in the RDD<br/><br/>
<div class="panel-group" id="accordion-22">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-22" href="#collapse1-22">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-22" class="panel-collapse collapse">
      <div class="panel-body">Use the <i>first()</i> method on the RDD to return the first element in an RDD.   You could also use the <i>take()</i> method with a parameter of 1.   first() and take(1) are equivalent.   Both will take the first element in the RDD's 0th partition.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-22" href="#collapse2-22">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-22" class="panel-collapse collapse">
      <div class="panel-body">Type: <br/>
x_nbr_rdd.first()</div>
    </div>
  </div>
</div> 

In [19]:
x_nbr_rdd.first()

1

In [22]:
x_nbr_rdd.take(3)

[1, 2, 3]

Each number in the collection is in a different element in the RDD. Because the first() method returned a value, it is an action.

### Step 2.3 - Return an array of the first five elements<br><br>
 <div class="panel-group" id="accordion-23">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-23" href="#collapse1-23">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-23" class="panel-collapse collapse">
      <div class="panel-body">Use the <i>take()</i> method</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-23" href="#collapse2-23">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-23" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
x_nbr_rdd.take(5)</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-23" href="#collapse3-23">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-23" class="panel-collapse collapse">
      <div class="panel-body">How would you get the 5th-7th elements?   <i>take()</i> only accepts one parameter so <i>take(5,7)</i> will not work.<br>
      </div>
    </div>
  </div>
</div> 


In [21]:
#Step 2.3 - Return an array of the first five elements
x_nbr_rdd.take(5)

[1, 2, 3, 4, 5]

In [34]:
test = x_nbr_rdd.take(7)
t_rdd = sc.parallelize(test)
t_rdd.top(3) # this is in the docs...not sure why it doesn't work

[7, 6, 5]

In [35]:
sc.parallelize(x_nbr_rdd.take(7)).top(3) # apparently there's no clean way to access individual elements or elements nested in the RDD.  This is pretty much how to do it.

[7, 6, 5]

<a id="rdd3"></a>
### 2.4 Create another RDD 
Create an RDD that contains multiple strings and print the value of the first string:

<div class="panel-group" id="accordion-26">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-26" href="#collapse1-26">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-26" class="panel-collapse collapse">
      <div class="panel-body">Create a variable with the Strings "Hello Human" and "My Name is Spark" and turn it into an RDD with the parallelize() function.   Remember that parallelize() is invoked from the Spark context!</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-26" href="#collapse2-26">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-26" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
y = ["Hello Human", "My Name is Spark"]<br>
y_str_rdd = sc.parallelize(y)<br>
y_str_rdd.take(1)<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-26" href="#collapse3-26">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-26" class="panel-collapse collapse">
      <div class="panel-body">Is there a way to get the third element directly?</div>
    </div>
  </div>
</div> 

In [44]:
# create a string array
y = ["Hello Human", "My Name is Spark"]


Put the collection into an RDD:

In [45]:
# put the collection into an RDD
y_str_rdd = sc.parallelize(y)



View the first element in the RDD:

In [46]:
# view the first element
y_str_rdd.take(1)

['Hello Human']

In [47]:
y_str_rdd.lookup(2) # not what i was looking for...figure this out later

[]

You created the strings "Hello Human" and "My Name is Spark" and you returned "Hello Human" as the first element of the RDD. To analyze a set of words, you can map each word into an RDD element.

<a id="trans"></a>
## 3. Manipulate data in RDDs

Remember that to manipulate data, you use transformation functions.

Here are some common Spark transformation functions that you'll be using in this notebook:

 - `map(func)`: returns a new RDD with the results of running the specified function on each element  
 - `filter(func)`: returns a new RDD with the elements for which the specified function returns true   
 - `distinct([numTasks]))`: returns a new RDD that contains the distinct elements of the source RDD
 - `flatMap(func)`: returns a new RDD by first running the specified function on all elements, returning 0 or more results for each original element, and then flattening the results into individual elements

You can also create functions that run a single expression and don't have a name with the Python `lambda` keyword. For example, this function returns the sum of its arguments: `lambda a , b : a + b`.

<a id="trans1"></a>
### 3.1 Update numeric values
Run the `map()` function with the `lambda` keyword to replace each element, X, in your first RDD (the one that has numeric values) with X+1. For more information go to [Transformations](http://spark.apache.org/docs/latest/programming-guide.html#transformations)

<br/>
 <div class="panel-group" id="accordion-24">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-24" href="#collapse1-24">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-24" class="panel-collapse collapse">
      <div class="panel-body">Use the <i>map(func)</i> function on the RDD.   Map invokes function <i>func</i> on each element of the RDD.   You can also use a inline (or lambda) function.   The syntax for a lambda function is:<br>

lambda &lt;var&gt;: &lt;myCode&gt;
</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-24" href="#collapse2-24">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-24" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
x_nbr_rdd_2 = x_nbr_rdd.map(lambda x: x+1)</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-24" href="#collapse3-24">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-24" class="panel-collapse collapse">
      <div class="panel-body">Instead of using lambda, write a python function which increments the value by 1 and pass that function to map()</div>
    </div>
  </div>
</div> 


In [58]:
# Step 3.1 - Write your map function
x_nbr_rdd_2 = x_nbr_rdd.map(lambda x: x+1)
x_nbr_rdd_2.collect()



[2, 3, 4, 5, 6, 7, 8, 9, 10]

In [100]:
# Optional Advanced
def incr(x):
    return x+1
x_nbr_rdd_3 = x_nbr_rdd.map(incr) # not sure how this works
x_nbr_rdd_3.collect()

[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]

Note that there was no result for Step 3.1.  Why was this?  Use collect() to look at all the elements of the new RDD.<br>
Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp; x_nbr_rdd_2.collect()   

In [57]:
# Check out the elements of the new RDD. Warning: Be careful with this in real life! Collect returns everything!  Returning a large data set might be not be very useful. No-one wants to scroll through a million rows!
x_nbr_rdd_2.collect()

[2, 3, 4, 5, 6, 7, 8, 9, 10]

<a id="trans2"></a>
### 3.2 Sum the numbers in an RDD
Create an RDD consisting of a collection of numbers. 

<div class="panel-group" id="accordion-32">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-32" href="#collapse1-32">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-32" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>

x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]<br><br>
Or we can try to be a little clever by typing:<br>

x = range(1, 11)
      </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-32" href="#collapse2-32">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-32" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>

x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]<br>
x_nbr_rdd = sc.parallelize(x)
      </div>
    </div>
  </div>

In [76]:
x = range(1,11)
x_nbr_rdd = sc.parallelize(x)

Calculate the Sum

<div class="panel-group" id="accordion-21">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion2-32" href="#collapse3-32">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse3-32" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
&nbsp;&nbsp;&nbsp;&nbsp;
Use the reduce action which aggregates the elements of the dataset using a function func (which takes two arguments and returns one). The function should be commutative and associative so that it can be computed correctly in parallel.
      </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion2-32" href="#collapse4-32">
        Solution</a>
      </h4>
    </div>
    <div id="collapse4-32" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
x_nbr_rdd.reduce(lambda x,y: x+y) <br>
      </div>
    </div>
  </div>

In [77]:
x_nbr_rdd.reduce(lambda x,y: x+y)

55

<a id="trans3"></a>
### 3.3 Split and count text strings

Create an RDD with the following text strings and show the first element:<br/>
&nbsp;&nbsp;&nbsp;&nbsp;"IBM Data Science Experience is built for enterprise-scale deployment."<br/>
&nbsp;&nbsp;&nbsp;&nbsp;"Manage your data, your analytical assets, and your projects in a secured cloud environment."<br/>
&nbsp;&nbsp;&nbsp;&nbsp;"When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data."<br/><br/>
 <div class="panel-group" id="accordion-27">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-27" href="#collapse1-27">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-27" class="panel-collapse collapse">
      <div class="panel-body">Use an array -- [] -- to contain all three strings.   Don't forget to enclose them in quotes!</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-27" href="#collapse2-27">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-27" class="panel-collapse collapse">
      <div class="panel-body">z = [ "IBM Data Science Experience is built for enterprise-scale deployment.", "Manage your data, your analytical assets, and your projects in a secured cloud environment.", "When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data." ]<br/>
z_str_rdd = sc.parallelize(z)<br/>
z_str_rdd.first()      
      </div>
    </div>
  </div>
</div> 

In [79]:
# create and parallelize strings
s = ["IBM Data Science Experience is built for enterprise-scale deployment.","Manage your data, your analytical assets, and your projects in a secured cloud environment.","When you create an account in the IBM Data Science Experience, we deploy for you a Spark as a Service instance to power your analysis and 5 GB of IBM Object Storage to store your data."]
z_str_rdd = sc.parallelize(s)
z_str_rdd.first()

'IBM Data Science Experience is built for enterprise-scale deployment.'

# Step 3.4 - Split all the entries in the RDD on the spaces.  Then print it out.  Pay careful attention to the new format.
<br/>
 <div class="panel-group" id="accordion-210">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-210" href="#collapse1-210">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-210" class="panel-collapse collapse">
      <div class="panel-body">To split on spaces, use the <a href="https://docs.python.org/2/library/stdtypes.html#string-methods"><i>split()</i></a> function.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-210" href="#collapse2-210">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse2-210" class="panel-collapse collapse">
      <div class="panel-body">Since you want to run on every line, use <i>map()</i> on the RDD and write a lambda function to call <i>split()</i></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-210" href="#collapse3-210">
        Solution</a>
      </h4>
    </div>
    <div id="collapse3-210" class="panel-collapse collapse">
      <div class="panel-body">Type: <br>
z_str_rdd_split = z_str_rdd.map(lambda line: line.split(" "))<br/>
z_str_rdd_split.collect()<br><br>
Question: Is there any difference between split(" ") and split()?</div>
    </div>
  </div>
</div> 

In [84]:
#Step 2.10 - Perform a map transformation to split all entries in the RDD
#Check out the entries in the new RDD
z_str_rdd_split = z_str_rdd.map(lambda x: x.split(" "))
z_str_rdd_split.collect()

[['IBM',
  'Data',
  'Science',
  'Experience',
  'is',
  'built',
  'for',
  'enterprise-scale',
  'deployment.'],
 ['Manage',
  'your',
  'data,',
  'your',
  'analytical',
  'assets,',
  'and',
  'your',
  'projects',
  'in',
  'a',
  'secured',
  'cloud',
  'environment.'],
 ['When',
  'you',
  'create',
  'an',
  'account',
  'in',
  'the',
  'IBM',
  'Data',
  'Science',
  'Experience,',
  'we',
  'deploy',
  'for',
  'you',
  'a',
  'Spark',
  'as',
  'a',
  'Service',
  'instance',
  'to',
  'power',
  'your',
  'analysis',
  'and',
  '5',
  'GB',
  'of',
  'IBM',
  'Object',
  'Storage',
  'to',
  'store',
  'your',
  'data.']]

### Step 3.5 - Explore a new transformation: <a href="https://spark.apache.org/docs/1.6.0/api/python/pyspark#pyspark.RDD.flatMap">flatMap</a>
<br/>
We want to count the words in <b>all</b> the lines, but currently they are split by line.   We need to 'flatten' the line return values into one object.<br/>
flatMap will "flatten" all the elements of an RDD element into 0 or more output terms.<br/><br/>
 <div class="panel-group" id="accordion-211">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-211" href="#collapse1-211">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-211" class="panel-collapse collapse">
      <div class="panel-body"><i>flatmap()</i> parameters work the same way as in <i>map()</i></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-211" href="#collapse2-211">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-211" class="panel-collapse collapse">
      <div class="panel-body">Type:<br/>
z_str_rdd_split_flatmap = z_str_rdd.flatMap(lambda line: line.split())<br/>
print(z_str_rdd_split_flatmap.collect())<br/></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-211" href="#collapse3-211">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-211" class="panel-collapse collapse">
      <div class="panel-body">Use the replace() and lower() methods to remove all commas and periods then make everything lower-case</div>
    </div>
  </div>
</div> 

In [87]:
# Step 3.5 - Learn the difference between two transformations: map and flatMap.
z_str_rdd_split_flatmap = z_str_rdd.flatMap(lambda line: line.split()) # same as split(" ")
print(z_str_rdd_split_flatmap.collect())
# print the result so that if you do the advanced section both results will be output



['IBM', 'Data', 'Science', 'Experience', 'is', 'built', 'for', 'enterprise-scale', 'deployment.', 'Manage', 'your', 'data,', 'your', 'analytical', 'assets,', 'and', 'your', 'projects', 'in', 'a', 'secured', 'cloud', 'environment.', 'When', 'you', 'create', 'an', 'account', 'in', 'the', 'IBM', 'Data', 'Science', 'Experience,', 'we', 'deploy', 'for', 'you', 'a', 'Spark', 'as', 'a', 'Service', 'instance', 'to', 'power', 'your', 'analysis', 'and', '5', 'GB', 'of', 'IBM', 'Object', 'Storage', 'to', 'store', 'your', 'data.']


In [96]:
# Optional Advanced
# What do you notice? How are the outputs of 3.4 and 3.5 different?
z_str_rdd_split_flatmap_2 = z_str_rdd_split_flatmap.map(lambda x: x.lower().replace(',',''))
print(z_str_rdd_split_flatmap_2.collect())


['ibm', 'data', 'science', 'experience', 'is', 'built', 'for', 'enterprise-scale', 'deployment.', 'manage', 'your', 'data', 'your', 'analytical', 'assets', 'and', 'your', 'projects', 'in', 'a', 'secured', 'cloud', 'environment.', 'when', 'you', 'create', 'an', 'account', 'in', 'the', 'ibm', 'data', 'science', 'experience', 'we', 'deploy', 'for', 'you', 'a', 'spark', 'as', 'a', 'service', 'instance', 'to', 'power', 'your', 'analysis', 'and', '5', 'gb', 'of', 'ibm', 'object', 'storage', 'to', 'store', 'your', 'data.']


### Step 3.6 - Augment each entry in the previous RDD with the number "1" to create pairs or tuples. The first element of the tuple will be the word and the second elements of the tuple will be the digit "1".  This is a common step in performing a count as we need values to sum.
<br>
 <div class="panel-group" id="accordion-212">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-212" href="#collapse1-212">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-212" class="panel-collapse collapse">
      <div class="panel-body">Maps don't always have to perform calculations, they can just echo values as well.   Simply echo the value and a 1<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-212" href="#collapse2-212">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse2-212" class="panel-collapse collapse">
      <div class="panel-body">We need to create tuples which are values enclosed in parenthesis, so you'll need to enclose the value, 1 in parens.   For example: (x, 1)<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-212" href="#collapse3-212">
        Solution</a>
      </h4>
    </div>
    <div id="collapse3-212" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
countWords = z_str_rdd_split_flatmap.map(lambda word:(word,1))<br>
countWords.collect()<br></div>
    </div>
  </div>
</div> 

In [98]:
#Step 3.6 - Create pairs or tuple RDD and print it.
countWords = z_str_rdd_split_flatmap_2.map(lambda x: (x,1))
countWords.collect()

[('ibm', 1),
 ('data', 1),
 ('science', 1),
 ('experience', 1),
 ('is', 1),
 ('built', 1),
 ('for', 1),
 ('enterprise-scale', 1),
 ('deployment.', 1),
 ('manage', 1),
 ('your', 1),
 ('data', 1),
 ('your', 1),
 ('analytical', 1),
 ('assets', 1),
 ('and', 1),
 ('your', 1),
 ('projects', 1),
 ('in', 1),
 ('a', 1),
 ('secured', 1),
 ('cloud', 1),
 ('environment.', 1),
 ('when', 1),
 ('you', 1),
 ('create', 1),
 ('an', 1),
 ('account', 1),
 ('in', 1),
 ('the', 1),
 ('ibm', 1),
 ('data', 1),
 ('science', 1),
 ('experience', 1),
 ('we', 1),
 ('deploy', 1),
 ('for', 1),
 ('you', 1),
 ('a', 1),
 ('spark', 1),
 ('as', 1),
 ('a', 1),
 ('service', 1),
 ('instance', 1),
 ('to', 1),
 ('power', 1),
 ('your', 1),
 ('analysis', 1),
 ('and', 1),
 ('5', 1),
 ('gb', 1),
 ('of', 1),
 ('ibm', 1),
 ('object', 1),
 ('storage', 1),
 ('to', 1),
 ('store', 1),
 ('your', 1),
 ('data.', 1)]

<a id="trans4"></a>
### Step 3.7 Now we have above what is known as a [Pair RDD](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions). Each entry in the RDD has a KEY and a VALUE.<br>
The KEY is the word (Light, of, the, ...) and the value is the number "1".  
We can now AGGREGATE this RDD by summing up all the values BY KEY<br><br>
 <div class="panel-group" id="accordion-213">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-213" href="#collapse1-213">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-213" class="panel-collapse collapse in">
      <div class="panel-body">We want to sum all values by key in the key-value pairs.  The generic function to do this is <i>reduceByKey(func)</i>:<br>
      &nbsp;&nbsp;&nbsp;&nbsp;When called on a dataset of (K [Key], V [Value]) pairs, returns a dataset of (K, V) pairs where the values for each key are aggregated using the given reduce function func, which must be of type (V,V) => V.<br><br>Which means func(v1, v2) runs across all values for a specific key.  Think of v1 as the output (initialized as 0 or "") and v2 as the iterated value over each value in the set with the same key.  With each iterated value, v1 is updated.<br>
      Use a lambda function to sum up the values just as you wrote for <i>map()</i></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-213" href="#collapse2-213">
         Solution</a>
      </h4>
    </div>
    <div id="collapse2-213" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
countWords2 = countWords.reduceByKey(lambda x,y: x+y)<br>
countWords2.collect()<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-213" href="#collapse3-213">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-213" class="panel-collapse collapse">
      <div class="panel-body">Sort the results by the count.   You could call <i>sortBy()</i> on the result....<br>
      Also, while the function used in <i>map()</i> has only one parameter, when working with Pair RDDs, that parameter is an array of two values....
      </div>
    </div>
  </div>
</div> 


In [103]:
# Step 3.7 - Check out the results of the aggregation
countWords2 = countWords.reduceByKey(lambda x,y:x+y)
countWords2.collect()



[('enterprise-scale', 1),
 ('for', 2),
 ('storage', 1),
 ('secured', 1),
 ('when', 1),
 ('as', 1),
 ('spark', 1),
 ('cloud', 1),
 ('instance', 1),
 ('ibm', 3),
 ('built', 1),
 ('and', 2),
 ('a', 3),
 ('we', 1),
 ('power', 1),
 ('service', 1),
 ('account', 1),
 ('analysis', 1),
 ('deployment.', 1),
 ('5', 1),
 ('experience', 2),
 ('analytical', 1),
 ('the', 1),
 ('projects', 1),
 ('store', 1),
 ('data.', 1),
 ('deploy', 1),
 ('science', 2),
 ('manage', 1),
 ('an', 1),
 ('environment.', 1),
 ('gb', 1),
 ('is', 1),
 ('data', 3),
 ('create', 1),
 ('you', 2),
 ('assets', 1),
 ('of', 1),
 ('object', 1),
 ('to', 2),
 ('in', 2),
 ('your', 5)]

In [107]:
# optional advanced
countWords2.sortBy(lambda x: x[0]).collect()

[('5', 1),
 ('a', 3),
 ('account', 1),
 ('an', 1),
 ('analysis', 1),
 ('analytical', 1),
 ('and', 2),
 ('as', 1),
 ('assets', 1),
 ('built', 1),
 ('cloud', 1),
 ('create', 1),
 ('data', 3),
 ('data.', 1),
 ('deploy', 1),
 ('deployment.', 1),
 ('enterprise-scale', 1),
 ('environment.', 1),
 ('experience', 2),
 ('for', 2),
 ('gb', 1),
 ('ibm', 3),
 ('in', 2),
 ('instance', 1),
 ('is', 1),
 ('manage', 1),
 ('object', 1),
 ('of', 1),
 ('power', 1),
 ('projects', 1),
 ('science', 2),
 ('secured', 1),
 ('service', 1),
 ('spark', 1),
 ('storage', 1),
 ('store', 1),
 ('the', 1),
 ('to', 2),
 ('we', 1),
 ('when', 1),
 ('you', 2),
 ('your', 5)]

<a id="filter"></a>
## 4. Filter data

The filter command creates a new RDD from another RDD based on a filter criteria.
The filter syntax is: 

`.filter(lambda line: "Filter Criteria")`

Hint - The criteria for a simple string check is: &#60;string&#62; in &#60;variable&#62;.

Find the number of instances of the word `IBM` in the `z_str_rdd_split_flatmap` RDD:

In [110]:
words_rd3 = z_str_rdd_split_flatmap.filter(lambda line: "IBM" in line) 

print "The count of words " + str(words_rd3.first())
print "Is: " + str(words_rd3.count())

The count of words IBM
Is: 3


<a id="wordfile"></a>
## 5. Analyze text data from a file
In this section, you'll download a file from a URL, create an RDD from it, and analyze the text in it.

<a id="wordfile1"></a>
### Step 5.1 - Read the Apache Spark README.md file from Github.  The ! allows you to embed file system commands
<br/>
We remove README.md in case there was an updated version -- but also for another reason you will discover in Lab 2<br/><br/>
Type:<br/>

&nbsp;&nbsp;&nbsp;&nbsp;!rm README.md* -f<br>
&nbsp;&nbsp;&nbsp;&nbsp;!wget https://raw.githubusercontent.com/apache/spark/master/README.md<br>


In [111]:
# Step 5.1 - Pull data file into workbench
!rm README.md* -f
!wget https://raw.githubusercontent.com/apache/spark/master/README.md

--2017-09-19 11:27:13--  https://raw.githubusercontent.com/apache/spark/master/README.md
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 3809 (3.7K) [text/plain]
Saving to: ‘README.md’


2017-09-19 11:27:13 (18.5 MB/s) - ‘README.md’ saved [3809/3809]



<a id="wordfile1"></a>
### Step 5.2 - Create an RDD by reading from the local filesystem and count the number of lines  Here is the [textfile()](http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=textfile#pyspark.SparkContext.textFile) documentation.<br><br>
 <div class="panel-group" id="accordion-52">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-52" href="#collapse1-52">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-52" class="panel-collapse collapse">
      <div class="panel-body">README.md has been loaded into local storage so there is no path needed.   <i>textFile()</i> returns an RDD -- you do not have to parallelize the result.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-52" href="#collapse2-52">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-52" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
textfile_rdd = sc.textFile("README.md")<br>
textfile_rdd.count()<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-52" href="#collapse3-52">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-52" class="panel-collapse collapse">
      <div class="panel-body">By default, <i>textFile()</i> uses UTF-8 format.   Read the file as UNICODE (refer to the docs).</div>
    </div>
  </div>
</div> 


In [117]:
# Step 5.2 - Create RDD from data file
textfile_rdd = sc.textFile("README.md", use_unicode=False)
print(textfile_rdd.count())

# Optional Advanced
textfile_rdd_uni = sc.textFile("README.md", use_unicode=True)
print(textfile_rdd_uni.count())

103
103


<a id="wordfile3"></a>
### Step 5.3 - Use [filter](http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=filter#pyspark.RDD.filter) transformation to include lines that contain "Spark". Python allows us to use the 'in' syntax to search strings.<br>
We will also take a look at the first line in the newly filtered RDD. <br><br>
 <div class="panel-group" id="accordion-33">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-33" href="#collapse1-33">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-33" class="panel-collapse collapse">
      <div class="panel-body"><i>filter()</i>, just like <i>map()</i> can take a lambda function as its input</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-33" href="#collapse2-33">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-33" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
Spark_lines = textfile_rdd.filter(lambda line: "Spark" in line)<br>
Spark_lines.first()<br></div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-33" href="#collapse3-33">
        Advanced Optional</a>
      </h4>
    </div>
    <div id="collapse3-33" class="panel-collapse collapse">
      <div class="panel-body">There are 19 lines which contain the word "Spark".   Find all lines which contain it when case-insensitive<br></div>
    </div>
  </div>
</div> 

In [122]:
#Step 5.3 - Filter for only lines with word Spark
Spark_lines = textfile_rdd.filter(lambda line: "Spark" in line)
print(Spark_lines.first())
print(Spark_lines.collect())

# Advanced optional


# Apache Spark
['# Apache Spark', 'Spark is a fast and general cluster computing system for Big Data. It provides', 'rich set of higher-level tools including Spark SQL for SQL and DataFrames,', 'and Spark Streaming for stream processing.', 'You can find the latest Spark documentation, including a programming', '## Building Spark', 'Spark is built using [Apache Maven](http://maven.apache.org/).', 'To build Spark and its example programs, run:', 'You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).', '["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).', 'For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](http://spark.apache.org/developer-tools.html).', 'The easiest way to start using Spark is through the Scala shell:', 'Spark also comes with several sample program

<a id="wordfile4"></a>
### Step 5.4 - Print the number of Spark lines in this filtered RDD out of the total number and print the result as a concatenated string.<br/><br/>
 <div class="panel-group" id="accordion-34">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-34" href="#collapse1-34">
        Hint 1</a>
      </h4>
    </div>
    <div id="collapse1-34" class="panel-collapse collapse">
      <div class="panel-body">The <i>print()</i> statement prints to the console.  (Note: be careful on a cluster because a print on a distributed machine will not be seen).  You can cast integers to string by using the <i>str()</i> method.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-34" href="#collapse2-34">
        Hint 2</a>
      </h4>
    </div>
    <div id="collapse2-34" class="panel-collapse collapse">
      <div class="panel-body">Strings can be concatenated together with the + sign.   You can mark a statement as spanning multiple lines by putting a \ at the end of the line.</div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-34" href="#collapse3-34">
        Solution</a>
      </h4>
    </div>
    <div id="collapse3-34" class="panel-collapse collapse">
      <div class="panel-body">Type:<br>
print "The file README.md has " + str(Spark_lines.count()) + \<br/>
" of " + str(textfile_rdd.count()) + \<br/>
" lines with the word Spark in it."<br/></div>
    </div>
  </div>
</div> 

In [123]:
# Step 5.4 - count the number of lines
print "The file README.md has " + str(Spark_lines.count()) + \
" of " + str(textfile_rdd.count()) + \
" lines with the word Spark in it."

The file README.md has 20 of 103 lines with the word Spark in it.


<a id="wordfile5"></a>
### Step 5.5 - Now count the number of times the word Spark appears in the original text, not just the number of lines that contain it.
 <div class="panel-group" id="accordion-35">
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-35" href="#collapse1-35">
        Hint</a>
      </h4>
    </div>
    <div id="collapse1-35" class="panel-collapse collapse">
      <div class="panel-body">
        Looking back at previous exercises, you will need to: <br>
        &nbsp;&nbsp;&nbsp;&nbsp;1 - Execute a flatMap transformation on the original RDD Spark_lines and split on white space.<br>
        &nbsp;&nbsp;&nbsp;&nbsp;2 - Use filter to include all instances of the word Spark<br>
        &nbsp;&nbsp;&nbsp;&nbsp;3 - Count all instances<br>
        &nbsp;&nbsp;&nbsp;&nbsp;4 - Print the total count<br><br>
      </div>
    </div>
  </div>
  <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-35" href="#collapse2-35">
        Solution</a>
      </h4>
    </div>
    <div id="collapse2-35" class="panel-collapse collapse">
      <div class="panel-body">
      Flattened_Spark_lines = Spark_lines.flatMap(lambda line: line.split())<br>
      Spark_instances = Flattened_Spark_lines.filter(lambda word: "Spark" in word)<br>
      print "Number of Spark instances: ",str(Spark_instances.count())
      </div>
      </div>
    </div>
    <div class="panel panel-default">
    <div class="panel-heading">
      <h4 class="panel-title">
        <a data-toggle="collapse" data-parent="#accordion-35" href="#collapse3-35">
        Optional Advanced</a>
      </h4>
    </div>
    <div id="collapse3-35" class="panel-collapse collapse">
      <div class="panel-body">Put the entire statement on one line and make the filter case-insensitive.</div>
    </div>
  </div>
</div> 

In [124]:
# Step 5.5
Flattened_Spark_lines = Spark_lines.flatMap(lambda line: line.split())
Spark_instances = Flattened_Spark_lines.filter(lambda word: "Spark" in word)
print "Number of Spark instances: ",str(Spark_instances.count()) 
#Optional Advanced


Number of Spark instances:  21


In [139]:
print "Number of Spark instances: ",str(Spark_lines.flatMap(lambda line: line.split()).filter(lambda word: "spark" in word.lower()).count()) 

Number of Spark instances:  24


<a id="numfile"></a>
## Step 6 - Perform analysis on a data file
This part is a little more open ended and there are a few ways to complete it.  Scroll up to previous examples for some guidance.  You will download a data file, transform the data, and then average the prices.  The data file will be a sample of tech stock prices over six days. <br>

Data Location: https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv<br>
The data file is a csv<br/><br/>
Here is a sample of the file:<br/>
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;IBM,159.720001,159.399994,158.880005,159.539993,159.550003,160.350006

We leverage map-reduce to create a generic solution but there are multiple ways to solve this problem.

In [140]:
# Step 6 - Delete the file if it exists, download a new copy and load it into an RDD
!rm StockPrices.csv -f
!wget https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv
    
SP = sc.textFile("StockPrices.csv")
SP.collect()

--2017-09-19 11:48:52--  https://raw.githubusercontent.com/JosephKambourakisIBM/SparkPoT/master/StockPrices.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.48.133
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|151.101.48.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 244 [text/plain]
Saving to: ‘StockPrices.csv’


2017-09-19 11:48:52 (34.6 MB/s) - ‘StockPrices.csv’ saved [244/244]



[u'IBM,159.720001,159.399994,158.880005,159.539993,159.550003,160.350006',
 u'MSFT,58.099998,57.889999,57.459999,57.59,57.669998,57.610001',
 u'AAPL,106.82,106,106.099998,106.730003,107.730003,107.699997',
 u'ORCL,41.310001,41.310001,41.220001,41.16,41.25,41.25']

In [142]:
test1 = SP.map(lambda line: line.split(','))
test1.collect()

[[u'IBM',
  u'159.720001',
  u'159.399994',
  u'158.880005',
  u'159.539993',
  u'159.550003',
  u'160.350006'],
 [u'MSFT',
  u'58.099998',
  u'57.889999',
  u'57.459999',
  u'57.59',
  u'57.669998',
  u'57.610001'],
 [u'AAPL',
  u'106.82',
  u'106',
  u'106.099998',
  u'106.730003',
  u'107.730003',
  u'107.699997'],
 [u'ORCL',
  u'41.310001',
  u'41.310001',
  u'41.220001',
  u'41.16',
  u'41.25',
  u'41.25']]

In [144]:
test2 = test1.flatMap(lambda row: map(lambda x: (row[0],[float(row[x]), 1]), range(1,len(row))))
test2.collect()

[(u'IBM', [159.720001, 1]),
 (u'IBM', [159.399994, 1]),
 (u'IBM', [158.880005, 1]),
 (u'IBM', [159.539993, 1]),
 (u'IBM', [159.550003, 1]),
 (u'IBM', [160.350006, 1]),
 (u'MSFT', [58.099998, 1]),
 (u'MSFT', [57.889999, 1]),
 (u'MSFT', [57.459999, 1]),
 (u'MSFT', [57.59, 1]),
 (u'MSFT', [57.669998, 1]),
 (u'MSFT', [57.610001, 1]),
 (u'AAPL', [106.82, 1]),
 (u'AAPL', [106.0, 1]),
 (u'AAPL', [106.099998, 1]),
 (u'AAPL', [106.730003, 1]),
 (u'AAPL', [107.730003, 1]),
 (u'AAPL', [107.699997, 1]),
 (u'ORCL', [41.310001, 1]),
 (u'ORCL', [41.310001, 1]),
 (u'ORCL', [41.220001, 1]),
 (u'ORCL', [41.16, 1]),
 (u'ORCL', [41.25, 1]),
 (u'ORCL', [41.25, 1])]

In [146]:
test3 = test2.reduceByKey(lambda x,y: [x[0] + y[0], x[1] + y[1]])
test3.collect()

[(u'AAPL', [641.080001, 6]),
 (u'ORCL', [247.500003, 6]),
 (u'MSFT', [346.319995, 6]),
 (u'IBM', [957.4400019999999, 6])]

In [147]:
test4 = test3.map(lambda x: (x[0], float(x[1][0]) / int(x[1][1])))
test4.collect()

[(u'AAPL', 106.84666683333334),
 (u'ORCL', 41.2500005),
 (u'MSFT', 57.71999916666667),
 (u'IBM', 159.57333366666666)]

<a id="summary"></a>
## 7. Summary and next steps

You've learned how to work with data in RDDs to discover useful information.

Dig deeper:
 - [Apache Spark documentation](http://spark.apache.org/documentation.html)
 - [PySpark documentation](http://spark.apache.org/docs/latest/api/python/pyspark.html)

### Authors
Carlo Appugliese is a Spark and Hadoop evangelist at IBM.<br/>
Braden Callahan is a Big Data Technical Specialist for IBM.<br/>
Ross Lewis is a Big Data Technical Sales Specialist for IBM.<br/>
Mokhtar Kandil is a World Wide Big Data Technical Specialist for IBM.<br/>
Joel Patterson is a Big Data Technical Specialist for IBM