Welcome to exercise one of “Apache Spark for Scalable Machine Learning on BigData”. In this exercise you’ll apply the basics of functional and parallel programming. 

Let’s start with a simple example. Let’s consider you have a list of integers.

Let’s find out what the size of this list is.

Note that we already provide an RDD object, so please have a look at the RDD API in order to find out what function to use:
https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD

The following link contains additional documentation:
https://spark.apache.org/docs/latest/rdd-programming-guide.html



This notebook is designed to run in a IBM Watson Studio default runtime (NOT the Watson Studio Apache Spark Runtime as the default runtime with 1 vCPU is free of charge). Therefore, we install Apache Spark in local mode for test purposes only. Please don't use it in production.

In case you are facing issues, please read the following two documents first:

https://github.com/IBM/skillsnetwork/wiki/Environment-Setup

https://github.com/IBM/skillsnetwork/wiki/FAQ

Then, please feel free to ask:

https://coursera.org/learn/machine-learning-big-data-apache-spark/discussions/all

Please make sure to follow the guidelines before asking a question:

https://github.com/IBM/skillsnetwork/wiki/FAQ#im-feeling-lost-and-confused-please-help-me


If running outside Watson Studio, this should work as well. In case you are running in an Apache Spark context outside Watson Studio, please remove the Apache Spark setup in the first notebook cells.

In [2]:
from IPython.display import Markdown, display
def printmd(string):
    display(Markdown('# <span style="color:red">'+string+'</span>'))


if ('sc' in locals() or 'sc' in globals()):
    printmd('<<<<<!!!!! It seems that you are running in a IBM Watson Studio Apache Spark Notebook. Please run it in an IBM Watson Studio Default Runtime (without Apache Spark) !!!!!>>>>>')


In [3]:
!pip install pyspark==2.4.5

Collecting pyspark==2.4.5
  Downloading pyspark-2.4.5.tar.gz (217.8 MB)
[K     |████████████████████████████████| 217.8 MB 3.3 kB/s eta 0:00:01   |▊                               | 4.6 MB 1.8 MB/s eta 0:01:58     |█████                           | 34.5 MB 2.8 MB/s eta 0:01:06     |██████                          | 41.3 MB 2.6 MB/s eta 0:01:07     |███████                         | 47.5 MB 3.1 MB/s eta 0:00:56     |█████████▍                      | 63.6 MB 4.5 MB/s eta 0:00:35     |█████████▉                      | 67.3 MB 3.6 MB/s eta 0:00:42     |████████████▎                   | 83.3 MB 3.5 MB/s eta 0:00:39     |██████████████▏                 | 96.3 MB 2.1 MB/s eta 0:00:59     |███████████████▍                | 105.0 MB 2.1 MB/s eta 0:00:55     |███████████████▊                | 106.9 MB 2.2 MB/s eta 0:00:51     |█████████████████▌              | 119.2 MB 3.1 MB/s eta 0:00:32     |█████████████████▉              | 121.0 MB 755 kB/s eta 0:02:09     |██████████████████▏             |

In [4]:
try:
    from pyspark import SparkContext, SparkConf
    from pyspark.sql import SparkSession
except ImportError as e:
    printmd('<<<<<!!!!! Please restart your kernel after installing Apache Spark !!!!!>>>>>')

In [5]:
sc = SparkContext.getOrCreate(SparkConf().setMaster("local[*]"))

spark = SparkSession \
    .builder \
    .getOrCreate()

In [6]:
rdd = sc.parallelize(range(100))

In [9]:
# please replace $$ with the correct characters
rdd.collect()

[0,
 1,
 2,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 14,
 15,
 16,
 17,
 18,
 19,
 20,
 21,
 22,
 23,
 24,
 25,
 26,
 27,
 28,
 29,
 30,
 31,
 32,
 33,
 34,
 35,
 36,
 37,
 38,
 39,
 40,
 41,
 42,
 43,
 44,
 45,
 46,
 47,
 48,
 49,
 50,
 51,
 52,
 53,
 54,
 55,
 56,
 57,
 58,
 59,
 60,
 61,
 62,
 63,
 64,
 65,
 66,
 67,
 68,
 69,
 70,
 71,
 72,
 73,
 74,
 75,
 76,
 77,
 78,
 79,
 80,
 81,
 82,
 83,
 84,
 85,
 86,
 87,
 88,
 89,
 90,
 91,
 92,
 93,
 94,
 95,
 96,
 97,
 98,
 99]

You should see "100" as answer. Now we want to know the sum of all elements. Please again, have a look at the API documentation and complete the code below in order to get the sum.

In [11]:
rdd.sum()

4950

You should get "4950" as answer.