# Spark basics

Let's first initialize a Spark context:

In [8]:
import findspark
findspark.init()

import pyspark
sc = pyspark.SparkContext(master='local[*]', appName="Spark course")

Now pretty-print the current Spark configuration using `sc.getConf` (you can use [Spark documentation](https://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.SparkConf) as a reference)

In [9]:
for c in sc.getConf().getAll():
    print(c)

('spark.memory.offHeap.size', '4g')
('spark.jars', 'file:///usr/local/spark/python/axs/AxsUtilities-1.0-SNAPSHOT.jar,file:///home/spark/first-edition/ch06/kafkaProducerWrapper.jar,file:///home/spark/.ivy2/jars/org.mariadb.jdbc_mariadb-java-client-2.2.3.jar')
('spark.driver.port', '40074')
('spark.app.name', 'Spark course')
('spark.files', 'file:///home/spark/.ivy2/jars/org.mariadb.jdbc_mariadb-java-client-2.2.3.jar')
('spark.jars.packages', 'org.mariadb.jdbc:mariadb-java-client:2.2.3')
('spark.local.dir', '/home/spark/sparktmp')
('spark.executor.id', 'driver')
('spark.driver.host', '10.0.2.15')
('spark.driver.memory', '1g')
('spark.ui.killEnabled', 'true')
('spark.memory.offHeap.enabled', 'true')
('spark.sql.warehouse.dir', 'file:///home/spark/spark-warehouse')
('spark.scheduler.minRegisteredResourcesRatio', '0.75')
('spark.executor.extraJavaOptions', '-XX:MaxDirectMemorySize=4096m')
('spark.executor.memory', '1g')
('spark.rdd.compress', 'True')
('spark.repl.local.jars', 'file:///usr/l

What is Spark's current parallelism level? (Hint: the parameter is called "spark.default.parallelism")

In [10]:
print(sc.getConf().get("spark.default.parallelism"))
print(sc.defaultParallelism)

4
4


Examine the corresponding page at [Spark Web UI](http://192.168.10.2:4040/environment/).

## Working with RDDs

Load Spark `NOTICE` file (from `/usr/local/spark` folder) into a value called `noticeRdd`. (You might want to check out the [official documentation for SparkContext](https://spark.apache.org/docs/2.4.0/api/python/pyspark.html#pyspark.SparkContext)).

In [11]:
noticeRdd = sc.textFile("/usr/local/spark/NOTICE")

How many lines does the NOTICE file have?

In [12]:
noticeRdd.count()

1174

Create an rdd called `words` containing only words from the NOTICE file (word is any string of symbols separated by whitespace).   
   
   How many words does the NOTICE file have? (Notice: make sure not to count empty words)

In [18]:
words = noticeRdd.flatMap(lambda line: line.split(" ")).filter(lambda word: word.strip() != "")
words.count()

4895

How many DISTINCT words does the NOTICE file have?

In [15]:
words.distinct().count()

1512

What is the average word length in the NOTICE file?

In [16]:
words.map(lambda word: len(word)).mean()

7.510316649642486

How many words have length less than 3, how many are between 3 and 8 and how many are above? (Hint: use `histogram`)

In [17]:
words.map(lambda word: len(word)).histogram([0, 3, 8, 1000])

([0, 3, 8, 1000], [746, 2494, 1655])

What is the total number of non-whitespace characters in the NOTICE file (Hint: use the `words` RDD and `reduce` action)?

In [19]:
words.map(lambda w: len(w)).reduce(lambda w1, w2: w1 + w2)

36763

Save distinct words from the NOTICE file to the `/home/spark/output/noticeWords` file.

In [20]:
words.saveAsTextFile("/home/spark/output/noticeWords")

Use your Linux shell to examine the output "file".

Visit [Spark's Web UI](http://192.168.10.2:4040) and examine jobs, stages and tasks that were executed when the previous cells were running.

## Using accumulators

Accumulators are variables shared across executors that you can only add to. You can use them to implement global sums and counters in your Spark jobs. Reading from accumulators is allowed only from the driver side.

Create two accumulators `totallen` and `totalcount` and initialize them to zero.

In [26]:
totallen = sc.accumulator(0)
totalcount = sc.accumulator(0)

Use the `foreach` method to calculate the total length of all words and the count of the words and put those values into the two accumulators.

In [27]:
def calc_words(word):
    totallen.add(len(word))
    totalcount.add(1)
    
words.foreach(calc_words)

Examine the accumulator values.

In [28]:
totallen.value, totalcount.value

(36763, 4895)

Now use the accumulator variables to calculate the average word length.

In [29]:
totallen.value / float(totalcount.value)

7.510316649642492

## Using broadcast variables

Typically, variables created in the driver, needed by tasks for their execution, are serialized and shipped along with those tasks. But a single driver program can reuse the same variable in several jobs, and several tasks may get shipped to the same executor as part of the same job. So, a potentially large variable may get serialized and transferred over the network more times than necessary. In these cases, itâ€™s better to use broadcast variables, because they can transfer the data in a more optimized way and only once.

Let's say you have the following dictionary of words that need to be corrected (execute the following cell).

In [30]:
changedict = {"Guice": "Juice", "Glyphicons": "Glyphs", "Bootstrap": "bootstrap", "Bengtson": "Benson"}

Broadcast this dictionary to the executors and use it inside the `mapPartitions` method to change all the matching words in the `words` RDD.

In [31]:
dictb = sc.broadcast(changedict)
newwords = words.map(lambda w: dictb.value[w] if w in dictb.value else w)

Check if you made any changes by counting some of the matching words in both new and the old RDD.

In [32]:
print(words.filter(lambda x: x == "Guice").count())
print(words.filter(lambda x: x == "Juice").count())

2
0


In [33]:
print(newwords.filter(lambda x: x == "Guice").count())
print(newwords.filter(lambda x: x == "Juice").count())

0
2
