## 5.3.2 Step by Step breakdown of the Word Frequency Spark example

In this slide, we will explain the code line by line to help you see the workings of PySpark and also how familiar it can be if you have some prior Python experience. Use the notebook function to run line by line, observe the outcome and feel free to edit if you wish to find out more. Explanations will be provided in-line in your notebook. 

If you are not familiar with Jupyter Notebook, here is a crash course:
- You can move your mouse over the numbers on the left (looks like [1], [2] and so on). The number will turn into a "Play" button. Click on that and that particular line of code will be executed.
- Sometimes the standard output (things that will get "printed" in the console) will be displayed right below your code. Sometimes you will see nothing. Both are OK. What's not OK, is when you received an error output. 
- If you have assigned value(s) to a variable, you will not see any output in the console. To view, you can just type the name of the variable in the next line (some of the code example has already that done for you). 
- You can also select Run All to execute all codes at one go (and in sequence), but that's not so fun, is it? :)

Enjoy and have fun learning.

Have you tried to figure out the previous python code that uses Spark (via pyspark) to perform the exact word frequency calculation tasks you have tested using MapReduce in Week 4? Let's dive in to run through the code line by line. 

1.0 First of all you begin by importing *SparkSession* from the library *pyspark*.


In [2]:
from pyspark.sql import SparkSession

2.0 It is important to begin by creating a SparkSession. This can be seen as an entry to Spark (which should be installed on the same machine beforehand). This will allow you to use all functions provided in Spark via pyspark. 

In [3]:
spark = SparkSession.builder.master("local").appName('Firstprogram').getOrCreate()

If a SparkSession is created successfully, you should get the version, name of the master node and the app name printed.

You can also update or set configuration that is specific to your needs by using the config command, but for this example we will keep the default configuration.

Always end the SparkSession command with `.getOrCreate()`.

3.0 Next will be the reading from a file. Let's read the text file you have processed in Week 4. It has been named the same as text.txt. You use the following command to read it into a PythonRDD object.

In [4]:
text_file = spark.read.text("text.txt").rdd.map(lambda r: r[0])

Print the text_file object, and also try to check its contents. Can you see how the text file text.txt is now mapped into a PythonRDD object?
1. List item
2. List item

4.0 Doing the word count...
Next, you will convert the text_file object into a flatMap dataframe. What really happens here is each line is split by a space to convert each word into a key value pair of word and 1. The final step in the process was to add up all the 1s. Does this look familiar?

In [5]:
counts = text_file.flatMap(lambda line: line.split(" "))\
                            .map(lambda word: (word, 1))\
                          .reduceByKey(lambda x, y: x + y)

5.0 Collection time
The next step is to collect the contents from all nodes. Remember, in a cluster setting, the RDDs are executed in more than one nodes so this step is usually required to collect the results from all nodes.

In [None]:
output = counts.collect()

6.0 Checking the output
Finally, you can print the final output.

In [None]:
for (word, count) in output:
    print("%s: %i" % (word, count))

Due to the constrains on my local, this is the output:

Four: 1 
<br>score: 1
<br>and: 5
<br>seven: 1
<br>years: 1
<br>ago: 1
<br>our: 2
<br>fathers: 1
<br>brought: 1
<br>forth: 1
<br>on: 2
<br>this: 2
<br>continent,: 1
<br>a: 7
<br>new: 2
<br>nation,: 3
<br>conceived: 2
<br>in: 4
<br>Liberty,: 1
<br>dedicated: 3
<br>to: 8
<br>the: 9
<br>proposition: 1
<br>that: 10
<br>all: 1
<br>men: 1
<br>are: 3
<br>created: 1
<br>equal.: 1
<br>: 2
<br>Now: 1
<br>we: 6
<br>engaged: 1
<br>great: 3
<br>civil: 1
<br>war,: 1
<br>testing: 1
<br>whether: 1
<br>or: 2
<br>any: 1
<br>nation: 2
<br>so: 3
<br>dedicated,: 1
<br>can: 5
<br>long: 2
<br>endure.: 1
<br>We: 2
<br>met: 1
<br>battle-field: 1
<br>of: 5
<br>war.: 1
<br>have: 5
<br>come: 1
<br>dedicate: 1
<br>portion: 1
<br>field,: 1
<br>as: 1
<br>final: 1
<br>resting: 1
<br>place: 1
<br>for: 5
<br>those: 1
<br>who: 3
<br>here: 5
<br>gave: 2
<br>their: 1
<br>lives: 1
<br>might: 1
<br>live.: 1
<br>It: 3
<br>is: 3
<br>altogether: 1
<br>fitting: 1
<br>proper: 1
<br>should: 1
<br>do: 1
<br>this.: 1
<br>But,: 1
<br>larger: 1
<br>sense,: 1
<br>not: 5
<br>dedicate—we: 1
<br>consecrate—we: 1
<br>hallow—this: 1
<br>ground.: 1
<br>The: 2
<br>brave: 1
<br>men,: 1
<br>living: 1
<br>dead,: 1
<br>struggled: 1
<br>here,: 2
<br>consecrated: 1
<br>it,: 1
<br>far: 2
<br>above: 1
<br>poor: 1
<br>power: 1
<br>add: 1
<br>detract.: 1
<br>world: 1
<br>will: 1
<br>little: 1
<br>note,: 1
<br>nor: 1
<br>remember: 1
<br>what: 2
<br>say: 1
<br>but: 1
<br>it: 1
<br>never: 1
<br>forget: 1
<br>they: 3
<br>did: 1
<br>here.: 1
<br>us: 2
<br>living,: 1
<br>rather,: 1
<br>be: 2
<br>unfinished: 1
<br>work: 1
<br>which: 2
<br>fought: 1
<br>thus: 1
<br>nobly: 1
<br>advanced.: 1
<br>rather: 1
<br>task: 1
<br>remaining: 1
<br>before: 1
<br>us—that: 1
<br>from: 2
<br>these: 2
<br>honored: 1
<br>dead: 2
<br>take: 1
<br>increased: 1
<br>devotion: 1
<br>cause: 1
<br>last: 1
<br>full: 1
<br>measure: 1
<br>devotion—that: 1
<br>highly: 1
<br>resolve: 1
<br>shall: 3
<br>died: 1
<br>vain—that: 1
<br>under: 1
<br>God,: 1
<br>birth: 1
<br>freedom—and: 1
<br>government: 1
<br>people,: 3
<br>by: 1
<br>perish: 1
<br>earth.: 1

Please visit [Google Colab: 5.3.2 Step by Step breakdown of the Word Frequency Spark example](https://colab.research.google.com/drive/1lRf_wySa7JU3uDUldWi1X1rZ9UVFp289?usp=sharing) for a smooth process.