<img style="float: left"  src="images/python.png">
<img style="float: right" src="images/surfsara.png">

<hr style="clear: both" />

# Python introduction

We present a very brief introduction to Python here before we will work with Spark.

Python is relatively easy to learn. Of course, you can do complicated things but we focus on the basics here and assume that you have seen similar constructs in other programming languages.

_You can edit the cells below and execute the code by selecting the cell and press Shift-Enter. Code completion is supported by use of the Tab key._

## Variables and dynamic typing

In the cell below we assign the variable `i` to an integer. We then print the type of `i`.
You can see by executing the cell that `i` is of type _int_.

In [None]:
# Assign a integer to i
i = 123567

# Print the type of i
print type(i)

Next we assign `s` to string and determine its length. Then we print the length of `s`.

In [None]:
# Assign s to a string, using single quotes (or double quotes)
s = 'abcdABCD'

# Strings have a length
s_length = len(s)

The function `print` expects a single argument, so we create a single string from the text and variable `s_length`. For this we need to convert `s_length` from `int` to `string` using the `str` function.

In [None]:
# We print the length of s

# The + operator concatenates strings (and lists and tuples)
print 'length of s is : ' +  str(s_length)

Strings are very similar to lists (we get to them next). They can be accessed by using the index, the position within the string. Take a look at the following code.

In [None]:
# TODO: Replace FILL IN YOUR CODE HERE with appropriate code

# The first letter of s
f = s[0]
print 'the first letter is: ' + f

# The second letter of s
print 'the second letter is: ' + s[1]

# Slice of s from offsets 1 through 2 (3 not included)
print 'the second and third letter are: ' + s[1:3]

# Print last letter of s
print 'the last letter is: ' + s[-1]

# Print the last three letters of s
print 'the last three letters are: ' + s[-3:]

# Print s, except the last letter
print # PLEASE FILL IN YOUR CODE HERE

## Lists

Lists are ordered collections, which can contain elements of any type.

In [None]:
L = [123, 'some string', 1.666]
print 'length of L is: ' +  str(len(L))

In [None]:
# We create two lists and concatenate these using the + operator

L1 = ['a', 'b', 'c']
L2 = ['D', 'E', 'F']
L3 = L1 + L2
print L3

Lists are objects in Python. They come with a number of methods, e.g. `append` and `reverse`.
These methods are called on the _list_ object by using the `.` (dot) notation.

In [None]:
print 'L3 before reverse: ' + str(L3)
L3.reverse()
print 'L3 after reverse: ' + str(L3)

Similarly we can append an element to the end of a list by calling the `append` method.

In [None]:
print 'L3 before append: ' + str(L3)
L3.append('G')
print 'L3 after append: ' + str(L3)

Notice that reverse and append modify the _existing list_ L3. This means these methods are _impure_ functions because they have side-effects. You even invoke them just for the side-effects, because they return nothing useful!

When a list contains integers or floats we can call the `sum` function to compute the total summation of its elements.

Note that there are two type of 'functions' on lists. _Methods_ like `append` en `reverse` which are used on objects with the dot notation and _functions_ like `sum` and `len` that take lists as arguments.

In [None]:
print sum([4,5,7,77.00, 2])

## Tuples

Tuples are very much like _lists_, except that they are _immutable_, they cannot be changed. Tuples are often denoted by the parentheses `(` and `)`. This can sometimes be confusing to people new to Python. Depending on the context the parentheses can be omitted.

Tuples behave very much like lists, as is shown below.

In [None]:
my_tuple = (1,2,3,4)
print 'the length of my_tuple: ' + str(len(my_tuple))
print 'the first element of my_tuple: ' + str(my_tuple[0])

Tuples are immutable, so we can not replace existing elements:

In [None]:
try:
    my_tuple[2] = 9 # this will throw a TypeError
except TypeError:
    print "cannot modify tuple elements!"
    
print "my_tuple[2]: " + str(my_tuple[2])

Tuples also lack the `append` method to add elements:

In [None]:
try:
    my_tuple.append(5) # this will throw an AttributeError
except AttributeError:
    print "cannot append to a tuple!"

print "my_tuple: " + str(my_tuple)

We can build lists where the list elements are tuples and select them on basis of their index.

In [None]:
# A list of tuples
tuple_list = [('a','b'), (3,4), ('Z', 42)]

# Select the first tuple in the list
print 'the first tuple is: ' + str(tuple_list[0])

In [None]:
# Select the first element of the first tuple of the list
print 'the first element of the first tuple is: ' + tuple_list[0][0]

In [None]:
# TODO: Replace FILL IN YOUR CODE HERE with appropriate code

# Print the second element of the third tuple
print # PLEASE FILL IN YOUR CODE HERE

## Dictionaries

Dictionaries are very similar to `Maps` in Java. They are unordered collections of key-value pairs. The key is used to retrieve the related value from the dictionary.

To index a dictionary by a key we use the same bracketed notion as with lists, but here the key instead of the offset is used.

In [None]:
my_dict = {'name' : 'John',
           'age' : 29,
           'city' : 'New York'}

print my_dict['name'] + ' lives in ' + my_dict['city'] + ' and is ' + str(my_dict['age']) + ' years old.'

## Functions

You can define functions in Python, similar to other languages. We assume you are familiar with the concept of a function in programming languages.

There are two basic ways to define a function. The first is by using the `def` keyword and a name for the function, followed by arguments and `:`. The keyword `return` is used to return the value.

Here we see another feature of Python: it is sensitive to indentation. In the cell below you see that a tab preceding the return statement. Removing this tab will result in an indentation error. The rules for indentation are very intuitive, but you must remember to abide by them.

In [None]:
# We define a function called times, which takes two integers and returns their product
def times(x, y):
    return x * y

p = times(3, 2)
print 'the product of 3 and 2 is: ' + str(p)

## Lambda functions

There is another way of defining functions, called _lambda_ or _anonymous_ functions. The term lambda comes from the field of lambda calculus, which is a branch of mathematics. Lambda functions play a large role in functional programming languages.

Both MapReduce and Spark have taken their inspiration from functional programming and hence understanding lambda functions will help you to understand MapReduce and Spark.

Lambda functions are anonymous functions, that is functions without a name. The keyword `lambda` simply denotes that a function is defined. What follows is a function statement, a single statement only, and no return statement.

Finally, a lambda function can be assigned to a variable, which then is used as the name of the function. This is not possible with functions that are defined using `def`.

Let's look at an example.

In [None]:
# This lambda function has two arguments x and y which are multiplied
#  Note that there is no return statement and that the function is assigned to the variable l_times
#  The : separates the arguments from the body of the function
l_times = lambda x,y: x * y

# Next we call the function by using the variable as a function
result = l_times(2,3)
print 'the product of 2 and 3 is: ' + str(result)

## Lambda function exercise

Next, define a lambda function which adds two numbers. Then execute the function on the integers 7 and 9 and print the result.

In [None]:
# TODO: Replace FILL IN YOUR CODE HERE with appropriate code

# A lambda function to add two numbers
my_add = lambda # PLEASE FILL IN YOUR CODE HERE

# Add two numbers using the function just defined
result = # PLEASE FILL IN YOUR CODE HERE

print str(result)

## Map and Reduce

Functions are very important to both MapReduce and Spark. To see why let us look at a function called `map`, which is part of Python.

`Map` is a function which takes two arguments, a function and a list. `Map` applies the function (its first argument) to every element of the list (its second argument).

In the next cell we show how this works.
We define a list called `celsius` which contains a number of temperature measures in degrees Celsius. We are going to convert this list to degrees Fahrenheit.

For this we write a function which will convert a single degree Celsius into Fahrenheit.  We then call `map` to apply this function to all elements of the list `celsius`.

In [None]:
# A list of temperature measures in degrees Celsius
celsius = [39.2, 36.5, 37.3, 37.8]

# A lambda function defining the conversion from a degree in Celsius to one in Fahrenheit
convert = lambda x: (float(9)/5)*x + 32

# By using map we apply the function to every element of the list celsius
fahrenheit = map(convert, celsius)

# We can do exactly the same by using the lambda expression inside the map statement directly
# fahrenheit = map(lambda x: (float(9)/5)*x + 32, celsius)
print fahrenheit

We do not have to use lambda functions here. We can define the convert function using `def` and use the name of the function as the first argument for map. Let's see how this works:

In [None]:
# We define the same function as in the previous cell, now using def
def convert_def(x):
    fahr = (float(9)/5)*x + 32
    return fahr

# Let's use it in map on the celsius list
fahrenheit = map(convert_def, celsius)
print fahrenheit

This is conceptually very similar to Hadoop's Map as in MapReduce. But the Python version is not executed in parallel, as in Hadoop.

Next, let us look at Reduce. Python also has a function called `reduce` which takes as its first argument a function, and as a second argument a list. The function has to have two arguments.

`Reduce` will then apply the function to the first two elements of the list and use this result together with the next item in the list to compute the next step. This procedure is repeated until the entire list is traversed.

For example, suppose we want to add up all elements of the list `[47, 11, 42, 13]` then we write a function which will add up two integers, and we will call `reduce`. `Reduce` will then proceed as depicted in the picture below:

![python reduce](images/reduce.png)

Note that this function `reduce` resembles Hadoop's Reduce. In Hadoop you have to supply the Mappers and Reducers classes with a program, in Python it is function.

Spark works similar to Python in this regard. You have to write functions that you have feed into other functions, which will process data for you.

As a final exercise, before we move to Spark, we are going to compute the mean of the Fahrenheit list, using reduce.

In [None]:
# TODO: Replace FILL IN YOUR CODE HERE with appropriate code

# First write a lambda function which adds up two elements
sum_up = # PLEASE FILL IN YOUR CODE HERE

# Use reduce to sum up the elements in fahrenheit
total = reduce(# PLEASE FILL IN YOUR CODE HERE)

# Divide total by the length of the list fahrenheit
# Use the division operator /
mean = total / len(fahrenheit)
print 'the mean temperature is: ' + str(mean)

In [None]:
# Run this cell to test if you have the right value for mean
from test_helper import Test
import numpy as np

Test.assertTrue(np.allclose(mean, 99.86), 'Wrong value for mean')

## Spark teaser

To show you how similar the operations in Spark are to the functional Python version we will give you a teaser on how to do the Celsius to Fahrenheit conversion in Spark. For such a small example of course the overhead outweights the benefit of parallel processing.

Instead of working on Python lists we will do our processing on Spark's data structure for collections called _RDD's_. These will be explained later today. Comparing the following code to the earlier version we note:

- The workflow is exactly the same.
- We can even re-use the lambda functions.
- But `map` and `reduce` are methods of RDD instead of functions.
- Instead of the Python `len` function we use the `count` method.

In [None]:
# Initialize Spark
from pyspark import SparkContext, SparkConf

if not 'sc' in globals(): # This 'trick' makes sure the SparkContext sc is initialized exactly once
    conf = SparkConf().setMaster('local[*]')
    sc = SparkContext(conf=conf)

In [None]:
# Distribute the celsius list
celcius_rdd = sc.parallelize(celsius)

# This part runs in parallel
fahrenheit_rdd = celcius_rdd.map(convert)
print 'degrees in Fahrenheit: ' + str(fahrenheit_rdd.collect())
count = fahrenheit_rdd.count()
mean = fahrenheit_rdd.reduce(sum_up) / count

print 'the mean temperature is: ' + str(mean)