![](images/python.png)
<img style="float: right" src="images/surfsara.png">

# Python introduction

We present a very brief introduction to Python here before we will work with HDFS and Spark.

Python is relatively easy to learn. Of course, you can do complicated things but we focus on the basics here and assume that you have seen similar constructs in other programming languages.

_You can edit the cells below and execute the code by selecting the cell and press Shift-Enter. Code completion is supported by use of the Tab key._

## Variables and dynamic typing

In the cell below we assign the variable `i` to an integer. We then print the type of `i`. The function `print` expects a _string_ object and therefore we use the function `str` to cast `type(i)` which is of type _type_ to a _string_ object.

You can see by executing the cell that `i` is of type _int_.

Next we assign `s` to string and determine its length. Then we print the length of `s`.

In [None]:
# Assign a integer to s
i = 123567

# print the type of s
print str(type(i))

In [None]:
# Assign s to a string, using single quotes
s = 'abcdABCD'

# Strings have a length
s_length = len(s)

# We print the length of s
# s_length is an int so we cast s_length to a string by using str

# The + operator concatenates strings (and lists and tuples)

print 'length of s is : ' +  str(s_length)

Strings are very similar to lists (we get to them next). They can be accessed by using the index, the position within the string. Take a look at the following code.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# The first item of s
f = s[0]
print f

# The second item of s
print s[1]

# Slice of s from offsets 1 through 2 (3 not included)
print s[1:3]

# print last letter of s
print s[-1]

# print s, except the last letter
print <FILL IN>

## Lists

Lists are ordered collections, which can contain elements of any type.

In [None]:
L = [123, 'some string', 1.666]
print 'length of L is: ' +  str(len(L))

In [None]:
# create two lists and concatenate these using the + operator

L1 = ['a', 'b', 'c']
L2 = ['D', 'E', 'F']
L3 = L1 + L2
print L3

(Of course you can use + with strings as well:)

In [None]:
print 'abc' + 'def'

Lists are objects in Python. They come with a number of methods, e.g. `append` and `reverse`.
These methods are called on the _list_ object by using the `.` (dot) notation.

In [None]:
print L3
L3.reverse()
print L3

Similarly we can append an element to the end of a list by calling the `append` method.

In [None]:
print L3
L3.append('G')
print L3

When a list contains integers or floats we can call the `sum` function to compute the total summation of its elements.

Note that there are two type of 'functions' on lists. Methods like `append` en `reverse` which are used on objects with the dot notation and functions like `sum` and `len` that take lists as arguments.

In [None]:
print sum([4,5,7,77.00, 2])

## Tuple

Tuples are very much like _lists_, except that they are immutable, they cannot be changed. Tuples are often denoted by the parentheses `(` and `)`. This can sometimes be confusing to people new to Python. Depending on the context the parentheses can be omitted.

Tuples behave very much like lists, as is shown below.

In [None]:
T = (1,2,3,4)
print len(T)
print 'The first element of T : ' + str(T[0])

We can now build lists of tuples and select them on basis of their index.

In [None]:
# A list of tuples
LT = [('a','b'), (3,4), ('Z', 'Z')]

# Select the first tuple in the list
print LT[0]

In [None]:
# select the first element of the first tuple of the list
print LT[0][0]

## Dictionaries

Dictionaries are very similar to `Maps` in Java. They are unordered collections of key-value pairs. The key is used to retrieve the related value from the dictionary.

To index a dictionary by a key we use the same bracketed notion as with lists, but here the key instead of the offset is used.

In [None]:
my_dictionary = {'name' : 'john', 'age' : 29, 'city' : 'New York'}
print my_dictionary['name']

## Functions

You can define functions in python, similar to other languages. We assume you are familiar with the concept of a function in programming languages.

There are two basic ways to define a function. The first is by using the `def` keyword and a name for the function, followed by arguments and `:`. The keyword `return` is used to return the value.

Here we see another feature of Python: it is sensitive to indentation. In the cell below you see that a tab preceding the return statement. Removing this tab will result in an indentation error. The rules for indentation are very intuitive, but you must remember to abide by them.

In [None]:
# We define a function called times, which takes two integers and returns their product

def times(x, y):
    return x * y

p = times(3, 2)
print str(p)

In the example below, we define the function intersect, which identifies the common elements of two lists.
The function contains a `for` loop, and an `if` statement, which we haven't seen before, but should not surprise you too much.

Note that a function can contain multiple statements, in this case an assignment, a `for` loop and an `if` statement.

Also note the indentation in the code. Each time `:` is followed by an additional tab. The indentation helps to read the scope of the statements. Again, if you change this indentation you will get an error message.

Walking through this example:

- After the definition a new empty list `res` is defined.
- Next, the for loop, checks each element of the list `seq1`, which is the first argument of the function.
- If this element is also in the list `seq2` (the second argument) it is added to the list `res`.
- After the loop has ended, the list `res` is returned.

In [None]:
def intersect(seq1, seq2):
    res = []                     # Start empty
    for x in seq1:               # Scan seq1
        if x in seq2:            # Common item?
            res.append(x)        # Add to end
    return res

## Lambda functions

There is another way of defining functions, called lambda functions. The term lambda comes from the field of lambda calculus, which is a branch of mathematics. Lambda functions play a large role in functional programming languages.

Both MapReduce and Spark have taken their inspiration from functional programming and hence understanding lambda functions will help you to understand MapReduce and Spark.

Lambda functions are anonymous functions, that is functions without a name. The keyword `lambda` simply denotes that a function is defined. What follows is a function statement, a single statement only, and no return statement.

Finally, a lambda function can be assigned to a variable, which then is used as the name of the function. This is not possible with functions that are defined using `def`.

Let's look at an example.

In [None]:
# This lambda function has two arguments x and y which are multiplied
#  Note that there is no return statement and that the function is assigned to the variable l_times
#  The : separates the arguments from the body of the function

l_times = lambda x,y: x * y

#Next we call the function by using the variable as a function

result = l_times(2,3)
print str(result)

## Lambda function exercise

Next, define a lambda function which adds two numbers. Then execute the function on the integers 7 and 9 and print the result.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# A lambda function to add two numbers

my_add = lambda <FILL_IN>

# add two numbers using the function just defined
result = <FILL IN>

print str(result)

## MapReduce

Functions are very important to both MapReduce and Spark. To see why let us look at a function called `map`, which is part of Python.

`Map` is a function which takes two arguments, a function and a list. `Map` applies the function (its first argument) to every element of the list (its second argument).

In the next cell we show how this works.
We define a list, called `celsius` which contains a number of temperature measures in degrees Celsius. We are going to convert this list to degrees Fahrenheit.

For this we write a function which will convert a single degree Celsius into Fahrenheit.  We then call `map` to apply this function to all elements of the list `celsius`.

In [None]:
# A list of temperature measures in degrees Celsius
celsius = [39.2, 36.5, 37.3, 37.8]

# A lambda function defining the conversion from a degree in Celsius to one in Fahrenheit
convert = lambda x: (float(9)/5)*x + 32

# by using map we apply the function to every element of the list celsius
fahrenheit = map(convert, celsius)

# We can do exactly the same by using the lambda expression inside the map statement directly:
# fahrenheit = map(lambda x: (float(9)/5)*x + 32, celsius)

print fahrenheit

We do not have to use lambda functions here. We can define the convert function using <b>def</b> and use the name of the function as the first argument for map. Let's see how this works:

In [None]:
# We define the same function as in the previous cell, now using def

def convert_def(x):
    fahr = (float(9)/5)*x + 32
    return fahr

# Let's use it in map on the celsius list

fahrenheit = map(convert_def, celsius)
print fahrenheit

This is conceptually very similar to Hadoop's Map as in MapReduce. But the Python version is not executed in parallel, as in Hadoop.

Next, let us look at Reduce. Python also has a function called `reduce` which takes as its first argument a function, and as a second argument a list. The function has to have two arguments.

`Reduce` will then apply the function to the first two elements of the list and use this result together with the next item in the list to compute the next step. This procedure is repeated until the entire list is traversed.

For example, suppose we want to add up all elements of the list [47, 11, 42, 13] then we write a function which will add up two integers, and we will call `reduce`. `Reduce` will then proceed as depicted in the picture below:

![python reduce](images/reduce_diagram.png)

Note that this function `reduce` resembles Hadoop's Reduce. In Hadoop you have to supply the Mappers and Reducers classes with a program, in Python it is function.

Spark works similarly in this regard. You have to write functions that you have feed into other functions, which will process data for you.

As a final exercise, before we move to Spark, we are going to compute the mean of the Fahrenheit list, using reduce.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# First write a lambda function which adds up two elements

sum_up = lambda <FILL IN>

# Use reduce to sum up the elements in Fahrenheit

total = reduce(<FILL IN>)

# divide total by the length of the list Fahrenheit
# Use the division operator /

mean = <FILL IN>
print mean

In [None]:
# Run this cell to test if you have the right value for mean
from test_helper import Test
import numpy as np

Test.assertTrue(np.allclose(mean, 99.86), 'Wrong value for mean')