<img style="height: 20px;"  src="images/python.png">
<img style="height: 20px;" src="images/surfsara.png">

<hr style="clear: both" />

# Python introduction

We present a very brief introduction to Python here, based on the very useful Python introduction given by SurfSara (https://github.com/sara-nl/jupyter-bigdata-notebooks). 

To use a notebook, open it in your browser on a (local or remote) iPython server. The notebooks consist of cells that can contain either: Python script, the output for a Python script, or text instructions. You can intregrate your own notes inside the nodebook by adding new cells, see the documentation on how to work with notebooks (https://ipython.org/ipython-doc/3/notebook/notebook.html#notebook-user-interface).

_Most importantly: you can edit the cells below and execute the code by selecting the cell and press Shift-Enter. Code completion is supported by use of the Tab key._

Python is relatively easy to learn. Of course, you can do complicated things but we focus on the basics here and assume that you have seen similar constructs in other programming languages.

## Variables and dynamic typing

In the cell below we assign the variable `i` to an integer. We then print the type of `i`.
You can see by executing the cell that `i` is of type _int_.

In [1]:
# Assign a integer to i
i = 123567

# Print the type of i
print('the type of variable i is: ')
print(type(i))

the type of variable i is: 
<class 'int'>


Next we assign `s` to string and determine its length. Then we print the length of `s`.

In [1]:
# Assign s to a string, using single quotes (or double quotes)
s = 'abcdABCD'

# Strings have a length
s_length = len(s)

The function `print` expects a single argument, so we create a single string from the text and variable `s_length`. For this we need to convert `s_length` from `int` to `string` using the `str` function.

In [2]:
# Three ways to print the length of s

# pythonc way, the {} are substituted for the parameters of format
print('the length of variable s is : {}'.format(s_length))

# C syntax, the %d, %s are substituted for the parameters of %
print('the length of variable s is : %d'%(s_length))

# via String concatenation (+ concatenates strings, list, tuples)
print('the length of variable s is : ' +  str(s_length))


the length of variable s is : 8
the length of variable s is : 8
the length of variable s is : 8


In Python, a String behaves like a list of characters, and a list is similar to an ArrayList in Java (in the sense that lists have a flexible size). Python has no equivalent to Java's fixed length array but rather uses lists for all mutable ordered collections. Lists have a language syntax to address positions and slices with `[]`, similar to the way arrays are indexed in Java to access elements by their index. Take a look at the following code.

In [3]:
# The first letter of s
f = s[0]
print('the first letter is: ' + f)

# The second letter of s
print('the second letter is: ' + s[1])

# Slice of s from offsets 1 through 2 (3 not included)
print('the second and third letter are: ' + s[1:3])

# Print last letter of s
print('the last letter is: ' + s[-1])

# Print the last three letters of s
print('the last three letters are: ' + s[-3:])


the first letter is: a
the second letter is: b
the second and third letter are: bc
the last letter is: D
the last three letters are: BCD


In [4]:
# TODO: Replace <FILL IN> with appropriate code

# Print s, except the last letter
print(<FILL IN>)

SyntaxError: invalid syntax (<ipython-input-4-cb53e5c79e87>, line 4)

## Lists

Lists are ordered collections, which can contain elements of any type.

In [5]:
L = [123, 'some string', 1.666]

print('the length of L is: ' +  str(len(L)))

the length of L is: 3


In [6]:
# We create two lists and concatenate these using the + operator
L1 = ['a', 'b', 'c']
L2 = ['D', 'E', 'F']
L3 = L1 + L2

print(L3)

['a', 'b', 'c', 'D', 'E', 'F']


Lists are objects in Python. They come with a number of methods, e.g. `append` and `reverse`.
These methods are called on the _list_ object by using the `.` (dot) notation.

In [7]:
print('L3 before reverse: ' + str(L3))

L3.reverse()

print('L3 after reverse: ' + str(L3))

L3 before reverse: ['a', 'b', 'c', 'D', 'E', 'F']
L3 after reverse: ['F', 'E', 'D', 'c', 'b', 'a']


Similarly we can append an element to the end of a list by calling the `append` method.

In [8]:
print('L3 before append: ' + str(L3))

L3.append('G')

print('L3 after append: ' + str(L3))

L3 before append: ['F', 'E', 'D', 'c', 'b', 'a']
L3 after append: ['F', 'E', 'D', 'c', 'b', 'a', 'G']


Notice that reverse and append **modify** the _existing list_ L3. This means these methods are _impure_ functions because they have side-effects. You even invoke them just for the side-effects, because they return nothing useful!

When a list contains integers or floats we can call the `sum` function to compute the total summation of its elements.

Note that there are two type of 'functions' on lists. _Methods_ like `append` en `reverse` which are used on objects with the dot notation and _functions_ like `sum` and `len` that take lists as arguments.

In [9]:
print(sum([4,5,7,77.00, 2]))

95.0


## Tuples

Tuples are very much like _lists_, except that they are _immutable_, they cannot be changed. Tuples are often denoted by the parentheses `(` and `)`. This can sometimes be confusing to people new to Python. Depending on the context the parentheses can be omitted.

Tuples behave very much like lists, as is shown below.

In [10]:
my_tuple = (1,2,3,4)

print('the length of my_tuple: ' + str( len(my_tuple) ) )
print('the first element of my_tuple: ' + str( my_tuple[0] ) )

the length of my_tuple: 4
the first element of my_tuple: 1


Tuples are immutable, so we can not replace existing elements:

In [11]:
try:
    my_tuple[2] = 9 # this will throw a TypeError
except TypeError:
    print("cannot modify tuple elements!")
    
print("my_tuple[2]: " + str( my_tuple[2] ) )

cannot modify tuple elements!
my_tuple[2]: 3


Tuples also lack the `append` method to add elements:

In [12]:
try:
    my_tuple.append(5) # this will throw an AttributeError
except AttributeError:
    print("cannot append to a tuple!")

print("my_tuple: " + str(my_tuple) )

cannot append to a tuple!
my_tuple: (1, 2, 3, 4)


We can build lists where the list elements are tuples and select them on basis of their index.

In [13]:
# A list of tuples
tuple_list = [('a','b'), (3,4), ('Z', 42)]

# Select the first tuple in the list
print('the first tuple is: ' + str(tuple_list[0]))

the first tuple is: ('a', 'b')


In [14]:
# Select the first element of the first tuple of the list
print('the first element of the first tuple is: ' + tuple_list[0][0] )

the first element of the first tuple is: a


In [None]:
# TODO: Replace <FILL IN> with appropriate code

# Print the second element of the third tuple
print(<FILL IN>)

## Dictionaries

Dictionaries are very similar to `Maps` in Java. They are unordered collections of key-value pairs. The key is used to retrieve the related value from the dictionary.

To index a dictionary by a key we use the same bracketed notion as with lists, but here the key instead of the offset is used.

In [15]:
my_dict = {'name' : 'John',
           'age' : 29}

my_dict['city'] = 'New York'

print(my_dict['name'] + ' lives in ' + my_dict['city'] + ' and is ' + str(my_dict['age']) + ' years old.')

John lives in New York and is 29 years old.


## Functions

You can define functions in Python, similar to other languages. We assume you are familiar with the concept of a function in programming languages.

There are two basic ways to define a function. The first is by using the `def` keyword and a name for the function, followed by arguments and `:`. The keyword `return` is used to return the value.

Here we see another feature of Python: code blocks are indicated by indentation (similar to `{` and `}` in Java); a code block starts by indenting a statement and continues while the indentation does not return to the indentation level before the start of the code block. It is common practice to use a `tab` for indentation (which is expanded to 4 spaces). You have to use the indentation consistently, and you cannot use indentation to markup your code!

In [16]:
# We define a function called times, which takes two integers and returns their product
def times(x, y):
    return x * y

p = times(3, 2)

print('the product of 3 and 2 is: ' + str(p))

the product of 3 and 2 is: 6


## Lambda functions

There is another way of defining functions, called _lambda_ or _anonymous_ functions. The term lambda comes from the field of lambda calculus, which is a branch of mathematics. Lambda functions play a large role in functional programming languages.

Both MapReduce and Spark have taken their inspiration from functional programming and hence understanding lambda functions will help you to understand MapReduce and Spark, and to write very efficient code.

Lambda functions are anonymous functions, that is functions without a name. The keyword `lambda` simply denotes that a function is defined, by stating a parameter list, a colon, and a single expression (without return).

Finally, both lambda functions and normally defined function (using `def`) can be assigned to variables, to pass a function as a parameter to another function, but also to use as the name of the function.

Let's look at an example.

In [17]:
# This lambda function has two arguments x and y which are multiplied
#  Note that there is no return statement and that the function is assigned to the variable l_times
#  The : separates the arguments from the body of the function
l_times = lambda x,y: x * y

# Next we call the function by using the variable as a function
result = l_times(2,3)

print('the product of 2 and 3 is: ' + str(result))

the product of 2 and 3 is: 6


## Lambda function exercise

Next, define a lambda function which adds two numbers. Then execute the function on the integers 7 and 9 and print the result.

In [None]:
# TODO: Replace <FILL IN> with appropriate code

# A lambda function to add two numbers
my_add = lambda <FILL_IN>

# Add two numbers using the function just defined
result = <FILL IN>

print(str(result))

## Map and Reduce

Map and Reduce are essential functions found in _functional languages_ (e.g. Haskell, Scala, Python). On a distributed framework (e.g. Hadoop, Spark) Map and Reduce are commonly used to transparently speed up the processing by using a cluster of computers. We will look at distributed processing in the next lessons, but first introduce the essentials of Map and Reduce in Python. 

`Map` is a function that executes another function on all elements in a collection. The `map` function takes two arguments: a function and a collection. `Map` applies the function (its first argument) to every element of the collection (its second argument). For example, `mapping` the function `squared` to the collection `[2, 4, 7]` results in `[squared(2), squared(4), squared(7)]`. 

In the next cell we show how this works.
We define a list called `celsius` which contains a number of temperature measures in degrees Celsius. We are going to convert this list to degrees Fahrenheit.

For this we write a function which will convert a single degree Celsius into Fahrenheit.  We then call `map` to apply this function to all elements of the list `celsius`.

In [18]:
# A list of temperature measures in degrees Celsius
celsius = [39.2, 36.5, 37.3, 37.8]

# A lambda function defining the conversion from a degree in Celsius to one in Fahrenheit
convert = lambda x: (float(9)/5)*x + 32

# By using map we apply the function to every element of the list celsius
fahrenheit = list(map(convert, celsius))

# We can do exactly the same by using the lambda expression inside the map statement directly
# fahrenheit = map(lambda x: (float(9)/5)*x + 32, celsius)
print(fahrenheit)

[102.56, 97.7, 99.14, 100.03999999999999]


We do not have to use lambda functions here. We can define the convert function using `def` and use the name of the function as the first argument for map. Let's see how this works:

In [19]:
# We define the same function as in the previous cell, now using def
def convert_def(x):
    fahr = (float(9)/5)*x + 32
    return fahr

# Let's use it in map on the celsius list
fahrenheit = list( map(convert_def, celsius) )
print(fahrenheit)

[102.56, 97.7, 99.14, 100.03999999999999]


**Important**: notice that there is a `list` function to convert the result of `map` into a list. Python allows to use **lazy evaluation** (which we will talk about more), which means that a result is only computed when needed, saving time and memory when not all results are needed. Consider the difference between the next two statements:

In [20]:
print(map(convert_def, celsius))
print(list(map(convert_def, celsius)))

<map object at 0x1050ffd30>
[102.56, 97.7, 99.14, 100.03999999999999]


The `map` function does not return a list, but rather a _generator_ that can be iterated over to obtain the results. The `list` function iterates over the generator in order to create the list, thus requesting the evaluation of every element. Lazy evaluation can save time and memory when we do not need the entire result set but only want to use the first few results. More importantly, for parallel processing, lazy evaluation allows Spark to first define a complex processing pipeline, which is optimized (by Spark) before it is executed. 

Next, let us look at **Reduce**. Python also has a function called `reduce` which takes as its first argument a function, and as a second argument a list. The function has to have two arguments.

`Reduce` will then apply the function to the first two elements of the list and use this result together with the next item in the list to compute the next step. This procedure is repeated until the entire list is traversed.

For example, suppose we want to add up all elements of the list `[47, 11, 42, 13]` then we write a function which will add up two integers, and we will call `reduce`. `Reduce` will then proceed as depicted in the picture below:

![python reduce](images/reduce.png)

As a final exercise, before we move to Spark, we are going to compute the mean of the Fahrenheit list, using reduce.

In [None]:
# TODO: Replace <FILL IN> with appropriate code
from functools import reduce

# First write a lambda function which adds up two elements
sum_up = <FILL IN>

# Use reduce to sum up the elements in fahrenheit
total = reduce(<FILL IN>)

# Divide total by the length of the list fahrenheit
# Use the division operator /
mean = total / len(fahrenheit)
print('the mean temperature is: ' + str(mean))

## Parallel Processing

The `map` and `reduce` functions are essential to parallel processing, performing the same function on multiple processors that each hold a part of the the data (e.g. a part of a list of elements). Additionally, since a program is often tiny compared to the volume of the data to be processed, sending the program to the data is often much more efficient than sending the data to a program. By itself, Python does not process data in parallel (unless you code it yourself), but when Python is used in with Spark parallel processing is used automatically and transparently, as we will see in the upcoming workbooks.  

## Interactive Help

Python provides an interactive help you can use within a notebook. Use the help() function to open the interactive help, and for instance to get help on the `if` statement. Give the help() function a variable as parameter to learn more about the object it contains and the methods you can use on it. 

In [21]:
help(fahrenheit)

Help on list object:

class list(object)
 |  list() -> new empty list
 |  list(iterable) -> new list initialized from iterable's items
 |  
 |  Methods defined here:
 |  
 |  __add__(self, value, /)
 |      Return self+value.
 |  
 |  __contains__(self, key, /)
 |      Return key in self.
 |  
 |  __delitem__(self, key, /)
 |      Delete self[key].
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __getitem__(...)
 |      x.__getitem__(y) <==> x[y]
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iadd__(self, value, /)
 |      Implement self+=value.
 |  
 |  __imul__(self, value, /)
 |      Implement self*=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __le__(self, value, /

In [22]:
help()


Welcome to Python 3.5's help utility!

If this is your first time using Python, you should definitely check out
the tutorial on the Internet at http://docs.python.org/3.5/tutorial/.

Enter the name of any module, keyword, or topic to get help on writing
Python programs and using Python modules.  To quit this help utility and
return to the interpreter, just type "quit".

To get a list of available modules, keywords, symbols, or topics, type
"modules", "keywords", "symbols", or "topics".  Each module also comes
with a one-line summary of what it does; to list the modules whose name
or summary contain a given string such as "spam", type "modules spam".


You are now leaving help and returning to the Python interpreter.
If you want to ask for help on a particular object directly from the
interpreter, you can type "help(object)".  Executing "help('string')"
has the same effect as typing a particular string at the help> prompt.
