# DAML 01 - PyData Primer

Michal Grochmal <michal.grochmal@city.ac.uk>

This is a summary of the features of Python that we will use.
By no means this is an extensive tutorial of the Python language,
instead this is just a cuckoo's flew over the basics of the features that
we will need throughout the course.  Think of it as a retrospective of
what you learned about Python.

In general, the following is structured so that one with understanding of
a programming language can understand the Python features we will need.
We will make analogies to other programming languages you may know.
If you struggle with this notebook I'll need to ask you to brush up your
programming :) .

My main objective for the course is that everyone attending does learn
something.  In other words, the objective of the course is not to present
the entire material but to make sure that at least 70-80% of the material
is understood by 70-80% of the students.

## Functions

Python was originally built as an object oriented language, yet it wanted to compete with Perl
which was a language heavily used for quick scripting.  Python succeeded, by making its function
a first class citizen and not dependent on object oriented patterns (note though that below the hood
a Python function is an object).

A function starts after the `def` statement and ends when it executes a `return` statement or an
exception is raised though it.  (Contrary to compiled programming languages)
The return statement does not require a single value to be returned
or any value at all.  The following are all valid function definitions:

In [None]:
def do_nothing():
    pass


def do_nothing_as_well():
    return None


def with_args(cat, pig):
    return 'Cat %s, pig %s' % (cat, pig)


def return_tuple(cat, pig):
    return 'Cat %s' % cat, 'Pig %s' % pig


print(do_nothing())
print(do_nothing_as_well())
print(with_args('is hungry', 'escaped'))
print(return_tuple('is hungry', 'escaped'))

## Optional Arguments

You can provide optional/default *keyword* arguments to functions.
That is Python's way of giving different signatures/constructors to the same function/method.
Optional arguments are characterized by an assignment (equal sign) inside the `def`
statement, next to the defaulted argument.  All non-defaulted arguments *must come before*
the defaulted/optional arguments.  Examples:

In [None]:
def status(cat='is hungry'):
    return 'Cat %s' % cat


def neighbours_cat(neighbour, status='is hungry'):
    return '%s cat is %s' % (neighbour, status)


print(status())
print(status('well fed'))
print(neighbours_cat("Upstair's"))
print(neighbours_cat("'round the corner's", 'well fed'))

## Function Arguments

Since Python is a dynamic language, it is possible to call the same function in several ways.
A function call is performed by evaluating all arguments in the call and then comparing the resulting
lists of arguments with the signature of the function.  A function call is parsed as:

1.  From left to right all non-keyword arguments (positional arguments) are appended to a list
2.  All keyword arguments are placed inside a dictionary
3.  The positional arguments fill the list of arguments of the function signature
4.  All non-filled keyword arguments in the signature are searched for in the keyword dictionary
5.  If the function has a `*<arg>` argument the remaining list of positional arguments are passed there
6.  If the function has a `**<arg>` argument the remaining keyword dictionary is passed there
7.  If the positional list and keyword dictionary are empty the function is called, otherwise an error is raised

By convention the argument for extra positional arguments is often called `*args`,
and the argument for extra keyword arguments is called `**kwargs` or `**kw`.
Yet that is not a very strong convention, and if better readability can be achieved
by giving these variables better names that is accepted.  For example, here we use
non-conventional names:

In [None]:
def can_eat(cat, brand='felix'):
    print(cat, 'eats', brand, 'food')


def cat_food_brands(market, *brands):
    print('In', market, 'we found the following brands of cat food:')
    for brand in brands:
        print(brand)
        

def deliver_cat_food(address, **quantity):
    print('Delivery to', address)
    for b, q in quantity.items():
        print(q, 'cans of', b)


can_eat('my cat', 'whiskas')
print('-' * 30)
can_eat('my cat', brand='wheats')
print('-' * 30)
cat_food_brands('Tesco', 'felix', 'whiskas', 'wheats')
print('-' * 30)
cat_food_brands("Sainsbury's", 'whiskas', 'sainsbury')
print('-' * 30)
deliver_cat_food('Northampton Square', whiskas=7, felix=3)

## List Comprehensions

Despite its object oriented origin Python did fall in love with functional patterns.
The idea of a functional execution was originated in LISP (list processing), and is based
on operations such as `map`, and `filter`.  Python does support the `map` and `filter` functions
as built-ins but it also does come with a syntax called *list comprehension*.

List comprehensions are often easier to read and shorter to write than their equivalents with
`map` and `filter`.  Also, Python has a good optimizer of list comprehensions which makes
these perform faster than hand-coded sequences of `map` and `filter`.  Following we can see
a couple of list comprehensions and their lisp-like counterparts:

In [None]:
numbers = list(range(10))
print('numbers:', numbers)

odd = [x for x in numbers if x % 2 == 1]
# filter(lambda x: x % 2 == 1, numbers)
print('odd:', odd)

even_squared = [x*x for x in numbers if x % 2 == 0]
# map(lambda x: x*x, filter(lambda x: x % 2 == 1, numbers))
print('even squared:', even_squared)

## Combining Comprehensions

A single list comprehension is powerful but a combination of them makes for
the full power of the functional paradigm.  An example is in order.

Let's try to distribute cat food across several households in a way that most cats are happy.
Note that we will ignore the special preferences of each cat,
e.g. a cat that likes "whiskas special" will need to do with
plain whiskas food since we do not want to spend too much.

The below uses the functional paradigm to distribute equally the amount of cat food
across the neighborhood cats.  Note: iterating over a dictionary is the same as
iterating over its `.keys()` method.

In [None]:
from pprint import pprint


cat_preferences = {
    'my cat': ['whiskas', 'felix pork', 'wheat'],
    "neighbour's cat": ['whiskas special', 'wheat'],
    "'round the corner cat": ['felix', 'sainsbury']
}
food_in_drawer = {'felix': 6, 'whiskas': 10, 'wheat': 12, 'sainsbury': 5}


preferences = dict(
    [(cat, [food for food in food_in_drawer if [x for x in cat_preferences[cat] if x.startswith(food)]])
        for cat in cat_preferences])
print('preferences')
pprint(preferences)
print('-' * 30)
food_div =  dict(
    [(food, len([cat for cat in cat_preferences if food in preferences[cat]]))
        for food in food_in_drawer])
print('food division')
pprint(food_div)
print('-' * 30)
rations = dict(
    [(cat, dict([(food, food_in_drawer[food] // food_div[food])
                    for food in food_in_drawer if food in preferences[cat]]))
        for cat in cat_preferences])
rations

This was an exercise in *relational algebra*, which is often used in `NumPy` and `Pandas`.

## String Operations

Above we saw `startswith`, this is a string operation, i.e. an operation performed on string objects.
Being able to handle strings is an important skill independent of whether you are analyzing data,
writing a web crawler or scripting your cat food delivery network.  Let's have a look at some of these
operations, specifically the operations that may be useful in data munging.

In [None]:
cat = 'Aubrey'
dog = 'Rose'
address = ' Northampton Square, Clerkenwell '  # note the spaces

print(cat.startswith('A'))
print(cat.endswith('y'))
print(cat.lower())
print(cat.upper())
print(', '.join([cat, dog]))
print('[' + address + ']')
print('[' + address.lstrip() + ']')
print('[' + address.rstrip() + ']')
print('[' + address.strip() + ']')
print(address.split())
print([x.strip(',') for x in address.split()])

For anything more complex [regular expressions][regex] are the way to go.
Yet, we are covering very little on regular expressions.

[regex]: https://docs.python.org/3/library/re.html "Regular Expressions - Python Documentation"

## Data Types

Python is dynamically typed, i.e. the type of a variable is only retrieved when needed.
More specifically Python is duck-typed, which means that as long as and object (data type,
data structure or even function) abides by a certain protocol it will work as the type intended
for that protocol.  In other words, as long as a data type behaves well enough as the intended
data type for an operation, it will just work.

This also means that a function may receive completely different types
of objects and act differently based on what it got.
One example of such behavior can be outlined with:

In [None]:
CAT_NUM = 3


def divide_food(food):
    """Divides the food among cats, can receive a dictionary or list of 2-tuples"""
    if not hasattr(food, 'keys'):
        food = dict(food)
    for f in food:
        food[f] //= 3
    return food


print(divide_food({'felix': 7, 'whiskas': 6}))
print(divide_food([('felix', 7), ('whiskas', 6)]))

Duck-typing, and protocol checking like above, is heavily used throughout the Python data stack.

## Lambdas

Since functions are first class citizens in Python, nothing holds us from having variables with
references to functions.  And since we have references to functions nothing holds us from referencing
a function which we did not give a name.

Anonymous functions are functions without a given name (in Python, without a meaningful
`__name__` attribute).  These are often used to pass simple functions around.  A *lambda
function* can only contain a single expression and has an implicit return.

In [None]:
def named_function(food):
    return 'Cat ate %s' % food
    

anon_function = lambda food: 'Cat ate %s' % food


print(named_function('felix'))
print(anon_function('felix'))
print(named_function.__name__)
print(anon_function.__name__)

## Objects

We will deal very little with the object oriented nature of Python
but we will need to know a bit about objects.  An object is an encapsulation of state
together with methods (functions) that operate on this state.  In Python **object state
and object methods live in different places in memory**, the first argument to all
normal methods of an object points to the actual state encapsulated by the current
instance of the object.  By convention we use `self` as the name of the first argument
of the object methods, and this is a very strong convention.

After constructing an object the `__init__` method is invoked, it takes the `self` argument
and then anything that we wish to be stored or used for constructing an instance of our object.
Optional arguments are accepted and encouraged within the definition of `__init__`, these optional
arguments make for what in other languages is accomplished with multiple constructors.

A Python function is actually an object.  The `def` simply defines and object which has a
`__call__` method, this method is invoked when the object is called (by placing brackets after it).
The dictionaries and lists are just Python objects too, these define the `__getitem__` method.
In Python these *dunder* (double underscore) methods define the protocols of the basic objects.

What follows is an example of a multi-protocol object,
with a similar `__getitem__` as the multidimensional array object which we will see next.
Note: do not worry if you do not understand what is happening below,
we will not explicitly cover it.
On the other hand, if you know Python well and are interested in what goes
behind the scenes in the data manipulation libraries this object outlines it.

In [None]:
class Cat(object):

    def __init__(self, greeting='Meaow!', legs=4):
        self.greeting = greeting
        self.legs = legs
        self.fed = True

    def is_hungry(self):
        return not self.fed

    def feed(self):
        self.fed = True

    def __call__(self):
        if self.fed:
            print(self.greeting)
        self.fed = False

    def __getitem__(self, key):
        """
        This one is pretty complicated - this is how NumPy and Pandas works below the hood.
        
        If you really want to go deep try figuring out what it does and how it does it.
        """
        if slice == type(key):
            return 'Do not slice me!'
        elif int == type(key):
            return min(abs(key), self.legs)
        else:
            return key


cat = Cat('Mieau!')
print('Hungry:', cat.is_hungry())
cat()
print('Hungry:', cat.is_hungry())
cat()  # is hungry, will not meaow
cat.feed()
cat()
print('List slice:', cat[1:3:2])
print('List access:', cat[1])
print('Too many legs:', cat[7])
print('Dictionary access:', cat['are you may cat?'])
print('Arbitrary access:', cat[1:7:2, 'fur', 3])

Finally, if anything in the section was too much do have a look at one
of the several extensive resources for learning more about Python.
The list below is, by far, not comprehensive.

## Extra Resources

- [Dive Into Python 3][dive] by Mark Pilgrim
- [Think Python][think] by Allen B. Downey
- [Official Python Tutorial][tut]

[dive]: http://www.diveintopython3.net/
[think]:  http://greenteapress.com/wp/think-python-2e/
[tut]: https://docs.python.org/3/tutorial/