# Advanced Data Manipulation

Thus far in this course, our focus has been on what we might call general-purpose Python--skills that you will need for nearly any programming project.  When we write code to analyze a data set, however, there are several unique concerns that we need to keep in mind. In this section we will cover several high-level concepts related to data manipulation.  We will present these using Python, but they are actually important components of many data manipulation frameworks.

These open the door to an alternative to the object-oriented style of programming that we covered before. These all belong to the functional style, my personal favorite.

## What Is Functional Programming?

In computer science, functional programming is a programming paradigm, or a style of writing code.  It is an alternative to the object-oriented framework that we have been working with up to this point. Rather than keeping state in an object like OOP, functional programming is declarative and emphasizes the use of functions to perform computation. At the core of functional programming is a simple idea: the output of a function should depend only on the arguments passed into the function. This means that calling function f with parameter x will always return the same value.

At first, it may be hard to understand why a functional approach is useful or desirable.  Some (if not most) of the code that we wrote for this course violates this principle. This is really an obvious point, but let's reinforce it with another example.

Say that I have an instance of a class.

In [1]:
class Drone:
    power_system = "battery"
    def fly(self):
        return "The %s-powered drone is flying" % (self.power_system)
d = Drone()

In [2]:
def drone_flyer(drone):
    print(drone.fly())

In [3]:
drone_flyer(d)

The battery-powered drone is flying


Now that I have that class, let's change its power system.

In [4]:
d.power_system = "'Subway - Eat Fresh'"

In [5]:
drone_flyer(d)

The 'Subway - Eat Fresh'-powered drone is flying


This may seem like a trivial example, but we have already violated the principles of functional programming. We are executing the same method and might expect the same output. Because we have an object with a *state* that is *mutable*, however, we cannot count on getting the same output each time we call a method. You may think to yourself, "Well I just won't change it," but that only helps for small projects. You may not intentionally change a value, but perhaps a user of your system will or you may write the code and change it later.

Once you have started working in an object-oriented style, you need to think about and manage state. We can start controlling this through properties like we did previously, but the functional programmer would say that these are all just Band-Aids to the core problem. A program should have no state; that is, given the same input, the function should always return the same output.

More simply, **once something is created, you should not be able to change its value or behavior.**

Let's walk through another example.

In [6]:
d = Drone()

def drone_changer():
    d.power_system = "I've made a huge mistake"

print(d.fly())
drone_changer()
print(d.fly())

The battery-powered drone is flying
The I've made a huge mistake-powered drone is flying


By this point, you should recognize that this is a rather poor coding practice.  This function has a rather unexpected *side effect*. That means that it is modifying something outside of its own scope.  There is no way to know this unless you actually examine the code inside the function.

A good way to know whether something has a side effect is if moving the function to another file makes it useless. In this case, unless that file has an instance of drone d, it does. Why is this so bad?  In short, it makes it difficult to reason about the program. Imagine if you had 50 functions that all had the potential to change the drone instance. Anyone reading your code would have no idea of the *state* of the drone.

## Why Functional Programming?

Functional programming is a way to avoid the problems we have been discussing.  Instead of using objects with mutable state, functional programmers organize a program into functions, none of which has any side effect.  The execution of a program is then viewed as the evaluation of functions.  There is no state except for the values being passed into functions and returned from functions, which are stored in the execution stack.

This is a very different way of viewing programs, and it may seem unintuitive if you are used to object-oriented programs.  In fact, while object-oriented programming is still dominant in modern software development, functional programs are quite useful in some places.

According to advocates, one of the great benefits of a functional approach is code reliability.  In a functional program, the same function call will always return the same result, so there is less potential for problems to pop up after initial testing.  This is one reason that functional programming has been applied to fault-resistant telecommunications networks.  As another benefit, functional programming enforces good code modularity, forcing programmers to divide tasks into subtasks with clear results.  Functional programming is also widely used for writing parallelized algorithms that distribute computation over many computer servers.

We are not suggesting that you should use a functional style for all of your code.  In the context of a data analysis, however, we will see that Python's data libraries are well suited to a functional style.  Understanding how functional programming works will help you derive the greatest benefit from these packages.  Let's examine some Python components that fit closely with a functional programming paradigm.

## Mapping

Mapping is basically mapping one value to another one, almost like a dictionary. This is a functional programming concept but can be useful in certain circumstances and will certainly come up in your data analysis career. This is a fundamental part of the MapReduce style of programming popular in big data.

Let's explore how it works.

In [7]:
x = range(0,10)

In [8]:
x

range(0, 10)

Now that we have a range of integers, we will want to apply a transformation to each value in that list. For example, let's cube every value in that list. So we write our cube function, which in theory operates on one individual datum.

In [9]:
def cube(num):
    return num ** 3

You may think a for loop is the way to do this, but it really is not because we have no easy way of capturing the output. We can put it in a new list, but that makes it mutable. Here is that example.

In [11]:
new_list = []
for item in x:
    new_list.append(cube(item))

In [12]:
print(new_list)

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]


Why don't we just do this? Because we had to create a mutable varible `new_list` to do it, something that is completely unnecessary and is a mutable value.

Our solution is easy: we just create a map. Remember that mapping maps a value to another value via a functional transformation. It does this with no mutability and no side effects. It is also much more concise and easy to read.

In [13]:
map_list = map(cube, x)

In [14]:
print(list(map_list))

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]


We have to convert it to a list because it gives us back a `generator`. That means that functional transformations are *lazily evaluated*. Python will not execute this code until the very last second when it is needed, which means there is very little waste: if we do not need a transformation, we do not have to perform it.

## Filters

Filters are more self-explanatory than maps: they allow you to filter certain values that meet a criterion out of a list.

In [15]:
x

range(0, 10)

We have our familiar range again. Let's create a function that checks whether or not a function is divisible by two.

In [16]:
def divis_by_2(num):
    return num % 2 == 0

In [17]:
divis_by_2(2)

True

In [18]:
divis_by_2(3)

False

Filters can be thought of as a special kind of map, for example, using divis_by_2. We are mapping a value to true or false depending on whether or not it is divisible by 2.

In [19]:
list(map(divis_by_2, x))

[True, False, True, False, True, False, True, False, True, False]

We remove all the values that are false.

In [20]:
list(filter(divis_by_2, x))

[0, 2, 4, 6, 8]

This is really valuable because we can start chaining a lot of these operations together and we will always get the same output given the same input. Let's try it with a different example.

In [21]:
test_list = [
"hello",
"x,y,z. i like this.",
"this is two xx",
"this, x, here x is, here it is againx"
]

We will check whether or not a string has 2 'x' characters.

In [22]:
def has_x(my_string):
    return my_string.count("x") >= 2

In [23]:
has_x("hello")

False

In [24]:
has_x("xxhello")

True

In [25]:
list(filter(has_x, test_list))

['this is two xx', 'this, x, here x is, here it is againx']

We will filter out the values that do not match this criterion.

Remember, this is just like a map that just removes values.

In [26]:
list(map(has_x, test_list))

[False, False, True, True]

## List Comprehensions

We looked at list comprehensions before, but this is a good time to review them.  After all, list comprehensions combine some features of maps with some features of filters.

Let's start with our usual x range.

In [27]:
x

range(0, 10)

Now we will cube every value in this list like we did above.

In [28]:
[item**3 for item in x]

[0, 1, 8, 27, 64, 125, 216, 343, 512, 729]

These are always strange for novice programmers because it looks like a for loop, and in a way it is; it is just a bit more compressed. Also, rather than outputing or appending to a list, we are just wrapping it in brackets to tell Python what we are doing.

Let's do the same with multiplying each number by 2.

In [29]:
[item * 2 for item in x]

[0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

Let's add a little bit of filtering to it now. Let's filter by whether or not it is divisible by 3, and if it is, we will multiply it by 3.

In [30]:
[item * 3 for item in x if item % 3 == 0]

[0, 9, 18, 27]

It may seem strange that we can simply tack on an if statement like that, but we can. We can of course do other things with list comprehensions too.

We can convert types, for example.

In [31]:
[float(item) for item in x]

[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0]

Or we can apply a more complicated function or one that we create outside of the loop.

In [32]:
def cube_if(num):
    if num % 3 == 0:
        return num * 3
    else:
        return num * 2

In [33]:
[cube_if(z) for z in x]

[0, 2, 4, 9, 8, 10, 18, 14, 16, 27]

Notice that we do not have to call the thing we are iterating through 'item'; that is just a convention. We can call it whatever we want. List comprehensions do not fit every use case, but they are useful and worth knowing about. Just keep in mind that they are no different from maps and filters.

## Lambda Functions

Another concept we touched on before is lambda functions.  These play an important role in functional programming, so it is worth reviewing them now. Lambda functions are functions that are anonymous and do not need to have a name assigned to them prior to use. This means that you do not have to create functions like "square_a_num" when you want to square a number; you can just create a function to do so.

Let's go through a couple of ways to square a list of items.

We can create a list comprehension.

In [34]:
[item**2 for item in x]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

We can create a function to do it.

In [35]:
def square(num):
    return num ** 2

In [36]:
[square(z) for z in x]

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

We can do a map with this operation.

In [37]:
list(map(square, x))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

And we can create a lambda function.

In [38]:
list(map(lambda z: z ** 2, x))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

Lambda functions take some getting used to, but they are quite useful when you are performing small operations on lists that you do not want to have to save somewhere.

The requirement for lambda functions is that they are single expressions of Python code; they are for things that are simple.

We can also save that lambda function as an object if we want to.

In [39]:
square_lambda = lambda z: z ** 2

In [40]:
list(map(square_lambda,x))

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [41]:
print(type(square_lambda))

<class 'function'>


In Python, functions can be passed around just like any other object. We see that with maps and filters too, we essentially tell Python to take this function and for every value in this list, do this to it and give me back the result. This is a feature in a lot of other programming languages too and is one of my favorite features.

Let's look at what this allows us to do on a bigger scale. Let's say we want to square all values that are divisible by 2 from 1 to 20. We need to start from the outside and move in.

In [42]:
map(lambda z: z**2, filter(lambda z: z % 2 == 0, range(1,20)))

<map at 0x104063898>

Notice that we have a map here because of the lazy evaluation; we have not asked for the result, so it has not given it to us. Let's take that and convert it to a list and we will get the results.

In [43]:
list(map(lambda z: z**2, filter(lambda z: z % 2 == 0, range(1,20))))

[4, 16, 36, 64, 100, 144, 196, 256, 324]

Let's reflect on how we would have written that before.

In [44]:
my_range = range(1,20)
output = []
for z in my_range:
    if z % 2 == 0:
        output.append(z ** 2)

The latter may seem more familiar, but it is also a lot more prone to errors as we make code modifications later on; it also has more code lines and is not nearly as extensible. This may seem trivial, but it really is not.

There are a lot of other reasons to love functional programming, but these are some of the reasons you should appreciate. This way of thinking is certainly a departure from how you might be accustomed to thinking, but you will see that it is most certainly valid and extremely useful in future contexts.

We will see a lot of this style of programming in the coming chapters.