## The Zen of Python

In [None]:
import this

## Mini-focus: Code layout and structure in Python

Python is an unusual language in that "whitespace" (spaces, tabs, newlines, and so forth) are significant in determining what is valid Python.

This was (is?) shocking to proponents of other languages.  For example, in Python, we would write a for loop like this:

```
for i in range(5):
    print(i)
```

where the `print` function call must be indented.  Although you have choice HOW MUCH to indent (as long as your are consistent), you MUST indent.

Meanwhile, in other languages like C, one would write

```
for (i = 0; i < 5; i++) {
    printf("%d", i);
}
```

But also it could be written

```
for (i = 0; i < 5; i++) { printf("%d", i); }
```

```
for (i = 0; i < 5; i++) 
{
    printf("%d", i);
}
```

```
for(i=0;i<5;i++)printf("%d",i);
```

among a myriad of other ways.

An important implication of the spacing rules of Python is that they (somewhat) encourage you to structure your code to make it more human-readable.  We'll look a bit at some potentially confusing points and also some suggestions for good practice.



You may have noticed I tend to have certain conventions in how I have laid out the expressions I have written.

First, because spacing is important in Python, an expression ends at the end of a line.  So for example I cannot do

In [None]:
foo =
1

The end of the line ends the expression, which is incomplete because Python is expecting something to be on the right-side of that assignment.  Now, if we had a good reason to, there is a way to continue an expression across multiple lines, which is using the `\` (backslash) character:

In [None]:
foo = \
1

In [None]:
foo

There's really no good reason to write such a simple expression over multiple lines.  However, if you have long expressions, you might want to break them up for readability.  Most Python style guides suggest lines should not be too long - 80 characters is a standard recommendation although there are a few guides which are more permissive.

In [None]:
def myexpression(x):
    return \
        3 * x**2 + \
        16 * x + \
        24
myexpression(15)

However, there is an exception to the end-of-line equals end-of-expression rule.  If you have an "open" delimiter like `(`, `[`, or `{`, then the expression **automatically** is assumed to continue to the next line, and you don't need to use the backslash character to continue the expression.

So I could write the function `myexpression` like this as well, by wrapping the expression inside parentheses.
There parentheses don't affect the meaning of the code, but by having the open parenthesis, the expression automatically extends to the next line.

Further, I use the flexibility of spacing to lay out the expression in a way that is visually appealing (I think!)

In [None]:
def myexpression(x):
    return (
        3 * x**2 +
        16 * x +
        24
    )
myexpression(15)

This technique combines very well with working with `DataFrame`s and transforming data in `pandas`.

For example, when I have created ad-hoc `DataFrame`s, I usually use a layout like the one below.  This creates a `DataFrame` based on a list of `dict`s.  In this case, I am able to put one row of the `DataFrame` (one `dict`) on each line.  Further, because I have open delimiters - two in fact (a parenthesis and a square bracket), I don't have to worry about explicitly continuing a line.

In [None]:
import pandas as pd
df = pd.DataFrame([
    {'city': "Aberdeen", 'temperature': 0},
    {'city': "Norwich", 'temperature': 5}
])
df

Compare this with the below, which is exactly equivalent, but more difficult to read, and more difficult to keep your delimiters straight - there is the sequence of `([{` and then the sequence of `}])`.

In [None]:
temps = pd.DataFrame([{'city': "Aberdeen", 'temperature': 0}, {'city': "Norwich", 'temperature': 5}])
temps

## Mini-focus: Strings in Python

Python takes a flexible approach to how you indicate literal text strings.  In particular, you can use either single-quotes or double-quotes - as long as you use the same type of quotes on each string.  You can mix-and-match all you want with different strings.

In the above, I used both double-quotes (when giving the city names) and single-quotes (when giving the field names).  This is a convention I use myself - I use the different types of quoting to denote different types of information.  I usually use single-quotes for column names and double-quotes for data values.  However, you will also find plenty of examples where I don't follow this convention.  And, it's purely a personal convention - an example of using the language to let you try to express more information to the human reader.

I could just as well create the previous DataFrame like this.  It's exactly the same and exactly as correct.

In [None]:
temps = pd.DataFrame([
    {'city': 'Aberdeen', 'temperature': 0},
    {'city': 'Norwich', 'temperature': 5}
])
temps

Allowing both the single-quote and double-quote makes it easier to deal with text strings which themselves have quotes in them.  If we have a string that has a single-quote in it, then we can use double-quotes to indicate the string:

In [None]:
"Dwayne 'The Rock' Johnson"

Or, if we have a string that has a double-quote inside of it, we can use single-quotes to indicate the string:

In [None]:
'Dwayne "The Rock" Johnson'

Of course, the two strings are *not* the same, because although single-quotes and double-quotes both mean "this is a string" in Python, when comparing the text **inside** the string, a single-quote and a double-quote are different characters.

In [None]:
"Dwayne 'The Rock' Johnson" == 'Dwayne "The Rock" Johnson'

As noted before, however, we don't like to have long lines in programs because they're difficult to read and difficult to maintain.  Python allows you to create multi-line strings using **three double-quotes** or **three single-quotes** in succession.
If you do this, everything between the quotes is included in the string.  Notice below that the newline characters (represented by `\n` in the output) are retained, so the formatting inside these strings is significant and is taken literally.

If for some reason you really wanted to use double-quotes for strings everywhere but had a string with a double-quote in it, you can "escape" the quotes by putting a backslash before it.

In [None]:
"Dwayne \"The Rock\" Johnson"

In [None]:
"""It is a period of civil war.
Rebel spaceships, striking
from a hidden base, have won
their first victory against
the evil Galactic Empire.

During the battle, Rebel
spies managed to steal secret
plans to the Empire's
ultimate weapon, the DEATH
STAR, an armored space
station with enough power to
destroy an entire planet.

Pursued by the Empire's
sinister agents, Princess
Leia races home aboard her
starship, custodian of the
stolen plans that can save
her people and restore
freedom to the galaxy....
"""

But what if you don't want to retain the newlines?  Python also automatically joins up strings which are adjacent.  So we could write:

In [None]:
crawler = (
    "It is a period of civil war. "
    "Rebel spaceships, striking "
    "from a hidden base, have won "
    "their first victory against "
    "the evil Galactic Empire."
)
crawler

This feature is useful, but also leads to a type of error, which arises from a common kind of typo where you forget to put, for example, a comma between successive strings in a list.

For example, in the below I forgot the comma between `'city'` and `'temperature'`.  You might think this would be a syntax error and Python would tell you something is missing.  But actually, what it does is treats this like you wrote `'citytemperature`', and as a result the error we get is from `pandas` telling us there is no such column.  This can be a tricky kind of bug to track down because if you search your code for `citytemperature`, you won't find it...!

In [None]:
temps[['city' 'temperature']]

## Mini-focus: Documentation and help

In most Python environments, you can access documentation on objects, function calls, and so on from within the interface, using `help`.

Including these documentation strings ("docstrings") is something that package developers/maintainers are responsible for - but you'll find that mature and actively-developed packages, such as the ones we're focusing on in this module, will have very good coverage.  (In fact, those web-based help documents I've been linking to are actually generated by pulling these docstrings out of the code!)

In [None]:
help(temps)

In [None]:
help(temps.assign)

In [None]:
help(temps['city'])

In [None]:
help(temps['city'].str)

## Mini-focus: Functions (and lambda)

We've seen two ways to define functions: using `def` and using `lambda`.

Most functions are defined using `def`.  For example, to convert Centigrade to Fahrenheit, we could write:

In [None]:
def CtoF(x):
    "Converts from Centigrade to Fahrenheit temperatures"
    return 32 + 9*x/5

In [None]:
CtoF(10)

Python also allows functions to be defined using another syntax, using the `lambda` keyword.

In the below, inside the first set of parentheses we define our function, and then we call it.

In [None]:
(lambda x: 32 + 9*x/5)(10)

Functions defined by `lambda` work the same as those defined by `def`, with two restrictions:

1. `lambda` functions can only consist of a single expression
2. `lambda` functions are "anonymous".  They're intended to be used ad-hoc in one place.

The benefit of `lambda` functions is they are more compact to define, which can help with readability of code, as we'll see in a moment.

Before we move on, you might notice I put a string in the first line of the function.  This is the function's `docstring`.  It's optional, but if you include it, it becomes the documentation of the function.

Python is somewhat ununsual in that this convention of putting a string as the first line of a function definition is supported by the language itself.  In most languages, documentation is relegated to a comment, and you can't access it from within the language environment.  Python builds in documentation as a "first-class" concept.

This also means that because I gave my function a docstring, I can access that help using the `help` facility!

In [None]:
help(CtoF)

## Mini-focus: Transforming data columns

We can put some of the above together to look more closely at what we are doing when we are transforming columns of data.  Using our simple temperature example, we'll look at several ways to solve the problem of creating a column of temperatures in Fahrenheit.  They're all "correct" - but we will also look at ways which are more robust and, importantly, easier for people to read.

The simplest way to do this would be something like this:

In [None]:
temps.assign(tempF=32 + 9*temps['temperature']/5)

One way that can improve readability is to put the calculation into a named function; the function name then makes the intent of the calculation more clear.  This is quite useful especially if you have this function defined for some other reason, such as part of a library.

In [None]:
temps.assign(tempF=CtoF(temps['temperature']))

But wait a minute here.  Previously we were calling `CtoF` just giving it a single number.  Now we are giving it a pandas `Series`.  What's going here?

In [None]:
CtoF(temps['temperature'])

In [None]:
CtoF(10)

Python determines the types of variables while running the code.  When we define `CtoF` as above, all we need for the code to run is that multiplication, division, and addition are defined on whatever we pass as `x` to the function.  These operations are defined on `Series` (as well as on floating-point numbers).  So we can call this function on a `Series` and it does the right thing.

This is called "duck typing" in Python.  ("If it looks like a duck...")

Remember that in general, when we're working with `assign`, we work with the entire `Series` (or column), and apply operations row-by-row.


In the above code, we were computing a `Series` by passing temperatures through `CtoF`, and then passing them to `assign`.  Instead of passing a `Series` to `assign`, we can pass a **function** to `assign`.  `pandas` will then call that function to compute the value for the column.

Here's where `lambda` functions can really shine.  When using `assign`, `pandas` will pass the `DataFrame` to your function, and so you can compute new columns using whatever combination of existing columns you might want.

In [None]:
temps.assign(tempF=lambda x: 32 + 9*x['temperature']/5)

Using a function with `assign` is also powerful because it works with the `DataFrame` as it is at the time you call `assign`.  So if you are chaining together several operations, you can write things like this - where we have better names for the temperature columns.

In [None]:
(
    temps.rename(columns={'temperature': 'tempC'})
    .assign(tempF=lambda x: 32 + 9*x['tempC']/5)
)

There's a useful (but a bit advanced) Python language feature which works really well with `assign`.  When we write `columns=` or `tempF=` in the above, these are called "keyword arguments" to those functions (because we specify the names of the arguments, specifically `columns` or `tempF` in this case).

Python allows you also to specify keyword arguments to functions using **"dictionary unpacking"**.  We can create a dictionary of the arguments to the function, and then pass them to the function, putting `**` in front of the dictionary.

In [None]:
(
    temps.rename(columns={'temperature': 'tempC'})
    .assign(**{
        'tempF': lambda x: 32 + 9*x['tempC']/5
    })
)

In most cases there's no difference between the two styles of calling `assign`, because unpacking the dictionary using `**` is exactly the same in Python as just passing the keyword arguments.  However, you can have valid column names in `pandas` which are not valid argument names in Python.  The most common situation - which we've encountered already - is when there is a full-stop in the name of a variable.

In a situation like the below, we would need to use the dictionary-unpacking method as below in order to assign to the column `temp.F`.

In my own practice, I just about **always** use dictionary-unpacking with `assign`.  But it is completely OK to use the keyword-argument method as well in the usual case.

In [None]:
(
    temps.rename(columns={'temperature': 'temp.C'})
    .assign(**{
        'temp.F': lambda x: 32 + 9*x['temp.C']/5
    })
)

## Mini-topic: using `query` with awkward column names

The `DataFrame` method called `query` is the usual way we select a subset of rows.  As we previously mentioned, it's a slightly unusual function because we write the expression for the selection criteria inside a text string, even though the expression we write inside of it is (usually) valid Python.  At this moment, let's not go into exactly why; you don't need to know the technical details in order to be able to get your work done!

As above, however, when we have awkward column names, such as those with a full-stop in them, we can't write these:

In [None]:
(
    temps.rename(columns={'temperature': 'temp.C'})
    .assign(**{
        'temp.F': lambda x: 32 + 9*x['temp.C']/5
    })
    .query("temp.F > 40")
)

We can solve this by surrounding the column name with "back-ticks".  This is the character \`.  This is one of those characters that tends to go walkabout in terms of where it is on your keyboard - but don't confuse it with the single quote that we were using earlier for strings!

In [None]:
(
    temps.rename(columns={'temperature': 'temp.C'})
    .assign(**{
        'temp.F': lambda x: 32 + 9*x['temp.C']/5
    })
    .query("`temp.F` > 40")
)

## When to use .str?

Let's suppose our incoming data wasn't quite so nicely formatted, and instead looked like this.

In [None]:
temps = pd.DataFrame([
    {'City': "aberdeen", 'Temperature': 0},
    {'City': "norwich", 'Temperature': 5}
])
temps

We could make some very good use of `lambda` functions to clean this data, because both `rename` and `assign` accept functions.

In [None]:
(
    temps.rename(columns=lambda x: x.lower())
    .assign(**{
        'city': lambda x: x['city'].str.title()
    })
)

Notice however a difference.  When we call `rename`, we call the string function `lower` directly on the object being passed.  Meanwhile, when we call `assign`, we refer to the `.str` member of `x['city']` and then call the string function `title` on that.  This looks confusing!  (And, it is a bit confusing.)

The difference is in what the functions are operating on.  When you call `rename` with a function, `pandas` passes each column name to the function in turn.  Column names are strings themselves.  So in that `lambda` function, `x` will be a string, and you can call string functions directly on it.

What are the string functions?  Happy you asked!



In [None]:
help(str)

Meanwhile, when we are using `assign`, remember that what is being passed is a `DataFrame`, and in particular `x['city']` is going to be a `Series`.  The `Series` class collects all of the string operations under `.str`:

In [None]:
help(pd.Series.str)

In short: If you are working with a `Series`, you'll access string operations using `.str`.  Principally this will happen when you are doing operations on a column as a whole - and so the most common situation will be in the context of `assign`.