<div style="float:left">
    <h1 style="width:450px">Practical 7: Visualising Data</h1>
    <h2 style="width:450px">Using Seaborn and Matplotlib</h2>
</div>
<div style="float:right"><img width="100" src="https://github.com/jreades/i2p/raw/master/img/casa_logo.jpg" /></div>

## This Week’s Overview

This week we're going to explore how visualising data (using pandas and seaborn) helps us to make more sense of our data. We're then going to move on to automating this process because coding isn't _just_ about being able to load lots of data, it's also about being able to be constructively lazy with the data once it's loaded.

## Learning Outcomes

By the end of this practical you should:
- Have created a set of different plots using seaborn
- Have automated the presentation of data for a number of columns
- Have grasped how a mix of graphs and numbers can help you make sense of your data

## Setup

As usual we will be using pandas, so we need to import the package:

In [1]:
import pandas as pd

We will start working with the initial LSOA data that we used previously, hosted at https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz?raw=true

In [None]:
df = pd.read_csv(
    'https://github.com/kingsgeocomp/geocomputation/blob/master/data/LSOA%20Data.csv.gz?raw=true',
    compression='gzip', low_memory=False) # The 'low memory' option means pandas doesn't guess data types

Later we'll look at how we can add more data to this `DataFrame` later, but first let's just check what columns of data we have:

In [None]:
df.columns

Okay, now we have our data loaded and we've reminded ourselves of what the data set contains (maybe by consulting the [metadata](https://github.com/kingsgeocomp/geocomputation/raw/master/data/LSOA_metadata.xlsx)) we can move on. 

## More Data Exploration

Let's start with some really obvious questions about the data. We looked at distributions and min/max values using built-in pandas methods (like `describe()`) the previous week, but that doesn't tell us _which_ places are very expensive/large/whatever. 

#### Task 1
Let's start by examining the mean price for AirBnB properties. Use the metadata spreadsheet and list of column names above to identify which Series name you should use to replace `???`:

In [None]:
df[???].describe()

And now let's figure out where the highest means can be found:

In [None]:
df.sort_values(by='MeanPrice', ascending=False).head(5)[['MeanPrice','LSOA11NM']]

That last command is actually quite complex, so let's take a second to unpack it:
```python
df.sort_values(by='Mean Price', ascending=False).head(5)[['Mean Price','LSOA11NM']]
```

1. Take the data frame `df`;
2. Sort it using the `Mean Price` column by descending order (_i.e._ `ascending=False`);
3. Take the first five values (`head(5)`);
4. And print out the columns specified by the list (`['Mean Price','LSOA11NM']`).

Let's pull it apart step-by-step at the code level:

* The first step in this process is `df.sort_values` -- you can probably guess what this does: it sorts the data frame!
* The parameters passed to the `sort_values(...)` function are `by`, which is the column on which to sort, and `ascending=False`, which gives us the data frame sorted in _descending_ order!
* The output of `df.sort(...)` is _itself_ a new data frame, which means that we can simply add `.head(5)` to get the first five rows of the newly-sorted data frame.
* And the output of `df.sort(...).head(...)` is yet _another_ data frame, which means that we can print out the values of selected columns using the 'dictionary-like' syntax: we use the outer set of square brackets (`[...]`) to tell pandas that we want to access a subset of the top-5 data frame, and we use the inner set of square brackets (`['Mean Price','LSOA11NM']`) to tell pandas which columns we want to see.

I'd say 'simples, right?' but that's obviously _not_ simple. It _is_, however, very, very _elegant_ because it's quite clear (once you get past the way that lots of commands can be chained together) and it's very succinct (we did all of that in _one_ line of code!).

#### Task 2
Now, can you output the Mean and Median prices (as well as the LSOA11NM) for the 5 _cheapest (by median price)_ LSOAs?

In [None]:
df.sort_values(???).head(5)???

If all has gone well (using the full data set) you _should_ have something like:

|  | Mean Price | Median Price | LSOA11NM |
|--------|------------|--------------|----------|	
| 2417 | 0.0 | 0.0 | Hillingdon 010D |
| 2101	| 0.0	| 0.0	| Harrow 006D |
| 647	| 0.0	| 0.0	| Bromley 026B |
| 648	| 0.0	| 0.0	| Bromley 026C |
| 651	| 0.0	| 0.0	| Bromley 033B |

Not perhaps the _most_ useful thing to know -- it would probably be more useful to know the cheapest that _is not zero_.

#### Task 3
Edit the code below so that only LSOAs with a Median Price greater than zero are sorted and the cheapest five returned.

In [None]:
df[???].sort_values(by='MedianPrice', ascending=True).head(5)[['MeanPrice','MedianPrice','LSOA11NM']]

The cheapest _non_-zero LSOA should be Newham 032B (£9/night).

## Working with a Data Series

Implicitly, we've already done quite a bit with the Series (i.e. column) class offered by pandas, but I want to revisit it so that you understand why getting to grips with how the Series works (especially the 'index', which is a special type of Series) is crucial to getting the most out of pandas.

The easiest way to think about this: a `Series` is just another name for a pandas column.

### Adding a New Series

You may recall that you can add a new series to an existing data frame using the dictionary-like syntax:
```python
df['<new series name>'] = pd.Series(... <series definition> ...)
``` 
But just to remind you: see how familiar that syntax is? `df['key'] = value` is _exactly_ like creating and assigning a new key/value pair to a dictionary called `df`! The only difference here is that the 'value' we store in the dictionary is a Series object, and not a simple variable (String, int, float).

Let's do this for the price so that we get a sense of how expensive a place is _relative_ to all other listings... Perhaps something like standard deviations from the mean?

In [None]:
df['score'] = pd.Series((df['MeanPrice'] - df['MeanPrice'].mean()) / df['MeanPrice'].std())

In principle, we could have done the same using 'dot notation':
```python
df.score = pd.Series((df.MeanPrice - df.MeanPrice.mean()) / df.MeanPrice.std())
```
However, my experience is that's it's generally better to write this as:
```python
df['score'] = pd.Series((df.MeanPrice - df.MeanPrice.mean()) / df.MeanPrice.std())
```

If we look at the head of the entire pandas `dataframe` we can see that it has a new Series (i.e. column) named 'score' (compare the output from the next code block to the same one above): 

In [None]:
df.columns

An we can do things as we would have with any of the pre-existing Series: 

In [None]:
df.score.describe()

Let's step through the code to make sense of what we did:
```python
df.score = pd.Series((df.MeanPrice - df.MeanPrice.mean()) / df.MeanPrice.std())
```

Aaaaanyway:
1. `pd.Series(...)` creates a new pandas Series from whatever data we pass it;
2. `(df.MeanPrice - df.MeanPrice.mean())` does something really clever: it takes each _individual_ row value of the `MeanPrice` column and _subtracts_ the mean price of the entire column using `df.MeanPrice.mean()`.
3. We then divide that by the standard deviation, which we just calculated for the _entire_ column using `df.MeanPrice.std()`!

Doing this in Excel would be a bit more work and, more importantly, a _lot_ slower. This is especially true if what you want to do is get things working on a subset of the data _before_ analysing the entire data set in one go. Imagine doing this for 25,000,000 records and you begin to see how _scripability_ is incredibly useful here.

### What's in a Score?

Do you remember any GCSE statistics? What this pandas command does:
```python
(df.MeanPrice - df.MeanPrice.mean()) / df.MeanPrice.std()
```

...is standardisation (more on this near the end of term):

$$
z = \frac{x - \bar x}{\sigma}
$$

This is often known as the _z-score_. The value of z that we have now tells us how far from the _average_ price of properties in London _each_ individual property is as a multiple of the standard deviation. So something priced at the average would have a value of 0, while the most expensive have the largest values...

#### Task 4

Using _only_ the output of the `sample(5)` function below, what do you think (_approximately_) is the mean price of an Airbnb property in London? Justify it by reference to the scores for these five LSOAs.

In [None]:
df.sort_values(by='score', ascending=False).sample(5, random_state=123456789)[['MeanPrice','score','LSOA11NM']]

#### Your answer here 

&lt;Your reasoning here&gt; 

Explain your reasoning before correcting and running the code below to find the true mean price.

In [None]:
df.???.mean()

## Visualising Data

OK, we've added one useful new column (data series) to our data frame just to see how it's done, but now let's get down to some visualisation!

### Start with a Chart

If we weren't learning how to program at the same time as we learn to do data analysis then my recommendation would have been this: **start with a chart**. Of course, we mean start with a _good_ chart:

[![Do maps lie?](http://img.youtube.com/vi/hYaoE4Kh9fk/0.jpg)](http://www.youtube.com/watch?v=hYaoE4Kh9fk)

There is _no_ better tool for understanding what is going on in your data than to visualise it, but we couldn't show you how to make a plot without first teaching you how to load data and perform some basic operations on a data frame! 

Now that we've done *that*, we can get to grips with VDQI (the [Visual Display of Quantitative Information](https://www.edwardtufte.com/tufte/books_vdqi) and how this supports our understanding of the data.

### Why seaborn?

We can do some straightforward plotting directly from pandas, but for the data visualisation part of the practical we are going to use the [seaborn package](http://stanford.edu/~mwaskom/software/seaborn/) because it provides a lot of quite complex functionality (and very pretty graphs) at quite low 'cost' (_i.e._ effort). For some examples of the beautiful and powerful data visualisations possible, checkout the seaborn gallery. 

[![seaborn Gallery](https://kingsgeocomputation.files.wordpress.com/2018/10/seaborngallery.png)](http://seaborn.pydata.org/examples/index.html)

There are, however, other options out there that are worth checking out if you take things further; the two that you are most likely to hear mentioned are: [Bokeh](http://bokeh.pydata.org/en/latest/) and matplotlib. 

1. Bokeh is, like seaborn, designed to make it easy for you to create good-looking plots with minimal effort. 
2. Matplotlib is a different beast: it is actually the _underlying_ package that supports the majority of plotting (drawing graphs) in Python. 

So seaborn and bokeh both make use of the matplotlib library to create their plots, and if you want to customise a figure from either of these two libraries then you will eventually need to get to grips with matplotlib. The reason we don't teach matplotlib directly is that it's much harder to make a good plot and the syntax is much more complex.

A more recent entrant is yhat's (a data science 'joke') ggplot library, which deliberately mimics R’s ggplot2 (http://ggplot.yhathq.com) -- this library has become the dominant way of creating plots in the R programming language and it uses a 'visualisation' grammar that many people find incredibly powerful and highly customisable. Unfortunately, ggplot on Python does not currently support mapping (which R does in ggplot2).

### Loading seaborn 

As with other libraries that we’ve used, we’ll import seaborn using an alias:
```python
import seaborn as sns
```
So to access seaborn's functions we will now always just write `sns.<function name>()` (where `<function name>` would be something like `distplot`). 

### But First!

<span style="color:red;">Important Note for Mac Users</span>

Recent changes in the way that the Mac OS handles the plotting of data means that you need to do certain things in a specific order at the start of any notebook in which you intend to show maps or graphs. Please make a copy of the following code for any notebook that you create and make it the _first_ code that you run in the notebook:

```python
# Needed on a Mac
import matplotlib as mpl
mpl.use('TkAgg')
%matplotlib inline
import matplotlib.pyplot as plt
```

For non-Mac users the code above should run fine, but not all of it is entirely necessary and Windows users likely could likely get away with:
```python
%matplotlib inline
import matplotlib.pyplot as plt
```
This _should_ enable you to create plots, including in the practical that we're about to start! If you forget to run this code _before_ trying to use seaborn then you will probably need to restart the Kernel (Kernel > Restart from the menu). If you do _that_ you will lose all of your 'live' work (_i.e._ variables, loaded modules, etc.).

The `%matplotlib inline` command only need to be run _once_ in a jupyter notebook; it tells jupyter to show the plots as part of the web page, rather than trying to show them in a separate window. So the easiest thing to do is just stick whatever code you need at the top of your notebook so that you _always_ run it once when you start up a notebook and can then forget about it.

In [None]:
# Needed on a Mac (also fine for Windows)
import matplotlib as mpl
mpl.use('TkAgg')
%matplotlib inline
import matplotlib.pyplot as plt

And of course we need to import the seaborn package itself:

In [None]:
import seaborn as sns

## Making Distribution Plots

One of the most useful ways to get a sense of a data series is simply to look at its overall distribution. Something like this:
```python
sns.distplot(<data series>)
```

Run the code below to see how we can plot the distribution of the mean AirBnB price using the seaborn `distplot` function ([don't worry](https://stackoverflow.com/a/52595447/10219907) about a `FutureWarning` if your receive one). 

In [None]:
sns.distplot(df.MeanPrice, hist=False)

Note the x-axis range on the plot.

Now let's plot the distribution of the _score_ Series, using a different argument passed to the `distplot` function. 

In [None]:
sns.distplot(df.score, kde=False)

Now note the x-axis range... check you understand it has the range of values it does (remember this is the _z-score_ not the absolute values). 

If all at went well you should see an 'okay' (kind of 'meh', really) distribution plot for the price! But there's a lot of blank space there on the plot for LSOAs with z-score greater than 5. So let's try plotting but with z-scores greater than 5 removed:

In [None]:
sns.distplot(df.MeanPrice[df.score < 5], kde=False)

### Recap

OK, let's take a second here: although there was a lot of setup work that needed to be done, we just created a distribution plot in one line of code _after_ filtering out values that we felt were skewing our view of the data. _One line_. This is a more sophisticated plot than you could ever create in Excel and you just created it in one line of code!

Let's review:
```python
sns.distplot(df.MeanPrice[df.score < 5])
```
We are:
1. Creating a seaborn distribution plot (`sns.distplot(...)`)
2. Using the price column (`df.MeanPrice`)
3. But first selecting only those data where the z-score was less than 5 (`[df.score < 5]`).

Pretty cool huh? You'll notice in Point 3. too that list-type syntax: `df.price[...]` which tells us that we're selecting elements of the list in the same way that we selected column names above.

To see what else you can do with a `distplot`, why not checkout the help, either in the notebook using `?sns.distplot` or [online](https://seaborn.pydata.org/generated/seaborn.distplot.html). 

## Making Scatter Plots

Another type of plot we might want to make to look at relationships between variables (i.e. Series) is some kind of scatterplot. There are a few different ways to do this using seaborn, including `jointplot` and `pairplot` functions. 

In the next plot we'll compare two variables, but it's worth understanding what we're looking at:
> Occupancy rating provides a measure of whether a household's accommodation is overcrowded or under occupied. There are two measures of occupancy rating, one based on the number of rooms in a household's accommodation, and one based on the number of bedrooms. The ages of the household members and their relationships to each other are used to derive the number of rooms/bedrooms they require, based on a standard formula. The number of rooms/bedrooms required is subtracted from the number of rooms/bedrooms in the household's accommodation to obtain the occupancy rating. An occupancy rating of -1 implies that a household has one fewer room/bedroom than required, whereas +1 implies that they have one more room/bedroom than the standard requirement.

> '1 bedroom' includes households who indicated '0 bedrooms' and '1 bedroom'. This is because all households where someone usually lives must have at least one room used as a bedroom.

> The following occupancy rating (bedrooms) classifications are available: <br/>
>  All categories: Occupancy rating (bedrooms) <br/>
>  Occupancy rating (bedrooms) of +2 or more <br/>
>  Occupancy rating (bedrooms) of +1 <br/>
>  Occupancy rating (bedrooms) of 0 <br/>
>  Occupancy rating (bedrooms) of -1 or less <br/>

In [None]:
sns.jointplot(df['MedianIncome'], df['ORbedsM1'], color='#4CB391')

Check you understand how the `jointplot` [function](https://seaborn.pydata.org/generated/seaborn.jointplot.html) works (note, we have to pass _two_ Series) and what the output contains (e.g. What does each point in this plot represent?).

Let's try tidying up the chart a bit (this is more for the purposes of demonstration for now -- you can try playing with parameters and seeing what happens!):

In [None]:
sns.set_style("whitegrid")
plt.figure(figsize=(10,10))
p = sns.jointplot(x=df['MedianIncome'], y=df['ORbedsM1'], 
              color='#4CB391',
              size=10,
              joint_kws={"s":2})
p = (p.set_axis_labels("Occupancy Rating (<= -1 Bedrooms)","Median Income",fontsize=15))
plt.suptitle("Overcrowding and Income",fontsize=25,y=0.99)
p.fig.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.savefig("Income and Overcrowding.png")

Another seaborn function to visualise relationships between variable (Series) in dataframe is `pairplot`. This function plots scatter plots for _every_ combination of variables. As the dataframe we are using has very many variables, this would take some time to plot. So, to use `pairplot` let's first select only the AirBnB variables:

In [None]:
AirBnB = df.loc[:,'MedianPrice':'SharedRoom']
AirBnB.columns

And now we can create our pairplot (it may still take a few moments, despite reducing the size of the dataframe): 

In [None]:
sns.pairplot(AirBnB)

Clearly this is a great way to do some exploratory data analysis: which variables seem to be strongly related, and which more weakly? Certainly Excel can't do this and defintely not in one line of code! 

## Other Types of Plots

Using our original dataframe, let’s have a look at some other types of plots... We'll start with the basics, then follow-up by be playing with [aesthetics](https://seaborn.pydata.org/tutorial/aesthetics.html) that change how the plot looks.

### Boxplot

Single variable for all data

In [None]:
sns.boxplot(df.MeanPrice)

To make things a bit neater, we can use the fact that `boxplot` (as for some other seaborn functions) returns an axis object. Then we can use the `set` methods with that axis object to format the plot.

In [None]:
ax = sns.boxplot(x = "MeanPrice", 
              data = df, 
              color = "red")
ax.set(xlabel="Mean Price ($)",title="AirBnB Data")
ax.set_xlim([0,400])
plt.show()

The final line above is needed to show the plot, which works here in a notebook becayse we included the `%matplotlib inline` line earlier. 

We could also plot a single variable, data split by Local Authority Districts (LADs):

In [None]:
plt.figure(figsize=(15, 6))
sns.boxplot(x='LAD11NM', y='MeanPrice', data=df.sort_values(by='MeanPrice'))

We can also use aesthetics to make the plot larger and change the colours:

In [None]:
from matplotlib import pyplot
fig, ax = pyplot.subplots(figsize=(10,10))
b = sns.boxplot(ax=ax, 
        x='LAD11NM', 
        y='MeanPrice', 
        data=df.sort_values(by='MeanPrice'), 
        palette='PRGn', 
        fliersize=4, 
        linewidth=1)
plt.ylim(-10, 800)
sns.despine(offset=10, trim=True)

But the LAD names are still not legible. So let's do the last plot again, but then add a loop to plot each of the LAD names  rotated so that they can be easily read:

In [None]:
from matplotlib import pyplot
fig, ax = pyplot.subplots(figsize=(10,10))
b = sns.boxplot(ax=ax, 
        x='LAD11NM', 
        y='MeanPrice', 
        data=df.sort_values(by='MeanPrice'), 
        palette='PRGn', 
        fliersize=4, 
        linewidth=1)
plt.ylim(-10, 800)
sns.despine(offset=10, trim=True)
for item in b.get_xticklabels():
    item.set_rotation('vertical')
plt.title("Price Distribution")

### Violin Plot

Similar to a boxplot, but shows variation differently.

In [None]:
sns.violinplot(df.MeanPrice[df.LAD11NM == "Merton"], orient='v')

### Regression Plot

Another type of scatter plot but includes a regression line.

In [None]:
sns.regplot(x='MeanPrice', y='MedianPrice', data=df)

## Saving Plots

One way to save a plot that you create in a notebook as an image file that you could use in a report, is to simply right-click (or Cmd-click) the image and 'Save as...'. Try it on one of the figures you made above. 

But... wouldn’t be it a lot easier if we could save our plot automatically and not have to even the mouse to do so? This is where we need to use matplotlib syntax (and where you'll see why we opted not to spend too much time on it):

```python
series = df.MeanPrice 
fig = plt.figure(series.name)
sns.distplot(series)
fig = plt.gcf()
plt.savefig("{0}-distplot.png".format(series.name), bbox_inches="tight")
plt.close()
```

To explain what's happening here: 

1. We copy the price data series to a new variable (this would mean that, to print out weekly_price or monthly_price we'd only have to change this one line... see: we're preparing to use a `for` loop!)
2. We create a `figure` object into which seaborn can 'print' its outputs using a title based on the series name.
3. We call Seaborn and ask it to print a `displot` (it doesn't really need to know that it's printing to something, it just needs to know what 'device' is should use for output).
4. We _G_et the _C_urrent _F_igure environment so that the next command works
5. We save the plot, using the `format` command to replace the `{0}` with the name of the data series (so in this example we'd be saving our figure to `MeanPrice-distplot.png`).
6. We then close the figure output so that we don't print other plots over top of it.
 
The plot should have been saved to your working directory (where the Jupyter notebook is running) using the name of the data series that you were working with. We want to use string replacement (the `{0}`) so that when we save the plot for weekly_price or monthly_price we don't overwrite the one for price!

Let's try saving the square footage data! Once you've edited the code to create `series` using the right variable, go and check a file has been made in your working directory (where this notebook is saved).

In [None]:
series = df.Area                 
fig = plt.figure(series.name) 
sns.distplot(series)          
fig = plt.gcf()               
plt.savefig("{0}-distplot.png".format(series.name), bbox_inches="tight")
plt.close()                   

## Exercises

1. Create a violinplot to compare population density distributions between Local Authority Districts

In [None]:
# To get you started
df['Density'] = ???

2. Create a ['KDE' jointplot with marginal distributions](https://seaborn.pydata.org/generated/seaborn.jointplot.html) to examine the relationship between total greenspace area and population density in Islington for LSOAs that have some greenspace.

3. Create a [barplot](https://seaborn.pydata.org/generated/seaborn.barplot.html) to compare number of household residents in LSOAs in the 'City of London' Local Authority District. The plot should have:
    - a y-axis label of 'Household Residents'
    - an x-axis label of 'LSOA'
    - a title of 'City of London'
    - x-axis tick labels should be rotated by 45 degree

# Automation

When we're undertaking an analysis of a data set, we often have to perform the same (or at least similar) tasks for many different columns. We *could* copy and paste the code, and then just change the variable names to update the analysis... but that would be a definite instance of what Larry Wall would have called 'false laziness': it seems like a time-saving device in the short run, but in the long run you've made your code less readily maintainable (what if you want to _add_ to your analysis or find a bug?) and less easy to understand.

There are nearly always two things that you should look at if you find yourself repeating the same code: 
1. write a `for` loop; 
2. consider writing a function.

### Using a `for` loop

For example, let's try producing `distplots` for several of variables. We should be able to do this using two things:
* A `for` loop to iterate over the column names
* String interpolation to access the column in the data frame

How about we try: `'Area that is Designated Greenspace', 'Average number of bedrooms per household', 'Population Density', 'Accomodation type: Private room'`? (**Note:** you'll need to work out the simple names of these!)

I'll get you started... this also demonstrates how we can use pandas' dictionary-style notation to create a new string that we can then use as a key for a new data series.

In [None]:
my_vars = ["GreenspaceArea","BedsHH","POPDEN","PrivateRoom"]
for c in my_vars:
    plt.figure() # Create a new figure so they don't over-print
    sns.distplot(df[c], hist=False)

## Exercise

1. Use a `for` loop to each create a separate boxplot for each variable in the _Airbnb_ columns of the dataframe (EntireHome, PrivateRoom, SmallHost, Multihost). Make sure that it has a title.

2. Use a `for` loop to plot the distribution (not histogram) of the following property count variables as a proportion of the total AirBnB property count, **on the same plot**:
    - EntireHome
    - PrivateRoom
    - SmallHost
    - MultiHost

The code below gets you started:

In [None]:
counts = ???
for v in counts:
    a = sns.distplot(??? / df.PropertyCount, ???=False, label=???)
    a.set(xlabel="")
a.set_xlim(0,1)
a.set(???='Hosts/Rooms in Category', ???='Proportion')

### Catching Potential Errors 

Above, we selected variables (Series) to plot, knowing that they were numeric. But what if we wanted to just plot everything in the dataframe? The problem with that is that some of our variables (Series) are not numeric. It often happens that we will be working with data but not knowing its type. Let's use an example of calculating summary statistics to see how we can deal with this challenge before going on to apply it to plotting figures. 

Here's the code, try it!

In [None]:
for c in df.columns.values[5:]:
    series = df[c]
    print('Summarising ' + series.name)
    series.describe()['mean']

As you've just now discovered we can't summarise every column because they are not all numeric. So it's time to introduce a useful little 'emergency' handler for what Python calls '[exceptions](https://docs.python.org/2/tutorial/errors.html)'. 

Let's test this with the first few columns of the dataframe, which we know are text (testing with a subset of data is always a good idea!):

In [None]:
for c in df.columns[2:5]: # Notice that we start with some testing!!!
    series = df[c]
    try:              # Try to do something
        print('Summarising ' + series.name)
        print("\tMean:    {0:> 9.2f}".format(series.mean()))
        print("\tMedian:  {0:> 9.2f}".format(series.median()))
    except TypeError: # If you see this problem don't blow up please!
        print("\tData cannot be summarised numerically.")

So using `try` and `except` helps us to 'catch' possible errors, without crashing our code and allowing it to continue to execute. To see how this works, compare the output from the last code block with output from the next code block and then answer this question:

_Why do we see the output for columns like `MSOA11CD` 'twice' even though we don't expect it?_

In [None]:
for c in df.columns[2:5]:
    series = df[c]
    try:              # Try to do something
        print('Summarising ' + series.name)
        print("\tMean:    {0:> 9.2f}".format(series.mean()))
        print("\tMedian:  {0:> 9.2f}".format(series.median()))
    except TypeError: # If you see this problem don't blow up please!
        print("Series " + series.name + " cannot be summarised numerically.")

What the `try:except` does is allow us to intercept the error _before_ Python simply gives up and throws an error at you. For automation this is a really useful feature since it allows us to figure out if there are problems and do something about them before the user is left trying to figure out what went wrong! You should read up on these as they are very useful!

So now we've done our testing, let's run the code for the entire _df_ dataframe (it may take a few moments):

In [None]:
for c in df.columns:
    series = df[c]
    try:              # Try to do something
        print('Summarising ' + series.name)
        print("\tMean:    {0:> 9.2f}".format(series.mean()))
        print("\tMedian:  {0:> 9.2f}".format(series.median()))
    except TypeError: # If you see this problem don't blow up please!
        print("Series " + series.name + " cannot be summarised numerically.")

### Adding Graphs

OK, now we've got the `for` loop working and can handle issues relating to whether or not the column is actually numeric. So we've solved _one_ part of the problem. We can now move on to the _next_ part of the problem: creating a chart from each of the numeric columns.

To help us make sense of the data it will be useful to add some additional information to our distribution plots: lines to show the location of the mean, median, and outlier thresholds.

To do this, we need to get at the library that seaborn itself uses: `matplotlib`.

In [None]:
# Setup work -- enables parameterisation
series = df['MeanPrice'][df.score.abs() <= 6]
fig = plt.figure(series.name)

# Create the plot
d    = sns.distplot(series)
# Find the limits
ymin = d.get_ylim()[0]
ymax = d.get_ylim()[1]

# Now add mean and median
plt.vlines(series.mean(), ymin, ymax, colors='red', linestyles='solid', label='Mean')
plt.vlines(series.median(), ymin, ymax, colors='green', linestyles='dashed', label='Median')

# Add outlier marks (more than 1.5 times the IQR above or below the 1st and 3rd quartiles)
iqr = series.quantile(0.75)-series.quantile(0.25)
if series.quantile(0.25)-1.5*iqr > 0:
    plt.vlines(series.quantile(0.25)-1.5*iqr, ymin, ymax, colors='blue', linestyles='dotted', label='Lower Outlier')
if series.quantile(0.75)+1.5*iqr > 0:
    plt.vlines(series.quantile(0.75)+1.5*iqr, ymin, ymax, colors='blue', linestyles='dotted', label='Upper Outlier')


That's a lot of code in that last block! Hopefully, you can work out how most of it works, but the `plt.vlines` function might be new. Try `?plt.vlines` or `help(plt.vlines)` to discover what parameters the function takes. 

### Creating a Function

Now that we've got the figure looking right for _one_ column, why don't we try to create a useful _function_ -- we _can_ always do this directly within the `for` loop, but a function is more elegant since it makes it simpler to see what is going on. The function is below, check you understand how it works. 

In [None]:
def create_plot(s):

    fig = plt.figure(s.name)

    # Create the plot
    d    = sns.distplot(s.dropna())
    # Find the limits
    ymin = d.get_ylim()[0]
    ymax = d.get_ylim()[1]

    # Now add mean and median
    plt.vlines(s.mean(), ymin, ymax, colors='red', linestyles='solid', label='Mean')
    plt.vlines(s.median(), ymin, ymax, colors='green', linestyles='dashed', label='Median')

    # Add outlier marks (more than 1.5 times the IQR above or below the 1st and 3rd quartiles)
    iqr = s.quantile(0.75)-series.quantile(0.25)
    if s.quantile(0.25)-1.5*iqr > 0:
        plt.vlines(s.quantile(0.25)-1.5*iqr, ymin, ymax, colors='blue', linestyles='dotted', label='Lower Outlier')
    if s.quantile(0.75)+1.5*iqr > 0:
        plt.vlines(s.quantile(0.75)+1.5*iqr, ymin, ymax, colors='blue', linestyles='dotted', label='Upper Outlier')

    fig = plt.gcf() # *G*et the *C*urrent *F*igure environment so that the next command works
    plt.savefig("{0}-automated-plot.png".format(s.name), bbox_inches="tight")
    plt.close() 

Now we can combine this function with a loop to create summaries and plots for _multiple_ Series.  

Fix the `???` then run the code:

In [None]:
for ??? in df.columns[1:20]:
    series = df[c]
    print('Summarising ' + series.name)
    try:              # Try to do something        
        # Print numerical summaries
        print("\tMean:    {0:> 9.2f}".format(series.mean()))
        print("\tMedian:  {0:> 9.2f}".format(series.median()))
        print("\tMin:     {0:> 9.2f}".format(series.min()))
        print("\tMax:     {0:> 9.2f}".format(series.max()))
        print("\tIQR:     {0:> 9.2f}".format(series.quantile(0.75)-series.quantile(0.25)))
        
        # Create graphical summary
        print("\tCreating graph...")
        create_plot(???)    #here's the function defined above
        
    except TypeError: # If you see this problem don't blow up please!
        print("\tData cannot be summarised numerically.")

print("Done!")

If you fixed the code correctly, you should see output above with `Done!` on the last line. 

From the output above, how many image files should have been created in your working directory? **Go and check the right number are there!** (think about what filename format they should have).  

So this is all pretty cool, huh? We can now make data summaries and plots as image files for _multiple_ variables with just a few lines of code. What's more, because the code is encapsulated within a function including exception checking, we can re-use this code in future!   

## And finally: 3D Plots

Finally, let's see how to create a 3D scatter plot – in this case the plot doesn’t add a lot to our understanding of the data, but there are cases where it might and it does illustrate how pandas, seaborn, and matplotlib work together to produce some pretty incredible outputs. Here's some code to examine:  

In [None]:
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.colors as colours 
import matplotlib.cm as cmx

# Set up the figure
w, h = 12, 8
fig = plt.figure(figsize=(w, h))
ax  = fig.add_subplot(111, projection='3d')

# Set up the 3D axes
x = df['BedsHH']
y = df['MedianIncome']
z = df['PrivateRented']

# Set up the colourmap so that we see
# different colours for each borough's
# data.
# From: http://stackoverflow.com/questions/28033046/matplotlib-scatter-color-by-categorical-factors
boroughs  = list(set(df.LAD11NM)) 
hot       = plt.get_cmap('hot')
cNorm     = colours.Normalize(vmin=0, vmax=len(boroughs))
scalarMap = cmx.ScalarMappable(norm=cNorm, cmap=hot)

for i in range(len(boroughs)):
    indx = df.LAD11NM==boroughs[i]
    ax.scatter(x[indx], y[indx], z[indx], c=[scalarMap.to_rgba(i)], marker='.')

ax.set_xlabel(x.name)
ax.set_ylabel(y.name)
ax.set_zlabel(z.name)

Look into the options in the code above in more detail:
* Can you change the colour map used to indicate which neighbourhood each listing is drawn from?
* Can you change the icons used to mark each neighbourhood so that they are different?
* Can you add a legend to indicate which marker is for which area?

## Credits!

#### Contributors:
The following individuals have contributed to these teaching materials: Jon Reades (jonathan.reades@kcl.ac.uk), James Millington (james.millington@kcl.ac.uk)

#### License
These teaching materials are licensed under a mix of [The MIT License](https://opensource.org/licenses/mit-license.php) and the [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/).

#### Acknowledgements:
Supported by the [Royal Geographical Society](https://www.rgs.org/HomePage.htm) (with the Institute of British Geographers) with a Ray Y Gildea Jr Award.

#### Potential Dependencies:
This notebook may depend on the following libraries: pandas, matplotlib, seaborn