<!--BOOK_INFORMATION-->
<img align="left" style="padding-right:10px;" src="../../figures/PDSH-cover-small.png">

*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Marathon Finishing Times Visualization with Seaborn

Here we'll look at using Seaborn to help visualize and understand finishing results from a marathon.
Jake has scraped the data from sources on the Web, aggregated it and removed any identifying information, and put it on GitHub where it can be downloaded. Your instructor has downloaded it and it is included in the directory with the notebook.

In [None]:
# Import all the necessary libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import datetime
plt.style.use('classic')
%matplotlib inline
import seaborn as sns
sns.set()

### Load the data and examine some rows

In [None]:
data = pd.read_csv('marathon-data.csv')
data

### 1. What `type` are the `split` and `final` attributes of the dataframe?

By default, Pandas would have loaded the time columns as Python strings (type `object`); confirm this by  looking at the `dtypes` attribute of the DataFrame `data`.

In [None]:
# Fill in

### 2. Convert the `split` and `final` attributes to `datetime.timedelta`.

Write a converter to convert strings into timedeltas. Confirm that `split` and `final` attributes of the dataframe are now `timedelta64[ns]`.

In [None]:
def convert_time(s):
    # Fill in...

data = pd.read_csv('marathon-data.csv',
                   converters={'split':convert_time, 'final':convert_time})

In [None]:
# Fill in

### 3. Time in seconds

For the purpose of our Seaborn plotting utilities, let's next add columns `split_sec` and `final_sec` that give the times in seconds.

In [None]:
data.insert(4, 'split_sec', data['split'].astype(int) / 1E9)
data.insert(5, 'final_sec', data['final'].astype(int) / 1E9)
data.head(3)

#### Jointplot

To get an idea of what the data looks like, we can plot a `jointplot` over the data. The code for this is written for you, just run it and understand what it did.

In [None]:
with sns.axes_style('white'):
    g = sns.jointplot(data=data, x="final_sec", y="split_sec", kind='hex')
    g.ax_joint.plot(np.linspace(4000, 16000),
                    np.linspace(8000, 32000), ':k')

The dotted line shows where someone's time would lie if they ran the marathon at a perfectly steady pace. The fact that the distribution lies above this indicates (as you might expect) that most people slow down over the course of the marathon.
If you have run competitively, you'll know that those who do the opposite—run faster during the second half of the race—are said to have "negative-split" the race.

**What did the code do?**

**Your answer:**

--- 

--- 

---

### 4. Split Fraction

Let's create another column in the data, the split fraction, which measures the degree to which each runner negative-splits or positive-splits the race

In [None]:
data['split_frac'] = 1 - 2 * data['split_sec'] / data['final_sec']
data.head(3)

Where this split difference is negative, the person negative-split the race by that fraction.
Let's do a distribution plot of this split fraction:

In [None]:
sns.histplot(data['split_frac'], kde=False);
plt.axvline(0, color="k", linestyle="--");

### 5. The number of people who negative-split their marathon

Find the number of people who negative-split their marathon.

In [None]:
# Fill in

Out of nearly 40,000 participants, there were only 250 people who negative-split their marathon.

### 6. Pair-grids

Let's see whether there is any correlation between this split fraction and other variables. We'll do this using a `pairgrid`, which draws plots of all these correlations.

Look carefully at the pair grid. Due to the symmetry of the situation, we expect that the age&mdash;split_sec plot to depict the same information as the split_sec&mdash;age plot. Similarly for all pairs. Do you find such symmetry in the grid? Simple checks like this are _essential_ for sanity-checking your results.

**Your answer**

---

---

In [None]:
# No coding required for this cell.
# Just run it and answer the question under Pair-grid
g = sns.PairGrid(data, vars=['age', 'split_sec', 'final_sec', 'split_frac'],
                 hue='gender', palette='RdBu_r')
g.map(plt.scatter, alpha=0.8)
g.add_legend();

It looks like the split fraction does not correlate particularly with age, but does correlate with the final time: faster runners tend to have closer to even splits on their marathon time.
(We see here that Seaborn is no panacea for Matplotlib's ills when it comes to plot styles: in particular, the x-axis labels overlap. Because the output is a simple Matplotlib plot, however, the methods in [Customizing Ticks](04.10-Customizing-Ticks.ipynb) can be used to adjust such things if desired.)

The difference between men and women here is interesting. Let's look at the histogram of split fractions for these two groups:

In [None]:
sns.kdeplot(data = data.split_frac[data.gender=='M'], label='men', shade=True)
sns.kdeplot(data = data.split_frac[data.gender=='W'], label='women', shade=True)
plt.xlabel('split_frac');

The interesting thing here is that there are many more men than women who are running close to an even split!
This almost looks like some kind of bimodal distribution among the men and women. Let's see if we can suss-out what's going on by looking at the distributions as a function of age.

A nice way to compare distributions is to use a *violin plot*

### 7. Violin Plots

Violin plots are yet another way to compare the distributions between men and women. Examine the two plots below and answer these questions:

1. What do you conclude from the observation that the blue butterfly extends _above_ the "wings" far more than the pink butterfly. Conversely, the pink butterfly extends more _below_ the wings.
2. What do you conclude from the observation that the pink butterfly's wings are _wider_ than those of the blue?

**Your answer**

--- 

---


In [None]:
sns.violinplot(x = "gender", y = "split_frac", data=data,
               palette=["lightblue", "lightpink"]);

![The end](https://live.staticflickr.com/32/89187454_3ae6aded89_b.jpg)

<!--NAVIGATION-->
< [Geographic Data with Basemap](04.13-Geographic-Data-With-Basemap.ipynb) | [Contents](Index.ipynb) | [Further Resources](04.15-Further-Resources.ipynb) >

<a href="https://colab.research.google.com/github/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/04.14-Visualization-With-Seaborn.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>
