# Seaborn Axes Plots

This chapter introduces the seaborn visualization library in Python. seaborn has a high-level, easy-to-use interface for creating powerful and beautiful visualizations. Like pandas, seaborn relies entirely on matplotlib to do all of the actual plotting. The library is fairly minimal and exposes relatively few functions.

## The seaborn API

Visit the [seaborn API page][1] to get a nice overview of the library. The top of the API shows the below sections that contain most of the plotting function in the library. The other sections cover less important topics such as making specific grids and aesthetics. This chapter focuses only on the plots in these sections.

* Relational
* Categorical
* Distribution
* Regression
* Matrix

### Axes and Grid plots

All of the seaborn plotting functions return either a matplotlib axes or a seaborn grid. As the name implies, these axes plots use a single matplotlib axes. Grid plots are more complex and are composed of a matplotlib figure with multiple axes. From the five sections in the API above, `relplot`, `catplot`, `lmplot` and `clustermap` are the only grid plots. These will be the focus of the following chapter. In this chapter, we focus on the axes plots.

### A different categorization of plots

If you look at the seaborn API, you'll notice a section on categorical plots. Unfortunately, this isn't labeled properly as there are multiple distribution plotting functions such as box and violin plots in that section. Instead of using seaborn's classification, I like to divide the plotting functions into the following three categories:

* **Distribution plots** - These plots show the distribution of some set of points of a continuous valued variable. Examples of these are box, violin, histogram, and KDE plots.
* **Grouping and aggregating plots** - These plots group by some categorical variable and aggregate another. Examples of these plots include bar, count, and point plots.
* **Raw data plots** - These plots do not do change the underlying data other than display it. Some examples are scatter and line plots along with heatmaps.

[1]: http://seaborn.pydata.org/api.html

## seaborn integration with pandas

seaborn is tightly integrated with pandas. Nearly all seaborn plotting functions contain a `data` parameter that accepts a pandas DataFrame. This allows you to use **strings** of the column names for the function arguments.

### The four common seaborn plotting function parameters - `x`, `y`, `hue`, and `data`

Most seaborn plotting function signatures look similar and use the parameters `x`, `y`, `hue`, and `data`. The syntax will often look like one of the following lines of code, where `x`, `y`, and `hue` are all optional and set to a column name if used.

```python
>>> sns.plotting_func(x='col1', data=df)
>>> sns.plotting_func(y='col1', data=df)
>>> sns.plotting_func(x='col1', y='col2', data=df)
>>> sns.plotting_func(x='col1', y='col2', hue='col3', data=df)
```

## Distribution Plots

We'll begin our adventure in seaborn by making distribution plots. We've seen how to make box plots, histograms and KDEs both directly with matplotlib and with pandas. seaborn uses the functions `boxplot` for box plots, `histplot` for histograms and univariate KDEs, and `kdeplot` for bivariate KDEs. By convention, seaborn is imported as `sns`. Let's begin by reading in Airbnb listing data from Washington, D.C.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
plt.style.use('../../mdap.mplstyle')
airbnb = pd.read_csv('../data/airbnb.csv')
airbnb.head(3)

### Univariate distribution plots

Univariate distribution plots involve a single numeric variable. For these univariate plots, set either `x` or `y` parameters to a DataFrame column name. Choosing `x` creates a horizontal plot, while `y` creates a vertical one. Let's create a simple horizontal box plot of the price and assign the result (which is an axes) to a variable. Because of the presence of many values much larger than the median, we use listings less than &#36;1,000.

In [None]:
ax = sns.boxplot(x='price', data=airbnb.query('price < 1000'));

### Recreate box plot with matplotlib

Seaborn creates the above plot by calling matplotlib's `boxplot` method, providing it many different settings for the median line, box, whiskers, caps (vertical lines at the end of the whiskers), and fliers. A recreation of the above seaborn plot is done below with matplotlib.

In [None]:
fig, ax = plt.subplots()
ax.boxplot(x='price', data=airbnb.query('price < 1000'), 
           widths=.8, vert=False, patch_artist=True,
           medianprops={'color': '.25', 'lw': 1.5},
           boxprops={'ec': '.25', 'lw': 1.5, 'fc': '#3274a1'},
           whiskerprops={'color': '.25', 'lw': 1.5}, 
           capprops={'color': '.25', 'lw': 1.5},
           flierprops={'marker': 'd', 'mfc': '.25', 'mec': '.25', 'ms': 5});

Let's do some work to change the appearance of this box plot. The height of the box is unnecessarily large. We'll control this by making the figure height much smaller. An `ax` parameter exists for all seaborn axes plots that can be set to an previously created axes.

The maximum price of the data is much larger than 1,000 and using a linear scale would compress the above data into less than 10% of the axes width. We'll use a log scale instead, which places major ticks every power of 10 (along with minor ticks) and uses scientific notation as the format. We use the `ticker` module to specify more major tick locations, remove the minor ticks, and format the x-axis labels as dollars. All spines but the bottom are made invisible.

In [None]:
from matplotlib import ticker
fig, ax = plt.subplots(figsize=(4, .6))
sns.boxplot(x='price', data=airbnb, whis=(5, 95), ax=ax)
ax.set_xscale('log')
ax.xaxis.set_major_locator(ticker.LogLocator(base=10, subs=(1, 2, 5)))
func = lambda x, pos: f'${x:,.0f}' if x < 1000 else f'${x // 1000:.0f}k'
ax.xaxis.set_major_formatter(ticker.FuncFormatter(func))
ax.xaxis.set_minor_locator(ticker.NullLocator())
ax.yaxis.set_major_locator(ticker.NullLocator())
ax.spines['left'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)

### Vertical plots

To make a vertical plot, set the `y` parameter to the data you want to plot. To remove spines, you can use the seaborn `despine` function instead of accessing each spine directly like we did above. By default, the top and left spines are removed. We tell it to remove the bottom spine as well.

In [None]:
fig, ax = plt.subplots(figsize=(.6, 2))
sns.boxplot(y='price', data=airbnb, ax=ax, whis=(5, 95))
ax.set_yscale('log')
sns.despine(ax=ax, bottom=True)

### Histograms

Histograms are made by the seaborn `histplot` function. Below we plot a histogram of the rental price of each listing under &#36;400. Like all Seaborn plotting functions, there are many parameters, and `binrange` is used to limit the range of the values in the histogram.

In [None]:
fig, ax = plt.subplots(figsize=(4.5, 2))
sns.histplot(x='price', data=airbnb.query('price < 400'), ax=ax);

Like most Seaborn plots, the orientation of the visualization can be changed from vertical to horizontal by using `y` instead of `x`. A number of other parameters are set, which are described fully in the [histplot documentation][0].

[0]: https://seaborn.pydata.org/generated/seaborn.histplot.html#seaborn.histplot

In [None]:
fig, ax = plt.subplots(figsize=(4, 2.5))
sns.histplot(data=airbnb, y='price', bins=20, binrange=[0, 1000], 
             stat='probability', ax=ax, color='orange', ec='darkgreen');

### KDE Plots

Kernel Density Estimation or KDE plots are created with the `kdeplot` function. Here, we estimate the distribution of prices less than &#36;400.

In [None]:
sns.kdeplot(x='price', data=airbnb.query('price < 400'), fill=True);

Both histograms and KDEs can be plotted together by setting the `kde` parameter of the `histplot` function to `True`. Keyword arguments for the KDE plot can be forwarded to it via the `kde_kws` parameter.

In [None]:
fig, ax = plt.subplots(figsize=(4.5, 2))
sns.histplot(x='price', data=airbnb.query('price < 1000'), kde=True, ax=ax);

### Bivariate KDE plots

A bivariate KDE estimates the density of two numeric values co-occurring from two different variables. The `kdeplot` function in seaborn produces a bivariate KDE as contour lines of different colors along a sequential color map. Let's construct an example and then explain it further. We read in a few columns from the housing dataset, which has more continuous variables that make for better examples.

In [None]:
cols = ['OverallQual', 'GrLivArea', 'SalePrice']
housing = pd.read_csv('../data/housing.csv', usecols=cols)
housing.head(3)

The `kdeplot` function forces you to pass each column as a Series to the `x` and `y` parameters. Here, we'll estimate the distribution of living area and sale price. We choose to shade in the contours and clip both the x and y limits to be where the majority of data is located.

In [None]:
fig, ax = plt.subplots(figsize=(4, 2))
sns.kdeplot(x='GrLivArea', y='SalePrice', data=housing, 
            fill=True, clip=((500, 2_500), (50_000, 300_000)), ax=ax);

The darkest areas represent the greatest concentration of data. All contours with the same color have approximately the same probability of occurrence. In our dataset, houses around 1,000 square feet of living area priced at around &#36;140,000 are more common than others.

It's possible to plot multiple bivariate KDE's on the same axes. Here we use the same two variables to construct the KDE, but filter by those with the lowest and highest overall quality into separate DataFrames. The distributions for each group are quite distinct and do not overlap. We choose just to plot the contours without shading.

In [None]:
fig, ax = plt.subplots(figsize=(4, 2.5))
df = housing.query('OverallQual <= 3')
sns.kdeplot(x='GrLivArea', y='SalePrice', data=df, 
            clip=((500, 2_500), (50_000, 300_000)), ax=ax, label='3 or less')

df = housing.query('OverallQual >= 9')
sns.kdeplot(x='GrLivArea', y='SalePrice', data=df,
            clip=((500, 2_500), (50_000, 500_000)), ax=ax, label='9 or more')
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', title='Overall Quality');

## Seaborn style sheets

There are several seaborn style sheets that alter the appearance of a plot by changing some of the matplotlib configuration settings. Call the `set_style` function with one of the strings 'darkgrid', 'whitegrid', 'dark', 'white', or 'ticks'. We will use the darkgrid for the remaining portion of this chapter which uses a light gray background color and white grid lines along with removing the tick marks. We'll plot the same histogram and KDE from above to view the contrast in styles.

In [None]:
sns.set_style('darkgrid')
fig, ax = plt.subplots(figsize=(4.5, 2))
sns.histplot(x='price', data=airbnb.query('price < 400'), ec='black', kde=True);

## Other distribution plots

The following functions are also capable of producing univariate distribution plots and work similarly as `boxplot`. We'll only cover `violinplot` below.

* `violinplot`
* `stripplot`
* `swarmplot`
* `boxenplot`

### Violin plots

Violin plots produce the same KDE as `kdeplot` but duplicate the curve so that it appears both above and below the baseline. The final shape often resembles a violin because many distributions have a single peak and a long tail like the list price. seaborn uses the matplotlib `violinplot` method to actually create the plot.

In [None]:
fig, ax = plt.subplots(figsize=(4, 1))
sns.violinplot(x='price', data=airbnb.query('price < 400'), ax=ax);

To prove that the violin plot really is a KDE, we use `kdeplot` to create just the KDE. The shape of the curve is exactly the same.

In [None]:
fig, ax = plt.subplots(figsize=(4, 1))
sns.kdeplot(x='price', data=airbnb.query('price < 400'), ax=ax);

Look back up at the original violin plot and you'll see a small white circle inside of a rectangle with a line running through it. seaborn creates a miniature box plot without fliers inside of the violin plot.

## Automatic grouping by category

Thus far it would seem that seaborn isn't all that useful or necessary as pandas can duplicate all of the above plots besides the violin plot. The big benefit from seaborn comes when you want to group or split the data without manually doing so with DataFrame operations. seaborn automatically splits data into independent groups just by providing it the DataFrame column name.

For instance, let's say we are interested in making a box plot of listing price for each of the neighborhoods. This isn't possible to do using pandas without a for-loop iterating through each unique neighborhood name, filtering the data for that neighborhood and making a box plot. With seaborn, we just need to supply the grouping and plotting variables to the `x` and `y` parameters like this:

```python
sns.boxplot(x='price', y='neighborhood', data=airbnb)
```

seaborn chooses the column that has the data type of either object or categorical to use as the **grouping variable** and creates a different box plot for each of the unique values in this column using the other variable as the data. Internally, it uses pandas to make these splits, before using matplotlib to make the box plots.

Since there are dozens of unique neighborhoods, we'll plot just the five most common neighborhoods. It's not even necessary to filter the data first with seaborn, as you can set the `order` parameter to the subset of categories that you want to plot. The boxes will also appear in that order. The fliers are not shown so that focus is drawn to the middle of the distribution to better compare neighborhood prices.

In [None]:
top5 = airbnb['neighborhood'].value_counts().index[:5]
sns.boxplot(x='price', y='neighborhood', data=airbnb, order=top5,
            whis=(5, 90), showfliers=False);

### Grouping with two numeric variables

It's possible that your grouping variable is numeric, and when this happens seaborn automatically chooses `x` as the grouping variable. Below is an attempt to make a box plot of price for each unique number of bedrooms, but seaborn does the opposite, and makes a box plot for each unique price.

In [None]:
sns.boxplot(x='price', y='bedrooms', data=airbnb.head(100));

To force seaborn to use the variable set to `y` as the grouping column, set `orient` to `'h'` (horizontal). Set it to `'v'` to force a vertical plot. We filter the listings to show only those with four or less bedrooms by using the `order` parameter.

In [None]:
ax = sns.boxplot(x='price', y='bedrooms', data=airbnb.query('price < 1000'),
                 orient='h', order=[0, 1, 2, 3, 4]);

## Grouping within groups with `hue`

Let's say we were interested in creating box plots for different numbers of bedroom for each neighborhood. seaborn allows you to create groups within your first group by setting the `hue` parameter to a column name. Below, we initially group by neighborhood. Within each neighborhood we will group by number of bedrooms and create a box plot for each of these groups. 

Just as we used `order` to filter the main grouping column, we can use `hue_order` to filter this other grouping column. Both are used to limit the total number of combinations below. Also, the `fliersize` is decreased from its default value of 5 to fit better in this cramped plot.

In [None]:
fig, ax = plt.subplots(figsize=(5, 3))
sns.boxplot(x='price', y='neighborhood', hue='bedrooms', 
            data=airbnb.query('price < 1000'), order=top5,
            hue_order=[0, 1, 2, 3], whis=(5, 90), fliersize=2, ax=ax)
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', title='Bedrooms')
ax.set_xlim(-50, 1_000);

### Grouping with violin plots

Violin plots can group and split data in the same manner. Here, we make vertical plots grouped first by the number of persons that the listing can accommodate before splitting into each of the top 5 most frequent neighborhoods. The data is filtered for listings under &#36;300. The `cut` parameter, when set to 0, does not show any part of the violin where actual data is not present.

In [None]:
fig, ax = plt.subplots(figsize=(6, 2))
airbnb_300 = airbnb.query('price < 300')
sns.violinplot(x='accommodates', y='price', data=airbnb_300, hue='neighborhood',
               order=[1, 2, 3, 4], hue_order=top5, cut=0, linewidth=1.2, ax=ax)
ax.legend(bbox_to_anchor=(.04, 1.1), ncol=5, loc='center left', fontsize='small');

Violin plots have an interesting feature that allows you to make a comparison whenever there are exactly two categories for the `hue` parameter. Setting `split` to `True` creates a violin with the KDE's of each category on either side. Below, we compare prices of two neighborhoods for several different values of accommodates.

In [None]:
fig, ax = plt.subplots(figsize=(6, 2))
airbnb_300 = airbnb.query('price < 300')
neighs = ['Columbia Heights', 'Dupont Circle']
sns.violinplot(x='accommodates', y='price', data=airbnb_300, hue='neighborhood', split=True,
               order=[1, 2, 3, 4, 5, 6], hue_order=neighs, cut=0, ax=ax)
ax.legend(bbox_to_anchor=(.5, 1.1), ncol=5, loc='center', fontsize='small');

## Tidy data

seaborn is built to work with tidy data. It does all the grouping and aggregating for you. This is what makes it powerful. Once your data is tidy, you can create many different plots directly from seaborn without any further manipulation.

### Comparison to pandas

The pandas `plot` method works quite different than seaborn plotting functions. In pandas, each column of the DataFrame is plotted as its own line, bar, box, histogram, etc... With pandas, you are responsible for grouping and aggregating the data on your own before plotting. With seaborn, the grouping and aggregating happens internally.

## Grouping and Aggregating Plots

Our next category of possible plots are those that group and aggregate. These plots will create a single point statistic for some numeric column of data. The following are the grouping and aggregating functions, all found in the Categorical section of the API:

* `barplot`
* `pointplot`
* `countplot`
* `lineplot`

### Univariate grouping and aggregating plots

Let's begin again like we did with the distribution plots by using a single numeric variable. The `barplot` function aggregates by taking the **mean** of the values by default. Below, we call the `barplot` function to create a single bar of the mean of the prices for all the listings.

In [None]:
fig, ax = plt.subplots(figsize=(1, 1.5))
sns.barplot(y='price', data=airbnb, ax=ax)
ax.set_title('Average Price for All Listings');

Notice the small black line appearing at the top of the bar. It represents the 95% confidence interval of the calculated statistic. It is computed by a procedure called bootstrapping which randomly samples the data with replacement to create an entire 'new' dataset. The mean of this new dataset is calculated. By default, 1,000 of these new datasets are produced and means calculated for each one. The 2.5 and 97.5 percentiles are found from these 1,000 values and used to draw the line above. Let's do this procedure ourselves with pandas by first calculating 1,000 different means and storing them in a Series.

In [None]:
price = airbnb['price']
means = pd.Series(price.sample(frac=1, replace=True).mean() for i in range(1000))
means.head()

We can use the `quantile` method to find the range of the 95% confidence interval.

In [None]:
means.quantile([.025, .975])

These numbers should be very close to the y-values from the little black line in our plot from above, which we retrieve below.

In [None]:
ax.lines[0].get_ydata()

The number of new datasets used during bootstrapping can be controlled with the `n_boot` parameter and the confidence interval with `ci`. Let's make the same bar plot, but make an 80% confidence interval with less bootstrapped datasets.

In [None]:
fig, ax = plt.subplots(figsize=(1, 1.5))
sns.barplot(y='price', data=airbnb, ax=ax, ci=80, n_boot=200)
ax.set_title('Average Price for All Listings (80% CI)');

### The `estimator` parameter

Both `barplot` and `pointplot` have an `estimator` parameter, which is used to choose the **aggregating function** and is defaulted to the numpy mean function. Unfortunately, seaborn doesn't allow you to use the string names of the function like pandas, so you'll have to use the numpy function directly. Here, we take the median of all listings and choose not to show the confidence interval by setting `ci` to `None`.

In [None]:
fig, ax = plt.subplots(figsize=(1, 1.5))
sns.barplot(y='price', data=airbnb, estimator=np.median, ci=None)
ax.set_title('Median Price for All Listings');

### Grouping by one variable

Just as we did with box and violin plots, we can group data by providing both the `x` and `y` parameter. seaborn uses the categorical column as the grouping column and aggregates the numeric column. Here, we get the median price of each neighborhood.

In [None]:
top10 = airbnb['neighborhood'].value_counts().index[:10]
sns.barplot(x='neighborhood', y='price', data=airbnb, estimator=np.median, order=top10);

The bars show the correct median for each neighborhood, but the presentation is sloppy. seaborn does not provide the option to format tick labels so we'll do it ourselves. We could rotate the labels, but instead choose to use the standard library's `textwrap` module's `fill` function to wrap them at the given character width. This function also sets the font size.

In [None]:
import textwrap
def wrap_labels(ax, width, fontsize='medium'):
    labels = []
    for label in ax.get_xticklabels():
        text = label.get_text()
        labels.append(textwrap.fill(text, width=width, break_long_words=False))
    ax.set_xticklabels(labels, fontsize=fontsize, rotation=0)

We recreate the same plot, but make it wider and use the new labels at a smaller size.

In [None]:
fig, ax = plt.subplots(figsize=(6, 1.5))
sns.barplot(x='neighborhood', y='price', data=airbnb, estimator=np.median, order=top10)
ax.set(title='Median Price by Neighborhood', xlabel='', ylabel='')
wrap_labels(ax, width=10, fontsize='small')

We can further split each of these neighborhoods into more groups with the `hue` parameter. Here, we find the median price by neighborhood for listings that can accommodate up to five persons. The confidence interval is also removed as it can take quite a long time to compute with many different groupings.

In [None]:
fig, ax = plt.subplots(figsize=(6, 1.5))
sns.barplot(x='neighborhood', y='price', data=airbnb, estimator=np.median, hue='accommodates',
            order=top10, hue_order=[1, 2, 3, 4, 5], ci=None, ax=ax)
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', title='Accommodates')
ax.set(title='Median Price by Neighborhood', xlabel='', ylabel='')
wrap_labels(ax, width=10, fontsize='small')

### Point plots

The `pointplot` function behaves similarly to `barplot` but instead of creating bars, it places points at the calculated statistic. When not specifying `hue`, a single line connects the calculated statistic for each group. The `scale` parameter controls the relative size of the line and point and is set to 1 by default. The confidence interval of the statistic is placed as a vertical line through the point. Use `errwidth` and `capsize` to control its appearance. 

In [None]:
fig, ax = plt.subplots(figsize=(6, 1.5))
sns.pointplot(x='neighborhood', y='price', data=airbnb, estimator=np.mean,
            order=top10, scale=.5, errwidth=1, capsize=.2)
ax.set(title='Mean Price by Neighborhood', xlabel='', ylabel='')
wrap_labels(ax, width=10, fontsize='small')

Notice that the y-axis does not begin at zero like it did with bar plots, so it can give the illusion of greater differences between neighborhoods than actually exist.  If you split the data using `hue`, a separate line for each unique category is created. Here, we recreate the last bar plot as a point plot and manually set the y-axis so it begins at 0.

In [None]:
fig, ax = plt.subplots(figsize=(6, 1.5))
sns.pointplot(x='neighborhood', y='price', data=airbnb, estimator=np.median, hue='accommodates',
            order=top10, hue_order=[1, 2, 3, 4, 5], ci=None, scale=.5, ax=ax)
wrap_labels(ax, width=10, fontsize='small')
ax.legend(bbox_to_anchor=(1, 1), loc='upper left', title='Accommodates')
ax.set(title='Median Price by Neighborhood', xlabel='', ylabel='', ylim=(0, 300));

### Count plots

The `countplot` function can be thought of as a specific case of `barplot` that calculates the size of each group as its only aggregating function. Let's count the frequency of each unique response time.

In [None]:
fig, ax = plt.subplots(figsize=(3, 1.8))
sns.countplot(x='response_time', data=airbnb, ax=ax)
wrap_labels(ax, width=10)

This is the same exact calculation produced by the Series `value_counts` method.

In [None]:
airbnb['response_time'].value_counts()

The `countplot` does not allow you to pass it both `x` and `y`, but you can use `hue` to further split the data. Here, we split by the binary variable `'superhost'`. Airbnb defines a "superhost" as an experienced host who offers exceptional experiences.

In [None]:
fig, ax = plt.subplots()
sns.countplot(x='response_time', data=airbnb, hue='superhost', ax=ax)
wrap_labels(ax, width=10)

This plot of raw counts can be misleading, since there are not the same total number of each host type. You might wrongly infer from the graph that both types of host have the same rate of response for "within an hour". Let's replicate the raw counts using pandas `crosstab` function.

In [None]:
pd.crosstab(index=airbnb['response_time'], columns=airbnb['superhost'])

Instead, normalizing by the total count of each host type could get us a more accurate comparison between them. Now, we can see that 86.4% of superhosts answer within an hour vs 70.8% of non-superhosts.

In [None]:
df = pd.crosstab(index=airbnb['response_time'], columns=airbnb['superhost'], 
                 normalize='columns')
df.round(3)

Surprisingly, this plot isn't easily manageable in seaborn as `countplot` does not do any normalization, so we plot the DataFrame above directly in pandas.

In [None]:
ax = df.plot(kind='bar', figsize=(3.5, 1.5), width=.8)
wrap_labels(ax, width=10)

### Line plots

The seaborn `lineplot` function is similar to `pointplot`, but does not draw markers at every point. It only aggregates the `y` variable and expects `x` to be either numeric or datetime but not categorical. Let's read in the full COVID-19 dataset, which contains both new and total case and death data for nearly every country in the world.

In [None]:
full_covid = pd.read_csv('../data/covid/full_covid_data.csv', parse_dates=['date'])
full_covid.head(3)

This dataset is tidy and contains total deaths in a single column. It's not possible to make line plots of total world deaths directly in pandas. You'd have to group and aggregate the data first like this:

In [None]:
full_covid.groupby('date').agg({'total_deaths': 'sum'}).tail(4)

Instead, we'll use seaborn to do the grouping and aggregation for us. Here, we set the `estimator` to the numpy `sum` function.

In [None]:
fig, ax = plt.subplots(figsize=(6, 1.8))
sns.lineplot(x='date', y='total_deaths', data=full_covid, ci=None, estimator=np.sum, ax=ax)
ax.set(xlabel='', ylabel='', title='Total World Deaths from COVID-19');

If we are interested in making line plots by country using pandas, we'd need to pivot the data first.

In [None]:
full_covid.pivot(index='date', columns='country', values='total_deaths').iloc[:3, :10]

Because we are using seaborn this isn't necessary. Since there are nearly 200 countries, we will select just the top 7 countries with most deaths. We first get the last date of data.

In [None]:
last_date = full_covid['date'].max()
last_date

We filter for the countries that have the highest total on this date.

In [None]:
highest_deaths = full_covid.query('date == @last_date').nlargest(7, 'total_deaths')
highest_deaths

We set all the other country values to `'Rest of World'` in the original dataset.

In [None]:
filt = ~full_covid['country'].isin(highest_deaths['country'])
full_covid.loc[filt, 'country'] = 'Rest of World'
full_covid.head(3)

We can now make a line plot by country using `hue`. Note that the sum aggregation is only doing an actual summation for 'Rest of World'. The other countries have exactly one row for each date.

In [None]:
fig, ax = plt.subplots(figsize=(6, 1.8))
sns.lineplot(x='date', y='total_deaths', data=full_covid.query('date > "2020-03-07"'), 
             hue='country', ci=None, estimator=np.sum, ax=ax)
ax.legend(bbox_to_anchor=(1, 1))
ax.set(xlabel='', ylabel='', title='Total Deaths from COVID-19');

## Raw data plots

The last major category of plots are those that do not change the underlying data. They simply plot the raw data as it is. In this section, we'll create scatter plots with regression lines running through them, as well as heat maps using the following functions.

* `scatterplot`
* `regplot`
* `heatmap`

### Scatter plots

The `scatterplot` function provides many different options to size, color, and style points according to different variables. Let's begin by creating a map of each Airbnb listing using the longitude and latitude points and place a marker at the exact location of the White House. The `s` parameter sets the size of each marker in points squared.

In [None]:
wh_coords = -77.0365, 38.8977
fig, ax = plt.subplots(figsize=(3, 3))
ax.set_aspect('equal')
sns.scatterplot(x='longitude', y='latitude', data=airbnb, s=16, ax=ax)
ax.scatter(*wh_coords, marker='*', c='white', ec='red', lw=1.5, s=150)
ax.annotate('White House', xy=wh_coords, xytext=(-77.09, 38.86), 
            arrowprops={'arrowstyle': '->', 'shrinkB': 7, 'color': 'black'})
ax.set_title('Washington D.C Airbnb Listings');

Because there are over 9,000 listings in this dataset, let's filter for listings that are within one mile of the White House. To do so exactly requires the use of the [haversine formula][0], but since this area of the world is so small and not close to the poles, we'll just some basic trigonometry to calculate the distance. We first calculate the distance in degrees between the White House and every listing.

[0]: https://en.wikipedia.org/wiki/Haversine_formula

In [None]:
dist_degree = ((airbnb['longitude'] - wh_coords[0]) ** 2 + 
               (airbnb['latitude'] - wh_coords[1]) ** 2) ** .5
dist_degree.head(3)

The earth is approximately 25,000 miles in circumference so we can get the number of miles per degree by dividing by 360.

In [None]:
miles_per_degree = 25000 / 360
miles_per_degree

Multiplying this number by the distance in degrees returns the number of miles from the White House for each listing.

In [None]:
airbnb['miles_from_wh'] = (dist_degree * miles_per_degree).round(2)
airbnb['miles_from_wh'].head(3)

Let's do a sanity check and get the minimum and maximum distances. Washington D.C. is a fairly small place originally carved into a perfect 10 by 10 mile square with corners rotated 45 degrees to appear diamond-shaped. The lower-left portion was returned to Virginia which accounts for its current shape and size of 68 square miles, down from it's original 100. The White House is very nearly in the center of the original square, so every listing should be no more than 10 miles from it.

In [None]:
airbnb['miles_from_wh'].agg(['min', 'max'])

To provide more verification that our calculations are on the right track, let's recreate the plot above adding a second scatter plot for just those listings within 1 mile of the White House. Because we will be creating several similar scatter plots, a function is created to setup the figure, axes, grid, aspect, and title as well as to place the White House.

In [None]:
def setup_wh_plot(figsize=(3, 3)):
    wh_coords = -77.0365, 38.8977
    fig, ax = plt.subplots(figsize=figsize)
    ax.set_aspect('equal')
    ax.scatter(*wh_coords, marker='*', c='white', ec='red', lw=1.5, s=150, zorder=3)
    ax.set_title('Airbnb Listings Near White House')
    return ax

We filter for our new data, then use the above function which returns the axes, before finally plotting two separate scatter plots. seaborn automatically uses the next color in the color cycle when plotting on the same axes.

In [None]:
airbnb_wh = airbnb.query('miles_from_wh < 1')
ax = setup_wh_plot()
sns.scatterplot(x='longitude', y='latitude', data=airbnb, s=16, ax=ax)
sns.scatterplot(x='longitude', y='latitude', data=airbnb_wh, s=16, ax=ax);

The color, size, and marker style can correspond to specific columns by setting the `hue`, `size`, and `style` parameters. Here, we color by neighborhood, size by price, and style by superhost. We'll zoom in on only the listings within one mile of the White House.

In [None]:
ax = setup_wh_plot(figsize=(4, 4))
sns.scatterplot(x='longitude', y='latitude', 
                data=airbnb_wh.query('price < 500'), ax=ax,
                hue='neighborhood', size='price', style='superhost')
ax.legend(bbox_to_anchor=(1, 1));

Each of the values for `hue`, `size`, and `style` can be customized. If either `hue` or `size` are numeric then you may set the `hue_norm` or `size_norm` parameter to a two-item tuple of the minimum and maximum value of the range. Values outside of the range given will be colored or sized to the minimum or maximum reference. 

In the scatter plot below, we choose to color by rating, a numeric variable which has valid values between 0 and 100. But, more than 85% of ratings are greater than or equal to 90 with a full 30% getting the top rating of 100%. Using the default scale of 0 to 100 wouldn't show much difference between a rating of 90 and 100 though 90 is in the 15th percentile. Therefore, we set `hue_norm` to be `(90, 100)`. All ratings below 90 are colored the same value as 90. Use `palette` to set the matplotlib color map.

The same approach is used for size, which is controlled by price. There are some extreme outliers that force nearly every point to be the smallest size. We use `size_norm` to size based on a limited range. Finally, the marker for each unique value in the `style` column can be set by using a dictionary passed to `markers`. Since there are so many points, a random sample of 100 listings that accommodate at most two persons is selected. 

In [None]:
ax = setup_wh_plot(figsize=(4, 4))
df = airbnb_wh.query('accommodates <= 2').dropna(subset=['rating']) \
              .sample(n=100, random_state=1)
sns.scatterplot(x='longitude', y='latitude', data=df, ax=ax,
                hue='rating', hue_norm=(90, 100),
                size='price', size_norm=(100, 300),
                style='superhost', markers={0: 'o', 1:'P'},
                palette='copper')
ax.legend(bbox_to_anchor=(1, 1.05));

### Heat maps

A heat map is a rectangular grid made of individual rectangles that each get colored based on a reference value. seaborn does not group or aggregate within its `heatmap` function and the `x` and `y` parameters do not exist. The first argument passed is the DataFrame who's values are used directly to correspond to the particular color of the chosen sequential colormap.

You'll probably need to do some data manipulation with pandas before creating a heat map as tidy data is not well-suited for it. Below, we filter listings that accommodate five or less and whose neighborhood has more than 300 total listings. We then create a pivot table of median price.

In [None]:
df = (airbnb.query('accommodates <= 5')
            .groupby('neighborhood').filter(lambda x: len(x) > 300))
median_price = (df.pivot_table(index='neighborhood', columns='accommodates', 
                               values='price', aggfunc='median')
                  .round(-1))
median_price

Notice the outlier of 1,000 for West End listings that accommodate three persons. This one value will dominate the heat map if we use the defaults. Instead, we set `vmax` to the second highest value to cap the range used to determine color. Setting `annot` to `True` annotates each cell with its value and is formatted with `fmt`.

In [None]:
ax = sns.heatmap(median_price, cmap='OrRd', annot=True, fmt=',.0f', vmax=270)
ax.set_title('Median Price');

## Scatter plots with linear regression lines using `regplot`

The `regplot` function creates scatter plots and places a linear regression line through the cloud of points. It might be interesting to investigate the relationship between distance from the White House and price. Perhaps listings closer to the White House would be in higher demand and cost more. To help test this hypothesis, we'll select listings that accommodate exactly one person. Since there are a large number of points, we'll select 100 random points.

seaborn allows you to control the scatter points and line with the `scatter_kws` and `line_kws` parameters. Set them each to a dictionary mapping the underlying property to its value. We use them to set the size of each point and the width of the line.

In [None]:
df = airbnb.query('accommodates == 1 and price < 500').sample(100, random_state=1)
sns.regplot(x='miles_from_wh', y='price', data=df,
            scatter_kws={'s': 10}, line_kws={'lw': 3});

The line appears to have a slight downward trend suggesting that miles from the White House might have a minor effect on the price of the listing. Notice the light-blue region surrounding the line. This represents the 95% confidence interval for the regression line. Instead of using a random sample of the data, let's use all points for one-person listings.

In [None]:
df = airbnb.query('accommodates == 1')
ax = sns.regplot(x='miles_from_wh', y='price', data=df, ci=None, marker='.',
                 scatter_kws={'s': 10, 'alpha': .5}, line_kws={'lw': 3})
ax.set_ylim(0, 200);

### Choosing different types of linear regression

The regression line looks bizarre and is now predicting a price less than 0 for listings more than five miles from the White House. The reason for this is the presence of extreme outliers. Specifically, there are 10 listings priced at exactly &#36;7,000. The default algorithm used to find the formula for the line is called ordinary least squares (OLS), which attempts to minimize the sum of squared errors (distance between line and point). This metric is greatly influenced by outliers causing the regression line to be pulled far away from the central cluster of points. 

seaborn allows other algorithms besides OLS to compute the line of best fit. Set `robust` to `True` to perform robust regression, which greatly reduces the effect of outliers on the line. Robust regression is much more computationally intense, so we do not calculate a confidence interval.  

In [None]:
df = airbnb.query('accommodates == 1')
ax = sns.regplot(x='miles_from_wh', y='price', data=df, marker='.', ci=None,
                 robust=True, scatter_kws={'s': 10}, line_kws={'color': 'black'})
ax.set(ylim=(0, 200), title='Robust Regression');

Another form of regression is LOWESS, or locally weighted sum of squares, that builds many regression models for every x-value weighing points closer to the x-value more than those far away. It's able to model highly non-linear movements as it's not constrained to a particular form. We set `lowess` to `True` and observe that the effect on price seems to only be significant for houses within three miles of the White House.

In [None]:
ax = sns.regplot(x='miles_from_wh', y='price', data=df, marker='.', ci=None,
                 lowess=True, scatter_kws={'s': 10}, line_kws={'color': 'black'})
ax.set(ylim=(0, 200), title='LOWESS Regression');

### Grouping `x` and aggregating `y`

Instead of plotting each and every point, you can group the data into bins based on the x-values and aggregate the  y-values. Choose evenly-sized bins by setting the `x_bins` parameter to an integer. Alternatively, set it to an array whose values will be the midpoint of the bins. Set `x_estimator` to the numpy aggregation function you want to use and `x_ci` for the confidence interval. The regression is calculated from the original data and NOT from these aggregated values. Here, we cut the data into 20 evenly-sized bins calculating the median price of each. The same robust regression line from above is plotted.

In [None]:
sns.regplot(x='miles_from_wh', y='price', x_bins=20, x_estimator=np.median, x_ci=None,
            robust=True, ci=None, data=df, scatter_kws={'s': 10});

## Ordered categorical data

The order of the grouping variable as it appears on your plot depends on its type. seaborn uses the natural ordering for numeric variables and the order of appearance for object/string variables. If you use an ordered categorical variable, then it will use that order. Let's read in three columns of the housing dataset to show how this works.

In [None]:
cols = ['OverallQual', 'HeatingQC', 'SalePrice']
housing = pd.read_csv('../data/housing.csv', usecols=cols)
housing.head(3)

If we were to make a plot using `'HeatingQC'` as a grouping variable, then seaborn would use the order of appearance of each unique value. We can find the first instance of each value with `drop_duplicates`.

In [None]:
housing['HeatingQC'].drop_duplicates()

When we plot the average sale price by heating quality, the group follows the order from above. We also group by overall quality, but since this is a numeric variable, seaborn wisely chooses to use its natural ordering and not its order of appearance.

In [None]:
ax = sns.barplot(x='HeatingQC', y='SalePrice', hue='OverallQual', data=housing, 
                 estimator=np.mean, ci=None)
ax.legend(bbox_to_anchor=(1, 1.1), title='Overall Quality');

For columns like heating quality that have an inherent natural ordering, it's best to use that ordering when plotting. We can do so by setting the `hue_order` parameter, but it's better to change its data type to ordered categorical in pandas first which seaborn will respect when plotting. We overwrite the `'HeatingQC'` with its new data type and make the same plot.

In [None]:
hc_cat = pd.CategoricalDtype(['Po', 'Fa', 'TA', 'Gd', 'Ex'], ordered=True)
housing['HeatingQC'] =  housing['HeatingQC'].astype(hc_cat)
ax = sns.barplot(x='HeatingQC', y='SalePrice', hue='OverallQual', data=housing, 
                 estimator=np.mean, ci=None)
ax.legend(bbox_to_anchor=(1, 1.1), title='Overall Quality');

## Exercises