Modified by Roger Wang (rq.wang@rutgers.edu)

Origin: https://www.kaggle.com/residentmario/welcome-to-data-visualization

# Univariate plotting with pandas

<table>
<tr>
<td><img src="https://i.imgur.com/skaZPhb.png" width="350px"/></td>
<td><img src="https://i.imgur.com/gaNttYd.png" width="350px"/></td>
<td><img src="https://i.imgur.com/pampioh.png"/></td>
<td><img src="https://i.imgur.com/OSbuszd.png"/></td>

<!--<td><img src="https://i.imgur.com/ydaMhT1.png" width="350px"/></td>
<td><img src="https://i.imgur.com/WLAqDSV.png" width="350px"/></td>
<td><img src="https://i.imgur.com/Tj2y9gH.png"/></td>
<td><img src="https://i.imgur.com/X0qXLCu.png"/></td>-->
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Bar Chat</td>
<td style="font-weight:bold; font-size:16px;">Line Chart</td>
<td style="font-weight:bold; font-size:16px;">Area Chart</td>
<td style="font-weight:bold; font-size:16px;">Histogram</td>
</tr>
<tr>
<td>df.plot.bar()</td>
<td>df.plot.line()</td>
<td>df.plot.area()</td>
<td>df.plot.hist()</td>
</tr>
<tr>
<td>Good for nominal and small ordinal categorical data.</td>
<td>	Good for ordinal categorical and interval data.</td>
<td>Good for ordinal categorical and interval data.</td>
<td>Good for interval data.</td>
</tr>
</table>

----

The `pandas` library is the core library for Python data analysis: the "killer feature" that makes the entire ecosystem stick together. However, it can do more than load and transform your data: it can visualize it too! Indeed, the easy-to-use and expressive pandas plotting API is a big part of `pandas` popularity.

In this section we will learn the basic `pandas` plotting facilities, starting with the simplest type of visualization: single-variable or "univariate" visualizations. This includes basic tools like bar plots and line charts. Through these we'll get an understanding of `pandas` plotting library structure, and spend some time examining data types.

In [None]:
import pandas as pd
%matplotlib inline 

reviews = pd.read_csv("../2019_EEAT/data_collection/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head(3)

## Bar charts and categorical data

Bar charts are arguably the simplest data visualization. They map categories to numbers: the amount of eggs consumed for breakfast (a category) to a number breakfast-eating Americans, for example; or, in our case, wine-producing provinces of the world (category) to the number of labels of wines they produce (number):

In [None]:
reviews['province'].value_counts().head(10).plot.bar()

What does this plot tell us? It says California produces far more wine than any other province of the world! We might ask what percent of the total is Californian vintage? This bar chart tells us absolute numbers, but it's more useful to know relative proportions. No problem:

In [None]:
(reviews['province'].value_counts().head(10) / len(reviews)).plot.bar()

California produces almost a third of wines reviewed in Wine Magazine!

Bar charts are very flexible: The height can represent anything, as long as it is a number. And each bar can represent anything, as long as it is a category.

In this case the categories are **nominal categories**: "pure" categories that don't make a lot of sense to order. Nominal categorical variables include things like countries, ZIP codes, types of cheese, and lunar landers. The other kind are **ordinal categories**: things that do make sense to compare, like earthquake magnitudes, housing complexes with certain numbers of apartments, and the sizes of bags of chips at your local deli.

Or, in our case, the number of reviews of a certain score allotted by Wine Magazine:

In [None]:
reviews['points'].value_counts().sort_index().plot.bar()

As you can see, every vintage is allotted an overall score between 80 and 100; and, if we are to believe that Wine Magazine is an arbiter of good taste, then a 92 is somehow meaningfully "better" than a 91.

## Line charts

The wine review scorecard has 20 different unique values to fill, for which our bar chart is just barely enough. What would we do if the magazine rated things 0-100? We'd have 100 different categories; simply too many to fit a bar in for each one!

In that case, instead of bar chart, we could use a line chart:

In [None]:
reviews['points'].value_counts().sort_index().plot.line()

A line chart can pass over any number of many individual values, making it the tool of first choice for distributions with many unique values or categories.

However, line charts have an important weakness: unlike bar charts, they're not appropriate for nominal categorical data. While bar charts distinguish between every "type" of point line charts mushes them together. So a line chart asserts an order to the values on the horizontal axis, and the order won’t make sense with some data. After all, a "descent" from California to Washington to Tuscany doesn't mean much!

Line charts also make it harder to distinguish between individual values.

In general, if your data can fit into a bar chart, just use a bar chart!

## Quick break: bar or line

Let's do a quick exercise. Suppose that we're interested in counting the following variables:

1. The number of tubs of ice cream purchased by flavor, given that there are 5 different flavors.
2. The average number of cars purchased from American car manufacturers in Michigan.
3. Test scores given to students by teachers at a college, on a 0-100 scale.
4. The number of restaurants located on the street by the name of the street in Lower Manhattan.

For which of these would a bar chart be better? Which ones would be better off with a line?

In [None]:
raw = """
<ol>
<li>This is a simple nominal categorical variable. Five bars will fit easily into a display, so a bar chart will do!</li>
<br/>
<li>This example is similar: nominal categorical variables. There are probably more than five American car manufacturers, so the chart will be a little more crowded, but a bar chart will still do it.</li>
<br/>
<li>This is an ordinal categorical variable. We have a lot of potential values between 0 and 100, so a bar chart won't have enough room. A line chart is better.</li>
<br/>
<li>
<p>Number 4 is a lot harder. City streets are obviously ordinary categorical variables, so we *ought* to use a bar chart; but there are a lot of streets out there! We couldn't possibly fit all of them into a display.</p>
<p>Sometimes, your data will have too many points to do something "neatly", and that's OK. If you organize the data by value count and plot a line chart over that, you'll learn valuable information about *percentiles*: that a street in the 90th percentile has 20 restaurants, for example, or one in the 50th just 6. This is basically a form of aggregation: we've turned streets into percentiles!</p> 
<p>The lesson: your *interpretation* of the data is more important than the tool that you use.</p></li>
</ol>
"""

from IPython.display import HTML
HTML(raw)

## Area charts

Area charts are just line charts, but with the bottom shaded in. That's it!

In [None]:
reviews['points'].value_counts().sort_index().plot.area()

When plotting only one variable, the difference between an area chart and a line chart is mostly visual. In this context, they can be used interchangably.

## Interval data

Let's move on by looking at yet another type of data, an **interval variable**.

Examples of interval variables are the wind speed in a hurricane, shear strength in concrete, and the temperature of the sun. An interval variable goes beyond an ordinal categorical variable: it has a *meaningful* order, in the sense that we can quantify what the difference between two entries is itself an interval variable.

For example, if I say that this sample of water is -20 degrees Celcius, and this other sample is 120 degrees Celcius, then I can quantify the difference between them: 140 degrees "worth" of heat, or such-and-such many joules of energy.

The difference can be qualitative sometimes. At a minimum, being able to state something so clearly feels a lot more "measured" than, say, saying you'll buy this wine and not that one, because this one scored a 92 on some taste test and that one only got an 85. More definitively, any variable that has infinitely many possible values is definitely an interval variable (why not 120.1 degrees? 120.001? 120.0000000001? Etc).

Line charts work well for interval data. Bar charts don't—unless your ability to measure it is very limited, interval data will naturally vary by quite a lot.

Let's apply a new tool, the histogram, to an interval variable in our dataset, price (we'll cut price off at 200$ a bottle; more on why shortly).

## Histograms

Here's a histogram:

In [None]:
reviews[reviews['price'] < 200]['price'].plot.hist()

A histogram looks, trivially, like a bar plot. And it basically is! In fact, a histogram is special kind of bar plot that splits your data into even intervals and displays how many rows are in each interval with bars. The only analytical difference is that instead of each bar representing a single value, it represents a range of values.

However, histograms have one major shortcoming (the reason for our 200$ caveat earlier). Because they break space up into even intervals, they don't deal very well with skewed data:

In [None]:
reviews['price'].plot.hist()

This is the real reason I excluded the >$200 bottles earlier; some of these vintages are really expensive! And the chart will "grow" to include them, to the detriment of the rest of the data being shown.

In [None]:
reviews[reviews['price'] > 1500]

There are many ways of dealing with the skewed data problem; those are outside the scope of this tutorial. The easiest is to just do what I did: cut things off at a sensible level.

This phenomenon is known (statistically) as **skew**, and it's a fairly common occurance among interval variables.

Histograms work best for interval variables without skew. They also work really well for ordinal categorical variables like `points`:

In [None]:
reviews['points'].plot.hist()

## Exercise: bar, line/area, or histogram?

Let's do another exercise. What would the best chart type be for:

1. The volume of apples picked at an orchard based on the type of apple (Granny Smith, Fuji, etcetera).
2. The number of points won in all basketball games in a season.
3. The count of apartment buildings in Chicago by the number of individual units.

To see the answer, click the "Output" button on the code block below.

In [None]:
raw = """
<ol>
<li>Example number 1 is a nominal categorical example, and hence, a pretty straightfoward bar graph target.</li>
<br/>
<li>Example 2 is a large nominal categorical variable. A basketball game team can score between 50 and 150 points, too much for a bar chart; a line chart is a good way to go. A histogram could also work.</li>
<br/>
<li>Example 3 is an interval variable: a single building can have anywhere between 1 and 1000 or more apartment units. A line chart could work, but a histogram would probably work better! Note that this distribution is going to have a lot of skew (there is only a handful of very, very large apartment buildings).</li>
</ol>
"""

from IPython.display import HTML
HTML(raw)

## Conclusion and exercise

In this section of the tutorial we learned about the handful of different kinds of data, and looked at some of the built-in tools that `pandas` provides us for plotting them.

Now it's your turn!

For these exercises, we'll be working with the Pokemon dataset (because what goes together better than wine and Pokemon?).

In [None]:
pd.set_option('max_columns', None)
pokemon = pd.read_csv("../2019_EEAT/data_collection/pokemon.csv")
pokemon.head(3)

The frequency of Pokemon by type:

The frequency of Pokemon by HP stat total:

The frequency of Pokemon by Total:

# Bivariate plotting with pandas

<table>
<tr>
<td><img src="https://i.imgur.com/bBj1G1v.png" width="350px"/></td>
<td><img src="https://i.imgur.com/ChK9zR3.png" width="350px"/></td>
<td><img src="https://i.imgur.com/KBloVHe.png" width="350px"/></td>
<td><img src="https://i.imgur.com/C7kEWq7.png" width="350px"/></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Scatter Plot</td>
<td style="font-weight:bold; font-size:16px;">Hex Plot</td>
<td style="font-weight:bold; font-size:16px;">Stacked Bar Chart</td>
<td style="font-weight:bold; font-size:16px;">Bivariate Line Chart</td>
</tr>
<tr>
<td>df.plot.scatter()</td>
<td>df.plot.hexbin()</td>
<td>df.plot.bar(stacked=True)</td>
<td>df.plot.line()</td>
</tr>
<tr>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for nominal and ordinal categorical data.</td>
<td>Good for ordinal categorical and interval data.</td>
</tr>
</table>

----


In the previous notebook, we explored using `pandas` to plot and understand relationships within a single column. In this notebook, we'll expand this view by looking at plots that consider two variables at a time.

Data without relationships between variables is the data science equivalent of a blank canvas. To paint the picture in, we need to understand how variables interact with one another. Does an increase in one variable correlate with an increase in another? Does it relate to a decrease somewhere else? The best way to paint the picture in is by using plots that enable these possibilities.

In [None]:
import pandas as pd
reviews = pd.read_csv("../data_collection/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head()

## Scatter plot

The simplest bivariate plot is the lowly **scatter plot**. A simple scatter plot simply maps each variable of interest to a point in two-dimensional space. This is the result:

In [None]:
reviews[reviews['price'] < 100].sample(100).plot.scatter(x='price', y='points')

This plot shows us that price and points are weakly correlated: that is, that more expensive wines do generally earn more points when reviewed.

Note that in order to make effective use of this plot, we had to **downsample** our data, taking just 100 points from the full set. This is because naive scatter plots do not effectively treat points which map to the same place. For example, if two wines, both costing 100 dollars, get a rating of 90, then the second one is overplotted onto the first one, and we add just one point to the plot.

This isn't a problem if it happens just a few times. But with enough points the distribution starts to look like a shapeless blob, and you lose the forest for the trees:

In [None]:
reviews[reviews['price'] < 100].plot.scatter(x='price', y='points')

Because of their weakness to overplotting, scatter plots work best with relatively small datasets, and with variables which have a large number of unique values.

There are a few ways to deal with overplotting. We've already demonstrated one way: sampling the points. Another interesting way to do this that's built right into `pandas` is to use our next plot type, a hexplot.

## Hexplot

A  **hex plot** aggregates points in space into hexagons, and then colors those hexagons based on the values within them:

In [None]:
reviews[reviews['price'] < 100].plot.hexbin(x='price', y='points', gridsize=15)

(note: the x-axis is `points`, but is missing from the chart due to a bug)

The data in this plot is directly comparable with that in the scatter plot from earlier, but the story it tells us is very different. From this hexplot we can see that the bottles of wine reviewed by Wine Magazine cluster around 87.5 points and around $20.

We did not see this effect by looking at the scatter plot, because too many similarly-priced, similarly-scoring wines were overplotted. By doing away with this problem, this hexplot presents us a much more useful view of the dataset.

Hexplots and scatter plots can by applied to combinations of interval variables and/or ordinal categorical variables.

## Stacked plots

Scatter plots and hex plots are new. But we can also use the simpler plots we saw in the last notebook.

The easiest way to modify them to support another visual variable is by using stacking. A stacked chart is one which plots the variables one on top of the other.

We'll use a supplemental selection of the five most common wines for this next section.

In [None]:
wine_counts = pd.read_csv("../data_collection/most-common-wine-scores/top-five-wine-score-counts.csv",
                          index_col=0)

`wine_counts` counts the number of times each of the possible review scores was received by the five most commonly reviewed types of wines:

In [None]:
wine_counts.head()

Many `pandas` multivariate plots expect input data to be in this format, with one categorical variable in the columns, one categorical variable in the rows, and counts of their intersections in the entries. 

Let's now look at some stacked plots. We'll start with the stacked bar chart.

In [None]:
wine_counts.plot.bar(stacked=True)

Stacked bar plots share the strengths and weaknesses of univariate bar charts. They work best for nominal categorical or small ordinal categorical variables.

Another simple example is the area plot, which lends itself very naturally to this form of manipulation:

In [None]:
wine_counts.plot.area()

Like single-variable area charts, multivariate area charts are meant for nominal categorical or interval variables.

Stacked plots are visually very pretty. However, they have two major limitations.

The first limitation is that the second variable in a stacked plot must be a variable with a very limited number of possible values (probably an ordinal categorical, as here). Five different types of wine is a good number because it keeps the result interpretable; eight is sometimes mentioned as a suggested upper bound. Many dataset fields will not fit this critereon naturally, so you have to "make do", as here, by selecting a group of interest.

The second limitation is one of interpretability. As easy as they are to make, and as pretty as they look, stacked plots make it really hard to distinguish concrete values. For example, looking at the plots above, can you tell which wine got a score of 87 more often: Red Blends (in purple), Pinot Noir (in red), or Chardonnay (in green)? It's actually really hard to tell!

## Bivariate line chart

One plot type we've seen already that remains highly effective when made bivariate is the line chart. Because the line in this chart takes up so little visual space, it's really easy and effective to overplot multiple lines on the same chart.

In [None]:
wine_counts.plot.line()

Using a line chart this way makes inroads against the second limitation of stacked plotting: interpretability. Bivariate line charts are much more interpretable because the lines themselves don't take up much space. Their values remain readable when we place multiple lines side-by-side, as here.  

For example, in this chart we can easily answer our question from the previous example: which wine most commonly scores an 87. We can see here that the Chardonnay, in green, narrowly beats out the Pinot Noir, in red.

----

## Exercises

In this section of the tutorial we introduced and explored some common bivariate plot types:

* Scatter plots
* Hex plots
* Stacked bar charts and area charts
* Bivariate line charts

Let's now put what we've learned to the test!

To start off, try answering the following questions:

1. A scatter plot or hex plot is good for what two types of data?
2. What type of data makes sense to show in a stacked bar chart, but not in a bivariate line chart?
3. What type of data makes sense to show in a bivariate line chart, but not in a stacked bar chart?
4. Suppose we create a scatter plot but find that due to the large number of points it's hard to interpret. What are two things we can do to fix this issue?

To see the answers, click the "Output" button on the cell below.

In [None]:
from IPython.display import HTML
HTML("""
<ol>
<li>Scatter plots and hex plots work best with a mixture of ordinal categorical and interval data.</li>
<br/>
<li>Nominal categorical data makes sense in a stacked bar chart, but not in a bivariate line chart.</li>
<br/>
<li>Interval data makes sense in a bivariate line chart, but not in a stacked bar chart.</li>
<br/>
<li>One way to fix this issue would be to sample the points. Another way to fix it would be to use a hex plot.</li>
</ol>
""")

Next, let's replicate some plots. Recall the Pokemon dataset from earlier:

In [None]:
pokemon = pd.read_csv("../data_collection/Pokemon.csv", index_col=0)
pokemon.head()

In [None]:
#plot scatter

In [None]:
# plot hexgon

For thee next plot, use the following data:

In [None]:
pokemon_stats_legendary = pokemon.groupby(['Legendary', 'Generation']).mean()[['Attack', 'Defense']]

In [None]:
# plot bars

For the next plot, use the following data:

In [None]:
pokemon_stats_by_generation = pokemon.groupby('Generation').mean()[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

In [None]:
# plot lines

## Conclusion

In this section we introduced and explored some common bivariate plot types:

* Scatter plots
* Hex plots
* Stacked bar charts and area charts
* Bivariate line charts

In the next section we will move on to exploring another plotting library, `seaborn`, which compliments `pandas` with many more advanced data visualization tools for you to use.

[Click here to move on to the next section, "Plotting with seaborn"](https://www.kaggle.com/residentmario/plotting-with-seaborn/).

# Styling your plots

## Introduction

Whenever exposing your work to an external audience (like, say, the Kaggle userbase), styling your work is a must. The defaults in `pandas` (and other tools) are rarely exactly right for the message you want to communicate. Tweaking your plot can greatly enhance the communicative power of your visualizations, helping to make your work more impactful.

In this section we'll learn how to style the visualizations we've been creating. Because there are *so many* things you can tweak in your plot, it's impossible to cover everything, so we won't try to be comprehensive here. Instead this section will cover some of the most useful basics: changing figure sizes, colors, and font sizes; adding titles; and removing axis borders.

An important skill in plot styling is knowing how to look things up. Comments like "I have been using Matplotlib for a decade now, and I still have to look most things up" are [all too common](https://youtu.be/aRxahWy-ul8?t=2m42s). If you're styling a `seaborn` plot, the library's [gallery](http://seaborn.pydata.org/examples/) and [API documentation](https://seaborn.pydata.org/api.html) are a great place to find styling options. And for both `seaborn` and `pandas` there is a wealth of information that you can find by looking up "how to do X with Y" on [StackOverflow](https://stackoverflow.com/) (replacing X with what you want to do, and Y with `pandas` or `seaborn`). If you want to change your plot in some way not covered in this brief tutorial, and don't already know what function you need to do it, searching like this is the most efficient way of finding it.

In [None]:
import pandas as pd
reviews = pd.read_csv("../data_collection/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head(3)

## Points on style

Recall our bar plot from earlier:

In [None]:
reviews['points'].value_counts().sort_index().plot.bar()

Throughout this section we're going to work on making this plot look nicer.

This plot is kind of hard to see. So make it bigger! We can use the `figsize` parameter to do that.

In [None]:
reviews['points'].value_counts().sort_index().plot.bar(figsize=(12, 6))

`figsize` controls the size of the image, in inches. It expects a tuple of `(width, height)` values.

Next, we can change the color of the bars to be more thematic, using the `color` parameter.

In [None]:
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred'
)

The text labels are very hard to read at this size. They fit the plot when our plot was very small, but now that the plot is much bigger we need much bigger labels. We can used `fontsize` to adjust this.

In [None]:
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)

We also need a `title`.

In [None]:
reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16,
    title='Rankings Given by Wine Magazine',
)

However, this title is too small. Unfortunately, `pandas` doesn't give us an easy way of adjusting the title size.

Under the hood, `pandas` data visualization tools are built on top of another, lower-level graphics library called `matplotlib`. Anything that you build in `pandas` can be built using `matplotlib` directly. `pandas` merely make it easier to get that work done.

`matplotlib` *does* provide a way of adjusting the title size. Let's go ahead and do it that way, and see what's different:

In [None]:
import matplotlib.pyplot as plt

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)

In the cell immediately above, all we've done is grabbed that object, assigned it to the variable `ax`, and then called `set_title` on `ax`. The `ax.set_title` method makes it easy to change the fontsize; the `title=` keyword parameter in the `pandas` library does not.

`seaborn`, covered in a separate section of the tutorial, *also* uses `matplotlib` under the hood. This means that the tricks above work there too. `seaborn` has its own tricks, too&mdash;for example, we can use the very convenient `sns.despine` method to turn off the ugly black border.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

ax = reviews['points'].value_counts().sort_index().plot.bar(
    figsize=(12, 6),
    color='mediumvioletred',
    fontsize=16
)
ax.set_title("Rankings Given by Wine Magazine", fontsize=20)
sns.despine(bottom=True, left=True)

Prefect. This graph is more clearer than what we started with; it will do a much better job communicating the analysis to our readers.

There are many, many more things that you can do than just what we've shown here. Different plots provide different styling options: `color` is almost universal for example, while `s` (size) only makes sense in a scatterplot. For now, the operations we've shown here are enough to get you started.

# Exercises


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pokemon = pd.read_csv("../data_collection/Pokemon.csv")
pokemon.head(3)

In [None]:
# scatter Defense vs Attack
# figsize=(12, 6),
# title='Pokemon by Attack and Defense'

In [None]:
# hist of Total,    figsize=(12, 6),    fontsize=14,    bins=50,    color='gray'
# title='Pokemon by Stat Total', fontsize=20

In [None]:
# repeat but remove the top and right spines from plot(s).

# Conclusion

In this section of the tutorial, we learned a few simple tricks for making our plots more visually appealing, and hence, more communicative. We also learned that there is another plotting library, `matplotlib`, which lies "underneath" the `pandas` data visualization tools, and which we can use to more finely manipulate our plots.

In the next section we will learn to compose plots together using a technique called subplotting.

## Subplots

In the previous section, "Styling your plots", we set the title of a plot using a bit of `matplotlib` code. We did this by grabbing the underlying "axis" and then calling `set_title` on that.

In this section we'll explore another `matplotlib`-based stylistic feature: **subplotting**.

In [None]:
import pandas as pd
reviews = pd.read_csv("../data_collection/wine-reviews/winemag-data_first150k.csv", index_col=0)
reviews.head(3)

## Subplotting

Subplotting is a technique for creating multiple plots that live side-by-side in one overall figure. We can use the `subplots` method to create a figure with multiple subplots. `subplots` takes two arguments. The first one controls the number of *rows*, the second one the number of *columns*.

In [None]:
import matplotlib.pyplot as plt
fig, axarr = plt.subplots(2, 1, figsize=(12, 8))

Since we asked for a `subplots(2, 1)`, we got a figure with two rows and one column.

Let's break this down a bit. When `pandas` generates a bar chart, behind the scenes here is what it actually does:

1. Generate a new `matplotlib` `Figure` object.
2. Create a new `matplotlib` `AxesSubplot` object, and assign it to the `Figure`.
3. Use `AxesSubplot` methods to draw the information on the screen.
4. Return the result to the user.

In a similar way, our `subplots` operation above created one overall `Figure` with two `AxesSubplots` vertically nested inside of it.

`subplots` returns two things, a figure (which we assigned to `fig`) and an array of the axes contained therein (which we assigned to `axarr`). Here are the `axarr` contents:

In [None]:
axarr

To tell `pandas` which subplot we want a new plot to go in&mdash;the first one or the second one&mdash;we need to grab the proper axis out of the list and pass it into `pandas` via the `ax` parameter:

In [None]:
fig, axarr = plt.subplots(2, 1, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0]
)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1]
)

We are of course not limited to having only a single row. We can create as many subplots as we want, in whatever configuration we need.

For example:

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

If there are multiple columns *and* multiple rows, as above, the axis array becoming a list of lists:

In [None]:
axarr

That means that to plot our data from earlier, we now need a row number, then a column number.

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0][0]
)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1][1]
)

Notice that the bar plot of wines by point counts is in the first row and first column (the `[0][0]` position), while the bar plot of wines by origin is in the second row and second column (`[1][1]`).

By combining subplots with the styles we learned in the last section, we can create appealing-looking panel displays.

In [None]:
fig, axarr = plt.subplots(2, 2, figsize=(12, 8))

reviews['points'].value_counts().sort_index().plot.bar(
    ax=axarr[0][0], fontsize=12, color='mediumvioletred'
)
axarr[0][0].set_title("Wine Scores", fontsize=18)

reviews['variety'].value_counts().head(20).plot.bar(
    ax=axarr[1][0], fontsize=12, color='mediumvioletred'
)
axarr[1][0].set_title("Wine Varieties", fontsize=18)

reviews['province'].value_counts().head(20).plot.bar(
    ax=axarr[1][1], fontsize=12, color='mediumvioletred'
)
axarr[1][1].set_title("Wine Origins", fontsize=18)

reviews['price'].value_counts().plot.hist(
    ax=axarr[0][1], fontsize=12, color='mediumvioletred'
)
axarr[0][1].set_title("Wine Prices", fontsize=18)

plt.subplots_adjust(hspace=.3)

import seaborn as sns
sns.despine()

# Why subplot?

Why are subplots useful?

Oftentimes as a part of the exploratory data visualization process you will find yourself creating a large number of smaller charts probing one or a few specific aspects of the data. For example, suppose we're interested in comparing the scores for relatively common wines with those for relatively rare ones. In these cases, it makes logical sense to combine the two plots we would produce into one visual "unit" for analysis and discussion.

When we combine subplots with the style attributes we explored in the previous notebook, this technique allows us to create extremely attractive and informative panel displays.

Finally, subplots are critically useful because they enable **faceting**. Faceting is the act of breaking data variables up across multiple subplots, and combining those subplots into a single figure. So instead of one bar chart, we might have, say, four, arranged together in a grid.

The recommended way to perform faceting is to use the `seaborn` `FacetGrid` facility. This feature is explored in a separate section of this tutorial.

# Exercises

Let's test ourselves by answering some questions about the plots we've used in this section. Once you have your answers, click on "Output" button below to show the correct answers.

1. A `matplotlib` plot consists of a single X composed of one or more Y. What are X and Y?
2. The `subplots` function takes which two parameters as input?
3. The `subplots` function returns what two variables? 

In [None]:
from IPython.display import HTML
HTML("""
<ol>
<li>The plot consists of one overall figure composed of one or more axes.</li>
<li>The subplots function takes the number of rows as the first parameter, and the number of columns as the second.</li>
<li>The subplots function returns a figure and an array of axes.</li>
</ol>
""")

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
pokemon = pd.read_csv("../data_collection/Pokemon.csv")
pokemon.head(3)

(Hint: use `figsize=(8, 8)`)

In [None]:
# two empty subplots with figsize=(8, 8)

In [None]:
# two hists using subplots: 1. Attach, 2. Defense figsize=(8, 8)

# Conclusion

In the previous section we explored some `pandas`/`matplotlib` style parameters. In this section, we dove a little deeper still by exploring subplots.

Together these two sections conclude our primer on style. Hopefully our plots will now be more legible and informative.

[Click here to go to the next section, "Plotting with seaborn"](https://www.kaggle.com/residentmario/plotting-with-seaborn).

# Plotting with seaborn

<table>
<tr>
<td><img src="https://i.imgur.com/3cYy56H.png" width="350px"/></td>
<td><img src="https://i.imgur.com/V9jAreo.png" width="350px"/></td>
<td><img src="https://i.imgur.com/5a6dwtm.png" width="350px"/></td>
<td><img src="https://i.imgur.com/ZSsHzrA.png" width="350px"/></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Count (Bar) Plot</td>
<td style="font-weight:bold; font-size:16px;">KDE Plot</td>
<td style="font-weight:bold; font-size:16px;">Joint (Hex) Plot</td>
<td style="font-weight:bold; font-size:16px;">Violin Plot</td>
</tr>
<tr>
<td>sns.countplot()</td>
<td>sns.kdeplot()</td>
<td>sns.jointplot()</td>
<td>sns.violinplot()</td>
</tr>
<tr>
<td>Good for nominal and small ordinal categorical data.</td>
<td>Good for interval data.</td>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for interval data and some nominal categorical data.</td>
</tr>
</table>

----

In the previous two sections we explored data visualization using the `pandas` built-in plotting tools. In this section, we'll do the same with `seaborn`.

`seaborn` is a standalone data visualization package that provides many extremely valuable data visualizations in a single package. It is generally a much more powerful tool than `pandas`; let's see why.

In [None]:
import pandas as pd
reviews = pd.read_csv("../data_collection/wine-reviews/winemag-data_first150k.csv", index_col=0)
import seaborn as sns

reviews.head()

## Countplot

The `pandas` bar chart becomes a `seaborn` `countplot`.

In [None]:
sns.countplot(reviews['points'])

Comparing this chart with the bar chart from two notebooks ago, we find that, unlike `pandas`, `seaborn` doesn't require us to shape the data for it via `value_counts`; the `countplot` (true to its name) aggregates the data for us!

`seaborn` doesn't have a direct analogue to the line or area chart. Instead, the package provides a `kdeplot`:

## KDE Plot

In [None]:
sns.kdeplot(reviews.query('price < 200').price)

KDE, short for "kernel density estimate", is a statistical technique for smoothing out data noise. It addresses an important fundamental weakness of a line chart: it will buff out outlier or "in-betweener" values which would cause a line chart to suddenly dip.

For example, suppose that there was just one wine priced 19.93\$, but several hundred prices 20.00\$. If we were to plot the value counts in a line chart, our line would dip very suddenly down to 1 and then back up to around 1000 again, creating a strangely "jagged" line. The line chart with the same data, shown below for the purposes of comparison, has exactly this problem!

Note that the x xais is a `seaborn` `kdeplot` is the variable being plotted (in this case, `price`), while the y axis is how often it occurs. 

In [None]:
reviews[reviews['price'] < 200]['price'].value_counts().sort_index().plot.line()

A KDE plot is better than a line chart for getting the "true shape" of interval data. In fact, I recommend always using it instead of a line chart for such data.

However, it's a worse choice for ordinal categorical data. A KDE plot expects that if there are 200 wine rated 85 and 400 rated 86, then the values in between, like 85.5, should smooth out to somewhere in between (say, 300). However, if the value in between can't occur (wine ratings of 85.5 are not allowed), then the KDE plot is fitting to something that doesn't exist. In these cases, use a line chart instead.

KDE plots can also be used in two dimensions.

In [None]:
sns.kdeplot(reviews[reviews['price'] < 200].loc[:, ['price', 'points']].dropna().sample(5000))

Bivariate KDE plots like this one are a great alternative to scatter plots and hex plots. They solve the same data overplotting issue that scatter plots suffer from and hex plots address, in a different but similarly visually appealing. However, note that bivariate KDE plots are very computationally intensive. We took a sample of 5000 points in this example to keep compute time reasonable.

## Distplot

The `seaborn` equivalent to a `pandas` histogram is the `distplot`. Here's an example:

In [None]:
sns.distplot(reviews['points'], bins=10, kde=False)

The `distplot` is a composite plot type. In the example above we've turned off the `kde` that's included by default, and manually set the number of bins to 10 (two possible ratings per bin), to get a clearer picture.

## Scatterplot and hexplot

To plot two variables against one another in `seaborn`, we use `jointplot`.

In [None]:
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 100])

Notice that this plot comes with some bells and whistles: a correlation coefficient is provided, along with histograms on the sides. These kinds of composite plots are a recurring theme in `seaborn`. Other than that, the `jointplot` is just like the `pandas` scatter plot.

As in `pandas`, we can use a hex plot (by simply passing `kind='hex'`) to deal with overplotting:

In [None]:
sns.jointplot(x='price', y='points', data=reviews[reviews['price'] < 100], kind='hex', 
              gridsize=20)

## Boxplot and violin plot

`seaborn` provides a boxplot function. It creates a statistically useful plot that looks like this:

In [None]:
df = reviews[reviews.variety.isin(reviews.variety.value_counts().head(5).index)]

sns.boxplot(
    x='variety',
    y='points',
    data=df)

The center of the distributions shown above is the "box" in boxplot. The top of the box is the 75th percentile, while the bottom is the 25th percentile. In other words, half of the data is distributed within the box! The green line in the middle is the median.

The other part of the plot, the "whiskers", shows the extent of the points beyond the center of the distribution. Individual circles beyond *that* are outliers.

This boxplot shows us that although all five wines recieve broadly similar ratings, Bordeaux-style wines tend to be rated a little higher than a Chardonnay.

Boxplots are great for summarizing the shape of many datasets. They also don't have a limit in terms of numeracy: you can place as many boxes in the plot as you feel comfortable squeezing onto the page.

However, they only work for interval variables and nominal variables with a large number of possible values; they assume your data is roughly normally distributed (otherwise their design doesn't make much sense); and they don't carry any information about individual values, only treating the distribution as a whole.

I find the slightly more advanced `violinplot` to be more visually enticing, in most cases:

In [None]:
sns.violinplot(
    x='variety',
    y='points',
    data=reviews[reviews.variety.isin(reviews.variety.value_counts()[:5].index)]
)

A `violinplot` cleverly replaces the box in the boxplot with a kernel density estimate for the data. It shows basically the same data, but is harder to misinterpret and much prettier than the utilitarian boxplot.

## Why seaborn?

Having now seen both `pandas` plotting and the `seaborn` library in action, we are now in a position to compare the two and decide when to use which for what.

Recall the data we've been working with in this tutorial is in:

In [None]:
reviews.head()

This data is in a "record-oriented" format. Each individual row is a single record (a review); in aggregate, the list of all rows is the list of all records (all reviews). This is the format of choice for the most kinds of data: data corresponding with individual, unit-identifiable "things" ("records"). The majority of the simple data that gets generated is created in this format, and data that isn't can almost always be converted over. This is known as a "tidy data" format.

`seaborn` is designed to work with this kind of data out-of-the-box, for all of its plot types, with minimal fuss. This makes it an incredibly convenient workbench tool.

`pandas` is not designed this way. In `pandas`, every plot we generate is tied very directly to the input data. In essence, `pandas` expects your data being in exactly the right *output* shape, regardless of what the input is.

<!--
In the previous section of this tutorial, we purposely evaded this issue by using supplemental datasets in a "just right" shape. Starting from the data that we already have, here's what it would take to generate a simple histogram:

```python
import numpy as np
top_five_wines_scores = (
    reviews
        .loc[np.where(reviews.variety.isin(reviews.variety.value_counts().head(5).index))]
        .loc[:, ['variety', 'points']]
        .groupby('variety')
        .apply(lambda df: pd.Series(df.points.values))
        .unstack()
        .T
)
top_five_wines_scores.plot.hist()
```

As we demonstrated above, to do the same thing in `seaborn`, all we need is:

```python
sns.distplot(reviews.points, bins=10, kde=False)
```

The difference is stark!
-->

Hence, in practice, despite its simplicity, the `pandas` plotting tools are great for the initial stages of exploratory data analytics, but `seaborn` really becomes your tool of choice once you start doing more sophisticated explorations.

<!--
My recommendations are:
* Bar plot: 
  * `pd.Series.plot.bar`
  * `sns.countplot`
* Scatter plot:
  * `pd.Series.plot.scatter`
  * `sns.jointplot`
* Hex plot:
  * `pd.Series.plot.hex`
  * `sns.jointplot`
* Line/KDE plot:
  * `pd.Series.plot.line` for nominal categorical variables
  * `sns.kdeplot` for interval variables
* Box/Violin plot:
  * `sns.boxplot`
  * `sns.violinplot`
* Histogram:
   * `sns.distplot`
-->

# Examples

As in previous notebooks, let's now test ourselves by answering some questions about the plots we've used in this section. Once you have your answers, click on "Output" button below to show the correct answers.

1. A `seaborn` `countplot` is equivalent to what in `pandas`?
2. A `seaborn` `jointplot` which is configured with `kind='hex'` is equivalent to a what in `pandas`?
3. Why might a `kdeplot` not work very well for ordinal categorical data?
4. What does the "box" in a `boxplot` represent?

In [None]:
from IPython.display import HTML
HTML("""
<ol>
<li>A seaborn countplot is like a pandas bar plot.</li>
<li>A seaborn jointplot is like a pandas hex plot.</li>
<li>KDEPlots work by aggregating data into a smooth curve. This is great for interval data but doesn't always work quite as well for ordinal categorical data.</li>
<li>The top of the box is the 75th percentile. The bottom of the box is the 25th percentile. The median, the 50th percentile, is the line in the center of the box. So 50% of the data in the distribution is located within the box!</li>
</ol>
""")

In [None]:
pokemon = pd.read_csv("../data_collection/Pokemon.csv", index_col=0)
pokemon.head()

And now, can you replicate the plots in the pictures?

<img src="images/60.png">

<img src="images/61.png">

<img src="images/62.png">

<img src="images/63.png">

<img src="images/64.png">

<img src="images/65.png">

<img src="images/66.png">

## Conclusion

`seaborn` is one of the most important, if not *the* most important, data visualization tool in the Python data viz ecosystem. In this notebook we looked at what features and capacities `seaborn` brings to the table. There's plenty more that you can do with the library that we won't cover here or elsewhere in the tutorial; I highly recommend browsing the terrific `seaborn` [Gallery page](https://seaborn.pydata.org/examples/index.html) to see more beautiful examples of the library in action.

[Click here to go to the next section, "Faceting with seaborn"](https://www.kaggle.com/residentmario/faceting-with-seaborn).

# Faceting with seaborn

<table>
<tr>
<td><img src="https://i.imgur.com/wU9M9gu.png" width="350px"/></td>
<td><img src="https://i.imgur.com/85d2nIj.png" width="350px"/></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Facet Grid</td>
<td style="font-weight:bold; font-size:16px;">Pair Plot</td>
</tr>
<tr>
<td>sns.FacetGrid()</td>
<td>sns.pairplot()</td>
</tr>
<tr>
<td>Good for data with at least two categorical variables.</td>
<td>Good for exploring most kinds of data.</td>
</tr>
</table>

So far in this tutorial we've been plotting data in one (univariate) or two (bivariate) dimensions, and we've learned how plotting in `seaborn` works. In this section we'll dive deeper into `seaborn` by exploring **faceting**.

Faceting is the act of breaking data variables up across multiple subplots, and combining those subplots into a single figure. So instead of one bar chart, we might have, say, four, arranged together in a grid.

In this notebook we'll put this technique in action, and see why it's so useful.

In [None]:
import pandas as pd
pd.set_option('max_columns', None)
df = pd.read_csv("../data_collection/fifa-18-demo-player-dataset/CompleteDataset.csv", index_col=0)
df.head()

In [None]:
import re
import numpy as np

footballers = df.copy()
footballers['Unit'] = df['Value'].str[-1]
footballers['Value (M)'] = np.where(footballers['Unit'] == '0', 0, 
                                    footballers['Value'].str[1:-1].replace(r'[a-zA-Z]',''))
footballers['Value (M)'] = footballers['Value (M)'].astype(float)
footballers['Value (M)'] = np.where(footballers['Unit'] == 'M', 
                                    footballers['Value (M)'], 
                                    footballers['Value (M)']/1000)
footballers = footballers.assign(Value=footballers['Value (M)'],
                                 Position=footballers['Preferred Positions'].str.split().str[0])

(Note: the first code cell above contains some data pre-processing. This is extraneous, and so I've hidden it by default.)

In [None]:
footballers.head()

In [None]:
import seaborn as sns

## The FacetGrid

The core `seaborn` utility for faceting is the `FacetGrid`. A `FacetGrid` is an object which stores some information on how you want to break up your data visualization.

For example, suppose that we're interested in (as in the previous notebook) comparing strikers and goalkeepers in some way. To do this, we can create a `FacetGrid` with our data, telling it that we want to break the `Position` variable down by `col` (column).

Since we're zeroing in on just two positions in particular, this results in a pair of grids ready for us to "do" something with them:

In [None]:
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col="Position")

From there, we use the `map` object method to plot the data into the laid-out grid.

In [None]:
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
g = sns.FacetGrid(df, col="Position")
g.map(sns.kdeplot, "Overall")

Passing a method into another method like this may take some getting used to, if this is your first time seeing this being done. But once you get used to it, `FacetGrid` is very easy to use.

By using an object to gather "design criteria", `seaborn` does an effective job seamlessly marrying the data *representation* to the data *values*, sparing us the need to lay the plot out ourselves.

We're probably interested in more than just goalkeepers and strikers, however. But if we squeezed all of the possible game positions into one row, the resulting plots would be tiny. `FacetGrid` comes equipped with a `col_wrap` parameter for dealing with this case exactly.

In [None]:
df = footballers

g = sns.FacetGrid(df, col="Position", col_wrap=6)
g.map(sns.kdeplot, "Overall")

So far we've been dealing exclusively with one `col` (column) of data. The "grid" in `FacetGrid`, however, refers to the ability to lay data out by row *and* column.

For example, suppose we're interested in comparing the talent distribution for (goalkeepers and strikers specifically, to keep things succinct) across rival clubs Real Madrid, Atlético Madrid, and FC Barcelona.

As the plot below demonstrates, we can achieve this by passing `row=Position` and `col=Club` parameters into the plot.

In [None]:
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
df = df[df['Club'].isin(['Real Madrid CF', 'FC Barcelona', 'Atlético Madrid'])]

g = sns.FacetGrid(df, row="Position", col="Club")
g.map(sns.violinplot, "Overall")

`FacetGrid` orders the subplots effectively arbitrarily by default. To specify your own ordering explicitly, pass the appropriate argument to the `row_order` and `col_order` parameters.

In [None]:
df = footballers[footballers['Position'].isin(['ST', 'GK'])]
df = df[df['Club'].isin(['Real Madrid CF', 'FC Barcelona', 'Atlético Madrid'])]

g = sns.FacetGrid(df, row="Position", col="Club", 
                  row_order=['GK', 'ST'],
                  col_order=['Atlético Madrid', 'FC Barcelona', 'Real Madrid CF'])
g.map(sns.violinplot, "Overall")

`FacetGrid` comes equipped with various lesser parameters as well, but these are the most important ones.

## Why facet?

In a nutshell, faceting is the easiest way to make your data visualization multivariate.

Faceting is multivariate because after laying out one (categorical) variable in the rows and another (categorical) variable in the columns, we are already at two variables accounted for before regular plotting has even begun.

And faceting is easy because transitioning from plotting a `kdeplot` to gridding them out, as here, is very simple. It doesn't require learning any new visualization techniques. The limitations are the same ones that held for the plots you use inside.

Faceting does have some important limitations however. It can only be used to break data out across singular or paired categorical variables with very low numeracy&mdash;any more than five or so dimensions in the grid, and the plots become too small (or involve a lot of scrolling). Additionally it involves choosing (or letting Python) an order to plot in, but with nominal categorical variables that choice is distractingly arbitrary.

Nevertheless, faceting is an extremely useful and applicable tool to have in your toolbox.

## Pairplot

Now that we understand faceting, it's worth taking a quick once-over of the `seaborn` `pairplot` function.

`pairplot` is a very useful and widely used `seaborn` method for faceting *variables* (as opposed to *variable values*). You pass it a `pandas` `DataFrame` in the right shape, and it returns you a gridded result of your variable values:

In [None]:
sns.pairplot(footballers[['Overall', 'Potential', 'Value']])

By default `pairplot` will return scatter plots in the main entries and a histogram in the diagonal. `pairplot` is oftentimes the first thing that a data scientist will throw at their data, and it works fantastically well in that capacity, even if sometimes the scatter-and-histogram approach isn't quite appropriate, given the data types.

# Examples

As in previous notebooks, let's now test ourselves by answering some questions about the plots we've used in this section. Once you have your answers, click on "Output" button below to show the correct answers.

1. Suppose that we create an `n` by `n` `FacetGrid`. How big can `n` get?
2. What are the two things about faceting which make it appealing?
3. When is `pairplot` most useful?

In [None]:
from IPython.display import HTML
HTML("""
<ol>
<li>You should try to keep your grid variables down to five or so. Otherwise the plots get too small.</li>
<li>It's (1) a multivariate technique which (2) is very easy to use.</li>
<li>Pair plots are most useful when just starting out with a dataset, because they help contextualize relationships within it.</li>
</ol>
""")

In [None]:
import pandas as pd
import seaborn as sns

pokemon = pd.read_csv("../data_collection/Pokemon.csv", index_col=0)
pokemon.head(3)

In [None]:
Now, can you reproduce the graphs below

<img src="images/67.png">

<img src="images/68.png">

<img src="images/69.png">

## Conclusion

In this notebook we explored `FacetGrid` and `pairplot`, two `seaborn` facilities for faceting your data, and discussed why faceting is so useful in a broad range of cases.

This technique is our first dip into multivariate plotting, an idea that we will explore in more depth with two other approaches in the next section.

[Click here to go to the next section, "Multivariate plotting"](https://www.kaggle.com/residentmario/multivariate-plotting).

# Multivariate plotting

<table>
<tr>
<td><img src="https://i.imgur.com/gJ65O47.png" width="350px"/></td>
<td><img src="https://i.imgur.com/3qEqPoD.png" width="350px"/></td>
<td><img src="https://i.imgur.com/1fmV4M2.png" width="350px"/></td>
<td><img src="https://i.imgur.com/H20s88a.png" width="350px"/></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Multivariate Scatter Plot</td>
<td style="font-weight:bold; font-size:16px;">Grouped Box Plot</td>
<td style="font-weight:bold; font-size:16px;">Heatmap</td>
<td style="font-weight:bold; font-size:16px;">Parallel Coordinates</td>
</tr>
<tr>
<td>df.plot.scatter()</td>
<td>df.plot.box()</td>
<td>sns.heatmap</td>
<td>pd.plotting.parallel_coordinates</td>
</tr>
<!--
<tr>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for nominal and ordinal categorical data.</td>
<td>Good for ordinal categorical and interval data.</td>
</tr>
-->
</table>

For most of this tutorial we've been plotting data in one (univariate) or two (bivariate) dimensions. In the previous section we explored faceting: a multivariate plotting method that works by "gridding out" the data.

In this section we'll delve further into multivariate plotting. First we'll explore "truly" multivariate charts. Then we'll examine some plots that use summarization to get at the same thing.

In [None]:
footballers.head()

## Adding more visual variables

The most obvious way to plot lots of variables is to augement the visualizations we've been using thus far with even more  [visual variables](http://www.infovis-wiki.net/index.php?title=Visual_Variables). A **visual variable** is any visual dimension or marker that we can use to perceptually distinguish two data elements from one another. Examples include size, color, shape, and one, two, and even three dimensional position.

"Good" multivariate data displays are ones that make efficient, easily-interpretable use of these parameters.

### Multivariate scatter plots

Let's look at some examples. We'll start with the scatter plot. Supose that we are interested in seeing which type of offensive players tends to get paid the most: the striker, the right-winger, or the left-winger.

In [None]:
import seaborn as sns

sns.lmplot(x='Value', y='Overall', hue='Position', 
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])], 
           fit_reg=False)

This scatterplot uses three visual variables. The horizontal position (x-value) tracks the `Value` of the player (how well they are paid). The vertical position (y-value) tracks the `Overall` score of the player across all attributes. And the color (the `hue` parameter) tracks which of the three categories of interest the player the point represents is in.

The new variable in this chart is **color**. Color provides an aesthetically pleasing visual, but it's tricky to use. Looking at this scatter plot we see the same overplotting issue we saw in previous sections. But we no longer have an easy solution, like using a hex plot, because color doesn't make sense in that setting.

Another example visual variable is **shape**. Shape controls, well, the shape of the marker:

In [None]:
sns.lmplot(x='Value', y='Overall', markers=['o', 'x', '*'], hue='Position',
           data=footballers.loc[footballers['Position'].isin(['ST', 'RW', 'LW'])],
           fit_reg=False
          )

`seaborn` is opinionated about what kinds of visual variables you should use, and doesn't provide a shape option very often. This is because simple shapes, though nifty, are perceptually inferior to colors in terms of their distinguishability.

### Grouped box plot

Another demonstrative plot is the grouped box plot. This plot takes advantage of **grouping**. Suppose we're interested in the following question: do Strikers score higher on "Aggression" than Goalkeepers do?

In [None]:
f = (footballers
         .loc[footballers['Position'].isin(['ST', 'GK'])]
         .loc[:, ['Value', 'Overall', 'Aggression', 'Position']]
    )
f = f[f["Overall"] >= 80]
f = f[f["Overall"] < 85]
f['Aggression'] = f['Aggression'].astype(float)

sns.boxplot(x="Overall", y="Aggression", hue='Position', data=f)

As you can see, this plot demonstrates conclusively that within our datasets goalkeepers (at least, those with an overall score between 80 and 85) have *much* lower Aggression scores than Strikers do.

In this plot, the horizontal axis encodes the `Overall` score, the vertical axis encodes the `Aggression` score, and the grouping encodes the `Position`.

Grouping is an extremely communicative visual variable: it makes this chart very easy to interpret. However, it has very low cardinality: it's very hard to use groups to fit more than a handful of categorical values. In this plot we've chosen just two player positions and five Overall player scores and the visualization is already rather crowded. Overall, grouping is very similar to faceting in terms of what it can and can't do.

## Summarization

It is difficult to squeeze enough dimensions onto a plot without hurting its interpretability. Very busy plots are naturally very hard to interpret. Hence highly multivariate can be difficult to use.

Another way to plot many dataset features while circumnavigating this problem is to use **summarization**. Summarization is the creation and addition of new variables by mixing and matching the information provided in the old ones.

Summarization is a useful technique in data visualization because it allows us to "boil down" potentially very complicated relationships into simpler ones.

### Heatmap

Probably the most heavily used summarization visualization is the **correlation plot**, in which measures the correlation between every pair of values in a dataset and plots a result in color.

In [None]:
f = (
    footballers.loc[:, ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control']]
        .applymap(lambda v: int(v) if str.isdecimal(v) else np.nan)
        .dropna()
).corr()

sns.heatmap(f, annot=True)

Each cell in this plot is the intersection of two variables; its color and label together indicate the amount of *correlation* between the two variables (how likely both variables are the increase or decrease at the same time). For example, in this dataset Agility and Acceleration are highly correlated, while Aggression and Balanced are very uncorrelated.

A correlation plot is a specific kind of **heatmap**. A heatmap maps one particular fact (in this case, correlation) about every pair of variables you chose from a dataset.

In the visualization above we've plotted a sample of 200 goalkeepers (in dark green) and strikers (in light green) across our five variables of interest.

Parallel coordinates plots are great for determining how distinguishable different classes are in the data. They standardize the variables from top to bottom... In this case, we see that strikers are almost uniformally higher rated on all of the variables we've chosen, meaning these two classes of players are very easy to distinguish.

## Exercises

In [None]:
pokemon = pd.read_csv("../input/pokemon/Pokemon.csv", index_col=0)
pokemon.head()

Try answering the following questions. Click the "Output" button on the cell below to see the answers.

1. What are three techniques for creating multivariate data visualziations?
2. Name three examples of visual variables.
3. How does summarization in data visualization work?

In [None]:
import seaborn as sns

# scatter plot Defense against Attack, 
# hues depending on Legendary, markers of 'x' and 'o'


In [None]:
# boxplot Total depending on Generation, hue is 'legendary'


In [None]:
# heatmap of 'HP', 'Attack', 'Sp. Atk', 'Defense', 'Sp. Def', 'Speed'


# Conclusion

In this tutorial we followed up on faceting, covered in the last section, by diving into two other multivariate data visualization techniques.

The first technique, adding more visual variables, results in more complicated but potentially more detailed plots. The second technique, summarization, compresses variable information to a summary statistic, resulting in a simple output&mdash;albeit at the cost of expressiveness.

Faceting, adding visual variables, and summarization are the three multivariate techniques that we will cover in this tutorial.

<table>
<tr>
<td><img src="https://i.imgur.com/BqJgyzB.png" width="350px"/></td>
<td><img src="https://i.imgur.com/ttYzMwD.png" width="350px"/></td>
<td><img src="https://i.imgur.com/WLmzj41.png" width="350px"/></td>
<td><img src="https://i.imgur.com/LjRTbCn.png" width="350px"/></td>
</tr>
<tr>
<td style="font-weight:bold; font-size:16px;">Scatter Plot</td>
<td style="font-weight:bold; font-size:16px;">Choropleth</td>
<td style="font-weight:bold; font-size:16px;">Heatmap</td>
<td style="font-weight:bold; font-size:16px;">Surface Plot</td>
</tr>
<tr>
<td>go.Scatter()</td>
<td>go.Choropleth()</td>
<td>go.Heatmap()</td>
<td>go.Surface()</td>
</tr>
<!--
<tr>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for interval and some nominal categorical data.</td>
<td>Good for nominal and ordinal categorical data.</td>
<td>Good for ordinal categorical and interval data.</td>
</tr>
-->
</table>