# 08 Basic plotting with seaborn
File(s) needed: none

There are two types of plots: exploratory and explanatory. **Exploratory** plots are useful when you are trying to determine what direction you pursue in your analysis. They can be simple plots, but they still need enough detail to be self-documenting. Plotting data on a graph is an important part of preparing it for analysis. Descriptive statistics are helpful but don't show the whole picture. Our first example will show why.

**Explanatory** plots are used to tell an audience about the results of your analysis. These plots are usually more complete and can even be "publication ready" in format.

There are many libraries available for plotting in Python. The Seaborn library is built upon the matplotlib package, and there are even some basic plotting methods built into pandas objects. 

We will focus on using the seaborn library to produce exploratory plots as part of the process of learning about our data. 

## Why bother with plotting our data?
Anscombe's quartet shows why we have to visualize data. Each of four sets of two continuous variables has the same
- mean
- variance
- correlation
- regression line.

When we plot them, however, we can see that they are definitely not the same.

The Anscombe data is contained in the "seaborn" library, so we will need to import the library to access the data, plus we will use its plotting functionality a little later. The standard way to import seaborn is with the `sns` alias. The `load_dataset()` method is only used to retrieve datasets that are part of the seaborn library.

In [1]:
# import seaborn library and get the data


In [2]:
# Create a subset of the Anscombe data using just the rows from data set I


In [3]:
# Create the rest of the subsets of the Anscombe data


Let's look at the descriptive stats for each dataset in the Anscombe set to confirm they have the same values. 

In [4]:
# Comparison of descriptive stats for datasets I - IV
print('dataset_1\n',dataset_1.describe(),'\n')
print('dataset_2\n',dataset_2.describe(),'\n')
print('dataset_3\n',dataset_3.describe(),'\n')
print('dataset_4\n',dataset_4.describe(),'\n')

NameError: name 'dataset_1' is not defined

Plot each of the Anscombe datasets as a scatterplot to see the differences.


In [None]:
# First dataset


In [None]:
# Second dataset


In [None]:
# Third dataset


In [None]:
# Fourth dataset


So, even though these four sets of data produce the same descriptive statistics, they are very different. But that is only apparent if we plot them.
<p></p>
<div style="padding:20px;font-size:150%;color:maroon;background-color:palegoldenrod">
    <p style="text-align:center">Avoid surprises. Always plot your data to see what it looks like.</p>
</div>
<p></p>

## Another important point about plotting
<p></p>
<div style="padding:20px;font-size:250%;color:maroon;background-color:palegoldenrod">
    <p style="text-align:center">Ability to create a plot</p><p style="text-align:center">!=</p><p style="text-align:center">Need to create a plot</p>
</div>
<p></p>
There are many types of plots out there. The types of plots you are expected to use often change depending upon the field in which you are working. Only use a plot if it makes sense as part of your analysis. Just because you can create a fancy plot doesn't mean you should do it, unless it adds value to the analysis.

We are going to skip some of the more exotic plots shown in the text and focus on the ones you will most likely need when exploring a dataset: 
- scatterplot,
- bar plot, 
- histogram, and
- box and whiskers plot.

# Seaborn graphics
The `matplotlib` library can be thought of as the core foundational plotting tool in Python. The `seaborn` library builds on `matplotlib` by providing a higher-level interface for statistical graphics. It allows you to create more professional-looking plots with very little coding.

From the Seaborn documentation:
>Seaborn aims to make visualization a central part of exploring and understanding data. Its dataset-oriented plotting functions operate on dataframes and arrays containing whole datasets and internally perform the necessary semantic mapping and statistical aggregation to produce informative plots.

https://seaborn.pydata.org/index.html

This will make more sense after a few examples.

---
## Global formatting options
There are options that can be set for each of the plots below, but you might want to set some global options to use on all your plots in the notebook. Maybe you like larger text in your plots or you prefer to use a different set of colors. If that is ths case, you can use one of the pre-defined sets of colors called a _palette_. 

The predefined seaborn color palettes are:
- deep
- muted
- bright
- pastel
- dark
- colorblind

This is especially a good idea if you will be sharing any of your graphics with others. In that case, you might want to use the `colorblind` palette to potentially avoid any issues.

Overall styles included with Seaborn include
- dark
- white
- darkgrid
- whitegrid
- ticks

More information on using these styles is available in the seaborn documentation.

You can also change the font, font size, and other options. See the page https://seaborn.pydata.org/generated/seaborn.set.html for more on the available options.

In [None]:
# DON'T RUN THIS YET!
# You can use sns.set() to change global options for your seaborn plots.
# Create the next few plots using the default settings, then come back
# here to run this cell and then recreate the plots to see the differences.
sns.set(palette='colorblind', style='whitegrid', font_scale=1.2)

# Examples with `tips` data

In [None]:
# load a copy of the tips dataset from the seaborn library for our examples


# Univariate data
When we only have one variable of interest, there isn't much we can do. You can't plot one variable in two dimensions. What we do, therefore, is count the number of times each data point occurs and plot the counts. We typically call those counts **_frequencies_**.

How we get those frequencies depends upon whether we have _categorical_ or _continuous_ data.

## Categorical data: Bar charts
For categorical data, seaborn provides a function called `countplot()`. This method counts the occurances of the values in the specified column and plots a bar chart based on the frequencies. Just specify the dataframe (`data =` argument) and the column of interest (`x=` argument).


In [None]:
# Use the countplot method to plot tips by the day


In [None]:
# Plot horizontal bars by specifying a y value (instead of x).


---
<p style="font-size:125%;color:darkred;background:palegoldenrod;padding:25px">Now go back and run the cell with the <span style="Color:black">sns.set(palette='colorblind', style='whitegrid', font_scale=1.2)</span> statement in it. Then rerun these last two plots to see the difference.</p>

If you want to return to the default settings, run the command
```
sns.set()
```
in a cell and the settings we changed will be reset to the default values.

---

## Continuous data: Histogram
**Histograms** are probably the most common way to look at a single continuous variable. When the data is continuous, that means we have to convert it into data we can count. This is called **_discretizing_** the data (i.e., converting it to discrete data). We use a process called **binning** to create ranges (i.e., the bins) to group data points together. Then we can count the number of data points in each bin and plot the bin frequencies.

We can specify the number of bins ourselves or use seaborn's built-in functionality to do the binning for us. First, let's have seaborn do a quick histogram for us.

In [None]:
# Quick histogram with defaults


- The default distplot will plot both a histogram and a density plot (using a kernel
density estimation)
- we can set the `kde` parameter to `False` if we just want the histogram

Let's set that option and the option for the number of bins and run the plot again.

In [None]:
# Histogram only, add descriptive titles


## Continuous data: Box and whiskers plot
Boxplots show multiple statistics for a single variable.
- minimum
- first quartile
- median
- third quartile
- maximum
- outliers based on the interquartile range (if applicable)

In [None]:
# Create a simple box plot - note that y is specified, not x


# Bivariate data
Plots using values of two variables allow us to see a representation of the relationship between the variables.

    
## Scatterplots
A scatterplot is great way to get an idea of what your data looks like when you have two-dimensional data. A basic scatterplot is just a graphic representation of each data point on an x-y grid. We use them to see if there are any obvious patterns in the data.

If we want a scatterplot that includes a fitted regression line, we use `lmplot` (or `regplot`).

In [None]:
# Basic scatterplot


We can add some customization to the scatter plot, like a title and better axis labels.

Not every Seaborn plot type allows this, but since we are focusing on exploratory plots it is not a big deal.

In [None]:
# Can add some customization


In [None]:
# A scatterplot with a regression line using lmplot()


## Pairwise relationships

- You can see all the pairwise relationships for all the numeric data in a dataframe using `pairplots()`.
- It plots a scatterplot between each pair of variables and a histogram of each variable on the diagonal.
- If there are more than a few variables it may take a little while so BE PATIENT!
- The top half of the visualization above the diagonal is a mirror image of the bottom half.

In [None]:
# Display all the pairwise plots


# Multivariate data
Plotting data with more than two variables is tricky, because there isn’t a template to use for every case.
- If we want to add a third variable, like the 'sex' variable in the 'tips' data, one option would be to color the points based on the value of that variable.
- If we wanted to add a fourth variable, we could add size to the dots.
    - A caveat to using size as a variable is that humans are not very good at differentiating relative size or areas.

- There are ways to include more information and distinguish data within the plot based upon additional categorical variables in the dataset.
    - multiple plots of univariate data on a single chart (like multiple boxplots)
    - color - use the `hue` parameter
    - size - use the `size` parameter
    - shape - use the `markers` parameter
    
Remember that size is the **worst** option to use because people have a hard time accurately seeing size differences.

## Multiple boxplots

In [None]:
# Multiple box plots
# x parameter is categorical to split data for multiple plots in one figure


## Colors - use `hue`

In [None]:
# make it a multiple bar chart by added the hue parameter


In [None]:
# A simple scatterplot with colors for different time values plus titles


In [None]:
# Color used for multiple groups on scatterplot with regression line


In [None]:
# Pass a hue value into pairplot


## Color and size
- Sizes of the point markers can be another means to add more information to a plot.
- This option should be used sparingly, since the human eye is not very good at comparing areas.

In [None]:
# scatter_kws is a dictionary that is passed on to the matplotlib function plt.scatter
# access the s parameter to change the size of the points
# This example uses the tips['size'] column to provide a relative size value. It is multiplied
# by 10 to make the numbers more manageable.


## Color and shape

In [None]:
# Use of color and shape to distinguish different values of the variable sex.
# The hue option defines the third variable, not just the variable for color.
# The markers option changes the shapes of the point markers.


# Facets and FacetGrids
Seaborn uses facets in a facet grid to create a set of graphs from one dataset. We don't specify each individual plot, just the overall parameters, including the variable on which to split the data (the `col` parameter). Seaborn does the rest.

- `col` specifies the categorical variable to split across columns
- `col_wrap` specifies how many column plots to show on a line before wrapping to the next line.

## Multiple scatter plots with `col`
The `scatterplot()` method can only do one plot at a time. Use `lmplot()` to display multiple plots together using the `col` parameter to specify the variable used to split to separate plots.

In [None]:
# Multiple scatter plots with lmplot and col



Here is a re-creation of the four Anscombe plots we did earlier, but now together in one plot. 

In [None]:
# Anscombe plots using seaborn
anscombe = sns.load_dataset("anscombe")      # reload the data, just in case
ax = sns.lmplot(x='x', y='y', data=anscombe,
                          fit_reg=False,
                          col='dataset', col_wrap=2)


The `FacetGrid` can be used to further customize your multi-plot output. The `lmplot` function is a figure-level function so it has `col` and `col_wrap` parameters. Axes level functions like `regplot` don't have those and need to be placed in a portion of the grid using the `FacetGrid` directly. 

We will not spend any more time on these details because they are typically needed for explanatory visualizations.

## Scatter plot with `col` and `hue`

Use lmplot() in order to access the `hue` and `col` parameters. `hue` specifies the third variable to use to split for the different colors and `col` is the variable used to split to the separate plots.

In [None]:
# Scatter plot with hue and col

