# Unit 3: Exploratory Analysis

## Contents

* [Contents](#Contents)
* [Getting Started](#Getting-Started)
* [Selecting Data with Pandas](#Selecting-Data-with-Pandas)
* [Exploring a Dataset](#Exploring-a-Dataset)
    * [Histograms](#Histograms)
    * [Scatter Plots](#Scatter-Plots)
    * [Pair Plots: Histograms and Scatter Plots](#Pair-Plots:-Histograms-and-Scatter-Plots)
    * [Categorical Data](#Categorical-Data)
    * [Plotting Numerical Data with Categorical Data](#Plotting-Numerical-Data-with-Categorical-Data)
    * [Box Plots](#Box-Plots)
    * [Pivot Tables](#Pivot-Tables)
* [Lab Answers](#Lab-Answers)
* [Next Steps](#Next-Steps)
* [Resources and Further Reading](#Resources-and-Further-Reading)
* [Exercises](#Exercises)

### Lab Questions

[1](#Lab-1), [2](#Lab-2), [3](#Lab-3), [4](#Lab-4), [5](#Lab-5), [6](#Lab-6), [7](#Lab-7), [8](#Lab-8), [9](#Lab-9),  [10](#Lab-10)

## Getting Started

With some data loaded and cleaned, we can begin to look at it more closely and see if we can identify any trends or relationships.  To do this, we can rely on both quantitative methods such as the calculation and analysis of descriptive statistics as well as qualitative methods such as plotting. We'll rely on methods in pandas to calculate descriptive statistics. We'll rely on [Matplotlib](https://matplotlib.org/) and [Seaborn] for plotting(https://seaborn.pydata.org/). Matplotlib is a popular plotting library capable of producing [many different types of plots](https://matplotlib.org/gallery/index.html). Seaborn provides a simpler way of creating many of the plots commonly associated with data analysis and typically produces [nicer looking plots](https://seaborn.pydata.org/examples/index.html).

To use Seaborn, we'll make sure it is installed with `pip`.

If using [Anaconda](https://www.anaconda.com/download) and the following pip command fails, open the Anaconda prompt on your computer and run the following

```
conda install --yes seaborn
```

In [None]:
import sys
!{sys.executable} -m pip install seaborn

We'll be creating plots in this notebook.  Before importing any modules, we should indicate to the notebook software how we would like to handle plots.  We can use the [`%matplotlib`](https://ipython.readthedocs.io/en/stable/interactive/magics.html#magic-matplotlib) magic command and `inline` to indicate that we would like plot to appear as static images in the notebook.

In [None]:
%matplotlib inline

With Seaborn installed, we can start importing modules for use in the notebook. Just as we followed convention and imported pandas as `pd`, we will import the Seaborn library as `sns`.  

Following the import, we can set the appropriate pandas option to display 100 columns at a time. We can use the Searborn `set()` function to control figure size and the size of marker edges. Setting marker edges allows us to see outliers when working with box plots that would otherwise be invisible due to a bug in Matplotlib and Seaborn. 

In [None]:
import pandas as pd
import seaborn as sns

sns.set(rc={'figure.figsize': (12, 10), "lines.markeredgewidth": 0.5 })

## Selecting Data with Pandas

In this unit we'll continue to work with panda's Series and DataFrames to store and manipulate data.  As part of that work, we'll be interested in examining parts of a larger DataFrame.  A common way to select a subset of DataFrame is through the use of masks and filters - this is something we've already done.  For example, consider the following DataFrame.

In [None]:
employees = pd.DataFrame(
    [
        ('John', 'Marketing', '123 Main St', 'Columbus', 'OH', 3),
        ('Jane', 'HR', '456 High Ave', 'Columbus', 'OH', 7),
        ('Bob',  'HR', '152 Market Rd', 'Cleveland', 'OH', 4),
        ('Sue', 'Marketing', '729 Green Blvd', 'Cleveland', 'OH', 8),
        ('Tom', 'IT', '314 Oak Dr', 'Cincinnati', 'OH', 11),
        ('Kate', 'IT', '841 Elm Ln', 'Cincinnati', 'OH', 2)
    ],
    columns = ("name", "department", "address", 
               "city", "state", "years_with_company")
)

display(employees)

If we want to work with a specific subset of the data, say only those employees that are in Columbus, we could use the mask `employees.city == 'Columbus'` to filter the data.

In [None]:
employees[employees.city == 'Columbus']

Pandas offers an alternative [*query()* method](https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-query) for selecting data that allows us to write statements that similar to standard Python syntax. For example, *query()* can be used to select only those employees in Columbus.

In [None]:
employees.query('city == "Columbus"')

The argument we provide the *query()* method is a string that indicates the data we'd like extract from the original DataFame.  When comparing string values, we have to to be sure to enclose the inner string in different quotes than the query itself.  

Compare the following which both return data for employees that have been with the company for more than 4 years.

In [None]:
employees[employees.years_with_company > 4]

In [None]:
employees.query('years_with_company > 4')

So far there doesn't seem to be much of an advantage to one method over the other.  A benefit to using *query()* becomes apparent when our conditions become more complex.  Compare the two methods when we want to find staff that have been with the company for more than 4 years and live in either Cleveland or Columbus.  White space such as line returns and extra indentation has been added for clarity and is not required.

In [None]:
employees[(employees.years_with_company > 4) &
          ((employees.city == "Cleveland") | 
           (employees.city == "Columbus"))]


In [None]:
employees.query(
    "years_with_company > 4 and"
    "(city == 'Cleveland' or "
    " city == 'Columbus')"
)

The argument supplied to the *query()* method is more concise and probably easier to read; generally, we write code with readability in mind as it is easier to share or understand later.  

<hr>
<a name="Lab-1"></a><mark> **Lab 1** In the cell below, use the *query()* method to find all employees that live in Columbus and work for HR.
</mark>

<hr>

One potential advantage to the non-*query()* method is programmability.  As we write scripts, we often use variables to store values that will change and use the variables in our selection criteria.  For example, suppose we have a function that routinely allows different departments to find their staff with more than 4 years with the company.

In [None]:
def senior_staff(dept):
    # return only those employees in the 
    # specified department that have been
    # with the company for more than 4 years
    pass    

Working with the notation we have been using, writing the function body the returns the correct result for a specified department is straightforward.

In [None]:
def senior_staff(dept):
    return employees[employees.department == dept]   

senior_staff("IT")

Working with the *query()* method, there are multiple ways we can achieve the same result.  The method suggested by the pandas' documentation makes use of `@`.  We can reference existing variables in the query string in-line by prefixing their names with `@`.

In [None]:
def senior_staff(dept):
    return employees.query("department == @dept")

senior_staff("IT")

Both methods of selecting data have benefits and disadvantages.  We'll primarily use the method we've been working with but occasionally use *query()*.

## Exploring a Dataset

Let's start with the EPA/Department of Energy fuel economy dataset set we looked at last time.  Often data cleaning, merging, and exploration are done together - data is cleaned as we examine it for relationships and trends and then the pertinent/interesting data is merged and stored for further analysis.  Though we cleaned and merged the fuel economy and vehicle sales data previously, lets start with the original datasets for this initial exploration.  

We can load the data from the `./data/02-vehicles.csv` file using pandas' `read_csv()` function.

In [None]:
epa_data = pd.read_csv("./data/02-vehicles.csv", engine="python")
epa_data.head()

We also have a summarized data description document for this data, we can display it with the `HTML` function in the `IPython.display` module.

In [None]:
from IPython.display import HTML
HTML(filename="./data/02-vehicles-description.html")

As part of our initial exploration, we'll attempt to catalog/categorize the values in the following columns and see if there are any relationships between pairs of them.

- `city08`
- `city08U`
- `co2`
- `c02TailpipeGpm`
- `comb08`
- `comb08U`
- `cylinders`
- `displ`
- `fuelType1`
- `highway08`
- `highway08U`
- `year`
- `VClass`

We'll also keep the following fields for each row.  

- `make`
- `model`

<hr>
<a name="Lab-2"></a><mark> **Lab 2** In the cell below, create a copy of the `epa_data` DataFrame containing only the columns listed above.  Store the new DataFrame in a variable named `epa_subset`.  Use the *head()* method to confirm that the new DataFrame contains the correct column data.
</mark>

<hr>

As we've seen before, we can use the DataFrame *describe()* method to quickly calculate some discriptive statistics for each of the numeric columns in the dataframe.

<hr>
<a name="Lab-3"></a><mark> **Lab 3** In the cell below, use the *describe()* method to calculate descriptive statistics for the numeric columns of `epa_subset`.
</mark>

<hr>

Let's look at one column to get an idea of what these values represent. According to the documentation, the `city08` column represnts the fuel economy for city driving with the primary fuel type.  The rows have the following meaning.

- `count`: the number of non-null elements in the column; here there are 39,518 non-null values in the `city08` column
- `mean`: the sum of all values divided by the number of values; the mean value for `city08` is 18.2 mpg.
- `std`: the standard deviation - a measure of the variation of values within a collection, can be thought of as an "average" distance to the mean among all the values; the standard deviation of values in the `city08` column is 7.3 mpg.
- `min` and `max`: the smallest and largest values, respectively; here we have 6.0 mpg and 150.0 mpg. 
- `25%`, `50%`, and `75%`: the quartile values that allow us to divide the data into four parts. The first quartile, 25%, corresponds to the value between the minimum and the median; 25% of values are less than this value.  The second quartile is the median, the middle most number among the values; 50% of values are less than the median and 50% of values are greater than the median. The third quartile represents the middle value between the median and the maximum; 25% of values are greater than this value. 

As we noted previously, *describe()* only display results for numeric columns.  For non-numeric columns, we might be interested in knowing about the data values and how often those values appear.  Below, we iterate through the columns of the DataFrame and if the the column is a string data we display the column name with the output from the *value_counts()* method.

In [None]:
for column in epa_subset.columns:
    if pd.api.types.is_string_dtype(epa_subset[column]):
        display(column, epa_subset[column].value_counts())


While having access to these results can be useful, we often rely on visualizations to help characterize data or provide insights into potential relationships.  Before generating visualizations, let's address some data quality issues.  First, we can remove duplicates. 

<hr>
<a name="Lab-4"></a><mark> **Lab 4** In the cell below, remove duplicate rows from the `epa_subset` DataFrame.
</mark>

<hr>

We'll also need to account for missing data - while some methods we'll use to explore the data are able to ignore missing values, other will fail and throw exceptions.

From the code below we can see that the `cylinders` and `displ` columns are missing data.

In [None]:
epa_subset.isna().sum()

Let's see if we can identify any common properties for rows missing `cylinders` or `displ` data.  We can filter the DataFrame using a mask that corresponds to a row in which any column value is missing.

In [None]:
epa_subset[epa_subset.isna().any(axis=1)].head()

It looks like the rows with missing cylinder and displacement data correspond to electric vehicles.  This makes sense given the fact that electric vehicles do no have an internal combustion engine.  Let's refine the mask to exclude rows where `fuelType` is `Electricity`.

In [None]:
epa_subset[(epa_subset.isna().any(axis=1)) & 
           (epa_subset.fuelType1 != 'Electricity')].head()

It appears that these rows are anomalies and are simply missing data.

Before continuing on, we'll remove rows with missing data.

In [None]:
epa_subset.dropna(inplace=True)

### Histograms

To begin getting a higher-level picture of our data, we can use visualizations.  While we have some sense of the distribution of data values from the quartile information calculated by the *describe()* method, a [histograms](https://en.wikipedia.org/wiki/Histogram) can be used to visualize the date distribution.  

Both pandas DataFrames and Series have *hist()* methods that can be used to plot histograms. This allows us to create a histogram for a specific column or for each column in a DataFrame with numeric values.

In [None]:
epa_subset.city08.hist()

The *hist()* method returns an `AxesSubplot` object that can be used to manipulate the plot - this is what the text above the plot refers to - we can ignore this now.

From the plot we can see that most of the values are concentrated between 10 and 30 mpg.  We can also see that the distribution has a positive [skew](https://en.wikipedia.org/wiki/Skewness).  We can confirm this using the column's `skew()` method.  Similarly, we can calculate the [kurtosis](https://en.wikipedia.org/wiki/Kurtosis) using the *kurtosis()* method.

<hr>
<a name="Lab-5"></a><mark> **Lab 5** Calculate and display the skew and kurtosis of the data in the `city08` column.
</mark>


<hr>

The pandas *hist()* method relies on Matplotlib and Seaborn to generate the plot. We can generate a histogram directly from Seaborn if we'd like. To do this, we can use the [*distplot()*](https://seaborn.pydata.org/generated/seaborn.distplot.html) function.  By default, the function generates a plot representing the probability distribution of observations rather than the count of values.  To generate a plot based on the count, we have to provide the `kde=False` argument. We can also specify `bins=10` for consistency with the previous plot.

In [None]:
sns.distplot(epa_subset.city08, kde=False, bins=10)

To generate the histograms for each of the numeric columns in the DataFrame, we can use the DataFrame's *hist()* method rather than the Series *hist()* method associated with an individual column.

In [None]:
epa_subset.hist()

Viewing the histograms allow us to see the distrubution of data for each column more quickly than looking at results of *describe()* or similar methods.  From a quick glance, we can see that there noticable differences between the values in `city08` and `city08U`. From the data documentation, we know that the `city08` column contains "unrounded data" but, when comparing the historgrams between the two columns, it apepars that the "unrounded" data contains more zero values.

### Scatter Plots

While a histogram is useful to see how data is distributed for a signle column, we often would like to see if any relationships exist between two columns/variables.  We can compare the values of two columns directly using a scatter plot.

In [None]:
epa_subset.plot.scatter(x="comb08U", y="comb08")

Notice that there are rows in which `comb08U` has a value of zero but the value of `comb08` is non-zero.  Because it doesn't make sense to round zero to a non-zero value, it's reasonable to conclude that there is missing data for the unrounded values and zero was used as a placeholder. It might be the case that before some point in time only rounded values were stored.  To verify this, we can compare the values of both columns against the `year` data assuming measurements were made around the the time the vehicles were manufactured.

<hr>
<a name="Lab-06"></a><mark> **Lab 6** In the cells below, verify that rounded values for combined fuel efficiency are available for earlier years compared to to unrounded values by creating two scatter plots. 
</mark>

<hr>

We can see that the `city08U`, `comb08U`, and `highway08U` columns have the same number of zero-valued entries, which supports the idea that unrounded data from earlier years isn't available.

In [None]:
for column in ["city08U", "comb08U", "highway08U"]:
    zeros = epa_subset[epa_subset[column] == 0]
    display(column, len(zeros))

With this in mind, we might choose to work with the rounded data if we wanted to work with a larger set of data including more historic data.  If there is a desire to work with unrounded values or we wish to examine only recent data, we could work with the unrounded values.  Having more historic data will be useful to us so we'll work with rounded data.

Having a sense of the distribution of a single column's data is important but we're often interested in how one or more columns influence another column. When working with two columns, we often use scatter plots to asses potential relationships.  We can create a scatter plot for two columns as we did above when looking at rounded and unrounded data compared to years.  As another example, let's compare the values of `comb08`, the combined highway and city fuel efficiency in miles/gallon, and `co2`, the tailpipe CO2 emissions in grams/mile.

In [None]:
epa_subset.plot.scatter(x="comb08", y="co2")

Here we can see that there appears to be a relationship between fuel efficiency and carbon dioxide emissions - as fuel efficiency improves, emissions decrease.  We'll examine this relationship further in a later unit.

### Pair Plots: Histograms and Scatter Plots

Just as it was helpful to be able to quickly visualize distribution data for each of the numeric columns in our dataset, we can use a [*pairplot*](https://seaborn.pydata.org/generated/seaborn.pairplot.html) to visualize the pairwise relationships between columns.  In the example below, we first further reduce the columns we'll examine then use the Seaborn *pairplot()* function to generate the pairwise scatter plots for those columns.  Note that when a column is paired with itself, the column's histogram is displayed.

In [None]:
columns = [ "co2", "comb08", "cylinders", "displ"]
sns.pairplot(epa_subset[columns])

Notice that there is a form of symmetry to the plots with respect to the plots to the left and below the histograms and the plots to the right and above.  For example, the the second plot in the first row and the first plot in the second column both show the relationship between `co2` and `comb08` - the axis to which each variable corresponds differs but the relationship is the same.

Using pair plots can help us quickly identify which columns or variables are dependent on other columns/variables.  

The `co2`column contains quite a few values that appear to be zero but are -1 (as can be seen from the output of *describe()*.  Let's see what those are.

In [None]:
epa_subset.query("co2 == -1").head()

Its not immediately clear what might be the reason for the carbon dioxide emissions having a value of -1 but we'll remove them for the remainder of our work.

In [None]:
epa_subset = epa_subset.query('co2 >= 0')

With that, let's look at the pair plots again.

In [None]:
columns = [ "co2", "comb08", "cylinders", "displ"]
sns.pairplot(epa_subset[columns])

### Categorical Data

While the examples above provide insight into numeric data, they don't tell us about columns that contain categorical data such as `VClass`. 

A typical method of visualizing categorical data is with a bar plot.  Both pandas and Seaborn provide methods of generating bar plots.

For a given column in the DataFrame, the *value_counts()* method returns a Series.  Series, as we've seen before, have a *plot()* method.  We can explicitly create a bar chart using the `kind=bar` or `kind=barh` arguments to the *plot()* method for a vertical or horizontal bar chart, respectively.

In [None]:
epa_subset.VClass.value_counts().plot(kind="bar")

The order of the categories is determined by the Series used to create the chart; because the results of *value_counts()* are ordered, the bars in the resulting plot are ordered.  

We can create a similar plot using the Seaborn [*countplot()*](https://seaborn.pydata.org/generated/seaborn.countplot.html) function. To create a vertical bar chart, we can use the `x` keyword argument to specify source data; to create a horizontal bar chart, we can use the `y` keyword argument to specify the source data.  Below, we create a horizontal bar chart for the vehicle class data.

In [None]:
sns.countplot(y=epa_subset.VClass)

Notice that the data isn't ordered in the same way it was before; by default, categories are ordered based on when they appear in the source data.  To impose a different ordering, we can use the `order` keyword argument with the *countplot()* function.


<hr>
<a name="Lab-7"></a><mark> **Lab 7** In the cell below, use the Seaborn *countplot()* function to generate a bar chart visualizing the counts for each of the vehicle classes in the `VClass` column.  Use the `order` keyword argument to order the categories based on the number of entries for each class.  Use the index from the series returned by the `value_counts()` method for ordering information. 
</mark>


<hr>

As you look at these bar charts, you might notice that there are several vehicle class names that are similar to other classes, for example, "Special Purpose Vehicles" and "Special Purpose Vehicle". Listing the distinct vehicles alphabetically helps to see this better.

In [None]:
sorted(epa_subset.VClass.unique().tolist())

Let's clean this data a bit by replacing similar values with one value. In addition to combining categories with similar names, we'll combine two- and four-wheel drive vehicles into the category without an indication of drive and the different types of vans into the general van category. 

There are a variety of way of doing this.  A for-loop would work but, as mentioned previously, for-loops should be avoided.  An alternate way of replacing a column's values is through the use of the *apply()* method that we've used before.  We could use the [*map()*] method and supply a dictionary where the keys are correspond to current values in the column and the associated values represent the replacement data, but we have to provide a mapping for every column - even those we don't need to alter.

In [None]:
vclass_map = {
    'Minivan - 2WD': 'Minivan',
    'Minivan - 4WD': 'Minivan',
    'Small Pickup Trucks 2WD': 'Small Pickup Trucks',
    'Small Pickup Trucks 4WD': 'Small Pickup Trucks',
    'Small Sport Utility Vehicle 2WD': 'Small Sport Utility Vehicle',
    'Small Sport Utility Vehicle 4WD': 'Small Sport Utility Vehicle',
    'Special Purpose Vehicle': 'Special Purpose Vehicles',
    'Special Purpose Vehicle 2WD': 'Special Purpose Vehicles',
    'Special Purpose Vehicle 4WD': 'Special Purpose Vehicles',
    'Special Purpose Vehicles/2wd': 'Special Purpose Vehicles',
    'Special Purpose Vehicles/4wd': 'Special Purpose Vehicles',
    'Sport Utility Vehicle - 2WD': 'Sport Utility Vehicle',
    'Sport Utility Vehicle - 4WD': 'Sport Utility Vehicle',
    'Standard Pickup Trucks 2WD': 'Standard Pickup Trucks', 
    'Standard Pickup Trucks 4WD': 'Standard Pickup Trucks',
    'Standard Pickup Trucks/2wd': 'Standard Pickup Trucks',
    'Standard Sport Utility Vehicle 2WD': 'Standard Sport Utility Vehicle',
    'Standard Sport Utility Vehicle 4WD': 'Standard Sport Utility Vehicle',
    'Vans Passenger': 'Vans',
    'Vans, Cargo Type': 'Vans',
    'Vans, Passenger Type': 'Vans'
}
    
def replace(value):
    if value in vclass_map:
        return vclass_map[value]
    else:
        return value
    
epa_subset.VClass = epa_subset.VClass.apply(replace)

With the similar values replaced, let's look at the value count bar chart again.

In [None]:
sns.countplot(y=epa_subset.VClass, 
              order=epa_subset.VClass.value_counts().index)

As an alternative to how often a given vehicle class appears in the data, we might be interested in knowing the mean fuel economy for each class.  In order to this, we can use panda's [*group by* functionality](https://pandas.pydata.org/pandas-docs/stable/groupby.html) that allows us to creates groups of data within the dataset and calculate an aggregate value for each group; this is similar to the standard SQL [*group by*](https://www.w3schools.com/sql/sql_groupby.asp) statement.  

To create a grouping, we can use the DataFrame's [*groupby()*](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.groupby.html) method specifying at least one column whose values should be used for grouping. The code below groups the data in `epa_subset` by values in the `VClass` column then computes the mean across the other columns for each group.

In [None]:
epa_subset.groupby(['VClass']).mean()

If we'd like the aggregated values for a single column, we can specify that column after the call to the the *groupby()* method or after the call to the aggregation function. It's generally better to reduce the size of the dataset over which a calculation is applied so we should select the column of interest before doing the aggregation calculation. We can select the `comb08` column after grouping the data and then calculate the mean.

In [None]:
mean_mpg_by_vclass = epa_subset.groupby(['VClass'])['comb08'].mean()
display(mean_mpg_by_vclass)

We can use this aggregated data, stored in a pandas Series, to create a bar chart.  Because we are no longer interested in the number of times a `VClass` value appears, we cannot use the Seaborn *countplot()* function.  Instead, we can use [*barplot()*].  Before plotting, we sort the Series using the *sort_values()* method to indicate that we'd like to sort the Series by its values; we specify `inplace=True` to alter the existing Series object and `ascending=False` to sort the values in descending order.  

In [None]:
mean_mpg_by_vclass.sort_values(inplace=True, ascending=False)
sns.barplot(x=mean_mpg_by_vclass.values, y=mean_mpg_by_vclass.index)

Alternatively, we could have used the *plot()* method on the Series itself.

In [None]:
mean_mpg_by_vclass.plot(kind='barh')

### Plotting Numerical Data with Categorical Data

With the vehicle classes sorted by combined city and highway fuel economy, let's see how some the relationships we looked at in the pair plot above differ based on classes.  Let's select three vehicle classes, the one with the greatest mean fuel economy, the one with the least mean fuel economy, and the class corresponding to median of the aggregated values. We can use the *min()*, *max()*, and *median()* methods to find the corresponding values within the Series and use a filter with those values to get the index values (the vehicle class names).  The following identifies the vehicle class with the median value for mean combined fuel economy. Note that the filter returns a collection of rows that match the given criteria; to get the first and only result, we use bracket notation.

In [None]:
mean_mpg_by_vclass[mean_mpg_by_vclass == mean_mpg_by_vclass.median()].index[0]

Let's collect the three classes we're interested in into one list.

In [None]:
vclass_sample = [
    mean_mpg_by_vclass[mean_mpg_by_vclass == mean_mpg_by_vclass.min()].index[0],
    mean_mpg_by_vclass[mean_mpg_by_vclass == mean_mpg_by_vclass.median()].index[0],
    mean_mpg_by_vclass[mean_mpg_by_vclass == mean_mpg_by_vclass.max()].index[0]
]

vclass_sample

We can use this list to reduce the size of our `epa_subset` DataFrame.  The mask we'll use to filter the data will rely on the [*isin()*](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.isin.html) method that tests whether a value is in a specified list or not; this is similar to the Python *in* keyword.

In [None]:
epa_vclass_sample = epa_subset[epa_subset.VClass.isin(vclass_sample)]
epa_vclass_sample.head()

We could achieve the same result using *query()* and a more Python-like syntax.

In [None]:
epa_vclass_sample = epa_subset.query('VClass in @vclass_sample')
epa_vclass_sample.head()

With these three vehicle classes, let's look at the pair plot from earlier.  We can use the values in the `VClass` column to determine the coloring of markers in the various plots.  To to this, we specify the column name with the `hue` argument when we call *pairplot()*.

In [None]:
columns = [ "co2", "comb08", "cylinders", "displ", "VClass"]
sns.pairplot(epa_vclass_sample[columns], hue="VClass")

Looking at the data with how we chose values for `VClass` in mind, we can begin to see some trends.  For example, the more fuel-efficient vehicles, like small station wagons, have engines with a smaller displacement and emit less carbon dioxide. Similarly, vans, which represent the least fuel-efficient vehicles, have engines with a greater displacement and emit more carbon dioxide.

We can look at pairwise relationships one at a time if we would like. Let's look at the scatter plot of `comb08` and `co2`.  While we could use the *scatter()* method we used earlier, it would take some work to add the marker coloring based on `VClass` that we have in the pair plot above.  Instead, we'll use the Seaborn [*lmplot()*](https://seaborn.pydata.org/generated/seaborn.lmplot.html) function.  By default, this plot type will also show the linear regression that fits the data.  We'll disable this feature for now but will use it later.

In [None]:
sns.lmplot(x='comb08', y="co2", hue='VClass', 
           data=epa_vclass_sample, fit_reg=False)

We can see that there appears to be an indirect relationship between the two variables: as fuel economy increases, carbon dioxide emissions decrease.


<hr>
<a name="Lab-8"></a><mark> **Lab 8** In the cell below, create a similar scatter plot for `displ` and `co2` with marker coloring determined by vehicle class.
</mark>

<hr>

### Box Plots

We'll look at modeling these relationships in the next unit, but for now, let's return to the larger subset of data with all the vehicle classes.  So far, we've used the *describe()* method, histograms, and bar charts to get a sense of the distribution and other properties of some of the numeric data in our dataset.  Another common way to explore numeric data is through the use of [box plots](https://en.wikipedia.org/wiki/Box_plot).

To understand the components of a box plot, consider the following example Series.

In [None]:
example = pd.Series([-4, -1, -0.5, 0, 0.5, 1, 4], name="Example")
example.describe()

Let's create a the box plot for the same Series and compare it to the output above.

In [None]:
example.plot(kind='box')

The red line in the box corresponds to the median or 50th percentile value.  The lines above and below the median correspond to the 75th and 25th percentile values, respectively.  The narrower lines above and below these lines correspond to the maximum and minimum values excluding outliers.  Outliers are symbolized by the small circular markers. Outliers represent values that can be considered "distant" from other values.  There are various ways of defining what constitutes an outlier but a common method relies on interquartile range, *IQR*, the difference between the third and first quartile, or in terms of percentiles, the difference between the 75th and 25th percentiles.  Using interquartile range a value can be considered an outlier if it satisfies one of the following:

- it is greater than the sum of the 75th percentile and 1.5 times the interquartile range 
- it is less than the difference of 25th percentile and 1.5 times the interquartile range.



In [None]:
epa_subset.co2.plot(kind='box')

We can see that the median is about 400 g/mile and that there are quite a few outliers.  

Next, let's look at the `comb08` and `cylinders` columns.

In [None]:
epa_subset.comb08.plot(kind='box')

In [None]:
epa_subset.cylinders.plot(kind='box')

The box plot for `cylinders` is a bit unusual.  We see that the median and the 75th percentile values are the same.  This is due to the fact that there are few different values and the distribution of those values as shown by *describe()* and *value_counts()* below.

In [None]:
display(epa_subset.cylinders.describe())
display(epa_subset.cylinders.value_counts())

<hr>
<a name="Lab-9"></a><mark> **Lab 9** In the cell below, create a box plot for the `displ` column of the `epa_subset` DataFrame.
</mark>

<hr>

Let's look at the box plot for the `displ` column again.

In [None]:
epa_subset.displ.plot(kind='box')

Let's calculate thresholds for the outliers.

In [None]:
q1 = epa_subset.displ.quantile(0.25)  # first quartile
q3 = epa_subset.displ.quantile(0.75)  # third quartile
IQR = q3 - q1  # IQR
lower_threshold = q1 - 1.5 * IQR  # lower threshold
upper_threshold = q3 + 1.5 * IQR  # upper threshold
display(lower_threshold, upper_threshold)

From these calculations, an outlier for engine displacement is anything greater than 6.5 or less than about -0.7; for this data there are no lower outliers as negative displacement is meaningless.

<hr>
<a name="Lab-10"></a><mark> **Lab 10** In the cell below, calculate the lower and upper thresholds for outliers for the `co2` column in the `epa_subset` DataFrame.
</mark>

Seaborn also includes functionality to create box plots using the [*boxplot()*](https://seaborn.pydata.org/generated/seaborn.boxplot.html) function but the pandas DataFrame *plot()* method tends to produce easier-to-read plots.

In [None]:
sns.boxplot(y=epa_subset.co2)

Seaborn's *boxplot()* function does work well when we want to show the box plots from one column separated by categorical values.

In [None]:
sns.boxplot(x="co2", y="VClass", data=epa_subset)

### Pivot Tables

Pandas provides the ability to create [pivot tables](https://en.wikipedia.org/wiki/Pivot_table) to aggregate and summarize data.  To crate a pivot table, we can use the DataFrame's [*pivot_table()*](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.pivot_table.html) method and, at a minimum, specifying which column should be used as the index of the new table; the index serves as the column by which values are grouped and aggregated.  By default, the mean is used as the aggregation function but we can specify any appropriate function using the `aggfunc` keyword argument.

Let's create a pivot table that groups the data by `year` and aggregates data by calculating the median of values.

In [None]:
epa_subset.pivot_table(index="year", aggfunc=pd.np.median)

We can choose a subset of columns by listing them using the `values` keyword argument.

In [None]:
epa_subset.pivot_table(index="year", 
                       values=["city08", "comb08", "highway08"], 
                       aggfunc=pd.np.median)

We fan also further divide the data by specifying additional indexes or through the `columns` keyword argument.

In [None]:
epa_subset.pivot_table(index=["year", "fuelType1"],
                       values=["city08", "comb08", "highway08"], 
                       aggfunc=pd.np.median)

In [None]:
epa_subset.pivot_table(index="year", 
                       columns="fuelType1",
                       values=["city08", "comb08", "highway08"], 
                       aggfunc=pd.np.median)

We an also use pivot tables as inputs for visualizations.  Let's compare city and highway fuel economy for the various vehicle classes.

In [None]:
pivot = (
    epa_subset.pivot_table(
        index=["VClass"],
        values=["city08", "highway08"],
        aggfunc=pd.np.median)
    .sort_values(by="highway08"))
display(pivot)

With this we can create a [heatmap](https://en.wikipedia.org/wiki/Heat_map) using Seaborn's [*heatmap()*](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function.  In addition to specifying the pivot table as the data source, we can show the values for each rectangle using the `annot` keyword argument and specify the colormap using the `cmap` keyword argument with a [matplotlib colormap  name](https://matplotlib.org/users/colormaps.html).

In [None]:
sns.heatmap(pivot, annot=True, cmap="RdYlGn")

## Lab Answers

1. ```python
   employees.query(
       "city == 'Columbus' and "
       "department == 'HR'"
   )
   ```
   
2. ```python
   columns = ["city08", "city08U", "co2", "co2TailpipeGpm", "comb08", "comb08U", 
              "cylinders", "displ", "fuelType1", "highway08", "highway08U", 
              "make", "model", "VClass", "year"]
   epa_subset = epa_data[columns].copy()
   epa_subset.head()
   ```
   
3. ```python
   epa_subset.describe()
   ```
   
4. ```python
   epa_subset.drop_duplicates(inplace=True)
   ```
   
5. ```python
   display(epa_subset.city08.skew())
   display(epa_subset.city08.kurtosis())
   ```
   
6. ```python
   epa_subset.plot.scatter(x="year", y="comb08")
   ```
   
   and
   
   ```python
   epa_subset.plot.scatter(x="year", y="comb08U")
   ```
   
7. ```python
   sns.countplot(y=epa_subset.VClass, order=epa_subset.VClass.value_counts().index)
   ```
   
8. ```python
   sns.lmplot(x="displ", y="co2", hue='VClass', data=epa_vclass_sample, fit_reg=False)
   ```
   
9. ```python
   epa_subset.displ.plot(kind='box')
   ```
   
10. ```python
    q1 = epa_subset.co2.quantile(0.25)
    q3 = epa_subset.co2.quantile(0.75)
    IQR = q3 - q1
    lower_threshold = q1 - 1.5 * IQR
    upper_threshold = q3 + 1.5 * IQR
    display(lower_threshold, upper_threshold)
   ```

## Next Steps

We've identified some potential relationships among columns within the fuel economy dataset.  In the next unit we'll create mathematical models for some of these relationships and see how well the model fits the existing data.

We'll continue to use plots and charts to understand the data but mostly for exploratory purposes.  Visualizations also serve as great tools for conveying information; we'll explore explanatory visualizations later.

## Resources and Further Reading

- [An Introduction to Seaborn](https://seaborn.pydata.org/introduction.html)
- [Seaborn Gallery](https://seaborn.pydata.org/examples/index.html)
- [*Practical Statistics for Data Scientists* by Bruce and Bruce, Chapter 1: Exploratory Data Analysis (Safari Books)](http://proquest.safaribooksonline.com.cscc.ohionet.org/book/databases/9781491952955/1dot-exploratory-data-analysis/eda_html?uicode=ohlink)
- [*Python: Data Analytics and Visualization* by Phuong, et. al., Data Exploration (Safari Books)](http://proquest.safaribooksonline.com.cscc.ohionet.org/book/programming/python/9781788290098/implementing-logistic-regression-with-python/ch06lvl2sec00083_html?uicode=ohlink)

## Exercises

1. We created a set of box plots for the values in the `co2` column separated by `VClass` from the `epa_data` DataFrame using `sns.boxplot(x="co2", y="VClass", data=epa_subset)`.  Prior to that, when working with bar charts, we specified an ordering for the categorical data using the `order` keyword argument.  Modify the box plot so categories are ordered by median carbon dioxide emissions from least to greatest.  See the image below for the desired plot.

2. Load county auditor data, either one county's data from one of the data sources we've used or from the combined data we saved to a database, and create a pair plot that compares sales price, area, number of bedrooms, and bathrooms.  Are there any potential relationships between any of these variables?

<figure>
<img src="./images/03-boxplots.png" alt="box plots for co2 emission by median">
<figcaption style="text-align: center; font-weight: bold">Exercise 1 - Box plots for carbon dioxide emission by vehicle class sorted by median emission</figcaption>
</figure>