# Matplotlib Scatter and Bar Plots

In this chapter, we'll cover how to create scatter and bar plots in matplotlib using data.

## Scatter plots

A typical scatter plot consists of a sequence of paired x and y coordinates each plotted as a single point without being connected to one another. In matplotlib, the `scatter` method is usually the best way to create a scatter plot. Its first two arguments are the same as the `plot` method, the x and y values. Below, we create a scatter plot of 20 points using random values between 0 and 1 generated with numpy. We assign the result of the call to the `scatter` method to its own variable.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('mdap.mplstyle')
fig, ax = plt.subplots()
x = np.random.rand(20)
y = np.random.rand(20)
scatter_obj = ax.scatter(x, y)

Each point is represented as a small circle of the same color. Let's find out what type of object gets returned.

In [None]:
type(scatter_obj)

This `PathCollection` object represents all 20 of these points. As usual, it has getter and setter methods to view and change its properties. When calling the `get_facecolor` method, a two-dimensional numpy array is returned and not a four-item tuple or an RGBA string as you might expect. With the `scatter` method, every single point can be a different color, so its possible a 20 x 4 array was returned with RGBA floats for each point.

In [None]:
scatter_obj.get_facecolor()

The default size of a scatter plot point is in points squared. This value is 36, which means it is 6 points or 1 / 12 (6 / 72) of a figure inch in diameter. The `get_sizes` method returns this size as an array. Like the face color, each point can be a different size, so it could have returned an array of 20 values.

In [None]:
scatter_obj.get_sizes()

### Scatter plot properties

There are several parameters you can set to customize the appearance of your scatter plot. The following are the most common:

* `s` - size of marker in points squared
* `c` - face color of marker
* `marker` - a single character or digit [marker style][1]
* `linewidth`/`lw` - width of marker edge in points
* `edgecolor`/`ec` - color of marker edge

Let's set each of these parameters to a value and plot the same data. The size of each marker is set to 100, but this is in points squared, meaning that the actual size is the square root of 100, or 10.

[1]: https://matplotlib.org/api/markers_api.html

In [None]:
fig, ax = plt.subplots()
ax.scatter(x, y, s=100, c='orange', marker='v', lw=2, ec='.3');

### Creating a scatter plot with the `plot` method

The primary purpose of the `plot` method is to create line plots. But, it can be used to create scatter plots by setting the `linestyle` or `ls` parameter to an empty string and setting the `marker` parameter to any valid value. Let's replicate the above scatter plot with the `plot` method. The parameters `ms`, `mew`, and `mec` are used instead of of `s`, `lw` and `ec`. Note that the marker size for the `plot` method is given in actual points (10) and not points squared (100) as it was with `scatter`.

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, ms=10, c='orange', marker='v', mew=2, mec='.3', ls='');

Although the exact same output is produced, I don't recommend using the `plot` method to create scatter plots. With the `scatter` method, it's possible to control the properties of each individual point. With the `plot` method, all points must have the same properties. 

### Setting properties of individual points

Let's show examples of how we can change the properties of each individual point with the `scatter` method. We begin by reading in a few columns of the housing dataset, getting a random sample of 100 houses.

In [None]:
cols = ['OverallQual', 'Exterior1st', 'GrLivArea', 
        'GarageArea', 'GarageCars', 'SalePrice']
housing = pd.read_csv('../data/housing.csv', usecols=cols)
housing = housing.sample(100, random_state=43).reset_index(drop=True)
housing.head(3)

### Setting the color of individual points

The color of each individual point is controlled by the `c` (and NOT the `color`) parameter. You can set it to a sequence of web colors or RGBA values the same length as the data. Below, we create an array of 100 random RGBA values and use them to color the points of the scatter plot between living area and sale price.

In [None]:
c = np.random.rand(100, 4)
fig, ax = plt.subplots(figsize=(5, 2))
ax.scatter('GrLivArea', 'SalePrice', c=c, data=housing, s=20);

It's possible to use column names to reference the color name, just like we do with the x and y values. The column values must either be numeric or named color strings. Let's use the overall quality column to color the points. The values in this column range from 1 to 10. Also, all points are made slightly transparent, which is important to do for scatter plots that have many overlapping values.

In [None]:
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c='OverallQual', 
                         data=housing, alpha=.6, s=20)
ax.set_title('Price vs Square Foot');

### Colors chosen with default colormap

matplotlib uses a colormap to select the colors for the points. The default colormap is viridis, which contains 256 colors. matplotlib maps each value in the overall quality column to a single color of the colormap using the following steps:

* Calculate the range of the values (difference between maximum and minimum)
* Subtract the minimum from each value
* Divide each value by the range
* Multiply each value by the number of colors
* Truncate the decimal to get an integer

These integers are then passed to the colormap object to retrieve the RGB value. Since there are only 10 unique values of overall quality, we can follow the above steps ourselves to produce the colormap integer for each value and output the result as a DataFrame.

In [None]:
from matplotlib import cm
min_oq = housing['OverallQual'].min()
max_oq = housing['OverallQual'].max()
range_oq = max_oq - min_oq
oq_vals = np.arange(1, 11)
N = cm.viridis.N
cm_vals = ((oq_vals - min_oq) / range_oq * N // 1).astype('int64')
df_colormap = pd.DataFrame({'OverallQual': oq_vals, 'Colormap integer': cm_vals})
df_colormap

The lowest value always gets mapped to 0 and the highest to `N`, the total number of colors in the colormap.  Let's apply the same transformation to all the actual values of overall quality to create a Series of colormap integers.

In [None]:
cm_values = (housing['OverallQual'] - min_oq) / range_oq * N // 1
cm_values = cm_values.astype('int64')
cm_values.head()

These are the values matplotlib used to determine the color of each point in our scatter plot. Let's convert each of these integers to its RGBA value using the same alpha for each.

In [None]:
rgba_array = cm.viridis(cm_values, alpha=.6)
rgba_array[:3]

The above values are what we believe matplotlib used as RGBA colors for each point. Let's get the actual RGBA values from the plotting object with the `get_facecolor` method.

In [None]:
scatter_rgba = scatter_obj.get_facecolor()
scatter_rgba[:3]

The values from the arrays appear to be identical. Let's verify this by running the `assert_array_equal` method from numpy's `testing` module. If no exception is raised, then the arrays are equal.

In [None]:
np.testing.assert_array_equal(rgba_array, scatter_rgba)

### Adding a legend to the scatter plot

Adding a legend to a scatter plot is simple if all the points represent the same thing. In that case, set the `label` parameter of the `scatter` method to a string and then call the `legend` method as before.

If different groups of points require different labels, more work has to be done. Call the `legend_elements` method from the object returned from `scatter`. This will automatically generate the handles and labels for you as a two-item tuple of lists. Below, we unpack these two returned objects into their own variable names.

In [None]:
handles, labels = scatter_obj.legend_elements()

Let's output the `handles` list to see what it contains.

In [None]:
handles

It contains a list of 10 lines corresponding to the 10 unique values of overall quality. These lines are not on the axes. They are created from `legend_elements` just so you can pass them to the `legend` method. The color of the first line is the same color of the lowest overall quality value, which we verify below.

In [None]:
handle_0 = handles[0]
handle_0.get_color() == cm.viridis(0)

The labels are a list of 10 strings corresponding to the 10 unique values of overall quality. The first few are output below.

In [None]:
labels[:3]

These strings look strange, but essentially represent the integers 1 to 10. matplotlib allows you to write mathematical expressions in a special markup language called LaTeX by placing the content between dollar signs in a string. The details of this will not be presented here. The legend is created by passing these handles and labels to the `legend` method.

In [None]:
ax.legend(handles=handles, labels=labels, title='OverallQual', 
          facecolor='.9', ncol=3, loc='upper left')
fig

### Coloring by a column with more unique values

Let's now color each point by the size of the house's garage area, which is numeric and therefore valid. Each value of garage area will be mapped to a specific color from the viridis colormap using the rules from above. The `legend_elements` method is called to return the handles and labels again. It does not produce a legend entry for every unique value, but instead smartly chooses a reasonable number of entries given the data. You can control the exact number of entries to produce by setting the `num` parameter.

In [None]:
fig, ax = plt.subplots(figsize=(6, 2.5))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c='GarageArea', data=housing, s=20)
handles, labels = scatter_obj.legend_elements(num=12)
ax.legend(handles=handles, labels=labels, title='GarageArea', ncol=2)
ax.set_title('Price vs Square Foot');

### Using a colorbar

If a substantial number of unique values exist, a colorbar might be a better choice to label the meaning of colors. Call the `colorbar` **figure** method passing it the underlying plotting object. This is a rare instance where a figure (and not an axes) method is used. The entire spectrum of the colormap with its corresponding values will be placed in a slim vertical bar on the right side of the figure.

In [None]:
fig, ax = plt.subplots(figsize=(6, 2.5))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c='GarageArea', data=housing, s=20)
fig.colorbar(scatter_obj, label='Garage Area')
ax.set_title('Price vs Square Foot');

Technically, an entire new axes was created. You can retrieve all axes from the figure as a list using an attribute of the same name.

In [None]:
fig.axes

### Using a different colormap

Take a look at all of the [colormaps in the official documentation][0]. You'll notice that they are divided into the categories sequential, diverging, cyclic, and qualitative. Each category of colors is best suited for a particular type of data.

* sequential - data has a meaningful order - population, SAT score, quality of kitchen, salary, diamond clarity, etc...
* diverging - data has a clear point that divides the distribution  - sea level, velocity, z-scores, etc...
* cyclic - data where first and last values have similar meaning - seasonal data, business cycles, etc...
* qualitative - data has no meaningful order - states, animals, brands, type of roof, department, etc...

Garage area is data with meaningful order with higher values meaning larger garages. The default colormap, viridis, is sequential so it is already a valid choice. But, there are several other sequential colormaps that we can choose from. Below, we use `'YlOrRd'`, which begins with shades of yellow, before moving to orange and red. We pass it as a string to the `cmap` parameter.

[0]: https://matplotlib.org/gallery/color/colormap_reference.html

In [None]:
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c='GarageArea', 
                         cmap='YlOrRd', data=housing, s=20)
fig.colorbar(scatter_obj, label='Garage Area')
ax.set_title('Price vs Square Foot');

### Limiting the range of values for the colormap

If there are outliers in the column used to create the colormap, then the diversity of colors shown can greatly diminish. For example, the largest garage area is currently around 1,000. If we change the garage area of one house to 10,000, then the distribution of colors would completely change. All the houses except one would be mapped to the first 10% of the colormap spectrum. Let's see how this would look.

In [None]:
housing.loc[67, 'GarageArea'] = 10_000
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c='GarageArea', 
                         cmap='YlOrRd', data=housing, s=20)
handles, labels = scatter_obj.legend_elements(num=5)
ax.legend(handles=handles, labels=labels, title='GarageArea')
ax.set_title('Price vs Square Foot');

The coloring is now polarized and completely dominated by this one outlying garage area. You can set the minimum and maximum values of the range of the variable used for the colormap to consider with the `vmin` and `vmax` parameters of the `scatter` method. The values 200 and 800 are used below for this range. Garage areas below 200 and above 800 will be mapped to the first and last colors of the colormap respectively. 

Instead of using an integer for the `num` parameter in `legend_elements`, a sequence can be used to provide the exact values you want. If you don't provide a sequence, then matplotlib will choose equally spaced values from 0 to 10,000.

In [None]:
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c='GarageArea', 
                         cmap='YlOrRd', data=housing, s=20, vmin=200, vmax=800)
handles, labels = scatter_obj.legend_elements(num=range(200, 900, 100))
ax.legend(handles=handles, labels=labels, title='GarageArea', ncol=2)
ax.set_title('Price vs Square Foot');

### Using string columns to color

DataFrame columns with string values can also be used to color the points of a scatter plot, but you'll need to convert the values to integers. In the housing dataset, the `Exterior1st` column contains a string description of the primary exterior of each house.

In [None]:
housing['Exterior1st'].value_counts()

Attempting to use this column for the color directly results in an error.

In [None]:
ax.scatter('GrLivArea', 'SalePrice', c='Exterior1st', data=housing)

We need to map each value to an integer. The simplest way to do this is to convert it to the categorical data type. Since it has no inherent ordering, we convert it without providing a list of all the categories.

In [None]:
housing['Exterior1st'] = housing['Exterior1st'].astype('category')

The underlying integer values can be accessed with the `cat` accessor's `codes` attribute.

In [None]:
ext_codes = housing['Exterior1st'].cat.codes
ext_codes.head(3)

The category names need to be retrieved for the legend. We use the `cat` accessor again and convert the values to a list and display a few of the categories.

In [None]:
categories = housing['Exterior1st'].cat.categories.tolist()
categories[:5]

This data is best plotted with a qualitative colormap as it has no inherent ordering. The Set1 colormap is chosen, which has 9 colors, the exact number of unique values of `Exterior1st`. The labels are set to the categories.

In [None]:
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', c=ext_codes, cmap='Set1', 
                         data=housing, s=20)
handles, labels = scatter_obj.legend_elements()
ax.legend(handles=handles, labels=categories, title='GarageArea', 
          ncol=2, bbox_to_anchor=(1, .8))
ax.set_title('Price vs Square Foot');

## Change scatter plot point size

The size of each point of a scatter plot is provided as typographical points squared with the `s` parameter. You can set it as a single number, a sequence of numbers, or a DataFrame column name containing numeric values. Let's plot our 20 random values with a random size of either 25, 64, or 100. While these numbers seem large, they are in points squared, so these translate to heights of 5, 8, and 10 points or .07, .11, and .14 figure-inches (dividing by 72). 

When calling `legend_elements`, set `prop` to the string `'sizes'` so that it returns handles and labels referencing the size and not the color. Other legend parameters that control the spacing and padding are also used.

In [None]:
fig, ax = plt.subplots()
s = np.random.choice([25, 64, 100], 20)
scatter_obj = ax.scatter(x, y, s=s, alpha=.6)
handles, labels = scatter_obj.legend_elements(prop='sizes')
ax.legend(handles=handles,  labels=labels, bbox_to_anchor=(1, .8), 
          labelspacing=.9, borderpad=.7);

Let's go back to our housing dataset and use the values of garage area as the size of our points. Before doing so, we need to get an understanding of how large the points will be. Below, we calculate the figure-inches for the points that will represent the smallest and largest garage areas. We also return the garage area of the house that was set as 10,000 square feet to its original value.

In [None]:
housing.loc[67, 'GarageArea'] = 812
housing['GarageArea'].agg(['min', 'max']) ** .5 / 72

The houses with no garage area won't appear on the graph, while the largest garage will have a point that is about half of a figure-inch, which doesn't seem too large. Let's go ahead and make the plot and assess its appearance.

In [None]:
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', s='GarageArea', alpha=.6, data=housing)
handles, labels = scatter_obj.legend_elements(prop='sizes')
ax.legend(handles=handles, labels=labels, title='GarageArea', bbox_to_anchor=(1, 1), 
          labelspacing=3, borderpad=1.2, ncol=3);

The points begin too large and there are far too many of them in the legend. It's not likely that your column of data will happen to align nicely to typographical points squared. Most of the time, you'll need to manually transform your data so that you get the exact point size you desire. One fairly simple way to do this is with the pandas `cut` function, which categorizes numeric data into bins. Pass it the Series you want to bin, the edges of the bins, and the labels for each bin. Use the marker size in points squared as the labels.

In [None]:
ga_bins = pd.cut(housing['GarageArea'], bins=[-1, 300, 600, 900, 1500], 
                 labels=[20, 40, 80, 100])
ga_bins.head()

We created four categories with five edge points for garage areas less than 300, between 300 and 600, between 600 and 900, and greater than 900 with corresponding integer values of 20, 40, 60, and 80 for the marker size. A list of descriptive strings is manually created for the legend.

In [None]:
fig, ax = plt.subplots(figsize=(5, 2))
scatter_obj = ax.scatter('GrLivArea', 'SalePrice', s=ga_bins, alpha=.6, data=housing)
handles, _ = scatter_obj.legend_elements(prop='sizes')
labels = ['< 300', '300 - 600', '600 - 900', '> 900']
ax.legend(handles=handles, labels=labels, title='GarageArea', bbox_to_anchor=(1, .8),
         labelspacing=.8)
ax.set_title('Price vs Square Foot');

## Bar plots

Standard bar plots are a series of rectangles with a bottom on the x-axis. Each bar is defined by and x-value and a height. In matplotlib, bar plots involving only one group of data are fairly easy to produce, but become substantially more difficult when multiple groups are involved. Let's use the City of Houston employee dataset for the bar plot examples.

In [None]:
emp = pd.read_csv('../data/employee.csv', parse_dates=['hire_date'])
emp.head(3)

Let's create a bar plot of the frequency of each department.

In [None]:
dept_counts = emp['dept'].value_counts()
dept_counts

To create a bar plot, you must supply both the `x` and the `height` parameters. Let's assign the unique departments to the variable name `x` and the values to `height`.

In [None]:
x = dept_counts.index
height = dept_counts.values

We can now call the `bar` method to plot the counts of each department. Like usual, we assign the result of the plotting method to a variable.

In [None]:
fig, ax = plt.subplots()
bar_obj = ax.bar(x=x, height=height)

Let's get the type of object returned.

In [None]:
type(bar_obj)

This `BarContainer` object is essentially a list of each of the bars. Nine bars were created, one or each department. We verify this with the `len` function.

In [None]:
len(bar_obj)

Each individual item is a rectangle patch.

In [None]:
bar0 = bar_obj[0]
type(bar0)

We've created these kinds of patches before. Let's use some of the getter methods to retrieve some properties.

In [None]:
bar0.get_width()

In [None]:
bar0.get_xy()

In [None]:
bar0.get_height()

We can set this first bar to a different color and output the figure to verify the change.

In [None]:
bar0.set_color('red')
fig

### Rotating the ticks labels

The tick labels overlap each other. Let's use the `tick_params` method to rotate the labels 45 degrees.

In [None]:
ax.tick_params(axis='x', labelrotation=45)
fig

The labels are text objects with center horizontal alignment making them difficult to determine which bar they reference when rotated. They would be better suited if they were aligned from the right. Unfortunately, there is no way to set the horizontal alignment with the `tick_params` method. You'll have to loop though each label individually and call its setter methods. The `get_xticklabels` method returns a list-like object of each label as a matplotlib text object. We rotate the labels a bit less and align them horizontally from the right.

In [None]:
for label in ax.get_xticklabels():
    label.set_rotation(30)
    label.set_ha('right')
fig

### Set rotation, alignment, and all other text properties with `set_xticklabels`

If the tick labels are already in a list or a sequence like a pandas index, then you can use the `set_xticklabels` method to change all of their text properties without a loop.

In [None]:
fig, ax = plt.subplots()
bar_obj = ax.bar(x=x, height=height)
ax.set_xticklabels(x, rotation=30, ha='right');

We used strings for the x-values, but these are mapped directly to integers beginning at 0. We could have duplicated the bars by using a sequence of integers instead. Notice that not all tick values are present. We'll learn in a future chapter how to place ticks at particular intervals.

In [None]:
fig, ax = plt.subplots()
ax.bar(x=np.arange(9), height=height);

### Customizing every bar

Every property of every bar can be customized. The width of each bar can be set with the `width` parameter (default .8). The starting y-value of each bar is set to 0 by default, but can be changed with the `bottom` parameter. Below, we create four bars of all different sizes and starting positions. The first four colors of the Accent colormap are used for the colors and the labels are set with the `tick_label` parameter.

In [None]:
x = [-12, -5, 8, 16]
height = [5, 7, 3, 9]
width = [3, 2, 1, 4]
bottom = [6, 2, 5, 3]
color = cm.Accent(range(4))
tick_labels = ['Dean', 'Chris', 'Jimmy', 'Elias']
fig, ax = plt.subplots()
ax.set_ylim(0, 14)
ax.bar(x=x, height=height, width=width, bottom=bottom, alpha=.8, color=color,
       ec='black', tick_label=tick_labels);

### Plotting groups of bars

Plotting multiple groups of bars requires us to carefully place each set of bars at a precise location along the x-axis with a width that does not overlap the other bars. Let's first calculate the average salary for every department and race.

In [None]:
pt = emp.pivot_table(index='dept', columns='race',
                     values='salary', aggfunc='mean').round(-3)
pt.head(3)

Each column can be plotted as bars to represent the average salary of each department for that race. A total of five groups of bars, one for each race, will be created. To accommodate each bar, we'll reduce the width to one-fifth of its original value. 

Below, we select the first column (`'Asian'`) as a Series for the height and calculate the new width. The `label` parameter is set to the name of the race so that it appears in the legend. By default, the bars are centered around the given x-value. When the `align` parameter is set to `'edge'`, the left edge of the bar will begin at the x-value.

In [None]:
m, n = pt.shape
x = np.arange(m)
height = pt['Asian']
orig_width = .8
width = orig_width / n
fig, ax = plt.subplots(figsize=(5, 2.5))
ax.bar(x, height, width=width, align='edge', label='Asian')
ax.legend();

The x-values for the next column group need to be shifted to the right by the width of the bars so that they begin exactly where the last bar ends. We add the width to every x-value and then plot the next group. The `legend` method must be called again to update it.

In [None]:
height = pt['Black']
ax.bar(x + width, height, width=width, align='edge', label='Black')
ax.legend(bbox_to_anchor=(1, 1), loc='upper left')
fig

Instead of adding each group of bars one at a time, we can use a loop to iterate through the column names, calculating the new x-values. The tick marks are placed in the center of each group and given the department names as labels. The `set_xticklabels` method is used to rotate and align the labels.

In [None]:
fig, ax = plt.subplots(figsize=(5, 2.5))
for i, col in enumerate(pt.columns):
    x_new = x + width * i
    ax.bar(x_new, pt[col], width=width, label=col, align='edge')
ax.legend(bbox_to_anchor=(1, 1))
ax.set_xticks(x + orig_width / 2)
ax.set_xticklabels(pt.index, rotation=30, ha='right')
ax.set_title('Average salary by department and race');

### Stacked bar plot

Instead of placing the bars directly next to each other, they can be stacked one on top of the other to create a stacked bar plot. In this instance, the `bottom` parameter must be used and updated each iteration. It is initially set to zero, but then accumulated for each group. The default width can be used.

In [None]:
fig, ax = plt.subplots(figsize=(3, 1.5))
bottom = np.zeros(9)
for col in pt.columns:
    ax.bar(x, pt[col], bottom=bottom, label=col)
    bottom += pt[col]
ax.legend(bbox_to_anchor=(1, 1))
ax.set_xticks(x)
ax.set_xticklabels(pt.index, rotation=30, ha='right');

### Horizontal bar plots

Horizontal bar plots are created with the `barh` method in nearly the same manner. Only the names of the parameters are different. The same code from above is copied below. The parameters `x`, `width`, and `height` have been substituted with `y`, `height`, and `width`.

In [None]:
fig, ax = plt.subplots(figsize=(4, 3))
for i, col in enumerate(pt.columns):
    x_new = x + width * i
    ax.barh(y=x_new, width=pt[col], height=width, label=col, align='edge')
ax.legend(bbox_to_anchor=(1, 1))
ax.set_yticks(x + orig_width / 2)
ax.set_yticklabels(pt.index)
ax.set_title('Average salary by department and race');

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Read in the bikes dataset and select 500 rows of data at random. Filter for rides with a trip duration less than the 95th percentile. Remove any rows that have obviously bad data for temperature and wind speed. Make a scatter plot with temperature and trip duration as the x and y variables. Color by gender and size by wind speed using a qualitative color map. Use [this tutorial][0] to create two separate legends, one for gender, and the other for wind speed.</span>

[0]: https://matplotlib.org/gallery/lines_bars_and_markers/scatter_with_legend.html