# Matplotlib Line Plots

In this chapter, we will plot actual data with matplotlib line and scatter plots. Thus far, we have yet to do any data visualization. We've focused on understanding how to create the figure and axes, and how to change its properties using the object-oriented approach. We've also covered how to add straight lines, text, and patches, and how to choose colors for them. None of these plotting objects came from data. In this chapter, we will use our data that we previously analyzed within a pandas DataFrame to make data visualizations with matplotlib.

## Axes API

We'll continue to call methods from the axes object, but concentrate on those that plot data. It can be helpful to navigate to the [axes API page from matplotlib's official documentation][1] which contains around 300 different methods. The API page categorizes and groups each method by its functionality. In this chapter, we focus on the `plot` method from the [plotting section][2].

### Plotting methods

All of the plotting methods accept data as input, and add matplotlib objects to the axes. Each of the plotting methods returns the plotting object(s) which we can assign to a variable name.

[1]: https://matplotlib.org/api/axes_api.html
[2]: https://matplotlib.org/api/axes_api.html#plotting

## Line plots with the `plot` method

The `plot` method's primary purpose is to create line plots. It does have the ability to create scatter plots as well, but that task is best reserved for the `scatter` method. A more appropriate name for this method would have been `lineplot` as the name `plot` isn't descriptive. You'll have to make the association that `plot` really means line plot. All of the other plotting methods are descriptive of what kind of plot they actually create except this one.

The `plot` method is very flexible and can take a variety of different inputs. Instead of covering all of the possible ways to use the `plot` method, a single, straightforward approach will be given. The first two arguments to the `plot` method are the x and y coordinates of the data.

Below, a numpy array of the integers from -5 to 5 is created with the `arange` function. These values represent the  coordinates of our line. The squared values of this array are used as the y coordinates.  We call the `plot` method assigning the returned values to `line_plot_objects`. Each x-value is paired with its respective y-value. The points are connected with straight lines forming the image below.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('mdap.mplstyle')
fig, ax = plt.subplots()
x = np.arange(-5, 6)
y = x ** 2
line_plot_objects = ax.plot(x, y)

### What was returned?

The variable `line_plot_objects` was assigned the return value from the call to the `plot` method. Let's inspect this object and output its contents to the screen.

In [None]:
line_plot_objects

matplotlib returns a list of `Line2D` objects. The `plot` method can produce many lines in a single call to it, which is why it returns the results as a list and not as a single object. Let's verify that this returned object is a list.

In [None]:
type(line_plot_objects)

Let's assign its only item to a variable name and output its type.

In [None]:
line = line_plot_objects[0]
type(line)

We now have access to our line, technically a `Line2D` object. Like all matplotlib objects, it has many properties, which can be accessed and changed with its getter and setter methods. We begin by getting the underlying data, which is returned as a tuple of numpy arrays.

In [None]:
line.get_data()

The color is retrieved as an RGB string.

In [None]:
line.get_color()

All lines default to a zorder of 2, which we verify below.

In [None]:
line.get_zorder()

The `Line2D` object is not all that important to understand and you will rarely use it directly to create lines. Instead, you'll be using methods like `plot`, `hlines`, `vlines` and others.

### Line properties

The line properties that we've covered in previous chapters remain the same and all work when using the `plot` method.


| Property                      | Possible Values                                                           |
|-------------------------------|---------------------------------------------------------------------------|
| `linewidth` or `lw`         | width of line in points       |
| `linestyle` or `ls`          | `'solid'` or `'-'` (default), `'dashed'` or `'--'`, `'dotted'` or `':'`, `'dashdot'` or `'-.'`                                                 |
| `color` or `c` | line color |
| `alpha`         | 0 to 1 - opacity       |

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, color='orange', alpha=.6, ls='--', lw=3);

### Markers

By default, the coordinates that describe the line have no special marker to denote their location. These coordinates can be visualized by providing one of a few dozen [marker styles][1] as a single character or digit. There are several other parameters available to control the marker appearance.

* `marker` - style of marker with some examples below
    * `'.'` - point
    * `'o'` - circle
    * `'s'` - square
    * `'+'` - plus
    * `6` - caret up
* `markersize` or `ms` - size of marker in points
* `markerfacecolor` or `mfc` - face color of marker
* `markeredgecolor` or `mec` - edge color of marker
* `markeredgewidth` or `mew` - edge width of marker in points

We create another line plot using the same data adding square markers (denoted by 's'). The marker size, face color, edge color, and edge width are all set as well.

[1]: https://matplotlib.org/api/markers_api.html

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, color='darkred', lw=2, ls='--', 
        marker='s', ms=6, mfc='gold', mec='navy', mew=2);

## Integration with pandas

In the official documentation, the vast majority of examples use numpy arrays as inputs for the matplotlib plotting methods. Alternatively, matplotlib allows the use of pandas Series and DataFrames as inputs. There's even an alternative syntax that will be explained below that allows you to reference DataFrame columns by their string names. Let's begin with examples from the flights dataset.

In [None]:
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.head(3)

### Average carrier delay per departure hour

Let's run a calculation before plotting, such as finding the average carrier delay for each departure hour. First, we'll round down each departure time to the nearest hour by creating the column `dep_hour`.

In [None]:
flights['dep_hour'] = flights['dep_time'] // 100
flights['dep_hour'].head(3)

We use this new column to calculate the average carrier delay per departure hour.

In [None]:
avg_cd = flights.groupby('dep_hour').agg(avg_carrier_delay=('carrier_delay', 'mean'))
avg_cd = avg_cd.reset_index()
avg_cd.head(3)

We'll now make a line plot placing the departure hour as the x-values and the average carrier delay as the y-values. We'll use the old syntax, passing the x and y data as pandas Series.

In [None]:
fig, ax = plt.subplots()
x = avg_cd['dep_hour']
y = avg_cd['avg_carrier_delay']
ax.plot(x, y)
ax.set_title('Average carrier delay vs Departure hour');

### Alternative syntax for plotting with DataFrames

Most plotting methods have a `data` parameter that can be set to be a pandas DataFrame. Doing so allows you to use the column names as the x and y values. Creating a line plot from a DataFrame, `df`, with x-values in `col1` and y-values in `col2` has the following generic form:

```python
ax.plot(col1, col2, data=df)
```

We recreate the above plot using this alternative syntax.

In [None]:
fig, ax = plt.subplots()
ax.plot('dep_hour', 'avg_carrier_delay', data=avg_cd)
ax.set_title('Average carrier delay vs Departure hour');

### Plotting with strings

In our above plots, both the x and y values were numeric. It's possible to create plots where either the x or the y values are strings. Let's create a line plot of the total number of flights for each airline. The x-values will be the string abbreviation of the airline with the number of flights as the y-values. Let's calculate this result with the `value_counts` methods.

In [None]:
num_flights_series = flights['airline'].value_counts()
num_flights_series.head(3)

Let's convert this Series into a two-column DataFrame so that we can pass it to the `data` parameter.

In [None]:
df = num_flights_series.reset_index()
df.columns = ['airline', 'num_flights']
df.head(3)

The string column airline is used for the x-values in the line plot below.

In [None]:
fig, ax = plt.subplots()
ax.plot('airline', 'num_flights', data=df)
ax.set_title('Number of flights by airline');

matplotlib places the strings on the x-axis in the order that they appear in the DataFrame. It can be helpful to understand what is happening behind the x-axis. Each unique string value is mapped to an integer beginning with 0. Let's get the underlying x-values of the tick marks.

In [None]:
ax.get_xticks()

Normally, the labels for the tick marks are just the x-values themselves. When plotting strings, matplotlib uses those exact strings as the labels. Let's view a few of the underlying tick labels.

In [None]:
list(ax.get_xticklabels())[:3]

It's actually not necessary to convert the Series to a DataFrame. matplotlib uses the index values as the x-values in the plot. We pass the original Series computed from the `value_counts` method to the `plot` method as its only argument to produce the same plot as above.

In [None]:
fig, ax = plt.subplots()
ax.plot(num_flights_series)
ax.set_title('Number of flights by airline');

### Plotting multiple lines on the same axes

You can plot as many distinct lines on an axes as you like by repeatedly calling the `plot` method. Let's return to creating lines with simple algebraic functions like we did in our first example. Below, three different sets of x and y values are created and then plotted.

In [None]:
fig, ax = plt.subplots()
x1 = np.arange(-5, 6)
y1 = x1 ** 2
x2 = np.arange(0, 10)
y2 =  - x2 + 6
x3 = np.arange(-6, 7)
y3 = .2 * (x3 + 5) * (x3 - 2) * (x3 - 4)
ax.plot(x1, y1)
ax.plot(x2, y2)
ax.plot(x3, y3);

Even though we did not assign the result of any of these lines to a variable, we can still access them with the `lines` attribute, which is available to all axes objects, and returns a list of all the lines.

In [None]:
ax.lines

Notice how each of the lines above was plotted with a different color. This happened without any explicit setting of the `color` parameter. matplotlib is designed such that each new call to one of the plotting methods results in an object with a different color. This automatic selection of new colors is referred to as the **color cycle**.

## Color cycle

By default, the color cycle is a list of 10 different colors. Each time a new plotting method is called without explicitly setting a color, the next color in the color cycle is chosen. In the plot below, 15 lines are created without specifying a color. Using the `text` method, the cycle number and RGB string are placed to the right of the line. Notice that the colors begin repeating at cycle number 10.

In [None]:
fig, ax = plt.subplots(figsize=(5, 3))
x = [0, .5]
ax.set_xlim(0, 1)
ax.set_title('The default color cycle')
for i in range(15):
    y = [i, i]
    lines = ax.plot(x, y, lw=6)
    color = lines[0].get_color()
    ax.text(.55, i, s = f'cycle num {i} = {color}', fontsize=8, va='center')

The default color cycle is the the **tab10** colormap, which comes directly from the data visualization company Tableau. Let's verify this by listing the RGB hexadecimal strings for each of the colors in tab10. We use the `to_hex` method from the `colors` module to convert each RGB float.

In [None]:
from matplotlib import colors, cm
print([colors.to_hex(cm.tab10(i)) for i in range(10)])

### Changing the color cycle

The color cycle of an axes can be changed with the `set_prop_cycle` method by setting the `color` parameter to a sequence of color values. Here, we retrieve every 10th value of the first 120 colors from the 'gist_earth' colormap for a total of 12 colors. These 12 colors are set as the new color cycle for this axes. The same 15 lines from above are produced below with the new color cycle. The color for line 12 is the same as line 0. The color cycle returns back to its default whenever a new axes is created.

In [None]:
fig, ax = plt.subplots(figsize=(5, 3))
x = [0, .5]
new_cycle = cm.gist_earth(range(0, 120, 10))
ax.set_prop_cycle(color=new_cycle)
ax.set_xlim(0, 1)
ax.set_title('The default color cycle')
for i in range(15):
    y = [i, i]
    lines = ax.plot(x, y, lw=6)
    color = colors.to_hex(lines[0].get_color())
    ax.text(.55, i, s = f'cycle num {i} = {color}', fontsize=8, va='center')

## More line plots

Now that we understand the color cycle, let's get back to plotting multiple lines on the same axes. Using the flights dataset, let's find the number of flights every month for each of the top five busiest origin airports. We begin by creating a new column containing the first three letters of each month.

In [None]:
flights['month_name'] = flights['date'].dt.month_name().str[:3]
flights['month_name'].head(3)

To help pandas sort the data appropriately, we can convert this column to an ordered categorical. First, we'll need the unique values of the months in order. Since the data is already ordered by date, the `drop_duplicates` method should return the sequence we want.

In [None]:
months = flights['month_name'].drop_duplicates()
months

We can now make the conversion.

In [None]:
cat_dtype = pd.CategoricalDtype(categories=months, ordered=True)
flights['month_name'] = flights['month_name'].astype(cat_dtype)

The top five origin airports are now found.

In [None]:
top_5_origin = flights['origin'].value_counts().index[:5]
top_5_origin

We use the `query` method to filter our data to just these five airports and see that over 40,000 rows have been filtered out.

In [None]:
flights_busy = flights.query('origin in @top_5_origin')
len(flights) - len(flights_busy)

We can now use the `crosstab` function to count all the flights for each month for every origin airport. Because the month names are ordered categoricals, they will remain in that order in the index.

In [None]:
origin_flight_counts = pd.crosstab(index=flights_busy['month_name'], 
                                   columns=flights_busy['origin'])
origin_flight_counts.head()

Let's plot the first two columns as line plots by calling the `plot` method twice, passing it each column as a Series. 

In [None]:
fig, ax = plt.subplots()
atl = origin_flight_counts['ATL']
bos = origin_flight_counts['BOS']
ax.plot(atl)
ax.plot(bos)
ax.set_title('Flights originating from ATL and BOS');

Each unique axes keeps track of where it is in the color cycle. The first plot on each axes is given the first color of the cycle. Let's add the next origin airport to the plot, which uses the next color in the cycle.

In [None]:
lax = origin_flight_counts['LAX']
ax.plot(lax)
ax.set_title('Flights originating from ATL, BOS, and LAX')
fig

### Plotting DataFrames

Instead of plotting one column at a time, you can pass the entire DataFrame to the `plot` method. Each column will be plotted as a separate line and use the index as its x-values.

In [None]:
fig, ax = plt.subplots()
ax.plot(origin_flight_counts)
ax.set_title('Flights from top 5 origin airports for 2018');

## Adding a legend

It's impossible to determine the origin airport of each line just by looking at the plot. Legends are an important addition allowing you to label each object on the plot. matplotlib grants its users lots of power to create the exact legends they desire. Unfortunately, this extensive power is often difficult to use for beginners. In this section, we begin by creating legends with a simple, straightforward approach, slowly adding more complexity.

### Simple legend with `label` parameter

Every plotting method has a `label` parameter that can be set to a string to identify the particular objects produced from that method. Calling the `legend` axes method creates a legend with the string used as the `label`. Let's make the same three calls to the `plot` method as we did above, but use the `label` parameter to identify each line before calling the `legend` method with no arguments.

In [None]:
fig, ax = plt.subplots()
ax.plot(atl, label='ATL')
ax.plot(bos, label='BOS')
ax.plot(lax, label='LAX')
ax.legend();

Setting the `label` isn't enough for the legend to appear. You must call the `legend` method if you want to have a legend. If you call `legend` without having set any of the labels, then that particular object will not appear in the legend.

### Setting the labels in the `legend` method with a list

The `legend` method has a huge number of parameters that can be set to control its appearance. It's not strictly necessary to set the `label` parameter within the plotting method. Alternatively, you can set the `labels` parameter of the `legend` method to a list of strings.

In [None]:
fig, ax = plt.subplots()
ax.plot(atl)
ax.plot(bos)
ax.plot(lax)
labels=['Atlanta', 'Boston', 'Los Angeles']
ax.legend(labels=labels);

While this works, I prefer using the `label` parameter. With this method, you have to make sure that your labels are listed in the same order as the lines they reference. It's not quite as explicit as the previous method.

### Labeling patches in the legend

All patch constructors have the same `label` parameter that can be used to identify it in the legend. A circle and triangle are added to the axes with a legend to identify them.

In [None]:
from matplotlib import patches
fig, ax = plt.subplots()
ax.set_aspect('equal')
p1 = patches.Circle((.3, .4), radius=.2, color='green', label='green circle')
p2 = patches.Polygon([[.6, .1], [.7, .4], [.9, .3]], 
                     color='purple', label='purple triangle')
ax.add_patch(p1)
ax.add_patch(p2)
ax.legend();

### Legends when plotting DataFrames

We plotted all five origin airports with a single call to the `plot` method when passing it the DataFrame `origin_flight_counts` above. If we want to add a legend with a label for each line, we can't proceed as before since there is only a single plot method called which has a single `label` parameter. 

You might think that setting the `label` parameter to a list would allow you to label each line individually. This unfortunately does not work and shows the label of each line as the entire list as is seen below. Both Series and Index objects have a `tolist` method to make the conversion to a list.

In [None]:
fig, ax = plt.subplots()
origin_cols = origin_flight_counts.columns.tolist()
ax.plot(origin_flight_counts, label=origin_cols)
ax.legend();

### Legend entries - handle, label pairs

Before showing how to properly label the legend when plotting DataFrames, we'll need to cover more specifics of the legend itself. Every legend is composed of one or more **entries**. Each entry is composed of a **handle**, the colored marker on the left, and the **label**, the text to the right.

For all of our examples thus far, matplotlib has created the legend handles for us, and we've provided the labels as strings. You can actually provide both the handles and the labels yourself. The handles must be a plotting object, but don't have to be part of the plot or have anything to do with the data. 

Here, three lines are added to the axes with the `plot` method and given a label. Three different patches that have nothing to do with the data, and are not added to the axes, are created and placed in a list. This list is passed to the `legend` method as the handles along with a list of strings for the labels. The handles and labels provided to the `legend` method override any handles and labels automatically generated from the plotting objects.

In [None]:
fig, ax = plt.subplots()
x = np.arange(10)
ax.plot(x, x + 2, label='Line 1')
ax.plot(x, x + 5, label='Line 2')
ax.plot(x, x + 8, label='Line 3')

# patches that are NOT added to the plot
p1 = patches.Circle((0, 0), radius=10, color='.2')
p2 = patches.Rectangle((5, -15), width=3, height=2, color='.5')
p3 = patches.Polygon([[100, 50], [30, 40], [500, -300]], color='.8')
handles = [p1, p2, p3]
labels = ['orange', 'banana', 'mango']
ax.legend(handles=handles, labels=labels);

In order to create the legend properly when plotting multiple lines using a DataFrame, you'll need to assign the list of lines returned from the `plot` method to a variable and use them as the handles in the `legend` method. The list of column names can be used as the labels.

In [None]:
fig, ax = plt.subplots()
lines = ax.plot(origin_flight_counts)
ax.legend(handles=lines, labels=origin_cols);

### Moving the legend

By default, matplotlib attempts to put the legend in the best possible location within the axes. You can specify the location of the legend by setting the `loc` parameter to a string combining a vertical (`'upper'`, `'center'`, `'lower'`) and horizontal (`'left'`, `'center'`, `'right'`) position. The string `'best'` is also a choice, but is the default. Let's move our legend to the upper left corner of the plot. Instead of creating a new figure, we'll just call the `legend` method again which replaces the current legend.

In [None]:
ax.legend(handles=lines, labels=origin_cols, loc='upper left')
fig

You can also place the legend in a specific location by setting the `bbox_to_anchor` parameter to a two-item numeric tuple. This tuple defines coordinates where the legend will be 'anchored'. The coordinates to define this point are relative to the axes where 0 is the beginning of the x and y axis and 1 is the end of the x and y axis. The coordinate (0, 0) corresponds to the lower left corner of the plot, with (.5, .5) corresponding to the center. These units have nothing to do with the actual data. 

It's possible to place the legend outside of the plotting surface by specifying an x or y coordinate outside of the range 0 to 1. Below, we anchor the upper left corner of the legend to the point (1.03, .75) which is just to the right of the axes.

In [None]:
ax.legend(handles=lines, labels=origin_cols, loc='upper left', bbox_to_anchor=(1.03, .75))
fig

Several more legend properties exist and are set below. Control the number of columns in the legend with `ncol` and the relative length of each marker with `handlelength` (which has default value of 2). The other parameters should be intuitive to understand. [Visit the documentation][0] to learn about all of the legend parameters.

[0]: https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html

In [None]:
ax.legend(handles=lines, labels=origin_cols, loc='upper left', 
          bbox_to_anchor=(1.03, .75), title='Origin Airport', ncol=2, 
          facecolor='.8', handlelength=3, edgecolor='black', shadow=True)
fig

### Accessing the legend

The legend object can be accessed by assigning the result of the call to the `legend` method to a variable or from the `legend_` axes attribute.

In [None]:
legend = ax.legend_
type(legend)

We've seen that other groups of plotting objects are accessed with attributes that end in 's' such as `lines`, `collections`, `patches`, and `texts`. matplotlib only allows a single legend per axes, therefore the name `legends` wouldn't quite work, so the developers chose `legend_` instead. As usual, all of the properties can be retrieved and changed with the getter and setter methods.

## Exercises

### Exercise 1

<span style="color:green; font-size:16px">Create line plots of the average sale price per every 100 square feet of living area, garage area, and basement area. Place markers at every point and add a legend.</span>

### Exercise 2 

<span style="color:green; font-size:16px">For every neighborhood that has at least 100 homes, find the average sale price by each overall quality between 3 and 8 (inclusive). Plot each neighborhood as a line with overall quality on the x-axis and sale price on the y-axis. Use one of the [qualitative colormaps][0] other than the default tab10. Add a legend inside the bounds of the axes that has a frame, a title, and all labels on one row.</span>

[0]: https://matplotlib.org/tutorials/colors/colormaps.html