# Matplotlib Data Plotting

In previous notebooks, we created Figure and Axes objects, and proceeded to change their properties without plotting any actual data. In this notebook, we will learn how to make basic line and scatter plots.

## The Axes API
The [matplotlib documentation][1] has a nice layout of the Axes API. There are around 300 different calls you make with an Axes object. The API page categorizes and groups each method by its functionality. The first third (approximately) of the categories in the API are used to create plots.

The simplest and most common plots are found in the Basics category and include `plot`, `scatter`, `bar`, `pie`, and others.

[1]: https://matplotlib.org/api/axes_api.html

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## The `plot` method - Creates line plots
The `plot` method's primary purpose is to create line plots. It does have the ability to create scatter plots as well, but that task is best reserved for `scatter`.

### Plotting 2D Data
The `plot` method is very flexible and can take a variety of different inputs. The following teaches a straightforward and consistent approach that is explicit and easy to read.

The first two arguments to the `plot` method can be the x and y coordinates of the data. Below, we use numpy arrays to hold our data. We simply plot the square of the x value.

In [None]:
fig, ax = plt.subplots()
x = np.arange(-5, 6)
y = x ** 2
ax.plot(x, y)

### What was returned?
A list of `Line` objects were returned from our call to the `plot` method. The `plot` method can produce many lines in a single call to it, which is why it returns the results as a list.

## Formatting the line
The line can be formatted using many different parameters. Please see the documentation for the [Line object][1]. All of the possible parameters are available on that page. The most common parameters are listed below with a short description.

* `alpha` - opaqueness of the line - float between 0 and 1 where 0 is completely translucent and 1 is completely opaque
* `color` or `c` - color of line - see color section below
* `label` - string label for legend
* `linestyle` or `ls` - style of line - possible options are '-', '--', '-.', ':'
* `linewidth` or `lw` - width of line as a float
* `marker` - style of marker - see marker section below
* `markeredgecolor` or `mec` - edge color of marker - see color section below
* `markeredgewidth` or `mew` - width of marker edge as a float
* `markerfacecolor` or `mfc` - face color of marker - see color section below
* `markersize` or `ms` - size of marker as a float


[1]: https://matplotlib.org/api/_as_gen/matplotlib.lines.Line2D.html#matplotlib.lines.Line2D

### Changing properties of our line
Use the documentation above for details on how to change properties of a line. Let's begin by changing the line style.

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, linestyle='--')

## Matplotlib Colors

There are many possible ways to identify a color in matplotlib. Read the [color documentation][1] to see all the ways to specify a color.

* an RGB or RGBA tuple of float values in [0, 1] (e.g., (0.1, 0.2, 0.5) or (0.1, 0.2, 0.5, 0.3)). RGBA is short for Red, Green, Blue, Alpha, where Alpha represents the opacity
* a hex RGB or RGBA string (e.g., '#0F0F0F' or '#0F0F0F0F');
* a string representation of a float value in [0, 1] for gray level (e.g., '0.5');
* one of {'b', 'g', 'r', 'c', 'm', 'y', 'k', 'w'}; **I don't use these because they are confusing and not explicit**
* a X11/CSS4 color name - **I do use these**

### Web Colors
You can use any of the following colors that are available to web developers
![][2]

[1]: https://matplotlib.org/tutorials/colors/colors.html#sphx-glr-tutorials-colors-colors-py
[2]: images/named_colors.png

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, linestyle='--', color='saddlebrown', linewidth=4)

## Markers
There are a few dozen [styles for markers][1]. These are plotted on every point. Set the `marker` parameter to the string that references the marker you want. Below, we use several more parameters to change the size and color of the marker.

[1]: https://matplotlib.org/api/markers_api.html

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, color='darkred', linestyle='--', marker='s', markersize=12, 
        markerfacecolor='gold', markeredgecolor='navy',markeredgewidth=4)

### Grayscale

Use a **string** with a number between 0 and 1 for grayscale.

In [None]:
fig, ax = plt.subplots()
ax.plot(x, y, color='.7')

## Integration with Pandas - plotting real data
Matplotlib makes it simple to create plots when our data is in a DataFrame. Let's begin by reading in the flights data.

In [None]:
pd.options.display.max_columns = 100
flights = pd.read_csv('../data/flights.csv', parse_dates=['date'])
flights.head(3)

### Average carrier delay per departure hour

Let's run a calculation before plotting, such as finding the average carrier delay for each departure hour. First, we'll round down each departure time to the nearest hour by creating the column `dep_hour`.

In [None]:
flights['dep_hour'] = flights['dep_time'] // 100
flights['dep_hour'].head()

We use this new column to calculate the average carrier delay per departure hour.

In [None]:
avg_cd = flights.groupby('dep_hour').agg(average_carrier_delay=('carrier_delay', 'mean'))
avg_cd = avg_cd.reset_index()
avg_cd.head()

### Make a line plot with a DataFrame 

Matplotlib simplifies the process by providing a `data` parameter. Set this equal to the name of our above DataFrame. Pass the column names as strings as the first two arguments in the `plot` method.

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot('dep_hour', 'average_carrier_delay', data=avg_cd)
ax.set_xlabel('Departure Hour', fontsize=15)
ax.set_title('Average Carrier Delay by Departure Hour', fontsize=20, color='tomato');

## Most common Plots
Visit the [Axes API][1] to see the most common plotting methods.

[1]: https://matplotlib.org/api/axes_api.html

## Univariate Analysis

These are the primary plots that you will make from your Axes. We just plotted a lines with the `plot` method in our above example. Let's see a few more plots in action.

### Boxplots

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(x='air_time', data=flights.dropna());

Set the `vert` parameter to `False` to make a horizontal box plot.

In [None]:
fig, ax = plt.subplots(figsize=(10, 5))
ax.boxplot(x='air_time', data=flights.dropna(), vert=False);

### Plotting dates

The **`plot_date`** method creates a line or scatter plot with dates in the x-axis.

### Read in the bikes dataset

In [None]:
bikes = pd.read_csv('../data/bikes.csv', parse_dates=['starttime', 'stoptime'])
bikes.head(3)

### Examine relationship between date and temperature

In [None]:
bikes['temperature'].describe()

In [None]:
bikes['temperature'].sort_values().head()

Remove bad temperature data and sample 2% of the bikes dataset, which will help keep the number of plotted points from overwhelming the graph.

In [None]:
bikes = bikes[bikes['temperature'] > -10]
bikes2 = bikes.sample(frac=.02)

Call the `plot_date` function.

In [None]:
fig, ax = plt.subplots(figsize=(12, 6))
ax.plot_date('starttime', 'temperature', data=bikes2)

## Plotting the Number of Riders per Day
Let's find the number of riders each day. We need to group by each day.

In [None]:
temperature_count = bikes.resample('D', on='starttime').size()
temperature_count = temperature_count.reset_index()
temperature_count.head()

In [None]:
temperature_count.columns = ['starttime', 'count']
temperature_count.head()

In [None]:
fig, ax = plt.subplots(figsize=(12, 5))
ax.plot_date('starttime', 'count', data=temperature_count)
ax.set_title('Rider Count per Day', fontsize=30)

##  Scatterplots 
Although we have already created some scatterplots above, Matplotlib has a special **`scatter`** function that allows you to set both the color and size of each point individually based on the value of a different variable.

In [None]:
housing = pd.read_csv('../data/housing.csv')
housing.head()

In [None]:
housing.shape

### Take a sample of the data
Although matplotlib can handle several thousand plotted points, our scatter plot would be a bit too crowded by plotting all the points. Let's use the same method to select a random subset of rows.

In [None]:
housing_sample = housing.sample(200)

### Use OverallQual as the size of each point
We will size our points based on the square of the OverallQual which are integers between 0 and 10.

In [None]:
housing_sample['OverallQual2'] = housing_sample['OverallQual'] ** 2

In [None]:
fig, ax = plt.subplots(figsize=(14, 10))
ax.scatter('GrLivArea', 'SalePrice', s='OverallQual2', c='BedroomAbvGr', data=housing_sample)

## Creating a Legend
In our above plot it is impossible to determine what the colors mean. No legend is present to inform us of the number of bedrooms. Unfortunately, this is not a straight forward task in Matplotlib. We must plot each group that we want to form a legend with as a separate call to the `scatter` method. Then we can use the `label` parameter to name the legend.

### Boolean Selection
We must use a loop to select data for each unique bedroom.

In [None]:
housing['BedroomAbvGr'].unique()

In [None]:
fig, ax = plt.subplots(figsize=(14, 6))
colors = ['red', 'orange', 'yellow', 'green', 'blue', 'indigo', 'violet', 'black']
for i in range(8):
    filt = housing['BedroomAbvGr'] == i
    housing_temp = housing[filt]
    ax.scatter('GrLivArea', 'SalePrice', s='OverallQual', color=colors[i], data=housing_temp, label=i)
ax.legend();

## Exercises

### Exercise 1
<span  style="color:green; font-size:16px">Create a Figure with three Axes using `plt.subplots`. Use `np.linspace` to create a one dimensional array of data from -5 to 5 of length 100 and store this to `x`. In each of three Axes, plot take some mathematical function of `x` to create `y` values in a line plot. For instance, you can take the square root of `x`.</span>

### Exercise 2
<span  style="color:green; font-size:16px">Use `np.random.rand` to create two arrays, `x` and `y` that are each 100 units in length. Make a scatter plot of the data. Make the size of the markers proportional to the ratio of y to x. Make the color proportional to y. For scatter plots, use the parameter `c` to control the color with a number. The parameter `s` controls the size. Set the title as well.</span>

### Exercise 3
<span  style="color:green; font-size:16px">Read in the college dataset and set the index to the institution name. Complete the following tasks:</span>

<span  style="color:green; font-size:16px">
    
* Convert the median earnings within 10 years (MD_EARN_WNE_P10) column to numeric
* Create a column for the total SAT score
* Select just the columns for SAT total, UGDS, RELAFFIL and MD_EARN_WNE_P10 into another DataFrame called `college_samp`. Continue with this DataFrame for the rest of the Exercise.
* Drop any rows with missing values
* Randomly sample 10% of the DataFrame and assign it back to itself.
* Call the `map` method on the `RELAFFIL` column. Pass it a dictionary to convert the values to color names. Assign the result to the column `color`
* Take the square root of the UGDS column and assign it to the column `size`. 
* Create a scatterplot of the total SAT scores vs the MD_EARN_WNE_P10 column. Color and size each point with their respective columns.
* Extra Credit: Annotate the school with the largest population as it is done [in this example](https://matplotlib.org/users/annotations.html)</span>

### Exercise 4
<span  style="color:green; font-size:16px">Read in the employee dataset and select the `salary` column as a Series, drop the missing values, and assign it to a variable. Read about the `pd.cut` function and create categories that span 25k from 0 to 300k. Save this result as a Series and find the frequency of each category. Then take that result and create a `pie` chart with labels.</span>