# Week 04: Matplotlib

This week's learning goals are as follows:

1. Understand Matplotlib's graphing paradigm.
1. Advanced labeling.
1. Use ```boxplot```
1. Analyze data using Matplotlib's ```bar``` and NumPy's ```histogram``` commands


In [None]:
# the following code guarantees you'll properly reload any modules that you custom-defined in your environment.
# you don't need to understand it.
# just run this once at the beginning.
# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
import os
import sys
import numpy as np
import csv
import matplotlib as mpl
import matplotlib.pyplot as plt

This notebook uses the Kaggle Dataset [Pokemon with stats](https://www.kaggle.com/abcsds/pokemon/data). Download and move the csv into ```04_matplotlib/csvs```.

For this notebook, I have defined a set of util functions for working with our Pokemon data. You can simply load the following two code blocks to see how the data is formatted:

In [None]:
# load the pokemon dataset in
from pokemon_util import *
pokemon_fpath = os.path.join('csvs', 'pokemon.csv')
poke_headers, poke_types, poke_dict = load_pokemon(pokemon_fpath) # get the dictionary
poke_array, poke_np_lookup = poke_array(poke_dict, poke_types) # convert into numpy array

In [None]:
print(poke_headers)
print(poke_types) # pokemon types, sorted
print(poke_dict['Bulbasaur']) # dictionary format
print(poke_np_lookup[0], poke_array[0,:]) # np array format

## 1. Matplotlib's graphing paradigm

Matplotlib takes all of the plotting features from Matlab and throws them into Python so that you can interface with NumPy.

```pyplot``` is the main interface to the matplotlib library. Instead of calling it directly, the convention is to import is as ```plt```:

    import matplotlib.pyplot as plt
    
In Jupyter Notebook, ```plt``` defaults to plot one figure per cell block. You do not have to create the figure; as soon as you start plotting, ```plt``` will handle the figure creation.

This first block shows the most common function, ```plot(x_args, y_args)``` which plots a line graph. Notice how each call to ```plot``` plots a new series.

In [None]:
X = np.linspace(-np.pi, np.pi, 80, endpoint=True)
C,S = np.cos(X), np.sin(X)
plt.plot(X,C)
plt.plot(X,S)
plt.show()

In [None]:
# also notice how a separate block to plot does not plot anything
plt.show()

The second most common function is ```scatter```, which plots a scatter plot.

In [None]:
plt.scatter(X, C)
plt.scatter(X,S)
plt.show()

For any plotting function that takes in two arguments ```xs``` and ```ys``` and plots each element of ```ys``` versus each element of ```xs```, it is essential that ```xs.shape == ys.shape```. That being said, you can plot multiple things at a time provided they are the same shape.

The below code shows how you would plot both the sine and cosine waves with a single scatter plot. Notice that Matplotlib does not auto-change colors, as it still considers it a single series.

In [None]:
#plt.scatter(X, [C,S]) # won't work because X is not the same size
X_tile = np.tile(np.expand_dims(X, axis=0), (2,1))
plt.scatter(X_tile, [C,S])
plt.show()

### Legends and labels

Graphs are often meaningless without labels. The following code adds axis labels and a title to our sine and cosine graph:

In [None]:
def simple_plot():
    X = np.linspace(-np.pi, np.pi, 80, endpoint=True)
    C,S = np.cos(X), np.sin(X)
    plt.plot(X,C)
    plt.plot(X,S)
    plt.xlabel('x values')
    plt.ylabel('y values')
    plt.title('sine and cosine graphs')
    plt.show()
simple_plot()

We can also add a legend in two ways:

In [None]:
plt.plot(X,C)
plt.plot(X,S)
plt.legend(['cosine', 'sine']) # in order of how we plotted them
plt.show()

In [None]:
plt.plot(X,C, label='cosine')
plt.plot(X,S, label='sine')
plt.legend()
plt.show()

### ```fig``` and ```ax```

What ```plt.plot``` and ```plt.scatter``` do is two things:
* Create a figure object (usually called ```fig```) which refers to the entire plot window
* Create an axis object (usually called ```ax```) which is used to plot and add new things

So we can split up the ```plot``` command into a few parts:
1. ```plt.figure()``` create a new figure
1. ```ax = plt.gca()``` get the current set of axes for the current figure that ```plt``` is focusing on
1. ```ax.scatter(...)``` use the axis explicitly to plot

When we use ax explicitly like this, our commands for labeling slightly change:
* ```ax.set_xlabel('x values') # instead of plt.xlabel('x values')```
* ```ax.set_ylabel('y values') # instead of plt.ylabel(...)```
* ```ax.set_title('trig functions') # instead of plt.title(...)```


In [None]:
def fig_plot():
    X = np.linspace(-np.pi, np.pi, 80, endpoint=True)
    C,S = np.cos(X), np.sin(X)
    fig = plt.figure()
    ax = plt.gca()
    ax.plot(X,C, label='cosine')
    ax.plot(X,S, label='sine')
    ax.set_xlabel('x values')
    ax.set_ylabel('y values')
    ax.set_title('sine and cosine graphs')
    ax.legend()
    fig.legend() # notice how it puts a legend in a different place, because its graphics area is different
    plt.show()
    return fig
fig = fig_plot()

And now notice that since we have a handle ```fig``` to the plot window, we can carry it across notebook cells.

In [None]:
fig

Having a directly handle to ```fig``` is very useful if we want to save figures.

In [None]:
fig.savefig(os.path.join('images','waves.png'))

#### Programming exercises

Plot a Univariate Gaussian probability distribution

* Use ```np.linspace``` to generate x values from 5 to -5 with 100 points.
* Look at the [SciPy normal documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html) and use ```norm.pdf()``` with loc=0, scale=1 to produce the y array for your x values.
* Make sure to label your axes and title your plot.
    * x axis: x, or ```$x$``` for italics
    * y axis: Gaussian(x), or ```Gaussian($x$)``` for italics
    * title: Univariate gaussian probabilitry distribution (pdf)
* Save your figure as ```images/gaussian.pdf```.

In [None]:
from scipy.stats import norm

def plot_norm():
    fig = plt.figure()
    ax = plt.gca()
    # your code here
    
fig = plot_norm()

## 2. Advanced labeling and customizations

The rest of this week's lecture focuses on using the Pokemon dataset (from Kaggle) to help us show complex trends in our data. If you haven't already, download and load in the Pokemon dataset. I have written the utility function for you.

In [None]:
# load the pokemon dataset in
from pokemon_util import *
pokemon_fpath = os.path.join('csvs', 'pokemon.csv')
poke_headers, poke_types, poke_dict = load_pokemon(pokemon_fpath) # get the dictionary
poke_array, poke_np_lookup = poke_array(poke_dict, poke_types) # convert into numpy array

In [None]:
print(poke_headers)
print(poke_types) # pokemon types, sorted
print(poke_dict['Bulbasaur']) # dictionary format
print(poke_np_lookup[0], poke_array[0,:]) # np array format

Suppose we want to plot the total stats versus the pokemon numbers. Just blindly plotting them produces a pretty messy graph:

In [None]:
total_stats = poke_array[:,TOTAL_STAT]
poke_nums = poke_array[:,POKENUM]
plt.scatter(poke_nums, total_stats)
plt.xlabel('Pokedex #')
plt.ylabel('Total stats')
plt.title('Pokemon statistics')
plt.show()

There are a few ways we can fix this. The first is to customize our scatter points to add more information about the size, shape, and transparency of our points:
* ```alpha``` indicates opacity. 1.0 means 100% opaque (by default), 0.0 means 100% transparency.
* ```c``` is color. ```'k'``` would be black, ```'o'``` would be orange, and so on. There are lots of colors that can be specified by name, which you can find on the [Matplotlib documentation page for named colors](https://matplotlib.org/2.0.0/examples/color/named_colors.html).
* ```s``` is size. Numbers are exponential.
* ```marker``` is marker shape (the shape of your point). Default is ```'o'``` or a filled in circle. More info is on the [Matplotlib documentation page for markers](https://matplotlib.org/api/markers_api.html#module-matplotlib.markers).
* ```facecolors``` is the face color of your marker. If you do not want your markers filled in, set this to ```'none'``` (the string). Otherwise it can be set to a color.
* ```edgecolors``` is the edge color of your marker. If you do not want edge outlines, set this to ```'none'``` (the string).

In [None]:
plt.scatter(poke_nums, total_stats, alpha=0.4, marker='^') # you must replot
plt.xlabel('Pokedex #')
plt.ylabel('Total stats')
plt.title('Pokemon statistics')
plt.show()

Alas, we probably would prefer to differentiate these Pokemon by their type. So here I've tried it out, still plotting by Pokedex number.

In [None]:
fig = plt.figure()
ax = plt.gca()
for i, poke_type in enumerate(poke_types):
    total_stats_type = poke_array[poke_array[:,TYPE1] == i, TOTAL_STAT]
    poke_nums_type = poke_array[poke_array[:,TYPE1] == i, POKENUM]
    ax.scatter(poke_nums_type, total_stats_type, label=poke_type, marker='^', alpha=0.4)
ax.set_xlabel('Pokedex #')
ax.set_ylabel('Total stats')
ax.set_title('Pokemon statistics')
plt.show()

### Setting axis limits

Suppose I want to adjust my axis boundaries.
* ```ax.set_xlim((min_value, max_value))```(note the internal tuple). The analogous function for the y axis is ```ax.set_ylim((min_value, max_value))```.
* ```ax.get_xlim()``` returns two values, the minimum and maximum of your axis.

In [None]:
curr_xmin, curr_xmax = ax.get_xlim()
ax.set_xlim((0, curr_xmax))
ax.set_ylim((0, 1000))
fig

### Making custom axis labels

Sometimes you may not always have numeric axis labels. For example, suppose I wanted to plot the spread of stats by Pokemon type, instead of Pokedex number. Then I would need to adjust the x axis labels. The key function we use is ```plt.xticks(<list of intercepts>, <list of labels>)``` or ```ax.set_xticks(<list of intercepts>)``` followed by ```ax.set_xticklabels(<list of labels>)```.

You can also add ```rotation``` to the labels. I've done so for legibility.


In [None]:
fig = plt.figure()
ax = plt.gca()

for i, poke_type in enumerate(poke_types):
    combo_type = np.logical_or(poke_array[:,TYPE1] == i, poke_array[:,TYPE2] == i)
    #combo_type = poke_array[:,TYPE1] == i
    total_stats_type = poke_array[combo_type, TOTAL_STAT]
    # note that you cannot set facecolor=None in this loop. This is because matplotlib is auto-setting
    # the color of your series for you.
    ax.scatter(i*np.ones(total_stats_type.shape), total_stats_type, label=poke_type, alpha=0.2)
ax.set_xlabel('Pokemon type')
ax.set_ylabel('Total stats')
ax.set_xticks(range(len(poke_types)))
ax.set_xticklabels(poke_types,rotation=90)
ax.set_title('Total stats by Pokemon type')
plt.show()

#### Programming exercises

Plot a horizontal version of the previous graph. To make your figure a different size, call ```plt.figure((width, height))``` when you create the figure object.

In [None]:
# your code here

## 3. Boxplots

Other than scatter and line graphs, Matplotlib can also plot a variety of other standard graphs, like boxplots and histograms. We'll first discuss the easier of the two, the boxplot.

The Matplotlib documentation website has detailed information about each function (like the [boxplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.boxplot.html) one), but the best resource for graphing is their extensive examples library. This [boxplot demo](https://matplotlib.org/2.0.1/examples/pylab_examples/boxplot_demo.html) page has a lot of examples of how to plot boxplots.

For example, let's graph the total pokemon stats as a boxplot.

In [None]:
total_stats = poke_array[:,TOTAL_STAT]
plt.boxplot(total_stats)
plt.title('Total stats of all Pokemon')
plt.show()

Want a horizontal graph? You can check up on the boxplot documentation, but here you go. I've also adjusted the whiskers from their default 1.5 IQR (interquartile range) to show some outliers, and I've labeled the axis.

In [None]:
plt.boxplot(total_stats, vert=False, whis=1.0)
ax = plt.gca()
ax.set_yticklabels(['total stats'])
ax.set_xlabel('Stat points')
ax.set_title('Spread of total stats of all Pokemon')
plt.show()

We can also plot multiple boxplots on the same graph. However, note that every call to ```boxplot``` resets the location at which to graph the boxplot, and therefore repeated calls to boxplot will actually graph overlapping boxplots, as below. 

In [None]:
fig = plt.figure()
ax = plt.gca()
stat_inds = [HP, ATK, DEF, SPATK, SPDEF, SPD]
for stat_ind in stat_inds:
    dim_stats = poke_array[:,stat_ind]
    ax.boxplot(dim_stats, vert=False, whis=1.5)
ax.set_title('Spread of each Pokemon stat')
plt.show()

Instead, we should plot all of the boxplots at once in a single call to ```boxplot``` by creating a 2-D array of all of the information that we want plotted.

In [None]:
all_stats = poke_array[:,stat_inds]
stat_labels = [poke_headers[ind] for ind in stat_inds]
print('all stats dim', all_stats.shape)
def multiple_boxplots(arr, labels, xlabel=None, title=None):
    fig = plt.figure()
    ax = plt.gca()
    ax.boxplot(arr, vert=False, whis=1.5, labels=labels) # note `labels` keyword, not `label`
    if xlabel is not None: ax.set_xlabel(xlabel)
    if title is not None: ax.set_title(title)
    plt.show()
multiple_boxplots(all_stats, stat_labels, xlabel='Stat points', title='Spread of stats of all Pokemon')

But perhaps normalized stats would be better, as the range of each Pokemon stat is different. Below is the same plot with a normalized x-axis.

In [None]:
norm_stats = all_stats/np.amax(all_stats, axis=0) # do you see how this normalizes across each dimension?
multiple_boxplots(norm_stats, stat_labels, xlabel='Stat points', title='Normalized stat spread of all Pokemon')

#### Programming exercises

Plot the spread of stats of all Pokemon where any given stat is ```stat/total_stat```. In other words, instead of normalizing by the max of that particular stat---say, normalizing Bulbasaur's HP by the max HP seen across all Pokemon---now normalize Bulbasaur's HP by Bulbasaur's total stat statistic.

A suggested implementation will work in two lines:
* 1 line to normalize by total stats
* 1 line to call ```multiple_boxplots()```

In [None]:
# your code here

## 4. Matplotlib ```bar``` and ```hist``` and NumPy ```histogram```

Now let's move on to plotting bar graphs and histograms in Matplotlib. Recall the difference between bar graphs and histograms:
* **bar graph**: counts of items separated by categories. For example, the number of Pokemon of each type over all generations.
* **histogram**: binned distribution count of an observed item population. For example, the distribution of the HP statistic over all Pokemon. Note that histograms provide a more informative view of a population than the boxplot, as it also portrays the shape of the distribution.

### Matplotlib ```ax.bar()```

Matplotlib has a ```bar``` command that works as you would expect.

```ax.bar(xs, height_ys, width)```
* ```xs```: a list of starting coordinates of each of your bars. This is either the center coordinate (by default) or the leftmost coordinate, which is specified by the ```align``` keyword.
* ```height_ys```: a list of heights for each of your bars.
* ```width``` (optional): a scalar specifying the width of all of your bars. Default is 0.8.
* ```align``` (optional: 'center' or 'edge'. Determines whether your ```xs``` are the center or left coordinates of each bar. Default is 'center'.

Let's plot the bar graph of Pokemon per type. Note that if a Pokemon has dual typing, we count it twice. So the number of Pokemon you see might be more than the number of Pokemon per generation.

In [None]:
fig = plt.figure()
ax = plt.gca()
pokemon_counts = []
for i, poke_type in enumerate(poke_types):
    combo_type = np.logical_or(poke_array[:,TYPE1] == i, poke_array[:,TYPE2] == i)
    pokemon_counts.append(np.count_nonzero(combo_type))
ax.bar(range(len(poke_types)), pokemon_counts)
ax.set_xticks(range(len(poke_types)))
ax.set_xticklabels(poke_types, rotation=90)
ax.set_xlabel('Pokemon types')
ax.set_ylabel('Number of Pokemon')
ax.set_title('Pokemon types')
plt.show()

Since ```ax.bar()``` allows us to specify the x-coordinates of our bars, it is simple to make a clustered bar graph. Below I have plotted the number of Pokemon per type for Generation 1 and Generation 7.

In [None]:
fig = plt.figure()
ax = plt.gca()
pokemon_counts_gen1 = []
pokemon_counts_gen2 = []
for i, poke_type in enumerate(poke_types):
    combo_type = np.logical_or(poke_array[:,TYPE1] == i, poke_array[:,TYPE2] == i)
    combo_type_gen1 = np.logical_and(combo_type, poke_array[:,GEN] == 1)
    combo_type_gen2 = np.logical_and(combo_type, poke_array[:,GEN] == 2)
    pokemon_counts_gen1.append(np.count_nonzero(combo_type_gen1))
    pokemon_counts_gen2.append(np.count_nonzero(combo_type_gen2))
x_gen1 = np.arange(0, len(poke_types)*3, 3)
x_gen2 = np.arange(1, len(poke_types)*3, 3)
bar_width = 1
ax.bar(x_gen1, pokemon_counts_gen1, width=bar_width, label='Gen 1')
ax.bar(x_gen2, pokemon_counts_gen2, width=bar_width, label='Gen 2')
ax.set_xticks(x_gen2)
ax.set_xticklabels(poke_types, rotation=90)
ax.set_xlabel('Pokemon types')
ax.set_ylabel('Number of Pokemon')
ax.legend()
ax.set_title('Pokemon types')
plt.show()

#### Programming exercise

Note that the above bar chart doesn't tell us too much because the number of Pokemon introduced in Gen 1 and in Gen 2 are different. Plot the above bar chart with normalized values; that is, where the y-axis becomes the fraction of Pokemon in that particular generation that are of a particular type.

In [None]:
# your code here



### Matplotlib ```hist``` and NumPy ```histogram```

Next we'll do histograms. 
* Examples: [Matplotlib histogram demo](https://matplotlib.org/1.2.1/examples/pylab_examples/histogram_demo.html)
* [Matplotlib basic tutorial](https://matplotlib.org/gallery/statistics/hist.html)
* Documentation of ```plt.hist(x, bins, range, density, cumulative)``` [Matplotlib documentation](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.hist.html)
    * ```x```:
    * ```bins``` (optional): an int or list. If an int, specifies to create that number of bins for the data. If a list, specifies the left-hand edges + the last right-hand edge for your bins. In other words, if bins is a list, len(bins) = len(x) + 1. If not specified, defaults to bins=10.
    * ```range``` (optional): if bins is an integer, you can use this to specify the range=(min_val, max_val) that you want.
    * ```density``` (optional): If True, the first element of the return tuple will be the counts normalized to form a probability density, i.e., the area (or integral) under the histogram will sum to 1. This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations. If stacked is also True, the sum of the histograms is normalized to 1.
    * ```cumulative``` (optional): if true, computes a cdf instead of a pmf.
    * ```width``` (optional): bar width. Default is width of each bin.
    
Below is a simple histogram. Note that the number of bins is 10, and the bars are flush against each other (as opposed to in ```ax.bar()```). Also notice the range is already decided for you.

In [None]:
total_stats = poke_array[:,TOTAL_STAT]
plt.hist(total_stats)
plt.title('Distribution of total stats of all Pokemon')
plt.show()

```plt.hist``` also returns values for you:
* ```n```: the values of the histogram bins in a list.
* ```bins```: the edges of the bins. Note len(bins) = n+1, like in the parameter.
* ```patches```: the actual rectangular objects (one per bin) for the plot, in case you want to manipulate these as well

In [None]:
total_stats = poke_array[:,TOTAL_STAT]
counts, bins, _ = plt.hist(total_stats) # ignore patches
plt.title('Total stats of all Pokemon')
plt.show()
print('counts', counts)
print('bins', bins)

#### ```density=True``` and ```np.histogram()```
But what if you just wanted the bin counts and the bins? Then NumPy has the function for you: **```np.histogram()```**. It does the same thing as ```ax.hist()``` in Matplotlib, but just does not graph it.

In [None]:
speed_stats = poke_array[:,SPD]
counts, bins = np.histogram(speed_stats)
print('counts', counts)
print('bins', bins)
print('Total number of Pokemon', np.sum(counts))

It's useful to know how to call ```np.histogram()``` explicitly because the ```density``` option in ```ax.hist()``` (and subsequently ```np.histogram()``` does not do exactly what you would expect.

In [None]:
speed_stats = poke_array[:,SPD]
fracs, bins, _ = plt.hist(speed_stats, density=True)
plt.title('Speed stat of all Pokemon')
plt.show()

In the graph above, it is clear that for the tallest bar, 0.0040 != 198/800 (800 total Pokemon). So what is the ```density=True``` option giving? It is giving us the idea that if you took the integral over the area of the histogram, it would be 1:

In [None]:
bin_width = bins[1] - bins[0]
np.sum(fracs*bin_width)

However, suppose we wanted the percentage of Pokemon as a distribution. Then we can explicitly call ```np.histogram()``` and combine it with an ```ax.bar()``` call to create the correctly normalized plot, as below.

**Notes**:
* ```ax.bar()``` takes in only the one coordinate of each bar, so we must set that coordinate to be 'edge' aligned.
* We must must also specify the width since the default width is just 0.8.
* It's also useful to have your xtick marks be where your actual bin edges are, so let us specify that too. I cast the numpy array of bins to integers to avoid the floating point printout on the graph.

In [None]:
counts, bins = np.histogram(speed_stats)
total_pokemon = np.sum(counts)
percentages = counts/total_pokemon * 100
bin_width = bins[1] - bins[0]
plt.bar(bins[:-1], percentages, width=bin_width, align='edge')
plt.ylabel('% of Pokemon')
plt.xlabel('Speed stat')
print('default xticks', plt.xticks())
plt.xticks(bins.astype(int)) # without this line, you would have default xticks
print('new xticks', bins.astype(int))
plt.title('Distribution of speed stats')

#### Programming Exercises

1. Fill in the below code to plot a bar graph of the number of Pokemon per generation.

In [None]:
fig = plt.figure()
ax = plt.gca()
pokemon_counts = []
max_gen = np.amax(poke_array[:,GEN])
for i in range(1, max_gen+1):
    # your code here
    # populate the pokemon_counts array
    pass
# your code here to ```ax.bar()```

# set labels here
plt.show()

2 Plot a histogram of the distribution of the HP stat by % of Pokemon, not count of Pokemon.

In [None]:
# your code here

## 5. Conclusion and Homework

This week's Matplotlib covered the basics of how to plot what you want. Next week we'll look at more advanced customizations --- how to specify heatmaps, make subplots, use dual axes, and so on.

* 01_problem.ipynb
* 02_problem_power_law.ipynb