# Week 05: Advanced Matplotlib

This week's learning goals are as follows:

1. Learn how to use subplots.
1. Use Matplotlib documentation to customize your graphs.
1. Graph dual axes.
1. Use colorbar to plot heat maps.

This notebook uses the Kaggle Dataset Pokemon with stats. Download and move the csv into ```05_matplotlib_advanced/csvs```. For this notebook, I have defined a set of util functions for working with our Pokemon data. Please copy over the ```pokemon_util.py``` file from Week 4.

This notebook also uses the MyAnimeList database, and I've defined util functions for that as well. Please download the csv from [this GitHub repo](https://github.com/Dibakarroy1997/myanimelist-data-set-creator), "MyAnimeList Anime Dataset up to May 7 2018" and save it in ```05_matplotlib_advanced/csvs```.

In [None]:
# the following code guarantees you'll properly reload any modules that you custom-defined in your environment.
# you don't need to understand it.
# just run this once at the beginning.
# for auto-reloading extenrnal modules
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
import os
import sys
import numpy as np
import csv
import matplotlib as mpl
import matplotlib.pyplot as plt

## 1. Subplots

From last week's homework, we learned how to graph multiple series onto a single figure throuhg repeated calls to ```plot``` or ```scatter```. However, it's possible to even graph multiple plots onto a figure by creating subplots.

The below function creates one figure and two axes objects that we can access for two separate plots.

In [None]:
"""
subplots(num_rows, num_cols, figsize=(10,6))
returns:
    fig: entire figure object
    axs: list of axes
"""
fig, axs = plt.subplots(1, 2, figsize=(10,4))
print(axs)

When we plot on the two plots, now notice that they have separate color tracking. We also can separately label the subplots.

In [None]:
x = np.linspace(-np.pi, np.pi, 80, endpoint=True)
cos_y,sin_y = np.cos(x), np.sin(x)

fig, axs = plt.subplots(1, 2, figsize=(10,4))
axs[0].plot(x, cos_y, label='cos')
axs[0].plot(x, sin_y, label='sin')
axs[0].legend()
axs[1].plot(x, sin_y)
axs[1].set_xlabel('x values')
axs[1].set_ylabel('y values')
axs[1].set_xlim((-3, 3))
plt.show()

But the labels aren't well-positioned now, so we use this function to automatically resize the figures so we can see everything nicely. We generally keep this line as the last line to auto format.

In [None]:
fig.tight_layout()
fig # to show the figure in this cell

Finally, we can set an overall title for this subplot figure. This one is also a bit hard to position (and doesn't play well with ```fig.tight_layout()```) so I usually just leave it out.

In [None]:
fig.suptitle('Overall title')
fig.tight_layout() # reformat again
fig # to show the figure in this cell

We can save the entire figure with the same call to ```savefig()``` as last time.

In [None]:
fig.savefig(os.path.join('images', 'subplot_example.png'))

### 2-D subplots

In the above exercise, since our subplots were 1-D (in a single row), the ```axs``` object returned is a one-dimensional list. However, if we create a 2-D set of subplots (with multiple rows, multiple columns), the ```axs``` object returned is a numpy list indexed as ```axs[row_index,col_index]```.

In [None]:
fig, axs = plt.subplots(3,3, figsize=(9,9))
print('axs object', axs)
print(axs.shape) # it's a numpy array
for row_ind in range(axs.shape[0]):
    for col_ind in range(axs.shape[1]):
        axs[row_ind, col_ind].set_title('Subplot index ({}, {})'.format(row_ind, col_ind))

fig.tight_layout()

## Example
Let's plot a subplot of all six of the Pokemon stats.

In [None]:
# load the pokemon dataset in
from pokemon_util import *
pokemon_fpath = os.path.join('csvs', 'pokemon.csv')
poke_headers, poke_types, poke_dict = load_pokemon(pokemon_fpath) # get the dictionary
poke_arr, poke_np_lookup = poke_array(poke_dict, poke_types) # convert into numpy array

In [None]:
print(poke_headers)
print(poke_types) # pokemon types, sorted
print(poke_dict['Bulbasaur']) # dictionary format
print(poke_np_lookup[0], poke_arr[0,:]) # np array format

In [None]:
fig, axs = plt.subplots(3, 2, figsize=(8,6))
num_stats = 6
num_rows, num_cols = axs.shape
bin_range = np.linspace(0, 256, 12)
for stat_ind in range(num_stats):
    poke_ind = TOTAL_STAT+1+stat_ind
    stat_col = poke_arr[:,poke_ind]
    row_ind = stat_ind//num_cols # integer division
    col_ind = stat_ind % num_cols # remainder
    axs[row_ind, col_ind].hist(stat_col, bins=bin_range, edgecolor='w')
    axs[row_ind, col_ind].set_title(poke_headers[poke_ind])

# then label the x/y axes at the end
for row_ind in range(num_rows):
    axs[row_ind,0].set_ylabel('# of Pokemon')
for col_ind in range(num_cols):
    axs[-1, col_ind].set_xlabel('Stat points')
fig.tight_layout()

#### Programming exercises

Plot the same Pokemon stat figure but using a nested loop instead.

Notes:
* First decide whether you want 3 rows x 2 columns or 2 rows x 3 columns. It's up to you.
* Recall that you can make a histogram with ```ax.hist(x_values)``` if ```ax``` is a single object. You will probably have to call ```axs[i,j].hist(...)``` with what you want.
* The ```STAT_START``` constant gives you the index into ```TOTAL_STAT```, followed by ```HP, ATK, DEF, SPATK, SPDEF, SPD```. So to get ```HP```, you can call ```STAT_START+1```.
* It is easiest to accomplish this task in a nested double loop, where the outer loop indexes into the rows of ```axs```, and the inner loop indexes into the columns of ```axs```. You can then decide the stat index by doing ```STAT_START+1+<some value>```, where ```<some value>``` is a function of the row and column you're on, as well as the dimensions of your ```axs``` array.

In [None]:
# Your code here


## 2. Graph customization

We are going to use the MyAnimeList dataset for the remainder of this notebook.

Different size stuff (densities and sizes)
- a circle plot?
Color version of plot
- a color plot? with colorbar
- Custom bars and stuff
- LaTeX/subscript/supersript in titles and text

Fill between
- https://matplotlib.org/users/recipes.html

Legend placement, Textbox adding

In [None]:
from mal_util import *
mal_fpath = os.path.join('csvs', 'myAnimeListDataset [07-05-2018].csv')
anime_headers, anime_studios, anime_sources, anime_dict = load_mal(mal_fpath)
anime_arr, anime_np_lookup = anime_array(anime_dict,  anime_studios, anime_sources) # convert into numpy array

print(anime_headers) # things
print('studios:', len(anime_studios), anime_studios[:10])
print(anime_sources)
print('example entry for Houseki no Kuni', anime_dict['Houseki no Kuni']) # ditionary format
np.set_printoptions(suppress=True) # remove scientific notation view
print(anime_np_lookup[0], anime_arr[0,:]) # np array format

### Using an array for marker size
If we plot all anime scores from 2017, we get a blob of information:

In [None]:
mal_scores = anime_arr[:,SCORE]
bool_2017 = anime_arr[:,YEAR] == 2017
def set_labels(ax, title):
    ax.set_xticks(np.arange(4))
    ax.set_xticklabels(SEASON_STRS)
    ax.set_xlim((-0.5, 3.5))
    ax.set_title(title)
    ax.set_ylabel('MAL rating')


fig = plt.figure()
ax = plt.gca()
x_vals = anime_arr[bool_2017,SEASON]
y_vals = anime_arr[bool_2017,SCORE]
ax.scatter(x_vals, y_vals,alpha=0.2,c='g')
#plt.xlim((2000,2018))
set_labels(ax, title='Anime from 2017')
plt.show()

One thing we can do is to randomize the x-offset to make each individual point more readable.

In [None]:
offset_x_vals = x_vals + 0.05-np.random.random(x_vals.shape)*0.1
fig = plt.figure()
ax = plt.gca()
ax.scatter(offset_x_vals, y_vals,alpha=0.2,c='g')
set_labels(ax, title='Anime from 2017')

Another thing we can do is set the size as an array, which is the same shape as our x and y arrays. However if we want our size to be the number of people who watched this anime, we actually get an unreadable graph, so instead we do the following:
* Figure out automatic bins for the viewcount distribution
* Use the ```np.digitize(values, bins)``` function to get the bin index of each view count, instead of the absolute viewcount
* Use the bin indices as the size index itself

We do some analysis on the distribution of our anime viewcounts to show why this works.

In [None]:
ratings = anime_arr[:,NUM_RATINGS]
counts, bins, _ = plt.hist(ratings, bins=40)
plt.ylabel('Number of anime')
plt.xlabel('Number of ratings')
print('bins', bins.tolist())
print('counts', counts.tolist())

Now our below graph has much more information, since we can clearly see where the more popular anime lie in our chart.

In [None]:
fig = plt.figure()
ax = plt.gca()
bin_inds = np.digitize(anime_arr[bool_2017,NUM_RATINGS], bins) # get the index of what viewcount bin each item is in
print('example of bin indices from digitize call', bin_inds[:10])
s_vals = (bin_inds+1)*20
ax.scatter(offset_x_vals, y_vals, s=s_vals, alpha=0.3, c='g')
set_labels(ax,title='Anime from 2017')

### Using an array for colors and alpha  values
Recall that we can set the ```alpha``` values, which set the shading. Unfortunately, ```alpha``` can only take in a scalar value or ```None```; it can't take in an array of alphas. The reason for this is because an alpha value is a shade, and it is closely tied with the RGBA representation of color, which is a tuple of 4 values: R,G,B,A (for alpha).

So what we **can** do is specify an array of RGBA colors, where the RGB values are always the color we care about, but the alpha values vary. Note that all elements of RGBA must be scaled between 0 and 1.0, so a pure green which is (0,255,0) will actually be (0, 1.0, 0, 1) in RGBA land.

In [None]:
rgba_vals = np.ones((y_vals.shape[0],4))
rgba_vals[:,:3] = [0,1.0,0]
rgba_vals[:,3] = (bin_inds)/np.max(bin_inds)
fig = plt.figure()
ax = plt.gca()
ax.scatter(offset_x_vals, y_vals, s=s_vals, c=rgba_vals)
set_labels(ax,title='Anime from 2017')

#### Programming exercises

Choose four studios out of the following list:

In [None]:
print(anime_studios)

#### Exercise 1

Plot the studio ratings by season, using a different color for each studio.

Notes:
* ```l.index(item)``` returns the first index of ```item``` in the list ```l```. You can use this to find the studio index from ```anime_studios``` (your list).
* ```fig.legend()``` tries its best to plot a unobtrusive legend (this should show which studio belongs to which color)
* ```set_labels(ax, title=title_str)``` automatically configures the axis object to display seasons, with the title being ```title_str```

In [None]:
select_studios = [] # fill this in
select_colors = [] # fill this in with colors you decide. should be same length as studio list

fig = plt.figure()
ax = plt.gca()

studio_arr = anime_arr[:,STUDIO]

for studio, c in zip(select_studios, select_colors):
    ## your code here
    pass

# labeling code here

## 3. Dual axes

What if we wanted to track average MAL rating with the number of people rating anime over time? The answer is probably dual axes. You can read this Matplotlib documentation for more information, but the following code is basically me applying the documentation to the MAL dataset.

https://matplotlib.org/gallery/api/two_scales.html

Notice that we are plotting *standard error*, not standard deviation. This is the estimated error of the average that we've plotted for that particular year, which is ```sigma/sqrt(n)```, where ```sigma``` is the standard deviation of the sample mean, and ```n``` is the number of samples we have.

In [None]:
def get_statistics_by_year(arr):
    year_col = arr[:,SEASON].astype(int)
    years = sorted(list(set(year_col.tolist())))
    mean_tups, std_tups, min_tups, max_tups, n_tups = [], [], [], [], []
    for year in years:
        year_arr = arr[year_col == year,:]
        mean_tups.append(np.mean(year_arr,axis=0))
        std_tups.append(np.std(year_arr, axis=0))
        min_tups.append(np.amin(year_arr, axis=0))
        max_tups.append(np.amax(year_arr, axis=0))
        n_tups.append(np.count_nonzero(year_col == year))
    return years, np.array(mean_tups), np.array(std_tups), np.array(min_tups), np.array(max_tups), np.array(n_tups)

years, mean_by_year, std_by_year, min_by_year, max_by_year, n_by_year = get_statistics_by_year(anime_arr)
stderr_by_year = (std_by_year.T/np.sqrt(n_by_year)).T

fig = plt.figure()

# plot the left axis
ax1 = plt.gca()
score_c = 'g'
ax1.errorbar(years, mean_by_year[:,SCORE], yerr=stderr_by_year[:,SCORE], capsize=5, c=score_c)
ax1.set_ylabel('Average MAL score ($\pm$ 1 SE)', color=score_c)
ax1.tick_params(axis='y', labelcolor=score_c)
ax1.set_ylim((5, 8))

# plot the right axis
ax2 = ax1.twinx()
rate_c = 'b'
ax2.errorbar(years, mean_by_year[:,NUM_RATINGS], yerr=stderr_by_year[:,NUM_RATINGS], capsize=5, c=rate_c)
ax2.set_ylabel('Average number of ratings ($\pm$ 1 SE)', color=rate_c)
ax2.tick_params(axis='y', labelcolor=rate_c)

# set x labels
ax1.set_xlim((1990, 2018))
ax1.set_xlabel('Year')
plt.show()

print('Year with most ratings:', years[np.argmax(mean_by_year[:,NUM_RATINGS])])

For context, the MAL listing feature was created in 2004, the MyAnimeList name was created in 2006, Fullmetal Alchemist: Brotherhood was released in 2010, Attack on Titan was released in 2013, and One Punch Man was released in 2015. Teekyuu was released in 2012 (the peak year for number of ratings; other anime series released in 2012 available on [wikipedia](https://en.wikipedia.org/wiki/List_of_animated_television_series_of_2012)).


#### Programming exercise

Plot the average number of people rating an anime with the average number of people favoriting anime over the number of years.

## 4. Colorbar to plot heat maps

Plotting colors is sometimes used to introduce a third dimension onto 2-D plots (as an alternative to size/alpha shading). Let's see how this works.

In [None]:
fig = plt.figure()
ax = plt.gca()
num_ratings = anime_arr[:,NUM_RATINGS]
num_faves = anime_arr[:,NUM_FAV]
scores = anime_arr[:,SCORE]


# Choose colormap
cmap = plt.cm.jet
my_cmap = cmap(np.arange(cmap.N)) # Get the colormap colors
print('default RGB colors', my_cmap)
print('possible colors', my_cmap.shape)
my_cmap[:,-1] = 0.5
my_cmap = mpl.colors.ListedColormap(my_cmap)
print('color object', my_cmap)

norm = mpl.colors.Normalize(vmin=0, vmax=10)
scale_map = plt.cm.ScalarMappable(norm=norm, cmap=my_cmap)
scale_cs = scale_map.to_rgba(scores)
print('scaled color object', scale_map)
print('output colors', scale_cs)

ax.scatter(num_ratings, num_faves, c=scale_cs)
ax.set_xlabel('Number of people who rated an anime')
ax.set_ylabel('Number of people who favorited an anime')
ax.set_yscale('log')
plt.show()