<h1><center>Introduction to Scientific Packages II (Matplotlib and Pandas)</center></h1>

> 1. Basics of Matplotlib
>> - The basic plot
>> - Subplots and Axis elements
>> - GridSpec
>> - Some examples of decorating your figure
> 2. Basics of Pandas
>> - The Series and DataFrame classes
>> - Filtering by columns and rows
>> - I/O functions
>> - Looking at data
>> - Summarizing data
>> - GroupBy for exploring data



<u>Main modules of interest:</u>
> <b>Matplotlib</b> : High-quality 2D charting library that has functions taking data arrays as inputs and returning figures. Also, allows us to focus on visual effects like colors or spacing an an abstract level. (https://matplotlib.org)
>
> <b>Pandas</b> : Used for data analysis and making tabular data structures with mixed data types (DataFrames). Organizes everything visually into tables which can be useful if you get lost in what's stored where. (This will only be used behind the scenes in this lesson, https://pandas.pydata.org)

<u>List of other included modules</u>:
> <b>OS</b> : Standard input-output functions for accessing and saving files on your operating system. (https://docs.python.org/3/library/os.html)


## Import modules

In [None]:
# data handling module
import pandas as pd
import numpy as np

# data viz module
import matplotlib.pyplot as plt 
# the percent sign below is something called Jupyter magic making the Jupyter Notebook interactive
# - (In this case, just plotting figures inside the notebook.)
%matplotlib inline 

# I/O + system module
import os

## Basics of Matplotlib

Matplotlib is built around two Python objects:
+ Figure objects
+ Axis objects


These two classes are your bread-and-butter interfaces to the plotting <i>backends</i>. <i>Backends</i> deal with how the plots are rendered into the different image formats (.jpg, .svg, .pdf, .png, etc). You don't need to do anything with the backends for the most part - just know that we structure the figure layout with Python code, and the backend renders it into an image that could be <i>rasterized</i> (made up of pixel dots with different properties), <i>vectorized</i> (expressed as scalable vectors having multiple layers to the image), etc.

They also have very similar functionality and interact with each other. You will be using both constantly throughout the entire data science workflow - from exploring data to designing more elaborate and informative figures for publications/presentations.

> <b>Note</b>: There are a number of visualization packages. Some examples include: <b>Matplotlib</b>, <b>Seaborn</b>, <b>Bokeh</b>, and <b>Plotly</b>.
> 
> All packages are typically built around <b>Matplotlib</b>

**Pros:**
<ul> 
<li>Huge amount of functionality/options. (There are >70,000 lines of code associated with this library.)
<li>Works with numpy arrays and python lists.
<li>Comes with many prepackaged Python distros (anaconda, WinPython, etc.).
<li>Easily saves plots to raster (.png, .bmp, etc.) and vector (.svg, .pdf, etc.) formats.
<li>Has an excellent set of examples for coming up with figure layouts (with code) at http://matplotlib.org/gallery.
<li>Shares many syntactic conventions with Matlab.
</ul>


**Cons:**
<ul>
<li>Slow for rapidly updating plots.
<li>3D plotting support is not great.
<li>Documentation is not always useful (it's actually very extensive, but often out-of-date or lacking in up-to-date examples)
<li>Essentially has two primary interfaces.  One is intended to be close to Matlab, the other is object oriented.  You will find examples that assume one or the other, but rarely the one you are after. 
    - There actually several interfaces that work with several backends.
<li>Shares many syntactic conventions with Matlab.
</ul>

### The basic plot using the Figure Class

In [None]:
help(plt.figure)

In [None]:
# This is our interface
fig = plt.figure() # the "basic plot" is class instance with a size and Axes attributes associated with it



In [None]:
type(fig) # this type is handled by the new_figure_manager behind the scenes so that w 



In [None]:
# let's look at some "data" 
fake_data = np.random.random(1000)

# we create a Figure instance
fig = plt.figure()

# we then call the plot method
plt.plot(fake_data) 

Notice that the "plot" method creates a 2D "line" plot. This line plot could be a timeseries, i.e., an time-ordered sequence of values, but can also be anything we want to be plotted using a line plot. One example could be plotting the curve of reconstruction errors for machine-learning models as we vary learning parameters.

In [None]:
# can also clean our cell output by adding a semi-colon to 
# the end of the last line of the cell. These can get messy 
# the more elements are involved in the last line.
plt.plot(fake_data);


> <b>Note</b>: Generally, we want to tell the interpreter when to render the final plot. This is done with matplotlib.pyplot.show(). However, since we're using a Jupyter Notebook and used Jupyter "magic" to render plots in-line, we aren't here. This should be done (in conjuction to "closing" plots via matplotlib.pyplot.close() ...) when you are running code in scripts

In [None]:
# ... the bare minimum of the Figure instance
fig # WHAT HAPPENED TO THE AXES??

## The Anatomy of a Figure

!!! INSERT IMAGE HERE

### The Axes Class and Subplots

In [None]:
# subplots creates the entire layout and gives us access to the Axes of Figure
fig, ax = plt.subplots() # <- returns a tuple



In [None]:
ax, type(ax)

In [None]:
# we can essentially do the same thing as before
fig, ax = plt.subplots()

ax.plot(fake_data) # here we work directly with the Axes 

In [None]:
type(ax)

In [None]:
# How are multiple axes stored?
fig, axs = plt.subplots(nrows=3,ncols=3)



In [None]:
type(axs) # as a numpy array



In [None]:
# Axes object is a... of shape...?
axs, axs.shape



In [None]:
# ... and can be indexed like any other array
print(axs[0,1]) # note the address



In [None]:
print(axs[2,1]) # third row, second col



> <b>Note</b>: Generally, I like to collapse the 2D grid to use with non-nested for loops when I need to plot several subplots. This is a preference, and not necessarily universal.

In [None]:
# example of collapsing using Numpy
fig, axs = plt.subplots(3,3,
                       constrained_layout=True) # changes spacing
# numpy.array function
axs = axs.ravel() # unwraps the 3x3 array into a 1x9 array

axs[4].plot(fake_data); # == axs[1,0]

> Alternatively, you can use axs.flatten() for a similar effect.

### GridSpec
> The GridSpec class gives us more control over the placement and span of each of the subplots
>
> If you need even MORE control, look into the <b>mpl_toolkits</b> Python library. Example. AxesGrid can give you even more control

In [None]:
from matplotlib.gridspec import GridSpec

In [None]:
gs = GridSpec(nrows=3,ncols=3)

In [None]:
type(gs), gs

In [None]:
# the main power of grid spec is scaling the subplots
fig = plt.figure()

# Create grispec object and define each subplot
gs = GridSpec(2, 2) # 2 x 2 is the shape

# We can create each axis here
ax0 = plt.subplot(gs[0, 0]) # Top left corner
ax1 = plt.subplot(gs[0, 1]) # Top right corner
ax2 = plt.subplot(gs[1, :]) # Bottom, span entire width

plt.tight_layout();

Let's just make a figure and dissect the pieces to it. Matplotlib is definitely one of those packages you'll have to explore. The first Data Exercise should be very helpful for this! 

In [None]:
# Create and empty figure - DON'T SPECIFY ANY SHAPE DETAILS ABOUT THE FIGURE
fig = plt.figure(figsize=(12,4)) # figsize sets the width by the height

# Create grispec object and define each subplot - a different method
ncols, nrows = 2,2
fig.add_gridspec(ncols=ncols, nrows=nrows, # we use the attribute method to create it
                figure=fig) # here we tell grid spec what figure it belongs to

# let's add some colors, too
colors = [['r','orange'],['y']] # frequently used colors are specified using strings
                                # and some can be specified using shorthand single characters 

# Now we directly add the subplots
for ax_j in range(ncols):
        ax = fig.add_subplot(spec[0,ax_j])

        
        ax.plot(fake_data,
                color=colors[0][ax_j],
                
                # we're just going to throw in some example changes here
                marker='o', # adding circles to the lines
                alpha=0.2 # changes the opacity (values between 0 and 1)
                )
        
        # we can decorate specific axes
        ax.set_xlabel('X-axis')
        ax.set_xlim([0,1000])
        
        if ax_j==0:
            ax.set_ylabel('Y-axis',size='xx-large')
            
        if ax_i==0:
            ax.set_title('The Same Fake Data')
            
        ax.set_yticks([0,0.5,1.0])
        ax.set_yticklabels(['0','1/2','1'])
        
# bottom row       
ax_bot = fig.add_subplot(spec[1,:])
ax_bot.plot(fake_data,
                color=colors[1][0],
                
                # we're just going to throw in some example changes here
                marker='o', # adding circles to the lines
                alpha=0.2 # changes the opacity (values between 0 and 1)
                )

ax_bot.set_xlabel('X-axis')
ax_bot.set_yticks([0,0.5,1.0])
ax_bot.set_yticklabels(['0','1/2','1','3/2','2'])
ax_bot.set_ylim(top=2.05) 

        
        
plt.suptitle('A Somewhat Elaborate Plot')


plt.tight_layout(); # similar to "constrained_layout"

## Basic of Pandas

<b>Pandas</b> is a Python library for high-level data structures and data minipulation.

This includes:
> + loading/saving data
> + filtering, selecting, grouping functions
> + basic data exploration
> + plotting and visualization

Pandas is built around two data structures: 
> + Series objects, and 
> + DataFrame objects

Pandas is just not Zen (https://twitter.com/tymwol/status/1390281948564701184). So we will really just take a very brief glimpse at its functionality. You'll discover your own way of working with Pandas, though.

### The Series and DataFrame

In [None]:
example = {'order' : [0,1,2,23,24,25],
           'alphabet' : ['a','b','c','x','y','z']}

df = pd.DataFrame(example) # notice that D and F are capitalize

df

In [None]:
df.index # [0,6) by 1

In [None]:
df.columns # column is essentially a list so can have mixed built-in types, 
        #  here we just have strings

### Filtering by columns

In [None]:
# we can select columns
print(df['order'])

print('Type =',type(df.order)) # every column in a DataFrame is a Series

In [None]:
example['order']



In [None]:
series = pd.Series(example['order'])

series # this is the same as above, but has 
       # no Name associated with it

### Columns (and rows) can be Numpy arrays or Lists depending on needs

In [None]:
np.array(df['order']), df['order'].values # second is Pandas function and preferred


In [None]:
df['alphabet'].tolist(), list(df['alphabet'])

### Filtering rows

In [None]:
# we can also select rows
df.loc[1] # returns row values for row with index LABELED as 1


In [None]:
df.iloc[1] # returns row values for row in index POSITION 1


### Selecting values

In [None]:
df['alphabet'][0], df['alphabet'].iloc[0], df['alphabet'].loc[0]


### Loading data

There are methods for reading several different data formats

In [None]:
print(dir(pd)) # recall dir gives us everything in the namespace 


In [None]:
# Examples of loading
load_list = [x for x in dir(pd) if 'read' in x]

for method in load_list:
    print(method)



### Reading a specific file

In [None]:
path_to_dir = os.getcwd()
path_to_data = 'exercise_data'
filename = 'pokemon_alopez247.csv'

# join paths 
f = os.path.join(path_to_dir,path_to_data,filename)

print(f)



In [None]:
help(pd.read_csv)



In [None]:
pokemon_df = pd.read_csv(f) # data is read

### Let's take look at the data

In [None]:
pokemon_df.head() # first 5

In [None]:
pokemon_df.tail(15) # last 15

In [None]:
# Do we have missing data in one of our columns
pokemon_df.isnull().values.any()



In [None]:
# How many and where?
total = 0
columns_with_nans = []

for column in pokemon_df.columns.tolist():
    
    nan_count = len([val for val in pokemon_df[column].tolist() if val is np.nan])
    
    if bool(nan_count):
        total += nan_count

        columns_with_nans.append(column)
    
print('Total Missing =',total,'\nMissing Cols : ',columns_with_nans)

### Quick summary of data

In [None]:
pokemon_df.describe()



### Visualizing data is built into Pandas

In [None]:
x1 = pokemon_df['HP'].values
x2 = pokemon_df['Defense'].values
x3 = pokemon_df['Sp_Def'].values

# Option 1 - Pyplot.plot
plt.plot(x1,x2,marker='.',linestyle='none')

In [None]:
# Option 2 - Pyplot.scatter
plt.scatter(x1,x2)



In [None]:
# Option 3 - Pandas
pokemon_df.plot.scatter(x='Sp_Def',y='Defense')



> <b>Note</b>: Some visualization packages utilize Pandas DataFrames!

In [None]:
# ... Option 4 - Seaborn
import seaborn as sns # this is a Python joke, but has become standard
                      # stands for Samuel Norman "Sam" Seaborn
                      # A character in The West Wing?

In [None]:
# Option 4 - Pandas
pokemon_df.plot.scatter(x='Sp_Def',y='Defense')



In [None]:
# Now all together
df = pokemon_df[['HP','Defense','Sp_Def']] # another way to filer

fig, axs = plt.subplots(3,3,figsize=(10,10))
axs = axs.ravel()
columns = df.columns.tolist()
col_x_col = [(col_1,col_2) for col_1 in columns for col_2 in columns]

for i, (ax,col_pair) in enumerate(zip(axs,col_x_col)):
    
    if i in [0,4,8]:
        vals = df[col_pair[0]].values
        ax.hist(vals)
    else:
        sns.scatterplot(x=col_pair[0],y=col_pair[1],data=df,ax=ax)



In [None]:
# wow... this is way easier
sns.pairplot(df);

### GroupBy for quick data exploration

The GroupBy instance method let's us partition a DataFrame by desireable characteristics (the column descriptors or categorical metadata stored in the columns).

In [None]:
grouped_means_df = pokemon_df.groupby('Type_1')[['HP','Defense','Sp_Def']].mean()



In [None]:
grouped_means_df



## Let's practice with the Data Exercise I