# Pandas and Matplotlib | Some advanced features

Szymon Talaga | 20.01.2020

<hr>

This is our last, final notebook. It reviews some (selected) advanced features of Pandas as well as introduce a better way to
use Matplotlib visualization library.

Remember that Pandas is quite vast a library, so we could not really cover everything. Luckily, the official documentation of Pandas
is rather accessible and well written (for the most part). To solidify your knowledge it is recommended to read through:

* [Getting started tutorial](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html)
* [User guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html)

While reading you will notice that we managed to cover more or less all the core concepts, but in some cases we had to omit some details.
We also did not talk at all about working with time series (at which Pandas, in fact, excels).

The problems in the final homework will be as dependent only on topics that we discussed in class as possible. Thus, for instance
there will be no questions about time series processing. However, it may happen that in some cases you will have to read a little bit
on your own or at least consult a documentation page for a function or method. This is unavoidable and it is also what happens very often 
in any real project.

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sbn

## Pandas | Reading and writing from and to (text) files

We already discussed this so this will be just a brief reminder.

Reading from a text file can be done with `pd.read_csv()` function. 
In general it is quite complex (look at docpage with `pd.read_csv?`), but basic usage is rather simple.

Similarily, you can write to a file with `.to_csv()` method. Again, it is quite complex but basic usage is simple.

Now, we will first dump the _iris_ dataset to a CSV (comma-separated values) file and a TSV (tab-separated values) file with `.to_csv()`
method and then we will read them back in with `.read_csv()`.

In [None]:
# Load iris dataset from Seaborn package
iris = sbn.load_dataset('iris')
iris.head()

In [None]:
# Write CSV (comma-separated)
iris.to_csv('iris.csv', index=False)

# Write TSV (tab-separated)
# This time we additionally remove header with header=False argument
iris.to_csv('iris.tsv', sep="\t", header=False, index=False)

In [None]:
# Read-in from CSV
iris_csv = pd.read_csv('iris.csv')
iris_csv.head()

In [None]:
# Read-in from TSV
# First attempt
pd.read_csv('iris.tsv', sep="\t").head()

In [None]:
# We forgot that we remove header
pd.read_csv('iris.tsv', sep="\t", header=None).head()

In [None]:
# We may also pass column index by hand
iris_tsv = pd.read_csv('iris.tsv', sep="\t", header=None, names=[
    'sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'
])
iris_tsv.head()

## Pandas | Sorting

You can sort data frames (even by multiple columns) with `.sort_values()` method.

In [None]:
np.random.seed(101010)

df = pd.DataFrame(np.random.randint(0, 10, (15, 2)), columns=['x', 'y'])
df

In [None]:
df.sort_values(by=['x', 'y'], axis=0)

## Pandas | Working with strings

We can work quite efficiently with string columns thanks to the special `.str` attribute. It allows us to access standard string methods
we know from standard Python and apply them in a vectorized manner to columns in data frames.

In [None]:
iris.head()

In [None]:
iris['species'].map(lambda x: x.upper())

In [None]:
iris['species'].str.upper()

In [None]:
iris['species'].str.contains(r"^v")

In [None]:
# Filter only to species with names starting with 'v'
iris.loc[iris['species'].str.contains(r"^v"), :]
iris.loc[iris['species'].str.match(r"^v"), :]

In [None]:
s = "a string"

s.startswith('a')
s.startswith('not a')

In [None]:
# Equivalent to
iris.loc[iris['species'].str.startswith('v'), :]

## Pandas | Create new variables on the fly

You can create a copy of a data frame with new columns (defined even according to complicated rules) with `.assign()` method.

In [None]:
iris.loc[iris['species'] == 'setosa', 'new_column'] = 999
iris.head()
iris.tail()

In [None]:
iris.assign(
    a_new_column = 1,
    large_sepal_w = lambda df: df['sepal_width'] > 6
)

iris['large_seapal_w'] = iris['sepal_width'] > 6

iris.head()

## Pandas | Method chaining syntax

At this point we can see some pattern with respect to how Pandas works. We can notice that most of methods in Pandas return
new data frame objects (usually copies). Thus, we can chain multiple method calls together one after another without creating
any intermediate objects.

We will see an example of this in which we wull compute again the `large_sepal_w` variable and then use it to compute distribution
of species within the two groups defined by it.

In [None]:
temp_df1 = iris.assign(large_sepal_w = lambda df: df['sepal_length'] > 6)
temp_df2 = temp_df1.groupby(['large_sepal_w'])
temp_df3 = temp_df2.apply(lambda df: pd.Series(
    df.groupby('species').size() / len(df),
    name='freq'
))
temp_df3.reset_index()

In [None]:
iris = sbn.load_dataset('iris')

In [None]:
iris.head()

In [None]:
df = iris
df.groupby('species').size() / len(df)

In [None]:
# The method chainging syntax is most readable
# when we separate every method call with new line and a proper indentation
iris_dist = iris \
    .assign(large_sepal_l = lambda df: df['sepal_length'] > 6) \
    .groupby(['large_sepal_l']) \
    .apply(lambda df: pd.Series(
        df.groupby('species').size() / len(df),
        name='dist'
    )) \
    .reset_index()

In [None]:
iris_dist

## Pandas | Reshaping and pivoting, long and wide format

One of the most important types of operations in practical data analysis is the ability to convert between long and wide
formats of data frames.

Long format is when one record (unit of observation) in a data frame corresponds not to a single subject, but to a single
measurement for a subject. Thus, different variables measurements for a single subject are in individual rows.

Wide format is when one record (unit of observation) in a data frame corresponds to a single subject and multiple 
variables/measurements are stored in different columns. For instance, the _iris_ datasets is by default represented
using the wide format.

In [None]:
iris = sbn.load_dataset('iris')
iris.head()

So how can we convert between wide format (above) and long format? The easiest way is to use `.stack()` method.

In [None]:
iris.head()

In [None]:
iris_long = iris \
    .stack() \
    .reset_index() \
    .rename(columns={
        'level_0': 'idx',
        'level_1': 'variable',
        0: 'value'
    })

iris_long

We can undo the stacking operation with `.unstack()` method.

In [None]:
iris \
    .stack() \
    .unstack() \
    .eq(iris) \
    .all() \
    .all()

However, we may want to convert to long format only for some subset of columns. For instance, it may be often useful
(and we will soon see why) to gather only numeric columns this way and leave the `species` column in the wide format
as a separate column.

To do this we can use `.melt()` method which allows us to do exactly what we want.

In [None]:
iris.head()

In [None]:
iris_long = iris \
    .reset_index() \
    .melt(id_vars=['index', 'species']) \
    .sort_values(by='index') \
    .reset_index(drop=True)

iris_long

But how can we now go back to the wide format? It turns out we can do that quite easily by using a few simple
indexing tricks.

In [None]:
iris_long.head()

In [None]:
iris_long.groupby('index')['species'].apply(lambda x: x.unique()[0])

In [None]:
iris_wide = iris_long \
    .set_index(['index', 'variable']) \
    .loc[:, 'value'] \
    .unstack() \
    .merge(
        iris_long.groupby('index')['species'].agg(lambda x: x.iloc[0]),
        right_index=True,
        left_index=True
    ) \
    .reset_index(drop=True)

iris_wide

## Concatenation, merging and relational operations

The last Pandas topic that we will discuss is how to join different data frames together.
Since data frames are 2-dimensional then we can think about this problem in several different ways.

We may want to join multiple frames side-by-side or stack them one on top of another.
Moreover, we may also think about different kinds of so-called relational operations (joins) in which
we add new columns in a host data frame based on columns of another data frame while aligning rows
by a specified key columns (or indexes).

Let us now review these ideas.

### Vertical stacking

In [None]:
df1 = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [3, 4, 5]
})
df1

In [None]:
df2 = pd.DataFrame({
    'a': [1, 7],
    'b': [10, 11]
})
df2

In [None]:
# Vertical stacking 
pd.concat([df1, df2], axis=0)

In [None]:
# Vertical stacking with new indexes
pd.concat([df1, df2], axis=0, ignore_index=True)

### Horizontal stacking

In [None]:
df1 = pd.DataFrame({
    'a': [1, 2, 3],
    'b': [2, 3, 4]
})
df1

In [None]:
df2 = pd.DataFrame({
    'c': [1, 4, 5],
    'd': [4, 6, 9]
})
df2

In [None]:
pd.concat([df1, df2], axis=1)

In [None]:
df2.columns = ['a', 'b']

In [None]:
pd.concat([df1, df2], axis=1)

In [None]:
pd.concat([df1, df2], axis=1, ignore_index=True)

### Left join

Left join is one of the most fundamental data processing operations. It allows us to add column to a data frame based on values
in another data frame in such a way that values between the two data frames are aligned according to a prespecified key column(s).

Let us see this on an example.

In [None]:
df1 = pd.DataFrame({
    'key': ['a', 'b', 'c', 'a', 'a', 'c', 'f'],
    'x': [1, 2, 3, 4, 5, 6, 100]
})
df1

In [None]:
df2 = pd.DataFrame({
    'key': ['a', 'b', 'c', 'd'],
    'y': [1, 2, 3, 11]
})
df2

We can perform left join between the two data frames with `.merge()` method.
Merge is quite flexible. You can learn more from the [official docs](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.merge.html).

In [None]:
df1.merge(df2, how='left', on='key')

We can perform an inner join to extract only records that have matching keys in both data frames.

In [None]:
df1.merge(df2, how='inner', on='key')

And we can perform an outer join to extract all records, even the ones without a match.

In [None]:
df1.merge(df2, how='outer', on='key')

## Exercise 1.

Use _iris_ dataset and add to it a new column called `sepal_length_mean` in which you compute store average values divided by species.
The output dataset should still have 150 rows. In other words for every record of a given species it should store mean value
of `sepal_length` for this species.

HINT. You try to use `groupby` and `merge` methods to do this quite easily.

In [None]:
iris = sbn.load_dataset('iris')
iris.head()

In [None]:
# Your solution

# Version one
iris.merge(
    pd.Series(iris.groupby('species')['sepal_length'].mean(), name='sepal_length_mean'),
    how='left',
    left_on='species',
    right_index=True
)

# Version two
iris \
    .groupby('species') \
    .apply(lambda df: df.assign(
        sepal_length_mean = df['sepal_length'].mean()
    )) \
    .reset_index(drop=True)

## Exercise 2. 

Here you will work with a famous `mpg` dataset. It contains some technical information about few hundred models of cars.
You want to compute mean and medians of all numeric variables divided into groups according to origin.
However, you want to have individual variables in rows, not in columns.

So first you need to use `.melt()` method to convert `mpg` to the long format while also leaving `origin` as a separate column.
You will also have to drop `name` column.

Write your solution nicely using the method chaining syntax.

Try to structure your output in such a way as to facilitate comparisons between regions (values of `origin`).

HINT. You may look at the `.drop` method. It may allow you to drop `name` column very easily.

In [None]:
mpg = sbn.load_dataset('mpg')
mpg.head()

In [None]:
# Your solution

# Version 1.
mpg \
    .drop('name', axis='columns') \
    .reset_index() \
    .melt(id_vars=['index', 'origin']) \
    .drop('index', axis=1) \
    .groupby(['origin', 'variable']) \
    .agg([np.mean, np.median])

# Version 2.
mpg \
    .groupby('origin') \
    .agg([np.mean, np.median]) \
    .T

## The second solution is arguably much nicer.

## Matplotlib | Object-oriented API

So far we used Matplotlib in a slightly naive way, which is okay for simple tasks, but sooner or later becomes problematic.
What is the problem here?

The problem is that we used so-called (stateful) Pylab API. It means that to draw every plot we have been using functions
defined globally at the package level.

In [None]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sbn

# Set Matplotlib style
plt.style.use('seaborn-white')

In [None]:
X = np.random.normal(0, 1, (100,))
Y = np.random.normal(0, 1, (100,))

_ = plt.hist(X)
_ = plt.hist(Y)

As the example above shows, in this approach every action that we perform is _added up_ to every previous action.
As a result, the Pylab API may be quite unpredictable when used for more complex tasks as earlier actions may affect
later actions in surprising way.

It also make it impossible to create multiple plots in a single chunk of code.

That is why in general it is prefereable to work with the object-oriented API. In this approach every figure
we draw is represented by a single `Figure` object and one or more `Axes` objects.

`Figure`s represent entire figures and `Axes` represent particular panels (windows) on a figure.
We will see what that means exactly just in few moments.

In [None]:
iris = sbn.load_dataset('iris')
iris.head()

In [None]:
# Simple scatter plot with Pyplot API
for group, df in iris.groupby('species'):
    _ = plt.scatter(df['sepal_length'], df['sepal_width'], label=group)
plt.legend()

The central function in the object-oriented approach is `plt.subplots`.
In the simplest case it creates a single `Figure` object and a single `Axes` object that jointly
correspond to a single figure with just one plot panel.

In [None]:
# Simple scatter plot with object-oriented API
fig, ax = plt.subplots()
for group, df in iris.groupby('species'):
    _ = ax.scatter(df['sepal_length'], df['sepal_width'], label=group)

Ok, so what did we gain here? The good thing is that we can redraw the figure multiple times in different codes chunks.
And if we want we can modify it later on.

In [None]:
fig

We can also use a `Figure` object to save any figure we created at any point in time.
With Pyplot API we can save only the most recent one.

In [None]:
# Save as PDF
fig.savefig('fig1.pdf', bbox_inches='tight')
# Save as PNG
fig.savefig('fig1.png', bbox_inches='tight')

# I usually use option `bbox_inches='tight'`
# to leave less whitespace around a figure

So the main idea is that we perform top-level operations such as assigning global style or
saving a figure using `Figure` objects and we draw particular (sub)plots using `Axes` objects.

For instance, once a plot is drawn, we can still modify it and for instance add a legend.

In [None]:
_ = ax.legend()
fig

### Styles

Matplotlib allows to set very detailed style configurations.
However, configuring styling by hand is usually quite difficult, so it is rather recommended
to use one of the [built-in styles](https://matplotlib.org/3.1.0/gallery/style_sheets/style_sheets_reference.html).

They can be turn on with the following simple function call:

In [None]:
# Use globally seaborn-white style
plt.style.use('seaborn-white')

If you are curious about details you may learn more about styling in Matplotlib [here](https://matplotlib.org/tutorials/introductory/customizing.html).

### Multi-panel plots

In data analysis we often want to visualize realtionships between multiple variables at one plot.
That is why we have a notion of a plot panel (or a subplot). It is a sort of a single window or canvas
on which, for instance, scatter plot between a pair of variables may be shown. Thus, we can use multiple
panels to show jointly many pairwise relationship between variables.

Below we draw a matrix of pairwise scatter plots for all numeric variables from _iris_ dataset.
We have four variables so we use a 4-by-4 grid of subplots. Along the diagonal we have plots with
the same variable on both `x` and `y` axis, so in this case we will show a histogram.
This way we will see not only pairwise relationships between all variables, but one-dimensional
distributions of all variables.

We will also group everything by species.

In [None]:
iris = sbn.load_dataset('iris')
iris.head()

In [None]:
from itertools import product

variables = iris.loc[:, 'sepal_length':'petal_width'].columns
# Grouped iris
iris_g = iris.groupby('species')

# Initialize figure with 4-by-4 grid of subplot axes objects
# We also set larger figure size
fig, axes = plt.subplots(nrows=4, ncols=4, figsize=(15, 12))

for ax, xy in zip(axes.flatten(), product(variables, variables)):
    # Unpack 2-tuple with variable names
    x, y = xy
    for group, df in iris_g:
        # Set X-axis label
        _ = ax.set_xlabel(x)
        if x == y:
            _ = ax.hist(df[x], label=group)
        else:
            _ = ax.scatter(df[x], df[y], label=group)
            # Set Y-axis labels
            _ = ax.set_ylabel(y)

# We may want to add legend only on one of the subplots
_ = axes.flatten()[2].legend(loc='best')

## Exercise 3.

Use `mpg` dataset.

Draw scatter plot of `mpg` and `weight` with different colors for different production regions (`origin`).

Use object-oriented matplotlib API.

Save the figure as PDF file named `mpg.pdf`. Check that it looks correctly.

In [None]:
mpg = sbn.load_dataset('mpg')
mpg.head()

In [None]:
# Your solution
fig, ax = plt.subplots()

x, y = 'mpg', 'weight'
for name, df in mpg.groupby('origin'):
    _ = ax.scatter(df[x], df[y], label=name)
    _ = ax.set_xlabel(x)
    _ = ax.set_ylabel(y)
_ = ax.legend()

## Exercise 4.

Use `mpg` dataset.

Draw 3-by-3 matrix of scatter plots (with histograms along the diagonal) showing relationships between
`mpg`, `weight` and `horsepower`. Use different colors for different regions (`origin` variable).

In [None]:
mpg = sbn.load_dataset('mpg')
mpg.head()

In [None]:
# Your solution
from itertools import product

cols = ['mpg', 'weight', 'horsepower']
n = len(cols)
# Grouped dataset
mpg_g = mpg.groupby('origin')
# Initialize figure and axes
fig, axes = plt.subplots(nrows=n, ncols=n, figsize=(15, 12))

for ax, xy in zip(axes.flatten(), product(cols, cols)):
    x, y = xy   # Unpack tuple with variable names
    _ = ax.set_xlabel(x)
    for name, df in mpg_g:
        if x == y:
            _ = ax.hist(df[x], label=name, alpha=.9)
        else:
            _ = ax.scatter(df[x], df[y], label=name, alpha=.9)
            _ = ax.set_ylabel(y)
            
# Add legend in a convenient spot
_ = axes[0, 1].legend()

## Seaborn package

[Seaborn](https://seaborn.pydata.org/) is a data visualization library built on top of Matplotlib.
It makes drawing few preselected types of plots very easy, but in my experience it is very opinionated
and makes it almost impossible to draw beautiful custom plots as it puts a lot of constraints on the user.
Because of that (and also because of the lack of time) we will not discuss Seaborn in the class.
However, if you are interested you may give it a try. Also, Seaborn-based solutions in HW3 problems
focused on data visualization will be accepted as long as they are correct. In other words,
you may choose wheter you want to use pure Matplotlib or Seaborn for data visualizations in HW3.

https://seaborn.pydata.org/