# Pandas | Core concepts, types and methods (part II)

Szymon Talaga 10.01.2020

<hr>

In this notebook we continue our quest for developing solid, in-depth understanding of Pandas data structures and their relations to Numpy.

Here we will focus on the most important type which is `DataFrame`. In the later part we will also consider `MultiIndex` type
and advanced methods for indexing hierarchical data structures. In the first part of the notebook we will focus on flat indexes
with only one level (i.e. standard `Index` objects).

Data frame in Pandas is a two dimensional collection of values arranged as a rectangular grid of rows and columns.
It is primarily oriented column-wise as the columns are represented as `Series` objects. This means that any single column of a `DataFrame` has to be
of a fixed `dtype`, but different columns may have different `dtypes`. This is a primary reason why (usually) row-wise operations in Pandas
are less efficient and in general more difficult to carry out.

Data frames use also a more complex indexing architecture. All `Series` defining columns have to share the same index. This allows to have a well-defined
row index. Additionally, a data frame has also a horizontal index which defines columns (and possibly also groups of columns etc.).

Summing up, data frames can be viewed as a collection of columns represented as `Series` which are mapped to column names organized as an `Index` 
(or `MultiIndex`) object combined with another `Index` (or `MultiIndex`) object that define row indexes and which is shared by all column `Series`.

Although the abovementioned view is not 100% veridical with respect to the true internal data model used by Pandas it rather captures the general
design of Pandas data frames. It also correctly points to the strong and weak points of Pandas data model in terms of what kinds of operations
are easy and hard to perform (in terms of computational efficiency).

```python
=====================================================
|       | column 1    column 2   . . .     column m |
=====================================================
| row 1 |    x           x                    x     |
| row 2 |    x           x                    x     |
|   .   |                                           |
|   .   |                                           |
|   .   |                                           |
| row n |    x           x                    x     |
=====================================================
```

<hr>

Many great resources about Pandas can be found in the official documentation. In particular, it is recommended to read the following articles:

* [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html)
* [Essential basic functionality](https://pandas.pydata.org/pandas-docs/stable/getting_started/basics.html)
* [Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/getting_started/dsintro.html)
* [Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#indexing-attribute-access)

<hr>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Helper package to load example datasets
import seaborn as sbn

In [None]:
### Configure IPython shell to show print all outputs generated in a code cell
### --------------------------------------------------------------------------
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## `DataFrame` | 2D rectangular dataset organized as a collection of named `Series`

In this section we will review most important ways in which one can initialize a `DataFrame` object. 
This will provide not only a practical exercise but also important insights into the ways in which we can think about data frames.

Perhaps the most natural, although not necessarily the best, way to think about a data frame is to think about it
in terms of a list of rows defining a rectangular table.

In this spirit, we can build a data frame from a list of lists defining rows.

In [None]:
pd.DataFrame([
    [1, 2, 3],
    [4, 5, 6, 4],
    [7, 8, 9],
    [10, 11, 12]
], columns=('a', 'b', 'c', 'd'))

This way we will not get any column or row names by default. However, this can be changed by using `columns` and `index` arugments.

Note that we do not have to pass type homogeneous data. Below we create a data frame of which third column is of `object` type
(it contains strings) while the first two ones are integer.

In [None]:
df = pd.DataFrame([
    [1, 2, 'a string 1'],
    [4, 5, 'a string 2'],
    [7, 8, 'a string 3'],
    [10, 11, 'a string 4']
], columns=['a', 'b', 'c'], index=['p', 'r', 's', 't'])
df

We can check types using `.dtypes` attribute of a data frame.

In [None]:
df.dtypes

Similarly we can look at row and column indexes.

In [None]:
df.index    # row index
df.columns  # column index

A similar approach is to create a data frame from a sequence of dictionaries. This way we can specify column names in data.
However, if we want to have non-generic row labels we still need to provide them separately through `index` argument.

NOTE. The ordering of columns is defined by the ordering of keys in the first dictionary.

In [None]:
pd.DataFrame([
    {'b': 'string', 'a': 2, 'c': 11111},
    {'a': 1, 'b': 'another string'},
], index=['s1', 's2'])

We can fully define data and both indexes in a row-wise fashion if we define a data frame based on a list of named `Series`
which are interpreted as rows in this case.

In [None]:
pd.DataFrame([
    pd.Series({'a': 1, 'b': 2}, name='subject 1'),
    pd.Series({'a': 2, 'b': 3}, name='subject 2')
])

From the column perspective we can define a data frame as a named collection (a mapping) of columns.

In [None]:
pd.DataFrame({
    'a': [1, 2, 3],
    'b': [4, 5, 6],
    'c': ['x', 'y', 'z']
}, index=['s1', 's2', 's3'])

In order to see that pandes really enforces equal length of columns as well as identical indexes, we can check what happens
if try to create data frame from a collection non-conformable series.

In [None]:
# Unequal lengths
pd.DataFrame({
    'a': [1, 2, 3, 999],
    'b': [4, 5, 6],
    'c': ['x', 'y', 'z']
})

When we pass series with different indexes we do not get an error. Instead we obtain a data frame partially filled with NaNs,
which is of course the result of the rules of labels alignment in Pandas.

In [None]:
# Unequal indexes
pd.DataFrame({
    'a': pd.Series([1, 2, 3]),
    'b': pd.Series([4, 5, 6]),
    'c': pd.Series(['x', 'y', 'z'], index=['uu', 'vv', 'ww'])
})

# `DataFrame` | Indexing and slicing

`DataFrame` objects provide the same three syntaxes and three kinds of indexing. The main difference is that data frames
have two separate axes (rows and columns).

In [None]:
df = pd.DataFrame([
    pd.Series({'a': 1, 'b': 2, 'c': 'foo'}, name='s1'),
    pd.Series({'a': 11, 'b': 22, 'c': 'bar'}, name='s2'),
    pd.Series({'a': 20, 'b': 30, 'c': 'xoxo'}, name='s3'),
    pd.Series({'a': 7, 'b': 15, 'c': 'yoyo'}, name='s4'),
    pd.Series({'a': 50, 'b': 1, 'c': 'howdy'}, name='s5')
])
df

### _getitem_ syntax

It can be used to select columns (single or multiple).

In [None]:
# Get single column
# Output is a series
df['b']

In [None]:
# Get multiple columns
# Output is a data frame
df[['a', 'c']]

In [None]:
# Get single column as a data frame
df[['b']]

However, if we provide a slice in _getitem_ indexing it will be interpreted as slice over row labels
or positions if it is an integer slice. Thus, in this case we have a similar problem with ambiguity of the _getitem_ syntax
as in the case of `Series`.

In [None]:
# Get slice of rows by label
df['s2':'s4']

In [None]:
# Get slice of rows by integer positions
df[1:4]

In [None]:
# Ambiguous case of data frame with integer labels
df2 = df.set_index(pd.Index([2, 3, 1, 4, 0]))
df2

In [None]:
# Get rows by integer slice
# They are interpreted as positions and not labels
df2[2:4]

### `DataFrame` | `.loc` indexing

It provides label-based indexing for both rows and columns.

In [None]:
df

In [None]:
# Get row (as a Series)
df.loc['s2', :]

In [None]:
# Get column (as a Series)
df.loc[:, 'b']

In [None]:
# Get column (as a DataFrame)
df.loc[:, ['b']]

In [None]:
# Get multiple columns and rows
df.loc[['s2', 's4'], ['a', 'c']]

In [None]:
# Get slices of rows and columns
df.loc['s2':, :'b']

### `DataFrame` | `.iloc` indexing

It provides position-based indexing for both rows and columns.

In [None]:
df

In [None]:
# Single element
df.iloc[2, 2]

In [None]:
# Single row (as Series)
df.iloc[-1, :]

In [None]:
# Single row (as DataFrame)
df.iloc[[-1], :]

In [None]:
# Multiple rows and columns
df.iloc[[1, -1], [0, -1]]

In [None]:
# Slices
df.iloc[:3, 1:]

### `DataFrame` | Boolean indexing

As in the case of `Series` we can provide 1D boolean masks to filer rows and/or columns.
What is important is the fact that with data frames we may mix boolean masks with other types of indexing.
For instance, we can have a boolean mask on columns and label-based or positional index on rows.

In [None]:
df

In [None]:
df.columns
df.columns.isin(['a', 'c'])

In [None]:
df.iloc[2:, df.columns.isin(['a', 'c'])]

In [None]:
df['a'] > 10

In [None]:
df.loc[df['a'] > 10, ['c', 'a']]

A new feature is that we can also provide a full boolean mask over entire data frame to mask particular values
abd turn them into NaNs.

In [None]:
np.random.seed(101010)

num = pd.DataFrame(np.random.normal(0, 1, (10, 3)), columns=['x', 'y', 'z'])
num

In [None]:
num >= 0

In [None]:
# Mask negative values
#num >= 0
num[num >= 0].sum(axis=0)

## `DataFrame` | Basic attributes and descriptions

In [None]:
## .columns
## Column index
df.columns

In [None]:
## .index
## Row index
df.index

In [None]:
## .dtypes
## Column names and their dtypes
df.dtypes

In [None]:
## len()
## Number of rows
len(df)

In [None]:
## .shape
## Tuple with number of rows and columns
df.shape

In [None]:
## .info()
## Show basic information about a data frame
df.info()

In [None]:
## .describe()
## Basic numeric summary of data
## Categorical columns are ommited by default if there are any numeric columns
df.describe()

In [None]:
df['c'].describe()

In [None]:
## .to_numpy()
## Numpy representation
df.to_numpy()

In [None]:
df.head(2)
df.tail(2)

## `DataFrame` | Broadcasting and labels alignment

Here we review the broadcasting and labels alignment rules for data frames.
In general they are the same as for `Series` objects. However, in this case we have two axes instead
of one, so this induces some additional complications.

The main rule is that we do labels alignment for both row and column indexes.

In [None]:
# Two simple data frames
df1 = pd.DataFrame(np.arange(4).reshape(2, 2), columns=['a', 'b'], index=[3, 7])
df1

In [None]:
df2 = pd.DataFrame(np.arange(4, 8).reshape(2, 2), columns=['b', 'a'], index=[7, 3])
df2

In [None]:
# Add them together
df2 + df1

In [None]:
# What happened step by step
row_union = df1.index.union(df2.index)
row_union

In [None]:
col_union = df1.columns.union(df2.columns)
col_union

In [None]:
df1.reindex(index=row_union, columns=col_union)

In [None]:
#df1 = df1.reindex(index=row_union).reindex(columns=col_union)
df1 = df1.reindex(index=row_union, columns=col_union)
df1

In [None]:
df2 = df2.reindex(index=row_union, columns=col_union)
df2

In [None]:
# Final operation
df1 + df2

Now let us see what happens if axes can not be perfecly aligned (they are at least partially non-overlapping).

In [None]:
df1 = pd.DataFrame(np.arange(8).reshape(4, 2), index=[2, 4, 6, 8], columns=['a', 'b'])
df1

In [None]:
df2 = pd.DataFrame(np.arange(8).reshape(4, 2), index=[2, 3, 4, 5], columns=['b', 'c'])
df2

In [None]:
df1 + df2

In [None]:
# What happened step by step
row_union = df1.index.union(df2.index)
row_union

In [None]:
col_union = df1.columns.union(df2.columns)
col_union

In [None]:
df1 = df1.reindex(index=row_union, columns=col_union)
df1

In [None]:
df2 = df2.reindex(index=row_union, columns=col_union)
df2

In [None]:
# Final operation
df1 + df2

But how are aligned operations between series and data frames? The rule is that index labels of a series are matched
with column labels of a data frame. It makes it easy to define operations based on columns and their aggregate
values such as centering.

In [None]:
# Center columns of numeric data frame
df = pd.DataFrame(np.random.normal(100, 15, (10, 3)), columns=['x', 'y', 'z'])
df

In [None]:
# Column means
# Mean computed over columns
df.mean(axis=0)

In [None]:
df_c = df - df.mean(axis=0)

In [None]:
# Check if it is really centered
df_c.mean(0)

Is it as easy carry out operations row-wise? Nope. But it can be done, although it requires some additional tricks.
However, first let us see and understand what is going on, when we try to broadcast between data frame and 
a series representing row-aggregated values (i.e. row centering).

In [None]:
df

In [None]:
# row means
df.mean(axis=1)

In [None]:
# Remove means from rows
df_c = df - df.mean(axis=1)
df_c

Total disaster! Do you understand what happened?

One way, although a little convoluted and not really the best one, to deal with this is to use transposition.
Data frames are inherently two-dimensional (they have rows and columns) like matrices. So we can easily
define a transpose of a data frame. Below is the transpose of our original data frame.

In [None]:
df

In [None]:
df.T

With this trick in our hands now we can reexpress our row-oriented problem as column oriented problem
and convert the result back to the original orientation with yet another transpose.

In [None]:
df.T
df.mean(axis=1)

In [None]:
(df.T - df.mean(1)).T

In [None]:
(df.T - df.mean(1)).T.mean(axis=1)

In [None]:
df_c = (df.T - df.mean(1)).T
df_c

In [None]:
# Check that it worked
df_c.mean(1)

In [None]:
df = pd.DataFrame({
    'x': [1, 2, 3],
    'y': [1., 2., 3.]
})
df
df.dtypes

In [None]:
df.dtypes

In [None]:
df.T.dtypes

In [None]:
df.T.T.dtypes

### `DataFrame` | Exercise 1.

You are provided with a simple numeric data frame. Standardize it both column and row-wise.
In other words both column and row means should be $0$ and standard deviations (and variances) should be $1$.

Remember that the formula for standardization is the following:

$$X_{\text{standardized}} = \frac{X - \text{Mean}(X)}{\text{Std}(X)}$$

In [None]:
np.random.seed(101)

df = pd.DataFrame(np.random.normal(100, 15, (10, 3)), columns=['x', 'y', 'z'])
df

In [None]:
# Your solution

## Flexible operations (arithmetic, logical etc.)

We managed to implement row-wise operation with some smart use of transposition. However, this seems rather hacky
and we would like to have some better tools for doing just that. Happily Pandas provides us with such tools.

`DataFrame` and `Series` objects in Pandas implements special methods called _flexible operations_ which are
just standard arithmetic and logical operation, but such that can be explicitly applied along a given axis.
Moreover, they can also automatically substitute NaNs which are created during labels alignment.

**Flexible arithmetic binary operations**

1. `add`
2. `sub`
3. `div`
4. `mul`
5. `pow`

**Flexible logical binary operations**

1. `eq` (equal)
2. `ne` (not equal)
3. `lt` (lower than)
4. `gt` (greater than)
5. `le` (lower or equal)
6. `ge` (greater or equal)

In [None]:
df = pd.DataFrame(np.random.normal(100, 15, (10, 2)), columns=['x', 'y'])
df

In [None]:
df - df.mean(0)

In [None]:
# FLEXIBLE OPERATIONS APPROACH
df.sub(df.mean(0), axis=1)

In [None]:
(df.T - df.mean(1)).T

In [None]:
df.sub(df.mean(1), axis=0)

In [None]:
# Adding partially matching series with automatic substitution of NaNs
s1 = pd.Series([1, 2, np.nan], index=['a', 'b', 'c'])
s1

In [None]:
s2 = pd.Series([6, 4, 7, 8], index=['a', 'd', 'f', 'b'])
s2

In [None]:
s1 + s2

In [None]:
index_union = s1.index.union(s2.index)
s1.reindex(index_union).fillna(0) + s2.reindex(index_union).fillna(0)

In [None]:
s1.add(s2, fill_value=0)

In [None]:
# What happened step by step
# Step 1. Index union.
index_union = s1.index.union(s2.index)
index_union

In [None]:
# Step 2. Reindex series
s1 = s1.reindex(index_union)
s1

In [None]:
s2 = s2.reindex(index_union)
s2

In [None]:
# Step 3. Check where both series have NaNs
nan_both = s1.isna() & s2.isna()
nan_both

In [None]:
# Step 4. Fill NaNs where only one series have missing data
s1[~nan_both] = s1[~nan_both].fillna(0)
s1

In [None]:
s2[~nan_both] = s2[~nan_both].fillna(0)
s2

In [None]:
# Step 5. Carry out the operation
s1 + s2

### `DataFrame` | Exercise 2.

You are provided with simple numeric data frame (again!). Normalize it by rows and column with the Min-Max scaling.
The lowest value should be $0$ and highest should be $1$.

$$\frac{X - \text{Min}(X)}{\text{Max}(X) - \text{Min}(X)}$$

Do it separately two times. Once normalize columns and then normalize rows. Do not normalize both rows and columns
at the same time as with Min-Max scaling such operation does not make any sense! If you are curious you
can do this and check how the data frame looks like in this case.

In [None]:
np.random.seed(101)

df = pd.DataFrame(np.random.normal(100, 15, (10, 3)), columns=['x', 'y', 'z'])
df

In [None]:
# Your solution

### `DataFrame` | Exercise 3.

You are provided with a list of subject ids and two sets of measurements for those subjects from an experiment
with two trials. However, some subjects may have participated in only one or even none of the trials.

Your task is to compute average scores. For subjects with only one score the available score should be presented.
Subjects with no data should be assigned with $-999$.

HINT. You may want to use `.reindex()` method.

In [None]:
np.random.seed(303)

subj = pd.Index(range(30)) # List of subject ids
x1 = pd.Series(np.random.normal(90, 10, (25,)), index=np.random.choice(subj, size=(25,), replace=False))
x2 = pd.Series(np.random.normal(100, 15, (20,)), index=np.random.choice(subj, size=(20,), replace=False))

In [None]:
# Your solution

## `DataFrame` | Column and row-wise operations aka _apply_

We can do a lot with vectorization and labels alignment. However, sometimes we may want to apply arbitrary
functions to columns or rows. For this we can use `.apply()` method.

Apply is a sort of map-like statement in which one specifies a function that will be applied to every item
(in this context items are rows or columns). Below is a simple example.

In [None]:
np.random.seed(101)

df = pd.DataFrame(np.random.normal(100, 15, (10, 3)), columns=['x', 'y', 'z'])
df

In [None]:
np.log(df)

In [None]:
# Apply logarithm to columns
df.apply(np.log)

In [None]:
len(df)

In [None]:
df.apply(len, axis=0)

In [None]:
df.sum(axis=1)

In [None]:
# Apply sum to columns
df.apply(np.sum, axis=1)

In [None]:
df[df > 100].count()
df.apply(lambda x: x[x > 100].count())

In [None]:
# Apply as column filter
df.apply(lambda x: x[x > 100])

In [None]:
df[df > 100]

In [None]:
# Non-trivial apply
# Interquartile range per column
#df.quantile(.75) - df.quantile(.25)

df.apply(lambda x: x.quantile(.75) - x.quantile(.25))
df.apply(lambda x: x.quantile(.75) - x.quantile(.25), axis=1)

We can also apply function to rows. To do that we use additional `axis` argument.

In [None]:
# Compute row ranges
df.apply(lambda x: x.max() - x.min(), axis=1)

When we work with `.apply()` on columns the situation is simple as we can expect that our function will be applied to
`Series` objects which are guaranteed to have fixed data types, so operations should be rather efficient.

On the hand, it is not entirely clear what the representation of rows should be as they may contain values of
different type. In general rows will also be represented as `Series` objects just such that are upcasted to an
appropriate `dtype` that can store all the values. However, this means that row-wise apply will be very often
much slower.

We can see representation of a single item during apply by simply passing a function that prints items.

In [None]:
# Representation of columns
_ = df.apply(print, axis=0)

In [None]:
# Representation of rows
_ = df.apply(print, axis=1)

In both cases these are series with nice fixed dtype `float64`. But this is so only because the data frame we use
is simple and contain only floating point numbers. See what happens when we really have mixed types.

In [None]:
import seaborn as sbn

iris = sbn.load_dataset('iris')

iris.head()

In [None]:
# Print row items
_ = iris.head().apply(print, axis=1)

We are forced to work with `object` series. This will usually negatively impact efficiency of our computations.
That is the reason why row apply is often slower.

If `Series` are returned by function used in apply they will be combined to form a data frame.

Below we show this with a non-trivial function that computes series with different fields for rows
of the iris dataset based on the value of species.

In [None]:
iris.head()

In [None]:
def row_func(row):
    if row['species'] == 'setosa':
        return pd.Series({'sepal': row['sepal_length'] / row['sepal_width']})
    return pd.Series({'petal': row['petal_length'] / row['petal_width']})

iris.apply(row_func, axis=1)

Data frames also define `.applymap()` method that applies a function element-wise.

Below we use the method to find all the prime numbers in a data frame with non-negative integers.

CAUTION. The implementation of the test for primality we use here is SUPER BAD.

In [None]:
def is_prime(x):
    if x < 2:
        return False
    for i in range(2, x):
        if x % i == 0:
            return False
    return True

idf = pd.DataFrame(np.random.randint(0, 100, (10, 20)))
idf

In [None]:
idf.applymap(is_prime)

In [None]:
idf[idf.applymap(is_prime)].fillna('')

### `DataFrame` | Exercise 4.

You are provided with a data frame of exam scores of students. Each student have three exam scores ranging from 0 to 100.
You have to convert scores to grades according to the grading scale below and compute average grades of students.

**Grading scale**

If score is:

* $< 60 \rightarrow 2$
* $< 80 \rightarrow 3$
* $< 90 \rightarrow 4$
* $\geq 90 \rightarrow 5$

In [None]:
df = pd.DataFrame([
    pd.Series([61, 70, 90], name='Alice'),
    pd.Series([50, 80, 91], name='Bob'),
    pd.Series([80, 90, 82], name='Freya'),
    pd.Series([90, 100, 92], name='Merlin')
])
df

In [None]:
# Your solution

### `DataFrame` | Exercise 5.

Your are provided with a set of exam scores for $10$ students (in a form of a `dict`).
Your task is to build a data frame in which row index values are student names,
the first column stores exam scores and the second column stores grade according to the following rule:

If score is:

* $< 60 \rightarrow 2$
* $< 80 \rightarrow 3$
* $< 90 \rightarrow 4$
* $\geq 90 \rightarrow 5$

HINT. You may want to convert the `dict` to a `Series` first. Then you can use `.apply()` or `.map()`
to apply some computations to every element of the series.

In [None]:
scores = {
    'Alice': 75,
    'Bob': 80,
    'Kate': 82,
    'Dog': 99,
    'Han Solo': 55,
    'Rick': 100,
    'Morty': 82,
    'Santa': 62,
    'Curie': 92,
    'Isabelle': 88,
    'Stan': 71,
    'Kyle': 81,
    'Kenny': 90,
    'Cartman': 30
}

In [None]:
# Your solution

## `DataFrame` | Aggregation

Data frames in Pandas offer a quite convenient interface for computing multiple aggregate quantities in one go.

In [None]:
import numpy as np
import pandas as pd
import seaborn as sbn

iris = sbn.load_dataset('iris')

In [None]:
iris.head()

In [None]:
iris.loc[:, 'sepal_length':'petal_width'].agg([np.mean, np.std])

In [None]:
iris['species'].nunique()

In [None]:
iris.agg({
    'sepal_length': [np.mean, np.std],
    'species': 'nunique'
})

## Split-apply-combine and `.groupby()`

Split-apply-combine is a powerful strategy used frequently in data science, statistics and scientific computing.
The main idea is that we can decompose even very complicate computations in a sequence of the three steps:

* **Split.** In this step we split our dataset into smaller datasets based on some criterion, for instance based
on value of some categorical variables.
* **Apply.** In this stage a function or a set of functions is applied to datasets. In this step we often aggregate
subdatasets into single values or perform some other processing (i.e. filtering).
* **Combine.** In the end we combine splitted datasets back to one data frame.

One of the typical usecases of this approach is computation of descriptive statistics for groups (i.e. group means).

Grouped operations in Pandas are defined via the `.groupby()` method. It returnes a special object
that stores a dataset parts divided by a given criterion (usually by values of a column or columns).
It can be used to iterate over the dataset parts but also to apply different functions to them.
After a function is applied the grouped object will return results combined back to a single data frame.
This is how _split-apply-combine_ strategy is implemented in Pandas.

In [None]:
iris = sbn.load_dataset('iris')
iris.head()

In [None]:
# Group iris data by species
iris_g = iris.groupby(['species'])
iris_g

In [None]:
# Iterate over groups and their names (criterion values)
for name, group in iris_g:
    print(name, "\n=========\n", group.head())

In [None]:
# Compute group means
iris.groupby('species')[['sepal_length', 'sepal_width']].mean()

In [None]:
# Standard functions like mean can be used with even simpler syntax
iris.groupby('species').apply(np.mean)
iris.groupby('species').mean()

In [None]:
iris.head()

In [None]:
# We can also use apply and transform methods with groupyby
# Below we standardize numeric variables in groups
z = iris \
    .groupby('species') \
    .apply(lambda gdf: (gdf - gdf.mean()) / gdf.std()) \
    .combine_first(iris)

z

In [None]:
# Check solution
z.groupby('species').agg([np.mean, np.var])

In [None]:
# We may use just a subset of columns
Q = iris.groupby('species')['sepal_length'] \
    .apply(lambda x: x.quantile([0, .25, .5, .75, 1]))
Q

In [None]:
# And apply different functions to different columns
iris.groupby('species').agg({
    'sepal_length': [np.mean, np.std],
    'petal_length': [np.mean, np.var],
    'species': ['nunique']
})

In [None]:
# Of course we can also ask for group sizes
iris.groupby('species').size()

All the methods above are useful but fall short when we need to perform complex computations depending on multiple
columns in groups. Luckily, there is a trick that we can use while using the `.apply()` method that allows us
to define arbitrary group computations using multiple columns as well as any other values we may need.

Below we compute average of ratios of sepal / petal lengths and widths in groups.

In [None]:
iris.groupby('species') \
    .apply(lambda gdf: pd.Series({
        'sepal': (gdf['sepal_length'] / gdf['sepal_width']).mean(),
        'petal': (gdf['petal_length'] / gdf['petal_width']).mean()
    }))

Of course it is also possible to group by multiple columns.

Below we show it using a famous dataset about Titanic survivors and casualties.

In [None]:
titanic = sbn.load_dataset('titanic')
titanic.head()

In [None]:
titanic.groupby(['class', 'embark_town']).size()

### `DataFrame` | Exercise 6.

Check how class and gender correlated with chance of survival in Titanic.

HINT. Use `groupby` (duh!)

HINT2. Note that the variable `survived` is simple binary vector, so you can compute its mean.

In [None]:
titanic = sbn.load_dataset('titanic')
titanic.head()

In [None]:
# Your solution

## Index hierarchies aka `MultiIndex`

So far we limited our attention to data structures with flat indexes that have only one level and (usually) map one unique
index value to one data value. But sometimes we may need something more elaborate. 

For instance, in one of the exercises we used a simple dataset with multiple exam scores for students.

In [None]:
df = pd.DataFrame([
    pd.Series([61, 70, 90], name='Alice'),
    pd.Series([50, 80, 91], name='Bob'),
    pd.Series([80, 90, 82], name='Freya'),
    pd.Series([90, 100, 92], name='Merlin')
])
df

In Pandas we may choose to represent the above data as a series object with two-level index that maps every data value
to student name and the exam number.

In [None]:
s = df.stack()
s

In [None]:
s.index

In [None]:
type(s)

In [None]:
s.index

Now we can use both index levels to address particular parts of our data.

In [None]:
s['Alice']

In [None]:
s['Alice':'Freya']

We can also use values on two-levels. We pass it as a tuple.

In [None]:
s

In [None]:
s[('Merlin', 2)]

Indexing on one of the nested level is a special kind of an operation that is called _cross-sectioning_. 
We have to use a special method to do this.

In [None]:
# Get scores for second test
s.xs(2, level=1)

We can also easily convert back and forth between multi index representation and simpler data frame representation.

In [None]:
df = s.reset_index().rename(columns={'level_0': 'name', 'level_1': 'exam', 0: 'score'})
df

In [None]:
df2 = df.set_index(['name', 'exam'])
df2

In [None]:
df2.loc[('Bob', 0):('Freya', 1), 'score']

In [None]:
df2.xs(2, level=1)

One of the typical situation in which we find multi indexes are group by computations, in particular when we split
by multiple columns. For instance, when we computed numbers of passengers from different ports by ticket class in Titanic dataset what we got was a series with two-level row index.

In [None]:
# Data frame with two-level row index
titanic = sbn.load_dataset('titanic')

gdf = titanic.groupby(['class', 'embark_town'])[['survived', 'fare']].mean()
gdf

We can also use indexes in group by. For this we specify `level` of the index that we will want to use to split our data.

In [None]:
gdf.groupby(level=0).mean()

In some sense, mulit indexes are just sequences of unique tuples.

In [None]:
gdf.index

We can create `MultiIndex` objects by hand from sequences of tuples or form a cartesian product of multiple sequences of values.

In [None]:
# From list of tuples
pd.MultiIndex.from_tuples([
    ('a', 1),
    ('a', 2),
    ('b', 1),
    ('b', 2),
    ('b', 3)
])

In [None]:
# From cartesian product
pd.MultiIndex.from_product([
    ['a', 'b', 'c'], 
    [1, 2, 3, 4]
])

### `MultiIndex` | Exercise 1.

You are provided with set of multiple measurements (5) for ten persons. They are arranged in 10 by 5 Numpy array.
Some value are missing (NaN). Your task is to arrange the data in a single `Series` object with index that differentiates
properly between persons and measurements.

Use your data structure to compute number of measurement per subject and number of subjects per measurements.

Finally, use your data structure (without changing it) to compute
average values by person and by measurement.

In [None]:
X = np.where(np.random.uniform(0, 1, (10, 5)) < .1, np.nan, np.random.normal(100, 15, (10, 5)))
X

In [None]:
# Your solution