# Pandas | Core concepts, types and methods

Szymon Talaga | 08.01.2020

<hr>

In this notebook we will carefully develop proper and in-depth understanding of core concepts, types (classes) and methods provided by Pandas.
They will be discussed with respect to Numpy in order to facilitate understanding of what typical usecases of both packages are as well as what
are their strong and weak points.

Pandas is focused around two main data structures:

* `Series` : a one-dimensional sequence of values with a fixed data type.
* `DataFrame` : a two-dimensional rectangular table of values with rows and columns. Each column has to be of fixed data type as it is represented as `Series` objects.

In fact, Pandas provides more data structures, but they are useful in rather specific circumstances so we will not discuss them here.

<hr>

Pandas is internally based on Numpy and its types such as `Series` and `DataFrame` are to a large extent compatible with many Numpy functions.
However, there are also very important differences between them.

The crucial difference concerns the way axes and their indexes are designed which also means that the notion of broadcasting in Pandas is completely different.

In Numpy indexes of axes are defined **implicitly** by the fact that elements are arranged sequentially, so they can be assigned with integer coordinates
ranging from $0$ to $n-1$ where $n$ is the number elements along a given axis.

In Pandas the way indexes work is entirely different. Here indexes are **explicit**, they exist as separate Python objects, and are defined as sets of labels (or even multilevel hierarchies of labels)
for elements along a given axis. Therefore, the rules of broadcasting in Pandas are not determined by conformability of shapes of arrays like in Numpy, but by alignment (congruence) of labels.
We will discuss the details of broadcasting and alignment in Pandas while discussing `Series` and `DataFrame` types.

<hr>

Many great resources about Pandas can be found in the official documentation. In particular, it is recommended to read the following articles:

* [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
* [Essential basic functionality](https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html)
* [Intro to data structures](https://pandas.pydata.org/pandas-docs/stable/user_guide/dsintro.html)
* [Indexing and selecting data](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)

<hr>

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
### Configure IPython shell to show print all outputs generated in a code cell
### --------------------------------------------------------------------------
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

## Series | 1D sequences of values with fixed dtypes

The simpler and more basic of the two core types in Pandas is `Series`. Objects of this type are used to store one-dimensional ordered sequences of values of fixed data type
(in most cases these are standard Numpy `dtypes`) aligned along an index which is a sequence of labels that identify individual elements. So a series has the following structure:

```python
Index  |  Value
===============
  b    |    5
  a    |    3
  f    |    1
  c    |    3
===============
```

Importantly, `Series` objects are both like dictionaries / mappings as well as lists / 1D arrays. In other words, they support both label / key based indexing characteristic for mappings 
and numerical / positional indexing characteristics for sequences / lists.

A `Series` object may be created with a type constructor from any object that can be interpreted as 1D array / sequence.

In [None]:
## Sequence-based initialization
s = pd.Series([5, 3, 1, 3])
s

The first column in the print output shows indexes and the second data values. Note, that in this case a default generic index was created (sequence from `0` to `n-1`).
However, if we want we may explicitly assign an index composed of any values we want.

In [None]:
## Sequence-based initialization with arbitrary index
s = pd.Series([5, 3, 1, 3], index=['b', 'a', 'f', 'c'])
s

Another very convenient approach is to create a series from dictionary.

In [None]:
## Dictionary-based initialization
s = pd.Series({ 'b': 5, 'a': 3, 'f': 1, 'c': 3 })
s

Look good. But what about order? We discussed the fact that dictionaries in Python are inherently unordered. That stops to be true recently.
From the version 3.7 of Python `dict` objects keeps the insertion order (the order in which their keys were provided).
Pandas uses this fact when creating `Series` objects. In older version of Python and/or Pandas index is sorted lexicographically when
a series is built from a dictionary.

If a series is initialized with values of different types it will be upcasted to the most general data type.

In [None]:
# Upcasting of integers to floats
pd.Series([1, 2, 3.5])

In [None]:
# Upcasting to arbitrary python objects
pd.Series([1, 'string', [1,2,3]])

A series can be also created from a scalar (single) value. If additionally an index is provided then a series as long as the index is created
and it is populated with the constant value.

In [None]:
## Single value series as there is no index
pd.Series(5)

In [None]:
## Series from scalar and index
pd.Series(5, index=['a', 'b', 'c'])

Index of a series can be retrieved via the `.index` attribute

In [None]:
s.index

In [None]:
s.values

Numpy representation of the underlying data can be retrieved with `.to_numpy()` method.

In [None]:
s

In [None]:
s.to_numpy()

Moreover, series can be assigned with names. We will see how it can be useful later on when we discuss data frames.

In [None]:
s.name = 'a series'
s

In [None]:
pd.Series([1, 2, 3], name='name')

### Series | Indexing & slicing

Similarily to data frames series support three main types of indexing.

#### Standard dictionary syntax (aka _getitem_)

It is convenient as it allows to use both label-based and position-based indexing and slicing. However, due to its flexibility it can be
confusing as it has to guess what is the intent of the user based on the type of indexer.

That is why it is usually better to use more explicit `.loc` and `.iloc` indexers.

In [None]:
s

In [None]:
# Label-based index
s['f']

In [None]:
# Label-based slice
s['a':'c']

In [None]:
# List of labels
s[['a', 'f']]

In [None]:
# Position-based (integer) index
s[2]

In [None]:
# Position based (integer) slice
s[1:3]

In [None]:
# List of (integer) positions
s[[0, 2]]

In [None]:
x = pd.Series([2, 3, 4, 5], index=[2, 0, 3, 1])
x

In [None]:
x[0]
x[0:2]

This is all nice. However, this is possible because integer indexers are also interpreted as positional indexers.
This means that with this method we **can not** perform label indexing for series with integer labels.

#### Label-based indexing (aka `.loc` indexing)

The `.loc` attribute is defined on every `Series` (as well as `DataFrame`) object and it returns the special indexer object that allows
to query our data based on the explicit index labels, even if they are integers.

NOTE. Slices in `.loc` indexer always return ranges **including** the rightmost element.

In [None]:
s

In [None]:
s.loc

In [None]:
s.loc['a']

In [None]:
s.loc['a':'f']

Label `.loc` indexing can also use lists of labels.

In [None]:
s.loc[['a', 'f']]

In [None]:
# Works with integers labels too
x = pd.Series([1, 2, 3], index=[2, 0, 1])
x

In [None]:
x.loc[0]

In [None]:
x.loc[0:1]

#### Positional indexing (aka `.iloc` indexing)

The `.iloc` attribute returns a specialized indexer object that allows to query our data with positional (integer) indexes.

NOTE. Slices in `.iloc` follows standard Python semantics and **do not** include the rightmost element.

In [None]:
x

In [None]:
x.iloc[0]

In [None]:
x.iloc[0:2]

Positional `.iloc` indexing can be also used like integer indexing in Numpy.

In [None]:
s.iloc[[0, 2]]

#### Boolean indexing

There is also the fourth type of indexing that can be performed equally well with _getitem_, `.loc` and `.iloc` indexers.
It is of course boolean indexing, which works the same as in Numpy.

In [None]:
mask = [True, False, False, True]

# getitem
s[mask]

In [None]:
# .loc
s.loc[mask]

In [None]:
# .iloc
s.iloc[mask]

## Series | setting values

All types of indexing discussed above can be used to set new values to a series, including creating new entries with new labels.

In [None]:
x = pd.Series({ 'Alice': 10, 'Bob': 11 })
print("\n", x)

In [None]:
x['Alice'] = 9
x

In [None]:
x.loc['Bob'] = 0
x

In [None]:
x.iloc[0] = 100
x

In [None]:
x.loc[['Alice', 'Bob']] = 1
x

In [None]:
x.loc[['Alice', 'Bob']] = [1, 2]
x

In [None]:
x.loc['Mark'] = 777
x

### Series | vectorization between series and scalars

Vectorization between series and scalars in Pandas is as simple and trivial as in Numpy.

In [None]:
s

In [None]:
## Addition
s + 2

In [None]:
## Subtraction
s - 2

In [None]:
## Multiplication
s * 2

In [None]:
## Division
s / 2

In [None]:
## Integer division
s // 2

In [None]:
## Modulo
s % 2

In [None]:
## Raising to power
s ** 2

In [None]:
## Mathematical functions from Numpy
np.exp(s)

The same applies of course logical operators such as equality or negation.
Vectorization of logical expressions is very useful for creating masks for boolean indexing.

In [None]:
x = pd.Series(np.random.normal(0, 1, (10,)))
x

#x > 0
x[x > 0]

In [None]:
s

In [None]:
## Equality
s == 3

In [None]:
## Negation of boolean expression
~(s == 3)
## If course equivalent to
s != 3

Similarly, most of the standard Numpy mathematical functions vectorize properly over Pandas series as long as they are of numeric `dtype`.

In [None]:
## Exponentiation
np.exp(s)

In [None]:
## Natural logarithm
np.log(s)

In [None]:
## Trigonometric functions (e.g. sine)
np.sin(s)

### Series | broadcasting & alignment

As it was already mentioned, broadcasting in Pandas is very different than in Numpy as it is organized around the idea of _labels alignment_.
Luckily, this also means that the rules of broadcasting are somewhat simpler in Pandas.

Alignment of labels is based on the notion of union of sets. If we have two sets $A = \{x,y\}$ and $B = \{y,z\}$ then their union is defined as:

$$A \cup B = \{x, y, z\}$$

In other words a union is a set of all unique elements together.

In Pandas any operation between two or more series (we will extend this discussion for data frames later on) is preceeded by the axes alignment
stage in which axes of both series are transformed to the union of the two axes. We can see this in action in the chunk below.

In [None]:
s1 = pd.Series([1, 2], index=['a', 'b'])
s2 = pd.Series([3, 4], index=['a', 'c'])

print(s1)
print(s2)

s1 + s2

What happened? To see what is going on step-by-step we will have to replicate the axes alignment stage ourselves. To do so we will have to
use a few methods of the index objects. For now we show them without much commentary as we will discuss index type later on.

In [None]:
# The first step is to compute index union
#index_union = s1.index.union(s2.index)
index_union = s2.index.union(s1.index)
index_union

In [None]:
# The second step is to reindex the first series
s1 = s1.reindex(index_union)
s1

In [None]:
# The third step is to reindex the second series
s2 = s2.reindex(index_union)
s2

In [None]:
# The fourth and the last step is to perform the actual addition
s1 + s2

In [None]:
x1 = pd.Series([1, 2], index=['a', 'b'])
x2 = pd.Series([10, 20], index=['b', 'a'])

x1
x2

In [None]:
x1 + x2

In [None]:
x2 + x1

We get two NaN values as we have two NaNs at different positions in two series and as we know NaN values destroy any computation they appear in.

The above example shows that NaN value can pop up frequently in Pandas as a byproduct of the axes alignment process.

Thus, it is very important to understand how they work.

## Series | NaN values

**THIS DISCUSSION GENERALIZES TO DATA FRAMES**

In general Pandas NaNs are just standard numpy `np.nan` objects, so we already know their mechanics.
However, they way they are handled in Pandas objects such as `Series` is slightly different.

Perhaps the most important difference is the fact that aggregating methods in Pandas such as `.sum()` or `.mean()` (discussed in the next section) skip NaN values by default. Moreover, standard Python `None` values are also treated as NaN in Pandas.

In [None]:
s12 = s1 + s2
s12

In [None]:
# Discard NaNs (default behavior)
s12.sum()

In [None]:
# Do not discard NaNs
s12.sum(skipna=False)

The other main difference is the thing that we already observed. NaN values are created in Pandas when new cells are created during axes alignment.

We can easily remove NaN values from a series with `.dropna` method.

In [None]:
x = pd.Series([1, 2, np.nan, 4])
print(x)
x.dropna()

In [None]:
x.isna()

In [None]:
x[x.isna()]

In [None]:
# The above is equivalent to
x[~x.isna()]
x.dropna()

In [None]:
# Both `np.nan` and `None` are treated as NaN in Pandas
x = pd.Series([1, np.nan, None])
x

x.isna()

In [None]:
# Alias method
x.isnull()

It is also possible to fillin value in place of NaNs.

In [None]:
x = pd.Series([1, 2, np.nan, np.nan, 5], index=['a', 'b', 'c', 'd', 'e'])
x

In [None]:
x.fillna(99)
x.fillna(x.mean())

In [None]:
x.fillna(x.mean())
x

In [None]:
# The above is equivalent to
x[x.isna()] = x.mean()
x

In [None]:
x = pd.Series([1, 2, np.nan, np.nan, 5], index=['a', 'b', 'c', 'd', 'e'])
x
# It is possible to fill different values for different labels
x.fillna({ 'c': 3, 'd': 4})

In [None]:
# The above is equivalent to
x[x.isna()] = [3, 4]
x

## Series | Aggregation

**The same methods are available also for data frames**

Pandas objects such as `Series` and `DataFrame`s support aggregation methods known from numpy. Below we list some of them.

In the case of series there is no much to aggregation as it always leads to a single number.
However, in the case of data frames we can choose axes we want to aggregato over as it we can in Numpy. We discuss this issue later on.

In [None]:
# Sum
s.sum()
# Mean
s.mean()
# Variance
s.var()
# Standard deviation
s.std()
# Minimum
s.min()
# Maximum
s.max()

We can also ask for labels of maximum and minimum elements.

In [None]:
s
s.idxmax()
s.idxmin()

Moreover, we can use standard boolean aggregators.

In [None]:
s
(s == 3).all()
(s == 3).any()

## Series | Testing membership

A very useful feature of Pandas series is the fact that they provide easy-to-use and efficient method to test whether values
in a series are in some set of values. Below we extract from a series of positive integers only those that are prime numbers lower than $10$.

In [None]:
np.random.seed(9)

x = pd.Series(np.random.randint(1, 20, size=(30,)))
x

In [None]:
primes = [2, 3, 5, 7]

x[x.isin(primes)]

### Series | Exercise 1.

Change values for lowest and highest elements to -111.

In [None]:
np.random.seed(101)

s = pd.Series(np.random.randint(0, 100, (30,)))
s[:5]

In [None]:
# Your solution

### Series | Exercise 2.

Create two series with three numeric elements and use them to create a series with six elements filled with NaNs.
When creating the 6-elements series you can use only arithmetic operations.

HINT. Remember about the rules of alignment.

In [None]:
# Your solution

### Series | Iteration

The behavior of Pandas series in standard Python for-loop iteration is of course obvious. We iterate over values.

In [None]:
x = pd.Series([1, 2, 3, 4, 5])

for i in x:
    print(i)

However, since a series is also a kind of a `dict` we can also iterate over index values and data values in parallel
using the `.items()` method that we know from standard dictionaries.

In [None]:
x
for idx, val in x.items():
    print(idx, "=>", val)

In general, iterating over series should be avoided as it will be almost always slow. We should use the fact that internally
series store their data in Numpy-like arrays, so we should use vectorized operations.

However, sometimes this is not possible and we may want to apply some function elementwise to a series.
To do that we can use `.map()` method definded on series objects.

It can be used to apply a function element-wise or to map elements of a series to new values based on a dictionary or other series.

In [None]:
np.random.seed(77)

x = pd.Series(np.random.choice(['a', 'b', 'c', 'd'], size=(20,)))
x

In [None]:
# Now we will recode the values
value_map = {
    'a': 11,
    'b': 22,
    'c': 33,
    'd': 44
}
x.map(value_map)

In [None]:
# We can do the same with a function
def value_map(x):
    if x == 'a':
        return 11
    if x == 'b':
        return 22
    if x == 'c':
        return 33
    if x == 'd':
        return 44
    return np.nan

x.map(value_map)

## Series | Exercise 3.

You are provided with a set of random codes composed of ASCII letters. Your task is to convert them to shorter codes according
to the following rules:

1. The shorter code should start with 1 if the longer code starts with A, with 2 if the longer starts with B, with 3 if the longer starts with C
and with 0 if the longer starts with any other letter.
2. Then append uppercase # (hash sign).
3. Then the length of the longer code should be appended.

**Example**

Long code: BACFTYSBFTYSGC

Short code: 2#14

In [None]:
from string import ascii_uppercase
long_codes = pd.Series("".join(np.random.choice(list(ascii_uppercase), size=(i,))) for i in np.random.randint(10, 50, (30,)))
long_codes

In [None]:
## Your solution

## Index | Explicit labeling for axes

As we already discussed index object is an integral element of every series (and as we will see later also data frame).
It provides explicit labeling for an axis along which elements of a series are aligned.
In Numpy indexes were implicit and derived just from ordering of elements. In Pandas they are full-fledged Python objects
with a lot of their functionalities, so it is important that we understand their mechanics before discussing the most
important and most complex type in Pandas, that is, `DataFrame`.

Indexes may be created with a constructor in a similar as way as series. Like series, they can also be assigned with names.

In [None]:
import pandas as pd

In [None]:
idx = pd.Index(['a', 'b', 'c'])
idx

In [None]:
# Named index
idx = pd.Index(['a', 'b', 'c'], name='an index')
idx

Indexes can be used passed to series constructors.

In [None]:
pd.Series([1, 11, 22], index=idx)
# Note that the name of the index is printed

In [None]:
pd.Series([1, 11, 22], index=['a', 'b', 'c']).index

In general, indexes behave like multisets and they may contain duplicate labels. However, this situation should be usually avoided
as some procedures in Pandas are not implemented for this case and will raise an error. Moreover, this kind of ambiguity
will sooner or later lead to problems and/or errors in computations.

In [None]:
# Duplicate labels are problematic
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'a', 'c'])
s

In [None]:
s['a']

But they may be also useful in some cases. We will discuss this issue later.

Indexes and labels in Pandas can be very powerful, but sometimes they can be annoying and stand in our way. A typical example of situation
like that is when we have two series of the same length which we want to add together, but they come from different sources and may have
different indexes, so in the alignment stage NaN values will be produced.

Luckily, indexes can be reset and changed to generic integer indexes (similar to implicit Numpy indexes) at any moment.
We illustrate this below.

In [None]:
s1 = pd.Series([1, 2, 3])
s2 = pd.Series([4, 5, 6], index=['a', 'b', 'c'])

s1
s2

In [None]:
s1 + s2

This did not work becuase labels are completely different between the series. The first one has generic integer labels, but the second
one has string labels. To solve this problem we can reset index of the second series.

In [None]:
s2
s2.reset_index(drop=True)

In [None]:
s1.reset_index(drop=True) + s2.reset_index(drop=True)

In [None]:
s1.reset_index(drop=False)

In [None]:
s2.name = 'Some Series'
s2.index.name = 'An Index'
s2
s2.reset_index(drop=False)

A new index can added to a series by assigning to the `.index` attribute.

In [None]:
s = pd.Series([1, 2, 3])
s
idx = pd.Index(['a', 'b', 'c'])
s.index = idx
s

In [None]:
# The above is equivalent to
s = pd.Series([1, 2, 3])
s.index = ['a', 'b', 'c']
s.index

However, if we try to assign an index that is longer (or shorter) than our data series we will get an error.

In [None]:
s = pd.Series([1, 2, 3])
s.index = ['a', 'b', 'c', 'd']

In order to reshape an index of an existing series we have to use `.reindex()` method. This method can be also used to reorder labels
of a series (and its values as well). New labels are assigned with NaN values.

In [None]:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
s

A very important kind of operation one can perform on indexes is to combine them in various ways.
In general indexes are treated as sets, so we can define basic set operations between them such as union and intersection.

As we already discussed, the alignment of labels (Pandas broadcasting) involves creation of a union of indexes of two series.
However, in some cases we may want think of different combinations of indexes. Below we study some possibilities.

In [None]:
s1 = pd.Series(range(10), index=range(10))
s2 = pd.Series(range(10, 20), index=range(5, 15))
s1
s2

In [None]:
s1 + s2

We can use union of the indexes for instance to combine the two series in such a way that they are added together where possible
and existing values are used where one series is missing.

In [None]:
idx_union = s1.index.union(s2.index)
idx_union

In [None]:
s1.reindex(idx_union).fillna(0) + s2.reindex(idx_union).fillna(0)

In [None]:
s1.reindex(idx_union).fillna(1) * s2.reindex(idx_union).fillna(1)

Similarly we can use intersection of indexes to limit our results only to those cases for which we have full information.

In [None]:
idx_intersect = s1.index.intersection(s2.index)
idx_intersect

In [None]:
(s1 + s2).reindex(idx_intersect)

In [None]:
s1.reindex(idx_intersect) + s2.reindex(idx_intersect)

In [None]:
idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])

idx1.difference(idx2)

In [None]:
idx1
idx2

idx1.difference(idx2)
idx2.difference(idx1)

In [None]:
idx1 = pd.Index([1, 2, 3, 4])
idx2 = pd.Index([3, 4, 5, 6])

idx1.symmetric_difference(idx2)

In [None]:
idx1.difference(idx2).union(idx2.difference(idx1))

In the same fashion we can use symmetric difference (labels in one of the index but not in both).

In [None]:
idx_symmetric = s1.index.symmetric_difference(s2.index)

In [None]:
s1.reindex(idx_symmetric)

In [None]:
s2.reindex(idx_symmetric)

In [None]:
s1.reindex(idx_symmetric).fillna(0) + s2.reindex(idx_symmetric).fillna(0)

### Index | Exercise 1.

You are provided with two data series with measurements for subjects identified with integer indexes. The first series corresponds to the first
trial and second one to the second trial. However, some subjects participated only in one of the trials.

You have to execute the following tasks:

1. Compute a series with sums of scores for all subjects.
2. Compute a series with average scores for all subjects.
3. Find ids (labels) for subjects with highest and lowest average score.
4. Compute average score among subjects who participated in both trials.
5. Compute average score among subjects who participated in only one of the trials.

In [None]:
import numpy as np
np.random.seed(101)

ids = np.arange(25)
trial1 = pd.Series(np.random.normal(100, 15, (22,)), index=np.random.choice(ids, size=(22,), replace=False))
trial2 = pd.Series(np.random.normal(115, 30, (13,)), index=np.random.choice(ids, size=(13,), replace=False))

In [None]:
# Your solution