# Introduction to data manipulation with pandas

1. What is Pandas ?

- a powerful data analysis and manipulation library for Python
- a Python package providing fast, flexible, and expressive data structures designed to make working 
  with "relational" or "labeled" data both easy and intuitive.


Aim :  
- to be the fundamental high-level building block for doing practical, **real world** data analysis in Python. - - - to become **the most powerful and flexible open source data analysis / manipulation tool available in any language**.


It is already well on its way toward this goal.

2. Main features

 

  - Easy handling of missing data in floating point as well as non-floating
    point data.
  - Size mutability: columns can be inserted and deleted from DataFrame and
    higher dimensional objects
  - Automatic and explicit data alignment: objects can be explicitly aligned
    to a set of labels, or the user can simply ignore the labels and let
    `Series`, `DataFrame`, etc. automatically align the data for you in
    computations.
  - Powerful, flexible group by functionality to perform split-apply-combine
    operations on data sets, for both aggregating and transforming data
  - Make it easy to convert ragged, differently-indexed data in other Python
    and NumPy data structures into DataFrame objects.
  - Intelligent label-based slicing, fancy indexing, and subsetting of large
    data sets.
  - Intuitive merging and joining data sets.
  - Flexible reshaping and pivoting of data sets.
  - Hierarchical labeling of axes (possible to have multiple labels per tick).
  - Robust IO tools for loading data from flat files (CSV and delimited),
    Excel files, databases, and saving/loading data from the ultrafast HDF5
    format.
  - Time series-specific functionality: date range generation and frequency
    conversion, moving window statistics, date shifting and lagging.

3. Installing and Importing Pandas

- Details on pandas installation can be found in the Pandas documentation(https://pandas.pydata.org/).  
- For Anaconda stack users,Pandas is already installed. Once Pandas is installed, it can be imported and its version checked using these commands:

In [407]:
import pandas
pandas.__version__

'1.1.1'

In [408]:
import pandas as pd

In [409]:
 pd?

4. Introducing Pandas Objects

 The 3 fundamental Pandas data structures: 
 
        - the Series, 
        - DataFrame, 
        - and Index

Let us start first this section with the standard **NumPy** and **Pandas** imports

In [410]:
import numpy as np
import pandas as pd

4.1 The Pandas Series Object

A Pandas Series is a one-dimensional array of indexed data. It can be created from a list or array as follows:


In [411]:
ds = pd.Series([0.2, 0.35, 0.85, 1.0])
ds


0    0.20
1    0.35
2    0.85
3    1.00
dtype: float64

In [412]:
ds.values

array([0.2 , 0.35, 0.85, 1.  ])

In [413]:
ds.index



RangeIndex(start=0, stop=4, step=1)

In [414]:
ds[2]

0.85

In [415]:
ds[1:4]

1    0.35
2    0.85
3    1.00
dtype: float64

In [416]:
ds = pd.Series([0.2, 0.35, 0.85, 1.0], index=['a','b','c','d'])
ds


a    0.20
b    0.35
c    0.85
d    1.00
dtype: float64

And the item access works as expected:

In [417]:
ds['c']

0.85

We can even use non-contiguous or non-sequential indices:

In [418]:
ds = pd.Series([0.2, 0.35, 0.85, 1.0], index=[5,45,2,100])
ds

5      0.20
45     0.35
2      0.85
100    1.00
dtype: float64

In [419]:
ds[2]

0.85

Series as specialized dictionary. This  can be made even more clear by constructing a Series object directly from a Python dictionary:

In [420]:
population_dict = {'Montpellier': 290053,
                   'Paris': 2175601,
                   'Troyes': 61996,
                   'Marseille': 868277,
                   'Lyon': 518635}
population = pd.Series(population_dict)
population

Montpellier     290053
Paris          2175601
Troyes           61996
Marseille       868277
Lyon            518635
dtype: int64

In [421]:
population['Montpellier']

290053

In [422]:
population['Montpellier':'Troyes']

Montpellier     290053
Paris          2175601
Troyes           61996
dtype: int64

Constructing Series objects

We've already seen a few ways of constructing a Pandas Series from scratch; all of them are some version of the following:

where **index** is an optional argument, and data can be one of many entities.

For example, data can be a list or NumPy array, in which case index defaults to an integer sequence:

In [423]:
pd.Series([5, 10, 27, 14])

0     5
1    10
2    27
3    14
dtype: int64

data can be a scalar, which is repeated to fill the specified index:

In [424]:
pd.Series(7, index=[111, 222, 333])

111    7
222    7
333    7
dtype: int64

data can be a dictionary, in which index defaults to the sorted dictionary keys:

In [425]:
pd.Series({2:'a', 1:'b', 3:'c'})

2    a
1    b
3    c
dtype: object

In each case, the index can be explicitly set if a different result is preferred:

In [426]:
pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

3    c
2    a
dtype: object

The next fundamental structure in Pandas is the DataFrame. Like the Series object discussed in the previous section, the DataFrame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary. We'll now take a look at each of these perspectives.

DataFrame as a generalized NumPy array

If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.

To demonstrate this, let's first construct a new Series listing the area of each of the five states discussed in the previous section:

In [427]:
area_dict = {'Montpellier': 56.9,
                   'Paris': 105.4,
                   'Troyes': 13.2,
                   'Marseille': 240.6,
                   'Lyon': 47.9}
area = pd.Series(area_dict)
area


Montpellier     56.9
Paris          105.4
Troyes          13.2
Marseille      240.6
Lyon            47.9
dtype: float64

Now that we have this along with the population Series from before, we can use a dictionary to construct a single two-dimensional object containing this information:

In [428]:
cities = pd.DataFrame({'population': population,
                       'area': area})
cities



Unnamed: 0,population,area
Montpellier,290053,56.9
Paris,2175601,105.4
Troyes,61996,13.2
Marseille,868277,240.6
Lyon,518635,47.9


Like the Series object, the DataFrame has an index attribute that gives access to the index labels:

In [429]:
cities.index

Index(['Montpellier', 'Paris', 'Troyes', 'Marseille', 'Lyon'], dtype='object')

Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels:

In [430]:
cities.columns

Index(['population', 'area'], dtype='object')

Thus the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

DataFrame as specialized dictionary

Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. For example, asking for the 'area' attribute returns the Series object containing the areas we saw earlier

In [431]:
cities['area']

Montpellier     56.9
Paris          105.4
Troyes          13.2
Marseille      240.6
Lyon            47.9
Name: area, dtype: float64

In [432]:
cities['population']

Montpellier     290053
Paris          2175601
Troyes           61996
Marseille       868277
Lyon            518635
Name: population, dtype: int64

Notice the potential point of confusion here: in a two-dimesnional NumPy array, **data[0]** will return the first row. For a DataFrame, **data['col0']** will return the first column. Because of this, it is probably better to think about DataFrames as generalized dictionaries rather than generalized arrays, though both ways of looking at the situation can be useful.

Constructing DataFrame objects

A Pandas DataFrame can be constructed in a variety of ways. Here we'll give several examples.

From a single Series object

A DataFrame is a collection of Series objects, and a single-column `DataFrame` can be constructed from a single Series:

In [433]:
pd.DataFrame(population, columns=['population'])

Unnamed: 0,population
Montpellier,290053
Paris,2175601
Troyes,61996
Marseille,868277
Lyon,518635


From a list of dicts¶

Any list of dictionaries can be made into a DataFrame. We'll use a simple list comprehension to create some data:

In [434]:
df = [{'square': i*i, 'double': 2 * i}
        for i in range(6)]
pd.DataFrame(df)

Unnamed: 0,square,double
0,0,0
1,1,2
2,4,4
3,9,6
4,16,8
5,25,10


Even if some keys in the dictionary are missing, Pandas will fill them in with **NaN** (i.e., "not a number") values:

In [435]:
pd.DataFrame([{'column1': 15, 'column2': 40}, {'column2': 53, 'column3': 77}])

Unnamed: 0,column1,column2,column3
0,15.0,40,
1,,53,77.0



From a dictionary of Series objects



As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well:


In [436]:
pd.DataFrame({'population': population,
              'area': area})

Unnamed: 0,population,area
Montpellier,290053,56.9
Paris,2175601,105.4
Troyes,61996,13.2
Marseille,868277,240.6
Lyon,518635,47.9



From a two-dimensional NumPy array

Given a two-dimensional array of data, we can create a DataFrame with any specified column and index names. If omitted, an integer index will be used for each:


In [437]:
pd.DataFrame(np.random.rand(5, 3),
             columns=['col1', 'col2', 'col3'],
             index=['row1', 'row2', 'row3', 'row4', 'row5'])

Unnamed: 0,col1,col2,col3
row1,0.38183,0.945011,0.195805
row2,0.943701,0.334149,0.75922
row3,0.307404,0.93734,0.173765
row4,0.422756,0.481733,0.903045
row5,0.584231,0.858809,0.388251



From a NumPy structured array


We covered structured arrays in Structured Data: NumPy's Structured Arrays. A Pandas DataFrame operates much like a structured array, and can be created directly from one:


In [438]:
TAB = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
TAB



array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [439]:
pd.DataFrame(TAB)


Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0


The Pandas Index Object

We have seen here that both the Series and DataFrame objects contain an explicit index that lets you reference and modify data. This Index object is an interesting structure in itself, and it can be thought of either as an immutable array or as an ordered set (technically a multi-set, as Index objects may contain repeated values). Those views have some interesting consequences in the operations available on Index objects. As a simple example, let's construct an Index from a list of integers:

In [440]:
ind = pd.Index([8,14, 15, 18, 20])
ind

Int64Index([8, 14, 15, 18, 20], dtype='int64')


Index as immutable array

The Index in many ways operates like an array. For example, we can use standard Python indexing notation to retrieve values or slices:


In [441]:
ind[3]

18

In [442]:
ind[::2]

Int64Index([8, 15, 20], dtype='int64')

Index objects also have many of the attributes familiar from NumPy arrays:

In [443]:
print(ind.size, ind.shape, ind.ndim, ind.dtype)

5 (5,) 1 int64


One difference between Index objects and NumPy arrays is that indices are immutable–that is, they cannot be modified via the normal means:

In [444]:
ind[1] = 0

TypeError: Index does not support mutable operations

This immutability makes it safer to share indices between multiple DataFrames and arrays, without the potential for side effects from inadvertent index modification.


Index as ordered set

Pandas objects are designed to facilitate operations such as joins across datasets, which depend on many aspects of set arithmetic. The Index object follows many of the conventions used by Python's built-in set data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way:


In [None]:
indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

Intersection of indices

In [None]:
indA & indB 
indA.intersection(indB) #also possible

Union

In [None]:
indA | indB
indA.union(indB) #also possible

Symmetric difference

In [None]:
indA ^ indB
#indA.difference(indB)

Data Indexing and Selection

We'll start with the simple case of the one-dimensional Series object, and then move on to the more complicated two-dimesnional DataFrame object.

Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.


Series as dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values:


In [None]:
import pandas as pd
ds = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
ds

In [None]:
ds['b']

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [None]:
'a' in ds

In [None]:
ds.keys()

In [None]:
list(ds.items())

Series objects can even be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning to a new key, you can extend a Series by assigning to a new index value:

In [None]:
ds['e'] = 1.25
ds



This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.



# Series as one-dimensional array


A Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing. Examples of these are as follows:




# slicing by explicit index

In [None]:
ds['a':'c']



# slicing by implicit integer index



In [None]:
ds[0:2]

# masking

In [None]:
ds[(ds > 0.3) & (ds < 0.8)]

# fancy indexing

In [None]:
ds[['a', 'e']]

Among these, slicing may be the source of the most confusion. Notice that when slicing with an explicit index (i.e., **ds['a':'c']**), the final index is included in the slice, while when slicing with an implicit index (i.e., **ds[0:2]**), the final index is excluded from the slice.

# Indexers: loc, iloc, and ix

### These slicing and indexing conventions can be a source of confusion. For example, if your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.

In [None]:
ds = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
ds

## explicit index when indexing

In [None]:
ds[1]

## implicit index when slicing

In [None]:
ds[1:3]

Because of this potential confusion in the case of integer indexes, Pandas provides some special indexer attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the Series.

First, the loc attribute allows indexing and slicing that always references the explicit index:


In [445]:
ds.loc[1]

KeyError: 1

In [None]:
ds.loc[1:3]

The `iloc` attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
ds.iloc[1]

In [None]:
ds.iloc[1:3]



A third indexing attribute, ix, is a hybrid of the two, and for Series objects is equivalent to standard []-based indexing. The purpose of the ix indexer will become more apparent in the context of DataFrame objects, which we will discuss in a moment.

One guiding principle of Python code is that "explicit is better than implicit." The explicit nature of loc and iloc make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.


# Data Selection in DataFrame

## DataFrame as a dictionary¶

The first analogy we will consider is the `DataFrame` as a dictionary of related `Series` objects. Let's return to our example of areas and populations of states:

In [None]:
area = {'Montpellier': 56.9,
                   'Paris': 105.4,
                   'Troyes': 13.2,
                   'Marseille': 240.6,
                   'Lyon': 47.9}

population = {'Montpellier': 290053,
                   'Paris': 2175601,
                   'Troyes': 61996,
                   'Marseille': 868277,
                   'Lyon': 518635}
cities_data = pd.DataFrame({'population': population,
              'area': area})
cities_data

The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name:

In [None]:
cities_data['area']

In [446]:
cities_data.area

Montpellier     56.9
Paris          105.4
Troyes          13.2
Marseille      240.6
Lyon            47.9
Name: area, dtype: float64

This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [447]:
cities_data.area is cities_data['area']

True

Though this is a useful shorthand, keep in mind that it does not work for all cases! For example, if the column names are not strings, or if the column names conflict with methods of the `DataFrame`, this attribute-style access is not possible. For example, the `DataFrame` has a `population()` method, so `data.pop` will point to this rather than the "population" column:

In [448]:
cities_data.population is cities_data['population']

True



In particular, you should avoid the temptation to try column assignment via attribute (i.e., use `cities_data['population'] = z` rather than `cities_data.population = z`).

Like with the `Series` objects discussed earlier, this dictionary-style syntax can also be used to modify the object, in this case adding a new column:


In [449]:
cities_data['density'] = cities_data['population'] / cities_data['area']
cities_data

Unnamed: 0,population,area,density
Montpellier,290053,56.9,5097.592267
Paris,2175601,105.4,20641.375712
Troyes,61996,13.2,4696.666667
Marseille,868277,240.6,3608.798836
Lyon,518635,47.9,10827.453027


This shows a preview of the straightforward syntax of element-by-element arithmetic between Series objects


## DataFrame as two-dimensional array

As mentioned previously, we can also view the `DataFrame` as an enhanced two-dimensional array. We can examine the raw underlying data array using the `values` attribute:


In [450]:
cities_data.values

array([[2.90053000e+05, 5.69000000e+01, 5.09759227e+03],
       [2.17560100e+06, 1.05400000e+02, 2.06413757e+04],
       [6.19960000e+04, 1.32000000e+01, 4.69666667e+03],
       [8.68277000e+05, 2.40600000e+02, 3.60879884e+03],
       [5.18635000e+05, 4.79000000e+01, 1.08274530e+04]])

With this picture in mind, many familiar array-like observations can be done on the `DataFrame` itself. For example, we can transpose the full `DataFrame` to swap rows and columns:

In [451]:
cities_data.T

Unnamed: 0,Montpellier,Paris,Troyes,Marseille,Lyon
population,290053.0,2175601.0,61996.0,868277.0,518635.0
area,56.9,105.4,13.2,240.6,47.9
density,5097.592267,20641.38,4696.666667,3608.798836,10827.453027


When it comes to indexing of `DataFrame` objects, however, it is clear that the dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. In particular, passing a single index to an array accesses a row:

In [452]:
cities_data.values[0]

array([2.90053000e+05, 5.69000000e+01, 5.09759227e+03])

and passing a single "index" to a `DataFrame` accesses a column:

In [453]:
cities_data['area']

Montpellier     56.9
Paris          105.4
Troyes          13.2
Marseille      240.6
Lyon            47.9
Name: area, dtype: float64

Thus for array-style indexing, we need another convention. Here Pandas again uses the `loc`, `iloc`, and `ix` indexers mentioned earlier. Using the `iloc` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the `DataFrame` index and column labels are maintained in the result

In [454]:
cities_data.iloc[:3, :2]

Unnamed: 0,population,area
Montpellier,290053,56.9
Paris,2175601,105.4
Troyes,61996,13.2


Similarly, using the `loc` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [455]:
cities_data.loc[:'Troyes', :'population']

Unnamed: 0,population
Montpellier,290053
Paris,2175601
Troyes,61996


The `ix` indexer allows a hybrid of these two approaches:

In [456]:
cities_data.ix[:3, :'population']

AttributeError: 'DataFrame' object has no attribute 'ix'

Keep in mind that for integer indices, the `ix` indexer is subject to the same potential sources of confusion as discussed for integer-indexed `Series` objects.

Any of the familiar NumPy-style data access patterns can be used within these indexers. For example, in the `loc` indexer we can combine masking and fancy indexing as in the following

In [None]:
cities_data.loc[cities_data.density > 5000, ['population', 'density']]

Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [None]:
cities_data.iloc[0, 2] = 10000
cities_data



To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple `DataFrame` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.


Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice. First, while `indexing` refers to columns, `slicing` refers to rows:


In [None]:
cities_data['Paris':'Troyes']

Such slices can also refer to rows by number rather than by index:

In [None]:
cities_data[1:3]

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [None]:
cities_data[cities_data.density > 5000]



These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.


In [None]:


import pandas as pd
import numpy as np



In [None]:


rng = np.random.RandomState(42)
ser = pd.Series(rng.randint(0, 10, 4))
ser



In [None]:
df = pd.DataFrame(rng.randint(0, 10, (3, 4)),
                  columns=['A', 'B', 'C', 'D'])
df

If we apply a NumPy ufunc on either of these objects, the result will be another Pandas object with the indices preserved:

In [None]:
np.exp(ser)

Or, for a slightly more complex calculation:

In [None]:
np.sin(df * np.pi / 4)


# UFuncs: Index Alignment

For binary operations on two Series or DataFrame objects, Pandas will align indices in the process of performing the operation. This is very convenient when working with incomplete data, as we'll see in some of the examples that follow.


## Index alignment in Series

As an example, suppose we are combining two different data sources, and find only the top three French cities by population and the top three French cities by area:

In [457]:
population = pd.Series({'Paris': 2175601,
                        'Marseille': 868277,
                        'Lyon': 518635}, name='population')
area = pd.Series({'Marseille': 240.6,
                   'Paris': 105.4,
                   'Montpellier': 56.9}, name='area')

Let's see what happens when we divide these to compute the population density:

In [458]:
population / area

Lyon                    NaN
Marseille       3608.798836
Montpellier             NaN
Paris          20641.375712
dtype: float64

The resulting array contains the union of indices of the two input arrays, which could be determined using standard Python set arithmetic on these indices:

In [459]:
area.index | population.index

Index(['Lyon', 'Marseille', 'Montpellier', 'Paris'], dtype='object')

Any item for which one or the other does not have an entry is marked with `NaN`, or "Not a Number," which is how Pandas marks missing data. This index matching is implemented this way for any of Python's built-in arithmetic expressions; any missing values are filled in with `NaN` by default:

In [460]:
ds_A = pd.Series([2, 4, 6], index=[0, 1, 2])
ds_B = pd.Series([1, 3, 5], index=[1, 2, 3])
ds_A + ds_B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

If using `NaN` values is not the desired behavior, the fill value can be modified using appropriate object methods in place of the operators. For example, calling `ds_A.add(ds_B)` is equivalent to calling `ds_A + ds_B`, but allows optional explicit specification of the fill value for any elements in `ds_A` or `ds_B` that might be missing:

In [461]:
ds_A.add(ds_B, fill_value=0)

0    2.0
1    5.0
2    9.0
3    5.0
dtype: float64


## Index alignment in DataFrame

A similar type of alignment takes place for both columns and indices when performing operations on `DataFrame`s:


In [462]:
df_A = pd.DataFrame(rng.randint(0, 5, (2, 2)),
                 columns=list('XY'))
df_A



Unnamed: 0,X,Y
0,1,3
1,4,1


In [463]:
df_B = pd.DataFrame(rng.randint(0, 15, (3, 3)),
                 columns=list('YXZ'))
df_B



Unnamed: 0,Y,X,Z
0,9,11,1
1,9,13,3
2,13,14,14


In [464]:
df_A+df_B

Unnamed: 0,X,Y,Z
0,12.0,12.0,
1,17.0,10.0,
2,,,


Notice that indices are aligned correctly irrespective of their order in the two objects, and indices in the result are sorted. As was the case with `Series`, we can use the associated object's arithmetic method and pass any desired `fill_value` to be used in place of missing entries. Here we'll fill with the mean of all values in `df_A` (computed by first stacking the rows of `df_A`):

In [465]:
fill = df_A.stack().mean()
df_A.add(df_B, fill_value=fill)

Unnamed: 0,X,Y,Z
0,12.0,12.0,3.25
1,17.0,10.0,5.25
2,16.25,15.25,16.25


The following table lists Python operators and their equivalent Pandas object methods:

# Ufuncs: Operations Between DataFrame and Series

When performing operations between a `DataFrame` and a `Series`, the index and column alignment is similarly maintained. Operations between a `DataFrame` and a `Series` are similar to operations between a two-dimensional and one-dimensional NumPy array. Consider one common operation, where we find the difference of a two-dimensional array and one of its rows:


In [466]:
A = rng.randint(10, size=(3, 4))
A

array([[7, 6, 8, 7],
       [4, 1, 4, 7],
       [9, 8, 8, 0]])

In [467]:
A - A[0]

array([[ 0,  0,  0,  0],
       [-3, -5, -4,  0],
       [ 2,  2,  0, -7]])

According to NumPy's broadcasting rules (see Computation on Arrays: Broadcasting), subtraction between a two-dimensional array and one of its rows is applied row-wise.

In Pandas, the convention similarly operates row-wise by default:

In [468]:
df = pd.DataFrame(A, columns=list('QRST'))
df - df.iloc[0]

Unnamed: 0,Q,R,S,T
0,0,0,0,0
1,-3,-5,-4,0
2,2,2,0,-7


If you would instead like to operate column-wise, you can use the object methods mentioned earlier, while specifying the axis keyword:

In [469]:
df.subtract(df['R'], axis=0)

Unnamed: 0,Q,R,S,T
0,1,0,2,1
1,3,0,3,6
2,1,0,0,-8


Note that these DataFrame/Series operations, like the operations discussed above, will automatically align indices between the two elements:

In [470]:
halfrow = df.iloc[0, ::2]
halfrow

Q    7
S    8
Name: 0, dtype: int32

In [471]:
df - halfrow

Unnamed: 0,Q,R,S,T
0,0.0,,0.0,
1,-3.0,,-4.0,
2,2.0,,0.0,


This preservation and alignment of indices and columns means that operations on data in Pandas will always maintain the data context, which prevents the types of silly errors that might come up when working with heterogeneous and/or misaligned data in raw NumPy arrays.

# Missing Data in Pandas¶

Pandas chose to use sentinels for missing data, and further chose to use two already-existing Python null values: the special floating-point `NaN` value, and the Python `None` object. This choice has some side effects, as we will see, but in practice ends up being a good compromise in most cases of interest.

## `None`: Pythonic missing data

The first sentinel value used by Pandas is `None`, a Python singleton object that is often used for missing data in Python code. Because it is a Python object, `None` cannot be used in any arbitrary NumPy/Pandas array, but only in arrays with data type `'object'` (i.e., arrays of Python objects):

In [472]:
import numpy as np
import pandas as pd

In [473]:
vals1 = np.array([1, None, 3, 4])
vals1

array([1, None, 3, 4], dtype=object)

This `dtype=object` means that the best common type representation NumPy could infer for the contents of the array is that they are Python objects. While this kind of object array is useful for some purposes, any operations on the data will be done at the Python level, with much more overhead than the typically fast operations seen for arrays with native types:

In [474]:
for dtype in ['object', 'int']:
    print("dtype =", dtype)
    %timeit np.arange(1E6, dtype=dtype).sum()
    print()

dtype = object
57 ms ± 10.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

dtype = int
2.19 ms ± 144 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)



The use of Python objects in an array also means that if you perform aggregations like `sum()` or `min()` across an array with a `None` value, you will generally get an error:

In [476]:
vals1.sum()

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

This reflects the fact that addition between an integer and `None` is undefined.

## `NaN`: Missing numerical data

The other missing data representation, `NaN` (acronym for Not a Number), is different; it is a special floating-point value recognized by all systems that use the standard IEEE floating-point representation:

In [477]:
vals2 = np.array([1, np.nan, 3, 4]) 
vals2.dtype

dtype('float64')

Notice that NumPy chose a native floating-point type for this array: this means that unlike the object array from before, this array supports fast operations pushed into compiled code. You should be aware that `NaN` is a bit like a data virus–it infects any other object it touches. Regardless of the operation, the result of arithmetic with `NaN` will be another `NaN`:

In [478]:
1 + np.nan

nan

In [479]:
0 *  np.nan

nan

Note that this means that aggregates over the values are well defined (i.e., they don't result in an error) but not always useful:

In [480]:
vals2.sum(), vals2.min(), vals2.max()

(nan, nan, nan)

NumPy does provide some special aggregations that will ignore these missing values:

In [481]:
np.nansum(vals2), np.nanmin(vals2), np.nanmax(vals2), np.nanmean(vals2)

(8.0, 1.0, 4.0, 2.6666666666666665)

Keep in mind that `NaN` is specifically a floating-point value; there is no equivalent `NaN` value for integers, strings, or other types.

## `NaN` and `None` in Pandas

`NaN` and `None` both have their place, and Pandas is built to handle the two of them nearly interchangeably, converting between them where appropriate:

In [482]:
pd.Series([1, np.nan, 2, None])

0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

For types that don't have an available sentinel value, Pandas automatically type-casts when NA values are present. For example, if we set a value in an integer array to `np.nan`, it will automatically be upcast to a floating-point type to accommodate the NA:

In [483]:
x = pd.Series(range(2), dtype=int)
x

0    0
1    1
dtype: int32

In [484]:
x[0] = None
x

0    NaN
1    1.0
dtype: float64

Notice that in addition to casting the integer array to floating point, Pandas automatically converts the `None` to a `NaN` value

The following table lists the upcasting conventions in Pandas when NA values are introduced:

Keep in mind that in Pandas, string data is always stored with an `object` dtype.


# Operating on Null Values¶

As we have seen, Pandas treats `None` and `NaN` as essentially interchangeable for indicating missing or null values. To facilitate this convention, there are several useful methods for detecting, removing, and replacing null values in Pandas data structures. They are:

 - `isnull()`: Generate a boolean mask indicating missing values
 - `notnull()`: Opposite of isnull()
 - `dropna()`: Return a filtered version of the data
 -`fillna()`: Return a copy of the data with missing values filled or imputed

We will conclude this section with a brief exploration and demonstration of these routines.

## Detecting null values¶

Pandas data structures have two useful methods for detecting null data: `isnull()` and `notnull()`. Either one will return a Boolean mask over the data. For example:

In [485]:
data = pd.Series([1, np.nan, 'hello', None])

In [486]:
data.isnull()

0    False
1     True
2    False
3     True
dtype: bool

Boolean masks can be used directly as a `Series` or `DataFrame` index:

In [487]:
data[data.notnull()]

0        1
2    hello
dtype: object

The `isnull()` and `notnull()` methods produce similar Boolean results for `DataFrame`s.


## Dropping null values

In addition to the masking used before, there are the convenience methods, `dropna()` (which removes NA values) and `fillna()` (which fills in NA values). For a `Series`, the result is straightforward:


In [488]:
data.dropna()

0        1
2    hello
dtype: object

For a `DataFrame`, there are more options. Consider the following `DataFrame`:

In [489]:
df = pd.DataFrame([[1,      np.nan, 2],
                   [2,      3,      5],
                   [np.nan, 4,      6]])
df

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6




We cannot drop single values from a `DataFrame`; we can only drop full rows or full columns. Depending on the application, you might want one or the other, so `dropna()` gives a number of options for a `DataFrame`.

By default, `dropna()` will drop all rows in which any null value is present:


In [490]:
df.dropna()

Unnamed: 0,0,1,2
1,2.0,3.0,5


Alternatively, you can drop NA values along a different axis; `axis=1` drops all columns containing a null value

In [491]:
df.dropna(axis='columns')

Unnamed: 0,2
0,2
1,5
2,6


But this drops some good data as well; you might rather be interested in dropping rows or columns with *all* NA values, or a majority of NA values. This can be specified through the `how` or `thresh` parameters, which allow fine control of the number of nulls to allow through.

The default is `how='any'`, such that any row or column (depending on the axis keyword) containing a null value will be dropped. You can also specify `how='all'`, which will only drop rows/columns that are *all* null values:


In [492]:
df[3] = np.nan
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [493]:
df.dropna(axis='columns', how='all')

Unnamed: 0,0,1,2
0,1.0,,2
1,2.0,3.0,5
2,,4.0,6


For finer-grained control, the `thresh` parameter lets you specify a minimum number of non-null values for the row/column to be kept:

In [494]:
df.dropna(axis='rows', thresh=3)

Unnamed: 0,0,1,2,3
1,2.0,3.0,5,


Here the first and last row have been dropped, because they contain only two non-null values.


## Filling null values

Sometimes rather than dropping NA values, you'd rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values. You could do this in-place using the `isnull()` method as a mask, but because it is such a common operation Pandas provides the `fillna()` method, which returns a copy of the array with the null values replaced.

Consider the following `Series`:


In [495]:
data = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
data

a    1.0
b    NaN
c    2.0
d    NaN
e    3.0
dtype: float64

We can fill NA entries with a single value, such as zero:

In [496]:
data.fillna(0)

a    1.0
b    0.0
c    2.0
d    0.0
e    3.0
dtype: float64

We can specify a forward-fill to propagate the previous value forward:

In [497]:
# forward-fill
data.fillna(method='ffill')

a    1.0
b    1.0
c    2.0
d    2.0
e    3.0
dtype: float64

Or we can specify a back-fill to propagate the next values backward:

In [498]:
# back-fill
data.fillna(method='bfill')

a    1.0
b    2.0
c    2.0
d    3.0
e    3.0
dtype: float64

For `DataFrame`s, the options are similar, but we can also specify an axis along which the fills take place:

In [499]:
df

Unnamed: 0,0,1,2,3
0,1.0,,2,
1,2.0,3.0,5,
2,,4.0,6,


In [500]:
df.fillna(method='ffill', axis=1)

Unnamed: 0,0,1,2,3
0,1.0,1.0,2.0,2.0
1,2.0,3.0,5.0,5.0
2,,4.0,6.0,6.0


Notice that if a previous value is not available during a forward fill, the NA value remains.


# Hierarchical Indexing


Up to this point we've been focused primarily on one-dimensional and two-dimensional data, stored in Pandas Series and DataFrame objects, respectively. Often it is useful to go beyond this and store higher-dimensional data–that is, data indexed by more than one or two keys. While Pandas does provide Panel and Panel4D objects that natively handle three-dimensional and four-dimensional data (see Aside: Panel Data), a far more common pattern in practice is to make use of hierarchical indexing (also known as multi-indexing) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In this section, we'll explore the direct creation of MultiIndex objects, considerations when indexing, slicing, and computing statistics across multiply indexed data, and useful routines for converting between simple and hierarchically indexed representations of your data.


We begin with the standard imports:

In [501]:
import pandas as pd
import numpy as np


# A Multiply Indexed Series

Let's start by considering how we might represent two-dimensional data within a one-dimensional Series. For concreteness, we will consider a series of data where each point has a character and numerical key.

## The bad way

Suppose you would like to track data about states from two different years. Using the Pandas tools we've already covered, you might be tempted to simply use Python tuples as keys:


In [502]:
index = [('Paris', 2013), ('Paris', 2018),
         ('Marseille', 2013), ('Marseille', 2018),
         ('Lyon', 2013), ('Lyon', 2018)]
populations = [2229621, 2175601,
               855393, 868277,
               500715, 518635]
cities_pop = pd.Series(populations, index=index)
cities_pop

(Paris, 2013)        2229621
(Paris, 2018)        2175601
(Marseille, 2013)     855393
(Marseille, 2018)     868277
(Lyon, 2013)          500715
(Lyon, 2018)          518635
dtype: int64

With this indexing scheme, you can straightforwardly index or slice the series based on this multiple index:


In [503]:
cities_pop[('Paris', 2018):('Lyon', 2013)]

(Paris, 2018)        2175601
(Marseille, 2013)     855393
(Marseille, 2018)     868277
(Lyon, 2013)          500715
dtype: int64

But the convenience ends there. For example, if you need to select all values from 2018, you'll need to do some messy (and potentially slow) munging to make it happen:


In [504]:
cities_pop[[i for i in cities_pop.index if i[1] == 2018]]

(Paris, 2018)        2175601
(Marseille, 2018)     868277
(Lyon, 2018)          518635
dtype: int64

This produces the desired result, but is not as clean (or as efficient for large datasets) as the slicing syntax we've grown to love in Pandas.

## The Better Way: Pandas MultiIndex

Fortunately, Pandas provides a better way. Our tuple-based indexing is essentially a rudimentary multi-index, and the Pandas MultiIndex type gives us the type of operations we wish to have. We can create a multi-index from the tuples as follows:


In [505]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([(    'Paris', 2013),
            (    'Paris', 2018),
            ('Marseille', 2013),
            ('Marseille', 2018),
            (     'Lyon', 2013),
            (     'Lyon', 2018)],
           )

In [506]:
# Notice that the MultiIndex contains multiple levels of indexing–in 
#this case, the state names and the years, as well as multiple labels 
#for each data point which encode these levels.


If we re-index our series with this MultiIndex, we see the hierarchical representation of the data:


In [507]:
cities_pop = cities_pop.reindex(index)
cities_pop

Paris      2013    2229621
           2018    2175601
Marseille  2013     855393
           2018     868277
Lyon       2013     500715
           2018     518635
dtype: int64

Here the first two columns of the Series representation show the multiple index values, while the third column shows the data. Notice that some entries are missing in the first column: in this multi-index representation, any blank entry indicates the same value as the line above it.

Now to access all data for which the second index is 2018, we can simply use the Pandas slicing notation:


In [508]:
cities_pop[:, 2018]

Paris        2175601
Marseille     868277
Lyon          518635
dtype: int64

The result is a singly indexed array with just the keys we're interested in. This syntax is much more convenient (and the operation is much more efficient!) than the home-spun tuple-based multi-indexing solution that we started with. We'll now further discuss this sort of indexing operation on hieararchically indexed data.

## MultiIndex as extra dimension

You might notice something else here: we could easily have stored the same data using a simple DataFrame with index and column labels. In fact, Pandas is built with this equivalence in mind. The unstack() method will quickly convert a multiply indexed Series into a conventionally indexed DataFrame:


In [509]:
cities_pop_df = cities_pop.unstack()
cities_pop_df

Unnamed: 0,2013,2018
Lyon,500715,518635
Marseille,855393,868277
Paris,2229621,2175601


Naturally, the stack() method provides the opposite operation:

In [510]:
cities_pop_df.stack()

Lyon       2013     500715
           2018     518635
Marseille  2013     855393
           2018     868277
Paris      2013    2229621
           2018    2175601
dtype: int64

Seeing this, you might wonder why would we would bother with hierarchical indexing at all. The reason is simple: just as we were able to use multi-indexing to represent two-dimensional data within a one-dimensional Series, we can also use it to represent data of three or more dimensions in a Series or DataFrame. Each extra level in a multi-index represents an extra dimension of data; taking advantage of this property gives us much more flexibility in the types of data we can represent. Concretely, we might want to add another column of demographic data for each state at each year (say, students) ; with a MultiIndex this is as easy as adding another column to the DataFrame:


In [511]:
cities_pop_df = pd.DataFrame({'total': cities_pop,
                       'students': [625000 , 654455,
                                   90000 , 92148,
                                   140000 , 155440]})
cities_pop_df

Unnamed: 0,Unnamed: 1,total,students
Paris,2013,2229621,625000
Paris,2018,2175601,654455
Marseille,2013,855393,90000
Marseille,2018,868277,92148
Lyon,2013,500715,140000
Lyon,2018,518635,155440


In addition, all the ufuncs and other functionality discussed in Operating on Data in Pandas work with hierarchical indices as well. Here we compute the fraction of students by year, given the above data:


In [512]:
f_students = cities_pop_df['students'] / cities_pop_df['total']
f_students.unstack()

Unnamed: 0,2013,2018
Lyon,0.2796,0.29971
Marseille,0.105215,0.106127
Paris,0.280317,0.300816


This allows us to easily and quickly manipulate and explore even high-dimensional data.
Methods of MultiIndex Creation

The most straightforward way to construct a multiply indexed Series or DataFrame is to simply pass a list of two or more index arrays to the constructor. For example:


In [513]:
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                  columns=['data1', 'data2'])
df

Unnamed: 0,Unnamed: 1,data1,data2
a,1,0.679512,0.501663
a,2,0.212512,0.606359
b,1,0.157185,0.793862
b,2,0.630769,0.113914


The work of creating the MultiIndex is done in the background.

Similarly, if you pass a dictionary with appropriate tuples as keys, Pandas will automatically recognize this and use a MultiIndex by default:


In [514]:
data = {('Paris', 2013): 2229621, 
        ('Paris', 2018): 2175601,
        ('Marseille', 2013): 855393, 
        ('Marseille', 2018): 868277,
        ('Lyon', 2013): 500715, 
        ('Lyon', 2018): 518635}
pd.Series(data)

Paris      2013    2229621
           2018    2175601
Marseille  2013     855393
           2018     868277
Lyon       2013     500715
           2018     518635
dtype: int64

Nevertheless, it is sometimes useful to explicitly create a MultiIndex; we'll see a couple of these methods here.

##  Explicit MultiIndex constructors

For more flexibility in how the index is constructed, you can instead use the class method constructors available in the pd.MultiIndex. For example, as we did before, you can construct the MultiIndex from a simple list of arrays giving the index values within each level:


In [515]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can construct it from a list of tuples giving the multiple index values of each point:


In [516]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])


MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

You can even construct it from a Cartesian product of single indices:


In [517]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Similarly, you can construct the MultiIndex directly using its internal encoding by passing levels (a list of lists containing available index values for each level) and codes (a list of lists that reference these labels):


In [518]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

Any of these objects can be passed as the index argument when creating a Series or Dataframe, or be passed to the reindex method of an existing Series or DataFrame.

## MultiIndex level names

Sometimes it is convenient to name the levels of the MultiIndex. This can be accomplished by passing the names argument to any of the above MultiIndex constructors, or by setting the names attribute of the index after the fact:


In [519]:
cities_pop.index.names = ['city', 'year']
cities_pop

city       year
Paris      2013    2229621
           2018    2175601
Marseille  2013     855393
           2018     868277
Lyon       2013     500715
           2018     518635
dtype: int64

With more involved datasets, this can be a useful way to keep track of the meaning of various index values.

## MultiIndex for columns

In a DataFrame, the rows and columns are completely symmetric, and just as the rows can have multiple levels of indices, the columns can have multiple levels as well. Consider the following, which is a mock-up of some (somewhat realistic) medical data:


In [520]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,52.0,37.4,58.0,37.3,30.0,35.9
2013,2,36.0,37.6,48.0,36.5,37.0,36.4
2014,1,33.0,35.3,32.0,35.8,6.0,38.4
2014,2,35.0,36.5,35.0,37.7,47.0,37.8


Here we see where the multi-indexing for both rows and columns can come in very handy. This is fundamentally four-dimensional data, where the dimensions are the subject, the measurement type, the year, and the visit number. With this in place we can, for example, index the top-level column by the person's name and get a full DataFrame containing just that person's information:


In [521]:
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,58.0,37.3
2013,2,48.0,36.5
2014,1,32.0,35.8
2014,2,35.0,37.7


For complicated records containing multiple labeled measurements across multiple times for many subjects (people, countries, cities, etc.) use of hierarchical rows and columns can be extremely convenient!

# Indexing and Slicing a MultiIndex

Indexing and slicing on a MultiIndex is designed to be intuitive, and it helps if you think about the indices as added dimensions. We'll first look at indexing multiply indexed Series, and then multiply-indexed DataFrames.

## Multiply indexed Series

Consider the multiply indexed Series of state populations we saw earlier:

In [522]:
cities_pop

city       year
Paris      2013    2229621
           2018    2175601
Marseille  2013     855393
           2018     868277
Lyon       2013     500715
           2018     518635
dtype: int64

We can access single elements by indexing with multiple terms:


In [523]:
cities_pop['Paris', 2013]

2229621

The MultiIndex also supports partial indexing, or indexing just one of the levels in the index. The result is another Series, with the lower-level indices maintained:


In [524]:
cities_pop['Paris']

year
2013    2229621
2018    2175601
dtype: int64

Partial slicing is available as well, as long as the MultiIndex is sorted (see discussion in Sorted and Unsorted Indices):


In [525]:
cities_pop.loc['Paris':'Lyon']


UnsortedIndexError: 'Key length (1) was greater than MultiIndex lexsort depth (0)'

With sorted indices, partial indexing can be performed on lower levels by passing an empty slice in the first index:


In [526]:
cities_pop[:, 2013]

city
Paris        2229621
Marseille     855393
Lyon          500715
dtype: int64

Other types of indexing and selection (discussed in Data Indexing and Selection) work as well; for example, selection based on Boolean masks:


In [527]:
cities_pop[cities_pop > 22000000]

Series([], dtype: int64)

Selection based on fancy indexing also works:


In [528]:
cities_pop[['Paris', 'Marseille']]

city       year
Paris      2013    2229621
           2018    2175601
Marseille  2013     855393
           2018     868277
dtype: int64

## Multiply indexed DataFrames

A multiply indexed DataFrame behaves in a similar manner. Consider our toy medical DataFrame from before:



In [529]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,52.0,37.4,58.0,37.3,30.0,35.9
2013,2,36.0,37.6,48.0,36.5,37.0,36.4
2014,1,33.0,35.3,32.0,35.8,6.0,38.4
2014,2,35.0,36.5,35.0,37.7,47.0,37.8


Remember that columns are primary in a DataFrame, and the syntax used for multiply indexed Series applies to the columns. For example, we can recover Guido's heart rate data with a simple operation:


In [530]:
health_data['Guido', 'HR']

year  visit
2013  1        58.0
      2        48.0
2014  1        32.0
      2        35.0
Name: (Guido, HR), dtype: float64

Also, as with the single-index case, we can use the loc, iloc, and ix indexers introduced in Data Indexing and Selection. For example:


In [531]:
health_data.iloc[:2, :2]


Unnamed: 0_level_0,subject,Bob,Bob
Unnamed: 0_level_1,type,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2
2013,1,52.0,37.4
2013,2,36.0,37.6


These indexers provide an array-like view of the underlying two-dimensional data, but each individual index in loc or iloc can be passed a tuple of multiple indices. For example:


In [532]:
health_data.loc[:, ('Bob', 'HR')]

year  visit
2013  1        52.0
      2        36.0
2014  1        33.0
      2        35.0
Name: (Bob, HR), dtype: float64

Working with slices within these index tuples is not especially convenient; trying to create a slice within a tuple will lead to a syntax error:


In [533]:
health_data.loc[(:, 1), (:, 'HR')]

SyntaxError: invalid syntax (Temp/ipykernel_37488/3311942670.py, line 1)

You could get around this by building the desired slice explicitly using Python's built-in slice() function, but a better way in this context is to use an IndexSlice object, which Pandas provides for precisely this situation. For example:


In [None]:
idx = pd.IndexSlice
health_data.loc[idx[:, 1], idx[:, 'HR']]


There are so many ways to interact with data in multiply indexed Series and DataFrames, and as with many tools in this book the best way to become familiar with them is to try them out!
# Rearranging Multi-Indices

One of the keys to working with multiply indexed data is knowing how to effectively transform the data. There are a number of operations that will preserve all the information in the dataset, but rearrange it for the purposes of various computations. We saw a brief example of this in the stack() and unstack() methods, but there are many more ways to finely control the rearrangement of data between hierarchical indices and columns, and we'll explore them here.
## Sorted and unsorted indices

Earlier, we briefly mentioned a caveat, but we should emphasize it more here. Many of the MultiIndex slicing operations will fail if the index is not sorted. Let's take a look at this here.

We'll start by creating some simple multiply indexed data where the indices are not lexographically sorted:


In [None]:
index = pd.MultiIndex.from_product([['a', 'c', 'b'], [1, 2]])
data = pd.Series(np.random.rand(6), index=index)
data.index.names = ['char', 'int']
data

If we try to take a partial slice of this index, it will result in an error:


In [None]:
try:
    data['a':'b']
except KeyError as e:
    print(type(e))
    print(e)

Although it is not entirely clear from the error message, this is the result of the MultiIndex not being sorted. For various reasons, partial slices and other similar operations require the levels in the MultiIndex to be in sorted (i.e., lexographical) order. Pandas provides a number of convenience routines to perform this type of sorting; examples are the sort_index() and sortlevel() methods of the DataFrame. We'll use the simplest, sort_index(), here:


In [None]:
data = data.sort_index()
data

With the index sorted in this way, partial slicing will work as expected:


In [None]:
data['a':'b']

## Stacking and unstacking indices

As we saw briefly before, it is possible to convert a dataset from a stacked multi-index to a simple two-dimensional representation, optionally specifying the level to use:


In [None]:
cities_pop.unstack(level=0)

In [None]:
cities_pop.unstack(level=1)

The opposite of unstack() is stack(), which here can be used to recover the original series:


In [534]:
cities_pop.unstack().stack()

city       year
Lyon       2013     500715
           2018     518635
Marseille  2013     855393
           2018     868277
Paris      2013    2229621
           2018    2175601
dtype: int64

## Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the reset_index method. Calling this on the population dictionary will result in a DataFrame with a state and year column holding the information that was formerly in the index. For clarity, we can optionally specify the name of the data for the column representation:


In [535]:
cities_pop_flat = cities_pop.reset_index(name='population')
cities_pop_flat

Unnamed: 0,city,year,population
0,Paris,2013,2229621
1,Paris,2018,2175601
2,Marseille,2013,855393
3,Marseille,2018,868277
4,Lyon,2013,500715
5,Lyon,2018,518635


Often when working with data in the real world, the raw input data looks like this and it's useful to build a MultiIndex from the column values. This can be done with the set_index method of the DataFrame, which returns a multiply indexed DataFrame:


In [536]:
cities_pop_flat.set_index(['city', 'year'])

Unnamed: 0_level_0,Unnamed: 1_level_0,population
city,year,Unnamed: 2_level_1
Paris,2013,2229621
Paris,2018,2175601
Marseille,2013,855393
Marseille,2018,868277
Lyon,2013,500715
Lyon,2018,518635


In practice, I find this type of reindexing to be one of the more useful patterns when encountering real-world datasets.

# Data Aggregations on Multi-Indices

We've previously seen that Pandas has built-in data aggregation methods, such as mean(), sum(), and max(). For hierarchically indexed data, these can be passed a level parameter that controls which subset of the data the aggregate is computed on.

For example, let's return to our health data:

In [537]:
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,52.0,37.4,58.0,37.3,30.0,35.9
2013,2,36.0,37.6,48.0,36.5,37.0,36.4
2014,1,33.0,35.3,32.0,35.8,6.0,38.4
2014,2,35.0,36.5,35.0,37.7,47.0,37.8


Perhaps we'd like to average-out the measurements in the two visits each year. We can do this by naming the index level we'd like to explore, in this case the year:


In [538]:
data_mean = health_data.mean(level='year')
data_mean

subject,Bob,Bob,Guido,Guido,Sue,Sue
type,HR,Temp,HR,Temp,HR,Temp
year,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
2013,44.0,37.5,53.0,36.9,33.5,36.15
2014,34.0,35.9,33.5,36.75,26.5,38.1


By further making use of the axis keyword, we can take the mean among levels on the columns as well:


In [539]:
data_mean.mean(axis=1, level='type')

type,HR,Temp
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2013,43.5,36.85
2014,31.333333,36.916667


Thus in two lines, we've been able to find the average heart rate and temperature measured among all subjects in all visits each year. This syntax is actually a short cut to the GroupBy functionality, which we will discuss in Aggregation and Grouping. While this is a toy example, many real-world datasets have similar hierarchical structure.

# Aside: Panel Data

Pandas has a few other fundamental data structures that we have not yet discussed, namely the pd.Panel and pd.Panel4D objects. These can be thought of, respectively, as three-dimensional and four-dimensional generalizations of the (one-dimensional) Series and (two-dimensional) DataFrame structures. Once you are familiar with indexing and manipulation of data in a Series and DataFrame, Panel and Panel4D are relatively straightforward to use. In particular, the ix, loc, and iloc indexers discussed in Data Indexing and Selection extend readily to these higher-dimensional structures.

We won't cover these panel structures further in this text, as I've found in the majority of cases that multi-indexing is a more useful and conceptually simpler representation for higher-dimensional data. Additionally, panel data is fundamentally a dense data representation, while multi-indexing is fundamentally a sparse data representation. As the number of dimensions increases, the dense representation can become very inefficient for the majority of real-world datasets. For the occasional specialized application, however, these structures can be useful. If you'd like to read more about the Panel and Panel4D structures, see the references listed in Further Resources.
