
<a id='advanced'></a>
{{ header }}

# MultiIndex / advanced indexing

This section covers [indexing with a MultiIndex](#advanced-hierarchical)
and [other advanced indexing features](#indexing-index-types).

See the [Indexing and Selecting Data](indexing.ipynb#indexing) for general indexing documentation.

Whether a copy or a reference is returned for a setting operation may
depend on the context.  This is sometimes called `chained assignment` and
should be avoided.  See [Returning a View versus Copy](indexing.ipynb#indexing-view-versus-copy).

See the [cookbook](cookbook.ipynb#cookbook-selection) for some advanced strategies.


<a id='advanced-hierarchical'></a>

## Hierarchical indexing (MultiIndex)

Hierarchical / Multi-level indexing is very exciting as it opens the door to some
quite sophisticated data analysis and manipulation, especially for working with
higher dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data
structures like `Series` (1d) and `DataFrame` (2d).

In this section, we will show what exactly we mean by “hierarchical” indexing
and how it integrates with all of the pandas indexing functionality
described above and in prior sections. Later, when discussing [group by](groupby.ipynb#groupby) and [pivoting and reshaping data](reshaping.ipynb#reshaping), we’ll show
non-trivial applications to illustrate how it aids in structuring data for
analysis.

See the [cookbook](cookbook.ipynb#cookbook-multi-index) for some advanced strategies.

Changed in version 0.24.0: `MultiIndex.labels` has been renamed to `MultiIndex.codes`
and `MultiIndex.set_labels` to `MultiIndex.set_codes`.

### Creating a MultiIndex (hierarchical index) object

The `MultiIndex` object is the hierarchical analogue of the standard
`Index` object which typically stores the axis labels in pandas objects. You
can think of `MultiIndex` as an array of tuples where each tuple is unique. A
`MultiIndex` can be created from a list of arrays (using
`MultiIndex.from_arrays()`), an array of tuples (using
`MultiIndex.from_tuples()`), a crossed set of iterables (using
`MultiIndex.from_product()`), or a `DataFrame` (using
`MultiIndex.from_frame()`).  The `Index` constructor will attempt to return
a `MultiIndex` when it is passed a list of tuples.  The following examples
demonstrate different ways to initialize MultiIndexes.

When you want every pairing of the elements in two iterables, it can be easier
to use the `MultiIndex.from_product()` method:

You can also construct a `MultiIndex` from a `DataFrame` directly, using
the method `MultiIndex.from_frame()`. This is a complementary method to
`MultiIndex.to_frame()`.

New in version 0.24.0.

As a convenience, you can pass a list of arrays directly into `Series` or
`DataFrame` to construct a `MultiIndex` automatically:

All of the `MultiIndex` constructors accept a `names` argument which stores
string names for the levels themselves. If no names are provided, `None` will
be assigned:

This index can back any axis of a pandas object, and the number of **levels**
of the index is up to you:

We’ve “sparsified” the higher levels of the indexes to make the console output a
bit easier on the eyes. Note that how the index is displayed can be controlled using the
`multi_sparse` option in `pandas.set_options()`:

It’s worth keeping in mind that there’s nothing preventing you from using
tuples as atomic labels on an axis:

The reason that the `MultiIndex` matters is that it can allow you to do
grouping, selection, and reshaping operations as we will describe below and in
subsequent areas of the documentation. As you will see in later sections, you
can find yourself working with hierarchically-indexed data without creating a
`MultiIndex` explicitly yourself. However, when loading data from a file, you
may wish to generate your own `MultiIndex` when preparing the data set.


<a id='advanced-get-level-values'></a>

### Reconstructing the level labels

The method `get_level_values()` will return a vector of the labels for each
location at a particular level:

### Basic indexing on axis with MultiIndex

One of the important features of hierarchical indexing is that you can select
data by a “partial” label identifying a subgroup in the data. **Partial**
selection “drops” levels of the hierarchical index in the result in a
completely analogous way to selecting a column in a regular DataFrame:

See [Cross-section with hierarchical index](#advanced-xs) for how to select
on a deeper level.


<a id='advanced-shown-levels'></a>

### Defined levels

The `MultiIndex` keeps all the defined levels of an index, even
if they are not actually used. When slicing an index, you may notice this.
For example:

This is done to avoid a recomputation of the levels in order to make slicing
highly performant. If you want to see only the used levels, you can use the
`get_level_values()` method.

To reconstruct the `MultiIndex` with only the used levels, the
`remove_unused_levels()` method may be used.

New in version 0.20.0.

### Data alignment and using `reindex`

Operations between differently-indexed objects having `MultiIndex` on the
axes will work as you expect; data alignment will work the same as an Index of
tuples:

The `reindex()` method of `Series`/`DataFrames` can be
called with another `MultiIndex`, or even a list or array of tuples:


<a id='advanced-advanced-hierarchical'></a>

## Advanced indexing with hierarchical index

Syntactically integrating `MultiIndex` in advanced indexing with `.loc` is a
bit challenging, but we’ve made every effort to do so. In general, MultiIndex
keys take the form of tuples. For example, the following works as you would expect:

Note that `df.loc['bar', 'two']` would also work in this example, but this shorthand
notation can lead to ambiguity in general.

If you also want to index a specific column with `.loc`, you must use a tuple
like this:

You don’t have to specify all levels of the `MultiIndex` by passing only the
first elements of the tuple. For example, you can use “partial” indexing to
get all elements with `bar` in the first level as follows:

df.loc[‘bar’]

This is a shortcut for the slightly more verbose notation `df.loc[('bar',),]` (equivalent
to `df.loc['bar',]` in this example).

“Partial” slicing also works quite nicely.

You can slice with a ‘range’ of values, by providing a slice of tuples.

Passing a list of labels or tuples works similar to reindexing:

>**Note**
>
>It is important to note that tuples and lists are not treated identically
in pandas when it comes to indexing. Whereas a tuple is interpreted as one
multi-level key, a list is used to specify several keys. Or in other words,
tuples go horizontally (traversing levels), lists go vertically (scanning levels).

Importantly, a list of tuples indexes several complete `MultiIndex` keys,
whereas a tuple of lists refer to several values within a level:


<a id='advanced-mi-slicers'></a>

### Using slicers

You can slice a `MultiIndex` by providing multiple indexers.

You can provide any of the selectors as if you are indexing by label, see [Selection by Label](indexing.ipynb#indexing-label),
including slices, lists of labels, labels, and boolean indexers.

You can use `slice(None)` to select all the contents of *that* level. You do not need to specify all the
*deeper* levels, they will be implied as `slice(None)`.

As usual, **both sides** of the slicers are included as this is label indexing.

You should specify all axes in the `.loc` specifier, meaning the indexer for the **index** and
for the **columns**. There are some ambiguous cases where the passed indexer could be mis-interpreted
as indexing *both* axes, rather than into say the `MultiIndex` for the rows.

You should do this:

In [None]:
df.loc[(slice('A1', 'A3'), ...), :]             # noqa: E999

You should **not** do this:

In [None]:
df.loc[(slice('A1', 'A3'), ...)]                # noqa: E999

Basic MultiIndex slicing using slices, lists, and labels.

You can use `pandas.IndexSlice` to facilitate a more natural syntax
using `:`, rather than using `slice(None)`.

It is possible to perform quite complicated selections using this method on multiple
axes at the same time.

Using a boolean indexer you can provide selection related to the *values*.

You can also specify the `axis` argument to `.loc` to interpret the passed
slicers on a single axis.

Furthermore, you can *set* the values using the following methods.

You can use a right-hand-side of an alignable object as well.


<a id='advanced-xs'></a>

### Cross-section

The `xs()` method of `DataFrame` additionally takes a level argument to make
selecting data at a particular level of a `MultiIndex` easier.

You can also select on the columns with `xs`, by
providing the axis argument.

`xs` also allows selection with multiple keys.

You can pass `drop_level=False` to `xs` to retain
the level that was selected.

Compare the above with the result using `drop_level=True` (the default value).


<a id='advanced-advanced-reindex'></a>

### Advanced reindexing and alignment

Using the parameter `level` in the `reindex()` and
`align()` methods of pandas objects is useful to broadcast
values across a level. For instance:

### Swapping levels with `swaplevel`

The `swaplevel()` method can switch the order of two levels:


<a id='advanced-reorderlevels'></a>

### Reordering levels with `reorder_levels`

The `reorder_levels()` method generalizes the `swaplevel`
method, allowing you to permute the hierarchical index levels in one step:


<a id='advanced-index-names'></a>

### Renaming names of an `Index` or `MultiIndex`

The `rename()` method is used to rename the labels of a
`MultiIndex`, and is typically used to rename the columns of a `DataFrame`.
The `columns` argument of `rename` allows a dictionary to be specified
that includes only the columns you wish to rename.

This method can also be used to rename specific labels of the main index
of the `DataFrame`.

The `rename_axis()` method is used to rename the name of a
`Index` or `MultiIndex`. In particular, the names of the levels of a
`MultiIndex` can be specified, which is useful if `reset_index()` is later
used to move the values from the `MultiIndex` to a column.

Note that the columns of a `DataFrame` are an index, so that using
`rename_axis` with the `columns` argument will change the name of that
index.

Both `rename` and `rename_axis` support specifying a dictionary,
`Series` or a mapping function to map labels/names to new values.

## Sorting a `MultiIndex`

For `MultiIndex`-ed objects to be indexed and sliced effectively,
they need to be sorted. As with any index, you can use `sort_index()`.


<a id='advanced-sortlevel-byname'></a>
You may also pass a level name to `sort_index` if the `MultiIndex` levels
are named.

On higher dimensional objects, you can sort any of the other axes by level if
they have a `MultiIndex`:

Indexing will work even if the data are not sorted, but will be rather
inefficient (and show a `PerformanceWarning`). It will also
return a copy of the data rather than a view:

```ipython
In [4]: dfm.loc[(1, 'z')]
PerformanceWarning: indexing past lexsort depth may impact performance.

Out[4]:
           jolie
jim joe
1   z    0.64094
```



<a id='advanced-unsorted'></a>
Furthermore, if you try to index something that is not fully lexsorted, this can raise:

```ipython
In [5]: dfm.loc[(0, 'y'):(1, 'z')]
UnsortedIndexError: 'Key length (2) was greater than MultiIndex lexsort depth (1)'
```


The `is_lexsorted()` method on a `MultiIndex` shows if the
index is sorted, and the `lexsort_depth` property returns the sort depth:

And now selection works as expected.

## Take methods


<a id='advanced-take'></a>
Similar to NumPy ndarrays, pandas `Index`, `Series`, and `DataFrame` also provides
the `take()` method that retrieves elements along a given axis at the given
indices. The given indices must be either a list or an ndarray of integer
index positions. `take` will also accept negative integers as relative positions to the end of the object.

For DataFrames, the given indices should be a 1d list or ndarray that specifies
row or column positions.

It is important to note that the `take` method on pandas objects are not
intended to work on boolean indices and may return unexpected results.

Finally, as a small note on performance, because the `take` method handles
a narrower range of inputs, it can offer performance that is a good deal
faster than fancy indexing.


<a id='indexing-index-types'></a>

## Index types

We have discussed `MultiIndex` in the previous sections pretty extensively.
Documentation about `DatetimeIndex` and `PeriodIndex` are shown [here](timeseries.ipynb#timeseries-overview),
and documentation about `TimedeltaIndex` is found [here](timedeltas.ipynb#timedeltas-index).

In the following sub-sections we will highlight some other index types.


<a id='indexing-categoricalindex'></a>

### CategoricalIndex

`CategoricalIndex` is a type of index that is useful for supporting
indexing with duplicates. This is a container around a `Categorical`
and allows efficient indexing and storage of an index with a large number of duplicated elements.

Setting the index will create a `CategoricalIndex`.

Indexing with `__getitem__/.iloc/.loc` works similarly to an `Index` with duplicates.
The indexers **must** be in the category or the operation will raise a `KeyError`.

The `CategoricalIndex` is **preserved** after indexing:

Sorting the index will sort by the order of the categories (recall that we
created the index with `CategoricalDtype(list('cab'))`, so the sorted
order is `cab`).

Groupby operations on the index will preserve the index nature as well.

Reindexing operations will return a resulting index based on the type of the passed
indexer. Passing a list will return a plain-old `Index`; indexing with
a `Categorical` will return a `CategoricalIndex`, indexed according to the categories
of the **passed** `Categorical` dtype. This allows one to arbitrarily index these even with
values **not** in the categories, similarly to how you can reindex **any** pandas index.

Reshaping and Comparison operations on a `CategoricalIndex` must have the same categories
or a `TypeError` will be raised.

```ipython
In [9]: df3 = pd.DataFrame({'A': np.arange(6), 'B': pd.Series(list('aabbca')).astype('category')})

In [11]: df3 = df3.set_index('B')

In [11]: df3.index
Out[11]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['a', 'b', 'c'], ordered=False, name='B', dtype='category')

In [12]: pd.concat([df2, df3])
TypeError: categories must match existing categories when appending
```



<a id='indexing-rangeindex'></a>

### Int64Index and RangeIndex

Indexing on an integer-based Index with floats has been clarified in 0.18.0, for a summary of the changes, see [here](whatsnew/v0.18.0.ipynb#whatsnew-0180-float-indexers).

`Int64Index` is a fundamental basic index in pandas.
This is an immutable array implementing an ordered, sliceable set.
Prior to 0.18.0, the `Int64Index` would provide the default index for all `NDFrame` objects.

`RangeIndex` is a sub-class of `Int64Index` added in version 0.18.0, now providing the default index for all `NDFrame` objects.
`RangeIndex` is an optimized version of `Int64Index` that can represent a monotonic ordered set. These are analogous to Python [range types](https://docs.python.org/3/library/stdtypes.html#typesseq-range).


<a id='indexing-float64index'></a>

### Float64Index

By default a `Float64Index` will be automatically created when passing floating, or mixed-integer-floating values in index creation.
This enables a pure label-based slicing paradigm that makes `[],ix,loc` for scalar indexing and slicing work exactly the
same.

Scalar selection for `[],.loc` will always be label based. An integer will match an equal float index (e.g. `3` is equivalent to `3.0`).

The only positional indexing is via `iloc`.

A scalar index that is not found will raise a `KeyError`.
Slicing is primarily on the values of the index when using `[],ix,loc`, and
**always** positional when using `iloc`. The exception is when the slice is
boolean, in which case it will always be positional.

In float indexes, slicing using floats is allowed.

In non-float indexes, slicing using floats will raise a `TypeError`.

```ipython
In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)

In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type (Int64Index)
```


Using a scalar float indexer for `.iloc` has been removed in 0.18.0, so the following will raise a `TypeError`:

```ipython
In [3]: pd.Series(range(5)).iloc[3.0]
TypeError: cannot do positional indexing on <class 'pandas.indexes.range.RangeIndex'> with these indexers [3.0] of <type 'float'>
```


Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat
irregular timedelta-like indexing scheme, but the data is recorded as floats. This could, for
example, be millisecond offsets.

Selection operations then will always work on a value basis, for all selection operators.

You could retrieve the first 1 second (1000 ms) of data as such:

If you need integer based selection, you should use `iloc`:


<a id='advanced-intervalindex'></a>

### IntervalIndex

New in version 0.20.0.

`IntervalIndex` together with its own dtype, `IntervalDtype`
as well as the `Interval` scalar type,  allow first-class support in pandas
for interval notation.

The `IntervalIndex` allows some unique indexing and is also used as a
return type for the categories in `cut()` and `qcut()`.

#### Indexing with an `IntervalIndex`

An `IntervalIndex` can be used in `Series` and in `DataFrame` as the index.

Label based indexing via `.loc` along the edges of an interval works as you would expect,
selecting that particular interval.

If you select a label *contained* within an interval, this will also select the interval.

Selecting using an `Interval` will only return exact matches (starting from pandas 0.25.0).

Trying to select an `Interval` that is not exactly contained in the `IntervalIndex` will raise a `KeyError`.

In [None]:
In [7]: df.loc[pd.Interval(0.5, 2.5)]
---------------------------------------------------------------------------
KeyError: Interval(0.5, 2.5, closed='right')

Selecting all `Intervals` that overlap a given `Interval` can be performed using the
`overlaps()` method to create a boolean indexer.

#### Binning data with `cut` and `qcut`

`cut()` and `qcut()` both return a `Categorical` object, and the bins they
create are stored as an `IntervalIndex` in its `.categories` attribute.

`cut()` also accepts an `IntervalIndex` for its `bins` argument, which enables
a useful pandas idiom. First, We call `cut()` with some data and `bins` set to a
fixed number, to generate the bins. Then, we pass the values of `.categories` as the
`bins` argument in subsequent calls to `cut()`, supplying new data which will be
binned into the same bins.

Any value which falls outside all bins will be assigned a `NaN` value.

#### Generating ranges of intervals

If we need intervals on a regular frequency, we can use the `interval_range()` function
to create an `IntervalIndex` using various combinations of `start`, `end`, and `periods`.
The default frequency for `interval_range` is a 1 for numeric intervals, and calendar day for
datetime-like intervals:

The `freq` parameter can used to specify non-default frequencies, and can utilize a variety
of [frequency aliases](timeseries.ipynb#timeseries-offset-aliases) with datetime-like intervals:

Additionally, the `closed` parameter can be used to specify which side(s) the intervals
are closed on.  Intervals are closed on the right side by default.

New in version 0.23.0.

Specifying `start`, `end`, and `periods` will generate a range of evenly spaced
intervals from `start` to `end` inclusively, with `periods` number of elements
in the resulting `IntervalIndex`:

## Miscellaneous indexing FAQ

### Integer indexing

Label-based indexing with integer axis labels is a thorny topic. It has been
discussed heavily on mailing lists and among various members of the scientific
Python community. In pandas, our general viewpoint is that labels matter more
than integer locations. Therefore, with an integer axis index *only*
label-based indexing is possible with the standard tools like `.loc`. The
following code will generate exceptions:

This deliberate decision was made to prevent ambiguities and subtle bugs (many
users reported finding bugs when the API change was made to stop “falling back”
on position-based indexing).

### Non-monotonic indexes require exact matches

If the index of a `Series` or `DataFrame` is monotonically increasing or decreasing, then the bounds
of a label-based slice can be outside the range of the index, much like slice indexing a
normal Python `list`. Monotonicity of an index can be tested with the `is_monotonic_increasing()` and
`is_monotonic_decreasing()` attributes.

On the other hand, if the index is not monotonic, then both slice bounds must be
*unique* members of the index.

```ipython
# 0 is not in the index
In [9]: df.loc[0:4, :]
KeyError: 0

# 3 is not a unique label
In [11]: df.loc[2:3, :]
KeyError: 'Cannot get right slice bound for non-unique label: 3'
```


`Index.is_monotonic_increasing` and `Index.is_monotonic_decreasing` only check that
an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with
the `is_unique()` attribute.


<a id='advanced-endpoints-are-inclusive'></a>

### Endpoints are inclusive

Compared with standard Python sequence slicing in which the slice endpoint is
not inclusive, label-based slicing in pandas **is inclusive**. The primary
reason for this is that it is often not possible to easily determine the
“successor” or next element after a particular label in an index. For example,
consider the following `Series`:

Suppose we wished to slice from `c` to `e`, using integers this would be
accomplished as such:

However, if you only had `c` and `e`, determining the next element in the
index can be somewhat complicated. For example, the following does not work:

In [None]:
s.loc['c':'e' + 1]

A very common use case is to limit a time series to start and end at two
specific dates. To enable this, we made the design choice to make label-based
slicing include both endpoints:

This is most definitely a “practicality beats purity” sort of thing, but it is
something to watch out for if you expect label-based slicing to behave exactly
in the way that standard Python integer slicing works.

### Indexing potentially changes underlying Series dtype

The different indexing operation can potentially change the dtype of a `Series`.

This is because the (re)indexing operations above silently inserts `NaNs` and the `dtype`
changes accordingly.  This can cause some issues when using `numpy` `ufuncs`
such as `numpy.logical_and`.

See the [this old issue](https://github.com/pydata/pandas/issues/2388) for a more
detailed discussion.