# Practice 3
## MultiIndex /adanved indexing
- From https://pandas.pydata.org/docs/user_guide/advanced.html
- This section covers indexing with a MultiIndex and other advanced indexing features.
- See the Indexing and Selecting Data for general indexing documentation.
- **Warning**: Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.

### Hierarchical indexing (MultiIndex)
- Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

- In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

- See the cookbook for some advanced strategies.

#### Creating a MultiIndex (hierarchical index) object
- The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [5]:
import pandas as pd
import numpy as np

In [2]:
arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [3]:
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]:
s = pd.Series(np.random.randn(8), index=index)
s

first  second
bar    one       1.401668
       two       0.843072
baz    one       0.054789
       two      -1.624187
foo    one      -1.655984
       two       0.368256
qux    one      -0.786081
       two      -0.220323
dtype: float64

- When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [7]:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

pd.MultiIndex.from_product(iterables, names=["first", "second"])

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

- You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method to MultiIndex.to_frame().

In [8]:
df = pd.DataFrame(
    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
    columns=["first", "second"],
)
pd.MultiIndex.from_frame(df)

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

- As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [9]:
arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one   -0.571740
     two    0.657200
baz  one    0.638534
     two    0.636702
foo  one    0.105625
     two   -1.108993
qux  one   -1.490093
     two   -0.410552
dtype: float64

In [10]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,0.773128,0.960034,-0.696723,-0.768177
bar,two,-1.990783,-0.119898,-1.485906,0.787975
baz,one,-0.86624,1.12119,-0.018594,-1.927502
baz,two,-0.307928,0.41576,0.607379,1.673755
foo,one,-0.854522,-0.087861,-0.347848,-0.543385
foo,two,-1.034672,-0.023308,1.578449,0.965557
qux,one,-0.012152,-0.310898,0.403776,-1.804362
qux,two,-0.657695,0.832442,-0.257588,1.14622


- All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [11]:
df.index.names

FrozenList([None, None])

- This index can back any axis of a pandas object, and the number of levels of the index is up to you:

In [12]:
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,0.416418,0.011115,0.325098,-0.520251,-1.195804,-0.075759,0.75276,0.563595
B,0.333405,0.087818,0.228822,0.771709,0.605022,-1.181512,-0.81484,1.20718
C,2.033167,-1.144466,-0.022267,0.457045,2.229661,-0.754105,0.274854,0.009222


In [13]:
pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

Unnamed: 0_level_0,first,bar,bar,baz,baz,foo,foo
Unnamed: 0_level_1,second,one,two,one,two,one,two
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
bar,one,-1.01749,0.556011,-0.879456,-0.424015,0.557869,-2.353952
bar,two,0.9623,-1.314676,-0.710711,0.828191,1.267234,-0.151198
baz,one,0.512276,1.097474,0.755134,0.576787,0.588503,-0.686263
baz,two,-0.793089,0.358887,-0.1601,2.181757,0.065882,0.478478
foo,one,0.596251,0.706598,0.290585,-0.71179,-0.825277,-0.365812
foo,two,0.921318,-2.407341,0.368492,0.566729,0.085477,-0.800106


- We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

In [14]:
with pd.option_context("display.multi_sparse", False):
    df

- It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [15]:
pd.Series(np.random.randn(8), index=tuples)

(bar, one)    0.071709
(bar, two)   -0.127359
(baz, one)    0.786960
(baz, two)    0.025028
(foo, one)    0.666218
(foo, two)    1.402507
(qux, one)   -0.013039
(qux, two)    2.061105
dtype: float64

- The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

#### Reconstructing the level labels
- The method get_level_values() will return a vector of the labels for each location at a particular level:

In [16]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [17]:
index.get_level_values("second")

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

#### Basic indexing on axis with MultiIndex
- One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [18]:
df["bar"]

second,one,two
A,0.416418,0.011115
B,0.333405,0.087818
C,2.033167,-1.144466


In [19]:
df["bar", "one"]

A    0.416418
B    0.333405
C    2.033167
Name: (bar, one), dtype: float64

In [20]:
df["bar"]["one"]

A    0.416418
B    0.333405
C    2.033167
Name: one, dtype: float64

In [21]:
s["qux"]

one   -1.490093
two   -0.410552
dtype: float64

#### Defined levels
- The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [22]:
df.columns.levels  # original MultiIndex

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [23]:
df[["foo","qux"]].columns.levels  # sliced

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

- This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.

In [24]:
df[["foo", "qux"]].columns.to_numpy()

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

In [25]:
# for a specific level
df[["foo", "qux"]].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

- To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.

In [26]:
new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
new_mi.levels

FrozenList([['foo', 'qux'], ['one', 'two']])

#### Data alignment and using reindex
- Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [27]:
s + s[:-2]

bar  one   -1.143480
     two    1.314399
baz  one    1.277069
     two    1.273404
foo  one    0.211250
     two   -2.217987
qux  one         NaN
     two         NaN
dtype: float64

In [28]:
s + s[::2]

bar  one   -1.143480
     two         NaN
baz  one    1.277069
     two         NaN
foo  one    0.211250
     two         NaN
qux  one   -2.980186
     two         NaN
dtype: float64

- The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [29]:
s.reindex(index[:3])

first  second
bar    one      -0.571740
       two       0.657200
baz    one       0.638534
dtype: float64

In [30]:
s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])

foo  two   -1.108993
bar  one   -0.571740
qux  one   -1.490093
baz  one    0.638534
dtype: float64

### Advanced indexing with hierarchical index
- Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [31]:
df = df.T
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.416418,0.333405,2.033167
bar,two,0.011115,0.087818,-1.144466
baz,one,0.325098,0.228822,-0.022267
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105
qux,one,0.75276,-0.81484,0.274854
qux,two,0.563595,1.20718,0.009222


In [32]:
df.loc[("bar", "two")]

A    0.011115
B    0.087818
C   -1.144466
Name: (bar, two), dtype: float64

- Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

- If you also want to index a specific column with .loc, you must use a tuple like this:

In [33]:
df.loc[("bar", "two"), "A"]

np.float64(0.01111505574465237)

- You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

In [34]:
df.loc["bar"]

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0.416418,0.333405,2.033167
two,0.011115,0.087818,-1.144466


- This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).

- “Partial” slicing also works quite nicely.

In [35]:
df.loc["baz":"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,0.325098,0.228822,-0.022267
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105


- You can slice with a ‘range’ of values, by providing a slice of tuples.

In [36]:
df.loc[("baz", "two"):("qux", "one")]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105
qux,one,0.75276,-0.81484,0.274854


In [37]:
df.loc[("baz", "two"):"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105


- Passing a list of labels or tuples works similar to reindexing:

In [38]:
df.loc[[("bar", "two"), ("qux", "one")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,two,0.011115,0.087818,-1.144466
qux,one,0.75276,-0.81484,0.274854


- **Note**:  It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

- Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level:

In [39]:
s = pd.Series(
    [1, 2, 3, 4, 5, 6],
    index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
)
s.loc[[("A", "c"), ("B", "d")]]  # list of tuples

A  c    1
B  d    5
dtype: int64

In [40]:
s.loc[(["A", "B"], ["c", "d"])]  # tuple of lists

A  c    1
   d    2
B  c    4
   d    5
dtype: int64

#### Using slicers
- You can slice a MultiIndex by providing multiple indexers.
- You can provide any of the selectors as if you are indexing by label, see Selection by Label, including slices, lists of labels, labels, and boolean indexers.
- You can use slice(None) to select all the contents of that level. You do not need to specify all the deeper levels, they will be implied as slice(None).
- As usual, both sides of the slicers are included as this is label indexing.
- **Warning**: You should specify all axes in the .loc specifier, meaning the indexer for the index and for the columns. There are some ambiguous cases where the passed indexer could be misinterpreted as indexing both axes, rather than into say the MultiIndex for the rows.

- You should do this:
```python
df.loc[(slice("A1", "A3"), ...), :]  # noqa: E999
# cause error
```
- You should not do this.
```python
df.loc[(slice("A1", "A3"), ...)]  # noqa: E999
```

In [42]:
def mklbl(prefix, n):
    return ["%s%s" % (prefix, i) for i in range(n)]
miindex = pd.MultiIndex.from_product(
    [mklbl("A", 4), mklbl("B", 2), mklbl("C", 4), mklbl("D", 2)]
)
micolumns = pd.MultiIndex.from_tuples(
    [("a", "foo"), ("a", "bar"), ("b", "foo"), ("b", "bah")], names=["lvl0", "lvl1"]
)
dfmi = (
    pd.DataFrame(
        np.arange(len(miindex) * len(micolumns)).reshape(
            (len(miindex), len(micolumns))
        ),
        index=miindex,
        columns=micolumns,
    )
    .sort_index()
    .sort_index(axis=1)
)
dfmi


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,9,8,11,10
A0,B0,C1,D1,13,12,15,14
A0,B0,C2,D0,17,16,19,18
...,...,...,...,...,...,...,...
A3,B1,C1,D1,237,236,239,238
A3,B1,C2,D0,241,240,243,242
A3,B1,C2,D1,245,244,247,246
A3,B1,C3,D0,249,248,251,250


- Basic MultiIndex slicing using slices, lists, and labels.

In [43]:
dfmi.loc[(slice("A1", "A3"), slice(None), ["C1", "C3"]), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A1,B0,C1,D0,73,72,75,74
A1,B0,C1,D1,77,76,79,78
A1,B0,C3,D0,89,88,91,90
A1,B0,C3,D1,93,92,95,94
A1,B1,C1,D0,105,104,107,106
A1,B1,C1,D1,109,108,111,110
A1,B1,C3,D0,121,120,123,122
A1,B1,C3,D1,125,124,127,126
A2,B0,C1,D0,137,136,139,138
A2,B0,C1,D1,141,140,143,142


- You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).

In [45]:
idx = pd.IndexSlice
dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A0,B0,C1,D0,8,10
A0,B0,C1,D1,12,14
A0,B0,C3,D0,24,26
A0,B0,C3,D1,28,30
A0,B1,C1,D0,40,42
A0,B1,C1,D1,44,46
A0,B1,C3,D0,56,58
A0,B1,C3,D1,60,62
A1,B0,C1,D0,72,74
A1,B0,C1,D1,76,78


- It is possible to perform quite complicated selections using this method on multiple axes at the same time.

In [46]:
dfmi.loc["A1", (slice(None), "foo")]

Unnamed: 0_level_0,Unnamed: 1_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,lvl1,foo,foo
B0,C0,D0,64,66
B0,C0,D1,68,70
B0,C1,D0,72,74
B0,C1,D1,76,78
B0,C2,D0,80,82
B0,C2,D1,84,86
B0,C3,D0,88,90
B0,C3,D1,92,94
B1,C0,D0,96,98
B1,C0,D1,100,102


In [47]:
dfmi.loc[idx[:, :, ["C1", "C3"]], idx[:, "foo"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A0,B0,C1,D0,8,10
A0,B0,C1,D1,12,14
A0,B0,C3,D0,24,26
A0,B0,C3,D1,28,30
A0,B1,C1,D0,40,42
A0,B1,C1,D1,44,46
A0,B1,C3,D0,56,58
A0,B1,C3,D1,60,62
A1,B0,C1,D0,72,74
A1,B0,C1,D1,76,78


- Using a boolean indexer you can provide selection related to the values.

In [48]:
mask = dfmi[("a", "foo")] > 200
dfmi.loc[idx[mask, :, ["C1", "C3"]], idx[:, "foo"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,foo,foo
A3,B0,C1,D1,204,206
A3,B0,C3,D0,216,218
A3,B0,C3,D1,220,222
A3,B1,C1,D0,232,234
A3,B1,C1,D1,236,238
A3,B1,C3,D0,248,250
A3,B1,C3,D1,252,254


- You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.

In [49]:
dfmi.loc(axis=0)[:, :, ["C1", "C3"]]

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C1,D0,9,8,11,10
A0,B0,C1,D1,13,12,15,14
A0,B0,C3,D0,25,24,27,26
A0,B0,C3,D1,29,28,31,30
A0,B1,C1,D0,41,40,43,42
A0,B1,C1,D1,45,44,47,46
A0,B1,C3,D0,57,56,59,58
A0,B1,C3,D1,61,60,63,62
A1,B0,C1,D0,73,72,75,74
A1,B0,C1,D1,77,76,79,78


- Furthermore, you can set the values using the following methods.

In [50]:
df2 = dfmi.copy()
df2.loc(axis=0)[:, :, ["C1", "C3"]] = -10
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,-10,-10,-10,-10
A0,B0,C1,D1,-10,-10,-10,-10
A0,B0,C2,D0,17,16,19,18
...,...,...,...,...,...,...,...
A3,B1,C1,D1,-10,-10,-10,-10
A3,B1,C2,D0,241,240,243,242
A3,B1,C2,D1,245,244,247,246
A3,B1,C3,D0,-10,-10,-10,-10


- You can use a right-hand-side of an alignable object as well.

In [51]:
df2 = dfmi.copy()
df2.loc[idx[:, :, ["C1", "C3"]], :] = df2 * 1000
df2

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,lvl0,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,lvl1,bar,foo,bah,foo
A0,B0,C0,D0,1,0,3,2
A0,B0,C0,D1,5,4,7,6
A0,B0,C1,D0,9000,8000,11000,10000
A0,B0,C1,D1,13000,12000,15000,14000
A0,B0,C2,D0,17,16,19,18
...,...,...,...,...,...,...,...
A3,B1,C1,D1,237000,236000,239000,238000
A3,B1,C2,D0,241,240,243,242
A3,B1,C2,D1,245,244,247,246
A3,B1,C3,D0,249000,248000,251000,250000


#### Cross-section
- The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular level of a MultiIndex easier.

In [52]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.416418,0.333405,2.033167
bar,two,0.011115,0.087818,-1.144466
baz,one,0.325098,0.228822,-0.022267
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105
qux,one,0.75276,-0.81484,0.274854
qux,two,0.563595,1.20718,0.009222


In [53]:
df.xs("one", level="second")

Unnamed: 0_level_0,A,B,C
first,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,0.416418,0.333405,2.033167
baz,0.325098,0.228822,-0.022267
foo,-1.195804,0.605022,2.229661
qux,0.75276,-0.81484,0.274854


In [54]:
# using the slicers
df.loc[(slice(None), "one"), :]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.416418,0.333405,2.033167
baz,one,0.325098,0.228822,-0.022267
foo,one,-1.195804,0.605022,2.229661
qux,one,0.75276,-0.81484,0.274854


- You can also select on the columns with xs, by providing the axis argument.

In [55]:
df = df.T
df.xs("one", level="second", axis=1)

first,bar,baz,foo,qux
A,0.416418,0.325098,-1.195804,0.75276
B,0.333405,0.228822,0.605022,-0.81484
C,2.033167,-0.022267,2.229661,0.274854


In [56]:
# using the slicers
df.loc[:, (slice(None), "one")]

first,bar,baz,foo,qux
second,one,one,one,one
A,0.416418,0.325098,-1.195804,0.75276
B,0.333405,0.228822,0.605022,-0.81484
C,2.033167,-0.022267,2.229661,0.274854


- xs also allows selection with multiple keys.

In [57]:
df.xs(("one", "bar"), level=("second", "first"), axis=1)

first,bar
second,one
A,0.416418
B,0.333405
C,2.033167


In [58]:
# using the slicers
df.loc[:, ("bar", "one")]

A    0.416418
B    0.333405
C    2.033167
Name: (bar, one), dtype: float64

- You can pass drop_level=False to xs to retain the level that was selected.

In [59]:
df.xs("one", level="second", axis=1, drop_level=False)

first,bar,baz,foo,qux
second,one,one,one,one
A,0.416418,0.325098,-1.195804,0.75276
B,0.333405,0.228822,0.605022,-0.81484
C,2.033167,-0.022267,2.229661,0.274854


- Compare the above with the result using drop_level=True (the default value).

In [60]:
df.xs("one", level="second", axis=1, drop_level=True)

first,bar,baz,foo,qux
A,0.416418,0.325098,-1.195804,0.75276
B,0.333405,0.228822,0.605022,-0.81484
C,2.033167,-0.022267,2.229661,0.274854


#### Advanced reindexing and alignment
- Using the parameter level in the reindex() and align() methods of pandas objects is useful to broadcast values across a level. For instance:

In [61]:
midx = pd.MultiIndex(
    levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
)
df = pd.DataFrame(np.random.randn(4, 2), index=midx)
df

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.191065,0.665116
one,x,0.266268,-0.462328
zero,y,-1.306809,-0.447326
zero,x,-0.162528,0.52502


In [62]:
df2 = df.groupby(level=0).mean()

df2

Unnamed: 0,0,1
one,0.037601,0.101394
zero,-0.734669,0.038847


In [63]:
df2.reindex(df.index, level=0)

Unnamed: 0,Unnamed: 1,0,1
one,y,0.037601,0.101394
one,x,0.037601,0.101394
zero,y,-0.734669,0.038847
zero,x,-0.734669,0.038847


In [65]:
# aligning
df_aligned, df2_aligned = df.align(df2, level=0)
df_aligned

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.191065,0.665116
one,x,0.266268,-0.462328
zero,y,-1.306809,-0.447326
zero,x,-0.162528,0.52502


In [66]:
df2_aligned

Unnamed: 0,Unnamed: 1,0,1
one,y,0.037601,0.101394
one,x,0.037601,0.101394
zero,y,-0.734669,0.038847
zero,x,-0.734669,0.038847


#### Swapping levels with swaplevel
- The swaplevel() method can switch the order of two levels:

In [67]:
df[:5]

Unnamed: 0,Unnamed: 1,0,1
one,y,-0.191065,0.665116
one,x,0.266268,-0.462328
zero,y,-1.306809,-0.447326
zero,x,-0.162528,0.52502


In [68]:
df[:5].swaplevel(0, 1, axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,-0.191065,0.665116
x,one,0.266268,-0.462328
y,zero,-1.306809,-0.447326
x,zero,-0.162528,0.52502


#### Reordering levels with reorder_levels
- The reorder_levels() method generalizes the swaplevel method, allowing you to permute the hierarchical index levels in one step:

In [69]:
df[:5].reorder_levels([1, 0], axis=0)

Unnamed: 0,Unnamed: 1,0,1
y,one,-0.191065,0.665116
x,one,0.266268,-0.462328
y,zero,-1.306809,-0.447326
x,zero,-0.162528,0.52502


#### Renaming names of an Index or MultiIndex
- The rename() method is used to rename the labels of a MultiIndex, and is typically used to rename the columns of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes only the columns you wish to rename.

In [70]:
df.rename(columns={0: "col0", 1: "col1"})

Unnamed: 0,Unnamed: 1,col0,col1
one,y,-0.191065,0.665116
one,x,0.266268,-0.462328
zero,y,-1.306809,-0.447326
zero,x,-0.162528,0.52502


- This method can also be used to rename specific labels of the main index of the DataFrame.

In [71]:
df.rename(index={"one": "two", "y": "z"})

Unnamed: 0,Unnamed: 1,0,1
two,z,-0.191065,0.665116
two,x,0.266268,-0.462328
zero,z,-1.306809,-0.447326
zero,x,-0.162528,0.52502


- The rename_axis() method is used to rename the name of a Index or MultiIndex. In particular, the names of the levels of a MultiIndex can be specified, which is useful if reset_index() is later used to move the values from the MultiIndex to a column.

In [72]:
df.rename_axis(index=["abc", "def"])

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
abc,def,Unnamed: 2_level_1,Unnamed: 3_level_1
one,y,-0.191065,0.665116
one,x,0.266268,-0.462328
zero,y,-1.306809,-0.447326
zero,x,-0.162528,0.52502


- **Note** that the columns of a DataFrame are an index, so that using rename_axis with the columns argument will change the name of that index.

In [73]:
df.rename_axis(columns="Cols").columns

RangeIndex(start=0, stop=2, step=1, name='Cols')

- Both rename and rename_axis support specifying a dictionary, Series or a mapping function to map labels/names to new values.

- When working with an Index object directly, rather than via a DataFrame, Index.set_names() can be used to change the names.

In [74]:
mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
mi.names

FrozenList(['x', 'y'])

In [75]:
mi2 = mi.rename("new name", level=0)
mi2

MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b')],
           names=['new name', 'y'])

- You cannot set the names of the MultiIndex via a level.

In [76]:
# mi.levels[0].name = "name via level" # Cause RuntimeError

- Use Index.set_names() instead.

### Sorting a MultiIndex
- For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can use sort_index().

In [77]:
import random

random.shuffle(tuples)

s = pd.Series(np.random.randn(8), index=pd.MultiIndex.from_tuples(tuples))

s

foo  one    0.554170
     two    1.088018
baz  one    0.216603
     two   -0.650927
bar  one   -0.027816
qux  two   -2.019777
bar  two    1.523460
qux  one    1.628227
dtype: float64

In [78]:
s.sort_index()

bar  one   -0.027816
     two    1.523460
baz  one    0.216603
     two   -0.650927
foo  one    0.554170
     two    1.088018
qux  one    1.628227
     two   -2.019777
dtype: float64

In [79]:
s.sort_index(level=0)

bar  one   -0.027816
     two    1.523460
baz  one    0.216603
     two   -0.650927
foo  one    0.554170
     two    1.088018
qux  one    1.628227
     two   -2.019777
dtype: float64

In [80]:
s.sort_index(level=1)

bar  one   -0.027816
baz  one    0.216603
foo  one    0.554170
qux  one    1.628227
bar  two    1.523460
baz  two   -0.650927
foo  two    1.088018
qux  two   -2.019777
dtype: float64

- You may also pass a level name to sort_index if the MultiIndex levels are named.

In [81]:
s.index = s.index.set_names(["L1", "L2"])
s.sort_index(level="L1")

L1   L2 
bar  one   -0.027816
     two    1.523460
baz  one    0.216603
     two   -0.650927
foo  one    0.554170
     two    1.088018
qux  one    1.628227
     two   -2.019777
dtype: float64

In [82]:
s.sort_index(level="L2")

L1   L2 
bar  one   -0.027816
baz  one    0.216603
foo  one    0.554170
qux  one    1.628227
bar  two    1.523460
baz  two   -0.650927
foo  two    1.088018
qux  two   -2.019777
dtype: float64

- On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:

In [83]:
df.T.sort_index(level=1, axis=1)

Unnamed: 0_level_0,one,zero,one,zero
Unnamed: 0_level_1,x,x,y,y
0,0.266268,-0.162528,-0.191065,-1.306809
1,-0.462328,0.52502,0.665116,-0.447326


- Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning). It will also return a copy of the data rather than a view:

In [85]:
dfm = pd.DataFrame(
    {"jim": [0, 0, 1, 1], "joe": ["x", "x", "z", "y"], "jolie": np.random.rand(4)}
)
dfm = dfm.set_index(["jim", "joe"])
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
0,x,0.454656
0,x,0.119976
1,z,0.966698
1,y,0.529803


In [86]:
dfm.loc[(1, 'z')]

  dfm.loc[(1, 'z')]


Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
1,z,0.966698


- Furthermore, if you try to index something that is not fully lexsorted, this can raise:

In [87]:
# dfm.loc[(0, 'y'):(1, 'z')] # Cause UnsortedIndexError

- The is_monotonic_increasing() method on a MultiIndex shows if the index is sorted:

In [88]:
dfm.index.is_monotonic_increasing

False

In [89]:
dfm = dfm.sort_index()
dfm

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
0,x,0.454656
0,x,0.119976
1,y,0.529803
1,z,0.966698


- And now selection works as expected.

In [90]:
dfm.loc[(0, "y"):(1, "z")]

Unnamed: 0_level_0,Unnamed: 1_level_0,jolie
jim,joe,Unnamed: 2_level_1
1,y,0.529803
1,z,0.966698


### Take methods
- Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of integer index positions. take will also accept negative integers as relative positions to the end of the object.

In [91]:
index = pd.Index(np.random.randint(0, 1000, 10))
index

Index([996, 461, 605, 88, 374, 830, 160, 932, 693, 223], dtype='int32')

In [92]:
positions = [0, 9, 3]
index[positions]

Index([996, 223, 88], dtype='int32')

In [93]:
index.take(positions)

Index([996, 223, 88], dtype='int32')

In [94]:
ser = pd.Series(np.random.randn(10))
ser.iloc[positions]

0    1.333662
9   -2.311854
3   -0.263698
dtype: float64

In [95]:
ser.take(positions)

0    1.333662
9   -2.311854
3   -0.263698
dtype: float64

- For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.

In [96]:
frm = pd.DataFrame(np.random.randn(5, 3))
frm.take([1, 4, 3])

Unnamed: 0,0,1,2
1,0.708306,0.53882,1.00436
4,0.055899,-0.625386,0.752434
3,0.180357,-1.008078,-1.760977


In [97]:
frm.take([0, 2], axis=1)

Unnamed: 0,0,2
0,-1.602211,-2.56406
1,0.708306,1.00436
2,-1.835859,0.85345
3,0.180357,-1.760977
4,0.055899,0.752434


- It is important to note that the take method on pandas objects are not intended to work on boolean indices and may return unexpected results.

In [98]:
arr = np.random.randn(10)
arr.take([False, False, True, True])

array([ 0.31462296,  0.31462296, -1.10708173, -1.10708173])

In [99]:
arr[[0, 1]]

array([ 0.31462296, -1.10708173])

In [100]:
ser = pd.Series(np.random.randn(10))
ser.take([False, False, True, True])

0    0.885673
0    0.885673
1   -0.146598
1   -0.146598
dtype: float64

In [101]:
ser.iloc[[0, 1]]

0    0.885673
1   -0.146598
dtype: float64

- Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer performance that is a good deal faster than fancy indexing.

In [102]:
arr = np.random.randn(10000, 5)
indexer = np.arange(10000)
random.shuffle(indexer)
%timeit arr[indexer]
%timeit arr.take(indexer, axis=0)

119 μs ± 2.8 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
46.6 μs ± 6.92 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


In [103]:
ser = pd.Series(arr[:, 0])
%timeit ser.iloc[indexer]
%timeit ser.take(indexer)

78.4 μs ± 4.52 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
65.3 μs ± 949 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


### Index types
- We have discussed MultiIndex in the previous sections pretty extensively. Documentation about DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found here.
- In the following sub-sections we will highlight some other index types.
#### CategoricalIndex
- CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated elements.

In [104]:
from pandas.api.types import CategoricalDtype
df = pd.DataFrame({"A": np.arange(6), "B": list("aabbca")})
df["B"] = df["B"].astype(CategoricalDtype(list("cab")))
df

Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,c
5,5,a


In [105]:
df.dtypes

A       int64
B    category
dtype: object

In [106]:
df["B"].cat.categories

Index(['c', 'a', 'b'], dtype='object')

- Setting the index will create a CategoricalIndex.

In [107]:
df2 = df.set_index("B")
df2.index

CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

- Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be in the category or the operation will raise a KeyError.

In [108]:
df2.loc["a"]

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0
a,1
a,5


- The CategoricalIndex is preserved after indexing:

In [109]:
df2.loc["a"].index

CategoricalIndex(['a', 'a', 'a'], categories=['c', 'a', 'b'], ordered=False, dtype='category', name='B')

- Sorting the index will sort by the order of the categories (recall that we created the index with CategoricalDtype(list('cab')), so the sorted order is cab).

In [110]:
df2.sort_index()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,0
a,1
a,5
b,2
b,3


- Groupby operations on the index will preserve the index nature as well.

In [111]:
df2.groupby(level=0, observed=True).sum()

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
c,4
a,6
b,5


- Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the categories, similarly to how you can reindex any pandas index.

In [112]:
df3 = pd.DataFrame(
    {"A": np.arange(3), "B": pd.Series(list("abc")).astype("category")}
)
df3 = df3.set_index("B")
df3

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0
b,1
c,2


In [113]:
df3.reindex(["a", "e"])


Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
e,


In [114]:
df3.reindex(["a", "e"]).index

Index(['a', 'e'], dtype='object', name='B')

In [115]:
df3.reindex(pd.Categorical(["a", "e"], categories=list("abe")))

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
a,0.0
e,


In [116]:
df3.reindex(pd.Categorical(["a", "e"], categories=list("abe"))).index

CategoricalIndex(['a', 'e'], categories=['a', 'b', 'e'], ordered=False, dtype='category', name='B')

- **Warning**: Reshaping and Comparison operations on a CategoricalIndex must have the same categories or a TypeError will be raised.

In [117]:
df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})
df4["B"] = df4["B"].astype(CategoricalDtype(list("ab")))
df4 = df4.set_index("B")
df4.index

CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, dtype='category', name='B')

In [118]:
df5 = pd.DataFrame({"A": np.arange(2), "B": list("bc")})
df5["B"] = df5["B"].astype(CategoricalDtype(list("bc")))
df5 = df5.set_index("B")
df5.index

CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, dtype='category', name='B')

In [119]:
pd.concat([df4, df5])

Unnamed: 0_level_0,A
B,Unnamed: 1_level_1
b,0
a,1
b,0
c,1


#### RangeIndex
- RangeIndex is a sub-class of Index that provides the default index for all DataFrame and Series objects. RangeIndex is an optimized version of Index that can represent a monotonic ordered set. These are analogous to Python range types. A RangeIndex will always have an int64 dtype.

In [120]:
idx = pd.RangeIndex(5)
idx

RangeIndex(start=0, stop=5, step=1)

- RangeIndex is the default index for all DataFrame and Series objects:

In [121]:
ser = pd.Series([1, 2, 3])
ser.index

RangeIndex(start=0, stop=3, step=1)

In [122]:
df = pd.DataFrame([[1, 2], [3, 4]])
df.index

RangeIndex(start=0, stop=2, step=1)

In [123]:
df.columns

RangeIndex(start=0, stop=2, step=1)

- A RangeIndex will behave similarly to a Index with an int64 dtype and operations on a RangeIndex, whose result cannot be represented by a RangeIndex, but should have an integer dtype, will be converted to an Index with int64. For example:

In [124]:
idx[[0, 2]]

Index([0, 2], dtype='int64')

#### IntervalIndex
- IntervalIndex together with its own dtype, IntervalDtype as well as the Interval scalar type, allow first-class support in pandas for interval notation.
- The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut() and qcut().

##### Indexing with an IntervalIndex
- An IntervalIndex can be used in Series and in DataFrame as the index.

In [126]:
df = pd.DataFrame(
    {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
)
df

Unnamed: 0,A
"(0, 1]",1
"(1, 2]",2
"(2, 3]",3
"(3, 4]",4


- Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular interval.

In [127]:
df.loc[2]


A    2
Name: (1, 2], dtype: int64

In [128]:
df.loc[[2, 3]]

Unnamed: 0,A
"(1, 2]",2
"(2, 3]",3


- If you select a label contained within an interval, this will also select the interval.

In [129]:
df.loc[2.5]

A    3
Name: (2, 3], dtype: int64

In [130]:
df.loc[[2.5, 3.5]]

Unnamed: 0,A
"(2, 3]",3
"(3, 4]",4


- Selecting using an Interval will only return exact matches.

In [131]:
df.loc[pd.Interval(1, 2)]

A    2
Name: (1, 2], dtype: int64

- Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.

In [132]:
# df.loc[pd.Interval(0.5, 2.5)] # KeyError

- Selecting all Intervals that overlap a given Interval can be performed using the overlaps() method to create a boolean indexer.

In [133]:
idxr = df.index.overlaps(pd.Interval(0.5, 2.5))
idxr

array([ True,  True,  True, False])

In [134]:
df[idxr]

Unnamed: 0,A
"(0, 1]",1
"(1, 2]",2
"(2, 3]",3


##### Binning data with cut and qcut
- cut() and qcut() both return a Categorical object, and the bins they create are stored as an IntervalIndex in its .categories attribute.

In [135]:
c = pd.cut(range(4), bins=2)
c

[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

In [136]:
c.categories

IntervalIndex([(-0.003, 1.5], (1.5, 3.0]], dtype='interval[float64, right]')

- cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First, We call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the values of .categories as the bins argument in subsequent calls to cut(), supplying new data which will be binned into the same bins.

In [138]:
pd.cut([0, 3, 5, 1], bins=c.categories)

[(-0.003, 1.5], (1.5, 3.0], NaN, (-0.003, 1.5]]
Categories (2, interval[float64, right]): [(-0.003, 1.5] < (1.5, 3.0]]

- Any value which falls outside all bins will be assigned a NaN value.

##### Generating ranges of intervals
- If we need intervals on a regular frequency, we can use the interval_range() function to create an IntervalIndex using various combinations of start, end, and periods. The default frequency for interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:

In [139]:
pd.interval_range(start=0, end=5)

IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]], dtype='interval[int64, right]')

In [140]:
pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4)

IntervalIndex([(2017-01-01 00:00:00, 2017-01-02 00:00:00],
               (2017-01-02 00:00:00, 2017-01-03 00:00:00],
               (2017-01-03 00:00:00, 2017-01-04 00:00:00],
               (2017-01-04 00:00:00, 2017-01-05 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [141]:
pd.interval_range(end=pd.Timedelta("3 days"), periods=3)

IntervalIndex([(0 days 00:00:00, 1 days 00:00:00],
               (1 days 00:00:00, 2 days 00:00:00],
               (2 days 00:00:00, 3 days 00:00:00]],
              dtype='interval[timedelta64[ns], right]')

- The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with datetime-like intervals:

In [142]:
pd.interval_range(start=0, periods=5, freq=1.5)

IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0], (6.0, 7.5]], dtype='interval[float64, right]')

In [143]:
pd.interval_range(start=pd.Timestamp("2017-01-01"), periods=4, freq="W")

IntervalIndex([(2017-01-01 00:00:00, 2017-01-08 00:00:00],
               (2017-01-08 00:00:00, 2017-01-15 00:00:00],
               (2017-01-15 00:00:00, 2017-01-22 00:00:00],
               (2017-01-22 00:00:00, 2017-01-29 00:00:00]],
              dtype='interval[datetime64[ns], right]')

In [144]:
pd.interval_range(start=pd.Timedelta("0 days"), periods=3, freq="9h")

IntervalIndex([(0 days 00:00:00, 0 days 09:00:00],
               (0 days 09:00:00, 0 days 18:00:00],
               (0 days 18:00:00, 1 days 03:00:00]],
              dtype='interval[timedelta64[ns], right]')

- Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are closed on the right side by default.

In [145]:
pd.interval_range(start=0, end=4, closed="both")

IntervalIndex([[0, 1], [1, 2], [2, 3], [3, 4]], dtype='interval[int64, both]')

In [146]:
pd.interval_range(start=0, end=4, closed="neither")

IntervalIndex([(0, 1), (1, 2), (2, 3), (3, 4)], dtype='interval[int64, neither]')

- Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclusively, with periods number of elements in the resulting IntervalIndex:

In [147]:
pd.interval_range(start=0, end=6, periods=4)

IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]], dtype='interval[float64, right]')

In [148]:
pd.interval_range(pd.Timestamp("2018-01-01"), pd.Timestamp("2018-02-28"), periods=3)


IntervalIndex([(2018-01-01 00:00:00, 2018-01-20 08:00:00],
               (2018-01-20 08:00:00, 2018-02-08 16:00:00],
               (2018-02-08 16:00:00, 2018-02-28 00:00:00]],
              dtype='interval[datetime64[ns], right]')

### Miscellaneous indexing FAQ
#### Integer indexing
- Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .loc. The following code will generate exceptions:

In [149]:
s = pd.Series(range(5))
# s[-1] # ValueError

- This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
#### Non-monotonic indexes require exact matches
- If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index can be tested with the is_monotonic_increasing() and is_monotonic_decreasing() attributes.

In [150]:
df = pd.DataFrame(index=[2, 3, 3, 4, 5], columns=["data"], data=list(range(5)))
df.index.is_monotonic_increasing

True

In [151]:
df.loc[0:4, :]

Unnamed: 0,data
2,0
3,1
3,2
4,3


In [152]:
# slice is are outside the index, so empty DataFrame is returned
df.loc[13:15, :]

Unnamed: 0,data


- On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.

In [153]:
df = pd.DataFrame(index=[2, 3, 1, 4, 3, 5], columns=["data"], data=list(range(6)))
df.index.is_monotonic_increasing

False

In [154]:
# OK because 2 and 4 are in the index
df.loc[2:4, :]

Unnamed: 0,data
2,0
3,1
1,2
4,3


In [155]:
 # 0 is not in the index
# df.loc[0:4, :] # KeyError

- Index.is_monotonic_increasing and Index.is_monotonic_decreasing only check that an index is weakly monotonic. To check for strict monotonicity, you can combine one of those with the is_unique() attribute.

In [156]:
weakly_monotonic = pd.Index(["a", "b", "c", "c"])
weakly_monotonic

Index(['a', 'b', 'c', 'c'], dtype='object')

In [157]:
weakly_monotonic.is_monotonic_increasing


True

In [158]:
weakly_monotonic.is_monotonic_increasing & weakly_monotonic.is_unique

False

#### Endpoints are inclusive
- Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or next element after a particular label in an index. For example, consider the following Series:

In [159]:
s = pd.Series(np.random.randn(6), index=list("abcdef"))
s

a   -0.284306
b   -0.243045
c   -1.703894
d    0.975003
e    0.248485
f    0.094694
dtype: float64

- Suppose we wished to slice from c to e, using integers this would be accomplished as such:

In [160]:
s[2:5]

c   -1.703894
d    0.975003
e    0.248485
dtype: float64

- However, if you only had c and e, determining the next element in the index can be somewhat complicated. For example, the following does not work:

In [161]:
# s.loc['c':'e' + 1] # TypeError

- A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the design choice to make label-based slicing include both endpoints:

In [162]:
s.loc["c":"e"]

c   -1.703894
d    0.975003
e    0.248485
dtype: float64

- This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect label-based slicing to behave exactly in the way that standard Python integer slicing works.

#### Indexing potentially changes underlying Series dtype
- The different indexing operation can potentially change the dtype of a Series.

In [163]:
series1 = pd.Series([1, 2, 3])
series1.dtype

dtype('int64')

In [164]:
res = series1.reindex([0, 4])
res.dtype

dtype('float64')

In [165]:
res

0    1.0
4    NaN
dtype: float64

In [166]:
series2 = pd.Series([True])
series2.dtype

dtype('bool')

In [167]:
res = series2.reindex_like(series1)
res.dtype

dtype('O')

In [168]:
res

0    True
1     NaN
2     NaN
dtype: object

- This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly. This can cause some issues when using numpy ufuncs such as numpy.logical_and.

## Copy-on-Write (CoW)
- From https://pandas.pydata.org/docs/user_guide/copy_on_write.html
- **Note**: Copy-on-Write will become the default in pandas 3.0. We recommend turning it on now to benefit from all improvements.
- Copy-on-Write was first introduced in version 1.5.0. Starting from version 2.0 most of the optimizations that become possible through CoW are implemented and supported. All possible optimizations are supported starting from pandas 2.1.
- CoW will be enabled by default in version 3.0.
- CoW will lead to more predictable behavior since it is not possible to update more than one object with one statement, e.g. indexing operations or methods won’t have side-effects. Additionally, through delaying copies as long as possible, the average performance and memory usage will improve.

### Previous behavior
- pandas indexing behavior is tricky to understand. Some operations return views while other return copies. Depending on the result of the operation, mutating one object might accidentally mutate another:

In [169]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
df

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


- Mutating subset, e.g. updating its values, also updates df. The exact behavior is hard to predict. Copy-on-Write solves accidentally modifying more than one object, it explicitly disallows this. With CoW enabled, df is unchanged:

In [170]:
pd.options.mode.copy_on_write = True
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
df

Unnamed: 0,foo,bar
0,1,4
1,2,5
2,3,6


- The following sections will explain what this means and how it impacts existing applications.]

### Migrating to Copy-on-Write
- Copy-on-Write will be the default and only mode in pandas 3.0. This means that users need to migrate their code to be compliant with CoW rules.

- The default mode in pandas will raise warnings for certain cases that will actively change behavior and thus change user intended behavior.

- We added another mode, e.g.
`pd.options.mode.copy_on_write = "warn"`
- that will warn for every operation that will change behavior with CoW. We expect this mode to be very noisy, since many cases that we don’t expect that they will influence users will also emit a warning. We recommend checking this mode and analyzing the warnings, but it is not necessary to address all of these warning. The first two items of the following lists are the only cases that need to be addressed to make existing code work with CoW.
- The following few items describe the user visible changes:
- Chained assignment will never work
- loc should be used as an alternative. Check the chained assignment section for more details.
- Accessing the underlying array of a pandas object will return a read-only view

In [171]:
ser = pd.Series([1, 2, 3])
ser.to_numpy()

array([1, 2, 3])

- This example returns a NumPy array that is a view of the Series object. This view can be modified and thus also modify the pandas object. This is not compliant with CoW rules. The returned array is set to non-writeable to protect against this behavior. Creating a copy of this array allows modification. You can also make the array writeable again if you don’t care about the pandas object anymore.

- See the section about read-only NumPy arrays for more details.

- Only one pandas object is updated at once

- The following code snippet updates both df and subset without CoW:

In [173]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
subset = df["foo"]
subset.iloc[0] = 100
df

Unnamed: 0,foo,bar
0,1,4
1,2,5
2,3,6


- This won’t be possible anymore with CoW, since the CoW rules explicitly forbid this. This includes updating a single column as a Series and relying on the change propagating back to the parent DataFrame. This statement can be rewritten into a single statement with loc or iloc if this behavior is necessary. DataFrame.where() is another suitable alternative for this case.

- Updating a column selected from a DataFrame with an inplace method will also not work anymore.

In [174]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df["foo"].replace(1, 5, inplace=True)
df

C:\Users\thotc\AppData\Local\Temp\ipykernel_34888\3837958181.py:2: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
When using the Copy-on-Write mode, such inplace method never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' instead, to perform the operation inplace on the original object.


  df["foo"].replace(1, 5, inplace=True)


Unnamed: 0,foo,bar
0,1,4
1,2,5
2,3,6


- This is another form of chained assignment. This can generally be rewritten in 2 different forms:

In [176]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})

df.replace({"foo": {1: 5}}, inplace=True)
df

Unnamed: 0,foo,bar
0,5,4
1,2,5
2,3,6


- A different alternative would be to not use inplace:

In [177]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df["foo"] = df["foo"].replace(1, 5)
df

Unnamed: 0,foo,bar
0,5,4
1,2,5
2,3,6


- Constructors now copy NumPy arrays by default
- The Series and DataFrame constructors will now copy NumPy array by default when not otherwise specified. This was changed to avoid mutating a pandas object when the NumPy array is changed inplace outside of pandas. You can set copy=False to avoid this copy.

### Description
CoW means that any DataFrame or Series derived from another in any way always behaves as a copy. As a consequence, we can only change the values of an object through modifying the object itself. CoW disallows updating a DataFrame or a Series that shares data with another DataFrame or Series object inplace.
- This avoids side-effects when modifying values and hence, most methods can avoid actually copying the data and only trigger a copy when necessary.
- The following example will operate inplace with CoW:

In [178]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df.iloc[0, 0] = 100
df

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


- The object df does not share any data with any other object and hence no copy is triggered when updating the values. In contrast, the following operation triggers a copy of the data under CoW:

In [179]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df2 = df.reset_index(drop=True)
df2.iloc[0, 0] = 100
df

Unnamed: 0,foo,bar
0,1,4
1,2,5
2,3,6


- reset_index returns a lazy copy with CoW while it copies the data without CoW. Since both objects, df and df2 share the same data, a copy is triggered when modifying df2. The object df still has the same values as initially while df2 was modified.

- If the object df isn’t needed anymore after performing the reset_index operation, you can emulate an inplace-like operation through assigning the output of reset_index to the same variable:

In [180]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df = df.reset_index(drop=True)
df.iloc[0, 0] = 100
df

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


- The initial object gets out of scope as soon as the result of reset_index is reassigned and hence df does not share data with any other object. No copy is necessary when modifying the object. This is generally true for all methods listed in Copy-on-Write optimizations.

- Previously, when operating on views, the view and the parent object was modified:

In [181]:
with pd.option_context("mode.copy_on_write", False):
    df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
    view = df[:]
    df.iloc[0, 0] = 100
df

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


In [182]:
view

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


- CoW triggers a copy when df is changed to avoid mutating view as well:

In [183]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
view = df[:]
df.iloc[0, 0] = 100
df

Unnamed: 0,foo,bar
0,100,4
1,2,5
2,3,6


In [184]:
view

Unnamed: 0,foo,bar
0,1,4
1,2,5
2,3,6


### Chained Assignment
- Chained assignment references a technique where an object is updated through two subsequent indexing operations, e.g.


In [185]:
with pd.option_context("mode.copy_on_write", False):
    df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
    df["foo"][df["bar"] > 5] = 100
    df

You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  df["foo"][df["bar"] > 5] = 100


- The column foo is updated where the column bar is greater than 5. This violates the CoW principles though, because it would have to modify the view df["foo"] and df in one step. Hence, chained assignment will consistently never work and raise a ChainedAssignmentError warning with CoW enabled:

In [186]:
df = pd.DataFrame({"foo": [1, 2, 3], "bar": [4, 5, 6]})
df["foo"][df["bar"] > 5] = 100

C:\Users\thotc\AppData\Local\Temp\ipykernel_34888\1340306191.py:2: ChainedAssignmentError: A value is trying to be set on a copy of a DataFrame or Series through chained assignment.
When using the Copy-on-Write mode, such chained assignment never works to update the original DataFrame or Series, because the intermediate object on which we are setting values always behaves as a copy.

Try using '.loc[row_indexer, col_indexer] = value' instead, to perform the assignment in a single step.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["foo"][df["bar"] > 5] = 100


- With copy on write this can be done by using loc.

In [187]:
df.loc[df["bar"] > 5, "foo"] = 100

### Read-only NumPy arrays
- Accessing the underlying NumPy array of a DataFrame will return a read-only array if the array shares data with the initial DataFrame:

- The array is a copy if the initial DataFrame consists of more than one array:

In [188]:
df = pd.DataFrame({"a": [1, 2], "b": [1.5, 2.5]})
df.to_numpy()

array([[1. , 1.5],
       [2. , 2.5]])

- The array shares data with the DataFrame if the DataFrame consists of only one NumPy array:

In [189]:
df = pd.DataFrame({"a": [1, 2], "b": [3, 4]})
df.to_numpy()

array([[1, 3],
       [2, 4]])

- This array is read-only, which means that it can’t be modified inplace:

In [190]:
arr = df.to_numpy()
# arr[0, 0] = 100 # ValueError

- The same holds true for a Series, since a Series always consists of a single array.
- There are two potential solution to this:
    - Trigger a copy manually if you want to avoid updating DataFrames that share memory with your array.
    - Make the array writeable. This is a more performant solution but circumvents Copy-on-Write rules, so it should be used with caution.

In [191]:
arr = df.to_numpy()

arr.flags.writeable = True

arr[0, 0] = 100

arr

array([[100,   3],
       [  2,   4]])

### Patterns to avoid
- No defensive copy will be performed if two objects share the same data while you are modifying one object inplace.

In [192]:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

df2 = df.reset_index(drop=True)

df2.iloc[0, 0] = 100

- This creates two objects that share data and thus the setitem operation will trigger a copy. This is not necessary if the initial object df isn’t needed anymore. Simply reassigning to the same variable will invalidate the reference that is held by the object.

In [193]:
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})

df = df.reset_index(drop=True)

df.iloc[0, 0] = 100

- No copy is necessary in this example. Creating multiple references keeps unnecessary references alive and thus will hurt performance with Copy-on-Write.

### Copy-on-Write optimizations
- A new lazy copy mechanism that defers the copy until the object in question is modified and only if this object shares data with another object. This mechanism was added to methods that don’t require a copy of the underlying data. Popular examples are DataFrame.drop() for axis=1 and DataFrame.rename().

- These methods return views when Copy-on-Write is enabled, which provides a significant performance improvement compared to the regular execution.

### How to enable CoW
- Copy-on-Write can be enabled through the configuration option copy_on_write. The option can be turned on __globally__ through either of the following:

## Merge, join, concatenate and compare
- From http://pandas.pydata.org/docs/user_guide/merging.html
- pandas provides various methods for combining and comparing Series or DataFrame.
    - concat(): Merge multiple Series or DataFrame objects along a shared index or column
    - DataFrame.join(): Merge multiple DataFrame objects along the columns
    - DataFrame.combine_first(): Update missing values with non-missing values in the same location
    - merge(): Combine two Series or DataFrame objects with SQL-style joining
    - merge_ordered(): Combine two Series or DataFrame objects along an ordered axis
    - merge_asof(): Combine two Series or DataFrame objects by near instead of exact matching keys
    - Series.compare() and DataFrame.compare(): Show differences in values between two Series or DataFrame objects
### concat()
    - The concat() function concatenates an arbitrary amount of Series or DataFrame objects along an axis while performing optional set logic (union or intersection) of the indexes on the other axes. Like numpy.concatenate, concat() takes a list or dict of homogeneously-typed objects and concatenates them.

In [194]:
df1 = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    },
    index=[0, 1, 2, 3],
)

df2 = pd.DataFrame(
    {
        "A": ["A4", "A5", "A6", "A7"],
        "B": ["B4", "B5", "B6", "B7"],
        "C": ["C4", "C5", "C6", "C7"],
        "D": ["D4", "D5", "D6", "D7"],
    },
    index=[4, 5, 6, 7],
)

df3 = pd.DataFrame(
    {
        "A": ["A8", "A9", "A10", "A11"],
        "B": ["B8", "B9", "B10", "B11"],
        "C": ["C8", "C9", "C10", "C11"],
        "D": ["D8", "D9", "D10", "D11"],
    },
    index=[8, 9, 10, 11],
)

frames = [df1, df2, df3]

result = pd.concat(frames)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7
8,A8,B8,C8,D8
9,A9,B9,C9,D9


- **Note**: concat() makes a full copy of the data, and iteratively reusing concat() can create unnecessary copies. Collect all DataFrame or Series objects in a list before using concat().

- frames = [process_your_file(f) for f in files]
- result = pd.concat(frames)

- **Note**: When concatenating DataFrame with named axes, pandas will attempt to preserve these index/column names whenever possible. In the case where all inputs share a common name, this name will be assigned to the result. When the input names do not all agree, the result will be unnamed. The same is true for MultiIndex, but the logic is applied separately on a level-by-level basis.

#### Joining logic of the resulting axis
- The join keyword specifies how to handle axis values that don’t exist in the first DataFrame.

- join='outer' takes the union of all axis values

In [195]:
df4 = pd.DataFrame(
    {
        "B": ["B2", "B3", "B6", "B7"],
        "D": ["D2", "D3", "D6", "D7"],
        "F": ["F2", "F3", "F6", "F7"],
    },
    index=[2, 3, 6, 7],
)
result = pd.concat([df1, df4], axis=1)
result


Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3
6,,,,,B6,D6,F6
7,,,,,B7,D7,F7


- join='inner' takes the intersection of the axis values

In [196]:
result = pd.concat([df1, df4], axis=1, join="inner")
result

Unnamed: 0,A,B,C,D,B.1,D.1,F
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


- To perform an effective “left” join using the exact index from the original DataFrame, result can be reindexed.

In [197]:
result = pd.concat([df1, df4], axis=1).reindex(df1.index)

result

Unnamed: 0,A,B,C,D,B.1,D.1,F
0,A0,B0,C0,D0,,,
1,A1,B1,C1,D1,,,
2,A2,B2,C2,D2,B2,D2,F2
3,A3,B3,C3,D3,B3,D3,F3


#### Ignoring indexes on the concatenation axis
- For DataFrame objects which don’t have a meaningful index, the ignore_index ignores overlapping indexes.

In [198]:
result = pd.concat([df1, df4], ignore_index=True, sort=False)

result

Unnamed: 0,A,B,C,D,F
0,A0,B0,C0,D0,
1,A1,B1,C1,D1,
2,A2,B2,C2,D2,
3,A3,B3,C3,D3,
4,,B2,,D2,F2
5,,B3,,D3,F3
6,,B6,,D6,F6
7,,B7,,D7,F7


#### Concatenating Series and DataFrame together
- You can concatenate a mix of Series and DataFrame objects. The Series will be transformed to DataFrame with the column name as the name of the Series.

In [199]:
s1 = pd.Series(["X0", "X1", "X2", "X3"], name="X")

result = pd.concat([df1, s1], axis=1)

result

Unnamed: 0,A,B,C,D,X
0,A0,B0,C0,D0,X0
1,A1,B1,C1,D1,X1
2,A2,B2,C2,D2,X2
3,A3,B3,C3,D3,X3


In [200]:
s2 = pd.Series(["_0", "_1", "_2", "_3"])

result = pd.concat([df1, s2, s2, s2], axis=1)

result

Unnamed: 0,A,B,C,D,0,1,2
0,A0,B0,C0,D0,_0,_0,_0
1,A1,B1,C1,D1,_1,_1,_1
2,A2,B2,C2,D2,_2,_2,_2
3,A3,B3,C3,D3,_3,_3,_3


- ignore_index=True will drop all name references.

In [201]:
result = pd.concat([df1, s1], axis=1, ignore_index=True)

result

Unnamed: 0,0,1,2,3,4
0,A0,B0,C0,D0,X0
1,A1,B1,C1,D1,X1
2,A2,B2,C2,D2,X2
3,A3,B3,C3,D3,X3


#### Resulting keys
- The keys argument adds another axis level to the resulting index or column (creating a MultiIndex) associate specific keys with each original DataFrame.

In [202]:
result = pd.concat(frames, keys=["x", "y", "z"])

result

Unnamed: 0,Unnamed: 1,A,B,C,D
x,0,A0,B0,C0,D0
x,1,A1,B1,C1,D1
x,2,A2,B2,C2,D2
x,3,A3,B3,C3,D3
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9


In [203]:
result.loc["y"]

Unnamed: 0,A,B,C,D
4,A4,B4,C4,D4
5,A5,B5,C5,D5
6,A6,B6,C6,D6
7,A7,B7,C7,D7


- The keys argument cane override the column names when creating a new DataFrame based on existing Series.

In [204]:
s3 = pd.Series([0, 1, 2, 3], name="foo")
s4 = pd.Series([0, 1, 2, 3])
s5 = pd.Series([0, 1, 4, 5])
pd.concat([s3, s4, s5], axis=1)

Unnamed: 0,foo,0,1
0,0,0,0
1,1,1,1
2,2,2,4
3,3,3,5


In [205]:
pd.concat([s3, s4, s5], axis=1, keys=["red", "blue", "yellow"])

Unnamed: 0,red,blue,yellow
0,0,0,0
1,1,1,1
2,2,2,4
3,3,3,5


- You can also pass a dict to concat() in which case the dict keys will be used for the keys argument unless other keys argument is specified:

In [206]:
pieces = {"x": df1, "y": df2, "z": df3}
result = pd.concat(pieces)
result

Unnamed: 0,Unnamed: 1,A,B,C,D
x,0,A0,B0,C0,D0
x,1,A1,B1,C1,D1
x,2,A2,B2,C2,D2
x,3,A3,B3,C3,D3
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9


In [207]:
result = pd.concat(pieces, keys=["z", "y"])

result

Unnamed: 0,Unnamed: 1,A,B,C,D
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9
z,10,A10,B10,C10,D10
z,11,A11,B11,C11,D11
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7


- The MultiIndex created has levels that are constructed from the passed keys and the index of the DataFrame pieces:

In [208]:
result.index.levels

FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])

- levels argument allows specifying resulting levels associated with the keys

In [209]:
result = pd.concat(
    pieces, keys=["x", "y", "z"], levels=[["z", "y", "x", "w"]], names=["group_key"]
)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
group_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
x,0,A0,B0,C0,D0
x,1,A1,B1,C1,D1
x,2,A2,B2,C2,D2
x,3,A3,B3,C3,D3
y,4,A4,B4,C4,D4
y,5,A5,B5,C5,D5
y,6,A6,B6,C6,D6
y,7,A7,B7,C7,D7
z,8,A8,B8,C8,D8
z,9,A9,B9,C9,D9


In [210]:
result.index.levels

FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])

#### Appending rows to a DataFrame
- If you have a Series that you want to append as a single row to a DataFrame, you can convert the row into a DataFrame and use concat()

In [211]:
s2 = pd.Series(["X0", "X1", "X2", "X3"], index=["A", "B", "C", "D"])
result = pd.concat([df1, s2.to_frame().T], ignore_index=True)
result

Unnamed: 0,A,B,C,D
0,A0,B0,C0,D0
1,A1,B1,C1,D1
2,A2,B2,C2,D2
3,A3,B3,C3,D3
4,X0,X1,X2,X3


### merge()
- merge() performs join operations similar to relational databases like SQL. Users who are familiar with SQL but new to pandas can reference a comparison with SQL.

#### Merge types
- merge() implements common SQL style joining operations.
- one-to-one: joining two DataFrame objects on their indexes which must contain unique values.
- many-to-one: joining a unique index to one or more columns in a different DataFrame.
- many-to-many : joining columns on columns.

- **Note**: When joining columns on columns, potentially a many-to-many join, any indexes on the passed DataFrame objects will be discarded.

- For a many-to-many join, if a key combination appears more than once in both tables, the DataFrame will have the Cartesian product of the associated data.

In [212]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

result = pd.merge(left, right, on="key")
result

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


- The how argument to merge() specifies which keys are included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

| Merge method | SQL Join Name | Description |
| ---------- | --------- | ----------- |
| left | LEFT OUTER JOIN | Use keys from left frame only |
| right | RIGHT OUTER JOIN | Use keys from right frame only |
| outer | FULL OUTER JOIN | Use union of keys from both frames |
| inner | INNER JOIN | Use intersection of keys from both frames |
| cross | CROSS JOIN | Create the cartesian product of rows of both frames |


In [214]:
left = pd.DataFrame(
   {
      "key1": ["K0", "K0", "K1", "K2"],
      "key2": ["K0", "K1", "K0", "K1"],
      "A": ["A0", "A1", "A2", "A3"],
      "B": ["B0", "B1", "B2", "B3"],
   }
)


right = pd.DataFrame(
   {
      "key1": ["K0", "K1", "K1", "K2"],
      "key2": ["K0", "K0", "K0", "K0"],
      "C": ["C0", "C1", "C2", "C3"],
      "D": ["D0", "D1", "D2", "D3"],
   }
)
result = pd.merge(left, right, how="left", on=["key1", "key2"])

result


Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [215]:
result = pd.merge(left, right, how="right", on=["key1", "key2"])

result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


In [216]:
result = pd.merge(left, right, how="outer", on=["key1", "key2"])

result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K0,,,C3,D3
5,K2,K1,A3,B3,,


In [217]:
result = pd.merge(left, right, how="inner", on=["key1", "key2"])

result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [218]:
result = pd.merge(left, right, how="cross")

result

Unnamed: 0,key1_x,key2_x,A,B,key1_y,key2_y,C,D
0,K0,K0,A0,B0,K0,K0,C0,D0
1,K0,K0,A0,B0,K1,K0,C1,D1
2,K0,K0,A0,B0,K1,K0,C2,D2
3,K0,K0,A0,B0,K2,K0,C3,D3
4,K0,K1,A1,B1,K0,K0,C0,D0
5,K0,K1,A1,B1,K1,K0,C1,D1
6,K0,K1,A1,B1,K1,K0,C2,D2
7,K0,K1,A1,B1,K2,K0,C3,D3
8,K1,K0,A2,B2,K0,K0,C0,D0
9,K1,K0,A2,B2,K1,K0,C1,D1


- You can Series and a DataFrame with a MultiIndex if the names of the MultiIndex correspond to the columns from the DataFrame. Transform the Series to a DataFrame using Series.reset_index() before merging

In [219]:
df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})

df

Unnamed: 0,Let,Num
0,A,1
1,B,2
2,C,3


In [220]:
ser = pd.Series(
    ["a", "b", "c", "d", "e", "f"],
    index=pd.MultiIndex.from_arrays(
        [["A", "B", "C"] * 2, [1, 2, 3, 4, 5, 6]], names=["Let", "Num"]
    ),
)


ser

Let  Num
A    1      a
B    2      b
C    3      c
A    4      d
B    5      e
C    6      f
dtype: object

In [221]:
pd.merge(df, ser.reset_index(), on=["Let", "Num"])

Unnamed: 0,Let,Num,0
0,A,1,a
1,B,2,b
2,C,3,c


- Performing an outer join with duplicate join keys in DataFrame

In [222]:
left = pd.DataFrame({"A": [1, 2], "B": [2, 2]})

right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})

result = pd.merge(left, right, on="B", how="outer")
result

Unnamed: 0,A_x,B,A_y
0,1,2,4
1,1,2,5
2,1,2,6
3,2,2,4
4,2,2,5
5,2,2,6


- **Warning**: Merging on duplicate keys significantly increase the dimensions of the result and can cause a memory overflow.

#### Merge key uniqueness
- The validate argument checks whether the uniqueness of merge keys. Key uniqueness is checked before merge operations and can protect against memory overflows and unexpected key duplication.

In [223]:
left = pd.DataFrame({"A": [1, 2], "B": [1, 2]})
right = pd.DataFrame({"A": [4, 5, 6], "B": [2, 2, 2]})
# result = pd.merge(left, right, on="B", how="outer", validate="one_to_one") # MergeError

- If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in the left DataFrame, one can use the validate='one_to_many' argument instead, which will not raise an exception.

In [224]:
pd.merge(left, right, on="B", how="outer", validate="one_to_many")

Unnamed: 0,A_x,B,A_y
0,1,1,
1,2,2,4.0
2,2,2,5.0
3,2,2,6.0


#### Merge result indicator
- merge() accepts the argument indicator. If True, a Categorical-type column called _merge will be added to the output object that takes on values:

| Observation Origin | _merge value |
| ----------------- | ---------- |
| Merge key only in 'left' frame | left_only |
| Merge key only in 'right' frame | right_only |
| Merge key in both frames | both |


In [225]:
df1 = pd.DataFrame({"col1": [0, 1], "col_left": ["a", "b"]})
df2 = pd.DataFrame({"col1": [1, 2, 2], "col_right": [2, 2, 2]})
pd.merge(df1, df2, on="col1", how="outer", indicator=True)

Unnamed: 0,col1,col_left,col_right,_merge
0,0,a,,left_only
1,1,b,2.0,both
2,2,,2.0,right_only
3,2,,2.0,right_only


- A string argument to indicator will use the value as the name for the indicator column.

In [226]:
pd.merge(df1, df2, on="col1", how="outer", indicator="indicator_column")

Unnamed: 0,col1,col_left,col_right,indicator_column
0,0,a,,left_only
1,1,b,2.0,both
2,2,,2.0,right_only
3,2,,2.0,right_only


#### Overlapping value columns
- The merge suffixes argument takes a tuple of list of strings to append to overlapping column names in the input DataFrame to disambiguate the result columns:

In [227]:
left = pd.DataFrame({"k": ["K0", "K1", "K2"], "v": [1, 2, 3]})
right = pd.DataFrame({"k": ["K0", "K0", "K3"], "v": [4, 5, 6]})
result = pd.merge(left, right, on="k")
result

Unnamed: 0,k,v_x,v_y
0,K0,1,4
1,K0,1,5


In [228]:
result = pd.merge(left, right, on="k", suffixes=("_l", "_r"))
result

Unnamed: 0,k,v_l,v_r
0,K0,1,4
1,K0,1,5


### DataFrame.join()
- DataFrame.join() combines the columns of multiple, potentially differently-indexed DataFrame into a single result DataFrame.

In [230]:
left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=["K0", "K1", "K2"]
)
right = pd.DataFrame(
    {"C": ["C0", "C2", "C3"], "D": ["D0", "D2", "D3"]}, index=["K0", "K2", "K3"]
)
result = left.join(right)
result

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2


In [231]:
result = left.join(right, how="outer")

result

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K1,A1,B1,,
K2,A2,B2,C2,D2
K3,,,C3,D3


In [232]:
result = left.join(right, how="inner")

result

Unnamed: 0,A,B,C,D
K0,A0,B0,C0,D0
K2,A2,B2,C2,D2


- DataFrame.join() takes an optional on argument which may be a column or multiple column names that the passed DataFrame is to be aligned.

In [233]:
left = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "key": ["K0", "K1", "K0", "K1"],
    }
)
right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1"])

result = left.join(right, on="key")
result

Unnamed: 0,A,B,key,C,D
0,A0,B0,K0,C0,D0
1,A1,B1,K1,C1,D1
2,A2,B2,K0,C0,D0
3,A3,B3,K1,C1,D1


In [234]:
result = pd.merge(
    left, right, left_on="key", right_index=True, how="left", sort=False
)
result

Unnamed: 0,A,B,key,C,D
0,A0,B0,K0,C0,D0
1,A1,B1,K1,C1,D1
2,A2,B2,K0,C0,D0
3,A3,B3,K1,C1,D1


- To join on multiple keys, the passed DataFrame must have a MultiIndex:

In [235]:
left = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
    }
)
index = pd.MultiIndex.from_tuples(
    [("K0", "K0"), ("K1", "K0"), ("K2", "K0"), ("K2", "K1")]
)

right = pd.DataFrame(
    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=index
)

result = left.join(right, on=["key1", "key2"])
result

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
1,A1,B1,K0,K1,,
2,A2,B2,K1,K0,C1,D1
3,A3,B3,K2,K1,C3,D3


- The default for DataFrame.join is to perform a left join which uses only the keys found in the calling DataFrame. Other join types can be specified with how.

In [236]:
result = left.join(right, on=["key1", "key2"], how="inner")
result

Unnamed: 0,A,B,key1,key2,C,D
0,A0,B0,K0,K0,C0,D0
2,A2,B2,K1,K0,C1,D1
3,A3,B3,K2,K1,C3,D3


#### Joining a single Index to a MultiIndex
- You can join a DataFrame with a Index to a DataFrame with a MultiIndex on a level. The name of the Index with match the level name of the MultiIndex.

In [237]:
left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]},
    index=pd.Index(["K0", "K1", "K2"], name="key"),
)
index = pd.MultiIndex.from_tuples(
    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")],
    names=["key", "Y"],
)

right = pd.DataFrame(
    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]},
    index=index,
)

result = left.join(right, how="inner")
result

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C,D
key,Y,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
K0,Y0,A0,B0,C0,D0
K1,Y1,A1,B1,C1,D1
K2,Y2,A2,B2,C2,D2
K2,Y3,A2,B2,C3,D3


#### Joining with two MultiIndex
- The MultiIndex of the input argument must be completely used in the join and is a subset of the indices in the left argument.

In [238]:
leftindex = pd.MultiIndex.from_product(
    [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
)

left = pd.DataFrame({"v1": range(12)}, index=leftindex)
left

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,v1
abc,xy,num,Unnamed: 3_level_1
a,x,1,0
a,x,2,1
a,y,1,2
a,y,2,3
b,x,1,4
b,x,2,5
b,y,1,6
b,y,2,7
c,x,1,8
c,x,2,9


In [239]:
rightindex = pd.MultiIndex.from_product(
    [list("abc"), list("xy")], names=["abc", "xy"]
)

right = pd.DataFrame({"v2": [100 * i for i in range(1, 7)]}, index=rightindex)
right

Unnamed: 0_level_0,Unnamed: 1_level_0,v2
abc,xy,Unnamed: 2_level_1
a,x,100
a,y,200
b,x,300
b,y,400
c,x,500
c,y,600


In [240]:
left.join(right, on=["abc", "xy"], how="inner")

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,v1,v2
abc,xy,num,Unnamed: 3_level_1,Unnamed: 4_level_1
a,x,1,0,100
a,x,2,1,100
a,y,1,2,200
a,y,2,3,200
b,x,1,4,300
b,x,2,5,300
b,y,1,6,400
b,y,2,7,400
c,x,1,8,500
c,x,2,9,500


In [243]:
leftindex = pd.MultiIndex.from_tuples(
    [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
)

left = pd.DataFrame(
    {"A": ["A0", "A1", "A2"], "B": ["B0", "B1", "B2"]}, index=leftindex
)
rightindex = pd.MultiIndex.from_tuples(
    [("K0", "Y0"), ("K1", "Y1"), ("K2", "Y2"), ("K2", "Y3")], names=["key", "Y"]
)

right = pd.DataFrame(
    {"C": ["C0", "C1", "C2", "C3"], "D": ["D0", "D1", "D2", "D3"]}, index=rightindex
)
result = pd.merge(
    left.reset_index(), right.reset_index(), on=["key"], how="inner"
).set_index(["key", "X", "Y"])

result

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,A,B,C,D
key,X,Y,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
K0,X0,Y0,A0,B0,C0,D0
K0,X1,Y0,A1,B1,C0,D0
K1,X2,Y1,A2,B2,C1,D1


#### Merging on a combination of columns and index levels
- Strings passed as the on, left_on, and right_on parameters may refer to either column names or index level names. This enables merging DataFrame instances on a combination of index levels and columns without resetting indexes.

In [244]:
left_index = pd.Index(["K0", "K0", "K1", "K2"], name="key1")
left = pd.DataFrame(
    {
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
        "key2": ["K0", "K1", "K0", "K1"],
    },
    index=left_index,
)
right_index = pd.Index(["K0", "K1", "K2", "K2"], name="key1")

right = pd.DataFrame(
    {
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
        "key2": ["K0", "K0", "K0", "K1"],
    },
    index=right_index,
)
result = left.merge(right, on=["key1", "key2"])
result


Unnamed: 0_level_0,A,B,key2,C,D
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
K0,A0,B0,K0,C0,D0
K1,A2,B2,K0,C1,D1
K2,A3,B3,K1,C3,D3


- **Note**: When DataFrame are joined on a string that matches an index level in both arguments, the index level is preserved as an index level in the resulting DataFrame.
- **Note**: When DataFrame are joined using only some of the levels of a MultiIndex, the extra levels will be dropped from the resulting join. To preserve those levels, use DataFrame.reset_index() on those level names to move those levels to columns prior to the join.

#### Joining multiple DataFrame
- A list or tuple of :class:`DataFrame` can also be passed to join() to join them together on their indexes.

In [246]:
right2 = pd.DataFrame({"v": [7, 8, 9]}, index=["K1", "K1", "K2"])
result = left.join([right, right2])
result

Unnamed: 0_level_0,A,B,key2_x,C,D,key2_y,v
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
K0,A0,B0,K0,C0,D0,K0,
K0,A1,B1,K1,C0,D0,K0,
K1,A2,B2,K0,C1,D1,K0,7.0
K1,A2,B2,K0,C1,D1,K0,8.0
K2,A3,B3,K1,C2,D2,K0,9.0
K2,A3,B3,K1,C3,D3,K1,9.0


#### DataFrame.combine_first()
- DataFrame.combine_first() update missing values from one DataFrame with the non-missing values in another DataFrame in the corresponding location.

In [247]:
df1 = pd.DataFrame(
    [[np.nan, 3.0, 5.0], [-4.6, np.nan, np.nan], [np.nan, 7.0, np.nan]]
)

df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])

result = df1.combine_first(df2)

result

Unnamed: 0,0,1,2
0,,3.0,5.0
1,-4.6,,-8.2
2,-5.0,7.0,4.0


### merge_ordered()
- merge_ordered() combines order data such as numeric or time series data with optional filling of missing data with fill_method.

In [248]:
left = pd.DataFrame(
    {"k": ["K0", "K1", "K1", "K2"], "lv": [1, 2, 3, 4], "s": ["a", "b", "c", "d"]}
)
right = pd.DataFrame({"k": ["K1", "K2", "K4"], "rv": [1, 2, 3]})

pd.merge_ordered(left, right, fill_method="ffill", left_by="s")

Unnamed: 0,k,lv,s,rv
0,K0,1.0,a,
1,K1,1.0,a,1.0
2,K2,1.0,a,2.0
3,K4,1.0,a,3.0
4,K1,2.0,b,1.0
5,K2,2.0,b,2.0
6,K4,2.0,b,3.0
7,K1,3.0,c,1.0
8,K2,3.0,c,2.0
9,K4,3.0,c,3.0


### merge_asof()
- merge_asof() is similar to an ordered left-join except that mactches are on the nearest key rather than equal keys. For each row in the left DataFrame, the last row in the right DataFrame are selected where the on key is less than the left’s key. Both DataFrame must be sorted by the key.

- ptionally an merge_asof() can perform a group-wise merge by matching the by key in addition to the nearest match on the on key.

In [250]:
trades = pd.DataFrame(
    {
        "time": pd.to_datetime(
            [
                "20160525 13:30:00.023",
                "20160525 13:30:00.038",
                "20160525 13:30:00.048",
                "20160525 13:30:00.048",
                "20160525 13:30:00.048",
            ]
        ),
        "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
        "price": [51.95, 51.95, 720.77, 720.92, 98.00],
        "quantity": [75, 155, 100, 100, 100],
    },
    columns=["time", "ticker", "price", "quantity"],
)

quotes = pd.DataFrame(
    {
        "time": pd.to_datetime(
            [
                "20160525 13:30:00.023",
                "20160525 13:30:00.023",
                "20160525 13:30:00.030",
                "20160525 13:30:00.041",
                "20160525 13:30:00.048",
                "20160525 13:30:00.049",
                "20160525 13:30:00.072",
                "20160525 13:30:00.075",
            ]
        ),
        "ticker": ["GOOG", "MSFT", "MSFT", "MSFT", "GOOG", "AAPL", "GOOG", "MSFT"],
        "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
        "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
    },
    columns=["time", "ticker", "bid", "ask"],
)

trades


Unnamed: 0,time,ticker,price,quantity
0,2016-05-25 13:30:00.023,MSFT,51.95,75
1,2016-05-25 13:30:00.038,MSFT,51.95,155
2,2016-05-25 13:30:00.048,GOOG,720.77,100
3,2016-05-25 13:30:00.048,GOOG,720.92,100
4,2016-05-25 13:30:00.048,AAPL,98.0,100


In [251]:
quotes

Unnamed: 0,time,ticker,bid,ask
0,2016-05-25 13:30:00.023,GOOG,720.5,720.93
1,2016-05-25 13:30:00.023,MSFT,51.95,51.96
2,2016-05-25 13:30:00.030,MSFT,51.97,51.98
3,2016-05-25 13:30:00.041,MSFT,51.99,52.0
4,2016-05-25 13:30:00.048,GOOG,720.5,720.93
5,2016-05-25 13:30:00.049,AAPL,97.99,98.01
6,2016-05-25 13:30:00.072,GOOG,720.5,720.88
7,2016-05-25 13:30:00.075,MSFT,52.01,52.03


In [252]:
pd.merge_asof(trades, quotes, on="time", by="ticker")

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.97,51.98
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


- merge_asof() within 2ms between the quote time and the trade time.

In [253]:
pd.merge_asof(trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms"))


Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,51.95,51.96
1,2016-05-25 13:30:00.038,MSFT,51.95,155,,
2,2016-05-25 13:30:00.048,GOOG,720.77,100,720.5,720.93
3,2016-05-25 13:30:00.048,GOOG,720.92,100,720.5,720.93
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


- merge_asof() within 10ms between the quote time and the trade time and exclude exact matches on time. Note that though we exclude the exact matches (of the quotes), prior quotes do propagate to that point in time.

In [254]:
pd.merge_asof(
    trades,
    quotes,
    on="time",
    by="ticker",
    tolerance=pd.Timedelta("10ms"),
    allow_exact_matches=False,
)

Unnamed: 0,time,ticker,price,quantity,bid,ask
0,2016-05-25 13:30:00.023,MSFT,51.95,75,,
1,2016-05-25 13:30:00.038,MSFT,51.95,155,51.97,51.98
2,2016-05-25 13:30:00.048,GOOG,720.77,100,,
3,2016-05-25 13:30:00.048,GOOG,720.92,100,,
4,2016-05-25 13:30:00.048,AAPL,98.0,100,,


### compare()
- The Series.compare() and DataFrame.compare() methods allow you to compare two DataFrame or Series, respectively, and summarize their differences.

In [255]:
df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
    },
    columns=["col1", "col2", "col3"],
)
df

Unnamed: 0,col1,col2,col3
0,a,1.0,1.0
1,a,2.0,2.0
2,b,3.0,3.0
3,b,,4.0
4,a,5.0,5.0


In [256]:
df2 = df.copy()
df2.loc[0, "col1"] = "c"
df2.loc[2, "col3"] = 4.0
df2

Unnamed: 0,col1,col2,col3
0,c,1.0,1.0
1,a,2.0,2.0
2,b,3.0,4.0
3,b,,4.0
4,a,5.0,5.0


In [257]:
df.compare(df2)

Unnamed: 0_level_0,col1,col1,col3,col3
Unnamed: 0_level_1,self,other,self,other
0,a,c,,
2,,,3.0,4.0


- By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.

- Stack the differences on rows.

In [258]:
df.compare(df2, align_axis=0)

Unnamed: 0,Unnamed: 1,col1,col3
0,self,a,
0,other,c,
2,self,,3.0
2,other,,4.0


- Keep all original rows and columns with keep_shape=True

In [259]:
df.compare(df2, keep_shape=True)

Unnamed: 0_level_0,col1,col1,col2,col2,col3,col3
Unnamed: 0_level_1,self,other,self,other,self,other
0,a,c,,,,
1,,,,,,
2,,,,,3.0,4.0
3,,,,,,
4,,,,,,


- Keep all the original values even if they are equal.

In [260]:
df.compare(df2, keep_shape=True, keep_equal=True)

Unnamed: 0_level_0,col1,col1,col2,col2,col3,col3
Unnamed: 0_level_1,self,other,self,other,self,other
0,a,c,1.0,1.0,1.0,1.0
1,a,a,2.0,2.0,2.0,2.0
2,b,b,3.0,3.0,3.0,4.0
3,b,b,,,4.0,4.0
4,a,a,5.0,5.0,5.0,5.0
