## Practice 3
#### MultiIndex /adanved indexing
- From https://pandas.pydata.org/docs/user_guide/advanced.html
- This section covers indexing with a MultiIndex and other advanced indexing features.
- See the Indexing and Selecting Data for general indexing documentation.
- **Warning**: Whether a copy or a reference is returned for a setting operation may depend on the context. This is sometimes called chained assignment and should be avoided. See Returning a View versus Copy.

#### Hierarchical indexing (MultiIndex)
- Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame (2d).

- In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.

- See the cookbook for some advanced strategies.

#### Creating a MultiIndex (hierarchical index) object
- The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples (using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()), or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize MultiIndexes.

In [5]:
import pandas as pd
import numpy as np

In [2]:
arrays = [
    ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
    ["one", "two", "one", "two", "one", "two", "one", "two"],
]
tuples = list(zip(*arrays))
tuples

[('bar', 'one'),
 ('bar', 'two'),
 ('baz', 'one'),
 ('baz', 'two'),
 ('foo', 'one'),
 ('foo', 'two'),
 ('qux', 'one'),
 ('qux', 'two')]

In [3]:
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
index

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

In [6]:
s = pd.Series(np.random.randn(8), index=index)
s

first  second
bar    one       1.401668
       two       0.843072
baz    one       0.054789
       two      -1.624187
foo    one      -1.655984
       two       0.368256
qux    one      -0.786081
       two      -0.220323
dtype: float64

- When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.from_product() method:

In [7]:
iterables = [["bar", "baz", "foo", "qux"], ["one", "two"]]

pd.MultiIndex.from_product(iterables, names=["first", "second"])

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])

- You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.from_frame(). This is a complementary method to MultiIndex.to_frame().

In [8]:
df = pd.DataFrame(
    [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
    columns=["first", "second"],
)
pd.MultiIndex.from_frame(df)

MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('foo', 'one'),
            ('foo', 'two')],
           names=['first', 'second'])

- As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex automatically:

In [9]:
arrays = [
    np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
    np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
]
s = pd.Series(np.random.randn(8), index=arrays)
s

bar  one   -0.571740
     two    0.657200
baz  one    0.638534
     two    0.636702
foo  one    0.105625
     two   -1.108993
qux  one   -1.490093
     two   -0.410552
dtype: float64

In [10]:
df = pd.DataFrame(np.random.randn(8, 4), index=arrays)
df

Unnamed: 0,Unnamed: 1,0,1,2,3
bar,one,0.773128,0.960034,-0.696723,-0.768177
bar,two,-1.990783,-0.119898,-1.485906,0.787975
baz,one,-0.86624,1.12119,-0.018594,-1.927502
baz,two,-0.307928,0.41576,0.607379,1.673755
foo,one,-0.854522,-0.087861,-0.347848,-0.543385
foo,two,-1.034672,-0.023308,1.578449,0.965557
qux,one,-0.012152,-0.310898,0.403776,-1.804362
qux,two,-0.657695,0.832442,-0.257588,1.14622


- All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves. If no names are provided, None will be assigned:

In [11]:
df.index.names

FrozenList([None, None])

- This index can back any axis of a pandas object, and the number of levels of the index is up to you:

In [12]:
df = pd.DataFrame(np.random.randn(3, 8), index=["A", "B", "C"], columns=index)

df

first,bar,bar,baz,baz,foo,foo,qux,qux
second,one,two,one,two,one,two,one,two
A,0.416418,0.011115,0.325098,-0.520251,-1.195804,-0.075759,0.75276,0.563595
B,0.333405,0.087818,0.228822,0.771709,0.605022,-1.181512,-0.81484,1.20718
C,2.033167,-1.144466,-0.022267,0.457045,2.229661,-0.754105,0.274854,0.009222


In [13]:
pd.DataFrame(np.random.randn(6, 6), index=index[:6], columns=index[:6])

Unnamed: 0_level_0,first,bar,bar,baz,baz,foo,foo
Unnamed: 0_level_1,second,one,two,one,two,one,two
first,second,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
bar,one,-1.01749,0.556011,-0.879456,-0.424015,0.557869,-2.353952
bar,two,0.9623,-1.314676,-0.710711,0.828191,1.267234,-0.151198
baz,one,0.512276,1.097474,0.755134,0.576787,0.588503,-0.686263
baz,two,-0.793089,0.358887,-0.1601,2.181757,0.065882,0.478478
foo,one,0.596251,0.706598,0.290585,-0.71179,-0.825277,-0.365812
foo,two,0.921318,-2.407341,0.368492,0.566729,0.085477,-0.800106


- We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how the index is displayed can be controlled using the multi_sparse option in pandas.set_options():

In [14]:
with pd.option_context("display.multi_sparse", False):
    df

- It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:

In [15]:
pd.Series(np.random.randn(8), index=tuples)

(bar, one)    0.071709
(bar, two)   -0.127359
(baz, one)    0.786960
(baz, two)    0.025028
(foo, one)    0.666218
(foo, two)    1.402507
(qux, one)   -0.013039
(qux, two)    2.061105
dtype: float64

- The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However, when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.

#### Reconstructing the level labels
- The method get_level_values() will return a vector of the labels for each location at a particular level:

In [16]:
index.get_level_values(0)

Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

In [17]:
index.get_level_values("second")

Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object', name='second')

#### Basic indexing on axis with MultiIndex
- One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous way to selecting a column in a regular DataFrame:

In [18]:
df["bar"]

second,one,two
A,0.416418,0.011115
B,0.333405,0.087818
C,2.033167,-1.144466


In [19]:
df["bar", "one"]

A    0.416418
B    0.333405
C    2.033167
Name: (bar, one), dtype: float64

In [20]:
df["bar"]["one"]

A    0.416418
B    0.333405
C    2.033167
Name: one, dtype: float64

In [21]:
s["qux"]

one   -1.490093
two   -0.410552
dtype: float64

#### Defined levels
- The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index, you may notice this. For example:

In [22]:
df.columns.levels  # original MultiIndex

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

In [23]:
df[["foo","qux"]].columns.levels  # sliced

FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])

- This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only the used levels, you can use the get_level_values() method.

In [24]:
df[["foo", "qux"]].columns.to_numpy()

array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
      dtype=object)

In [25]:
# for a specific level
df[["foo", "qux"]].columns.get_level_values(0)

Index(['foo', 'foo', 'qux', 'qux'], dtype='object', name='first')

- To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.

In [26]:
new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
new_mi.levels

FrozenList([['foo', 'qux'], ['one', 'two']])

#### Data alignment and using reindex
- Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data alignment will work the same as an Index of tuples:

In [27]:
s + s[:-2]

bar  one   -1.143480
     two    1.314399
baz  one    1.277069
     two    1.273404
foo  one    0.211250
     two   -2.217987
qux  one         NaN
     two         NaN
dtype: float64

In [28]:
s + s[::2]

bar  one   -1.143480
     two         NaN
baz  one    1.277069
     two         NaN
foo  one    0.211250
     two         NaN
qux  one   -2.980186
     two         NaN
dtype: float64

- The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array of tuples:

In [29]:
s.reindex(index[:3])

first  second
bar    one      -0.571740
       two       0.657200
baz    one       0.638534
dtype: float64

In [30]:
s.reindex([("foo", "two"), ("bar", "one"), ("qux", "one"), ("baz", "one")])

foo  two   -1.108993
bar  one   -0.571740
qux  one   -1.490093
baz  one    0.638534
dtype: float64

#### Advanced indexing with hierarchical index
- Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would expect:

In [31]:
df = df.T
df

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,one,0.416418,0.333405,2.033167
bar,two,0.011115,0.087818,-1.144466
baz,one,0.325098,0.228822,-0.022267
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105
qux,one,0.75276,-0.81484,0.274854
qux,two,0.563595,1.20718,0.009222


In [32]:
df.loc[("bar", "two")]

A    0.011115
B    0.087818
C   -1.144466
Name: (bar, two), dtype: float64

- Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to ambiguity in general.

- If you also want to index a specific column with .loc, you must use a tuple like this:

In [33]:
df.loc[("bar", "two"), "A"]

np.float64(0.01111505574465237)

- You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example, you can use “partial” indexing to get all elements with bar in the first level as follows:

In [34]:
df.loc["bar"]

Unnamed: 0_level_0,A,B,C
second,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0.416418,0.333405,2.033167
two,0.011115,0.087818,-1.144466


- This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',] in this example).

- “Partial” slicing also works quite nicely.

In [35]:
df.loc["baz":"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,one,0.325098,0.228822,-0.022267
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105


- You can slice with a ‘range’ of values, by providing a slice of tuples.

In [36]:
df.loc[("baz", "two"):("qux", "one")]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105
qux,one,0.75276,-0.81484,0.274854


In [37]:
df.loc[("baz", "two"):"foo"]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
baz,two,-0.520251,0.771709,0.457045
foo,one,-1.195804,0.605022,2.229661
foo,two,-0.075759,-1.181512,-0.754105


- Passing a list of labels or tuples works similar to reindexing:

In [38]:
df.loc[[("bar", "two"), ("qux", "one")]]

Unnamed: 0_level_0,Unnamed: 1_level_0,A,B,C
first,second,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
bar,two,0.011115,0.087818,-1.144466
qux,one,0.75276,-0.81484,0.274854


- **Note**:  It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing. Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples go horizontally (traversing levels), lists go vertically (scanning levels).

- Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several values within a level: