# Data Wrangling: Join, Combine,and Reshape

## Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that enables you to have multiple
(two or more) index levels on an axis.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.Series(np.random.randn(9),index=[ ['a','a','a','b','b','c','c','d','d'],[1,2,3,1,3,1,2,2,3] ])

In [3]:
data

a  1    0.323156
   2   -0.615504
   3    0.450635
b  1    0.986506
   3    1.067334
c  1   -0.301883
   2   -1.621682
d  2   -0.690276
   3   -0.636359
dtype: float64

In [4]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

In [5]:
data['b']

1    0.986506
3    1.067334
dtype: float64

In [6]:
data['b':'c']

b  1    0.986506
   3    1.067334
c  1   -0.301883
   2   -1.621682
dtype: float64

In [7]:
data.loc[ ['b','d'] ]

b  1    0.986506
   3    1.067334
d  2   -0.690276
   3   -0.636359
dtype: float64

**Here a,b,c,d will be act as index and the numbers will act as a column**

In [8]:
data

a  1    0.323156
   2   -0.615504
   3    0.450635
b  1    0.986506
   3    1.067334
c  1   -0.301883
   2   -1.621682
d  2   -0.690276
   3   -0.636359
dtype: float64

In [9]:
data.loc['b'][1]

0.9865058294047845

In [10]:
data.loc['b',1]

0.9865058294047845

### data.unstack()

**We could rearrange the data into
a DataFrame using its unstack method.**

In [11]:
df = data.unstack()

**The inverse operation of unstack is stack**

In [12]:
df.stack()

a  1    0.323156
   2   -0.615504
   3    0.450635
b  1    0.986506
   3    1.067334
c  1   -0.301883
   2   -1.621682
d  2   -0.690276
   3   -0.636359
dtype: float64

**With a DataFrame, either axis can have a hierarchical index**

In [13]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
 index=[['a','a','b','b'], [1, 2, 1, 2]],
   columns=[['Ohio','Ohio','Colorado'],
      ['Green','Red','Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


**The hierarchical levels can have names (as strings or any Python objects). If so, these
will show up in the console output.**

In [14]:
frame.index.names = ['key1', 'key2']

In [15]:
frame.columns.names = ['state', 'color']

In [16]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [17]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [18]:
frame['Ohio','Green']

key1  key2
a     1       0
      2       3
b     1       6
      2       9
Name: (Ohio, Green), dtype: int32

In [19]:
frame.loc['a']

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0,1,2
2,3,4,5


In [20]:
frame.loc['a',:]

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,0,1,2
2,3,4,5


In [21]:
frame.loc['a',1]

state     color
Ohio      Green    0
          Red      1
Colorado  Green    2
Name: (a, 1), dtype: int32

In [22]:
frame.loc['a',1]['Colorado','Green']

2

**A MultiIndex can be created by itself and then reused; the columns in the preceding
DataFrame with level names could be created like this:**

In [23]:
col = pd.MultiIndex.from_arrays( [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']],
names=['state', 'color'] )

In [24]:
ind = pd.MultiIndex.from_arrays( [['a','a', 'b','b', 'c'], [1, 2,3,4,5]],
names=['state', 'color'] )

In [25]:
df = pd.DataFrame(np.arange(15).reshape((5,3)),columns=col,index=ind)
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
state,color,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,3,6,7,8
b,4,9,10,11
c,5,12,13,14


### Reordering and Sorting Levels

**The swaplevel takes two level numbers or names
and returns a new object with the levels interchanged (but the data is otherwise
unaltered):**

In [29]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [28]:
frame.swaplevel('key1','key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


sort_index, on the other hand, sorts the data using only the values in a single level.
When swapping levels, it’s not uncommon to also use sort_index so that the result is
lexicographically sorted by the indicated level:

In [30]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [32]:
frame.swaplevel(0,1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Summary Statistics by Level

In [33]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


**It will add the a-->0 with the b-->0 and a-->1 with the b-->1.**

In [34]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [35]:
frame.sum(level='color',axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### Indexing with a DataFrame’s columns

**It’s not unusual to want to use one or more columns from a DataFrame as the row
index; alternatively, you may wish to move the row index into the DataFrame’s columns.**

In [36]:
frame = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1),
 'c': ['one', 'one', 'one', 'two', 'two',
 'two', 'two'],
 'd': [0, 1, 2, 0, 1, 2, 3]})

In [37]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


#### frame.set_index

In [41]:
frame2 = frame.set_index(['c','d'])
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


By default the columns are removed from the DataFrame, though you can leave them
in:

In [42]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


#### frame.reset_index

**reset_index, on the other hand, does the opposite of set_index; the hierarchical
index levels are moved into the columns:**

In [43]:
frame.reset_index()

Unnamed: 0,index,a,b,c,d
0,0,0,7,one,0
1,1,1,6,one,1
2,2,2,5,one,2
3,3,3,4,two,0
4,4,4,3,two,1
5,5,5,2,two,2
6,6,6,1,two,3


## Combining and Merging Datasets