<br>

# Data Indexing, Selection, and Assignment

From the numpy lecture, we already know about indexing, slicing, masking, and fancy indexing:

In [2]:
import numpy as np
import pandas as pd

In [3]:
a = np.arange(16).reshape(4,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [4]:
# Takes those values of the second and fourth column that are divisible by 3
a[:, [1, 3]][a[:, [1, 3]] % 3 == 0]

array([ 3,  9, 15])

Here we'll look at similar means of accessing and modifying values in Pandas Series and DataFrame objects. The corresponding patterns in Pandas are very similar to those of numpy, though there are a few quirks to be aware of.

We'll start with the simple case of the one-dimensional Series object, and then move on to the slightly more complicated two-dimensional DataFrame object.

## Data Selection in Series

As we saw in the previous section, a Series object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary. If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

### Series as dictionary

Like a dictionary, the Series object provides a mapping from a collection of keys to a collection of values, which means most of the corresponding functions work just as well for them:

In [5]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [6]:
data.__contains__('b')

True

In [7]:
'b' in data

True

In [8]:
np.array_equal(data.keys(), data.index)

True

In [9]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [10]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [11]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### Series as one-dimensional array

Series builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, slices, masking, and fancy indexing. Examples of these are as follows:

In [12]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [13]:
# slicing by implicit integer index
data[0:2] 
# Note that when slicing with an explicit index (i.e., data['a':'c']), the final index is included in the slice, 
# while when slicing with an implicit index (i.e., data[0:2]), the final index is excluded from the slice.

a    0.25
b    0.50
dtype: float64

In [14]:
(data > 0.3) & (data < 0.8)

a    False
b     True
c     True
d    False
e    False
dtype: bool

In [15]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [16]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

In [17]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=[1, 2, 3, 4])
data

1    0.25
2    0.50
3    0.75
4    1.00
dtype: float64

In [18]:
data[1:3]

2    0.50
3    0.75
dtype: float64

<div class="alert alert-block alert-warning">
    <b> Important: </b>
    <br>
If your Series has an explicit integer index, an indexing operation such as data[1] will use the explicit indices, while a slicing operation like data[1:3] will use the implicit Python-style index.
</div>

In [19]:
data = pd.Series(['a', 'b', 'c'], index=[1, 5, 3])
data

1    a
5    b
3    c
dtype: object

In [20]:
# explicit index when indexing
data[1]

'a'

In [21]:
# implicit index when slicing
data[1:3]

5    b
3    c
dtype: object

The **loc** attribute allows indexing and slicing that *always* references the explicit index:

In [22]:
data.loc[1]

'a'

In [23]:
data.loc[1:3]

1    a
5    b
3    c
dtype: object

Note that `loc` may or may not throw Index-Errors when slicing:

In [24]:
data = pd.Series(['a', 'b', 'c'], index=[1, 5, 3])
data

1    a
5    b
3    c
dtype: object

In [25]:
try:
    
    data.loc['a':'z']
    
except KeyError:
    print(KeyError)

<class 'KeyError'>


In [26]:
try:
    
    data.loc[3:10]
    
except KeyError:
    print(KeyError)

<class 'KeyError'>


In [27]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data.loc[3:10]

3    b
5    c
dtype: object

In [28]:
data.loc['a':'z']

Series([], dtype: object)

The **iloc** attribute allows indexing and slicing that always references the implicit Python-style index:

In [29]:
data

1    a
3    b
5    c
dtype: object

In [30]:
data.iloc[1]

'b'

In [31]:
data.iloc[1:3]

3    b
5    c
dtype: object

Please, save yourself the pain and be always explicit about what you do -- always use ``.loc`` and ``.iloc``

In [32]:
%%bash
python -c "import this" | grep "Explicit"
#not saying that explicit indices are better than implicit ones, but that you should be explicit about what you're doing.

Explicit is better than implicit.


<br>

## Data Selection in DataFrame

Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure.

In [55]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop, 'Country':'USA'})
data

Unnamed: 0,area,pop,Country
California,423967,38332521,USA
Texas,695662,26448193,USA
New York,141297,19651127,USA
Florida,170312,19552860,USA
Illinois,149995,12882135,USA


<div class="alert alert-block alert-warning">
    <b> Important: </b>
    <br>
    If we index a DataFrame, we index the <b>column</b>!!
    </div>

In [34]:
# Dictionary-style indexing results in a Series....
print(type(data["area"]))
data["area"]

<class 'pandas.core.series.Series'>


California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [35]:
# We can also dereference, though it leads to side-effects if that's actually also a method...
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [36]:
data.values

array([[423967, 38332521, 'USA'],
       [695662, 26448193, 'USA'],
       [141297, 19651127, 'USA'],
       [170312, 19552860, 'USA'],
       [149995, 12882135, 'USA']], dtype=object)

With `.T` we will get the transpose of the DataFrame.

In [38]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967,695662,141297,170312,149995
pop,38332521,26448193,19651127,19552860,12882135
Country,USA,USA,USA,USA,USA


For array-style indexing, Pandas again uses the loc and iloc indexers mentioned earlier. Using the iloc indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), **but the DataFrame index and column labels are maintained in the result**:

In [39]:
#indexing the underlying numpy-array...
data.values[:3, :2]

array([[423967, 38332521],
       [695662, 26448193],
       [141297, 19651127]], dtype=object)

In [40]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [41]:
data

Unnamed: 0,area,pop,Country
California,423967,38332521,USA
Texas,695662,26448193,USA
New York,141297,19651127,USA
Florida,170312,19552860,USA
Illinois,149995,12882135,USA


In [42]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [43]:
data.loc[:,['area','pop']]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


So, this is how we get a row!

In [44]:
data.loc["California", :]

area         423967
pop        38332521
Country         USA
Name: California, dtype: object

In [45]:
# adding a new column.. (vectorized calculations!)
data['density'] = data['pop'] / data['area']
# we can combine masking with fancy indexing
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


If you want to combine explicit and implicit indexing, you have to chain them:

In [46]:
data

Unnamed: 0,area,pop,Country,density
California,423967,38332521,USA,90.413926
Texas,695662,26448193,USA,38.01874
New York,141297,19651127,USA,139.076746
Florida,170312,19552860,USA,114.806121
Illinois,149995,12882135,USA,85.883763


In [47]:
data.iloc[1:4].loc[:, ['pop', 'density']]

Unnamed: 0,pop,density
Texas,26448193,38.01874
New York,19651127,139.076746
Florida,19552860,114.806121


**While indexing refers to columns, slicing refers to rows:**

In [48]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [49]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,Country,density
Florida,170312,19552860,USA,114.806121
Illinois,149995,12882135,USA,85.883763


Again, rather be explicit about your indexing to save yourself from a lot of confusion.

In [50]:
try:
    
    data['area':'pop']

except KeyError:
    print(KeyError)

<class 'KeyError'>


In [51]:
data.loc[:, 'area':'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


Fast access to a single member using **at**

In [52]:
%%timeit
data.loc['Florida', 'pop']

8.51 µs ± 526 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [72]:
%%timeit
data.at['Florida', 'pop']

2.14 µs ± 13.7 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)


In [53]:
data.loc["Texas", ["area", "pop"]]

area      695662
pop     26448193
Name: Texas, dtype: object

In [54]:
data["area"]["Texas":"Florida"]

Texas       695662
New York    141297
Florida     170312
Name: area, dtype: int64

|                        |              |                                          |                          |
|------------------------|--------------|------------------------------------------|--------------------------|
| **direct []-access**   |              |                                          |                          |
| One argument,   single | Column       | data['area']                             |                          |
| One argument,   slice  | Row          | data['Florida': 'Illinois']              | slice-top is included    |
| Both arguments         | only MultiInd| -                                        |                          |
| **.loc[]**             |              |                                          |                          |
| One argument,   single | Row          | data.loc['Florida']                      |                          |
| One argument,   slice  | Row          | data.loc['Florida': 'Illinois']          | slice-top  is included   |
| Both arguments, both   | Row, Column  | data.loc['Florida': 'Illinois', 'area']  | slice-top  is included   |
| **.iloc[]**            |              |                                          |                          |
| One argument,   single | Row          | data.iloc[0]                             |                          |
| One argument,   slice  | Row          | data.iloc[0:2]                          | slice-top  is excluded   |
| Both arguments, both   | Row, Column  | data.iloc[0: 2, 0:3]                     | slice-top  is excluded   |

[Here you can find a small exercise](optional_exercises.ipynb#exe03a)
<img src="images/optional_exercises1.png" width="50" style="float: right;"/>

### Reindexing

In [56]:
index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
                  'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
                  index=index)
df

Unnamed: 0,http_status,response_time
Firefox,200,0.04
Chrome,200,0.02
Safari,404,0.07
IE10,404,0.08
Konqueror,301,1.0


In [57]:
new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', 'Chrome']
df.reindex(new_index)

Unnamed: 0,http_status,response_time
Safari,404.0,0.07
Iceweasel,,
Comodo Dragon,,
IE10,404.0,0.08
Chrome,200.0,0.02


In [58]:
df.reindex(columns=['http_status', 'user_agent'])

Unnamed: 0,http_status,user_agent
Firefox,200,
Chrome,200,
Safari,404,
IE10,404,
Konqueror,301,


**Renaming columns** 

In [60]:
df = pd.DataFrame({"a": [1, 2, 3, 4], "b": [2, 5, 7, 8]})
df = df.rename(mapper={'b': 'c'}, axis='columns')
df

Unnamed: 0,a,c
0,1,2
1,2,5
2,3,7
3,4,8


[Here you can find a small exercise](optional_exercises.ipynb#exe03b)
<img src="images/optional_exercises1.png" width="50" style="float: right;"/>

### Boolean Indexing

In [61]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))
df['E'] = ["one", "two", "three"] * 2
df

Unnamed: 0,A,B,C,D,E
2013-01-01,0.598683,-0.757449,-0.879191,1.048421,one
2013-01-02,0.906822,0.237712,-0.665137,1.166082,two
2013-01-03,0.23609,0.044482,-0.50257,1.583069,three
2013-01-04,0.971129,-1.311595,0.056273,-0.417808,one
2013-01-05,-1.675361,2.326398,-0.223658,-1.385233,two
2013-01-06,-1.834936,0.234453,-0.146686,0.870337,three


In [62]:
df['A'] > 0.5

2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04     True
2013-01-05    False
2013-01-06    False
Freq: D, Name: A, dtype: bool

In [63]:
df[df['A'] > 0.5]

Unnamed: 0,A,B,C,D,E
2013-01-01,0.598683,-0.757449,-0.879191,1.048421,one
2013-01-02,0.906822,0.237712,-0.665137,1.166082,two
2013-01-04,0.971129,-1.311595,0.056273,-0.417808,one


In [64]:
#alternate syntax: `query` - see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html
df.query('A > 0.5')

Unnamed: 0,A,B,C,D,E
2013-01-01,0.598683,-0.757449,-0.879191,1.048421,one
2013-01-02,0.906822,0.237712,-0.665137,1.166082,two
2013-01-04,0.971129,-1.311595,0.056273,-0.417808,one


In [65]:
df['E'].isin(['one','two'])

2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04     True
2013-01-05     True
2013-01-06    False
Freq: D, Name: E, dtype: bool

In [66]:
df[df['E'].isin(['one','two'])] = np.NaN
df

Unnamed: 0,A,B,C,D,E
2013-01-01,,,,,
2013-01-02,,,,,
2013-01-03,0.23609,0.044482,-0.50257,1.583069,three
2013-01-04,,,,,
2013-01-05,,,,,
2013-01-06,-1.834936,0.234453,-0.146686,0.870337,three


In [67]:
pd.isna(df)

Unnamed: 0,A,B,C,D,E
2013-01-01,True,True,True,True,True
2013-01-02,True,True,True,True,True
2013-01-03,False,False,False,False,False
2013-01-04,True,True,True,True,True
2013-01-05,True,True,True,True,True
2013-01-06,False,False,False,False,False


In [68]:
pd.isna(df).any(axis=1)

2013-01-01     True
2013-01-02     True
2013-01-03    False
2013-01-04     True
2013-01-05     True
2013-01-06    False
Freq: D, dtype: bool

In [69]:
df[~df.isna().any(axis=1)]

Unnamed: 0,A,B,C,D,E
2013-01-03,0.23609,0.044482,-0.50257,1.583069,three
2013-01-06,-1.834936,0.234453,-0.146686,0.870337,three


In [71]:
df.dropna(how="any")

Unnamed: 0,A,B,C,D,E
2013-01-03,0.23609,0.044482,-0.50257,1.583069,three
2013-01-06,-1.834936,0.234453,-0.146686,0.870337,three


[Here you can find a small exercise](optional_exercises.ipynb#exe03c)
<img src="images/optional_exercises1.png" width="50" style="float: right;"/>

<br>

## Data assignment

In [1]:
df = pd.DataFrame({'temp_c': [17.0, 25.0]},
                  index=['Portland', 'Berkeley'])
df

NameError: name 'pd' is not defined

### Assigning columns

In [73]:
df.loc[:,'country'] = 'USA'
df

Unnamed: 0,temp_c,country
Portland,17.0,USA
Berkeley,25.0,USA


In [74]:
df['temp_c'] <= 18

Portland     True
Berkeley    False
Name: temp_c, dtype: bool

In [75]:
df['too_cold'] = df['temp_c'] <= 18
df

Unnamed: 0,temp_c,country,too_cold
Portland,17.0,USA,True
Berkeley,25.0,USA,False


These however work in-place. To assign to a new dataframe, use `assign`:

In [76]:
df = pd.DataFrame({'temp_c': [17.0, 25.0]},
                  index=['Portland', 'Berkeley'])
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,25.0


In [77]:
df2 = df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
df2

Unnamed: 0,temp_c,temp_f
Portland,17.0,62.6
Berkeley,25.0,77.0


In [80]:
#vectorized version:
df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)

Unnamed: 0,temp_c,temp_f
Portland,17.0,62.6
Berkeley,25.0,77.0


In [81]:
#multiple assignments simulatenously:
df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
          temp_k=lambda x: (x['temp_f'] +  459.67) * 5 / 9)

Unnamed: 0,temp_c,temp_f,temp_k
Portland,17.0,62.6,290.15
Berkeley,25.0,77.0,298.15


In [82]:
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,25.0


### Row assignment

In [83]:
df.loc['Berkeley', 'temp_c'] = 26.0
df

Unnamed: 0,temp_c
Portland,17.0
Berkeley,26.0


In [84]:
type(df.loc['Portland'])

pandas.core.series.Series

In [85]:
df.loc['Portland'] = pd.Series({'temp_c': 99})
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0


In [86]:
df.loc['Osnabruck', 'temp_c'] = 18
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,18.0


In [87]:
df = pd.concat([df, df])
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,18.0
Portland,99.0
Berkeley,26.0
Osnabruck,18.0


In [88]:
df.loc['Osnabruck', 'temp_c'] = 25
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,25.0
Portland,99.0
Berkeley,26.0
Osnabruck,25.0


In [89]:
df.loc['Osnabruck'] = pd.Series({'temp_c': 99})
df

ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (2,1)

In [90]:
df.loc['Osnabruck']

Unnamed: 0,temp_c
Osnabruck,25.0
Osnabruck,25.0


In [91]:
np.where(df.index == 'Osnabruck')

(array([2, 5]),)

In [92]:
df.iloc[np.where(df.index == 'Osnabruck')[0][0]]

temp_c    25.0
Name: Osnabruck, dtype: float64

In [93]:
df.iloc[np.where(df.index == 'Osnabruck')[0][0]] = pd.Series({'temp_c': 100})
df

Unnamed: 0,temp_c
Portland,99.0
Berkeley,26.0
Osnabruck,100.0
Portland,99.0
Berkeley,26.0
Osnabruck,25.0


[Here you can find a small exercise](optional_exercises.ipynb#exe03d)
<img src="images/optional_exercises1.png" width="50" style="float: right;"/>

<br>

## Pandas indexing

### Multi-Indexing

While Pandas does provide objects that natively handle three-dimensional and four-dimensional data, a far more common pattern in practice is to make use of `hierarchical indexing` (also known as `multi-indexing`) to incorporate multiple index levels within a single index. In this way, higher-dimensional data can be compactly represented within the familiar one-dimensional Series and two-dimensional DataFrame objects.

In [94]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]
populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]
pop = pd.Series(populations, index=index)
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [95]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex([('California', 2000),
            ('California', 2010),
            (  'New York', 2000),
            (  'New York', 2010),
            (     'Texas', 2000),
            (     'Texas', 2010)],
           )

In [96]:
index.names = ['state', 'year']

In [97]:
pop = pop.reindex(index)
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [98]:
pop['California', 2000], pop['California', 2010]

(33871648, 37253956)

In [99]:
pop.iloc[0]

33871648

In [100]:
pop.iloc[1]

37253956

### MultiIndex as extra dimension: stack() and unstack()

You might notice something else here: we could easily have stored the same data using a simple ``DataFrame`` with index and column labels.
In fact, Pandas is built with this equivalence in mind. The ``unstack()`` method will quickly convert a multiply indexed ``Series`` into a conventionally indexed ``DataFrame``:

In [101]:
pop

state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [102]:
pop.unstack()

year,2000,2010
state,Unnamed: 1_level_1,Unnamed: 2_level_1
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [103]:
index.names = [None, None]
pop = pop.reindex(index)
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [104]:
pop.unstack()

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [105]:
pop.unstack().T

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [106]:
popdf = pop.unstack(level=0)
popdf

Unnamed: 0,California,New York,Texas
2000,33871648,18976457,20851820
2010,37253956,19378102,25145561


In [107]:
popdf.stack()

2000  California    33871648
      New York      18976457
      Texas         20851820
2010  California    37253956
      New York      19378102
      Texas         25145561
dtype: int64

### Index setting and resetting

Another way to rearrange hierarchical data is to turn the index labels into columns; this can be accomplished with the ``reset_index`` method.
Calling this on the population dictionary will result in a ``DataFrame`` with a *state* and *year* column holding the information that was formerly in the index.
For clarity, we can optionally specify the name of the data for the column representation:

In [108]:
pop

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [109]:
pop.index.names = ['state', 'year']
print(type(pop))
pop

<class 'pandas.core.series.Series'>


state       year
California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64

In [110]:
pop_flat = pop.reset_index(name='population')
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


Often when working with data in the real world, the raw input data looks like this and it's useful to build a ``MultiIndex`` from the column values.
This can be done with the ``set_index`` method of the ``DataFrame``, which returns a multiply indexed ``DataFrame``:

In [111]:
pop_df = pop_flat.set_index(['state', 'year'])
pop_df

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [112]:
pop_df.rename_axis([None, None])

Unnamed: 0,Unnamed: 1,population
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [113]:
asdf = pop_df.rename_axis([None, None]).unstack()
asdf

Unnamed: 0_level_0,population,population
Unnamed: 0_level_1,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [114]:
asdf.columns

MultiIndex([('population', 2000),
            ('population', 2010)],
           )

In [115]:
asdf["area"] = 999
asdf

Unnamed: 0_level_0,population,population,area
Unnamed: 0_level_1,2000,2010,Unnamed: 3_level_1
California,33871648,37253956,999
New York,18976457,19378102,999
Texas,20851820,25145561,999


In [116]:
asdf.columns

MultiIndex([('population', 2000),
            ('population', 2010),
            (      'area',   '')],
           )

In [131]:
print(type(asdf["area"]))
asdf["area"]

<class 'pandas.core.series.Series'>


California    999
New York      999
Texas         999
Name: area, dtype: int64

In [132]:
print(type(asdf["population"]))
asdf["population"]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [133]:
pop_flat

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561


In [134]:
pop_df2 = pop_flat.set_index('state').rename_axis(None)
pop_df2

Unnamed: 0,year,population
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [135]:
pop_df

Unnamed: 0_level_0,Unnamed: 1_level_0,population
state,year,Unnamed: 2_level_1
California,2000,33871648
California,2010,37253956
New York,2000,18976457
New York,2010,19378102
Texas,2000,20851820
Texas,2010,25145561


In [136]:
pop_df.reset_index()

Unnamed: 0,state,year,population
0,California,2000,33871648
1,California,2010,37253956
2,New York,2000,18976457
3,New York,2010,19378102
4,Texas,2000,20851820
5,Texas,2010,25145561
