*This notebook contains an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

# Data Indexing and Selection

## Data Selection in Series

As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

In [15]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [20]:
data[-2:]

c    0.75
d    1.00
dtype: float64

In [18]:
data['b':'c']

b    0.50
c    0.75
dtype: float64

In [8]:
'a' in data.index

True

In [4]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

In [5]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [6]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [7]:
data['a'] = -0.25
data

a   -0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

In [21]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [22]:
data['a'] #explicit index

0.25

In [23]:
data[0] #implicit index

0.25

In [24]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [25]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [26]:
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [27]:
mask = pd.Series([False, True, True, False],
                 index=['a', 'b', 'c', 'd'])
mask

a    False
b     True
c     True
d    False
dtype: bool

In [28]:
data[mask]

b    0.50
c    0.75
dtype: float64

In [30]:
data > 0.3

a    False
b     True
c     True
d     True
dtype: bool

In [31]:
mask1 = data > 0.3
data[mask1]

b    0.50
c    0.75
d    1.00
dtype: float64

In [32]:
data[data > 0.3]

b    0.50
c    0.75
d    1.00
dtype: float64

In [33]:
mask2 = data < 0.8
data[mask2]

a    0.25
b    0.50
c    0.75
dtype: float64

In [34]:
mask3 = mask1 & mask2
data[mask3]

b    0.50
c    0.75
dtype: float64

In [35]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [26]:
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [43]:
# fancy indexing
data[['d','a', 'a', 'd']]#[['a', 'd']]

d    1.00
a    0.25
a    0.25
d    1.00
dtype: float64

Among these, slicing may be the source of the most confusion.
Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.

In [45]:
l = [1,2,3,4]
l[0:2]

[1, 2]

selection based on the index

- indexing
- slicing
- fancy indexing

selection based on values

- masking

### Indexers: loc, iloc

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [47]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [48]:
data[1]

'a'

In [50]:
data.loc[1]

'a'

In [51]:
data.iloc[1]

'b'

In [49]:
data[1] # pandas will choose

'a'

In [53]:
# explicit index when indexing
data[1]

'a'

In [54]:
data

1    a
3    b
5    c
dtype: object

In [55]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

In [56]:
data.iloc[1:3]

3    b
5    c
dtype: object

In [57]:
data.loc[1:3]

1    a
3    b
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [14]:
data.loc[1]

'a'

In [15]:
data.loc[1:3]

1    a
3    b
dtype: object

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [16]:
data.iloc[1]

'b'

In [17]:
data.iloc[1:3]

3    b
5    c
dtype: object

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.
These analogies can be helpful to keep in mind as we explore data selection within this structure.

In [58]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [60]:
data['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [61]:
data.area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [63]:
data['pop']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

In [66]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [67]:
data.T

Unnamed: 0,California,Texas,New York,Florida,Illinois
area,423967.0,695662.0,141297.0,170312.0,149995.0
pop,38332520.0,26448190.0,19651130.0,19552860.0,12882140.0
density,90.41393,38.01874,139.0767,114.8061,85.88376


In [68]:
data['area'] #select a value on the columns axis

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [69]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [74]:
data.loc[ ['California','Illinois'] , 'area'  ]

California    423967
Illinois      149995
Name: area, dtype: int64

In [55]:
data.loc[:, 'area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [75]:
data.loc['Texas']

area       6.956620e+05
pop        2.644819e+07
density    3.801874e+01
Name: Texas, dtype: float64

In [77]:
data.loc['Texas':'Illinois', :]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


California    False
Texas         False
New York       True
Florida        True
Illinois       True
Name: area, dtype: bool

In [83]:
data.loc[ data['area'] < 200000  ,  'pop'  ]

New York    19651127
Florida     19552860
Illinois    12882135
Name: pop, dtype: int64

In [58]:
#data.loc[r, c]

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [84]:
data.iloc[0:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [85]:
data.iloc[0, [1,2]]

pop        3.833252e+07
density    9.041393e+01
Name: California, dtype: float64

In [87]:
data.iloc[::2, ::2]

Unnamed: 0,area,density
California,423967,90.413926
New York,141297,139.076746
Illinois,149995,85.883763


Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [60]:
data.loc[:'New York', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127


In [66]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [63]:
#data['area']

In [64]:
#data[:'New York']

In [65]:
# data.loc[ rows , cols ]

In [67]:
data.loc[ 'New York' , 'pop' ]

19651127

In [68]:
data.loc[ 'New York' , : ]

area       1.412970e+05
pop        1.965113e+07
density    1.390767e+02
Name: New York, dtype: float64

In [69]:
data.loc[ : , 'pop' ]

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: pop, dtype: int64

In [70]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [31]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


In [74]:
data.loc[ data.density > 100 ,  ['pop', 'density']]

Unnamed: 0,pop,density
New York,19651127,139.076746
Florida,19552860,114.806121


In [32]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:

In [88]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


In [92]:
data.loc[:, 'area':'pop']

Unnamed: 0,area,pop
California,423967,38332521
Texas,695662,26448193
New York,141297,19651127
Florida,170312,19552860
Illinois,149995,12882135


In [94]:
#data[0]

Such slices can also refer to rows by number rather than by index:

In [76]:
data[1:3]

Unnamed: 0,area,pop,density
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746


Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [77]:
data[data.density > 100]

Unnamed: 0,area,pop,density
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121


In [78]:
# data.loc[r, c]

In [None]:
# data.iloc

In [98]:
data.loc['Texas':'Florida'].iloc[:, -1]

Texas        38.018740
New York    139.076746
Florida     114.806121
Name: density, dtype: float64

In [99]:
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Texas,695662,26448193,38.01874
New York,141297,19651127,139.076746
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
