# Data Indexing and Selection

### Series as dictionary

- A ``Series`` object maps a collection of keys to a collection of values.

In [1]:
import pandas as pd
data = pd.Series([0.25, 0.5, 0.75, 1.0], index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [2]:
data['b']

0.5

- We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [3]:
'a' in data

True

In [4]:
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [5]:
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

- ``Series`` objects can be modified with a dictionary-like syntax. Just as you can extend a dictionary by assigning a new key, you can extend a ``Series`` by assigning a new index value:

In [6]:
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

### Series as one-dimensional array

- A ``Series`` provides array-style item selection via *slices*, *masking*, and *fancy indexing*.

In [7]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [8]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [9]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [10]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

- Slicing may be the source of the most confusion. When slicing with an explicit index (``data['a':'c']``), the final index is *included* in the slice. When slicing with an implicit index (``data[0:2]``), the final index is *excluded*.

In [11]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [12]:
# explicit index when indexing
data[1]

'a'

In [13]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

- Pandas provides special *indexer* attributes that explicitly expose certain indexing schemes. These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.
- First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [14]:
data.loc[1]

'a'

In [15]:
data.loc[1:3]

1    a
3    b
dtype: object

- The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [16]:
data.iloc[1]

'b'

In [17]:
data.iloc[1:3]

3    b
5    c
dtype: object

- A third indexing attribute, ``ix``, is a hybrid of the two, and for ``Series`` objects is equivalent to standard ``[]``-based indexing.

- One guiding principle: "explicit is better than implicit." The explicit nature of ``loc`` and ``iloc`` make them useful in maintaining clean and readable code; especially in the case of integer indexes.

### Data Selection in DataFrame

- A ``DataFrame`` acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of ``Series`` structures sharing the same index.

### DataFrame as a dictionary

- The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.

In [18]:
area = pd.Series({'California': 423967, 'Texas': 695662,
                  'New York': 141297, 'Florida': 170312,
                  'Illinois': 149995})
pop = pd.Series({'California': 38332521, 'Texas': 26448193,
                 'New York': 19651127, 'Florida': 19552860,
                 'Illinois': 12882135})
data = pd.DataFrame({'area':area, 'pop':pop})
data

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135
New York,141297,19651127
Texas,695662,26448193


- The ``Series`` (the columns of the ``DataFrame``) can be accessed via dictionary-style indexing of the column name:

In [19]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

- You can attribute-style access with column names that are strings:

In [20]:
data.area

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

- These two methods are equivalent.

In [21]:
data.area is data['area']

True

- Though this is a useful shorthand, there are limits. For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.

- For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column:

In [22]:
data.pop is data['pop']

False

- Avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

- Dictionary-style syntax can also be used to modify the object, in this case adding a new column.

In [23]:
data['density'] = data['pop'] / data['area']
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.413926
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### DataFrame as two-dimensional array

- You can also view the ``DataFrame`` as an enhanced two-dimensional array. View the raw underlying data array using the ``values`` attribute:

In [24]:
data.values

array([[  4.23967000e+05,   3.83325210e+07,   9.04139261e+01],
       [  1.70312000e+05,   1.95528600e+07,   1.14806121e+02],
       [  1.49995000e+05,   1.28821350e+07,   8.58837628e+01],
       [  1.41297000e+05,   1.96511270e+07,   1.39076746e+02],
       [  6.95662000e+05,   2.64481930e+07,   3.80187404e+01]])

- Many array-like observations can be done on ``DataFrame``s. For example, we can transpose the full ``DataFrame`` to swap rows and columns:

In [25]:
data.T

Unnamed: 0,California,Florida,Illinois,New York,Texas
area,423967.0,170312.0,149995.0,141297.0,695662.0
pop,38332520.0,19552860.0,12882140.0,19651130.0,26448190.0
density,90.41393,114.8061,85.88376,139.0767,38.01874


- The dictionary-style indexing of columns precludes our ability to simply treat it as a NumPy array. 
    - Passing a single index to an array accesses a row.
    - Passing a single index to a ``DataFrame`` accesses a column.

In [27]:
data.values[0]

array([  4.23967000e+05,   3.83325210e+07,   9.04139261e+01])

In [28]:
data['area']

California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64

- For array-style indexing, we need another convention. Here Pandas again uses the ``loc``, ``iloc``, and ``ix`` indexers mentioned earlier.

- Using the ``iloc`` indexer, we can index the underlying array as if it is a simple NumPy array (using the implicit Python-style index), but the ``DataFrame`` index and column labels are maintained in the result:

In [29]:
data.iloc[:3, :2]

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


- Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the explicit index and column names:

In [30]:
data.loc[:'Illinois', :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


- ``ix`` allows a hybrid of these two approaches. 

In [34]:
data.ix[:3, :'pop']

Unnamed: 0,area,pop
California,423967,38332521
Florida,170312,19552860
Illinois,149995,12882135


- ``ix`` is subject to the same potential integer index confusion as for integer-indexed ``Series`` objects.

- Any of the NumPy-style data access patterns can be used within these indexers. For example you can combine masking and fancy indexing.

In [35]:
data.loc[data.density > 100, ['pop', 'density']]

Unnamed: 0,pop,density
Florida,19552860,114.806121
New York,19651127,139.076746


- Any of these indexing conventions may also be used to set or modify values with standard conventions.

In [36]:
data.iloc[0, 2] = 90
data

Unnamed: 0,area,pop,density
California,423967,38332521,90.0
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763
New York,141297,19651127,139.076746
Texas,695662,26448193,38.01874


### Additional indexing conventions

- *indexing* refers to columns, *slicing* refers to rows:

In [37]:
data['Florida':'Illinois']

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


- Such slices can also refer to rows by number rather than by index:

In [38]:
data[1:3]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
Illinois,149995,12882135,85.883763


- Direct masking operations are also interpreted row-wise rather than column-wise:

In [39]:
data[data.density > 100]

Unnamed: 0,area,pop,density
Florida,170312,19552860,114.806121
New York,141297,19651127,139.076746
