*This notebook is based on an excerpt from the [Python Data Science Handbook](http://shop.oreilly.com/product/0636920034919.do) by Jake VanderPlas; the content is available [on GitHub](https://github.com/jakevdp/PythonDataScienceHandbook).*

*The text is released under the [CC-BY-NC-ND license](https://creativecommons.org/licenses/by-nc-nd/3.0/us/legalcode), and code is released under the [MIT license](https://opensource.org/licenses/MIT). If you find this content useful, please consider supporting the work by [buying the book](http://shop.oreilly.com/product/0636920034919.do)!*

# Data Indexing and Selection

Here we'll look at similar means of accessing and modifying values in Pandas ``Series`` and ``DataFrame`` objects.
If you have used the NumPy patterns, the corresponding patterns in Pandas will feel very familiar, though there are a few quirks to be aware of.

We'll start with the more salient case of the `DataFrame` and move onto the simpler case of the `Series`.

In [None]:
import pandas as pd
import numpy as np

## Data Selection in DataFrame

Recall that a ``DataFrame`` acts in many ways like a two-dimensional, structured array, or matrix.  You can think of each row as indicated by its index (explicitly or implicitly named) and its column index (also explicitly or implicitly named).

### Accessing a single column

The first analogy we will consider is the ``DataFrame`` as a dictionary of related ``Series`` objects.
Let's return to our example of areas and populations of states:

In [None]:
data = pd.read_csv('https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/inst/extdata/penguins.csv')


The individual ``Series`` that make up the columns of the ``DataFrame`` can be accessed via dictionary-style indexing of the column name:

In [None]:
# access by dictionary-style indexing


Equivalently, we can use attribute-style access with column names that are strings:

In [None]:
# access with attribute-style access


This attribute-style column access actually accesses the exact same object as the dictionary-style access:

In [None]:
# Check to see if access is the same


Though this is a useful shorthand, keep in mind that it does not work for all cases!
For example, if the column names are not strings, or if the column names conflict with methods of the ``DataFrame``, this attribute-style access is not possible.
For example, the ``DataFrame`` has a ``pop()`` method, so ``data.pop`` will point to this rather than the ``"pop"`` column if there were a `population` (`pop` for short) column in this dataframe:

In [None]:
# Challenges of attribute-style access


In particular, you should avoid the temptation to try column assignment via attribute (i.e., use ``data['pop'] = z`` rather than ``data.pop = z``).

Some vectorized operations can be completed as well - applying to all of the elements of the selected column.

**Memory check:** What was that operation called in the tidyverse that added a column to your data?  Does this look like a similar operation?

### Accessing as row, column: `loc` and `iloc`

As mentioned previously, we can also view the ``DataFrame`` as an enhanced two-dimensional array.
We can examine the raw underlying data array using the ``values`` attribute:

In [None]:
# get underlying values as numpy array


For array-style indexing, we need another convention.
Here Pandas again uses the ``loc`` and ``iloc`` indexers.  **Recall that in our previous exploration of series, explicit indices can be numerical, convoluting whether you're selecting by implicit or explicit index.  These two methods can disambiguate the approach.**

### `iloc`
Think of `iloc` as a reference to the _integer_ location, or _implicit_ index, or by position.  Using the ``iloc`` indexer, we can index the underlying array and the ``DataFrame`` index and column labels are maintained in the result.  The indexer expects the row position followed by the column position.

In [None]:
# index the Dataframe by position, row 0, column 0


#index the dataframe; find row 0, column1


In [None]:
# slice the dataframe to get the first 3 rows and the first 2 columns


### `loc`

Similarly, using the ``loc`` indexer we can index the underlying data in an array-like style but using the _explicit index_ and column names.  Note that since this has no explicit index, we use the implicit index given to the same effect.

In [None]:
#Get the first row


**Question**.  We just said that these indexers require the format `[row, column]`.  However, it seems like we omitted that here.  What do you think is happening?

**Gotcha!** Look at how this function appears.  It looks _very_ similar to `data['species']`.  However, these behaviors are decidedly different; always remember that `.loc` will expect a _row index_ as the first parameter in the bracket, whereas dictionary-like key-value lookup will expect a column index.  

In [None]:
# Try to use loc to just access the column


In [None]:
# Correctly use loc to access the column


In [None]:
# Can also slice
#explicit indices include the endpoint

### Try it yourself!!
Using the `data` dataframe above, try to do the following selections:
1. Select the `bill_depth_mm` column using `iloc`, `loc`, and dictionary indexing.
2. Select only the 1 and 5 rows with all columns.
3. Select the `species` and `island` columns with all rows.
4. Select rows 2, 4, and 6 years.
5. **Bonus**: Repeat #4, but reverse the order (e.g., dataframe returned should have order 6, 4, 2)

**HINT**: When trying to select multiple indices, you can pass in a list of indices (e.g., `data.loc[['John', 'Raheim'], :]`)

In [None]:
# Answer to 1


In [None]:
#Answer to 2


In [None]:
#Answer to 3


In [None]:
#Answer to 4


In [None]:
#Answer to 5


### Masking and fancy indexing

Many different data access patterns can be used within these indexers.
For example, in the ``loc`` indexer we can combine boolean masking and fancy indexing as in the following.  Boolean masking evaluates a certain condition on your dataset, and then returns the true/false value of this condition.  The returned dataset can then be used to filter out the `false` or `true` values.  Additionally fancy indexing can allow you to select multiple columns using a list.

In [None]:
# Look at the boolean masking result of bill_depth_mm > 17


**Question**: How big is this dataframe?  What's up with the indices?

In [None]:
# Only look at selected columns - island and body_mass_g


In [None]:
# Look at all rows that aren't index 0 (this is valuable when you have indices that are non-unique)


In [None]:
# Find all the columns that aren't species


Any of these indexing conventions may also be used to set or modify values; this is done in the standard way that you might be accustomed to from working with NumPy:

In [None]:
# You can also use these selections to set values too


To build up your fluency in Pandas data manipulation, I suggest spending some time with a simple ``DataFrame`` and exploring the types of indexing, slicing, masking, and fancy indexing that are allowed by these various indexing approaches.

### Try it yourself!!
Using the dataframe `work_data` below, answer the following questions:
1. Which penguins have bill depth > 20mm?  _How many_ penguins have bill depth > 20mm?
2. What are the species of the penguins which have bill depth > 20?  (Find the penguins that have bill depth greater than 20 and then return only the species column)
3. Which penguins have bill depth between 18 and 19mm?  What are their corresponding body masses?  **Hints**:  (1) `&` is the `and` operator compatible with Pandas Series.  (2) Remember operator precedence and how parentheses can help with that!
4. **Bonus**: Which penguins have body masses within 100 grams of the median body mass?  **Hint**: use `np.median`

In [None]:
# Make sure to execute this cell
work_data = data.copy()

In [None]:
#Answer to question 1
#res = fill in your answer here
display(res)
print('The total number of states with bill_depth > 20 is: {0}'.format(res.shape[0]))

In [None]:
#Answer to question 2


In [None]:
# Answer to question 3


In [None]:
# Answer to question 4


## On Pandas: Query
As we've already seen in previous sections, the power of the PyData stack is built upon the ability of NumPy and Pandas to push basic operations into C via an intuitive syntax: examples are vectorized/broadcasted operations in NumPy, and grouping-type operations in Pandas.
While these abstractions are efficient and effective for many common use cases, they often rely on the creation of temporary intermediate objects, which can cause undue overhead in computational time and memory use.

As of version 0.13 (released January 2014), Pandas includes some experimental tools that allow you to directly access C-speed operations without costly allocation of intermediate arrays.
These are the ``eval()`` and ``query()`` functions, which rely on the [Numexpr](https://github.com/pydata/numexpr) package.
In this notebook we will walk through their use and give some rules-of-thumb about when you might think about using them.

You can rewrite many of the above methods using strings in combination with query if they're operating on the columns of the dataframe.  This helps with speed and performance since temporary intermediate variables are not created, and also helps with readability.  Let's see how this would look below.

In [None]:
# Answer to 1


## with query


In [None]:
# Answer to 2


#with query


In [None]:
# Answer to 3


#with query


In [None]:
# Answer to 4: use @ to reference variables that are not in the dataframe


#with query


### Additional indexing conventions

There are a couple extra indexing conventions that might seem at odds with the preceding discussion, but nevertheless can be very useful in practice.
First, while *indexing* refers to columns, *slicing* refers to rows:

In [None]:
# Dictionary-like access allows you to slice on the rows
data[0:5]

Such slices can also refer to rows by number rather than by index:

In [None]:
# Dictionary-like access also allows you to slice by implicit index
data[1:3]

Similarly, direct masking operations are also interpreted row-wise rather than column-wise:

In [None]:
# Masking operations are implemented row wise
data[data.bill_depth_mm > 20]

These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.

## Data Selection in Series

As we saw in the previous section, a ``Series`` object acts in many ways like a one-dimensional NumPy array, and in many ways like a standard Python dictionary.
If we keep these two overlapping analogies in mind, it will help us to understand the patterns of data indexing and selection in these arrays.

### Series as dictionary

Like a dictionary, the ``Series`` object provides a mapping from a collection of keys to a collection of values:

In [None]:
#pandas series for testing
data = pd.Series([0.25, 0.5, 0.75, 1.0],
                 index=['a', 'b', 'c', 'd'])
data

a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64

In [None]:
#access a single row
data['b']

0.5

We can also use dictionary-like Python expressions and methods to examine the keys/indices and values:

In [None]:
# Use python expression with Series
'a' in data

True

In [None]:
# Access the keys in a dictionary-like way
data.keys()

Index(['a', 'b', 'c', 'd'], dtype='object')

In [None]:
# Access the items as sets of tuples
list(data.items())

[('a', 0.25), ('b', 0.5), ('c', 0.75), ('d', 1.0)]

``Series`` objects can even be modified with a dictionary-like syntax.
Just as you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:

In [None]:
# Set values by dictionary-like syntax
data['e'] = 1.25
data

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64

This easy mutability of the objects is a convenient feature: under the hood, Pandas is making decisions about memory layout and data copying that might need to take place; the user generally does not need to worry about these issues.

### Series as one-dimensional array

A ``Series`` builds on this dictionary-like interface and provides array-style item selection via the same basic mechanisms as NumPy arrays – that is, *slices*, *masking*, and *fancy indexing*.
Examples of these are as follows:

In [None]:
# slicing by explicit index
data['a':'c']

a    0.25
b    0.50
c    0.75
dtype: float64

In [None]:
# slicing by implicit integer index
data[0:2]

a    0.25
b    0.50
dtype: float64

In [None]:
# masking
data[(data > 0.3) & (data < 0.8)]

b    0.50
c    0.75
dtype: float64

In [None]:
# fancy indexing
data[['a', 'e']]

a    0.25
e    1.25
dtype: float64

Among these, slicing may be the source of the most confusion.
Notice that when slicing with an explicit index (i.e., ``data['a':'c']``), the final index is *included* in the slice, while when slicing with an implicit index (i.e., ``data[0:2]``), the final index is *excluded* from the slice.

### Indexers: loc, iloc, and ix

These slicing and indexing conventions can be a source of confusion.
For example, if your ``Series`` has an explicit integer index, an indexing operation such as ``data[1]`` will use the explicit indices, while a slicing operation like ``data[1:3]`` will use the implicit Python-style index.

In [None]:
data = pd.Series(['a', 'b', 'c'], index=[1, 3, 5])
data

1    a
3    b
5    c
dtype: object

In [None]:
# explicit index when indexing
data[1]

'a'

In [None]:
# implicit index when slicing
data[1:3]

3    b
5    c
dtype: object

Because of this potential confusion in the case of integer indexes, Pandas provides some special *indexer* attributes that explicitly expose certain indexing schemes.
These are not functional methods, but attributes that expose a particular slicing interface to the data in the ``Series``.

First, the ``loc`` attribute allows indexing and slicing that always references the explicit index:

In [None]:
# access using explicit index
data.loc[1]

'a'

In [None]:
# Slice using explicit index
data.loc[1:3]

1    a
3    b
dtype: object

The ``iloc`` attribute allows indexing and slicing that always references the implicit Python-style index:

In [None]:
# Accss using the implicit index
data.iloc[1]

'b'

In [None]:
# Slice using the implicit index
data.iloc[1:3]

3    b
5    c
dtype: object

One guiding principle of Python code is that "explicit is better than implicit."
The explicit nature of ``loc`` and ``iloc`` make them very useful in maintaining clean and readable code; especially in the case of integer indexes, I recommend using these both to make code easier to read and understand, and to prevent subtle bugs due to the mixed indexing/slicing convention.

## What we've learned in this lesson:

1. Indexing and selection of `DataFrames` using:
    - Bracket indexing
    - `.iloc`
    - `.loc`
    - `.query`
2. Quirks of dataframe indexing vs slicing
3. Indexing and selection of `Series` using:
    - Bracket indexing
    - `.iloc`
    - `.loc`