# 2. Selecting Subsets of Data

## There should be one-- and preferably only one --obvious way to do it.
This quote is from the "Zen of Python" by Tim Peters. Import the `this` module to have it printed to the screen.

In [None]:
import this

### Pandas breaks this guideline a whole lot
Pandas is perhaps an extreme example of a library that gives its users many different methods for completing the same task. This is not a good thing and leads to different users writing different code for the same task. Pandas is capable of doing many tasks and it is difficult to retain in your working memory all the possibilities. Restricting the number of ways to use the library will make you a more effective analyst.

## Selecting Subsets of Data

### Selecting a single column - brackets vs dot notation
Pandas gives its users two methods to select a single column of data as a Series. You can place the name inside the brackets or you can use dot notation. Let's see this in action with some sample data.

In [None]:
import pandas as pd
df = pd.read_csv('data/sample_data.csv', index_col=0)
df

### Select the state column
Let's select the `state` column with both the brackets and dot notation.

In [None]:
df['state']

In [None]:
df.state

### Select the favorite food column
This is only possible using the brackets as spaces would raise a syntax error with dot notation.

In [None]:
df['favorite food']

### Select the `count` column
This again only works with the brackets as `count` is a DataFrame method to count the non-missing values of each column.

In [None]:
df['count']

In [None]:
df.count

## Choose the method that always works
Using the brackets and dot notation provide two different ways to select a single column of data. Dot notation does now work for columns with spaces or columns with the same name as DataFrame methods. The dot notation provides no additional functionality over the brackets and does not work in all situations. Therefore, I never use it. It's single advantage is three less key strokes.

### Minimally  Sufficient Guiding Principle
If a method does not provide any additional functionality over another method (i.e. its functionality is a subset of another) then it shouldn't be used. Methods should only be considered if they have some additional, unique functionality.

**Guidance** - Only use the brackets when selecting a single column of data.

## Select multiple columns
Selecting multiple columns is done with the brackets. Pass in all the columns you want to select as a list. Here we select state, age, and color.

In [None]:
df[['state', 'age', 'color']]

### For clarity, consider creating a list first
A common mistake made when selecting multiple columns is to forget to put the columns within a list and write the following which would be an error: `df['state', 'age', 'color']`. For clarity, you can consider creating a list first and then making the selection in a second line.

In [None]:
cols = ['state', 'age', 'color']
df[cols]

## Selecting Rows and Columns Simultaneously with `loc` and `iloc`

Rows and columns in a Pandas DataFrame can be referenced in two ways - by either **label** or **integer location**. This dual reference is one of the reasons that subset selection is confusing for beginners. Pandas provides the indexers `loc` to handle selection by label and `iloc` for selection by integer location. Both are capable of simultaneously selecting rows and columns.

Let's select by label with `loc` the rows for Niko and Dean along with the columns age, favorite food and score. We create two separate lists for clarity.

In [None]:
rows = ['Niko', 'Dean']
cols = ['age', 'favorite food', 'score']
df.loc[rows, cols]

Now let's select by integer location with `iloc` the rows 1, 2, and 5 and the columns 0 and 4.

In [None]:
rows = [1, 2, 5]
cols = [0, 4]
df.iloc[rows, cols]

## The deprecated `ix` indexer - never use it

The `ix` indexer was created before `loc` or `iloc` and was able to select data by both label and integer location. Although it was versatile, it was ambiguous as labels can be integers as well as strings. Because of this ambiguity, the `loc` and `iloc` indexers were created which are explicit. 

**GUIDANCE** - Every trace of `ix` should be removed and replaced with `loc` or `iloc`. 

### What happens if you need to select by both integer location and label simultaneously
It's very rare that you will need to select by both integer location and label simultaneously. Let's see an example. If you are selecting rows 1, 2, and 5 along with columns age, favorite food and score you can call one indexer after another. This is called chained indexing and should be avoided at all costs.

In [None]:
rows = [1, 2, 5]
cols = ['age', 'favorite food', 'score']
df.iloc[rows, :].loc[:, cols]

### Selecting with `at` and `iat` 
Two additional indexers, `at` and `iat`, exist that select a single cell of a DataFrame. These provide a slight performance advantage over their analogous `loc` and `iloc` indexers. But, they introduce the additional burden of having to remember what they do. Also, for most data analysis, the increase in performance isn't useful at all unless it's being done at scale. And if performance truly is an issue, then you place your data in NumPy arrays and use it directly.

In [None]:
import numpy as np

In [None]:
a = np.random.rand(10 ** 5, 5)
df1 = pd.DataFrame(a)

In [None]:
row = 50000
col = 3

In [None]:
%timeit -n 1 -r 1 df1.iloc[row, col]

In [None]:
%timeit -n 1 -r 1 df1.iat[row, col]

In [None]:
%timeit a[row, col]

**GUIDANCE** - There really is no need to use `at` and `iat`. If you do need better performance, use the underlying NumPy array.