# 2. Selecting Subsets of Data

## There should be one-- and preferably only one --obvious way to do it.
This quote is from the "Zen of Python" by Tim Peters. Import the `this` module to have it printed to the screen.

In [1]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


### Pandas breaks this guideline a whole lot
Pandas is perhaps an extreme example of a library that gives its users many different methods for completing the same task. This is not a good thing and leads to different users writing different code for the same task. Pandas is capable of doing many tasks and it is difficult to retain in your working memory all the possibilities. Restricting the number of ways to use the library will make you a more effective analyst.

## Selecting Subsets of Data

### Selecting a single column - brackets vs dot notation
Pandas gives its users two methods to select a single column of data as a Series. You can place the name inside the brackets or you can use dot notation. Let's see this in action with some sample data.

In [3]:
import pandas as pd
df = pd.read_csv('data/sample_data.csv', index_col=0)
df

Unnamed: 0_level_0,state,color,favorite food,age,height,score,count
name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Jane,NY,blue,Steak,30,165,4.6,10
Niko,TX,green,Lamb,2,70,8.3,4
Aaron,FL,red,Mango,12,120,9.0,3
Penelope,AL,white,Apple,4,80,3.3,12
Dean,AK,gray,Cheese,32,180,1.8,8
Christina,TX,black,Melon,33,172,9.5,99
Cornelia,TX,red,Beans,69,150,2.2,44


### Select the state column
Let's select the `state` column with both the brackets and dot notation.

In [3]:
df['state']

name
Jane         NY
Niko         TX
Aaron        FL
Penelope     AL
Dean         AK
Christina    TX
Cornelia     TX
Name: state, dtype: object

In [None]:
df.state

### Select the favorite food column
This is only possible using the brackets as spaces would raise a syntax error with dot notation.

In [4]:
df['favorite food']

name
Jane          Steak
Niko           Lamb
Aaron         Mango
Penelope      Apple
Dean         Cheese
Christina     Melon
Cornelia      Beans
Name: favorite food, dtype: object

In [5]:
df.favorite food

SyntaxError: invalid syntax (<ipython-input-5-d3097dad1f65>, line 1)

### Select the `count` column
This again only works with the brackets as `count` is a DataFrame method to count the non-missing values of each column.

In [6]:
df['count']

name
Jane         10
Niko          4
Aaron         3
Penelope     12
Dean          8
Christina    99
Cornelia     44
Name: count, dtype: int64

In [7]:
df.count

<bound method DataFrame.count of           state  color favorite food  age  height  score  count
name                                                           
Jane         NY   blue         Steak   30     165    4.6     10
Niko         TX  green          Lamb    2      70    8.3      4
Aaron        FL    red         Mango   12     120    9.0      3
Penelope     AL  white         Apple    4      80    3.3     12
Dean         AK   gray        Cheese   32     180    1.8      8
Christina    TX  black         Melon   33     172    9.5     99
Cornelia     TX    red         Beans   69     150    2.2     44>

In [7]:
print(repr(df))

          state  color favorite food  age  height  score  count
name                                                           
Jane         NY   blue         Steak   30     165    4.6     10
Niko         TX  green          Lamb    2      70    8.3      4
Aaron        FL    red         Mango   12     120    9.0      3
Penelope     AL  white         Apple    4      80    3.3     12
Dean         AK   gray        Cheese   32     180    1.8      8
Christina    TX  black         Melon   33     172    9.5     99
Cornelia     TX    red         Beans   69     150    2.2     44


In [8]:
str(df)

'          state  color favorite food  age  height  score  count\nname                                                           \nJane         NY   blue         Steak   30     165    4.6     10\nNiko         TX  green          Lamb    2      70    8.3      4\nAaron        FL    red         Mango   12     120    9.0      3\nPenelope     AL  white         Apple    4      80    3.3     12\nDean         AK   gray        Cheese   32     180    1.8      8\nChristina    TX  black         Melon   33     172    9.5     99\nCornelia     TX    red         Beans   69     150    2.2     44'

## Choose the method that always works
Using the brackets and dot notation provide two different ways to select a single column of data. Dot notation does now work for columns with spaces or columns with the same name as DataFrame methods. The dot notation provides no additional functionality over the brackets and does not work in all situations. Therefore, I never use it. It's single advantage is three less key strokes.

### Minimally  Sufficient Guiding Principle
If a method does not provide any additional functionality over another method (i.e. its functionality is a subset of another) then it shouldn't be used. Methods should only be considered if they have some additional, unique functionality.

**Guidance** - Only use the brackets when selecting a single column of data.

## Select multiple columns
Selecting multiple columns is done with the brackets. Pass in all the columns you want to select as a list. Here we select state, age, and color.

In [None]:
df[['state', 'age', 'color']]

### For clarity, consider creating a list first
A common mistake made when selecting multiple columns is to forget to put the columns within a list and write the following which would be an error: `df['state', 'age', 'color']`. For clarity, you can consider creating a list first and then making the selection in a second line.

In [None]:
cols = ['state', 'age', 'color']
df[cols]

## Selecting Rows and Columns Simultaneously with `loc` and `iloc`

Rows and columns in a Pandas DataFrame can be referenced in two ways - by either **label** or **integer location**. This dual reference is one of the reasons that subset selection is confusing for beginners. Pandas provides the indexers `loc` to handle selection by label and `iloc` for selection by integer location. Both are capable of simultaneously selecting rows and columns.

Let's select by label with `loc` the rows for Niko and Dean along with the columns age, favorite food and score. We create two separate lists for clarity.

In [None]:
rows = ['Niko', 'Dean']
cols = ['age', 'favorite food', 'score']
df.loc[rows, cols]

Now let's select by integer location with `iloc` the rows 1, 2, and 5 and the columns 0 and 4.

In [None]:
rows = [1, 2, 5]
cols = [0, 4]
df.iloc[rows, cols]

## The deprecated `ix` indexer - never use it

The `ix` indexer was created before `loc` or `iloc` and was able to select data by both label and integer location. Although it was versatile, it was ambiguous as labels can be integers as well as strings. Because of this ambiguity, the `loc` and `iloc` indexers were created which are explicit. 

**GUIDANCE** - Every trace of `ix` should be removed and replaced with `loc` or `iloc`. 

### What happens if you need to select by both integer location and label simultaneously
It's very rare that you will need to select by both integer location and label simultaneously. Let's see an example. If you are selecting rows 1, 2, and 5 along with columns age, favorite food and score you can call one indexer after another. This is called chained indexing and should be avoided at all costs.

In [None]:
rows = [1, 2, 5]
cols = ['age', 'favorite food', 'score']
df.iloc[rows, :].loc[:, cols]

### Selecting with `at` and `iat` 
Two additional indexers, `at` and `iat`, exist that select a single cell of a DataFrame. These provide a slight performance advantage over their analogous `loc` and `iloc` indexers. But, they introduce the additional burden of having to remember what they do. Also, for most data analysis, the increase in performance isn't useful at all unless it's being done at scale. And if performance truly is an issue, then you place your data in NumPy arrays and use it directly.

In [9]:
import numpy as np

In [10]:
a = np.random.rand(10 ** 5, 5)
df1 = pd.DataFrame(a)

In [11]:
row = 50000
col = 3

In [13]:
%timeit df1.iloc[row, col]

13.8 µs ± 3.36 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [14]:
%timeit df1.iat[row, col]

7.36 µs ± 927 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [15]:
%timeit a[row, col]

232 ns ± 8.72 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


In [16]:
13800 / 232

59.48275862068966

**GUIDANCE** - There really is no need to use `at` and `iat`. If you do need better performance, use the underlying NumPy array.