# Selecting Data from a Pandas DataFrame

This notebook uses very simple data in order to focus on the various data selection operations.  Later notebooks use real-world data.

## Overview

Pandas Data Selection can be summarized as:
* **df\[columns\]** provides Python dictionary like access to the columns.
* **df.loc\[row_labels, col_labels\]** provides access by row and column labels.
* **df.iloc\[row_positions, col_positions\]** provides access by row and column positions

**df.iloc\[row_positions, col_positions\]** is almost identical to indexing numpy 2D arrays.

Additionally masking (the use of a boolean Series to pick out values in the DataFrame where the boolean value is True) is supported for both rows and columns for each of the above indexing operators.

## Numpy

Basic knowledge of numpy is often considered a prerequisite to learning Pandas. Pandas is built on top of numpy and its indexing is closely related to numpy's indexing.

For an excellent free introductory youtube presentation on numpy see: [Enthought Scipy 2018: Numpy](https://www.youtube.com/watch?v=V0D2mhVt7NE) 

The resources for this presentation are at: [Enthought Scipy 2018: Numpy Github Repo](https://github.com/enthought/Numpy-Tutorial-SciPyConf-2018).  

The resources include "slides.pdf", which from page 17 through page 64, offers an excellent illustrated introduction to numpy.

## Recommendations

I highly recommend the excellent blog post: [**Minimally Sufficient Pandas**](https://medium.com/dunder-data/minimally-sufficient-pandas-a8e67f2a2428)

Here I will repeat two of Ted Petrou's recommendations from that post.
* do not use dot notation to select columns
* do not use inplace=True

The above recommendations allow code to be more easily written correctly.

For exploratory data analysis, in a single notebook, neither of the above recommendations is necessary. However when the code is intended for more complex usage, such as preparing for use with Machine Learning, or the code is intended for production use and will need to be maintained, these recommendation reduce potential problems.

If you are on a team of data scientists, there is likely a written set of guidelines for how your team should use Pandas, if so, those recommendations should be followed.

## Selection using **df\[filter\]**

Selects columns when filter is:
1. a single column label (which is in df.columns)
2. a list of columns labels (all of which are in df.columns)

Selects rows when filter is:
3. a slice representing row positions
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Examples of each of the above 5 use cases follows.

### Examples

In [1]:
import pandas as pd
import numpy as np

In [2]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


#### 1. Single Column Label

In [3]:
df['Col1']

Row1     0
Row2    10
Row3    20
Name: Col1, dtype: int64

In [4]:
isinstance(df['Col1'], pd.Series)

True

#### 2. List of Column Labels

In [5]:
df[['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


In [6]:
isinstance(df[['Col1', 'Col3']], pd.DataFrame)

True

#### 3. Slice refering to Row Positions

In [7]:
df[1:]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12
Row3,20,21,22


#### 4. Boolean Series with Matching Index

In [8]:
row_boolean_series = pd.Series([False, True, False], index=['Row1', 'Row2', 'Row3'])
df[row_boolean_series]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


#### 5. Boolean Array

In [9]:
df[[False, True, False]]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


## Selection using **df.loc\[row, col\]**

Where row is:
1. a single row label
2. a list of row labels
3. a slice matching row labels in order, inclusive of the slice end
4. a boolean Series (with index matching df.index)
5. a list of boolean values

Where col is:
6. a single col label
7. a list of col labels
8. a slice matching col labels in order, inclusive of the slice end
9. a boolean Series (with index matching df.columns)
10. a list of boolean values

**df.loc\[rows\]** is the same as: **df.loc\[rows, : \]**  

Examples of each of the above 10 use cases follow.

### Examples

#### 1a. Single Row Label, result is Series

In [10]:
df.loc['Row2']

Col1    10
Col2    11
Col3    12
Name: Row2, dtype: int64

#### 1b List of Single Row Lable, result is DataFrame

In [11]:
df.loc[['Row2']]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


#### 2. List of Row Lables

In [12]:
df.loc[['Row1', 'Row3']]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


#### 3a. Slice of Row Labels

In [13]:
df.loc['Row1':'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


#### 3b. Slice of Row Labels

In [14]:
df.loc[:'Row2']

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


#### 4. Boolean Series with Matching Index

In [15]:
# Pandas will align the index
idx = ['Row2', 'Row1', 'Row3']

# indexes contain the same values
print(f'indexes have same values: {set(idx) == set(df.index)}')
print()

data = [False, True, False]
row_series = pd.Series(data=data, index=idx)
print('Boolean Series:')
print(row_series)

print('\nSelected Data:')
df.loc[row_series]

indexes have same values: True

Boolean Series:
Row2    False
Row1     True
Row3    False
dtype: bool

Selected Data:


Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2


#### 5. List of Boolean Values

In [16]:
# there is no index to align
selection = [False, True, False]
df.loc[selection]

Unnamed: 0,Col1,Col2,Col3
Row2,10,11,12


#### 6a. Single Column Label, result is Series

In [17]:
df.loc[:'Row2', 'Col2']

Row1     1
Row2    11
Name: Col2, dtype: int64

#### 6b. List of a Single Column Label, result is DataFrame

In [18]:
df.loc[:'Row2', ['Col2']]

Unnamed: 0,Col2
Row1,1
Row2,11


#### 7. List of Column Labels

In [19]:
df.loc[:'Row2', ['Col1', 'Col3']]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12


#### 8a. Slice of Row Labels and Slice of Column Labels

In [20]:
df.loc['Row1':'Row2', 'Col2':]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


#### 8b. Same as previous example, however use Explict Python Slice Object

In [21]:
# Reminder: with .loc[] and only with .loc[], slice is inclusive of its end point
row_slice = slice('Row1', 'Row2')
col_slice = slice('Col2', None)
df.loc[row_slice, col_slice]

Unnamed: 0,Col2,Col3
Row1,1,2
Row2,11,12


#### 8c. Slice Order Matters, No Rows Match Row2 followed by Row1

In [22]:
df.loc['Row2':'Row1', 'Col2':]

Unnamed: 0,Col2,Col3


#### 9. Boolean Series with Matching Index

In [23]:
# Pandas will align the column index
idx = ['Col2', 'Col1', 'Col3']

# show that the indexes contain the same values
print(f'indexes have same values: {set(idx) == set(df.columns)}')

print('\nBoolean Series:')
data = [False, True, False]
col_series = pd.Series(data=data, index=idx)
print(col_series)

print('\nSelected Data:')
df.loc[:,col_series]

indexes have same values: True

Boolean Series:
Col2    False
Col1     True
Col3    False
dtype: bool

Selected Data:


Unnamed: 0,Col1
Row1,0
Row2,10
Row3,20


#### 10. List of Boolean Values

In [24]:
# there are no indexes to align
df.loc[:, [True, False, True]]

Unnamed: 0,Col1,Col3
Row1,0,2
Row2,10,12
Row3,20,22


#### KeyError if Label is not in Column or Row Index

In [25]:
# it is a KeyError if the label is not in the column index
print('Col99' in df.columns)
try:
    df.loc[:, 'Col99']
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Col99'


In [26]:
# it is a KeyError if the label is not in the row index
print('Row99' in df.index)
try:
    df.loc['Row99', :]
except KeyError as err:
    print(f'KeyError: {err}')

False
KeyError: 'Row99'


#### Default Index Example: Row Label vs Row Position
The default index creates a range of integers that match the row position.

**Important Note:**   
Even though the default index initially matches the row position, **df.loc\[\]** is selecting "row labels", not "row positions".

In [27]:
# create DataFrame with default index
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]
columns = ['Col1', 'Col2', 'Col3']

df = pd.DataFrame(data=data, columns=columns)
df

Unnamed: 0,Col1,Col2,Col3
0,0,1,2
1,10,11,12
2,20,21,22


In [28]:
# index is concisely represented as a RangeIndex
df.index

RangeIndex(start=0, stop=3, step=1)

In [29]:
# df.index is a subclass of pd.Index
isinstance(df.index, pd.Index)

True

In [30]:
# matches row *labels*, 1 thru 2 inclusive
df.loc[1:2, 'Col1':'Col2']

Unnamed: 0,Col1,Col2
1,10,11
2,20,21


In [31]:
# matches row *positions*, 1 thru 2 exclusive
# Better to use .iloc[] to avoid confusion
df[1:2]

Unnamed: 0,Col1,Col2,Col3
1,10,11,12


In [32]:
df.index = df.index + 3
df

Unnamed: 0,Col1,Col2,Col3
3,0,1,2
4,10,11,12
5,20,21,22


In [33]:
# row *labels* 3 to 4 inclusive
df.loc[3:4, ['Col1', 'Col2']]

Unnamed: 0,Col1,Col2
3,0,1
4,10,11


In [34]:
# matches row *positions*, 1 to 2 exclusive
df[1:2]

Unnamed: 0,Col1,Col2,Col3
4,10,11,12


In [35]:
# clearer code
df.iloc[1:2, :]

Unnamed: 0,Col1,Col2,Col3
4,10,11,12


## Selection using **df.iloc\[rows, cols\]**

Where row is:
1. a single row position
2. a list of row positions
3. a slice of row positions
4. a boolean list

Where col is:
1. a single col position
2. a list of col positions
3. a slice of col positions
4. a boolean list

Slice works as it normally does for Python.  That is, the end point is exclusive.

**.iloc\[rows\]** is the same as: **.iloc\[rows\, :]**  
    
**All of the above works the same in numpy.** 

### Examples

In [36]:
columns = ['Col1', 'Col2', 'Col3']
index = ['Row1', 'Row2', 'Row3']
data = [[ 0, 1, 2],
       [ 10, 11, 12],
       [ 20, 21, 22]]

df = pd.DataFrame(data=data, columns=columns, index=index)
df

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12
Row3,20,21,22


#### 1. Specify Single Row Position

In [37]:
df.iloc[1,:]

Col1    10
Col2    11
Col3    12
Name: Row2, dtype: int64

#### 2. Specify List of Row Positions

In [38]:
# returns a dataframe
df.iloc[[0,2], :]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22


#### 3a. Specify Slice of Row Positions

In [39]:
df.iloc[0:2, :]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


#### 3b. Specify Slice of Row Positions using Slice Object

In [40]:
# same but with explict slice objectw
row_slice = slice(0,2)
col_slice = slice(None, None)

In [41]:
# returns a dataframe
df.iloc[row_slice, col_slice]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row2,10,11,12


#### Specify List of Boolean Values (mask)

In [42]:
df.iloc[[True, False, True], :]

Unnamed: 0,Col1,Col2,Col3
Row1,0,1,2
Row3,20,21,22
