## Chapter 4: Boolean indexing of dataframes

This chapter explains how to access rows in a DataFrame using its indexer objects ```.ix```, ```.loc```, ```.iloc``` and how it
differentiates itself from using a boolean mask.

### Accessing a DataFrame with a boolean index

In [2]:
import pandas as pd

Create a DataFrame:

In [5]:
df = pd.DataFrame({"color": ['red', 'blue', 'red', 'blue']},
    index=[True, False, True, False])
df

Unnamed: 0,color
True,red
False,blue
True,red
False,blue


Accessing with ```.loc```:

In [6]:
df.loc[True]

Unnamed: 0,color
True,red
True,red


Accessing with ```.iloc```:

In [7]:
df.iloc[1]

color    blue
dtype: object

Accessing with ```.ix``` (deprecated):

In [8]:
df.ix[True]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


Unnamed: 0,color
True,red
True,red


In [9]:
df.ix[1]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#ix-indexer-is-deprecated
  """Entry point for launching an IPython kernel.


color    blue
dtype: object

### Applying a boolean mask to a dataframe

Create a DataFrame:

In [10]:
df = pd.DataFrame({"color": ['red', 'blue', 'red', 'blue'],
                  "name": ['rose', 'violet', 'tulip', 'harebell'],
                  "size": ['big', 'big', 'small', 'small']
                  })
df

Unnamed: 0,color,name,size
0,red,rose,big
1,blue,violet,big
2,red,tulip,small
3,blue,harebell,small


Using the magic ```__getitem__``` or ```[]``` accessor. Giving it a list of True and False of the same length as
the dataframe will give you:

In [11]:
df[[True, False, True, False]]

Unnamed: 0,color,name,size
0,red,rose,big
2,red,tulip,small


### Masking data based on column value

Let's use the same DataFrame as for the previous paragraph.

Accessing a single column from a DataFrame, we can use a simple comparison ```==``` to compare every element in the column to the given variable, producing a ```pd.Series``` of True and False:

In [12]:
df['size'] == 'small'

0    False
1    False
2     True
3     True
Name: size, dtype: bool

This ```pd.Series``` is an extension of an ```np.array``` which is an extension of a simple ```list```, Thus we can hand this to the ```__getitem__``` or ```[]``` accessor as in the above example:

In [13]:
size_small_mask = df['size'] == 'small'
df[size_small_mask]

Unnamed: 0,color,name,size
2,red,tulip,small
3,blue,harebell,small


### Masking data based on index value

Let's create a DataFrame:

In [21]:
df = pd.DataFrame({"color": ['red', 'blue', 'red', 'blue'],
                  "size": ['big', 'big', 'small', 'small']},
                   index=['rose', 'violet', 'tulip', 'harebell']
                  )

df.index.name='name'
df

Unnamed: 0_level_0,color,size
name,Unnamed: 1_level_1,Unnamed: 2_level_1
rose,red,big
violet,blue,big
tulip,red,small
harebell,blue,small


We can create a mask based on the index values, just like on a column value:

In [22]:
rose_mask = df.index == 'rose'
df[rose_mask]

Unnamed: 0_level_0,color,size
name,Unnamed: 1_level_1,Unnamed: 2_level_1
rose,red,big


which is almost the same as:

In [23]:
df.loc['rose']

color    red
size     big
Name: rose, dtype: object

An important difference: when ```.loc``` only encounters one row in the index that matches, it will return a ```pd.Series```, if it encounters more rows that matches, it will return a ```pd.DataFrame```. This makes this method rather unstable. This behavior can be controlled by giving the ```.loc``` a list of a single entry. This will force it to return a DataFrame:

In [24]:
df.loc[['rose']]

Unnamed: 0_level_0,color,size
name,Unnamed: 1_level_1,Unnamed: 2_level_1
rose,red,big
