In [1]:
import seaborn as sns
import pandas as pd
import numpy as np

# Pandas Indexing and Selecting

Let's talk about slicing and dicing pandas data. We are going to be going over four topics today:

* Review the basics
* Multi-index
* Getting Single Values
* Pointing out some stuff you don't need to worry about

As always you can check out the full documentation: [basic indexing](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) and [advanced indexing](http://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html). But be warned that they are very long and tell you way more than you'd need to know :)

## Review the Basics

First let's start with a bit of a recap on traditional indexing and selection. (We went over most of this in the [pandas fundamentals](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Pandas%20Intro%20to%20Data%20Structures.ipynb)). To start off with, here is the data we are going to be working with (good old tips data):

In [2]:
tips = sns.load_dataset('tips')
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


There are basically 4 ways to do get data from dataframes:

In [3]:
# 1) get columns
tips[['total_bill', 'tip']].head()

Unnamed: 0,total_bill,tip
0,16.99,1.01
1,10.34,1.66
2,21.01,3.5
3,23.68,3.31
4,24.59,3.61


In [4]:
# 2) get some rows
tips[3:5]

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [5]:
# 3) select rows and columns based on their name
tips.loc[2:4, 'sex': 'smoker']

Unnamed: 0,sex,smoker
2,Male,No
3,Male,No
4,Female,No


In [6]:
# select rows and columns by their ordering
tips.iloc[1:3, 0:2]

Unnamed: 0,total_bill,tip
1,10.34,1.66
2,21.01,3.5


In [9]:
# 5) select using a bool series
tips[tips['tip'] > 1].head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


But this is just the tip of the iceberg (well actually it's 90% of the iceberg). 

But there are a couple of other important concepts that you will most likely get into when diving into other pandas functionalities.

# Multi-index

A subject that you might not think that you'd need - but turns out to be a rather frequent usecase. 

The initial idea behind the multi-index was to provide a framework to work with higher dim data (and thus a replacement for panels).

But because of some operations it became quite commonplace. In almost all cases multi-index comes from [groupby's](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Group%20Operations.ipynb) (you will almost never construct it or read it in yourself).

Let's do an example below:

In [10]:
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


In [11]:
mi_tips = tips.groupby(['sex', 'smoker']).agg({'tip': 'mean'})
mi_tips

Unnamed: 0_level_0,Unnamed: 1_level_0,tip
sex,smoker,Unnamed: 2_level_1
Male,Yes,3.051167
Male,No,3.113402
Female,Yes,2.931515
Female,No,2.773519


In [12]:
mi_tips.index

MultiIndex(levels=[['Male', 'Female'], ['Yes', 'No']],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]],
           names=['sex', 'smoker'])

Ultimately there are a ton of operations that you can do on top of this type of data. And there are equivalent multi-index operations you can do, like this:

In [13]:
mi_tips.loc[('Male', 'No')]

tip    3.113402
Name: (Male, No), dtype: float64

But in that way you'd have a learn a lot of details and there are always exceptions. 

So the way that I have always deal with this is simply by resetting the index.

In [14]:
ri_tips = mi_tips.reset_index()
ri_tips

Unnamed: 0,sex,smoker,tip
0,Male,Yes,3.051167
1,Male,No,3.113402
2,Female,Yes,2.931515
3,Female,No,2.773519


Notice how we get values spread out over the full column now. So in this way it is easy to select only the male non-smokers:

In [18]:
ri_tips[(ri_tips['smoker'] == 'No') & (ri_tips['sex'] == 'Male')]

Unnamed: 0,sex,smoker,tip
1,Male,No,3.113402


Another way you can deal with this is to only certain indexes out:

In [19]:
ri0_tips = mi_tips.reset_index(level=0)
ri0_tips.loc['Yes']

Unnamed: 0_level_0,sex,tip
smoker,Unnamed: 1_level_1,Unnamed: 2_level_1
Yes,Male,3.051167
Yes,Female,2.931515


And finally you can [pull indexes back into the index](https://github.com/knathanieltucker/pandas-tutorial/blob/master/notebooks/Indexing%20and%20Selecting.ipynb) (basically only useful for certain types of merges).

In [20]:
ri_tips.set_index(['sex', 'smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,tip
sex,smoker,Unnamed: 2_level_1
Male,Yes,3.051167
Male,No,3.113402
Female,Yes,2.931515
Female,No,2.773519


In [21]:
ri0_tips.set_index('sex', append=True)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip
smoker,sex,Unnamed: 2_level_1
Yes,Male,3.051167
No,Male,3.113402
Yes,Female,2.931515
No,Female,2.773519


# Getting Single Values

The next little indexing trick is one that is mostly about speed. But it is getting and setting single values. It is a pretty simple:

In [37]:
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,6.0,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


When getting/setting single values you should use the `at` function

In [23]:
tips.at[0, 'total_bill'] = 9000
tips.head(3)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,9000.0,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3


In [24]:
tips.iat[0, 0]

9000.0

If you are modifying single values of a dataframe you should always use these guys. It's faster and it is a good way to know that you are not messing up (often times modifying the data can result in odd errors).

So just to prove it's faster let's time it!

In [25]:
%%timeit
tips.at[0, 'total_bill'] = 6

5.85 µs ± 96.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [26]:
%%timeit
tips.loc['total_bill', 0] = 6

304 µs ± 8.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


# Where, Masks and Queries

These are things that are built into pandas that I have personally never used, mostly because they are pretty redundant and don't happen too often.

They are a bit faster, yes. But the mental space is probably not worth it. So if you wanna learn it, go for it (docs are [here](http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#the-query-method)). If not, probably won't matter.

Let me show you how you'd duplicate mask functionality below. 

In [27]:
df = pd.DataFrame(np.random.randn(25).reshape((5, 5)))
df.head()

Unnamed: 0,0,1,2,3,4
0,-1.438781,0.584173,-0.694112,0.135304,0.409292
1,-2.203219,1.232487,1.284779,-2.460982,-0.855321
2,-0.827212,-0.293645,-0.679745,0.209145,-0.402497
3,0.471747,1.141361,0.429878,2.29084,-0.655701
4,-1.944334,0.186785,1.031003,-0.633808,0.413554


In [28]:
df.where(df > 0)

Unnamed: 0,0,1,2,3,4
0,,0.584173,,0.135304,0.409292
1,,1.232487,1.284779,,
2,,,,0.209145,
3,0.471747,1.141361,0.429878,2.29084,
4,,0.186785,1.031003,,0.413554


In [29]:
df[df < 0] = np.NaN
df

Unnamed: 0,0,1,2,3,4
0,,0.584173,,0.135304,0.409292
1,,1.232487,1.284779,,
2,,,,0.209145,
3,0.471747,1.141361,0.429878,2.29084,
4,,0.186785,1.031003,,0.413554


## Conclusion

So that's it. This is really all I know about indexing and prob all you'll need to know too. If you've got any question or comment please add them! 

p.s. there are not really any great tutorials on this in particular, but if you know of one I should link, let me know.