![alt text](http://pandas.pydata.org/_static/pandas_logo.png)

<center><h1> PANDAS TABLE OF CONTENTS </h1></center>

## [Indexing and Selecting Date](http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-and-selecting-data) 



- [Basics](#Basics)
- [Attribute Access](#Attribute-Access)
- [Slicing Ranges](#Slicing-Ranges)
- [Selection By Label](#Selection-by-Label)
- [Selection By Position](#Selection-By-Position)
- [Selection By Callable](#Selection-By-Callable)
- [Selecting Random Samples](#Selecting-Random-Samples)
- [Setting With Enlargement](#Setting-With-Enlargement)
- [Fast scalar value getting and setting](#Fast-scalar-value-getting-and-setting)
- [Boolean indexing](#Boolean-indexing)
- [Indexing with isin](#Indexing-with-isin)
- [The where() Method and Masking](#The-where-Method-and-Masking)
- [The query() Method (Experimental)](#The-query-Method-(Experimental))
- [MultiIndex query() Syntax](#MultiIndex-query-Syntax)
- [query() Use Cases](#query-Use-Cases)
- [query() Python versus pandas Syntax Comparison](#query-Python-versus-pandas-Syntax-Comparison)
- [Special use-of the == operator with list objects](#Special-use-of-the-==-operator-with-list-objects)
- [Boolean Operators](#Boolean-Operators)
- [Duplicate Data](#Duplicate-Data)
- [Dictionary like get() method](#Dictionary-like-get-method)
- [The select() Method](#The-select-Method)
- [The lookup() Method](#The-lookup-Method)
- [Set / Reset Index](#Set-/-Reset-Index)
- [Returning a view versus a copy](#Returning-a-view-versus-a-copy)



In [1]:
import pandas as pd
import numpy as np

## Basics
As mentioned when introducing the data structures in the last section, the primary function of indexing with [] (a.k.a. __getitem__ for those familiar with implementing class behavior in Python) is selecting out lower-dimensional slices. Thus,

|Object Type|Selection|Return Value Type|
|---|:---:|---|
|Series|	series[label]|scalar value|
|DataFrame|	frame[colname]|Series corresponding to colname|



[[back to top](#Indexing-and-Selecting-Date)]

Here we construct a simple time series data set to use for illustrating the indexing functionality:

In [2]:
dates = pd.date_range('1/1/2000', periods = 8)

In [3]:
df = pd.DataFrame(np.random.randn(8, 4), index = dates, columns = ['A','B','C','D'])

In [4]:
df

Unnamed: 0,A,B,C,D
2000-01-01,0.292414,-0.303327,-0.500265,-0.709743
2000-01-02,-1.559403,2.393151,-0.082192,-0.290227
2000-01-03,-0.382275,0.032127,0.396081,0.012956
2000-01-04,1.212415,2.15118,0.499622,-0.611444
2000-01-05,-0.879824,-0.507925,-1.322001,0.299866
2000-01-06,0.162584,-0.751469,-1.501356,-1.974974
2000-01-07,2.233168,-1.189214,-1.028462,-1.974327
2000-01-08,0.760645,-0.723636,-0.837531,-1.10227


Thus, as per above, we have the most basic indexing using []:

In [5]:
s = df['A']

In [6]:
s

2000-01-01    0.292414
2000-01-02   -1.559403
2000-01-03   -0.382275
2000-01-04    1.212415
2000-01-05   -0.879824
2000-01-06    0.162584
2000-01-07    2.233168
2000-01-08    0.760645
Freq: D, Name: A, dtype: float64

In [7]:
s[dates[5]]

0.16258425929425141

In [8]:
s[5]

0.16258425929425141

You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an exception will be raised. Multiple columns can also be set in this manner:

In [9]:
df[['B','A']] = df[['A','B']]

In [10]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227
2000-01-03,0.032127,-0.382275,0.396081,0.012956
2000-01-04,2.15118,1.212415,0.499622,-0.611444
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227


# Attribute Access
You may access an index on a Series, column on a DataFrame, and an item on a Panel directly as an attribute:

[[back to top](#Indexing-and-Selecting-Date)]

In [11]:
sa = pd.Series([1,2,3], index = list('abc'))

In [12]:
sa

a    1
b    2
c    3
dtype: int64

In [13]:
dfa = df.copy()

In [14]:
dfa

Unnamed: 0,A,B,C,D
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227
2000-01-03,0.032127,-0.382275,0.396081,0.012956
2000-01-04,2.15118,1.212415,0.499622,-0.611444
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227


In [15]:
sa.b

2

In [16]:
dfa.A

2000-01-01   -0.303327
2000-01-02    2.393151
2000-01-03    0.032127
2000-01-04    2.151180
2000-01-05   -0.507925
2000-01-06   -0.751469
2000-01-07   -1.189214
2000-01-08   -0.723636
Freq: D, Name: A, dtype: float64

You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if you try to use attribute access to create a new column, it fails silently, creating a new attribute rather than a new column.

In [17]:
sa.a = 5

In [18]:
sa

a    5
b    2
c    3
dtype: int64

In [19]:
dfa.A = list(range(len(dfa.index))) # ok if A already exists

In [20]:
dfa

Unnamed: 0,A,B,C,D
2000-01-01,0,0.292414,-0.500265,-0.709743
2000-01-02,1,-1.559403,-0.082192,-0.290227
2000-01-03,2,-0.382275,0.396081,0.012956
2000-01-04,3,1.212415,0.499622,-0.611444
2000-01-05,4,-0.879824,-1.322001,0.299866
2000-01-06,5,0.162584,-1.501356,-1.974974
2000-01-07,6,2.233168,-1.028462,-1.974327
2000-01-08,7,0.760645,-0.837531,-1.10227


In [21]:
dfa['A'] = list(range(len(dfa.index))) # use this form to create a new column

In [22]:
dfa

Unnamed: 0,A,B,C,D
2000-01-01,0,0.292414,-0.500265,-0.709743
2000-01-02,1,-1.559403,-0.082192,-0.290227
2000-01-03,2,-0.382275,0.396081,0.012956
2000-01-04,3,1.212415,0.499622,-0.611444
2000-01-05,4,-0.879824,-1.322001,0.299866
2000-01-06,5,0.162584,-1.501356,-1.974974
2000-01-07,6,2.233168,-1.028462,-1.974327
2000-01-08,7,0.760645,-0.837531,-1.10227


If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.

You can also assign a dict to a row of a DataFrame:

In [23]:
x = pd.DataFrame({'x':[1,2,3], 'y': [3,4,5]})

In [24]:
x

Unnamed: 0,x,y
0,1,3
1,2,4
2,3,5


In [25]:
x.iloc[1]

x    2
y    4
Name: 1, dtype: int64

In [26]:
x.iloc[1] = dict(x=9, y=99)

In [27]:
x

Unnamed: 0,x,y
0,1,3
1,9,99
2,3,5


# Slicing Ranges

The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by Position section detailing the .iloc method. For now, we explain the semantics of slicing using the [] operator.

With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:

[[back to top](#Indexing-and-Selecting-Date)]

In [28]:
s

2000-01-01   -0.303327
2000-01-02    2.393151
2000-01-03    0.032127
2000-01-04    2.151180
2000-01-05   -0.507925
2000-01-06   -0.751469
2000-01-07   -1.189214
2000-01-08   -0.723636
Freq: D, Name: A, dtype: float64

In [29]:
s[:5]

2000-01-01   -0.303327
2000-01-02    2.393151
2000-01-03    0.032127
2000-01-04    2.151180
2000-01-05   -0.507925
Freq: D, Name: A, dtype: float64

In [30]:
s[::2] #Every Second Value

2000-01-01   -0.303327
2000-01-03    0.032127
2000-01-05   -0.507925
2000-01-07   -1.189214
Freq: 2D, Name: A, dtype: float64

In [31]:
s[::-1] #Reverse Order

2000-01-08   -0.723636
2000-01-07   -1.189214
2000-01-06   -0.751469
2000-01-05   -0.507925
2000-01-04    2.151180
2000-01-03    0.032127
2000-01-02    2.393151
2000-01-01   -0.303327
Freq: -1D, Name: A, dtype: float64

Note that setting works as well:

In [32]:
s2 = s.copy()

In [33]:
s2[:5] = 0

In [34]:
s2

2000-01-01    0.000000
2000-01-02    0.000000
2000-01-03    0.000000
2000-01-04    0.000000
2000-01-05    0.000000
2000-01-06   -0.751469
2000-01-07   -1.189214
2000-01-08   -0.723636
Freq: D, Name: A, dtype: float64

With DataFrame, slicing inside of [] **slices the rows**. This is provided largely as a convenience since it is such a common operation.

In [35]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227
2000-01-03,0.032127,-0.382275,0.396081,0.012956
2000-01-04,2.15118,1.212415,0.499622,-0.611444
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227


In [36]:
df[:3]

Unnamed: 0,A,B,C,D
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227
2000-01-03,0.032127,-0.382275,0.396081,0.012956


In [37]:
df[::-1]

Unnamed: 0,A,B,C,D
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866
2000-01-04,2.15118,1.212415,0.499622,-0.611444
2000-01-03,0.032127,-0.382275,0.396081,0.012956
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743


# Selection by Label

pandas provides a suite of methods in order to have **purely label based indexing**. This is a strict inclusion based protocol. At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised! When slicing, the start bound is included, AND the stop bound is included. Integers are valid labels, but they refer to the label and not the position.

The .loc attribute is the primary access method. The following are valid inputs:

* A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index. This use is **not** an integer position along the index)
* A list or array of labels ['a', 'b', 'c']
* A slice object with labels 'a':'f' (note that contrary to usual python slices, both the start and the stop are included!)
* A boolean array
* A callable, see Selection By Callable

[[back to top](#Indexing-and-Selecting-Date)]

In [38]:
s1 = pd.Series(np.random.randn(6), index = list('abcdef'))

In [39]:
s1

a   -1.697009
b    0.589104
c    0.475200
d   -1.440529
e   -1.439900
f    0.453692
dtype: float64

In [40]:
s1.loc['c':]

c    0.475200
d   -1.440529
e   -1.439900
f    0.453692
dtype: float64

In [41]:
s1.loc['b']

0.58910360782695337

Note that setting works as well:

In [42]:
s1.loc['c'] = 0

In [43]:
s1

a   -1.697009
b    0.589104
c    0.000000
d   -1.440529
e   -1.439900
f    0.453692
dtype: float64

With a DataFrame

In [44]:
df1 = pd.DataFrame(np.random.randn(6,4),
                    index = list('abcdef'),
                    columns = list('ABCD'))

In [45]:
df1

Unnamed: 0,A,B,C,D
a,1.746814,-0.503215,0.77522,-0.993989
b,1.102561,1.454276,-1.616963,-0.095906
c,0.146995,0.211968,0.349564,1.814721
d,0.74952,-0.104875,1.988423,1.751271
e,0.002873,-0.09492,0.132952,-0.763107
f,0.278255,0.521286,0.146565,0.918191


In [46]:

df1.loc[['a', 'b', 'd'], :]

Unnamed: 0,A,B,C,D
a,1.746814,-0.503215,0.77522,-0.993989
b,1.102561,1.454276,-1.616963,-0.095906
d,0.74952,-0.104875,1.988423,1.751271


Accessing via label slices

In [47]:
df1.loc['d':, 'A':'C']

Unnamed: 0,A,B,C
d,0.74952,-0.104875,1.988423
e,0.002873,-0.09492,0.132952
f,0.278255,0.521286,0.146565


For getting a cross section using a label (equiv to df.xs('a'))

In [48]:
df1.loc['a']

A    1.746814
B   -0.503215
C    0.775220
D   -0.993989
Name: a, dtype: float64

For getting values with a boolean array

In [49]:
df1.loc['a'] > 0

A     True
B    False
C     True
D    False
Name: a, dtype: bool

In [50]:
df1.loc[:, df1.loc['a'] > 0]

Unnamed: 0,A,C
a,1.746814,0.77522
b,1.102561,-1.616963
c,0.146995,0.349564
d,0.74952,1.988423
e,0.002873,0.132952
f,0.278255,0.146565


For getting a value explicitly (equiv to deprecated df.get_value('a','A'))

In [51]:
df1.loc['a', 'A']

1.7468136719969056

# Selection By Position

pandas provides a suite of methods in order to get **purely integer based indexing**. The semantics follow closely python and numpy slicing. These are 0-based indexing. When slicing, the start bounds is included, while the upper bound is excluded. Trying to use a non-integer, even a **valid** label will raise a IndexError.

The .iloc attribute is the primary access method. The following are valid inputs:

* An integer e.g. 5
* A list or array of integers [4, 3, 0]
* A slice object with ints 1:7
* A boolean array
* A callable, see Selection By Callable

[[back to top](#Indexing-and-Selecting-Date)]

In [52]:
s1 = pd.Series(np.random.randn(5), index = list(range(0,10,2)))

In [53]:
s1

0   -0.853776
2   -2.805136
4   -0.022554
6   -1.517335
8    0.431248
dtype: float64

In [54]:
s1.iloc[:3]

0   -0.853776
2   -2.805136
4   -0.022554
dtype: float64

In [55]:
s1.iloc[3] 

-1.5173346414089108

Note that setting works as well:

In [56]:
s1.iloc[:3] = 0

In [57]:
s1

0    0.000000
2    0.000000
4    0.000000
6   -1.517335
8    0.431248
dtype: float64

With a DataFrame

In [58]:
df1 = pd.DataFrame(np.random.randn(6,4),
                   index = list(range(0,12,2)),
                   columns = list(range(0,8,2)))

In [59]:
df1

Unnamed: 0,0,2,4,6
0,1.002153,-1.641474,2.205141,-0.645291
2,0.488601,0.131899,0.824523,0.12826
4,-1.334162,0.366354,0.011403,-0.118001
6,0.880918,0.208974,-0.526334,-0.868895
8,-2.45601,-0.720001,0.639207,0.267452
10,-1.097568,-0.949101,-0.257344,1.28218


Select via integer slicing

In [60]:
df1.iloc[:3]

Unnamed: 0,0,2,4,6
0,1.002153,-1.641474,2.205141,-0.645291
2,0.488601,0.131899,0.824523,0.12826
4,-1.334162,0.366354,0.011403,-0.118001


In [61]:
df1.iloc[1:5, 2:4]

Unnamed: 0,4,6
2,0.824523,0.12826
4,0.011403,-0.118001
6,-0.526334,-0.868895
8,0.639207,0.267452


Select via integer list

In [62]:
df1.iloc[[1,3,5], [1,3]]

Unnamed: 0,2,6
2,0.131899,0.12826
6,0.208974,-0.868895
10,-0.949101,1.28218


In [63]:
df1.iloc[1:3, :]

Unnamed: 0,0,2,4,6
2,0.488601,0.131899,0.824523,0.12826
4,-1.334162,0.366354,0.011403,-0.118001


In [64]:
df1.iloc[:,1:3]

Unnamed: 0,2,4
0,-1.641474,2.205141
2,0.131899,0.824523
4,0.366354,0.011403
6,0.208974,-0.526334
8,-0.720001,0.639207
10,-0.949101,-0.257344


In [65]:
df.iloc[1,1]

-1.5594026192477159

For getting a cross section using an integer position (equiv to df.xs(1))

In [66]:
df1.iloc[1]

0    0.488601
2    0.131899
4    0.824523
6    0.128260
Name: 2, dtype: float64

Out of range slice indexes are handled gracefully just as in Python/Numpy.

In [67]:
x = list('abcdef')

In [68]:
x

['a', 'b', 'c', 'd', 'e', 'f']

In [69]:
x[4:10]

['e', 'f']

In [70]:
x[8:10]

[]

In [71]:
s = pd.Series(x)

In [72]:
s

0    a
1    b
2    c
3    d
4    e
5    f
dtype: object

In [73]:
s.iloc[4:10]

4    e
5    f
dtype: object

In [74]:
s.iloc[8:10]

Series([], dtype: object)

Note that this could result in an empty axis (e.g. an empty DataFrame being returned)

In [75]:
df1 = pd.DataFrame(np.random.randn(5,2), columns = list('AB'))

In [76]:
df1

Unnamed: 0,A,B
0,-0.169145,-0.354348
1,2.23884,-1.731615
2,-1.333561,-0.171449
3,2.64016,0.547304
4,-0.680767,1.03191


In [77]:

df1.iloc[:, 2:3]

0
1
2
3
4


In [78]:
df1.iloc[:,1:3]

Unnamed: 0,B
0,-0.354348
1,-1.731615
2,-0.171449
3,0.547304
4,1.03191


In [79]:
df1.iloc[4:6]

Unnamed: 0,A,B
4,-0.680767,1.03191


A single indexer that is out of bounds will raise an IndexError. A list of indexers where any element is out of bounds will raise an IndexError

In [80]:
df1.iloc[[4,5,6]]

IndexError: positional indexers are out-of-bounds

In [None]:
df1.iloc[:, 4]

# Selection By Callable

.loc, .iloc, .ix and also [] indexing can accept a callable as indexer. The callable must be a function with one argument (the calling Series, DataFrame or Panel) and that returns valid output for indexing.

[[back to top](#Indexing-and-Selecting-Date)]

In [81]:
df1 = pd.DataFrame(np.random.rand(6, 4)
                  ,index = list('abcdef'),
                  columns=list('ABCD'))

In [82]:
df1

Unnamed: 0,A,B,C,D
a,0.597594,0.330252,0.984982,0.905348
b,0.3915,0.974684,0.186439,0.423094
c,0.819762,0.090801,0.756509,0.568264
d,0.90404,0.554439,0.330431,0.738448
e,0.312389,0.352692,0.573543,0.778051
f,0.666321,0.912709,0.867834,0.988221


In [83]:
df1.loc[lambda df: df.A > .8, :] # doesn't seem to work

Unnamed: 0,A,B,C,D
c,0.819762,0.090801,0.756509,0.568264
d,0.90404,0.554439,0.330431,0.738448


In [84]:
df1.loc[:, lambda df: ['A','B']] 

Unnamed: 0,A,B
a,0.597594,0.330252
b,0.3915,0.974684
c,0.819762,0.090801
d,0.90404,0.554439
e,0.312389,0.352692
f,0.666321,0.912709


In [85]:
df1.iloc[:, lambda df: [0,1]]

Unnamed: 0,A,B
a,0.597594,0.330252
b,0.3915,0.974684
c,0.819762,0.090801
d,0.90404,0.554439
e,0.312389,0.352692
f,0.666321,0.912709


In [86]:
df.columns[0]

'A'

In [87]:
df1[lambda df: df.columns[0]] #returns a Series

a    0.597594
b    0.391500
c    0.819762
d    0.904040
e    0.312389
f    0.666321
Name: A, dtype: float64

You can use callable indexing in Series.

In [88]:
df1.A

a    0.597594
b    0.391500
c    0.819762
d    0.904040
e    0.312389
f    0.666321
Name: A, dtype: float64

In [89]:
df1.A.loc[lambda s: s > 0]

a    0.597594
b    0.391500
c    0.819762
d    0.904040
e    0.312389
f    0.666321
Name: A, dtype: float64

Using these methods / indexers, you can chain data selection operations without using temporary variable.

In [90]:
bb = pd.read_csv('data/baseball.csv', index_col = 'id')

In [91]:
bb.head()

Unnamed: 0_level_0,player,year,stint,team,lg,g,ab,r,h,X2b,...,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
88641,womacto01,2006,2,CHN,NL,19,50,6,14,1,...,2.0,1.0,1.0,4,4.0,0.0,0.0,3.0,0.0,0.0
88643,schilcu01,2006,1,BOS,AL,31,2,0,1,0,...,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
88645,myersmi01,2006,1,NYA,AL,62,0,0,0,0,...,0.0,0.0,0.0,0,0.0,0.0,0.0,0.0,0.0,0.0
88649,helliri01,2006,1,MIL,NL,20,3,0,0,0,...,0.0,0.0,0.0,0,2.0,0.0,0.0,0.0,0.0,0.0
88650,johnsra05,2006,1,NYA,AL,33,6,0,1,0,...,0.0,0.0,0.0,0,4.0,0.0,0.0,0.0,0.0,0.0


In [92]:
bb.groupby(['year','team']).sum().head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,stint,g,ab,r,h,X2b,X3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
year,team,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2006,ARI,1,153,586,93,159,52,2,15,73.0,0.0,1.0,69,58.0,10.0,7.0,0.0,6.0,14.0
2006,BOS,1,31,2,0,1,0,0,0,0.0,0.0,0.0,0,1.0,0.0,0.0,0.0,0.0,0.0
2006,CHN,2,19,50,6,14,1,0,1,2.0,1.0,1.0,4,4.0,0.0,0.0,3.0,0.0,0.0
2006,LAN,1,28,26,2,5,1,0,0,0.0,0.0,0.0,1,7.0,0.0,0.0,6.0,0.0,1.0
2006,MIL,1,20,3,0,0,0,0,0,0.0,0.0,0.0,0,2.0,0.0,0.0,0.0,0.0,0.0
2006,NYA,2,95,6,0,1,0,0,0,0.0,0.0,0.0,0,4.0,0.0,0.0,0.0,0.0,0.0
2006,SFN,1,139,426,66,105,21,12,6,40.0,7.0,0.0,46,55.0,2.0,2.0,3.0,4.0,6.0
2007,ARI,5,46,55,6,9,4,0,0,6.0,0.0,0.0,5,13.0,0.0,0.0,2.0,0.0,1.0
2007,ATL,4,92,94,2,15,4,0,0,10.0,0.0,0.0,5,29.0,1.0,0.0,13.0,1.0,1.0
2007,BAL,2,76,174,17,51,10,1,1,16.0,1.0,2.0,10,23.0,1.0,0.0,5.0,1.0,5.0


In [93]:
bb.groupby(['year','team']).sum().loc[lambda x: x.r > 100] 

Unnamed: 0_level_0,Unnamed: 1_level_0,stint,g,ab,r,h,X2b,X3b,hr,rbi,sb,cs,bb,so,ibb,hbp,sh,sf,gidp
year,team,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
2007,CIN,6,379,745,101,203,35,2,36,125.0,10.0,1.0,105,127.0,14.0,1.0,1.0,15.0,18.0
2007,DET,5,301,1062,162,283,54,4,37,144.0,24.0,7.0,97,176.0,3.0,10.0,4.0,8.0,28.0
2007,HOU,4,311,926,109,218,47,6,14,77.0,10.0,4.0,60,212.0,3.0,9.0,16.0,6.0,17.0
2007,LAN,11,413,1021,153,293,61,3,36,154.0,7.0,5.0,114,141.0,8.0,9.0,3.0,8.0,29.0
2007,NYN,13,622,1854,240,509,101,3,61,243.0,22.0,4.0,174,310.0,24.0,23.0,18.0,15.0,48.0
2007,SFN,5,482,1305,198,337,67,6,40,171.0,26.0,7.0,235,188.0,51.0,8.0,16.0,6.0,41.0
2007,TEX,2,198,729,115,200,40,4,28,115.0,21.0,4.0,73,140.0,4.0,5.0,2.0,8.0,16.0
2007,TOR,4,459,1408,187,378,96,2,58,223.0,4.0,2.0,190,265.0,16.0,12.0,4.0,16.0,38.0


# Selecting Random Samples

A random selection of rows or columns from a Series, DataFrame, or Panel with the sample() method. The method will sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.

[[back to top](#Indexing-and-Selecting-Date)]

In [94]:
s = pd.Series([0,1,2,3,4,5])

In [95]:
# When no arugments are passed, returns 1 row.
s.sample()

1    1
dtype: int64

In [96]:
# One may specify either a number of rows:
s.sample(n=3)

0    0
2    2
4    4
dtype: int64

In [97]:
# Or a fraction of the rows:
s.sample(frac = 0.5)

1    1
4    4
3    3
dtype: int64

By default, sample will return each row at most once, but one can also sample with replacement using the replace option:

In [98]:
s = pd.Series([0,1,2,3,4,5])

In [99]:
# Without replacement (default):
s.sample(n=6, replace = False)

3    3
1    1
0    0
5    5
2    2
4    4
dtype: int64

In [100]:
# With replacment:
s.sample(n = 6, replace = True)

5    5
2    2
1    1
4    4
2    2
0    0
dtype: int64

By default, each row has an equal probability of being selected, but if you want rows to have different probabilities, you can pass the sample function sampling weights as weights. These weights can be a list, a numpy array, or a Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights by the sum of the weights. For example:

In [101]:
s = pd.Series([0,1,2,3,4,5])

In [102]:
example_weights = [0,0,0.2,0,2,0.4]

In [103]:
s.sample(n = 3, weights = example_weights)

4    4
5    5
2    2
dtype: int64

In [104]:
# Weights will be re-normalized automatically
example_weights2 = [0.5,0,0,0,0,0]

In [105]:
s.sample(n = 1, weights = example_weights2)

0    0
dtype: int64

When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling rows and not columns) by simply passing the name of the column as a string.

In [106]:
df2 = pd.DataFrame({'col1': [9,8,7,6], 'weight_column': [0.5, 0.4, 0.1, 0]})

In [107]:
df2

Unnamed: 0,col1,weight_column
0,9,0.5
1,8,0.4
2,7,0.1
3,6,0.0


In [108]:

df2.sample(n = 3, weights = 'weight_column')

Unnamed: 0,col1,weight_column
0,9,0.5
2,7,0.1
1,8,0.4


sample also allows users to sample columns instead of rows using the axis argument.

In [109]:
df3 = pd.DataFrame({'col1': [1,2,3], 'col2': [2,3,4]})

In [110]:
df3

Unnamed: 0,col1,col2
0,1,2
1,2,3
2,3,4


In [111]:
df3.sample(n=1, axis = 'columns') # axis = 1 also works

Unnamed: 0,col2
0,2
1,3
2,4


Finally, one can also set a seed for sample‘s random number generator using the random_state argument, which will accept either an integer (as a seed) or a numpy RandomState object.

In [112]:
df4 =  pd.DataFrame({'col1':[1,2,3], 'col2': [2,3,4]})

In [113]:
df4

Unnamed: 0,col1,col2
0,1,2
1,2,3
2,3,4


In [114]:
# With a given seed, the sample will always draw the same row
df4.sample(n = 2, random_state = 2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


In [115]:
df4.sample(n = 2, random_state = 2)

Unnamed: 0,col1,col2
2,3,4
1,2,3


# Setting With Enlargement

The .loc/.ix/[] operations can perform enlargement when setting a non-existant key for that axis.

In the Series case this is effectively an appending operation

[[back to top](#Indexing-and-Selecting-Date)]

In [116]:
se = pd.Series([1,2,3])

In [117]:
se

0    1
1    2
2    3
dtype: int64

In [118]:
se[5] = 5

In [119]:
se

0    1
1    2
2    3
5    5
dtype: int64

A DataFrame can be enlarged on either axis via .loc

In [120]:
dfi = pd.DataFrame(np.arange(6).reshape(3,2),
                  columns = ['A','B'])

In [121]:
dfi

Unnamed: 0,A,B
0,0,1
1,2,3
2,4,5


In [122]:
dfi.loc[:,'C'] = dfi.loc[:,'A']

In [123]:
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4


This is like an append operation on the DataFrame.

In [124]:
dfi.loc[3] = 5

In [125]:
dfi

Unnamed: 0,A,B,C
0,0,1,0
1,2,3,2
2,4,5,4
3,5,5,5


# Fast scalar value getting and setting

Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures.

Similarly to loc, at provides **label** based scalar lookups, while, iat provides **integer** based lookups analogously to iloc

[[back to top](#Indexing-and-Selecting-Date)]

In [126]:
s

0    0
1    1
2    2
3    3
4    4
5    5
dtype: int64

In [127]:
s.iat[5]

5

In [128]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227
2000-01-03,0.032127,-0.382275,0.396081,0.012956
2000-01-04,2.15118,1.212415,0.499622,-0.611444
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227


In [129]:
dates

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [130]:
df.at[dates[5], 'A']

-0.75146882002269344

In [131]:
df.iat[3, 0]

2.1511803561786116

You can also set using these same indexers.

In [132]:
df.at[dates[5], 'E'] = 7

In [133]:
df

Unnamed: 0,A,B,C,D,E
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743,
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227,
2000-01-03,0.032127,-0.382275,0.396081,0.012956,
2000-01-04,2.15118,1.212415,0.499622,-0.611444,
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866,
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974,7.0
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327,
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227,


In [134]:
df.iat[3, 0] = 7

In [135]:
df

Unnamed: 0,A,B,C,D,E
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743,
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227,
2000-01-03,0.032127,-0.382275,0.396081,0.012956,
2000-01-04,7.0,1.212415,0.499622,-0.611444,
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866,
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974,7.0
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327,
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227,


**at** may enlarge the object in-place as above if the indexer is missing.

In [136]:
df.at[dates[-1] + 1, 0] = 7

In [137]:
df

Unnamed: 0,A,B,C,D,E,0
2000-01-01,-0.303327,0.292414,-0.500265,-0.709743,,
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227,,
2000-01-03,0.032127,-0.382275,0.396081,0.012956,,
2000-01-04,7.0,1.212415,0.499622,-0.611444,,
2000-01-05,-0.507925,-0.879824,-1.322001,0.299866,,
2000-01-06,-0.751469,0.162584,-1.501356,-1.974974,7.0,
2000-01-07,-1.189214,2.233168,-1.028462,-1.974327,,
2000-01-08,-0.723636,0.760645,-0.837531,-1.10227,,
2000-01-09,,,,,,7.0


# Boolean indexing

Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and ~ for not. These **must** be grouped by using parentheses.

Using a boolean vector to index a Series works exactly as in a numpy ndarray:

[[back to top](#Indexing-and-Selecting-Date)]

In [138]:
s = pd.Series(range(-3, 4))

In [139]:
s

0   -3
1   -2
2   -1
3    0
4    1
5    2
6    3
dtype: int64

In [140]:
s[s > 0]

4    1
5    2
6    3
dtype: int64

In [141]:
s[(s < -1) | (s > 0.5)]

0   -3
1   -2
4    1
5    2
6    3
dtype: int64

In [142]:
s[~(s < 0)]

3    0
4    1
5    2
6    3
dtype: int64

You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example, something derived from one of the columns of the DataFrame):

In [143]:
df[df['A'] > 0] # or df.loc[df['A'] > 0, :]

Unnamed: 0,A,B,C,D,E,0
2000-01-02,2.393151,-1.559403,-0.082192,-0.290227,,
2000-01-03,0.032127,-0.382275,0.396081,0.012956,,
2000-01-04,7.0,1.212415,0.499622,-0.611444,,


List comprehensions and map method of Series can also be used to produce more complex criteria:

In [144]:
df2 = pd.DataFrame({'a' : ['one', 'one', 'two', 'three', 'two', 'one', 'six'],
                    'b' : ['x', 'y', 'y', 'x', 'y', 'x', 'x'],
                    'c' : np.random.randn(7)})

In [145]:
df2

Unnamed: 0,a,b,c
0,one,x,0.383619
1,one,y,-0.220392
2,two,y,-0.004826
3,three,x,-1.0741
4,two,y,-1.567319
5,one,x,0.334996
6,six,x,-1.902939


In [146]:
# only want 'two or 'three'
criterion = df2['a'].map(lambda x: x.startswith('t'))

In [147]:
criterion

0    False
1    False
2     True
3     True
4     True
5    False
6    False
Name: a, dtype: bool

In [148]:
df2[criterion]

Unnamed: 0,a,b,c
2,two,y,-0.004826
3,three,x,-1.0741
4,two,y,-1.567319


In [149]:
#equivalent but slower
df2[[x.startswith('t') for x in df2['a']]]

Unnamed: 0,a,b,c
2,two,y,-0.004826
3,three,x,-1.0741
4,two,y,-1.567319


In [150]:
# Multiple criteria
df2[criterion & (df2['b'] == 'x')]

Unnamed: 0,a,b,c
3,three,x,-1.0741


Note, with the choice methods [Selection By Label](#Selection-by-Label), [Selection by Position](#Selection-By-Position), and [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced) you may select along more than one axis using boolean vectors combined with other indexing expressions.

In [151]:
df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']

Unnamed: 0,b,c
3,x,-1.0741


# Indexing with isin

Consider the isin method of Series, which returns a boolean vector that is true wherever the Series elements exist in the passed list. This allows you to select rows where one or more columns have values you want:

[[back to top](#Indexing-and-Selecting-Date)]

In [152]:
s = pd.Series(np.arange(5), index = np.arange(5)[::-1], dtype = 'int64')

In [153]:
np.arange(5)[::-1]

array([4, 3, 2, 1, 0])

In [154]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [155]:
s.isin([2,4,6])

4    False
3    False
2     True
1    False
0     True
dtype: bool

In [156]:
s[s.isin([2,4,6])]

2    2
0    4
dtype: int64

The same method is available for Index objects and is useful for the cases when you don’t know which of the sought labels are in fact present:

In [157]:
s[s.index.isin([2,4,6])]

4    0
2    2
dtype: int64

In [158]:
# compare it to the following
s[[2,4,6]]

2    2.0
4    0.0
6    NaN
dtype: float64

In addition to that, MultiIndex allows selecting a separate level to use in the membership check:

In [159]:
s_mi = pd.Series(np.arange(6),
                 index=pd.MultiIndex.from_product([[0, 1], ['a', 'b', 'c']]))

In [160]:
s_mi

0  a    0
   b    1
   c    2
1  a    3
   b    4
   c    5
dtype: int32

In [161]:
s_mi.iloc[s_mi.index.isin([(1, 'a'), (2, 'b'), (0, 'c')])]

0  c    2
1  a    3
dtype: int32

In [162]:
s_mi.iloc[s_mi.index.isin(['a', 'c', 'e'], level=1)]

0  a    0
   c    2
1  a    3
   c    5
dtype: int32

DataFrame also has an isin method. When calling isin, pass a set of values as either an array or dict. If values is an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever the element is in the sequence of values.

In [163]:
df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'n'],
                   'ids2': ['a', 'n', 'c', 'n']})

In [164]:
df

Unnamed: 0,ids,ids2,vals
0,a,a,1
1,b,n,2
2,f,c,3
3,n,n,4


In [165]:
values = ['a', 'b', 1 ,3]

In [166]:
df.isin(values)

Unnamed: 0,ids,ids2,vals
0,True,True,True
1,True,False,False
2,False,False,True
3,False,False,False


Combine DataFrame’s isin with the any() and all() methods to quickly select subsets of your data that meet a given criteria. To select a row where each column meets its own criterion:

In [167]:
values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [168]:
values

{'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}

In [169]:
row_mask = df.isin(values).all(1)

In [170]:
row_mask

0     True
1    False
2    False
3    False
dtype: bool

In [171]:
df[row_mask]

Unnamed: 0,ids,ids2,vals
0,a,a,1


# The where Method and Masking

Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection output has the same shape as the original data, you can use the where method in Series and DataFrame.

To return only the selected rows

[[back to top](#Indexing-and-Selecting-Date)]

In [172]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [173]:
s[s > 0]

3    1
2    2
1    3
0    4
dtype: int64

To return a Series of the same shape as the original

In [174]:
s.where(s > 0)

4    NaN
3    1.0
2    2.0
1    3.0
0    4.0
dtype: float64

Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under the hood as the implementation. Equivalent is df.where(df < 0)

In [175]:
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C', 'D'])

In [176]:
df

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,-0.869662,-1.170597,-0.39669
2000-01-02,0.705645,0.603641,-0.5782,-0.962117
2000-01-03,0.167324,0.389131,0.66766,-0.812556
2000-01-04,-1.07812,0.961387,0.247198,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,0.940462
2000-01-06,0.592565,-1.522493,-3.710592,0.119091
2000-01-07,-0.001998,1.493248,-0.154796,-0.757448
2000-01-08,0.519904,-0.336913,-1.213088,0.312619


In [177]:
df.where(df < 0) 

Unnamed: 0,A,B,C,D
2000-01-01,,-0.869662,-1.170597,-0.39669
2000-01-02,,,-0.5782,-0.962117
2000-01-03,,,,-0.812556
2000-01-04,-1.07812,,,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,
2000-01-06,,-1.522493,-3.710592,
2000-01-07,-0.001998,,-0.154796,-0.757448
2000-01-08,,-0.336913,-1.213088,


In addition, where takes an optional other argument for replacement of values where the condition is False, in the returned copy.

In [178]:
df.where(df < 0, -df)

Unnamed: 0,A,B,C,D
2000-01-01,-0.336091,-0.869662,-1.170597,-0.39669
2000-01-02,-0.705645,-0.603641,-0.5782,-0.962117
2000-01-03,-0.167324,-0.389131,-0.66766,-0.812556
2000-01-04,-1.07812,-0.961387,-0.247198,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,-0.940462
2000-01-06,-0.592565,-1.522493,-3.710592,-0.119091
2000-01-07,-0.001998,-1.493248,-0.154796,-0.757448
2000-01-08,-0.519904,-0.336913,-1.213088,-0.312619


You may wish to set values based on some boolean criteria. This can be done intuitively like so:

In [179]:
s2 = s.copy()

In [180]:
s2

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [181]:
s2[s2 < 2] = 0

In [182]:
s2

4    0
3    0
2    2
1    3
0    4
dtype: int64

In [183]:
df2 = df.copy()

In [184]:
df2[df2 < 0] = 999

In [185]:
df2

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,999.0,999.0,999.0
2000-01-02,0.705645,0.603641,999.0,999.0
2000-01-03,0.167324,0.389131,0.66766,999.0
2000-01-04,999.0,0.961387,0.247198,999.0
2000-01-05,999.0,999.0,999.0,0.940462
2000-01-06,0.592565,999.0,999.0,0.119091
2000-01-07,999.0,1.493248,999.0,999.0
2000-01-08,0.519904,999.0,999.0,0.312619


### alignment

Furthermore, where aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting is possible. This is analogous to partial setting via .ix (but on the contents rather than the axis labels)

In [186]:
df2 =  df.copy()

In [187]:
df2[ df2[1:4] > 0 ] = 3

In [188]:
df2[1:4]

Unnamed: 0,A,B,C,D
2000-01-02,3.0,3.0,-0.5782,-0.962117
2000-01-03,3.0,3.0,3.0,-0.812556
2000-01-04,-1.07812,3.0,3.0,-1.833627


In [189]:
df2

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,-0.869662,-1.170597,-0.39669
2000-01-02,3.0,3.0,-0.5782,-0.962117
2000-01-03,3.0,3.0,3.0,-0.812556
2000-01-04,-1.07812,3.0,3.0,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,0.940462
2000-01-06,0.592565,-1.522493,-3.710592,0.119091
2000-01-07,-0.001998,1.493248,-0.154796,-0.757448
2000-01-08,0.519904,-0.336913,-1.213088,0.312619


Where can also accept axis and level parameters to align the input when performing the where.

In [190]:
df2 = df.copy()

In [191]:
df2

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,-0.869662,-1.170597,-0.39669
2000-01-02,0.705645,0.603641,-0.5782,-0.962117
2000-01-03,0.167324,0.389131,0.66766,-0.812556
2000-01-04,-1.07812,0.961387,0.247198,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,0.940462
2000-01-06,0.592565,-1.522493,-3.710592,0.119091
2000-01-07,-0.001998,1.493248,-0.154796,-0.757448
2000-01-08,0.519904,-0.336913,-1.213088,0.312619


In [192]:
df2 = df2.where(df2 > 0 , 9999) 

In [193]:
df2

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,9999.0,9999.0,9999.0
2000-01-02,0.705645,0.603641,9999.0,9999.0
2000-01-03,0.167324,0.389131,0.66766,9999.0
2000-01-04,9999.0,0.961387,0.247198,9999.0
2000-01-05,9999.0,9999.0,9999.0,0.940462
2000-01-06,0.592565,9999.0,9999.0,0.119091
2000-01-07,9999.0,1.493248,9999.0,9999.0
2000-01-08,0.519904,9999.0,9999.0,0.312619


In [194]:
df2.where(df2 != 9999, df2['A'], axis='index')

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,0.336091,0.336091,0.336091
2000-01-02,0.705645,0.603641,0.705645,0.705645
2000-01-03,0.167324,0.389131,0.66766,0.167324
2000-01-04,9999.0,0.961387,0.247198,9999.0
2000-01-05,9999.0,9999.0,9999.0,0.940462
2000-01-06,0.592565,0.592565,0.592565,0.119091
2000-01-07,9999.0,1.493248,9999.0,9999.0
2000-01-08,0.519904,0.519904,0.519904,0.312619


This is equivalent (but faster than) the following.

In [195]:
df2 = df.copy()

In [196]:
df.apply(lambda x, y: x.where(x>0, y), y = df['A'])

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,0.336091,0.336091,0.336091
2000-01-02,0.705645,0.603641,0.705645,0.705645
2000-01-03,0.167324,0.389131,0.66766,0.167324
2000-01-04,-1.07812,0.961387,0.247198,-1.07812
2000-01-05,-0.966156,-0.966156,-0.966156,0.940462
2000-01-06,0.592565,0.592565,0.592565,0.119091
2000-01-07,-0.001998,1.493248,-0.001998,-0.001998
2000-01-08,0.519904,0.519904,0.519904,0.312619


Where can accept a callable as condition and other arguments. The function must be with one argument (the calling Series or DataFrame) and that returns valid output as condition and other argument.

In [197]:
df3 = pd.DataFrame({'A': [1, 2, 3],
                    'B': [4, 5, 6],
                    'C': [7, 8, 9]})

In [198]:
df3

Unnamed: 0,A,B,C
0,1,4,7
1,2,5,8
2,3,6,9


In [199]:
df3.where(lambda x: x > 4, lambda x: x + 10)

Unnamed: 0,A,B,C
0,11,14,7
1,12,5,8
2,13,6,9


### mask

mask is the inverse boolean operation of where.

In [200]:
s

4    0
3    1
2    2
1    3
0    4
dtype: int64

In [201]:
s.mask(s > 0)

4    0.0
3    NaN
2    NaN
1    NaN
0    NaN
dtype: float64

In [202]:
df

Unnamed: 0,A,B,C,D
2000-01-01,0.336091,-0.869662,-1.170597,-0.39669
2000-01-02,0.705645,0.603641,-0.5782,-0.962117
2000-01-03,0.167324,0.389131,0.66766,-0.812556
2000-01-04,-1.07812,0.961387,0.247198,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,0.940462
2000-01-06,0.592565,-1.522493,-3.710592,0.119091
2000-01-07,-0.001998,1.493248,-0.154796,-0.757448
2000-01-08,0.519904,-0.336913,-1.213088,0.312619


In [203]:
df.mask(df >= 0)

Unnamed: 0,A,B,C,D
2000-01-01,,-0.869662,-1.170597,-0.39669
2000-01-02,,,-0.5782,-0.962117
2000-01-03,,,,-0.812556
2000-01-04,-1.07812,,,-1.833627
2000-01-05,-0.966156,-0.580122,-0.141597,
2000-01-06,,-1.522493,-3.710592,
2000-01-07,-0.001998,,-0.154796,-0.757448
2000-01-08,,-0.336913,-1.213088,


# The query Method (Experimental)

**DataFrame** objects have a **query()** method that allows selection using an expression.

You can get the value of the frame where column b has values between the values of columns a and c. For example:

[[back to top](#Indexing-and-Selecting-Date)]

In [204]:
n = 10

In [205]:
df = pd.DataFrame(np.random.rand(n, 3), columns = list('abc'))

In [206]:
df

Unnamed: 0,a,b,c
0,0.338794,0.909659,0.502763
1,0.546157,0.361746,0.103329
2,0.768378,0.048048,0.550461
3,0.413391,0.061056,0.19094
4,0.907898,0.915634,0.157
5,0.794375,0.358014,0.331985
6,0.279852,0.080822,0.989933
7,0.686174,0.409613,0.629644
8,0.474483,0.528165,0.183505
9,0.426622,0.26125,0.698927


In [207]:
# pure python
df[(df.a < df.b) & (df.b < df.c)]

Unnamed: 0,a,b,c


In [208]:
# query
df.query('(a < b) & (b < c)')

Unnamed: 0,a,b,c


Do the same thing but fall back on a named index if there is no column with the name a.

In [209]:
df = pd.DataFrame(np.random.randint(n / 2, size=(n, 2)), columns=list('bc'))

In [210]:
df

Unnamed: 0,b,c
0,0,0
1,4,1
2,4,3
3,1,2
4,0,2
5,3,0
6,1,2
7,2,1
8,4,2
9,2,3


In [211]:
df.index.name = 'a'

In [212]:
df

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,0
1,4,1
2,4,3
3,1,2
4,0,2
5,3,0
6,1,2
7,2,1
8,4,2
9,2,3


In [213]:
df.query('a < b and b > c')

Unnamed: 0_level_0,b,c
a,Unnamed: 1_level_1,Unnamed: 2_level_1
1,4,1
2,4,3


If instead you don’t want to or cannot name your index, you can use the name index in your query expression:

In [214]:
df = pd.DataFrame(np.random.randint(n, size=(n, 2)), columns=list('bc'))

In [215]:
df

Unnamed: 0,b,c
0,7,0
1,2,5
2,7,8
3,5,8
4,5,9
5,2,7
6,1,0
7,8,2
8,2,8
9,2,7


In [216]:
df.query('index < b < c')

Unnamed: 0,b,c
1,2,5
2,7,8
3,5,8
4,5,9


**Note** If the name of your index overlaps with a column name, the column name is given precedence.

# MultiIndex query Syntax

You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:

[[back to top](#Indexing-and-Selecting-Date)]

In [217]:
n = 10

In [218]:
 colors = np.random.choice(['red', 'green'], size=n)

In [219]:
colors

array(['red', 'red', 'green', 'green', 'green', 'green', 'red', 'green',
       'green', 'red'], 
      dtype='|S5')

In [220]:
foods = np.random.choice(['eggs', 'ham'], size=n)

In [221]:
foods

array(['eggs', 'ham', 'ham', 'ham', 'ham', 'eggs', 'ham', 'ham', 'ham',
       'eggs'], 
      dtype='|S4')

In [222]:
index = pd.MultiIndex.from_arrays([colors, foods], names=['color', 'food'])

In [223]:
index

MultiIndex(levels=[[u'green', u'red'], [u'eggs', u'ham']],
           labels=[[1, 1, 0, 0, 0, 0, 1, 0, 0, 1], [0, 1, 1, 1, 1, 0, 1, 1, 1, 0]],
           names=[u'color', u'food'])

In [224]:
df = pd.DataFrame(np.random.randn(n, 2), index=index)

In [225]:
df

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
red,eggs,-1.647784,0.214148
red,ham,-1.243073,-0.324917
green,ham,-0.590581,-0.016196
green,ham,0.788661,1.019681
green,ham,-1.021096,-0.505745
green,eggs,-0.045484,2.130698
red,ham,-0.425587,-0.254876
green,ham,-1.075681,2.894384
green,ham,2.042818,-0.432977
red,eggs,0.164762,0.366952


In [226]:
 df.query('color == "red"')

Unnamed: 0_level_0,Unnamed: 1_level_0,0,1
color,food,Unnamed: 2_level_1,Unnamed: 3_level_1
red,eggs,-1.647784,0.214148
red,ham,-1.243073,-0.324917
red,ham,-0.425587,-0.254876
red,eggs,0.164762,0.366952


If the levels of the MultiIndex are unnamed, you can refer to them using special names:

In [227]:
df.index.names = [None, None]

In [228]:
df

Unnamed: 0,Unnamed: 1,0,1
red,eggs,-1.647784,0.214148
red,ham,-1.243073,-0.324917
green,ham,-0.590581,-0.016196
green,ham,0.788661,1.019681
green,ham,-1.021096,-0.505745
green,eggs,-0.045484,2.130698
red,ham,-0.425587,-0.254876
green,ham,-1.075681,2.894384
green,ham,2.042818,-0.432977
red,eggs,0.164762,0.366952


In [229]:
df.query('ilevel_0 == "red"')

Unnamed: 0,Unnamed: 1,0,1
red,eggs,-1.647784,0.214148
red,ham,-1.243073,-0.324917
red,ham,-0.425587,-0.254876
red,eggs,0.164762,0.366952


The convention is ilevel_0, which means “index level 0” for the 0th level of the index.

# query Use Cases

A use case for query() is when you have a collection of DataFrame objects that have a subset of column names (or index levels/names) in common. You can pass the same query to both frames without having to specify which frame you’re interested in querying

[[back to top](#Indexing-and-Selecting-Date)]

In [230]:
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [231]:
df

Unnamed: 0,a,b,c
0,0.262749,0.489325,0.67104
1,0.443349,0.516366,0.288916
2,0.973994,0.588426,0.276498
3,0.045721,0.941823,0.797202
4,0.190815,0.010872,0.354538
5,0.385412,0.759646,0.005021
6,0.113376,0.54352,0.02303
7,0.069916,0.629841,0.501314
8,0.053642,0.981125,0.203925
9,0.994676,0.531717,0.668632


In [232]:
df2 = pd.DataFrame(np.random.rand(n + 2, 3), columns=df.columns)

In [233]:
df2

Unnamed: 0,a,b,c
0,0.100699,0.807485,0.922527
1,0.203752,0.829761,0.675697
2,0.955217,0.769145,0.527935
3,0.817046,0.072252,0.771851
4,0.660143,0.916956,0.367593
5,0.489324,0.305148,0.321524
6,0.650311,0.618304,0.834141
7,0.782217,0.693341,0.825836
8,0.12815,0.877166,0.687436
9,0.581586,0.832748,0.435027


In [234]:
expr = 'a > 0'

In [235]:
map(lambda frame: frame.query(expr), [df, df2])

[          a         b         c
 0  0.262749  0.489325  0.671040
 1  0.443349  0.516366  0.288916
 2  0.973994  0.588426  0.276498
 3  0.045721  0.941823  0.797202
 4  0.190815  0.010872  0.354538
 5  0.385412  0.759646  0.005021
 6  0.113376  0.543520  0.023030
 7  0.069916  0.629841  0.501314
 8  0.053642  0.981125  0.203925
 9  0.994676  0.531717  0.668632,            a         b         c
 0   0.100699  0.807485  0.922527
 1   0.203752  0.829761  0.675697
 2   0.955217  0.769145  0.527935
 3   0.817046  0.072252  0.771851
 4   0.660143  0.916956  0.367593
 5   0.489324  0.305148  0.321524
 6   0.650311  0.618304  0.834141
 7   0.782217  0.693341  0.825836
 8   0.128150  0.877166  0.687436
 9   0.581586  0.832748  0.435027
 10  0.633447  0.938974  0.363343
 11  0.062692  0.755528  0.365476]

# query Python versus pandas Syntax Comparison

Full numpy-like syntax

[[back to top](#Indexing-and-Selecting-Date)]

In [236]:
df = pd.DataFrame(np.random.randint(n, size=(n, 3)), columns=list('abc'))

In [237]:
df

Unnamed: 0,a,b,c
0,9,2,2
1,8,8,9
2,0,2,6
3,1,0,1
4,4,4,1
5,8,8,4
6,1,9,7
7,9,2,2
8,6,2,9
9,1,3,7


In [238]:
 df.query('(a < b) & (b < c)')

Unnamed: 0,a,b,c
2,0,2,6
9,1,3,7


In [239]:
df[(df.a < df.b) & (df.b < df.c)]

Unnamed: 0,a,b,c
2,0,2,6
9,1,3,7


Slightly nicer by removing the parentheses (by binding making comparison operators bind tighter than &/|)

In [240]:
df.query('a < b and b < c')

Unnamed: 0,a,b,c
2,0,2,6
9,1,3,7


Pretty close to how you might write it on paper

In [241]:
df.query('a < b < c')

Unnamed: 0,a,b,c
2,0,2,6
9,1,3,7


# The in and not in operators

**query()** also supports special use of Python’s in and not in comparison operators, providing a succinct syntax for calling the isin method of a Series or DataFrame.

[[back to top](#Indexing-and-Selecting-Date)]

In [242]:
# get all rows where columns "a" and "b" have overlapping values
df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
                   'c': np.random.randint(5, size=12),
                   'd': np.random.randint(9, size=12)})

In [243]:
df

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8
6,d,b,0,0
7,d,b,2,1
8,e,c,2,7
9,e,c,4,3


In [244]:
 df.query('"a" in b')

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4


In [245]:
 df.query('a in b')

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8


In [246]:
# How you'd do it in pure Python
df.loc[df.a.isin(df.b)]  # or df[df.a.isin(df.b)]

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8


In [247]:
df.query('a not in b')

Unnamed: 0,a,b,c,d
6,d,b,0,0
7,d,b,2,1
8,e,c,2,7
9,e,c,4,3
10,f,c,0,4
11,f,c,0,1


In [248]:
df.loc[~df.a.isin(df.b)] # or df[~df.a.isin(df.b)]

Unnamed: 0,a,b,c,d
6,d,b,0,0
7,d,b,2,1
8,e,c,2,7
9,e,c,4,3
10,f,c,0,4
11,f,c,0,1


You can combine this with other expressions for very succinct queries:

In [249]:
# rows where cols a and b have overlapping values and col c's values are less than col d's
df.query('a in b and c < d')

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8


In [250]:
df.loc[df.b.isin(df.a) & (df.c < df.d)] # or df[df.b.isin(df.a) & (df.c < df.d)]

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8
8,e,c,2,7
10,f,c,0,4
11,f,c,0,1


**Note:** Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation. However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the expression

**df.query('a in b + c + d')**

(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any operations that can be evaluated using numexpr will be.

# Special use of the == operator with list objects

Comparing a list of values to a column using ==/!= works similarly to in/not in

[[back to top](#Indexing-and-Selecting-Date)]

In [251]:
df.query('b == ["a", "b", "c"]')

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8
6,d,b,0,0
7,d,b,2,1
8,e,c,2,7
9,e,c,4,3


In [252]:
# pure Python
df[df.b.isin(['a','b','c'])]

Unnamed: 0,a,b,c,d
0,a,a,1,7
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
4,c,b,1,2
5,c,b,1,8
6,d,b,0,0
7,d,b,2,1
8,e,c,2,7
9,e,c,4,3


In [253]:
df.query('c == [1, 2]')

Unnamed: 0,a,b,c,d
0,a,a,1,7
4,c,b,1,2
5,c,b,1,8
7,d,b,2,1
8,e,c,2,7


In [254]:
df.query('c != [1, 2]')

Unnamed: 0,a,b,c,d
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
6,d,b,0,0
9,e,c,4,3
10,f,c,0,4
11,f,c,0,1


In [255]:
# using in/not in
df.query('[1, 2] in c')

Unnamed: 0,a,b,c,d
0,a,a,1,7
4,c,b,1,2
5,c,b,1,8
7,d,b,2,1
8,e,c,2,7


In [256]:
df.query('[1, 2] not in c')

Unnamed: 0,a,b,c,d
1,a,a,3,6
2,b,a,3,0
3,b,a,3,4
6,d,b,0,0
9,e,c,4,3
10,f,c,0,4
11,f,c,0,1


In [257]:
# pure Python
df[df.c.isin([1, 2])]

Unnamed: 0,a,b,c,d
0,a,a,1,7
4,c,b,1,2
5,c,b,1,8
7,d,b,2,1
8,e,c,2,7


# Boolean Operators

You can negate boolean expressions with the word not or the ~ operator.

[[back to top](#Indexing-and-Selecting-Date)]

In [258]:
df = pd.DataFrame(np.random.rand(n, 3), columns=list('abc'))

In [259]:
df

Unnamed: 0,a,b,c
0,0.10446,0.771922,0.998242
1,0.825425,0.781084,0.702163
2,0.492764,0.03464,0.003997
3,0.906642,0.680313,0.405463
4,0.000474,0.385759,0.818462
5,0.238348,0.666012,0.526818
6,0.880098,0.177296,0.908081
7,0.868203,0.625847,0.982547
8,0.569514,0.011155,0.946081
9,0.587791,0.739783,0.683634


In [260]:
df['bools'] = np.random.rand(len(df)) > 0.5

In [261]:
df

Unnamed: 0,a,b,c,bools
0,0.10446,0.771922,0.998242,True
1,0.825425,0.781084,0.702163,True
2,0.492764,0.03464,0.003997,False
3,0.906642,0.680313,0.405463,False
4,0.000474,0.385759,0.818462,True
5,0.238348,0.666012,0.526818,True
6,0.880098,0.177296,0.908081,True
7,0.868203,0.625847,0.982547,True
8,0.569514,0.011155,0.946081,True
9,0.587791,0.739783,0.683634,False


In [262]:
df.query('~bools')

Unnamed: 0,a,b,c,bools
2,0.492764,0.03464,0.003997,False
3,0.906642,0.680313,0.405463,False
9,0.587791,0.739783,0.683634,False


In [263]:
df.query('not bools')

Unnamed: 0,a,b,c,bools
2,0.492764,0.03464,0.003997,False
3,0.906642,0.680313,0.405463,False
9,0.587791,0.739783,0.683634,False


In [264]:
df.query('not bools') == df[~df.bools]

Unnamed: 0,a,b,c,bools
2,True,True,True,True
3,True,True,True,True
9,True,True,True,True


Of course, expressions can be arbitrarily complex too

In [265]:
# short query syntax
shorter = df.query('a > b < c and (not bools) or bools > 2')

In [266]:
# equivalent in pure Python
longer = df[(df.a > df.b) & (df.b < df.c) & (~df.bools) | (df.bools > 2)]

In [267]:
shorter

Unnamed: 0,a,b,c,bools


In [268]:
longer

Unnamed: 0,a,b,c,bools


In [269]:
shorter == longer

Unnamed: 0,a,b,c,bools


# Duplicate Data

If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.

* duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row is duplicated.
* drop_duplicates removes duplicate rows.

By default, the first observed row of a duplicate set is considered unique, but each method has a keep parameter to specify targets to be kept.

* keep='first' (default): mark / drop duplicates except for the first occurrence.
* keep='last': mark / drop duplicates except for the last occurrence.
* keep=False: mark / drop all duplicates.

[[back to top](#Indexing-and-Selecting-Date)]

In [270]:
 df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four'],
                     'b': ['x', 'y', 'x', 'y', 'x', 'x', 'x'],
                     'c': np.random.randn(7)})

In [271]:
df2

Unnamed: 0,a,b,c
0,one,x,-1.631891
1,one,y,-0.037956
2,two,x,-1.67051
3,two,y,-0.180181
4,two,x,-0.235641
5,three,x,2.069813
6,four,x,-0.219568


In [272]:
df2.duplicated('a')

0    False
1     True
2    False
3     True
4     True
5    False
6    False
dtype: bool

In [273]:
df2.duplicated('a', keep = 'last')

0     True
1    False
2     True
3     True
4    False
5    False
6    False
dtype: bool

In [274]:
df2.duplicated('a', keep = False)

0     True
1     True
2     True
3     True
4     True
5    False
6    False
dtype: bool

In [275]:
df2.drop_duplicates('a')

Unnamed: 0,a,b,c
0,one,x,-1.631891
2,two,x,-1.67051
5,three,x,2.069813
6,four,x,-0.219568


In [276]:
df2.drop_duplicates('a', keep = 'last')

Unnamed: 0,a,b,c
1,one,y,-0.037956
4,two,x,-0.235641
5,three,x,2.069813
6,four,x,-0.219568


In [277]:
df2.drop_duplicates('a', keep = False)

Unnamed: 0,a,b,c
5,three,x,2.069813
6,four,x,-0.219568


Also, you can pass a list of columns to identify duplications.

In [278]:
df2.duplicated(['a', 'b'])

0    False
1    False
2    False
3    False
4     True
5    False
6    False
dtype: bool

In [279]:
df2.drop_duplicates(['a','b'])

Unnamed: 0,a,b,c
0,one,x,-1.631891
1,one,y,-0.037956
2,two,x,-1.67051
3,two,y,-0.180181
5,three,x,2.069813
6,four,x,-0.219568


To drop duplicates by index value, use Index.duplicated then perform slicing. Same options are available in keep parameter.

In [280]:
df3 = pd.DataFrame({'a': np.arange(6),
                    'b': np.random.randn(6)},
                   index=['a', 'a', 'b', 'c', 'b', 'a'])

In [281]:
df3

Unnamed: 0,a,b
a,0,0.372601
a,1,-2.114825
b,2,-0.227672
c,3,-0.060814
b,4,-0.890075
a,5,-0.000168


In [282]:
df3.index.duplicated()

array([False,  True, False, False,  True,  True], dtype=bool)

In [283]:
df3[~df3.index.duplicated()]

Unnamed: 0,a,b
a,0,0.372601
b,2,-0.227672
c,3,-0.060814


In [284]:
df3[~df3.index.duplicated(keep='last')]

Unnamed: 0,a,b
c,3,-0.060814
b,4,-0.890075
a,5,-0.000168


In [285]:
df3[~df3.index.duplicated(keep=False)]

Unnamed: 0,a,b
c,3,-0.060814


# Dictionary-like get method

Each of Series, DataFrame, and Panel have a get method which can return a default value.

[[back to top](#Indexing-and-Selecting-Date)]

In [286]:
s = pd.Series([1,2,3], index=['a','b','c'])

In [287]:
s.get('a')               # equivalent to s['a']

1

In [288]:
s.get('x', default=-1)

-1

# The select Method

Another way to extract slices from an object is with the select method of Series, DataFrame, and Panel. This method should be used only when there is no more direct way. select takes a function which operates on labels along axis and returns a boolean. For instance

[[back to top](#Indexing-and-Selecting-Date)]

In [289]:
df = pd.DataFrame(np.random.randn(8, 4), index = dates, columns = ['A','B','C','D'])

In [290]:
df

Unnamed: 0,A,B,C,D
2000-01-01,-0.999729,-0.321897,-0.177739,-0.429816
2000-01-02,0.224859,-0.137953,0.404406,2.033337
2000-01-03,-0.019398,0.031553,-0.850603,0.562945
2000-01-04,1.020595,-0.019807,0.328945,0.805757
2000-01-05,0.148733,-0.106935,1.666672,0.207811
2000-01-06,-0.187006,0.372111,-0.115863,-0.42953
2000-01-07,-1.377855,0.128495,0.687791,0.034487
2000-01-08,-0.697985,0.326659,0.921325,1.963913


In [291]:
df.select(lambda x: x == 'A', axis = 'columns')

Unnamed: 0,A
2000-01-01,-0.999729
2000-01-02,0.224859
2000-01-03,-0.019398
2000-01-04,1.020595
2000-01-05,0.148733
2000-01-06,-0.187006
2000-01-07,-1.377855
2000-01-08,-0.697985


# The lookup Method

Sometimes you want to extract a set of values given a sequence of row labels and column labels, and the lookup method allows for this and returns a numpy array. For instance,

[[back to top](#Indexing-and-Selecting-Date)]

In [292]:
dflookup = pd.DataFrame(np.random.rand(20,4), columns = ['A','B','C','D'])

In [293]:
dflookup

Unnamed: 0,A,B,C,D
0,0.460219,0.886825,0.417259,0.077658
1,0.2181,0.759535,0.722149,0.098206
2,0.645016,0.947859,0.987255,0.476117
3,0.903317,0.242167,0.601585,0.998908
4,0.218026,0.62993,0.250961,0.644931
5,0.029831,0.624753,0.905707,0.532163
6,0.578942,0.913708,0.404239,0.017679
7,0.486327,0.47289,0.44658,0.357501
8,0.208943,0.889724,0.082928,0.512768
9,0.940604,0.054584,0.558492,0.376232


In [294]:
list(range(0,10,2))

[0, 2, 4, 6, 8]

In [295]:
 dflookup.lookup(list(range(0,10,2)), ['B','C','A','B','D'])

array([ 0.886825  ,  0.98725506,  0.21802579,  0.9137077 ,  0.51276777])

# Index objects

The pandas Index class and its subclasses can be viewed as implementing an ordered multiset. Duplicates are allowed. However, if you try to convert an Index object with duplicate entries into a set, an exception will be raised.

Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to create an Index directly is to pass a list or other sequence to Index:

[[back to top](#Indexing-and-Selecting-Date)]

In [296]:
index = pd.Index(['e', 'd', 'a', 'b'])

In [297]:
index

Index([u'e', u'd', u'a', u'b'], dtype='object')

In [298]:
'd' in index

True

You can also pass a name to be stored in the index:

In [299]:
index = pd.Index(['e', 'd', 'a', 'b'], name = 'something')

In [300]:
index

Index([u'e', u'd', u'a', u'b'], dtype='object', name=u'something')

In [301]:
index.name

'something'

The name, if set, will be shown in the console display:

In [302]:
index = pd.Index(list(range(5)), name = 'rows')

In [303]:
index

Int64Index([0, 1, 2, 3, 4], dtype='int64', name=u'rows')

In [304]:
columns = pd.Index(['A', 'B', 'C'], name = 'cols')

In [305]:
columns

Index([u'A', u'B', u'C'], dtype='object', name=u'cols')

In [306]:
df = pd. DataFrame(np.random.randn(5,3), index = index, columns = columns)

In [307]:
df

cols,A,B,C
rows,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,-0.200626,0.38879,-1.152891
1,-0.587904,0.653722,1.387636
2,-2.297391,-1.831712,0.088307
3,-1.511412,-0.911028,-1.04395
4,0.245931,-1.086001,0.143555


In [308]:
df['A']

rows
0   -0.200626
1   -0.587904
2   -2.297391
3   -1.511412
4    0.245931
Name: A, dtype: float64

## Setting metadata

New in version 0.13.0.

Indexes are “mostly immutable”, but it is possible to set and change their metadata, like the index name (or, for MultiIndex, levels and labels).

You can use the rename, set_names, set_levels, and set_labels to set these attributes directly. They default to returning a copy; however, you can specify inplace=True to have the data change in place.

See [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced) for usage of MultiIndexes.

In [309]:
ind = pd.Index([1, 2, 3])

In [310]:
ind

Int64Index([1, 2, 3], dtype='int64')

In [311]:
ind.rename('apple')

Int64Index([1, 2, 3], dtype='int64', name=u'apple')

In [312]:
ind

Int64Index([1, 2, 3], dtype='int64')

In [313]:
ind.set_names(['apple'], inplace = True)

In [314]:
ind.name = 'bob'

In [315]:
ind

Int64Index([1, 2, 3], dtype='int64', name=u'bob')

New in version 0.15.0.

set_names, set_levels, and set_labels also take an optional level` argument

In [316]:
 index = pd.MultiIndex.from_product([range(3), ['one', 'two']], names=['first', 'second'])

In [317]:
index

MultiIndex(levels=[[0, 1, 2], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

In [318]:
 index.levels[0]

Int64Index([0, 1, 2], dtype='int64', name=u'first')

In [319]:
 index.levels[1]

Index([u'one', u'two'], dtype='object', name=u'second')

In [320]:
index.set_levels(["a", "b"], level=1)

MultiIndex(levels=[[0, 1, 2], [u'a', u'b']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

In [321]:
index #set_levels only returned a copy

MultiIndex(levels=[[0, 1, 2], [u'one', u'two']],
           labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
           names=[u'first', u'second'])

## Set operations on Index objects

The two main operations are union (|), intersection (&) These can be directly called as instance methods or used via overloaded operators. Difference is provided via the .difference() method.

In [322]:
a = pd.Index(['c', 'b', 'a'])

In [323]:
b = pd.Index(['c', 'e', 'd'])

In [324]:
a | b

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [325]:
a & b

Index([u'c'], dtype='object')

In [326]:
a.difference(b) # Opposite of a & b

Index([u'a', u'b'], dtype='object')

Also available is the symmetric_difference (^) operation, which returns elements that appear in either idx1 or idx2 but not both. This is equivalent to the Index created by idx1.difference(idx2).union(idx2.difference(idx1)), with duplicates dropped.

In [327]:
idx1 = pd.Index([1, 2, 3, 4])

In [328]:
idx2 = pd.Index([2, 3, 4, 5])

In [329]:
idx1.symmetric_difference(idx2)

Int64Index([1, 5], dtype='int64')

In [330]:
idx1 ^ idx2

Int64Index([1, 5], dtype='int64')

## Missing values

**Important:** Even though Index can hold missing values (NaN), it should be avoided if you do not want any unexpected results. For example, some operations exclude missing values implicitly.

Index.fillna fills missing values with specified scalar value.

In [331]:
idx1 = pd.Index([1, np.nan, 3, 4])

In [332]:
idx1

Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')

In [333]:
idx1.fillna(2)

Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')

In [334]:
idx2 = pd.DatetimeIndex([pd.Timestamp('2011-01-01'), pd.NaT, pd.Timestamp('2011-01-03')])

In [335]:
idx2

DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]', freq=None)

In [336]:
idx2.fillna(pd.Timestamp('2011-01-02'))

DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq=None)

# Set / Reset Index

Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve already done so. There are a couple of different ways.

[[back to top](#Indexing-and-Selecting-Date)]

## Set an index

DataFrame has a set_index method which takes a column name (for a regular Index) or a list of column names (for a MultiIndex), to create a new, indexed DataFrame:

In [337]:
data = pd.DataFrame({'a':['bar','bar','foo', 'foo'],
                     'b':['one','two','one','two'],
                     'c':['z','y','x','w'],
                     'd':[1,2,3,4]}
                   )

In [338]:
data

Unnamed: 0,a,b,c,d
0,bar,one,z,1
1,bar,two,y,2
2,foo,one,x,3
3,foo,two,w,4


In [339]:
indexed1 = data.set_index('c')

In [340]:
indexed1

Unnamed: 0_level_0,a,b,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
z,bar,one,1
y,bar,two,2
x,foo,one,3
w,foo,two,4


In [341]:
indexed2 = data.set_index(['a', 'b'])

In [342]:
indexed2

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1
bar,two,y,2
foo,one,x,3
foo,two,w,4


The append keyword option allow you to keep the existing index and append the given columns to a MultiIndex:

In [343]:
frame = data.set_index('c', drop=False)

In [344]:
frame

Unnamed: 0_level_0,a,b,c,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


In [345]:
 frame = frame.set_index(['a', 'b'], append=True)

In [346]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c,d
c,a,b,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


Other options in set_index allow you not drop the index columns or to add the index in-place (without creating a new object):

In [347]:
data.set_index('c', drop=False)

Unnamed: 0_level_0,a,b,c,d
c,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


In [348]:
data.set_index(['a', 'b'], inplace=True)

In [349]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1
bar,two,y,2
foo,one,x,3
foo,two,w,4


## Reset the index

As a convenience, there is a new function on DataFrame called reset_index which transfers the index values into the DataFrame’s columns and sets a simple integer index. This is the inverse operation to set_index

In [350]:
data

Unnamed: 0_level_0,Unnamed: 1_level_0,c,d
a,b,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,z,1
bar,two,y,2
foo,one,x,3
foo,two,w,4


In [351]:
data.reset_index()

Unnamed: 0,a,b,c,d
0,bar,one,z,1
1,bar,two,y,2
2,foo,one,x,3
3,foo,two,w,4


The output is more similar to a SQL table or a record array. The names for the columns derived from the index are the ones stored in the names attribute.

You can use the level keyword to remove only a portion of the index:

In [352]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,c,d
c,a,b,Unnamed: 3_level_1,Unnamed: 4_level_1
z,bar,one,z,1
y,bar,two,y,2
x,foo,one,x,3
w,foo,two,w,4


In [353]:
frame.reset_index(level = 1)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,c,d
c,b,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
z,one,bar,z,1
y,two,bar,y,2
x,one,foo,x,3
w,two,foo,w,4


reset_index takes an optional parameter drop which if true simply discards the index, instead of putting index values in the DataFrame’s columns.

## Adding an ad hoc index

If you create an index yourself, you can just assign it to the index field:

In [354]:
data = pd.DataFrame({'a':['bar','bar','foo', 'foo'],
                     'b':['one','two','one','two'],
                     'c':['z','y','x','w'],
                     'd':[1,2,3,4]}
                   )

In [355]:
index = ['A','B','C','D']

In [356]:
data.index = index

In [357]:
data

Unnamed: 0,a,b,c,d
A,bar,one,z,1
B,bar,two,y,2
C,foo,one,x,3
D,foo,two,w,4


# Returning a view versus a copy

When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an example.

[[back to top](#Indexing-and-Selecting-Date)]

In [358]:
dfmi = pd.DataFrame([list('abcd'),
                     list('efgh'),
                     list('ijkl'),
                     list('mnop')],
                    columns=pd.MultiIndex.from_product([['one','two'],
                                                        ['first','second']]))

In [359]:
dfmi

Unnamed: 0_level_0,one,one,two,two
Unnamed: 0_level_1,first,second,first,second
0,a,b,c,d
1,e,f,g,h
2,i,j,k,l
3,m,n,o,p


Compare these two access methods:

In [360]:
dfmi['one']['second']

0    b
1    f
2    j
3    n
Name: second, dtype: object

In [361]:
dfmi.loc[:,('one','second')]

0    b
1    f
2    j
3    n
Name: (one, second), dtype: object

These both yield the same results, so which should you use? It is instructive to understand the order of operations on these and why method 2 (.loc) is much preferred over method 1 (chained [])

dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then another python operation dfmi_with_one['second'] selects the series indexed by 'second' happens. This is indicated by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to __getitem__, so it has to treat them as linear operations, they happen one after another.

Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one','second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity. Furthermore this order of operations can be significantly faster, and allows one to index both axes if so desired.

### Why does assignment fail when using chained indexing?

The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy warning? We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!

But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this, think about how the Python interpreter executes this code:

    dfmi.loc[:,('one','second')] = value
    # becomes
    dfmi.loc.__setitem__((slice(None), ('one', 'second')), value)

But this code is handled differently:

    dfmi['one']['second'] = value
    # becomes
    dfmi.__getitem__('one').__setitem__('second', value)

See that __getitem__ in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether the __setitem__ will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what SettingWithCopy is warning you about!

**Note:** You may be wondering whether we should be concerned about the loc property in the first example. But dfmi.loc is guaranteed to be dfmi itself with modified indexing behavior, so dfmi.loc.__getitem__ / dfmi.loc.__setitem__ operate on dfmi directly. Of course, dfmi.loc.__getitem__(idx) may be a view or a copy of dfmi.

Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on. These are the bugs that SettingWithCopy is designed to catch! Pandas is probably trying to warn you that you’ve done this:

In [362]:
def do_something(df):
   foo = df[['bar', 'baz']]  # Is foo a view? A copy? Nobody knows!
   # ... many lines here ...
   foo['quux'] = value       # We don't know whether this will modify df or not!
   return foo

Yikes!

### Evaluation order matters

Furthermore, in chained expressions, the order may determine whether a copy is returned or not. If an expression will set values on a copy of a slice, then a SettingWithCopy exception will be raised (this raise/warn behavior is new starting in 0.13.0)

You can control the action of a chained assignment via the option mode.chained_assignment, which can take the values ['raise','warn',None], where showing a warning is the default.

In [363]:
dfb = pd.DataFrame({'a' : ['one', 'one', 'two',
                           'three', 'two', 'one', 'six'],
                    'c' : np.arange(7)})

In [364]:
dfb

Unnamed: 0,a,c
0,one,0
1,one,1
2,two,2
3,three,3
4,two,4
5,one,5
6,six,6


In [365]:
# This will show the SettingWithCopyWarning
# but the frame values will be set
#dfb['c'][dfb.a.str.startswith('o')] = 42

In [366]:
dfb

Unnamed: 0,a,c
0,one,0
1,one,1
2,two,2
3,three,3
4,two,4
5,one,5
6,six,6


This however is operating on a copy and will not work.

In [367]:
pd.set_option('mode.chained_assignment','warn')

In [368]:
dfb[dfb.a.str.startswith('o')]['c'] = 42

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [369]:
dfb

Unnamed: 0,a,c
0,one,0
1,one,1
2,two,2
3,three,3
4,two,4
5,one,5
6,six,6


In [370]:
dfb.loc[dfb.a.str.startswith('o'),'c'] = 12

In [371]:
dfb

Unnamed: 0,a,c
0,one,12
1,one,12
2,two,2
3,three,3
4,two,4
5,one,12
6,six,6


A chained assignment can also crop up in setting in a mixed dtype frame.

This is the correct access method

In [372]:
dfc = pd.DataFrame({'A':['aaa','bbb','ccc'],'B':[1,2,3]})

In [373]:
dfc

Unnamed: 0,A,B
0,aaa,1
1,bbb,2
2,ccc,3


In [374]:
dfc.loc[0,'A'] = 11

In [375]:
dfc

Unnamed: 0,A,B
0,11,1
1,bbb,2
2,ccc,3


This can work at times, but is not guaranteed, and so should be avoided

In [376]:
dfc = dfc.copy()

In [377]:
dfc['A'][0] = 111

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':


In [378]:
dfc

Unnamed: 0,A,B
0,111,1
1,bbb,2
2,ccc,3


This will not work at all, and so should be avoided

In [379]:
pd.set_option('mode.chained_assignment','raise')

In [380]:
dfc.loc[0]['A'] = 1111

SettingWithCopyError: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy