## Getting Started with pandas

pandas will be the primary library of interest throughout much of the rest of the book. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

In [1]:
from pandas import Series, DataFrame

In [2]:
import pandas as pd

## Introduction to pandas Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse
data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

## Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data:

In [36]:
obj = Series([4, 7, -5, 3])

In [5]:
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [6]:
obj.values

array([ 4,  7, -5,  3], dtype=int64)

In [7]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [11]:
obj2 = Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])

In [12]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [13]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [14]:
obj2['a']

-5

In [17]:
obj2['d']

4

In [19]:
obj2[['c', 'a', 'd']]

c    3
a   -5
d    4
dtype: int64

In [20]:
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [21]:
obj2[obj2 > 0]

d    4
b    7
c    3
dtype: int64

In [22]:
obj2 * 2

d     8
b    14
a   -10
c     6
dtype: int64

In [24]:
import numpy as np

In [25]:
np.exp(obj2)

d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [30]:
'b' in obj2

True

In [31]:
'e' in obj2

False

Should you have data contained in a Python dict, you can create a Series from it by passing the dict:

In [32]:
sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, "Utah": 5000}

In [33]:
type(sdata)

dict

In [34]:
obj3 = Series(sdata)

In [35]:
obj3

Ohio      35000
Oregon    16000
Texas     71000
Utah       5000
dtype: int64

In [40]:
states = ['California', 'Ohio' , 'Oregon', 'Texas']

In [41]:
obj4 = Series(sdata, index=states)

In [42]:
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In this case, 3 values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values. I will use the terms “missing” or “NA”to refer to missing data. The isnull and notnull functions in pandas should be used to
detect missing data:

In [43]:
pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [44]:
pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [45]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [46]:
obj4.notnull()

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [47]:
obj3 + obj4

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64

In [48]:
obj4.name = 'population'

In [49]:
obj4.index.name = 'state'

In [50]:
obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [51]:
obj.index = ['bob', 'Steve', 'Jeff', 'Ryan']

In [52]:
obj

bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64

## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index). Compared with other such DataFrame-like structures you may have used before (like R’s data.frame), roworiented and column-oriented operations in DataFrame are treated roughly symmetrically. Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays. The exact details of DataFrame’s internals are far outside the scope of this book.

There are numerous ways to construct a DataFrame, though one of the most common is from a dict of equal-length lists or NumPy arrays

In [53]:
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002], 
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

In [54]:
data

{'pop': [1.5, 1.7, 3.6, 2.4, 2.9],
 'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
 'year': [2000, 2001, 2002, 2001, 2002]}

In [55]:
frame = DataFrame(data)

In [56]:
frame

Unnamed: 0,pop,state,year
0,1.5,Ohio,2000
1,1.7,Ohio,2001
2,3.6,Ohio,2002
3,2.4,Nevada,2001
4,2.9,Nevada,2002


In [58]:
DataFrame(data, columns=['year', 'state', 'pop' ])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [59]:
frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                  index=['one', 'two', 'three', 'four', 'five'])

In [60]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [61]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [62]:
frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [63]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [64]:
frame2.ix['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [65]:
frame2['debt'] = 16.5

In [66]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [70]:
frame2['debt'] = np.arange(5.00)

In [71]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes:

In [72]:
val = Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

In [73]:
frame2['debt'] = val

In [74]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [75]:
frame2['easten'] = frame2.state == 'Ohio'

In [76]:
frame2

Unnamed: 0,year,state,pop,debt,easten
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [77]:
del frame2['easten']

In [78]:
frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [79]:
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

The column returned when indexing a DataFrame is a view on the un-
derlying data, not a copy. Thus, any in-place modifications to the Series will be reflected in the DataFrame. The column can be explicitly copied using the Series’s copy method.

In [83]:
pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}


In [84]:
frame3 = DataFrame(pop)

In [85]:
frame3

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [86]:
frame3.T

Unnamed: 0,2000,2001,2002
Nevada,,2.4,2.9
Ohio,1.5,1.7,3.6


In [89]:
DataFrame(pop, index=[2001, 2002, 2003])

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2003,,


In [104]:
pdata = {'Ohio': frame3['Ohio'][:-1],
         'Nevada': frame3['Nevada'][:2]}

In [105]:
DataFrame(pdata)

Unnamed: 0,Nevada,Ohio
2000,,1.5
2001,2.4,1.7


In [106]:
frame3.index.name = 'year'; frame3.columns.name = 'state'

In [107]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [108]:
frame3.values

array([[ nan,  1.5],
       [ 2.4,  1.7],
       [ 2.9,  3.6]])

In [109]:
frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

## Index Objects

In [110]:
obj = Series(range(3), index=['a', 'b', 'c'])

In [111]:
index = obj.index

In [112]:
index

Index(['a', 'b', 'c'], dtype='object')

In [113]:
index[1:]

Index(['b', 'c'], dtype='object')

In [115]:
index[1] = 'd'

TypeError: Index does not support mutable operations

In [119]:
index = pd.Index(np.arange(3))

In [120]:
index

Int64Index([0, 1, 2], dtype='int64')

In [122]:
obj2 = Series([1.5, -2.5, 0], index=index)

In [123]:
obj2

0    1.5
1   -2.5
2    0.0
dtype: float64

In [124]:
obj2.index is index

True

In [125]:
frame3

state,Nevada,Ohio
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2000,,1.5
2001,2.4,1.7
2002,2.9,3.6


In [126]:
'Ohio' in frame3.columns

True

In [127]:
2003 in frame3.index

False

## Essential Functionality

## Reindexing

In [128]:
obj= Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])

In [129]:
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

Calling reindex on this Series rearranges the data according to the new index, intro-ducing missing values if any index values were not already present

In [130]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])

In [131]:
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

In [132]:
obj.reindex(['a', 'b', 'c', 'd','e'], fill_value=0)

a   -5.3
b    7.2
c    3.6
d    4.5
e    0.0
dtype: float64

In [133]:
obj3 = Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])

In [134]:
obj3

0      blue
2    purple
4    yellow
dtype: object

In [135]:
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

In [138]:
frame = DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                 
                 columns=['Ohio', 'Texas', 'California'])

In [139]:
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [140]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])

In [141]:
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


In [142]:
states = ['Texas', 'Utah', 'California']

In [144]:
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


In [145]:
frame.reindex(index=['a', 'b', 'c', 'd'], method='ffill',
             columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
b,1,,2
c,4,,5
d,7,,8


In [146]:
frame.ix[['a', 'b', 'c', 'd'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


## Dropping entries from an axis

In [147]:
obj = Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])

In [148]:
obj

a    0.0
b    1.0
c    2.0
d    3.0
e    4.0
dtype: float64

In [149]:
new_obj = obj.drop('c')

In [150]:
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [153]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

In [156]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                index=['Ohio', 'Colorado', 'Utah', 'New York'],
                columns=['one', 'two', 'three', 'four'])

In [157]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [158]:
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [159]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [160]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


## Indexing, selection, and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples this

In [162]:
obj = Series(np.arange(4), index=['a', 'b', 'c', 'd'])

In [163]:
obj

a    0
b    1
c    2
d    3
dtype: int32

In [164]:
obj['b']

1

In [165]:
obj[1]

1

In [166]:
obj[2:4]

c    2
d    3
dtype: int32

In [167]:
obj[['b', 'a', 'd']]

b    1
a    0
d    3
dtype: int32

In [169]:
obj[[1, 3]]

b    1
d    3
dtype: int32

In [170]:
obj[obj < 2]

a    0
b    1
dtype: int32

In [171]:
obj['b': 'd']


b    1
c    2
d    3
dtype: int32

In [172]:
obj['b':'c'] = 5

In [173]:
obj

a    0
b    5
c    5
d    3
dtype: int32

As you’ve seen above, indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence

In [177]:
data = DataFrame(np.arange(16).reshape((4, 4)),
                index=['Ohio', 'Colorado', 'Utah', 'New York'],
                columns=['one', 'two', 'three', 'four'])

In [178]:
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [179]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int32

In [181]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


In [182]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [183]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [184]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [185]:
data[data < 5] = 0

In [186]:
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


This is intended to make DataFrame syntactically more like an ndarray in this case.For DataFrame label-indexing on the rows, I introduce the special indexing field ix. It enables you to select a subset of the rows and columns from a DataFrame with NumPy-like notation plus axis labels. As I mentioned earlier, this is also a less verbose way to do reindexing

In [187]:
data.ix['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int32

In [188]:
data.ix[['Colorado', 'Utah'], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [190]:
data.ix[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int32

In [191]:
data.ix[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int32

In [192]:
data.ix[data.three > 5, :3]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


## Arithmetic and data alignment

In [193]:
s1 = Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])

In [205]:
s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [206]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

The internal data alignment introduces NA values in the indices that don’t overlap. Missing values propagate in arithmetic computations

In [213]:
df1 = DataFrame(np.arange(9).reshape((3, 3)), columns=list('bcd'),
               index=['Ohio', 'Texas', 'Colorado'])

In [214]:
df2 = DataFrame(np.arange(12).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])


In [215]:
df1

Unnamed: 0,b,c,d
Ohio,0,1,2
Texas,3,4,5
Colorado,6,7,8


In [216]:
df2

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [217]:
df1 +df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


### Arithmetic methods with fill values

In [218]:
df1 = DataFrame(np.arange(12).reshape((3, 4)), columns=list('abcd'))

In [219]:
df2 = DataFrame(np.arange(20).reshape((4, 5)), columns=list('abcde'))

In [220]:
df1

Unnamed: 0,a,b,c,d
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11


In [221]:
df2

Unnamed: 0,a,b,c,d,e
0,0,1,2,3,4
1,5,6,7,8,9
2,10,11,12,13,14
3,15,16,17,18,19


In [222]:
df1 +df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


In [223]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [224]:
df1.reindex(columns=df2.columns, fill_value=0)


Unnamed: 0,a,b,c,d,e
0,0,1,2,3,0
1,4,5,6,7,0
2,8,9,10,11,0


### Operations between DataFrame and Series

As with NumPy arrays, arithmetic between DataFrame and Series is well-defined. First, as a motivating example, consider the difference between a 2D array and one of its rows

In [226]:
arr = np.arange(12).reshape((3, 4))

In [227]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [228]:
arr[0]

array([0, 1, 2, 3])

In [229]:
arr - arr[0]

array([[0, 0, 0, 0],
       [4, 4, 4, 4],
       [8, 8, 8, 8]])

In [230]:
frame = DataFrame(np.arange(12).reshape((4, 3)), columns=list('bde'),
                index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [231]:
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [234]:
series = frame.ix[0]

In [235]:
series

b    0
d    1
e    2
Name: Utah, dtype: int32

In [236]:
frame - series

Unnamed: 0,b,d,e
Utah,0,0,0
Ohio,3,3,3
Texas,6,6,6
Oregon,9,9,9


In [239]:
series2 = Series(range(3), index=['b', 'e', 'f'])

In [240]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [241]:
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [242]:
series2

b    0
e    1
f    2
dtype: int32

If you want to instead broadcast over the columns, matching on the rows, you have to use one of the arithmetic methods. For example

In [243]:
series3 = frame['d']

In [244]:
frame

Unnamed: 0,b,d,e
Utah,0,1,2
Ohio,3,4,5
Texas,6,7,8
Oregon,9,10,11


In [245]:
series3

Utah       1
Ohio       4
Texas      7
Oregon    10
Name: d, dtype: int32

In [246]:
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1,0,1
Ohio,-1,0,1
Texas,-1,0,1
Oregon,-1,0,1


### Function application and mapping

In [249]:
frame = DataFrame(np.random.randn(4, 3), columns=list('bde'),
                  index=['Utah', 'Ohio', 'Texas', 'Oregon'])

In [250]:
frame

Unnamed: 0,b,d,e
Utah,-0.122974,-1.029331,-0.614316
Ohio,-0.015878,-0.562817,-0.171911
Texas,-0.321394,0.158331,-0.41028
Oregon,-0.537099,-1.599214,-0.331065


In [251]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,0.122974,1.029331,0.614316
Ohio,0.015878,0.562817,0.171911
Texas,0.321394,0.158331,0.41028
Oregon,0.537099,1.599214,0.331065


Another frequent operation is applying a function on 1D arrays to each column or row.DataFrame’s apply method does exactly this

In [252]:
f = lambda x: x.max() - x.min()

In [253]:
frame.apply(f)

b    0.521221
d    1.757545
e    0.442405
dtype: float64

In [254]:
frame.apply(f, axis=1)

Utah      0.906356
Ohio      0.546939
Texas     0.568611
Oregon    1.268149
dtype: float64

### lambda anonymous function

In [255]:
L = [lambda x: x ** 2,
     lambda x: x ** 3,
     lambda x: x ** 4]
for f in L:
    print(f(2))
print(L[0](3))

4
8
16
9


### Multiway branch switches: The finale

In [262]:
 key = 'already'
{'already': (lambda: 2 + 2),
     'got':     (lambda: 2 * 4),
     'one':     (lambda: 2 ** 6)}[key]()

4

In [263]:
lower = (lambda x, y: x if x < y else y)

In [264]:
lower('bb', 'aa')

'aa'

In [265]:
lower('bb', 'aa')

'aa'

In [266]:
import sys
showall = lambda x: list(map(sys.stdout.write, x))
t = showall(['spam\n', 'toast\n', 'egg\n'])

spam
toast
egg


In [273]:
((lambda x: (lambda y: x + y)))(99)(4)

103

### https://my.oschina.net/zyzzy/blog/115096

In [274]:
def f(x):
    return Series([x.min(), x.max()], index=['min', 'max'])

In [275]:
frame.apply(f)

Unnamed: 0,b,d,e
min,-0.537099,-1.599214,-0.614316
max,-0.015878,0.158331,-0.171911


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with applymap

In [276]:
format = lambda x: '%.2f' % x

In [277]:
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-0.12,-1.03,-0.61
Ohio,-0.02,-0.56,-0.17
Texas,-0.32,0.16,-0.41
Oregon,-0.54,-1.6,-0.33


In [279]:
frame['e'].map(format)

Utah      -0.61
Ohio      -0.17
Texas     -0.41
Oregon    -0.33
Name: e, dtype: object

## Sorting and ranking

Sorting a data set by some criterion is another important built-in operation. To sort lexicographically by row or column index, use the sort_index method, which returns a new, sorted object:

In [280]:
obj = Series(range(4), index=['d', 'a', 'b', 'c'])

In [281]:
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int32

With a DataFrame, you can sort by index on either axis:

In [283]:
frame = DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                  columns=['d', 'a', 'b', 'c'])

In [285]:
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [286]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


The data is sorted in ascending order by default, but can be sorted in descending order, too:

In [287]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its order method:

In [288]:
obj = Series([4, 7, -3, 2])

In [290]:
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

Any missing values are sorted to the end of the Series by default:

In [291]:
obj = Series([4, np.nan, 7, np.nan, -3, 2])

In [292]:
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

On DataFrame, you may want to sort by the values in one or more columns. To do so, pass one or more column names to the by option:


In [294]:
frame = DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})

In [295]:
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [297]:
frame.sort_values(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


To sort by multiple columns, pass a list of names:

In [299]:
frame.sort_values(by=['b', 'a'])

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


Ranking is closely related to sorting, assigning ranks from one through the number of valid data points in an array. It is similar to the indirect sort indices produced by numpy.argsort, except that ties are broken according to a rule. The rank methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

In [300]:
obj = Series([7, -5, 7, 4, 2, 0, 4])

In [301]:
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [302]:
obj

0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

In [303]:
?rank()

Object `rank` not found.


In [304]:
?np.rank