# Pandas基本功能
- 重新索引
    - 重新索引并得到一个新的SeriesObj:SeriesOb2 = SeriesObj1.reindex(['index_value1', 'index_value2', 'index_value3'])
    - 基于索引创建新的SeriesObj并填充值(ffill/bfill)
    - 重新索引并得到一个新的DataFrameObj:DataFrameObj2 = DataFrameObj1.reindex(['index_value1', 'index_value2', 'index_value3'])
    - 基于columns关键字重索引：DataFrameObj.reindex(columns=['col_name1', 'col_name2', 'col_name3'])
    - 基于loc重索引：DataFrameObj.loc[['index_name1', 'index_name2', 'index_name3'], ['col_name1', 'col_name2', 'col_name3']]
- 丢弃指定轴上的项
    - NewSeriesObj = SeriesObj.drop('index_name')
    - SeriesObj.drop(['index_name1', 'index_name2'])
    - DataFrameObj.drop(['index_name1', 'index_name2'])
    - DataFrameObj.drop('column_name', axis=1) 删除整列
    - DataFrameObj.drop(['column_name1', 'column_name2'], axis=1) 删除整列
    - DataFrameObj.drop(['index_name1', 'index_name2'], axis=0) 删除整行
    - SeriesObj.drop('index_name', inplace=True) 就地删除，不生成新对象
- 索引、选取和过滤
    - Series获取内容值
        - 基于索引取内容
            - SeriesObj['index_name']
            - SeriesObj['index_name1','index_name2','index_name4']
            - SeriesObj['index_name1':'index_name4']
        - 基于位置序号取内容
            - SeriesObj[position]
            - SeriesObj[position1:position2]
            - SeriesObj[[position1,position2]]
        - 基于布尔数组取内容
            - SeriesObj[SeriesObj < 2]
    - DataFrame获取内容值
        - 基于column取内容
            - DataFrameObj['column_name']
            - DataFrameObj[['column_name3', 'column_name1']]
        - DataFrameObj[:2] 显示从第2行起的所有行数据
        - 基于布尔表达式取内容
            - DataFrameObj[DataFrameObj['column_name'] > 5] 显示满足某布尔表达式的数据
            - DataFrameObj < 5 得到一个全局布尔dataframe
            - DataFrameObj[DataFrameObj < 5] = 0 赋值满足某布尔表达式的数据
        - 基于loc/iloc取内容(loc基于value定位column/index，iloc基于position定位column/index)
            - DataFrameObj.loc['index_name', ['column_name2', 'column_name3']]
            - DataFrameObj.iloc[[1, 2], [3, 0, 1]] 显示指定行列上的数据，这里指定了二维数据
            - DataFrameObj.iloc[2] 显示第2行数据
            - DataFrameObj.loc[:'Utah', 'two']
            - DataFrameObj.iloc[:, :3][data.three > 5] 这里第二个维度使用了布尔表达式
- 算术运算和数据对齐
    - 两个Series/DataFrame的算数运算是把对应行列上数据运算，不存在的补足NaN
- 在算数方法中填充值
    - DataFrameObj1.add(DataFrameObj2, fill_value=0) 本来是Nan，现在以0填充
    - DataFrameObj1.reindex(columns=DataFrameObj2.columns, fill_value=0) 这样写就不会扩充更多的来自DataFrameObj2的行（对比上面第一种情况）
- DataFrame和Series之间的运算
    - 从DataFrame中取某行成为Series：SeriesObj = DataFrameObj.iloc[0]
    - DataFrame与Series算术运算
        - DataFrameObj - SeriesObj
        - DataFrameObj.sub(SeriesObj, axis=0) 这里是按行计算然后broadcast
- 函数应用与映射
    - Numpy.abs(DataFrameObj) NumPy ufuncs是(element-wise array methods)
    - DataFrameObj.apply(func) 
        - 默认func会按列作用到所有元素
        - axis=1，func会按行作用到所有元素
    - DataFrameObj.applymap(func) 这是element-wise
- 排序和排名
    - sort_index
        - SeriesObj.sort_index()
        - DataFrameObj.sort_index() 基于index名排序
        - DataFrameObj.sort_index(axis=1) 基于列名排序
        - DataFrameObj.sort_index(axis=1, ascending=False) 基于列名排序，降序
    - sort_values
        - SeriesObj.sort_values()
        - DataFrameObj.sort_values(by='column_name')
        - DataFrameObj.sort_values(by=['column_name2', 'column_name1'])
    - rank
        - SeriesObj.rank()
        - SeriesObj.rank(method='first')
        - SeriesObj.rank(ascending=False, method='max')
        - DataFrameObj.rank(axis=1) 按行
        - DataFrameObj.rank(method='first') 按列
- 带有重复值的轴索引
    - SeriesObj.index.is_unique
    - 若果index value是重复的
        - 对于Series：SeriesObj['index_name'] 得到另一个Series(多entry的)
        - 对于DataFrame：DataFrame['index_name'] 得到另一个对于DataFrame：DataFrame(多entry的)

In [1]:
# coding:utf-8
from pandas import Series, DataFrame
import pandas as pd
import numpy as np
%pwd

u'/Users/zhangjun/Documents/machine-learning-notes/data-processing'

## 重新索引

In [2]:
obj = pd.Series([4.5, 7.2, -5.3, 3.6], index=['d', 'b', 'a', 'c'])
obj

d    4.5
b    7.2
a   -5.3
c    3.6
dtype: float64

In [3]:
obj2 = obj.reindex(['a', 'b', 'c', 'd', 'e'])
obj2

a   -5.3
b    7.2
c    3.6
d    4.5
e    NaN
dtype: float64

For ordered data like time series, it may be desirable to do some interpolation or filling of values when reindexing. The method option allows us to do this, using a `method` such as `ffill` which forward fills the values:

In [4]:
obj3 = pd.Series(['blue', 'purple', 'yellow'], index=[0, 2, 4])
obj3.reindex(range(6), method='ffill')

0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

Table 5-4. reindex method (interpolation) options

Argument | Description
---------|------------
ffill or pad | Fill (or carry) values forward
bfill or backfill | Fill (or carry) values backward

With DataFrame, reindex can alter either the (row) index, columns, or both. When passed only a sequence, the rows are reindexed in the result:

In [6]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)), index=['a', 'c', 'd'],
                     columns=['Ohio', 'Texas', 'California'])
frame

Unnamed: 0,Ohio,Texas,California
a,0,1,2
c,3,4,5
d,6,7,8


In [7]:
frame2 = frame.reindex(['a', 'b', 'c', 'd'])
frame2

Unnamed: 0,Ohio,Texas,California
a,0.0,1.0,2.0
b,,,
c,3.0,4.0,5.0
d,6.0,7.0,8.0


The columns can be reindexed using the `columns` keyword:

In [8]:
states = ['Texas', 'Utah', 'California']
frame.reindex(columns=states)

Unnamed: 0,Texas,Utah,California
a,1,,2
c,4,,5
d,7,,8


Reindexing can be done more succinctly by label-indexing with `loc`:

In [9]:
frame.loc[['a', 'b', 'c', 'd'], states]

Unnamed: 0,Texas,Utah,California
a,1.0,,2.0
b,,,
c,4.0,,5.0
d,7.0,,8.0


Table 5-5. reindex function arguments

Argument | Description
---------|------------
index | New sequence to use as index. Can be Index instance or any other sequence-like Python data structure. An Index will be used exactly as is without any copying
method | Interpolation (fill) method, see Table 5-4 for options.
fill_value | Substitute value to use when introducing missing data by reindexing
limit | When forward- or backfilling, maximum size gap (in number of elements) to fill
tolerance | When forward- or backfilling, maximum size gap (in absolute numeric distance) to fill for inexact matches
level | Match simple Index on level of MultiIndex, otherwise select subset of
copy | If True, always copy underlying data even if new index is equivalent to old index. If False, do not copy the data when the indexes are equivalent.

## 丢弃指定轴上的项
Dropping one or more entries from an axis is easy if you already have an index array or list without those entries. As that can require a bit of munging and set logic, the `drop` method will return a new object with the indicated value or values deleted from an axis:

In [12]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
new_obj = obj.drop('c')
new_obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

In [13]:
obj.drop(['d', 'c'])

a    0.0
b    1.0
e    4.0
dtype: float64

With DataFrame, index values can be deleted from either axis:

In [14]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
             index=['Ohio', 'Colorado', 'Utah', 'New York'],
             columns=['one', 'two', 'three', 'four'])
data.drop(['Colorado', 'Ohio'])

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


In [15]:
data.drop('two', axis=1)

Unnamed: 0,one,three,four
Ohio,0,2,3
Colorado,4,6,7
Utah,8,10,11
New York,12,14,15


In [16]:
data.drop(['two', 'four'], axis=1)

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New York,12,14


In [17]:
data.drop(['Ohio', 'Colorado'], axis=0)

Unnamed: 0,one,two,three,four
Utah,8,9,10,11
New York,12,13,14,15


Many functions, like drop, which modify the size or shape of a Series or DataFrame, can manipulate an object in place **without returning a new object**:

In [18]:
obj.drop('c', inplace=True)
obj

a    0.0
b    1.0
d    3.0
e    4.0
dtype: float64

## 索引、选取和过滤

In [20]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj['b']

1.0

In [21]:
obj[1]

1.0

In [22]:
obj[2:4]

c    2.0
d    3.0
dtype: float64

In [23]:
obj[['b', 'a', 'd']]

b    1.0
a    0.0
d    3.0
dtype: float64

In [24]:
obj[[1, 3]]

b    1.0
d    3.0
dtype: float64

In [25]:
obj[obj < 2]

a    0.0
b    1.0
dtype: float64

Slicing with labels behaves differently than normal Python slicing in that the endpoint is inclusive:

In [26]:
obj['b':'c']

b    1.0
c    2.0
dtype: float64

Setting using these methods modifies the corresponding sectino of the Series:

In [28]:
obj['b':'c'] = 5
obj

a    0.0
b    5.0
c    5.0
d    3.0
dtype: float64

Indexing into a DataFrame is for retrieving one or more columns either with a single value or sequence:

In [29]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


In [30]:
data['two']

Ohio         1
Colorado     5
Utah         9
New York    13
Name: two, dtype: int64

In [31]:
data[['three', 'one']]

Unnamed: 0,three,one
Ohio,2,0
Colorado,6,4
Utah,10,8
New York,14,12


Indexing like this has a few special cases. First selecting rows by slicing or a boolean array:

In [32]:
data[:2]

Unnamed: 0,one,two,three,four
Ohio,0,1,2,3
Colorado,4,5,6,7


In [33]:
data[data['three'] > 5]

Unnamed: 0,one,two,three,four
Colorado,4,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


Another use case is in indexing with a boolean DataFrame, such as one produced by a scalar comparison:

In [34]:
data < 5

Unnamed: 0,one,two,three,four
Ohio,True,True,True,True
Colorado,True,False,False,False
Utah,False,False,False,False
New York,False,False,False,False


In [35]:
data[data < 5] = 0
data

Unnamed: 0,one,two,three,four
Ohio,0,0,0,0
Colorado,0,5,6,7
Utah,8,9,10,11
New York,12,13,14,15


For DataFrame label-indexing on the rows, I introduce the special indexing operators `loc` and `iloc`. They enable you to select a subset of the rows and columns from a DataFrame with NumPy-like notation using either axis labels (loc) or integers (iloc).

In [36]:
data.loc['Colorado', ['two', 'three']]

two      5
three    6
Name: Colorado, dtype: int64

In [37]:
data.iloc[[1, 2], [3, 0, 1]]

Unnamed: 0,four,one,two
Colorado,7,0,5
Utah,11,8,9


In [38]:
data.iloc[2]

one       8
two       9
three    10
four     11
Name: Utah, dtype: int64

In [39]:
data.loc[:'Utah', 'two']

Ohio        0
Colorado    5
Utah        9
Name: two, dtype: int64

In [40]:
data.iloc[:, :3][data.three > 5]

Unnamed: 0,one,two,three
Colorado,0,5,6
Utah,8,9,10
New York,12,13,14


Table 5-6. Indexing options with DataFrame

Type | Notes
-----|------
df[val] | Select single column or sequence of columns from the DataFrame. Special case conveniences: boolean array (filter rows), slice (slice rows), or boolean DataFrame (set values based on some criterion).
df.loc[val] | Selects single row or subset of rows from the DataFrame by label.
df.loc[:, val] | Selects single column of subset of columns by label.
df.loc[val1, val2] | Select both rows and columns by label.
df.iloc[where] | Selects single row or subset of rows from the DataFrame by label.
df.iloc[:, where] | Selects single column of subset of columns by integer position.
df.iloc[where_i, where_j] | Select both rows and columns by integer position.
df.at[label_i, label_j] | Select a single scalar value by row and column label.
df.iat[i, j] | Select a single scalar value by row and column position (integers).
reindex method | Select either rows or columns by labels.
xsmethod | Select single row or column as a Series by label.
get_value, set_value methods | Select single value by row and column label.

## 算术运算和数据对齐

In [41]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])
s1

a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

In [42]:
s2

a   -2.1
c    3.6
e   -1.5
f    4.0
g    3.1
dtype: float64

In [43]:
s1 + s2

a    5.2
c    1.1
d    NaN
e    0.0
f    NaN
g    NaN
dtype: float64

In the case of DataFrame, alignment is performed on both the rows and the columns:

In [44]:
df1 = pd.DataFrame(np.arange(9.).reshape((3, 3)), columns=list('bcd'),
                   index=['Ohio', 'Texas', 'Colorado'])
df2 = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                   index=['Utah', 'Ohio', 'Texas', 'Oregon'])
df1

Unnamed: 0,b,c,d
Ohio,0.0,1.0,2.0
Texas,3.0,4.0,5.0
Colorado,6.0,7.0,8.0


In [45]:
df2

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [46]:
df1 + df2

Unnamed: 0,b,c,d,e
Colorado,,,,
Ohio,3.0,,6.0,
Oregon,,,,
Texas,9.0,,12.0,
Utah,,,,


## 在算数方法中填充值

In [50]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))
df1

Unnamed: 0,a,b,c,d
0,0.0,1.0,2.0,3.0
1,4.0,5.0,6.0,7.0
2,8.0,9.0,10.0,11.0


In [51]:
df2

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,4.0
1,5.0,6.0,7.0,8.0,9.0
2,10.0,11.0,12.0,13.0,14.0
3,15.0,16.0,17.0,18.0,19.0


In [52]:
df1 + df2

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,
1,9.0,11.0,13.0,15.0,
2,18.0,20.0,22.0,24.0,
3,,,,,


Using the add method on df1, I pass df2 and an argument to `fill_value`:

In [53]:
df1.add(df2, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,2.0,4.0,6.0,4.0
1,9.0,11.0,13.0,15.0,9.0
2,18.0,20.0,22.0,24.0,14.0
3,15.0,16.0,17.0,18.0,19.0


Relatedly, when reindexing a Series or DataFrame, you can also specify a different fill value:

In [54]:
df1.reindex(columns=df2.columns, fill_value=0)

Unnamed: 0,a,b,c,d,e
0,0.0,1.0,2.0,3.0,0
1,4.0,5.0,6.0,7.0,0
2,8.0,9.0,10.0,11.0,0


Table 5-7. Flexible arithmetic methods

Method | Description
-------|------------
add, radd | Methods for addition (+)
sub, rsub | Methods for subtraction (-)
div, rdiv | Methods for division (/)
floordiv, rfloordiv | Methods for floor division (//)
mul, rmul | Methods for multiplication (*)
pow, rpow | Methods for exponentiation (**)

## DataFrame和Series之间的运算

In [56]:
arr = np.arange(12.).reshape((3, 4))
arr

array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,   7.],
       [  8.,   9.,  10.,  11.]])

In [57]:
arr[0]

array([ 0.,  1.,  2.,  3.])

In [58]:
arr - arr[0]

array([[ 0.,  0.,  0.,  0.],
       [ 4.,  4.,  4.,  4.],
       [ 8.,  8.,  8.,  8.]])

In [59]:
frame = pd.DataFrame(np.arange(12.).reshape((4, 3)), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [60]:
series = frame.iloc[0]
series

b    0.0
d    1.0
e    2.0
Name: Utah, dtype: float64

In [61]:
frame - series

Unnamed: 0,b,d,e
Utah,0.0,0.0,0.0
Ohio,3.0,3.0,3.0
Texas,6.0,6.0,6.0
Oregon,9.0,9.0,9.0


In [62]:
series2 = pd.Series(range(3), index=['b', 'e', 'f'])
series2

b    0
e    1
f    2
dtype: int64

In [63]:
frame + series2

Unnamed: 0,b,d,e,f
Utah,0.0,,3.0,
Ohio,3.0,,6.0,
Texas,6.0,,9.0,
Oregon,9.0,,12.0,


In [64]:
series3 = frame['d']
series3

Utah       1.0
Ohio       4.0
Texas      7.0
Oregon    10.0
Name: d, dtype: float64

In [65]:
frame.sub(series3, axis=0)

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


>The axis number that you pass is the axis to match on. In this case we mean to match on the DataFrame’s row index (axis=0) and broadcast across.

## 函数应用与映射
NumPy ufuncs (element-wise array methods) also work with pandas objects:

In [66]:
frame = pd.DataFrame(np.random.randn(4, 3), columns=list('bde'),
                     index=['Utah', 'Ohio', 'Texas', 'Oregon'])
frame

Unnamed: 0,b,d,e
Utah,-1.067894,0.788223,1.474891
Ohio,-0.318441,1.677389,1.572767
Texas,-1.9388,0.248013,0.340766
Oregon,-1.922661,0.190869,2.654866


In [67]:
np.abs(frame)

Unnamed: 0,b,d,e
Utah,1.067894,0.788223,1.474891
Ohio,0.318441,1.677389,1.572767
Texas,1.9388,0.248013,0.340766
Oregon,1.922661,0.190869,2.654866


Another frequent operation is applying a function on 1D arrays to each column or row. DataFrame’s `apply` method does exactly this:

In [68]:
f = lambda x: x.max() - x.min()
frame.apply(f)

b    1.62036
d    1.48652
e    2.31410
dtype: float64

In [69]:
frame.apply(f, axis=1)

Utah      2.542785
Ohio      1.995830
Texas     2.279567
Oregon    4.577527
dtype: float64

Many of the most common array statistics (like sum and mean) are DataFrame methods, so using apply is not necessary.

The function passed to apply need not return a scalar value, it can also return a Series with multiple values:

In [70]:
def f(x):
    return pd.Series([x.min(), x.max()], index=['min', 'max'])
frame.apply(f)

Unnamed: 0,b,d,e
min,-1.9388,0.190869,0.340766
max,-0.318441,1.677389,2.654866


Element-wise Python functions can be used, too. Suppose you wanted to compute a formatted string from each floating point value in frame. You can do this with `applymap`:

In [71]:
format = lambda x: '%.2f' % x
frame.applymap(format)

Unnamed: 0,b,d,e
Utah,-1.07,0.79,1.47
Ohio,-0.32,1.68,1.57
Texas,-1.94,0.25,0.34
Oregon,-1.92,0.19,2.65


## 排序和排名
To sort lexicographically by row or column index, use the `sort_index` method, which returns a new, sorted object:

In [72]:
obj = pd.Series(range(4), index=['d', 'a', 'b', 'c'])
obj.sort_index()

a    1
b    2
c    3
d    0
dtype: int64

In [73]:
frame = pd.DataFrame(np.arange(8).reshape((2, 4)), index=['three', 'one'],
                     columns=['d', 'a', 'b', 'c'])
frame.sort_index()

Unnamed: 0,d,a,b,c
one,4,5,6,7
three,0,1,2,3


In [74]:
frame.sort_index(axis=1)

Unnamed: 0,a,b,c,d
three,1,2,3,0
one,5,6,7,4


In [75]:
frame.sort_index(axis=1, ascending=False)

Unnamed: 0,d,c,b,a
three,0,3,2,1
one,4,7,6,5


To sort a Series by its values, use its `sort_values` method:

In [76]:
obj = pd.Series([4, 7, -3, 2])
obj.sort_values()

2   -3
3    2
0    4
1    7
dtype: int64

In [77]:
obj = pd.Series([4, np.nan, 7, np.nan, -3, 2])
obj.sort_values()

4   -3.0
5    2.0
0    4.0
2    7.0
1    NaN
3    NaN
dtype: float64

When sorting a DataFrame, you can use the data in one or more columns as the sort keys. To do so, pass one or more column names to the by option of `sort_values`:

In [78]:
frame = pd.DataFrame({'b': [4, 7, -3, 2], 'a': [0, 1, 0, 1]})
frame

Unnamed: 0,a,b
0,0,4
1,1,7
2,0,-3
3,1,2


In [79]:
frame.sort_values(by='b')

Unnamed: 0,a,b
2,0,-3
3,1,2
0,0,4
1,1,7


In [80]:
frame.sort_values(by=['a', 'b'])

Unnamed: 0,a,b
2,0,-3
0,0,4
3,1,2
1,1,7


Ranking is closely related to sorting, assigning ranks from one through the number of valid data points in an array. The `rank` methods for Series and DataFrame are the place to look; by default rank breaks ties by assigning each group the mean rank:

In [81]:
obj = pd.Series([7, -5, 7, 4, 2, 0, 4])
obj.rank()

0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

In [82]:
obj.rank(method='first')

0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [86]:
obj.rank(method='max')

0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0
dtype: float64

In [83]:
# Assign tie values the maximum rank in the group
obj.rank(ascending=False, method='max')

0    2.0
1    7.0
2    2.0
3    4.0
4    5.0
5    6.0
6    4.0
dtype: float64

In [84]:
frame = pd.DataFrame({'b': [4.3, 7, -3, 2], 'a': [0, 1, 0, 1],
                       'c': [-2, 5, 8, -2.5]})
frame

Unnamed: 0,a,b,c
0,0,4.3,-2.0
1,1,7.0,5.0
2,0,-3.0,8.0
3,1,2.0,-2.5


In [85]:
frame.rank(axis=1)

Unnamed: 0,a,b,c
0,2.0,3.0,1.0
1,1.0,3.0,2.0
2,2.0,1.0,3.0
3,2.0,3.0,1.0


In [89]:
frame.rank(method='first')

Unnamed: 0,a,b,c
0,1.0,3.0,2.0
1,3.0,4.0,3.0
2,2.0,1.0,4.0
3,4.0,2.0,1.0


Table 5-8. Tie-breaking methods with rank

Method | Description
-------|------------
'average' | Default: assign the average rank to each entry in the equal group.
'min' | Use the minimum rank for the whole group.
'max' | Use the maximum rank for the whole group.
'first' | Assign ranks in the order the values appear in the data.
'dense' | Like method='min', but ranks always increase by 1 in between groups rather than the number of equal elements in a group.

## 带有重复值的轴索引

In [90]:
obj = pd.Series(range(5), index=['a', 'a', 'b', 'b', 'c'])
obj

a    0
a    1
b    2
b    3
c    4
dtype: int64

The index’s `is_unique` property can tell you whether its values are unique or not:

In [91]:
obj.index.is_unique

False

Indexing a value with multiple entries returns a Series while single entries return a scalar value:

In [92]:
obj['a']

a    0
a    1
dtype: int64

In [93]:
obj['c']

4

The same logic extends to indexing rows in a DataFrame:

In [94]:
df = pd.DataFrame(np.random.randn(4, 3), index=['a', 'a', 'b', 'b'])
df

Unnamed: 0,0,1,2
a,-0.258194,-2.114386,0.113255
a,0.562478,-0.210807,-1.479668
b,-0.295455,0.737173,-0.714988
b,0.586082,0.217061,-1.390135


In [95]:
df.loc['b']

Unnamed: 0,0,1,2
b,-0.295455,0.737173,-0.714988
b,0.586082,0.217061,-1.390135
