## 核心数据结构

In [1]:
import pandas as pd
import numpy as np

### Series

Series 是**一维带标签的数组**，数组里可以放任意的数据（整数，浮点数，字符串，Python Object）。其基本的创建函数是：

```python
s = pd.Series(data, index=index)
```

其中 index 是一个列表，用来作为数据的标签。data 可以是不同的数据类型：

* Python 字典
* ndarray 对象
* 一个标量值，如 5


#### 从 ndaray 创建

In [2]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -1.694356
b   -0.124048
c   -1.158450
d    1.035890
e    0.625649
dtype: float64

In [3]:
s.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [4]:
s = pd.Series(np.random.randn(5))
s

0    0.387019
1   -2.004426
2   -1.227255
3   -0.058406
4    1.591774
dtype: float64

In [5]:
s.index

RangeIndex(start=0, stop=5, step=1)

#### 从字典创建

In [6]:
# 空值的默认处理
d = {'a' : 0., 'b' : 1., 'd' : 3}
s = pd.Series(d, index=list('abcd'))
s

a    0.0
b    1.0
c    NaN
d    3.0
dtype: float64

#### 从标量创建

In [7]:
pd.Series(3, index=list('abcde'))

a    3
b    3
c    3
d    3
e    3
dtype: int64

#### Series 是类 ndarray 对象

In [9]:
s = pd.Series(np.random.randn(5))
s

0   -0.470827
1    1.263037
2    0.815732
3    1.070473
4    1.620516
dtype: float64

In [10]:
s[0]

-0.47082724360721745

In [11]:
s[:3]

0   -0.470827
1    1.263037
2    0.815732
dtype: float64

In [12]:
s[[1, 3, 4]]

1    1.263037
3    1.070473
4    1.620516
dtype: float64

In [13]:
np.exp(s)

0    0.624485
1    3.536144
2    2.260831
3    2.916759
4    5.055699
dtype: float64

In [14]:
np.sin(s)

0   -0.453624
1    0.953015
2    0.728228
3    0.877427
4    0.998764
dtype: float64

#### Series 是类字典对象

In [15]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
s

a   -1.828294
b   -0.688289
c   -0.890648
d   -1.003681
e    0.148651
dtype: float64

In [16]:
s['a']

-1.8282938283407693

In [17]:
s['e'] = 5

In [18]:
s

a   -1.828294
b   -0.688289
c   -0.890648
d   -1.003681
e    5.000000
dtype: float64

In [19]:
s['g'] = 100

In [20]:
s

a     -1.828294
b     -0.688289
c     -0.890648
d     -1.003681
e      5.000000
g    100.000000
dtype: float64

In [21]:
'e' in s

True

In [22]:
'f' in s

False

In [23]:
# s['f']

In [87]:
s.get('f')

In [25]:
s.get('f', np.nan)

nan

#### 标签对齐操作

In [88]:
s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
print('{0}\n\n{1}'.format(s1, s2))

a   -0.736370
c    0.773866
e   -0.329994
dtype: float64

a   -0.104675
d   -0.106100
e    0.186483
dtype: float64


In [27]:
s1 + s2

a    0.231065
c         NaN
d         NaN
e    0.476370
dtype: float64

#### name 属性

In [28]:
s = pd.Series(np.random.randn(5), name='Some Thing')
s

0    1.432539
1   -1.577910
2   -0.419583
3    0.180221
4   -0.117776
Name: Some Thing, dtype: float64

In [29]:
s.name

'Some Thing'

### DataFrame

DataFrame 是**二维带行标签和列标签的数组**。可以把 DataFrame 想你成一个 Excel 表格或一个 SQL 数据库的表格，还可以相像成是一个 Series 对象字典。它是 Pandas 里最常用的数据结构。

创建 DataFrame 的基本格式是：

```python
df = pd.DataFrame(data, index=index, columns=columns)
```

其中 index 是行标签，columns 是列标签，data 可以是下面的数据：

* 由一维 numpy 数组，list，Series 构成的字典
* 二维 numpy 数组
* 一个 Series
* 另外的 DataFrame 对象


#### 从字典创建

In [30]:
d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
     'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}

In [31]:
pd.DataFrame(d)

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [32]:
pd.DataFrame(d, index=['d', 'b', 'a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [33]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


In [34]:
d = {'one' : [1, 2, 3, 4],
     'two' : [21, 22, 23, 24]}

In [35]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1,21
1,2,22
2,3,23
3,4,24


In [36]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,21
b,2,22
c,3,23
d,4,24


#### 从结构化数据中创建

In [37]:
data = [(1, 2.2, 'Hello'), (2, 3., "World")]

In [38]:
pd.DataFrame(data)

Unnamed: 0,0,1,2
0,1,2.2,Hello
1,2,3.0,World


In [39]:
pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])

Unnamed: 0,A,B,C
first,1,2.2,Hello
second,2,3.0,World


#### 从字典列表创建

In [40]:
data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [41]:
pd.DataFrame(data)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [42]:
pd.DataFrame(data, index=['first', 'second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [43]:
pd.DataFrame(data, columns=['a', 'b'])

Unnamed: 0,a,b
0,1,2
1,5,10


#### 从元组字典创建

了解其创建的原理，实际应用中，会通过数据清洗的方式，把数据整理成方便 Pandas 导入且可读性好的格式。最后再通过 reindex/groupby 等方式转换成复杂数据结构。

In [44]:
d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
     ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
     ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
     ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
     ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}

In [45]:
# 多级标签
pd.DataFrame(d)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


#### 从 Series 创建

In [46]:
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
pd.DataFrame(s)

Unnamed: 0,0
a,0.415834
b,-0.126642
c,-2.561715
d,-1.262478
e,0.422887


In [47]:
pd.DataFrame(s, index=['a', 'c', 'd'])

Unnamed: 0,0
a,0.415834
c,-2.561715
d,-1.262478


In [48]:
pd.DataFrame(s, index=['a', 'c', 'd'], columns=['A'])

Unnamed: 0,A
a,0.415834
c,-2.561715
d,-1.262478


#### 列选择/增加/删除

In [49]:
df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
df

Unnamed: 0,one,two,three,four
0,-1.577147,0.935877,1.192445,-1.075989
1,0.678227,-0.422884,1.001849,2.331139
2,0.495693,-1.439533,0.653468,1.230008
3,1.202806,1.014182,-0.792714,-0.628002
4,0.934939,-1.066394,0.805637,-0.067281
5,-0.647791,1.286207,-0.695103,-0.843449


In [50]:
df['one']

0   -1.577147
1    0.678227
2    0.495693
3    1.202806
4    0.934939
5   -0.647791
Name: one, dtype: float64

In [51]:
df['three'] = df['one'] + df['two']
df

Unnamed: 0,one,two,three,four
0,-1.577147,0.935877,-0.64127,-1.075989
1,0.678227,-0.422884,0.255344,2.331139
2,0.495693,-1.439533,-0.94384,1.230008
3,1.202806,1.014182,2.216988,-0.628002
4,0.934939,-1.066394,-0.131456,-0.067281
5,-0.647791,1.286207,0.638416,-0.843449


In [52]:
df['flag'] = df['one'] > 0
df

Unnamed: 0,one,two,three,four,flag
0,-1.577147,0.935877,-0.64127,-1.075989,False
1,0.678227,-0.422884,0.255344,2.331139,True
2,0.495693,-1.439533,-0.94384,1.230008,True
3,1.202806,1.014182,2.216988,-0.628002,True
4,0.934939,-1.066394,-0.131456,-0.067281,True
5,-0.647791,1.286207,0.638416,-0.843449,False


In [53]:
del df['three']
df

Unnamed: 0,one,two,four,flag
0,-1.577147,0.935877,-1.075989,False
1,0.678227,-0.422884,2.331139,True
2,0.495693,-1.439533,1.230008,True
3,1.202806,1.014182,-0.628002,True
4,0.934939,-1.066394,-0.067281,True
5,-0.647791,1.286207,-0.843449,False


In [54]:
four = df.pop('four')
four

0   -1.075989
1    2.331139
2    1.230008
3   -0.628002
4   -0.067281
5   -0.843449
Name: four, dtype: float64

In [55]:
df

Unnamed: 0,one,two,flag
0,-1.577147,0.935877,False
1,0.678227,-0.422884,True
2,0.495693,-1.439533,True
3,1.202806,1.014182,True
4,0.934939,-1.066394,True
5,-0.647791,1.286207,False


In [56]:
df['five'] = 5
df

Unnamed: 0,one,two,flag,five
0,-1.577147,0.935877,False,5
1,0.678227,-0.422884,True,5
2,0.495693,-1.439533,True,5
3,1.202806,1.014182,True,5
4,0.934939,-1.066394,True,5
5,-0.647791,1.286207,False,5


In [57]:
df['one_trunc'] = df['one'][:2]
df

Unnamed: 0,one,two,flag,five,one_trunc
0,-1.577147,0.935877,False,5,-1.577147
1,0.678227,-0.422884,True,5,0.678227
2,0.495693,-1.439533,True,5,
3,1.202806,1.014182,True,5,
4,0.934939,-1.066394,True,5,
5,-0.647791,1.286207,False,5,


In [58]:
# 指定插入位置
df.insert(1, 'bar', df['one'])
df

Unnamed: 0,one,bar,two,flag,five,one_trunc
0,-1.577147,-1.577147,0.935877,False,5,-1.577147
1,0.678227,0.678227,-0.422884,True,5,0.678227
2,0.495693,0.495693,-1.439533,True,5,
3,1.202806,1.202806,1.014182,True,5,
4,0.934939,0.934939,-1.066394,True,5,
5,-0.647791,-0.647791,1.286207,False,5,


#### 使用 assign() 方法来插入新列

更方便地使用 methd chains 的方法来实现

In [59]:
df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,2,3,4,3
1,3,4,2,2
2,3,2,4,1
3,1,3,4,2
4,4,2,1,4
5,4,1,1,2


In [60]:
df.assign(Ratio = df['A'] / df['B'])

Unnamed: 0,A,B,C,D,Ratio
0,2,3,4,3,0.666667
1,3,4,2,2,0.75
2,3,2,4,1,1.5
3,1,3,4,2,0.333333
4,4,2,1,4,2.0
5,4,1,1,2,4.0


In [61]:
df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)

Unnamed: 0,A,B,C,D,AB_Ratio,CD_Ratio
0,2,3,4,3,0.666667,1
1,3,4,2,2,0.75,0
2,3,2,4,1,1.5,3
3,1,3,4,2,0.333333,2
4,4,2,1,4,2.0,-3
5,4,1,1,2,4.0,-1


In [62]:
df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)

Unnamed: 0,A,B,C,D,AB_Ratio,ABD_Ratio
0,2,3,4,3,0.666667,2.0
1,3,4,2,2,0.75,1.5
2,3,2,4,1,1.5,1.5
3,1,3,4,2,0.333333,0.666667
4,4,2,1,4,2.0,8.0
5,4,1,1,2,4.0,8.0


#### 索引和选择

对应的操作，语法和返回结果

* 选择一列 -> df[col] -> Series
* 根据行标签选择一行 -> df.loc[label] -> Series
* 根据行位置选择一行 -> df.iloc[index] -> Series
* 选择多行 -> df[5:10] -> DataFrame
* 根据布尔向量选择多行 -> df[bool_vector] -> DataFrame

In [63]:
df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
a,7,2,7,8
b,3,4,5,5
c,5,6,2,8
d,1,1,7,3
e,5,3,6,5
f,9,4,9,7


In [64]:
df['A']

a    7
b    3
c    5
d    1
e    5
f    9
Name: A, dtype: int64

In [65]:
df.loc['a']

A    7
B    2
C    7
D    8
Name: a, dtype: int64

In [66]:
df.iloc[0]

A    7
B    2
C    7
D    8
Name: a, dtype: int64

In [67]:
df[1:4]

Unnamed: 0,A,B,C,D
b,3,4,5,5
c,5,6,2,8
d,1,1,7,3


In [68]:
df[[False, True, True, False, True, False]]

Unnamed: 0,A,B,C,D
b,3,4,5,5
c,5,6,2,8
e,5,3,6,5


#### 数据对齐

DataFrame 在进行数据计算时，会自动按行和列进行数据对齐。最终的计算结果会合并两个 DataFrame。

In [69]:
df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
df1

Unnamed: 0,A,B,C,D
a,1.58888,-0.114489,-0.40185,-0.50703
b,-0.167183,1.594501,-0.120516,-0.206668
c,0.571104,0.941069,0.011007,0.597178
d,-0.057762,0.639945,-0.772874,0.730351
e,-0.458399,0.843232,-0.587982,-0.728883
f,-0.199473,0.094406,1.140005,0.066065
g,-0.584204,-0.34036,1.739054,-0.614724
h,-0.040029,-1.708328,0.269108,-0.5422
i,-0.289861,-0.434472,-0.194116,0.060082
j,-0.816742,-0.635237,0.345873,-0.952751


In [70]:
df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
df2

Unnamed: 0,A,B,C
c,-0.303747,0.250185,1.435337
d,-0.068829,-1.269498,-0.265227
e,-0.150956,-0.406825,-1.911517
f,-2.876589,1.162691,-0.243991
g,0.979592,0.549737,-0.119446
h,0.864909,-0.237887,0.194255
i,2.943195,0.048317,-0.138756


In [71]:
df1 + df2

Unnamed: 0,A,B,C,D
a,,,,
b,,,,
c,0.267357,1.191254,1.446345,
d,-0.126591,-0.629553,-1.038102,
e,-0.609355,0.436407,-2.4995,
f,-3.076061,1.257097,0.896015,
g,0.395388,0.209376,1.619609,
h,0.82488,-1.946214,0.463363,
i,2.653334,-0.386155,-0.332872,
j,,,,


In [72]:
df1 - df1.iloc[0]

Unnamed: 0,A,B,C,D
a,0.0,0.0,0.0,0.0
b,-1.756063,1.70899,0.281334,0.300362
c,-1.017776,1.055559,0.412858,1.104208
d,-1.646642,0.754434,-0.371024,1.237381
e,-2.047279,0.957721,-0.186132,-0.221854
f,-1.788353,0.208895,1.541856,0.573095
g,-2.173084,-0.225871,2.140905,-0.107694
h,-1.62891,-1.593839,0.670958,-0.03517
i,-1.878741,-0.319983,0.207735,0.567112
j,-2.405622,-0.520748,0.747723,-0.445722


#### 使用 numpy 函数

Pandas 与 numpy 在核心数据结构上是完全兼容的

In [73]:
df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
df

Unnamed: 0,one,two,three,four
0,1.234345,-0.433012,-0.189727,-0.347024
1,-0.319648,-0.043347,0.735352,-0.503228
2,-0.837519,-0.691962,0.459743,-0.360081
3,1.913879,0.576434,0.268782,0.890312
4,0.443715,0.190856,0.273737,-0.833024
5,0.739436,2.38009,1.743626,-0.797093
6,2.223067,-0.373445,1.012843,-1.009797
7,0.816871,1.728315,1.170194,1.174062
8,-0.325651,-0.61908,0.289101,1.136349
9,0.899967,-0.834685,0.086455,1.237113


In [74]:
np.exp(df)

Unnamed: 0,one,two,three,four
0,3.436128,0.648552,0.827185,0.706788
1,0.726404,0.957579,2.086217,0.604576
2,0.432783,0.500593,1.583666,0.69762
3,6.779336,1.779681,1.30837,2.43589
4,1.558486,1.210285,1.314869,0.434733
5,2.094754,10.805872,5.71804,0.450637
6,9.235617,0.688359,2.753417,0.364293
7,2.263406,5.63116,3.222617,3.235108
8,0.722057,0.53844,1.335227,3.115373
9,2.459521,0.434011,1.090302,3.445652


In [75]:
np.asarray(df) == df.values

array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]], dtype=bool)

In [76]:
type(np.asarray(df))

numpy.ndarray

In [77]:
np.asarray(df) == df

Unnamed: 0,one,two,three,four
0,True,True,True,True
1,True,True,True,True
2,True,True,True,True
3,True,True,True,True
4,True,True,True,True
5,True,True,True,True
6,True,True,True,True
7,True,True,True,True
8,True,True,True,True
9,True,True,True,True


In [78]:
df.one

0    1.234345
1   -0.319648
2   -0.837519
3    1.913879
4    0.443715
5    0.739436
6    2.223067
7    0.816871
8   -0.325651
9    0.899967
Name: one, dtype: float64

### Panel 

Panel 是三维带标签的数组。实际上，Pandas 的名称由来就是由 Panel 演进的，即 pan(el)-da(ta)-s。Panel 比较少用，但依然是最重要的基础数据结构之一。

* items: 坐标轴 0，索引对应的元素是一个 DataFrame
* major_axis: 坐标轴 1, DataFrame 里的行标签
* minor_axis: 坐标轴 2, DataFrame 里的列标签

In [79]:
data = {'Item1' : pd.DataFrame(np.random.randn(4, 3)),
        'Item2' : pd.DataFrame(np.random.randn(4, 2))}
pn = pd.Panel(data)
pn

<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

In [80]:
pn['Item1']

Unnamed: 0,0,1,2
0,0.572677,0.613401,0.355627
1,1.424893,-0.631455,-0.490202
2,0.127337,0.452397,-2.100219
3,-1.428068,-0.359308,-0.179074


In [81]:
pn.items

Index(['Item1', 'Item2'], dtype='object')

In [82]:
pn.major_axis

RangeIndex(start=0, stop=4, step=1)

In [83]:
pn.minor_axis

RangeIndex(start=0, stop=3, step=1)

In [84]:
# 函数调用
pn.major_xs(pn.major_axis[0])

Unnamed: 0,Item1,Item2
0,0.572677,-2.531987
1,0.613401,2.069053
2,0.355627,


In [85]:
# 函数调用
pn.minor_xs(pn.major_axis[1])

Unnamed: 0,Item1,Item2
0,0.613401,2.069053
1,-0.631455,0.351082
2,0.452397,-2.396276
3,-0.359308,0.886077


In [86]:
pn.to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,Item1,Item2
major,minor,Unnamed: 2_level_1,Unnamed: 3_level_1
0,0,0.572677,-2.531987
0,1,0.613401,2.069053
1,0,1.424893,-0.63134
1,1,-0.631455,0.351082
2,0,0.127337,-0.7506
2,1,0.452397,-2.396276
3,0,-1.428068,-0.816392
3,1,-0.359308,0.886077
