# 数据结构介绍

我们将对pandas中使用的基本数据结构进行快速，非全面的描述。有关数据类型，索引和轴标签/对齐的基本行为能够运用到所有对象。首先，导入numpy和pandas到你的命名空间：

In [5]:
import numpy as np

import pandas as pd

这里有一个基本原则:数据对齐是内在的。标签和数据之间的链接不会被打破,除非你明确这样做。

我们会对数据结构进行一下简短的介绍,然后在在单独的部分，再考虑的所有大类功能和方法。

## 一维数组

**Series**是一个一维标签数组，它能够容纳任何数据类型(整数、字符串、浮点数,Python对象,等等)。轴标签统称为**索引**。创建Series的基本方法如下:

s = pd.Series(data,index=index)

这里，数据可以有许多不同的类型:

* Python字典
* 数组
* 标量数值（比如5）

索引是一系列轴标签。因此，依据数据类型，我们可以分成以下几种情形：

**ndarray对象**

如果数据是一个ndarray对象，索引需要和数据的长度一样。若没有指明索引，默认创建一个 [0, ..., len(data) - 1]的索引。

In [81]:
s = pd.Series(np.random.randn(5),index=['a', 'b', 'c', 'd', 'e'])
s

a    0.547583
b    0.529911
c   -0.739141
d    0.061205
e   -0.501934
dtype: float64

In [82]:
s.index

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [83]:
pd.Series(np.random.randn(5))

0    0.756829
1   -1.361404
2    1.266434
3    0.323848
4    0.140737
dtype: float64

**dict字典**

如果数据是一个字典，如果索引超过了数据值的范围，对应于索引中的剩下的标签会被拉出，否则，索引将从字典键进行排序来构造。

In [84]:
d = {'a' : 0., 'b' : 1., 'c' : 2.}
pd.Series(d)

a    0
b    1
c    2
dtype: float64

In [85]:
pd.Series(d,index=['b', 'c', 'd', 'a'])

b     1
c     2
d   NaN
a     0
dtype: float64

**标量数值**

如果数据是一个标量值，必须提供一个索引。为了匹配索引的长度，该数值将被重复。

In [86]:
pd.Series(5., index=['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: float64

### 类ndarray Series

Series表现得非常类似于一个ndarray ，并且可以做numpy函数中有效的参数。然而，好像切片一样能够切片索引。

In [88]:
s[0]

0.54758263141805763

In [89]:
s[:3]

a    0.547583
b    0.529911
c   -0.739141
dtype: float64

In [90]:
s[s > s.median()]

a    0.547583
b    0.529911
dtype: float64

In [91]:
s[[4,3,1]]

e   -0.501934
d    0.061205
b    0.529911
dtype: float64

In [92]:
np.exp(s)

a    1.729068
b    1.698781
c    0.477524
d    1.063117
e    0.605358
dtype: float64

我们将在一个单独的部分解决基于数组的索引问题。

###  类字典Series 

Series就像是在一个固定大小的字典，你可以通过索引标签值获取和设置数值：

In [93]:
s['a']

0.54758263141805763

In [94]:
s['e'] = 12.

In [95]:
s

a     0.547583
b     0.529911
c    -0.739141
d     0.061205
e    12.000000
dtype: float64

In [96]:
'e' in s

True

In [97]:
'f' in s

False

如果其中不包含标签，将引发异常：

In [98]:
s['f']

KeyError: 'f'

使用get方法，缺失的标签将返回None或默认指定：

In [99]:
s.get('f')

In [100]:
s.get('f', np.nan)

nan

### Series矢量化操作和标签对齐

当进行数据分析，通常，与原始numpy的数组循环做比较，通过Series值值是没有必要的。Series也可以被传递到大部分的numpy方法，期待ndarray对象。

In [27]:
s + s

a    -1.978659
b     3.543817
c    -1.674672
d    -0.564938
e    24.000000
dtype: float64

In [28]:
s * 2

a    -1.978659
b     3.543817
c    -1.674672
d    -0.564938
e    24.000000
dtype: float64

In [29]:
np.exp(s)

a         0.371826
b         5.882069
c         0.432862
d         0.753920
e    162754.791419
dtype: float64

Series和ndarray之间的主要区别是，Series之间的操作基于标签上的数据自动对齐。因此，你可以写出计算，而不用考虑所涉及的Series是否有相同的标签。

In [30]:
s[1:] + s[:-1]

a         NaN
b    3.543817
c   -1.674672
d   -0.564938
e         NaN
dtype: float64

### 名字属性

Series还有名字属性： 

In [31]:
s = pd.Series(np.random.randn(5),name='something')

In [32]:
s

0   -0.198564
1   -0.024818
2    0.220382
3    1.227965
4    0.690763
Name: something, dtype: float64

In [33]:
s.name

'something'

通过**pandas.Series.rename()**方法，可以給Series重命名。

注意s和s2是两个不同对象

## DataFrame

**DataFrame**是具有不同类型列的2维标签数据结构。你可以认为它像一个电子表格或SQL表或Series对象的字典。它通常是最常用的pandas对象。像Series一样，DataFrame接受许多不同类型的输入： 

* 一维ndarray字典，列表，字典，序列(pandas.Series)
* 二维numpy.ndarray
* 结构化或记录化的ndarray
* Series
* 别的DataFrame

### 序列字典或字典

结果索引将不同系列索引的结合。如果存在任何嵌套类型的字典，这将首先转换成序列。如果不存在列，那么列将是字典键的排序列表。

In [101]:
 d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
      'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

In [102]:
df = pd.DataFrame(d)

In [104]:
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


In [105]:
pd.DataFrame(d,index=['d','b','a'])

Unnamed: 0,one,two
d,,4
b,2.0,2
a,1.0,1


In [106]:
pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])

Unnamed: 0,two,three
d,4,
b,2,
a,1,


行和列标签可分别通过访问索引和属性列访问：

In [107]:
df.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

In [108]:
df.columns

Index([u'one', u'two'], dtype='object')

### ndarray字典/列表

所述ndarrays必须全部具有相同的长度。如果其中一个索引缺失，它必须与数组具有相同的长度。如果没有索引，其结果将是范围n，其中n是该数组长度。

In [109]:
 d = {'one' : [1., 2., 3., 4.],
      'two' : [4., 3., 2., 1.]}

In [110]:
pd.DataFrame(d)

Unnamed: 0,one,two
0,1,4
1,2,3
2,3,2
3,4,1


In [111]:
pd.DataFrame(d, index=['a', 'b', 'c', 'd'])

Unnamed: 0,one,two
a,1,4
b,2,3
c,3,2
d,4,1


### 结构化或记录ndarray

In [112]:
data = np.zeros((2,), dtype=[('A', 'i4'),('B', 'f4'),('C', 'a10')])

data[:] = [(1,2.,'Hello'), (2,3.,"World")]

pd.DataFrame(data)

Unnamed: 0,A,B,C
0,1,2,Hello
1,2,3,World


In [113]:
pd.DataFrame(data, index=['first','second'])

Unnamed: 0,A,B,C
first,1,2,Hello
second,2,3,World


In [114]:
pd.DataFrame(data, columns=['C','A','B'])

Unnamed: 0,C,A,B
0,Hello,1,2
1,World,2,3


### 字典列表

In [115]:
data2 = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]

In [116]:
pd.DataFrame(data2)

Unnamed: 0,a,b,c
0,1,2,
1,5,10,20.0


In [117]:
pd.DataFrame(data2, index=['first','second'])

Unnamed: 0,a,b,c
first,1,2,
second,5,10,20.0


In [118]:
pd.DataFrame(data2,columns=['a','b'])

Unnamed: 0,a,b
0,1,2
1,5,10


### 元组字典

In [119]:
 pd.DataFrame({('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
               ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
              ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
                ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
               ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}})

Unnamed: 0_level_0,Unnamed: 1_level_0,a,a,a,b,b
Unnamed: 0_level_1,Unnamed: 1_level_1,a,b,c,a,b
A,B,4.0,1.0,5.0,8.0,10.0
A,C,3.0,2.0,6.0,7.0,
A,D,,,,,9.0


### 序列

其结果将是和输入序列具有相同指数的DataFrame，并且列名和原始序列列名一致（只有当没有其他列名提供）。

### 备用构造函数

**DataFrame.from_dict**

DataFrame.from_dict需要字典的字典或数组样序列的字典，并返回一个DataFrame。它的运算就像DataFrame构造函数一样，除了默认情况下是“列”的东方参数，但可以以使用字典键作为行标签设置为“索引”。

**DataFrame.from_records**

DataFrame.from_records需要一个元组列表或结构化类型的ndarray 。与正常DataFrame构造方法工作方式类似，所不同的是，索引也许是结构化类型作为索引使用的一个特定领域。例如：

In [120]:
data

array([(1, 2.0, 'Hello'), (2, 3.0, 'World')], 
      dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])

In [121]:
pd.DataFrame.from_records(data,index='C')

Unnamed: 0_level_0,A,B
C,Unnamed: 1_level_1,Unnamed: 2_level_1
Hello,1,2
World,2,3


**DataFrame.from_items**

In [122]:
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])])

Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6


In [123]:
pd.DataFrame.from_items([('A', [1, 2, 3]), ('B', [4, 5, 6])],
                            orient='index', columns=['one', 'two', 'three'])

Unnamed: 0,one,two,three
A,1,2,3
B,4,5,6


###  列选择，添加，删除

In [124]:
df['one']

a     1
b     2
c     3
d   NaN
Name: one, dtype: float64

In [125]:
df['three'] = df['one'] * df['two']

In [126]:
df['flag'] = df['one'] > 2

In [127]:
df

Unnamed: 0,one,two,three,flag
a,1.0,1,1.0,False
b,2.0,2,4.0,False
c,3.0,3,9.0,True
d,,4,,False


 列可以删除或像字典一样被弹出

In [128]:
del df['two']

In [129]:
three = df.pop('three')

In [130]:
df

Unnamed: 0,one,flag
a,1.0,False
b,2.0,False
c,3.0,True
d,,False


当插入一个标量值，它自然会被填充到列中

In [131]:
df['foo'] = 'bar'

In [132]:
df

Unnamed: 0,one,flag,foo
a,1.0,False,bar
b,2.0,False,bar
c,3.0,True,bar
d,,False,bar


当插入不具有相同的索引的DataFrame一个序列，将服从该DataFrame的索引：

In [133]:
df['one_trunc'] = df['one'][:2]

In [134]:
df

Unnamed: 0,one,flag,foo,one_trunc
a,1.0,False,bar,1.0
b,2.0,False,bar,2.0
c,3.0,True,bar,
d,,False,bar,


你可以插入原始ndarrays，但其长度必须与DataFrame的索引的长度相匹配。

缺省情况下，列被插在末端。插入函数能够可以允许插入特定的列：

In [135]:
df.insert(1,'bar',df['one'])

In [136]:
df

Unnamed: 0,one,bar,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,2.0,False,bar,2.0
c,3.0,3.0,True,bar,
d,,,False,bar,


### 用方法链分配新列

DataFrame有一个**assign()**方法，从现有潜在列中，使您可以轻松创建新列。

iris = pd.read_csv('data/iris.data')

iris.head()

(iris.assign(sepal_ratio = iris['SepalWidth'] / iris['SepalLength'])
         .head())

iris.assign(sepal_ratio = lambda x: (x['SepalWidth'] /
                                        x['SepalLength'])).head()

(iris.query('SepalLength > 5')
         .assign(SepalRatio = lambda x: x.SepalWidth / x.SepalLength,
                 PetalRatio = lambda x: x.PetalWidth / x.PetalLength)
         .plot(kind='scatter', x='SepalRatio', y='PetalRatio'))
    

### 索引/选择

索引的基础知识如下：

**操作**|**语法** |**结果**
-----|------|----
列选择    | df[col]    | Series
通过标签来选择行   | df.loc[label]   | Series
通过坐标选择行    | df.iloc[loc]    | Series
选择部分行    | df[5:10]    | DataFrame
通过布尔向量选择行   | df[bool_vec]    | DataFrame

行选择,例如,返回一个索引为DataFrame列的Series:

In [140]:
df

Unnamed: 0,one,bar,flag,foo,one_trunc
a,1.0,1.0,False,bar,1.0
b,2.0,2.0,False,bar,2.0
c,3.0,3.0,True,bar,
d,,,False,bar,


In [141]:
df.loc['b']

one              2
bar              2
flag         False
foo            bar
one_trunc        2
Name: b, dtype: object

In [142]:
df.iloc[2]

one             3
bar             3
flag         True
foo           bar
one_trunc     NaN
Name: c, dtype: object

### 数据一致性和运算

DataFrame对象之间数据一致性会根据列和索引(行标签)自动对齐。而且,运算结果将会对列和行标签进行融合。

In [143]:
df = pd.DataFrame(np.random.randn(10,4),columns=['A', 'B', 'C', 'D'])

In [144]:
df2 = pd.DataFrame(np.random.randn(7,3),columns=['A', 'B', 'C'])

In [145]:
df + df2

Unnamed: 0,A,B,C,D
0,-0.792044,0.196848,-2.197867,
1,-2.199564,0.298989,-0.669907,
2,-0.087791,2.226107,1.915892,
3,1.19963,0.185055,0.221739,
4,0.19591,0.519975,-2.007668,
5,-2.154454,1.17946,-1.76698,
6,-2.450981,-0.5214,-0.517213,
7,,,,
8,,,,
9,,,,


DataFrame与Series进行运算时,默认行为是DataFrame的列和Series行索引对齐。例如:

In [146]:
df - df.iloc[0]

Unnamed: 0,A,B,C,D
0,0.0,0.0,0.0,0.0
1,-0.334937,1.068745,1.768437,-0.385068
2,-1.082112,1.816277,2.06785,-0.054826
3,-0.510419,-0.202545,1.0983,0.962498
4,-0.803443,2.348294,0.507732,0.16732
5,0.555808,1.321957,1.02973,-2.103689
6,-0.426993,0.88529,0.740374,0.436118
7,0.770606,-0.206424,0.80504,0.105756
8,0.730288,1.890398,2.874625,-0.665956
9,-0.698176,1.931307,0.84164,-0.386771


在特殊情况下,处理时间序列数据,DataFrame行索引会包含日期:

In [147]:
index = pd.date_range('1/1/2000',periods=8)

In [148]:
df = pd.DataFrame(np.random.randn(8,3),index=index,columns=list('ABC'))

In [149]:
df

Unnamed: 0,A,B,C
2000-01-01,2.941187,1.14749,0.075349
2000-01-02,0.436512,-1.828436,-1.149418
2000-01-03,1.130869,-0.889218,-0.615453
2000-01-04,-0.408743,0.444983,-1.881506
2000-01-05,0.064386,0.528198,-1.675042
2000-01-06,0.281504,-0.61037,0.411973
2000-01-07,-2.465851,-1.370818,2.089272
2000-01-08,-1.722175,-0.932429,0.054045


In [150]:
type(df['A'])

pandas.core.series.Series

In [153]:
df - df['A']

Unnamed: 0,A,B,C
2000-01-01,0,-1.793697,-2.865837
2000-01-02,0,-2.264948,-1.58593
2000-01-03,0,-2.020087,-1.746323
2000-01-04,0,0.853726,-1.472762
2000-01-05,0,0.463813,-1.739428
2000-01-06,0,-0.891873,0.130469
2000-01-07,0,1.095032,4.555123
2000-01-08,0,0.789746,1.77622


In [154]:
df * 5 + 2

Unnamed: 0,A,B,C
2000-01-01,16.705934,7.737449,2.376747
2000-01-02,4.182561,-7.14218,-3.74709
2000-01-03,7.654347,-2.446091,-1.077266
2000-01-04,-0.043716,4.224914,-7.407528
2000-01-05,2.321928,4.640992,-6.375211
2000-01-06,3.407518,-1.051848,4.059865
2000-01-07,-10.329254,-4.854092,12.446361
2000-01-08,-6.610876,-2.662144,2.270226


In [155]:
1 / df

Unnamed: 0,A,B,C
2000-01-01,0.339999,0.871467,13.2715
2000-01-02,2.290887,-0.546915,-0.870005
2000-01-03,0.884275,-1.124583,-1.624819
2000-01-04,-2.446523,2.247278,-0.531489
2000-01-05,15.531405,1.893228,-0.597
2000-01-06,3.552353,-1.638352,2.427344
2000-01-07,-0.40554,-0.729491,0.478636
2000-01-08,-0.580661,-1.072468,18.503052


In [156]:
df ** 4

Unnamed: 0,A,B,C
2000-01-01,74.832529,1.733785,3.2e-05
2000-01-02,0.036307,11.176843,1.745469
2000-01-03,1.635497,0.62522,0.143476
2000-01-04,0.027913,0.039208,12.53205
2000-01-05,1.7e-05,0.077837,7.872325
2000-01-06,0.00628,0.138794,0.028805
2000-01-07,36.971511,3.531179,19.053732
2000-01-08,8.796489,0.755897,9e-06


布尔运算

In [157]:
df1 = pd.DataFrame({'a' : [1, 0, 1], 'b' : [0, 1, 1] }, dtype=bool)

In [159]:
df2 = pd.DataFrame({'a' : [0, 1, 1], 'b' : [1, 1, 0] }, dtype=bool)

In [160]:
df1 & df2

Unnamed: 0,a,b
0,False,False
1,False,True
2,True,False


In [162]:
df1 | df2

Unnamed: 0,a,b
0,True,True
1,True,True
2,True,True


In [163]:
df1 ^ df2

Unnamed: 0,a,b
0,True,True
1,True,False
2,False,True


In [164]:
-df1

Unnamed: 0,a,b
0,False,True
1,True,False
2,False,False


### 转置

转置,访问T属性(转置函数),类似于一个ndarray:

In [165]:
df[:5].T

Unnamed: 0,2000-01-01 00:00:00,2000-01-02 00:00:00,2000-01-03 00:00:00,2000-01-04 00:00:00,2000-01-05 00:00:00
A,2.941187,0.436512,1.130869,-0.408743,0.064386
B,1.14749,-1.828436,-0.889218,0.444983,0.528198
C,0.075349,-1.149418,-0.615453,-1.881506,-1.675042


### DataFrame与NumPy函数的互用性

In [167]:
df

Unnamed: 0,A,B,C
2000-01-01,2.941187,1.14749,0.075349
2000-01-02,0.436512,-1.828436,-1.149418
2000-01-03,1.130869,-0.889218,-0.615453
2000-01-04,-0.408743,0.444983,-1.881506
2000-01-05,0.064386,0.528198,-1.675042
2000-01-06,0.281504,-0.61037,0.411973
2000-01-07,-2.465851,-1.370818,2.089272
2000-01-08,-1.722175,-0.932429,0.054045


In [168]:
np.exp(df)

Unnamed: 0,A,B,C
2000-01-01,18.938309,3.150275,1.078261
2000-01-02,1.547301,0.160665,0.316821
2000-01-03,3.098349,0.410977,0.540396
2000-01-04,0.664485,1.560463,0.152361
2000-01-05,1.066504,1.695874,0.1873
2000-01-06,1.325121,0.54315,1.509794
2000-01-07,0.084937,0.253899,8.079033
2000-01-08,0.178677,0.393597,1.055532


In [169]:
np.asarray(df)

array([[ 2.94118678,  1.14748974,  0.07534943],
       [ 0.43651212, -1.82843608, -1.14941804],
       [ 1.13086936, -0.88921811, -0.61545328],
       [-0.40874327,  0.44498287, -1.88150567],
       [ 0.06438568,  0.52819837, -1.67504218],
       [ 0.28150356, -0.61036951,  0.41197293],
       [-2.46585083, -1.37081841,  2.08927213],
       [-1.72217524, -0.93242882,  0.05404514]])

dot方法能够实现DataFrame矩阵乘法:

In [170]:
df.T.dot(df)

Unnamed: 0,A,B,C
A,19.416757,6.237601,-5.443848
B,6.237601,9.048741,-2.152482
C,-5.443848,-2.152482,12.589153


类似地,dot方法也能运用在Series上:

In [173]:
s1 = pd.Series(np.arange(5,10))

In [174]:
s1.dot(s1)

255

###  控制台显示

在控制台，非常大的DataFrames将被截断显示它们。你也可以使用info()查看详情。

 baseball = pd.read_csv('data/baseball.csv')
 
 print(baseball)
 
 baseball.info()

然而,使用to_string，DataFrame将返回一个字符串表示的表格形式,虽然它并不总是适合控制台宽度:

print(baseball.iloc[-20:, :12].to_string())

 pd.DataFrame(np.random.randn(3, 12))

你可以设置display.width值，来改变单行打印的数量。

pd.set_option('display.width', 40)

pd.DataFrame(np.random.randn(3, 12))

### DataFrame列属性访问和IPython实现

如果DataFrame列标签是一个有效的Python变量名,就可以像访问列一样访问属性:

In [175]:
df = pd.DataFrame({'foo1':np.random.randn(5),'foo2':np.random.randn(5)})

In [176]:
df

Unnamed: 0,foo1,foo2
0,-0.347671,-1.009187
1,0.320512,1.200344
2,0.679307,-0.038199
3,0.769751,1.635441
4,-0.122912,-0.97982


In [177]:
df.foo1

0   -0.347671
1    0.320512
2    0.679307
3    0.769751
4   -0.122912
Name: foo1, dtype: float64

列也连接到IPython补全机制,这样他们可以tab-completed: