# 初识Pandas
Pandas是以NumPy为基础，在其之上提供更加易用的数据结构和数据分析工具，可帮助提高工作效率。

首先，pandas重点提供两种数据结构：
- **Series**
序列， 一维数据，是对NumPy的一维数组的封装，可自定义索引（index）
这点类似于Linux shell script中的关联数组
- **DataFrame**
数据框，二位数据，是对NumPy的二维数组的封装，可自定义索引（index）和列名（column）

除此之外，还有

- describe：快速计算数据的各种描述性统计值
- unique：数据的独立值列表
- value_count： 各个值的计数
- hist：直接绘制直方图
- plot：对matplotlib进行简单封装，执行简单的绘图功能

小工具：
- 便捷的I/O
- 媲美SQL的功能
- 媲美Excel的功能：透视表

最后，Pandas的文档非常丰富，更新频繁，社区十分活跃。

首先，导入需要的模块

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

# 创建对象（object creation）


In [2]:
s = pd.Series([1,3,4,np.nan,6,8])
s

0    1.0
1    3.0
2    4.0
3    NaN
4    6.0
5    8.0
dtype: float64

In [3]:
dates = pd.date_range('20171026', periods=6)
dates

DatetimeIndex(['2017-10-26', '2017-10-27', '2017-10-28', '2017-10-29',
               '2017-10-30', '2017-10-31'],
              dtype='datetime64[ns]', freq='D')

In [4]:
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=tuple('ABCD'))#此处tuple('ABCD)is short for ('A','B','C','D')
df

Unnamed: 0,A,B,C,D
2017-10-26,-1.33177,1.602367,-0.561792,1.825721
2017-10-27,0.251834,-0.765978,0.432401,0.671658
2017-10-28,-1.150419,-0.909648,0.012585,-1.406542
2017-10-29,1.607818,1.517355,-0.756374,0.872013
2017-10-30,1.34057,0.133495,1.136394,1.379843
2017-10-31,-0.205709,0.84486,-0.768821,0.263155


可使用一个字典dict 来创建一个DataFrame对象，而且它会自动应用NumPy的广播（自动复制）

In [8]:
df2 = pd.DataFrame({'A' : 1.,
                    'B' : pd.Timestamp('20171026'),
                    'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                    'D' : np.array([3]*4, dtype='int32'),
                    'E' : pd.Categorical(["test","train","test","train"]), #类别数据
                    'F' : 'foo'})
df2

Unnamed: 0,A,B,C,D,E,F
0,1.0,2017-10-26,1.0,3,test,foo
1,1.0,2017-10-26,1.0,3,train,foo
2,1.0,2017-10-26,1.0,3,test,foo
3,1.0,2017-10-26,1.0,3,train,foo


In [9]:
df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

In [10]:
df2.C #等价于 df2['C']

0    1.0
1    1.0
2    1.0
3    1.0
Name: C, dtype: float32

## 查看数据（viewing data）

In [12]:
df.head()  #默认给出五行

Unnamed: 0,A,B,C,D
2017-10-26,-1.33177,1.602367,-0.561792,1.825721
2017-10-27,0.251834,-0.765978,0.432401,0.671658
2017-10-28,-1.150419,-0.909648,0.012585,-1.406542
2017-10-29,1.607818,1.517355,-0.756374,0.872013
2017-10-30,1.34057,0.133495,1.136394,1.379843


In [13]:
df.tail() # 默认后面五行

Unnamed: 0,A,B,C,D
2017-10-27,0.251834,-0.765978,0.432401,0.671658
2017-10-28,-1.150419,-0.909648,0.012585,-1.406542
2017-10-29,1.607818,1.517355,-0.756374,0.872013
2017-10-30,1.34057,0.133495,1.136394,1.379843
2017-10-31,-0.205709,0.84486,-0.768821,0.263155


In [14]:
df.index

DatetimeIndex(['2017-10-26', '2017-10-27', '2017-10-28', '2017-10-29',
               '2017-10-30', '2017-10-31'],
              dtype='datetime64[ns]', freq='D')

In [15]:
df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

In [16]:
df.values

array([[-1.33176957,  1.60236681, -0.56179169,  1.82572055],
       [ 0.25183395, -0.76597795,  0.43240144,  0.67165814],
       [-1.15041898, -0.90964815,  0.0125847 , -1.4065419 ],
       [ 1.60781774,  1.5173552 , -0.7563744 ,  0.87201334],
       [ 1.34057038,  0.13349523,  1.13639407,  1.37984348],
       [-0.20570949,  0.8448597 , -0.76882115,  0.2631547 ]])

In [17]:
df.describe()  # 给出数据的统计特性

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,0.085387,0.403742,-0.084268,0.600975
std,1.22811,1.098881,0.763507,1.125108
min,-1.33177,-0.909648,-0.768821,-1.406542
25%,-0.914242,-0.54111,-0.707729,0.365281
50%,0.023062,0.489177,-0.274603,0.771836
75%,1.068386,1.349231,0.327447,1.252886
max,1.607818,1.602367,1.136394,1.825721


In [18]:
df.T #转置

Unnamed: 0,2017-10-26 00:00:00,2017-10-27 00:00:00,2017-10-28 00:00:00,2017-10-29 00:00:00,2017-10-30 00:00:00,2017-10-31 00:00:00
A,-1.33177,0.251834,-1.150419,1.607818,1.34057,-0.205709
B,1.602367,-0.765978,-0.909648,1.517355,0.133495,0.84486
C,-0.561792,0.432401,0.012585,-0.756374,1.136394,-0.768821
D,1.825721,0.671658,-1.406542,0.872013,1.379843,0.263155


In [19]:
df.sort_index(axis=1, ascending=False) #按照列名（axis=1）或行名进行排序

Unnamed: 0,D,C,B,A
2017-10-26,1.825721,-0.561792,1.602367,-1.33177
2017-10-27,0.671658,0.432401,-0.765978,0.251834
2017-10-28,-1.406542,0.012585,-0.909648,-1.150419
2017-10-29,0.872013,-0.756374,1.517355,1.607818
2017-10-30,1.379843,1.136394,0.133495,1.34057
2017-10-31,0.263155,-0.768821,0.84486,-0.205709


In [20]:
df.sort_values(by='B') #按照B列的数据排序

Unnamed: 0,A,B,C,D
2017-10-28,-1.150419,-0.909648,0.012585,-1.406542
2017-10-27,0.251834,-0.765978,0.432401,0.671658
2017-10-30,1.34057,0.133495,1.136394,1.379843
2017-10-31,-0.205709,0.84486,-0.768821,0.263155
2017-10-29,1.607818,1.517355,-0.756374,0.872013
2017-10-26,-1.33177,1.602367,-0.561792,1.825721


## 选择数据