# Pandas 入门
- 索引和对象本身的name属性
- series具有自动对齐功能
- 建立DataFrame：传入一个等长列表或者Numpy数组组成的字典
- 嵌套字典传给DataFrame 外层字典的键作为列，内层字典的键作为行索引
- index对象是不可变的，有集合的性质，可以具有重复的标签

In [2]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np

In [3]:
obj = pd.Series([1., 2, 3, 8])
obj

0    1.0
1    2.0
2    3.0
3    8.0
dtype: float64

In [4]:
obj_ = pd.Series([1., 2, 3, 8], index = ['a', 'b', 'c', 'd'])# 里面也可以传入一个字典
obj_

a    1.0
b    2.0
c    3.0
d    8.0
dtype: float64

In [7]:
pop = {'Neveda' : {2001:2.4, 2002: 2.9},
      'Ohio' : {2000: 1.5, 2001: 1.7, 2002:3.6}}

In [9]:
frame2 = DataFrame(pop)
frame2.columns.name = 1
frame2.index.name = 2
frame2

1,Neveda,Ohio
2,Unnamed: 1_level_1,Unnamed: 2_level_1
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


## 基本功能
- reindex 插入NaN值的索引，默认是0.0，float64
- ffill实现时间序列的前向值填充
- reindex也可以重索引columns (.reindex(columns = states), states=([, ,]))
- 利用index中标签进行切片时，左右都包含
- loc与iloc：轴标签索引与整数索引
- pandas的整数索引有的时候会存在问题([-1]无法索引, error key),原因是产生了歧义，对于整数索引则不会
- 数据对齐，索引对应数值不存在时表示为NaN，非+0
- 填充值，非数据对齐而是+0的方法 df1.add(df2, fill_value = 0),radd:创建副本
- dataFrame进行.sun()加和时会忽略nan，但可以使用skipna = False抑制此功能
- dataFrame和Series之间的运算会具有广播性质
- 匹配行且在列上广播，需要使用算术运算方法

In [11]:
obj = pd.Series([1,2,3,4],index = ['a','b','c','d'])
obj
obj2 = obj.reindex(['c','a','b','d','e'])
obj2

c    3.0
a    1.0
b    2.0
d    4.0
e    NaN
dtype: float64

In [10]:
obj3 = pd.Series(['blue','yellow','green'], index = [0, 2, 4])
obj3.reindex(range(6), method = 'ffill')

0      blue
1      blue
2    yellow
3    yellow
4     green
5     green
dtype: object

In [15]:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index = ['ohio', 'colorado','utah', 'new york'],
                    columns = ['one', 'two', 'three', 'four'])
data

Unnamed: 0,one,two,three,four
ohio,0,1,2,3
colorado,4,5,6,7
utah,8,9,10,11
new york,12,13,14,15


In [16]:
data.loc['colorado',['two','three']]

two      5
three    6
Name: colorado, dtype: int32

In [18]:
data.iloc[1, [3,0,1]]

four    7
one     4
two     5
Name: colorado, dtype: int32

In [19]:
data.iloc[:,:3][data.three >5 ]

Unnamed: 0,one,two,three
colorado,4,5,6
utah,8,9,10
new york,12,13,14


In [22]:
ser = pd.Series(np.arange(3.))
ser[2]

2.0

In [7]:
frame = pd.DataFrame(np.arange(0,12.).reshape((4,3)),
                     columns = ('b','d','e'),
                     index = ('Utah','Ohio','Texas','Oregon'))
frame

Unnamed: 0,b,d,e
Utah,0.0,1.0,2.0
Ohio,3.0,4.0,5.0
Texas,6.0,7.0,8.0
Oregon,9.0,10.0,11.0


In [10]:
series = frame['d']
frame.sub(series, axis = 'index')

Unnamed: 0,b,d,e
Utah,-1.0,0.0,1.0
Ohio,-1.0,0.0,1.0
Texas,-1.0,0.0,1.0
Oregon,-1.0,0.0,1.0


In [12]:
f = lambda x: x.max()-x.min()
frame.apply(f)
frame.apply(f, axis = 1)

Utah      2.0
Ohio      2.0
Texas     2.0
Oregon    2.0
dtype: float64

In [16]:
frame.describe()

Unnamed: 0,b,d,e
count,4.0,4.0,4.0
mean,4.5,5.5,6.5
std,3.872983,3.872983,3.872983
min,0.0,1.0,2.0
25%,2.25,3.25,4.25
50%,4.5,5.5,6.5
75%,6.75,7.75,8.75
max,9.0,10.0,11.0
