### pandas和NumPy的最大不同就是pandas是用来处理表格型或异质型数据的。而NumPy的代码的风格则相反，它更适合处理同质型的数值类数组数据。

In [1]:
import pandas as pd
from pandas import Series,DataFrame
import numpy as np

#### Series包含数据的索引

In [4]:
obj=pd.Series([4,7,-5,3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

In [5]:
#可以通过value和index分别得到Series对象的值和索引
print(obj.values)
print(obj.index)

[ 4  7 -5  3]
RangeIndex(start=0, stop=4, step=1)


In [6]:
#索引值可以由自己设定
obj2=pd.Series([4,7,-5,3],index=['d','b','c','a'])
print(obj2)

d    4
b    7
c   -5
a    3
dtype: int64


In [9]:
#与NumPy相比，此处可以通过标签来进行索引
print(obj2['a'])
obj2['d']=6
print(obj2[['c','a','d']])  #字符串作为索引列表

3
c   -5
a    3
d    6
dtype: int64


#### 其四则运算符合数组的运算过程，依次计算且不改变索引位置

In [14]:
print(np.exp(obj2))

d     403.428793
b    1096.633158
c       0.006738
a      20.085537
dtype: float64


#### 如果有数据包含再python字典中，可以使用字典生成一个Series

In [16]:
#其索引为字典的键，可以给出键值，使其输出符合你想要的顺序
sdata={'Ohio':35000,'Texas':71000,'Oregon':16000,'Utah':5000}
obj3=pd.Series(sdata)
print(obj3)
states=['California','Ohio','Oregon','Texas']
obj4=pd.Series(sdata,index=states)
print(obj4)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64


In [17]:
#我们用isnull和notnull检查数据是否缺失
print(pd.isnull(obj4))
print(pd.notnull(obj4))


California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool
California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool


In [18]:
#其中的+按照运算结果来看应该是求交运算
print(obj4+obj3)

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


#### Series对象自身和其索引都有name属性

In [19]:
obj4.name='population'
obj4.index.name='state'
print(obj4)

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64


#### DataFrame表示的是矩阵的数据表，它包含已排序的列集合，每一列可以是不同的值类型。
#### DataFrame既有行索引也有列索引，它可以被视为一个共享相同索引的Series的字典。
#### 在DataFrame中，数据被存储为一个以上的二维块，而不是列表，字典或其他一维数组的集合。

In [7]:
#DataFrame的构建
data={'state':['Ohio','Ohio','Ohio','Nevada','Nevada','Nevada'],
     'year':[2000,2001,2002,2001,2002,2003],
     'pop':[1.5,1.7,3.6,2.4,2.9,3.2]}
frame=pd.DataFrame(data)
print(frame)

    state  year  pop
0    Ohio  2000  1.5
1    Ohio  2001  1.7
2    Ohio  2002  3.6
3  Nevada  2001  2.4
4  Nevada  2002  2.9
5  Nevada  2003  3.2


In [4]:
#head方法提取表格前五行
frame.head()

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [9]:
#可以指定顺序排列
print(pd.DataFrame(data,columns=['year','state','pop']))

   year   state  pop
0  2000    Ohio  1.5
1  2001    Ohio  1.7
2  2002    Ohio  3.6
3  2001  Nevada  2.4
4  2002  Nevada  2.9
5  2003  Nevada  3.2


In [10]:
frame2=pd.DataFrame(data,columns=['year','state','pop','debt'],
                   index=['one','two','three','four','five','six'])
print(frame2.year)

one      2000
two      2001
three    2002
four     2001
five     2002
six      2003
Name: year, dtype: int64


#### 想要获取行信息时，可以通过loc进行提取

In [11]:
frame2.loc['three']

year     2002
state    Ohio
pop       3.6
debt      NaN
Name: three, dtype: object

In [12]:
#对列的引用进行修改
frame2['debt']=np.arange(6.)
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   0.0
two    2001    Ohio  1.7   1.0
three  2002    Ohio  3.6   2.0
four   2001  Nevada  2.4   3.0
five   2002  Nevada  2.9   4.0
six    2003  Nevada  3.2   5.0


#### 可以通过series给DataFrame的列赋值，更方便

In [13]:
val=pd.Series([-1.2,-1.5,-1.7],index=['two','four','five'])
frame2['debt']=val   #只改变Series中包含的索引对应的值，无需输入所有列的值
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


In [15]:
###如果被赋予的值不存在，会生成一个新的列，删除列的时候可以用del
frame2['eastern']=frame2.state=='Ohio' # frame.eastern无法创建新的列
print(frame2)

       year   state  pop  debt  eastern
one    2000    Ohio  1.5   NaN     True
two    2001    Ohio  1.7  -1.2     True
three  2002    Ohio  3.6   NaN     True
four   2001  Nevada  2.4  -1.5    False
five   2002  Nevada  2.9  -1.7    False
six    2003  Nevada  3.2   NaN    False


In [16]:
del frame2['eastern']
print(frame2)

       year   state  pop  debt
one    2000    Ohio  1.5   NaN
two    2001    Ohio  1.7  -1.2
three  2002    Ohio  3.6   NaN
four   2001  Nevada  2.4  -1.5
five   2002  Nevada  2.9  -1.7
six    2003  Nevada  3.2   NaN


#### 从DataFrame中选取的列是数据的视图，而不是拷贝。因此，对Series的修改会映射到DataFrame中。如果需要复制，需要显式的使用Series的copy方法。

包含字典的嵌套字数据形式

In [18]:
pop={'Nevada':{2001:2.4,2002:2.9},
    'Ohio':{2000:1.5,2001:1.7,2002:3.6}}
#将其赋值给DataFrame，pandas会将字典的键作为列，将内部字典的键作为行索引。
frame3=pd.DataFrame(pop)
print(frame3)

      Nevada  Ohio
2001     2.4   1.7
2002     2.9   3.6
2000     NaN   1.5


In [19]:
#转置操作
print(frame3.T)

        2001  2002  2000
Nevada   2.4   2.9   NaN
Ohio     1.7   3.6   1.5


In [20]:
#和Series类似，DataFrame的values属性会将包含的数据以二维ndarray的形式返回
print(frame3.values)

[[2.4 1.7]
 [2.9 3.6]
 [nan 1.5]]


#### 重建索引

In [3]:
obj=pd.Series([4.4,7.2,-5.3,3.6],index=['d','b','a','c'])
print(obj)
#调用reindex方法，会将数据按照新的索引进行排列，不存在的索引会引入缺省值
obj2=obj.reindex(['a','b','c','d','e'])
print(obj2)

d    4.4
b    7.2
a   -5.3
c    3.6
dtype: float64
a   -5.3
b    7.2
c    3.6
d    4.4
e    NaN
dtype: float64


In [5]:
#ffill在重建索引时向前填充
obj3=pd.Series(['blue','purple','yellow'],index=[0,2,4])
print(obj3)
obj3.reindex(range(6),method='ffill')

0      blue
2    purple
4    yellow
dtype: object


0      blue
1      blue
2    purple
3    purple
4    yellow
5    yellow
dtype: object

#### 在DataFrame中，reindex可以改变行索引、列索引，也可以同时改变

In [14]:
frame=pd.DataFrame(np.arange(9).reshape((3,3)),
                  index=['a','d','c'],
                  columns=['Ohio','Texas','California'])
print(frame)

   Ohio  Texas  California
a     0      1           2
d     3      4           5
c     6      7           8


In [12]:
frame2=frame.reindex(['a','b','c','d'])
print(frame2)
states=['Texas','Utah','California']
frame=frame.reindex(columns=states)
print(frame)

   Ohio  Texas  California
a   0.0    1.0         2.0
b   NaN    NaN         NaN
c   6.0    7.0         8.0
d   3.0    4.0         5.0
   Texas  Utah  California
a      1   NaN           2
d      4   NaN           5
c      7   NaN           8


In [20]:
states=['Texas','California']
#可以使用loc进行更加简洁的标签索引
frame=frame.loc[['a','c','d'],states] #loc只能选择列表中含有的行和列，不可以自动添加不存在的值
print(frame)

   Texas  California
a      1           2
c      7           8
d      4           5


### 5.2.2轴向上删除条目

#### axis=1表式列，axis=0表示行

In [22]:
data=pd.DataFrame(np.arange(16).reshape((4,4)),
                  index=['Ohio','Colorado','Utah','New Year'],
                  columns=['one','two','three','four'])
print(data)
data1=data.drop(['Colorado','Ohio'])  #默认axis为0，删除行
data2=data.drop(['two','one'],axis=1)
data3=data.drop(['two','four'],axis='columns')
print(data1,data2,data3)

          one  two  three  four
Ohio        0    1      2     3
Colorado    4    5      6     7
Utah        8    9     10    11
New Year   12   13     14    15
          one  two  three  four
Utah        8    9     10    11
New Year   12   13     14    15           three  four
Ohio          2     3
Colorado      6     7
Utah         10    11
New Year     14    15           one  three
Ohio        0      2
Colorado    4      6
Utah        8     10
New Year   12     14


In [24]:
#drop函数直接操作原对象，不返回新对象
data.drop(['two','four'],axis='columns')

Unnamed: 0,one,three
Ohio,0,2
Colorado,4,6
Utah,8,10
New Year,12,14


### 5.2.3索引、选择与过滤

#### loc使用轴标签，iloc使用整数标签

In [26]:
data.loc[:'Utah','two']

Ohio        1
Colorado    5
Utah        9
Name: two, dtype: int32

In [27]:
data.iloc[:,:3][data.three>5]

Unnamed: 0,one,two,three
Colorado,4,5,6
Utah,8,9,10
New Year,12,13,14


### 5.2.4整数索引

#### 使用整数索引有可能会由于数据内的值产生歧义，所以为了更精确的处理，可以使用loc或iloc显示的指明此处所用的索引类型。

In [30]:
ser=pd.Series(np.arange(3.))
print(ser)
print(ser[:1])
print(ser.loc[:1])
print(ser.iloc[:1])

0    0.0
1    1.0
2    2.0
dtype: float64
0    0.0
dtype: float64
0    0.0
1    1.0
dtype: float64
0    0.0
dtype: float64
