# Pandas

[Pandas的功能](http://pandas.pydata.org)     
Pandas提供了高级数据结构和数据操作工具，它是使Python成为强大而高效的数据处理环境的重要因素之一。

Numpy能够帮助我们处理数值，但是pandas除了处理数值之外(基于numpy)，还能够帮助我们处理其他类型的数据，提供了大量能够快速便捷地处理数据的函数和方法。



### Pandas的数据结构
 - Series，一维，带标签的数组，List元素可以是不同的数据类型，Series只允许存储相同的数据类型。    
 - DataFrame，二维，Series容器   
 - Panel，三维（扩展）   

In [26]:
import numpy as np
import pandas as pd

####  Series
Series是一种类似于一维数组的对象，能够保存任何数据类型（整数，字符串，浮点数，Python对象等）。   
由一组数据以及一组与之对应的索引组成，索引(index)在左，数据(values)在右。

In [35]:
name_data = {"lilei": 1.70, "hanmeimei": 1.60, "xiaoming": 1.50}
ser_obj_obj = pd.Series(name_data)
print(type(ser_obj_obj))
print(ser_obj_obj.head())
print(ser_obj_obj.index)

<class 'pandas.core.series.Series'>
lilei        1.7
hanmeimei    1.6
xiaoming     1.5
dtype: float64
Index(['lilei', 'hanmeimei', 'xiaoming'], dtype='object')


#### DataFrame
DataFrame是一个二维标记数据结构，类似多维数组/表格数据，它含有一组有序的列，每列可以是不同类型的值。     
DataFrame既有行索引也有列索引，它可以被看做是由Series组成的字典（共用同一个索引），数据是以二维结构存放的。

     
 
  


In [36]:
import numpy as np
dict_data = {'A': 1, 
             'B': pd.Timestamp('2019'),
             'C': pd.Series(1, index=list(range(4)),dtype='float32'),
             'D': ["Python","Java","C++","C"],
             }
df_obj2 = pd.DataFrame(dict_data)
print(df_obj2)

# 通过列索引获取列数据
print(df_obj2['D'])
print(type(df_obj2['D']))
print( '\n'*1 )

# 增加列
df_obj2['I'] = df_obj2['C'] +  5
print(df_obj2.head())
# 删除列
del(df_obj2['I'] )
print(df_obj2.head())

   A          B    C       D
0  1 2019-01-01  1.0  Python
1  1 2019-01-01  1.0    Java
2  1 2019-01-01  1.0     C++
3  1 2019-01-01  1.0       C
0    Python
1      Java
2       C++
3         C
Name: D, dtype: object
<class 'pandas.core.series.Series'>


   A          B    C       D    I
0  1 2019-01-01  1.0  Python  6.0
1  1 2019-01-01  1.0    Java  6.0
2  1 2019-01-01  1.0     C++  6.0
3  1 2019-01-01  1.0       C  6.0
   A          B    C       D
0  1 2019-01-01  1.0  Python
1  1 2019-01-01  1.0    Java
2  1 2019-01-01  1.0     C++
3  1 2019-01-01  1.0       C


### Pandas的索引操作

In [37]:
# 索引对象不可变
df_obj2.index[0] = 2

TypeError: Index does not support mutable operations

In [50]:
#Series索引
print(ser_obj.head())
# 行索引
print(ser_obj["lilei"])
print(ser_obj[0])
print( '\n'*1 )
# 切片索引
print(ser_obj[0:2])
print( '\n'*1 )
# 不连续索引
print(ser_obj[[0, 0, 2]])

name_data
lilei         1.7
hanmeimei     1.6
xiaoming     15.0
Name: name_data, dtype: float64
1.7
1.7


name_data
lilei        1.7
hanmeimei    1.6
Name: name_data, dtype: float64


name_data
lilei        1.7
lilei        1.7
xiaoming    15.0
Name: name_data, dtype: float64


In [52]:
#DataFrame索引
import numpy as np
print(df_obj2.head())
# 列索引
print(df_obj2['A']) # 返回Series类型
print(type(df_obj2['A'])) 
print(df_obj2[['A']]) # 返回DataFrame类型
print(type(df_obj2[['A']])) 
# 不连续索引
print(df_obj2[['A','D']])

   A          B    C       D
0  1 2019-01-01  1.0  Python
1  1 2019-01-01  1.0    Java
2  1 2019-01-01  1.0     C++
3  1 2019-01-01  1.0       C
0    1
1    1
2    1
3    1
Name: A, dtype: int64
<class 'pandas.core.series.Series'>
   A
0  1
1  1
2  1
3  1
<class 'pandas.core.frame.DataFrame'>
   A       D
0  1  Python
1  1    Java
2  1     C++
3  1       C


### Pandas的数据计算

In [102]:
import numpy as np
import pandas as pd
df_obj = pd.DataFrame(np.random.randn(5,4), columns = ['a', 'b', 'c', 'd'])
print(df_obj)

          a         b         c         d
0 -0.128458 -0.239049  1.268771 -0.483132
1  0.610187  0.011347 -0.220402 -0.725317
2  0.383537 -1.509982 -0.945710  1.057704
3 -0.240553 -0.774224  0.778547 -1.486289
4 -0.372635  0.550255 -0.226011 -1.246501


In [106]:
#常用的计算axis=0 按列统计，axis=1按行统计
print(df_obj.sum())
print(df_obj.count())
print( '\n'*1 )
#describe 产生多个统计数据
print(df_obj.describe())

a    0.252078
b   -1.961653
c    0.655194
d   -2.883534
dtype: float64
a    5
b    5
c    5
d    5
dtype: int64


              a         b         c         d
count  5.000000  5.000000  5.000000  5.000000
mean   0.050416 -0.392331  0.131039 -0.576707
std    0.424248  0.785797  0.883754  0.997259
min   -0.372635 -1.509982 -0.945710 -1.486289
25%   -0.240553 -0.774224 -0.226011 -1.246501
50%   -0.128458 -0.239049 -0.220402 -0.725317
75%    0.383537  0.011347  0.778547 -0.483132
max    0.610187  0.550255  1.268771  1.057704


In [141]:
#处理缺失数据
df_data = pd.DataFrame([np.random.randn(3), [1., 2., np.nan],
                       [np.nan, 4., np.nan], [1., 2., 3.]])
print(df_data.head())
# isnull
print(df_data.isnull())
# dropna根据axis轴方向，丢弃包含NaN的行或列
print(df_data.dropna())
print(df_data.dropna(axis=1))#指定行列
# fillna填充缺失数据
print(df_data.fillna(-99.))

          0         1        2
0  1.650403 -0.204041  0.99448
1  1.000000  2.000000      NaN
2       NaN  4.000000      NaN
3  1.000000  2.000000  3.00000
       0      1      2
0  False  False  False
1  False  False   True
2   True  False   True
3  False  False  False
          0         1        2
0  1.650403 -0.204041  0.99448
3  1.000000  2.000000  3.00000
          1
0 -0.204041
1  2.000000
2  4.000000
3  2.000000
           0         1         2
0   1.650403 -0.204041   0.99448
1   1.000000  2.000000 -99.00000
2 -99.000000  4.000000 -99.00000
3   1.000000  2.000000   3.00000


### Pandas的分组聚合

In [131]:
import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randn(8),
            'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

  key1   key2     data1     data2
0    a    one  0.800854  1.254107
1    b    one -0.359615  1.339783
2    a    two  0.663312 -0.952899
3    b  three  2.199351 -1.179059
4    a    two -0.182035 -0.007730
5    b    two -0.003307 -0.397958
6    a    one  0.508948 -1.067804
7    a  three  0.067674  1.520381


In [132]:
# 分组运算
# 单层分组，根据key1
grouped1 = df_obj.groupby('key1')
for group_name, group_data in grouped1:
    print(group_name)
    print(group_data)
print( '\n'*1 )
# 多层分组，根据data1
grouped2 = df_obj['data1'].groupby(df_obj['key1'])
for group_name, group_data in grouped2:
    print(group_name)
    print(group_data)

a
  key1   key2     data1     data2
0    a    one  0.800854  1.254107
2    a    two  0.663312 -0.952899
4    a    two -0.182035 -0.007730
6    a    one  0.508948 -1.067804
7    a  three  0.067674  1.520381
b
  key1   key2     data1     data2
1    b    one -0.359615  1.339783
3    b  three  2.199351 -1.179059
5    b    two -0.003307 -0.397958


a
0    0.800854
2    0.663312
4   -0.182035
6    0.508948
7    0.067674
Name: data1, dtype: float64
b
1   -0.359615
3    2.199351
5   -0.003307
Name: data1, dtype: float64


In [135]:
#聚合 (aggregation)常用于对分组后的数据进行计算
print(df_obj.groupby('key1').sum())
print(df_obj.groupby('key1').mean())

         data1     data2
key1                    
a     1.858752  0.746056
b     1.836428 -0.237235
         data1     data2
key1                    
a     0.371750  0.149211
b     0.612143 -0.079078


In [134]:
# 自定义聚合函数grouped.agg(func)
def peak_range(df):
    """
        返回数值范围
    """
    #print type(df) #参数为索引所对应的记录
    return df.max() - df.min()

print(df_obj.groupby('key1').agg(peak_range))
print(df_obj.groupby('key1').agg(lambda df : df.max() - df.min()))

         data1     data2
key1                    
a     0.982890  2.588186
b     2.558966  2.518842
         data1     data2
key1                    
a     0.982890  2.588186
b     2.558966  2.518842


### Pandas的数值重构

In [136]:
import pandas as pd
import numpy as np

df_obj1 = pd.DataFrame({'key': ['b', 'a', 'a', 'c', 'a', 'a', 'a'],
                        'data1' : np.random.randint(0,10,7)})
df_obj2 = pd.DataFrame({'key': ['a', 'b', 'c'],
                        'data2' : np.random.randint(0,10,3)})

print(df_obj1)
print(df_obj2)
# on显示指定“外键”
print(pd.merge(df_obj1, df_obj2, on='key'))

  key  data1
0   b      9
1   a      9
2   a      4
3   c      3
4   a      7
5   a      6
6   a      4
  key  data2
0   a      2
1   b      4
2   c      0
  key  data1  data2
0   b      9      4
1   a      9      2
2   a      4      2
3   a      7      2
4   a      6      2
5   a      4      2
6   c      3      0


In [137]:
#duplicated() 返回布尔型Series表示每行是否为重复行
print(df_obj1.duplicated())
#drop_duplicates() 过滤重复行
print(df_obj1.drop_duplicates())

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool
  key  data1
0   b      9
1   a      9
2   a      4
3   c      3
4   a      7
5   a      6


In [138]:
# 单个值替换单个值
print(df_obj1.replace(3, -100))

# 多个值替换一个值
print(df_obj1.replace([3, 5], -100))

# 多个值替换多个值
print(df_obj1.replace([3, 5], [-100, -200]))

  key  data1
0   b      9
1   a      9
2   a      4
3   c   -100
4   a      7
5   a      6
6   a      4
  key  data1
0   b      9
1   a      9
2   a      4
3   c   -100
4   a      7
5   a      6
6   a      4
  key  data1
0   b      9
1   a      9
2   a      4
3   c   -100
4   a      7
5   a      6
6   a      4
