# 7.1处理缺失数据

In [1]:
import pandas as pd
import numpy as np
from numpy import nan as NA
string_data=pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

## 1.过滤缺失数据
对于一个Series，dropna返回一个仅含非空数据和索引值的Series


In [2]:
string_data.dropna()

0     aardvark
1    artichoke
3      avocado
dtype: object

对于DataFrame对象,dropna默认丢弃任何含有缺失值的行

In [3]:
data=pd.DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]])
data.dropna()

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


传入how='all'将只丢弃全为NA的那些行


In [4]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


传入axis=1，将丢弃列

In [5]:
data[4]=NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [6]:
data.dropna(axis=1,how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


thresh=int参数表示删除值个数小于int的值

In [7]:
data.dropna(axis=0,thresh=2)

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
3,,6.5,3.0,


## 2.填充缺失值

In [8]:
data.fillna(0)

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,6.5,3.0,0.0


通过字典方式，可以对不同列填充不同值

In [9]:
data.fillna({2:1,4:2})

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,2.0
1,1.0,,1.0,2.0
2,,,1.0,2.0
3,,6.5,3.0,2.0


fillna默认会返回新对象，但也可以通过设置inplace=True对原对象修改

In [10]:
data.fillna(0,inplace=True)
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,0.0
1,1.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0
3,0.0,6.5,3.0,0.0


还可以通过method参数指定向前/向后填充

In [11]:
data=pd.DataFrame(np.random.randn(4,3))
data.iloc[0,1:]=NA
data.iloc[2:,:2]=NA
data

Unnamed: 0,0,1,2
0,-1.405865,,
1,-1.086872,-0.087551,0.666916
2,,,-0.037634
3,,,-0.362835


ffill表示用前面的值填充NA，limit表示最多填充几个

In [12]:
data.fillna(method='ffill',limit=1)

Unnamed: 0,0,1,2
0,-1.405865,,
1,-1.086872,-0.087551,0.666916
2,-1.086872,-0.087551,-0.037634
3,,,-0.362835


表示用后面的值填充NA

In [13]:
data.fillna(method='bfill')

Unnamed: 0,0,1,2
0,-1.405865,-0.087551,0.666916
1,-1.086872,-0.087551,0.666916
2,,,-0.037634
3,,,-0.362835


还可以使用平均值来进行填充

In [14]:
data[0].fillna(value=data[0].mean())

0   -1.405865
1   -1.086872
2   -1.246369
3   -1.246369
Name: 0, dtype: float64

# 7.2数据转换

## 1.删除重复数据

In [15]:
data=pd.DataFrame({'k1':['one','two']*3+['two'],'k2':[1,1,2,3,3,4,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4
6,two,4


duplicated方法返回一个布尔型Series，用于判断是否重复

In [16]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5    False
6     True
dtype: bool

通过drop_duplicates方法可以删除重复值

In [17]:
data.drop_duplicates()

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,3
5,two,4


drop_duplicates方法默认判断所有列，还可以根据指定列过滤重复值

In [18]:
data['v1']=range(7)
data

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
5,two,4,5
6,two,4,6


In [19]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1


默认情况下是保留第一个出现的值，可通过keep='last'保留最后出现的值

In [20]:
data.drop_duplicates(['k1','k2'],keep='last')

Unnamed: 0,k1,k2,v1
0,one,1,0
1,two,1,1
2,one,2,2
3,two,3,3
4,one,3,4
6,two,4,6


# 2.利用函数或映射进行数据转换
有下面一组关于肉类的数据

In [21]:
data=pd.DataFrame({'food':['bacon', 'pulled pork', 'bacon','Pastrami', 'corned beef', 'Bacon','pastrami', 'honey ham', 'nova lox'],
                  'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})

假设你想要添加一列表示该肉类食物来源的动物类型。我们先编写一个不同肉类到动物的映射：

In [22]:
meat_to_animal = {
  'bacon': 'pig',
  'pulled pork': 'pig',
  'pastrami': 'cow',
  'corned beef': 'cow',
  'honey ham': 'pig',
  'nova lox': 'salmon'
}

首先使用Series的str.lower方法，将各个值转换为小写

In [23]:
lower_food=data['food'].str.lower()
lower_food

0          bacon
1    pulled pork
2          bacon
3       pastrami
4    corned beef
5          bacon
6       pastrami
7      honey ham
8       nova lox
Name: food, dtype: object

Series的map方法可以接受一个函数或含有映射关系的字典型对象

In [24]:
data['animal']=lower_food.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


或者也可以使用下面更简单的语法

In [25]:
data['food'].map(lambda x :meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

# 3.替换值

In [26]:
data=pd.Series([1., -999., 2., -999., -1000., 3.])
data

0       1.0
1    -999.0
2       2.0
3    -999.0
4   -1000.0
5       3.0
dtype: float64

利用replace来产生一个新的Series

In [27]:
data.replace(-999,np.nan)

0       1.0
1       NaN
2       2.0
3       NaN
4   -1000.0
5       3.0
dtype: float64

如果你希望一次性替换多个值，可以传入一个由待替换值组成的列表以及一个替换值：

In [28]:
data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

要让每个值有不同的替换值，可以传递一个替换列表：

In [29]:
data.replace([-999,-1000],[np.nan,0])

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

此外，也可以通过字典方式传入

In [30]:
data.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

# 4.重命名轴索引

In [31]:
data=pd.DataFrame(np.arange(12).reshape((3,4)),
                 index=['Ohio', 'Colorado', 'New York'],
                 columns=['one', 'two', 'three', 'four'])
#跟Series一样，轴索引也有一个map方法：
data.index=data.index.map(str.upper)
data

Unnamed: 0,one,two,three,four
OHIO,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


另一种方法是rename

In [32]:
data.rename(index=str.title,columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colorado,4,5,6,7
New York,8,9,10,11


此外，rename方法还可以结合字典对象实现部分标签的替换

In [33]:
data.rename(index={'OHIO':'CHINA'},
           columns={'four':'f'})

Unnamed: 0,one,two,three,f
CHINA,0,1,2,3
COLORADO,4,5,6,7
NEW YORK,8,9,10,11


## 5.离散化和面元划分
假设有一组年龄数据，先希望将其分成不同年龄组。
pandas返回的是一个特殊的Categorical对象。结果展示了pandas.cut划分的面元。你可以将其看做一组表示面元名称的字符串。
它的底层含有一个表示不同分类名称的类型数组，以及一个codes属性中的年龄数据的标签

In [34]:
ages=[20, 22, 25, 27, 21, 23, 37, 31, 61, 45, 41, 32]
bins=[18, 25, 35, 60, 100]
cats=pd.cut(ages,bins)
cats

[(18, 25], (18, 25], (18, 25], (25, 35], (18, 25], ..., (25, 35], (60, 100], (35, 60], (35, 60], (25, 35]]
Length: 12
Categories (4, interval[int64]): [(18, 25] < (25, 35] < (35, 60] < (60, 100]]

pandas.cut结果的面元计数

In [35]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

跟“区间”的数学符号一样，圆括号表示开端，而方括号则表示闭端（包括）。哪边是闭端可以通过right=False进行修改

In [36]:
pd.cut(ages,bins,right=False)

[[18, 25), [18, 25), [25, 35), [25, 35), [18, 25), ..., [25, 35), [60, 100), [35, 60), [35, 60), [25, 35)]
Length: 12
Categories (4, interval[int64]): [[18, 25) < [25, 35) < [35, 60) < [60, 100)]

通过传递一个列表或数组到labels，设置自己的面元名称

In [37]:
group_names=['Youth', 'YoungAdult', 'MiddleAged', 'Senior']
pd.cut(ages,bins,labels=group_names)

[Youth, Youth, Youth, YoungAdult, Youth, ..., YoungAdult, Senior, MiddleAged, MiddleAged, YoungAdult]
Length: 12
Categories (4, object): [Youth < YoungAdult < MiddleAged < Senior]

如果向cut传入的是面元的数量而不是确切的面元边界，则它会根据数据的最小值和最大值计算等长面元。

下面这个例子中，我们将一些均匀分布的数据分成四组

In [38]:
data=np.random.rand(100)
cats=pd.cut(data,4,precision=2)#选项precision=2限定小数位只有2位
pd.value_counts(cats)

(0.73, 0.98]      26
(0.0042, 0.25]    26
(0.25, 0.49]      25
(0.49, 0.73]      23
dtype: int64

qcut是一个非常类似于cut的函数，它可以根据样本分位数对数据进行面元划分,可以得到大小基本相等的面元

In [39]:
cats2=pd.qcut(data,4)
pd.value_counts(cats2)

(0.738, 0.976]      25
(0.477, 0.738]      25
(0.241, 0.477]      25
(0.00414, 0.241]    25
dtype: int64

也可以自定义分位数

In [40]:
pd.qcut(data,[0,0.1,0.5,0.8,1])

[(0.00414, 0.0861], (0.00414, 0.0861], (0.0861, 0.477], (0.0861, 0.477], (0.477, 0.789], ..., (0.477, 0.789], (0.477, 0.789], (0.00414, 0.0861], (0.477, 0.789], (0.477, 0.789]]
Length: 100
Categories (4, interval[float64]): [(0.00414, 0.0861] < (0.0861, 0.477] < (0.477, 0.789] < (0.789, 0.976]]

## 6.检测和过滤异常值

In [41]:
df=pd.DataFrame(np.random.randn(1000,4))
df.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.020987,-0.049143,-0.000673,0.013051
std,0.995218,0.994054,0.996039,1.023015
min,-2.835985,-2.938999,-2.770754,-3.094552
25%,-0.685006,-0.738229,-0.686814,-0.657805
50%,0.013626,0.039478,0.051845,-0.019568
75%,0.73924,0.631325,0.702548,0.708522
max,3.607983,2.998813,2.891972,3.555632


假设要找出某列中绝对值大小超过3的值

In [42]:
col=df[2]
col[np.abs(col)>3]

Series([], Name: 2, dtype: float64)

要选出全部含有“超过3或－3的值”的行，你可以在布尔型DataFrame中使用any方法：

In [43]:
df[(np.abs(df)>3).any(1)]

Unnamed: 0,0,1,2,3
37,3.44777,0.124696,1.836341,1.812549
601,-0.932759,0.929034,1.10509,3.555632
639,3.607983,-0.491852,-0.851251,-2.104583
642,-0.476422,0.06971,-0.477892,-3.094552
876,-0.921552,-0.217697,-0.289866,3.038778


根据这些条件，就可以对值进行设置。下面的代码可以将值限制在区间－3到3以内：

In [44]:
df[np.abs(df)>3]=np.sign(df)*3
df.head(10)

Unnamed: 0,0,1,2,3
0,1.633631,1.710714,0.498685,-2.196764
1,1.543048,1.19862,-1.793326,1.419253
2,0.643443,-1.836535,0.731472,-1.070918
3,-1.333262,-1.021572,1.229724,-1.378533
4,-1.54569,0.391399,1.619534,-0.860167
5,-0.132075,0.135417,1.490089,-0.761409
6,0.715258,-1.04115,-0.50593,0.480584
7,-0.210875,0.784528,-0.047338,-0.053374
8,-2.082402,-0.580796,0.650047,0.059281
9,0.405322,0.457466,1.722031,0.456502


根据数据的值是正还是负，np.sign(df)可以生成1和-1：

In [45]:
np.sign(df).head()

Unnamed: 0,0,1,2,3
0,1.0,1.0,1.0,-1.0
1,1.0,1.0,-1.0,1.0
2,1.0,-1.0,1.0,-1.0
3,-1.0,-1.0,1.0,-1.0
4,-1.0,1.0,1.0,-1.0


## 7.排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排列工作（permuting，随机重排序）。通过需要排列的轴的长度调用permutation，可产生一个表示新顺序的整数数组：

In [46]:
data=pd.DataFrame(np.arange(20).reshape((5,4)))
data

Unnamed: 0,0,1,2,3
0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
4,16,17,18,19


In [47]:
sampler=np.random.permutation(4)#生成一个range(5)随机顺序的数组
sampler

array([1, 2, 3, 0])

In [48]:
data.loc[sampler,:]

Unnamed: 0,0,1,2,3
1,4,5,6,7
2,8,9,10,11
3,12,13,14,15
0,0,1,2,3


如果不想用替换的方式选取随机子集，可以在Series和DataFrame上使用sample方法：

In [49]:
data.sample(4)

Unnamed: 0,0,1,2,3
0,0,1,2,3
2,8,9,10,11
4,16,17,18,19
1,4,5,6,7


要通过替换的方式产生样本（允许重复选择），可以传递replace=True到sample：

In [50]:
data.sample(5,replace=True)

Unnamed: 0,0,1,2,3
2,8,9,10,11
3,12,13,14,15
3,12,13,14,15
1,4,5,6,7
1,4,5,6,7


## 计算指标/哑变量
另一种常用于统计建模或机器学习的转换方式是：将分类变量（categorical variable）转换为“哑变量”或“指标矩阵”。

如果DataFrame的某一列中含有k个不同的值，则可以派生出一个k列矩阵或DataFrame（其值全为1和0）。pandas有一个get_dummies函数可以实现该功能：

In [51]:
df=pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],'data':[3,4,2,5,6,7]})
df

Unnamed: 0,data,key
0,3,b
1,4,b
2,2,a
3,5,c
4,6,a
5,7,b


In [52]:
pd.get_dummies(df['key'])

Unnamed: 0,a,b,c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


有时候，你可能想给指标DataFrame的列加上一个前缀，以便能够跟其他数据进行合并。get_dummies的prefix参数可以实现该功能：

In [53]:
dum=pd.get_dummies(df['key'],prefix='key')
dum

Unnamed: 0,key_a,key_b,key_c
0,0,1,0
1,0,1,0
2,1,0,0
3,0,0,1
4,1,0,0
5,0,1,0


通过merge方法将两个DataFrame合并

In [54]:
pd.merge(df[['data']],dum,left_index=True,right_index=True)

Unnamed: 0,data,key_a,key_b,key_c
0,3,0,1,0
1,4,0,1,0
2,2,1,0,0
3,5,0,0,1
4,6,1,0,0
5,7,0,1,0


注意df['data']返回的是一个Series，df[['data']]返回的是一个DataFrame

In [55]:
df['data']

0    3
1    4
2    2
3    5
4    6
5    7
Name: data, dtype: int64

In [56]:
df[['data']]

Unnamed: 0,data
0,3
1,4
2,2
3,5
4,6
5,7


# 7.3 字符串操作
## 字符串对象方法
对于许多字符串处理和脚本应用，内置的字符串方法已经能够满足要求了。例如，以逗号分隔的字符串可以用split拆分成数段：

In [57]:
val='a,b,  guido'
val.split(',')

['a', 'b', '  guido']

split常常与strip一起使用，以去除空白符（包括换行符）：

In [63]:
pieces=[x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

利用加法，可以将这些子字符串以双冒号分隔符的形式连接起来：

In [64]:
first,second,third=pieces
first+'::'+second+'::'+third

'a::b::guido'

但这种方式并不是很实用。一种更快更符合Python风格的方式是，向字符串"::"的join方法传入一个列表或元组：

In [65]:
'::'.join(pieces)

'a::b::guido'

其它方法关注的是子串定位。检测子串的最佳方式是利用Python的in关键字，还可以使用index和find：

In [69]:
'a' in val

True

In [70]:
val.index('a')

0

In [71]:
val.find('a')

0

注意find和index的区别：如果找不到字符串，index将会引发一个异常，find则返回-1：

In [75]:
val.index(':')

ValueError: substring not found

In [76]:
val.find(':')

-1

count可以返回指定子串的出现次数：

In [77]:
val.count(',')

2

replace用于将指定模式替换为另一个模式。通过传入空字符串，它也常常用于删除模式：

In [78]:
val.replace(',','::')

'a::b::  guido'

In [81]:
val.replace(',','')

'ab  guido'