# 处理丢失数据

有两种丢失数据：
- None
- np.nan(NaN)

## 1. None

None是Python自带的，其类型为python object。因此，None不能参与到任何计算中。

In [1]:
#查看None的数据类型


## 2. np.nan（NaN）

np.nan是浮点类型，能参与到计算中。但计算的结果总是NaN。

In [2]:
#查看np.nan的数据类型


In [2]:
from pandas import Series,DataFrame
import numpy as np

## 3. pandas中的None与NaN

### 1) pandas中None与np.nan都视作np.nan

创建DataFrame

In [18]:
#将某些数组元素赋值为nan

### 2) pandas处理空值操作

- ``isnull()``
- ``notnull()``
- ``dropna()``: 过滤丢失数据
- ``fillna()``: 填充丢失数据

In [63]:
#创建DataFrame，给其中某些元素赋值为nan
df = DataFrame(data=np.random.randint(0,100,size=(5,8)),index=['a','b','c','d','e'],columns=['A','B','C','D','E','F','G','H'])
df['B']['c'] = None
df['F']['d'] = np.nan
df['D']['c'] = None
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,A,B,C,D,E,F,G,H
a,10,22.0,49,13.0,20,18.0,99,85
b,97,44.0,15,25.0,78,15.0,56,49
c,13,,87,,75,43.0,18,39
d,0,92.0,76,51.0,12,,23,23
e,49,35.0,87,7.0,81,29.0,88,60


In [20]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
a,47,69.0,70,62.0,89,96.0,32,25
b,38,74.0,11,43.0,24,16.0,28,65
c,60,,52,,77,21.0,60,52
d,62,40.0,87,36.0,15,,27,6
e,23,89.0,41,49.0,56,90.0,9,26


(1)判断函数
- ``isnull()``
- ``notnull()``

In [21]:
df.notnull()

Unnamed: 0,A,B,C,D,E,F,G,H
a,True,True,True,True,True,True,True,True
b,True,True,True,True,True,True,True,True
c,True,False,True,False,True,True,True,True
d,True,True,True,True,True,False,True,True
e,True,True,True,True,True,True,True,True


- df.notnull/isnull().any()/all()

In [51]:
condition = df.notnull().all(axis=0)
condition

A     True
B    False
C     True
D    False
E     True
F    False
G     True
H     True
dtype: bool

In [61]:
c = condition[condition].index
c

Index(['A', 'C', 'E', 'G', 'H'], dtype='object')

In [62]:
#过滤df中的空值（只保留没有空值的行）
df[c]

Unnamed: 0,A,C,E,G,H
a,47,70,89,32,25
b,38,11,24,28,65
c,60,52,77,60,52
d,62,87,15,27,6
e,23,41,56,9,26


df.dropna() 可以选择过滤的是行还是列（默认为行）:axis中0表示行，1表示的列

In [67]:
df.dropna(axis=1)  #在所有drop系列的函数中抽象0表示的是行   1表示的是列

Unnamed: 0,A,C,E,G,H
a,10,49,20,99,85
b,97,15,78,56,49
c,13,87,75,18,39
d,0,76,12,23,23
e,49,87,81,88,60


(3) 填充函数 Series/DataFrame
- ``fillna()``:value和method参数

In [68]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
a,10,22.0,49,13.0,20,18.0,99,85
b,97,44.0,15,25.0,78,15.0,56,49
c,13,,87,,75,43.0,18,39
d,0,92.0,76,51.0,12,,23,23
e,49,35.0,87,7.0,81,29.0,88,60


In [69]:
df.fillna(value=100)

Unnamed: 0,A,B,C,D,E,F,G,H
a,10,22.0,49,13.0,20,18.0,99,85
b,97,44.0,15,25.0,78,15.0,56,49
c,13,100.0,87,100.0,75,43.0,18,39
d,0,92.0,76,51.0,12,100.0,23,23
e,49,35.0,87,7.0,81,29.0,88,60


In [70]:
df

Unnamed: 0,A,B,C,D,E,F,G,H
a,10,22.0,49,13.0,20,18.0,99,85
b,97,44.0,15,25.0,78,15.0,56,49
c,13,,87,,75,43.0,18,39
d,0,92.0,76,51.0,12,,23,23
e,49,35.0,87,7.0,81,29.0,88,60


In [73]:
df.fillna(method="ffill",axis=0)

Unnamed: 0,A,B,C,D,E,F,G,H
a,10,22.0,49,13.0,20,18.0,99,85
b,97,44.0,15,25.0,78,15.0,56,49
c,13,44.0,87,25.0,75,43.0,18,39
d,0,92.0,76,51.0,12,43.0,23,23
e,49,35.0,87,7.0,81,29.0,88,60


可以选择前向填充还是后向填充

method 控制填充的方式 bfill ffill

============================================

练习7：

1. 简述None与NaN的区别

2. 假设张三李四参加模拟考试，但张三因为突然想明白人生放弃了英语考试，因此记为None，请据此创建一个DataFrame,命名为ddd3

3. 老师决定根据用数学的分数填充张三的英语成绩，如何实现？
    用李四的英语成绩填充张三的英语成绩？

============================================