### 5.4 누락된 데이터 처리하기

pandas에서는 누락된 데이터는 모두 NaN으로 취급합니다.

In [2]:
import numpy as np
import pandas as pd
from pandas import Series, DataFrame

In [3]:
string_data = Series(['aardvark', 'articoke', np.nan, 'avocado'])

In [4]:
string_data

0    aardvark
1    articoke
2         NaN
3     avocado
dtype: object

In [5]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [9]:
string_data[0] = None
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

NA 처리 메서드

- dropna: 누락된 데이터가 있는 축을 제외시킴, 누락 데이터의 양에 따른 범위지정 가능
- fillna: 누락된 데이터를 대신할 값을 채우거나, 'ffill' 또는 'bfill'같은 보간 메서드를 적용
- isnull: 누락이나 NA인 값을 알려주는 불리언 값이 저장된 형의 객체를 반환
- notnull: isnull과 반대

#### 5.4.1 누락된 데이터 골라내기

누락된 데이터를 골라내는 좋은 방법 중 하나는 dropna를 이용하는 것

In [10]:
from numpy import nan as NA

In [11]:
data = Series([1, NA, 3.5, NA, 7])
data.dropna()

0    1.0
2    3.5
4    7.0
dtype: float64

In [12]:
data[data.notnull()]

0    1.0
2    3.5
4    7.0
dtype: float64

In [13]:
data = DataFrame([[1., 6.5, 3.], [1., NA, NA],
                 [NA, NA, NA], [NA, 6.5, 3.]])
cleaned = data.dropna()

In [14]:
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [15]:
cleaned

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [16]:
data.dropna(how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
3,,6.5,3.0


In [17]:
data[4] = NA
data

Unnamed: 0,0,1,2,4
0,1.0,6.5,3.0,
1,1.0,,,
2,,,,
3,,6.5,3.0,


In [18]:
data.dropna(axis=1, how='all')

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [32]:
df = DataFrame(np.random.randn(7, 3))
df.iloc[:4, 1] = NA
df.iloc[:2, 2] = NA
df

Unnamed: 0,0,1,2
0,-0.538202,,
1,-1.475394,,
2,0.469243,,0.763256
3,0.45826,,2.122872
4,0.92064,0.266543,-0.353508
5,-1.217185,-1.719836,-0.729735
6,-1.191546,0.419345,-0.800384


In [22]:
df.dropna(thresh=2)  # 2개 이상 값이 들어있는 경우만 남기고 나머지는 버림

Unnamed: 0,0,1,2
2,-0.626578,,1.220781
3,0.098489,,-1.058167
4,-0.486342,-0.493832,0.028736
5,0.178425,-1.897049,0.541164
6,0.004872,-0.134424,0.526618


#### 5.4.2 누락된 값 채우기

In [23]:
df.fillna(0)

Unnamed: 0,0,1,2
0,0.292576,0.0,0.0
1,1.363445,0.0,0.0
2,-0.626578,0.0,1.220781
3,0.098489,0.0,-1.058167
4,-0.486342,-0.493832,0.028736
5,0.178425,-1.897049,0.541164
6,0.004872,-0.134424,0.526618


In [25]:
df.fillna({1: 0.5, 2: -1})

Unnamed: 0,0,1,2
0,0.292576,0.5,-1.0
1,1.363445,0.5,-1.0
2,-0.626578,0.5,1.220781
3,0.098489,0.5,-1.058167
4,-0.486342,-0.493832,0.028736
5,0.178425,-1.897049,0.541164
6,0.004872,-0.134424,0.526618


In [34]:
# fillna는 값을 채워 넣은 객체의 참조를 반환합니다.
_ = df.fillna(0, inplace=True)

In [35]:
df

Unnamed: 0,0,1,2
0,-0.538202,0.0,0.0
1,-1.475394,0.0,0.0
2,0.469243,0.0,0.763256
3,0.45826,0.0,2.122872
4,0.92064,0.266543,-0.353508
5,-1.217185,-1.719836,-0.729735
6,-1.191546,0.419345,-0.800384


In [36]:
df = DataFrame(np.random.randn(6, 3))
df.iloc[2:, 1] = NA
df.iloc[4:, 2] = NA
df

Unnamed: 0,0,1,2
0,0.954143,0.413022,0.193129
1,1.070314,1.606925,-0.005161
2,0.206493,,0.264381
3,0.929589,,-0.586963
4,-0.600758,,
5,0.445158,,


In [37]:
df.fillna(method='ffill')

Unnamed: 0,0,1,2
0,0.954143,0.413022,0.193129
1,1.070314,1.606925,-0.005161
2,0.206493,1.606925,0.264381
3,0.929589,1.606925,-0.586963
4,-0.600758,1.606925,-0.586963
5,0.445158,1.606925,-0.586963


In [38]:
df.fillna(method='ffill', limit=2)

Unnamed: 0,0,1,2
0,0.954143,0.413022,0.193129
1,1.070314,1.606925,-0.005161
2,0.206493,1.606925,0.264381
3,0.929589,1.606925,-0.586963
4,-0.600758,,-0.586963
5,0.445158,,-0.586963


In [39]:
data = Series([1., NA, 3.5, NA, 7])
data

0    1.0
1    NaN
2    3.5
3    NaN
4    7.0
dtype: float64

In [42]:
data.fillna(data.mean())

0    1.000000
1    3.833333
2    3.500000
3    3.833333
4    7.000000
dtype: float64

fillna 함수 인자

- value: 비어있는 값을 채울 스칼라나 값이나 사전 형식의 객체
- method: 보간 방식, 기본값 'ffill'
- axis: 값을 채워 넣을 축, 기본값 0(row)
- inplace: 복사본을 생성하지 않고 호출한 객체를 변경, 기본값 False
- limit: 값을 앞 혹은 뒤에서부터 몇 개까지 채울지 지정