# 빠져있는 데이터를 어떻게 처리하면 좋을지를 보자
- 원본은 카피떠서 그대로 두고 이것저것 해서 뽑은 자료가 유의미한지를 보고 원본에 적용한다

In [3]:
import pandas as pd
import seaborn as sns

In [4]:
df = sns.load_dataset('titanic')

## copy

In [5]:
df_copy = df.copy()

In [6]:
df_copy.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [7]:
df_copy.loc[0, 'age'] = 99999

In [8]:
df_copy.head(1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,99999.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


In [9]:
df.head(1)

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False


## 결측치
- 결측치 다룰 때 해야 할 것
    1. 결측 데이터 **확인**
    2. 결측치가 아닌 데이터 확인
---
- 그 다음에 해야 할 것
    1. 결측치를 채우기
    2. 결측 데이터 제거하기

In [35]:
df_copy = df.copy()

In [14]:
df_copy.isnull().sum() # 결측치가 있는 수

survived         0
pclass           0
sex              0
age            177
sibsp            0
parch            0
fare             0
embarked         2
class            0
who              0
adult_male       0
deck           688
embark_town      2
alive            0
alone            0
dtype: int64

In [16]:
df_copy.notnull().sum() # 비어있지 않은 값의 수

survived       891
pclass         891
sex            891
age            714
sibsp          891
parch          891
fare           891
embarked       889
class          891
who            891
adult_male     891
deck           203
embark_town    889
alive          891
alone          891
dtype: int64

### 결측 데이터 필터링

In [22]:
# 결측데이터를 골라내보자
cond = df_copy['age'].isnull()
df_copy.loc[cond, 'age'] = 30
df_copy

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,30.0,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


## fillna()
- 결측치 채우기

In [33]:
df_copy = df.copy()

In [25]:
df_copy['age'].fillna(100) # 연산 결과만 반영하고 바로 데이터셋을 수정하지 않는다

0       22.0
1       38.0
2       26.0
3       35.0
4       35.0
       ...  
886     27.0
887     19.0
888    100.0
889     26.0
890     32.0
Name: age, Length: 891, dtype: float64

In [28]:
# 수정하려면
df_copy['age'] = df_copy['age'].fillna(100)
df_copy.tail()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
886,0,2,male,27.0,0,0,13.0,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,100.0,1,2,23.45,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0,C,First,man,True,C,Cherbourg,yes,True
890,0,3,male,32.0,0,0,7.75,Q,Third,man,True,,Queenstown,no,True


In [29]:
df_copy['deck'].count()

np.int64(203)

In [34]:
df_copy['deck'] = df_copy['deck'].cat.add_categories('Z')
df_copy['deck'].fillna('Z') # Z가 카테고리에 포함되어잇지 않으므로 값을 바꿀 수 없다는 에러 반환하므로 바로 윗줄애ㅔ 한줄 추가한다

0      Z
1      C
2      Z
3      C
4      Z
      ..
886    Z
887    B
888    Z
889    C
890    Z
Name: deck, Length: 891, dtype: category
Categories (8, object): ['A', 'B', 'C', 'D', 'E', 'F', 'G', 'Z']

In [37]:
age_mean = df_copy['age'].mean()

In [39]:
df_copy['age'] = df_copy['age'].fillna(age_mean)

In [40]:
df_copy = df.copy()

In [41]:
df_copy['age'] = df_copy['age'].fillna(df_copy['age'].median())

In [43]:
df_copy['age']

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888    28.0
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

### 연습

- 남자 승객의 결측치는 남자 승객의 평균
- 여자 승객의 결측치는 여자 승객의 평균
- 결측치 모두 채운 뒤 모든 사람의 평균을 구해보자

In [77]:
df_copy = df.copy()
df_copy.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [78]:
cond1 = df_copy['sex'] == 'male'
cond2 = df_copy['sex'] == 'female'

In [79]:
male_mean = df_copy.loc[cond1, 'age'].mean()
female_mean = df_copy.loc[cond2, 'age'].mean()
print(male_mean)
print(female_mean)

30.72664459161148
27.915708812260537


In [80]:
df_copy.loc[cond1, 'age'] = df_copy.loc[cond1, 'age'].fillna(male_mean)
df_copy.loc[cond2, 'age'] = df_copy.loc[cond2, 'age'].fillna(female_mean)

In [81]:
df_copy['age'].mean()

np.float64(29.736034227171306)

## dropna()
- 결측치 제거하기

In [85]:
df_copy = df.copy()

- n/a가 포함된 인덱스라면 전부 지워버리기
`df_copy.dropna()`
- titanic에서 embarked 가 n/a엿던게 대부분이라 많은 데이터가 불필요하게 드랍된다
- 그래서 보통 아래와 같이 한다

In [89]:
df_copy.dropna(how='all') # 전체 데이터가 다 빠진것만 지운다

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True
