### [참고] <a href="https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf">Pandas Cheat Sheet</a>

https://pandas.pydata.org/docs/user_guide/missing_data.html

#### NaN(Not a Number) - 표현 불가능한 데이터(비어 있는 값)

- NaN : missing value 를 표현하는 기본 형태
- 기본적으로 float 형식으로 처리됨

#### NA(Not Available) : 결측값
#### None : 값의 부재(값이 존재하지 않거나, 없음, 정의되지 않음)

In [1]:
import pandas as pd
import numpy as np

### [실습 1]

#### 1) missing data 가 포함된 데이터 프레임 생성

In [2]:
df = pd.DataFrame({
    "name":["Alfred","Batman","Catwoman"],
    "toy":[np.nan, "Batmobile","Bullwhip"],
    "born":[None,pd.Timestamp("19400425"),pd.NA]
})
df

Unnamed: 0,name,toy,born
0,Alfred,,
1,Batman,Batmobile,1940-04-25 00:00:00
2,Catwoman,Bullwhip,


#### 2) 데이터 타입 확인

In [3]:
df.dtypes

name    object
toy     object
born    object
dtype: object

#### 3) missing data 처리

**dropna : missing values 제거**

In [7]:
df.dropna?

df.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False) **(default)**

In [9]:
df.dropna()

Unnamed: 0,name,toy,born
1,Batman,Batmobile,1940-04-25 00:00:00


In [10]:
df.dropna(axis=1)

Unnamed: 0,name
0,Alfred
1,Batman
2,Catwoman


In [12]:
df.dropna(how="all") # 모든 데이터가 missing values 일 경우 삭제

Unnamed: 0,name,toy,born
0,Alfred,,
1,Batman,Batmobile,1940-04-25 00:00:00
2,Catwoman,Bullwhip,


**fillna : missing values 를 임의의 값으로 채우기**

In [13]:
df.fillna?

df.fillna(
    value=None,
    method=None,
    axis=None,
    inplace=False,
    limit=None,
    downcast=None,
)

In [14]:
df.fillna(0)

Unnamed: 0,name,toy,born
0,Alfred,0,0
1,Batman,Batmobile,1940-04-25 00:00:00
2,Catwoman,Bullwhip,0


In [15]:
# 특정 값으로 채우기
values = {"name":"noname", "toy":"Bat", "born":pd.Timestamp("1900-01-01")}

df.fillna(value=values)

Unnamed: 0,name,toy,born
0,Alfred,Bat,1900-01-01
1,Batman,Batmobile,1940-04-25
2,Catwoman,Bullwhip,1900-01-01


### [실습 2]

In [16]:
student_list = {
    "name" : ["John", "Nate", "Yuna", "Abraham", "Brian", "Jenny", "Nate", "John"],
    "job" : ["teacher","teacher","teacher","student","student","student","teacher","student"],
    "age" : [40, 35, 37, 10, 12, 11 ,None, None]
}
df = pd.DataFrame(student_list)
df

Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Jenny,student,11.0
6,Nate,teacher,
7,John,student,


In [17]:
df.shape

(8, 3)

> **데이터 프레임 전체정보 확인 ( info() :: 행, 열, 수, 타입, 메모리 사용량, NaN 여부 )**

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   name    8 non-null      object 
 1   job     8 non-null      object 
 2   age     6 non-null      float64
dtypes: float64(1), object(2)
memory usage: 320.0+ bytes


In [21]:
df.isnull().sum()

name    0
job     0
age     2
dtype: int64

In [22]:
df['age'].fillna(0)

0    40.0
1    35.0
2    37.0
3    10.0
4    12.0
5    11.0
6     0.0
7     0.0
Name: age, dtype: float64

<b>* 상식적으로 나이가 0인 것은 말이 안되니까 그럴듯하게 변경하기</b><br>
<b>* 선생님의 나이는 다른 선생님들 나이의 평균값으로, 학생들의 나이또한 평균값으로 변경하기</b>

In [23]:
df

Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Jenny,student,11.0
6,Nate,teacher,
7,John,student,


In [24]:
df.groupby?

In [29]:
# median : 중앙값
df.groupby('job')['age'].transform('median')

0    37.0
1    37.0
2    37.0
3    11.0
4    11.0
5    11.0
6    37.0
7    11.0
Name: age, dtype: float64

In [31]:
df['age'].fillna(df.groupby('job')['age'].transform('median'), inplace=True)
df

Unnamed: 0,name,job,age
0,John,teacher,40.0
1,Nate,teacher,35.0
2,Yuna,teacher,37.0
3,Abraham,student,10.0
4,Brian,student,12.0
5,Jenny,student,11.0
6,Nate,teacher,37.0
7,John,student,11.0


### 실습

In [33]:
df = pd.DataFrame([[np.nan,2,np.nan,0], [3,4,np.nan,1],[np.nan,np.nan,np.nan,5]], columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,,2.0,,0
1,3.0,4.0,,1
2,,,,5


In [34]:
# 결측치 확인 - isna()
df.isna()

Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False


In [35]:
# 결측치 확인 - isnull()
df.isnull()

Unnamed: 0,A,B,C,D
0,True,False,True,False
1,False,False,True,False
2,True,True,True,False


In [36]:
# 결측치 0 으로 채우기
df.fillna(0)

Unnamed: 0,A,B,C,D
0,0.0,2.0,0.0,0
1,3.0,4.0,0.0,1
2,0.0,0.0,0.0,5


In [64]:
# 결측치 특정값으로 채우기
# A : 0, B : 1, C : 2, D : 3
values = {"A" : 0, "B" : 1, "C" : 2, "D" : 3}
df.fillna(value=values)

Unnamed: 0,A,B,C,D
0,0.0,2.0,2.0,0
1,3.0,4.0,2.0,1
2,0.0,1.0,2.0,5


In [57]:
# 모든 결측치 D 열의 중앙값으로 채우기
df.fillna(df['D'].median())

Unnamed: 0,A,B,C,D
0,1.0,2.0,1.0,0
1,3.0,4.0,1.0,1
2,1.0,1.0,1.0,5


In [60]:
# 모든 결측치 D 열의 최대값으로 채우기
df.fillna(df['D'].max())

Unnamed: 0,A,B,C,D
0,5.0,2.0,5.0,0
1,3.0,4.0,5.0,1
2,5.0,5.0,5.0,5
