# 1. Handling Missing Values

## 설명
- 결측값에 대해 숙고하는 것은 매우 의미있는 작업이다.
    > It's usually worth it to take the time to go through your data and really look at all the columns with missing values one-by-one to really get to know your dataset.
- 결측값을 다룰 때 생각해보면 좋을 질문
    > Is this value missing because it wasn't recorded or because it doesn't exist?
    - 만약 결측값이 애초에 존재하지 않는 값이라면, 해당 값은 그대로 두면 된다.
    - 그러나 기록되지 않은 값이라면 결측값이 위치한 행과 열의 데이터를 바탕으로 그 값이 어떤 값이어야 했을지 추측해보아야 한다.
- 결측값을 이해하기 위해서 해당 데이터셋의 **document**가 있다면 이를 꼭 읽어보아야 한다.

## 결측값 처리
**1. drop**
    - 추천하는 방법은 아니다. 너무 바쁘거나, 결측값을 이해할 방법이 도저히 없을 때나 사용하는 것이 좋다.
    
**2. fill in**
    - 다른 적절한 값으로 대체한다.

## 코드

In [1]:
import pandas as pd
import numpy as np

In [2]:
nfl_data = pd.read_csv('data/NFL Play by Play 2009-2017 (v4).csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [5]:
# 칼럼별 결측값 개수
missing_values_count = nfl_data.isnull().sum()

missing_values_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

In [7]:
# 데이터셋 내 결측값의 비율은?
# nfl_data.shape = (407688, 102)
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

24.87214126835169


In [9]:
# drop
# 결측값이 하나라도 있는 칼럼 모두 삭제
columns_with_na_dropped = nfl_data.dropna(axis=1)

print("기존 데이터셋 칼럼 수: %d \n" % nfl_data.shape[1])
print("칼럼 삭제 후 칼럼 수: %d" % columns_with_na_dropped.shape[1])

기존 데이터셋 칼럼 수: 102 

칼럼 삭제 후 칼럼 수: 41


In [10]:
# 값 대체

subset_nfl_data = nfl_data.loc[:, 'EPA':'Season'].head()

# NA -> 0
subset_nfl_data.fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,0.0,0.0,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,0.0,0.0,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,0.0,0.0,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.0,0.0,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009


In [11]:
# 먼저 같은 칼럼의 다음 행의 값으로 채우고, 그럼에도 NaN일 경우 0으로 대체
subset_nfl_data.fillna(method='bfill', axis=0).fillna(0)

Unnamed: 0,EPA,airEPA,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2.014474,-1.068169,1.146076,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,-0.032244,0.036899,2009
1,0.077907,-1.068169,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,-1.40276,3.318841,-5.031425,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,0.106663,-0.156239,2009
3,-1.712583,3.318841,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2.097796,0.0,0.0,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,0.0,0.0,2009
