## 데이터 핸들링 및 전처리 

## 04. 결측치 처리하기

<img src = "https://images.unsplash.com/photo-1611329857570-f02f340e7378?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1470&q=80" width=80% align="center"/>

<div align="right">사진: <a href="https://unsplash.com/ko/%EC%82%AC%EC%A7%84/B-x4VaIriRc?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>의<a href="https://unsplash.com/@sigmund?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Sigmund</a></div>
  
  

결측치는 데이터 분석에서 중요한 정보를 제공하지 않기 때문에 제거하거나 다른 값으로 대체해야 합니다. <br>
결측치를 처리하는 방법으로는 다음과 같은 방법이 있습니다.

#### 결측치를 처리하는 방법
- 결측치가 많은 열의 경우는 열을 제거합니다.
- 결측치가 아주 많지 않은 열의 경우는 결측치를 가진 행만 제거합니다.
- 결측치를 가진 데이터를 삭제하면 안되는 경우(결측치가 몇 개 안되거나 전체 데이터가 적은 경우)에는 결측치를 다른 값으로 치환합니다.

## 0. 데이터 불러오기
- ### 데이터 설명
    1. preprocessing_04.csv : 이전 실습에서 불필요한 컬럼을 지우고 컬러명을 변경한 데이터 
> - MovieId : (int) 영화 아이디 <br>
> - ImdbId : (int) IMDb 데이터베이스 관리 아이디<br>
> - TmdbId : (float) TMDB 데이터베이스 관리 아이디<br>
> - Title : (object) 영화 제목 <br> 
> - Year : (int) 제작년도 <br> 
> - Genres : (object) 영화의 장르, '|'을 구분자로 한 복수 장르
> - UserId : (int) 유저 아이디 <br>
> - Rating : (float) 영화 평점 <br>
> - Timestamp : (object) 평점을 작성한 시간정보 <br>
> - Gender : (object) 성별, M/F <br>
> - Age : (int) 나이<br>
> - Occupation : (object) 직업,<br>

In [1]:
# 라이브러리 불러오기
import pandas as pd

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
# 데이터 불러오기
df = pd.read_csv("./data/preprocessing_04.csv")

In [3]:
# 데이터 샘플 확인하기
df.head()

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
0,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03,F,2.0,K-12 student
1,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02,M,30.0,writer
2,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25 06:52:26,M,39.0,academic/educator
3,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13 12:59:30,M,29.0,executive/managerial
4,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18 05:28:03,M,52.0,academic/educator


---

### 1. 결측치 확인하기 : isnull()
데이터프레임의 결측치 행을 확인하고, 각 열 별로 결측치의 개수도 확인할 수 있습니다.

#### 1-1 결측치 행 확인하기
해당 위치에 결측치가 있을 경우 True를 반환합니다.

In [4]:
df.isnull()

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
0,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...
106279,True,True,True,True,True,True,False,True,True,False,False,False
106280,True,True,True,True,True,True,False,True,True,False,False,False
106281,True,True,True,True,True,True,False,True,True,False,False,False
106282,True,True,True,True,True,True,False,True,True,False,False,False


#### 1-2. 각 열 별 결측치 개수 확인하기
개수의 확인은 sum 연산에서 True가 1, False가 0으로 인식되는 원리를 활용하면 쉽게 확인 가능합니다.

In [5]:
df.isnull().sum()

MovieId       5430
ImdbId        5430
TmdbId        5443
Title         5430
Year          5430
Genres        5430
UserId          18
Rating        5448
Timestamp     5448
Gender          18
Age             18
Occupation      18
dtype: int64

#### 🔖 pandas의 'info()' 메소드를 활용해도 탐색적으로 결측치 유무 확인이 가능합니다.

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106284 entries, 0 to 106283
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   MovieId     100854 non-null  float64
 1   ImdbId      100854 non-null  float64
 2   TmdbId      100841 non-null  float64
 3   Title       100854 non-null  object 
 4   Year        100854 non-null  float64
 5   Genres      100854 non-null  object 
 6   UserId      106266 non-null  float64
 7   Rating      100836 non-null  float64
 8   Timestamp   100836 non-null  object 
 9   Gender      106266 non-null  object 
 10  Age         106266 non-null  float64
 11  Occupation  106266 non-null  object 
dtypes: float64(7), object(5)
memory usage: 9.7+ MB


---

### 2. 결측치 제거하기 : dropna()
결측치의 특성이 랜덤하게 손실되지 않았따면, 대부분의 경우 가장 좋은 방법은 삭제하는 것입니다.<br>
dropna()는 pandas에서 제공하는 Na/NaN과 같은 누락 데이터를 제거하는 함수입니다.<br>
- 옵션
> - axis : 0(default)인 경우 행 제거, 1인 경우 열을 제거 <br>
> - subset : 특정 열을 지정하여 해당 열의 결측치를 제거, 리스트 형태로 복수의 컬럼 조건 지정 가능 <br>
> - tresh : 임계치를 주어 행에 해당 개수만큼 결측치가 있는 경우 제거 <br>
> - inplace : 결측치를 제거한 상태를 바로 저장


#### 2-1. 목록 삭제 방식 (Listwise)
결측치가 있는 행 또는 열을 전부 삭제하는 방식
- 결측치가 있는 행을 전부 삭제

In [7]:
df_temp = df.dropna()

In [8]:
df_temp.isnull().sum()

MovieId       0
ImdbId        0
TmdbId        0
Title         0
Year          0
Genres        0
UserId        0
Rating        0
Timestamp     0
Gender        0
Age           0
Occupation    0
dtype: int64

- 결측치가 있는 열을 전부 삭제

In [9]:
df.dropna(axis=1)

0
1
2
3
4
...
106279
106280
106281
106282
106283


#### 2-2. 특정값 조건 만족 시 삭제 방식 (Pairwise)
특정값 조건 만족 시에만 결측치를 삭제하고, 다른 변수가 존재하는 경우에는 그대로 유지해 두는 것입니다.

- 행 전체가 결측치인 행만 삭제

In [10]:
df.dropna(how ='all')

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
0,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03,F,2.0,K-12 student
1,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02,M,30.0,writer
2,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25 06:52:26,M,39.0,academic/educator
3,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13 12:59:30,M,29.0,executive/managerial
4,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18 05:28:03,M,52.0,academic/educator
...,...,...,...,...,...,...,...,...,...,...,...,...
106279,,,,,,,6036.0,,,F,30.0,scientist
106280,,,,,,,6037.0,,,F,46.0,academic/educator
106281,,,,,,,6038.0,,,F,79.0,academic/educator
106282,,,,,,,6039.0,,,F,48.0,other


- 행의 결측치가 n개 초과인 행만 삭제

In [15]:
df.dropna(thresh=6)

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
0,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03,F,2.0,K-12 student
1,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02,M,30.0,writer
2,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25 06:52:26,M,39.0,academic/educator
3,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13 12:59:30,M,29.0,executive/managerial
4,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18 05:28:03,M,52.0,academic/educator
...,...,...,...,...,...,...,...,...,...,...,...,...
100849,193581.0,5476944.0,432131.0,Black Butler: Book of the Atlantic,2017.0,Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42,F,33.0,other
100850,193583.0,5914996.0,445030.0,No Game No Life: Zero,2017.0,Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25,F,33.0,other
100851,193585.0,6397426.0,479308.0,Flint,2017.0,Drama,184.0,3.5,2018-09-16 14:56:45,F,33.0,other
100852,193587.0,8391976.0,483455.0,Bungo Stray Dogs: Dead Apple,2018.0,Action|Animation,184.0,3.5,2018-09-16 15:00:21,F,33.0,other


- 특정 열들 중에 결측치가 있는 경우에만 행을 삭제

In [16]:
# movieId에 결측치가 있는 경우 삭제 
df.dropna(subset=['MovieId'])

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
0,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03,F,2.0,K-12 student
1,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02,M,30.0,writer
2,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25 06:52:26,M,39.0,academic/educator
3,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13 12:59:30,M,29.0,executive/managerial
4,1.0,114709.0,862.0,Toy Story,1995.0,Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18 05:28:03,M,52.0,academic/educator
...,...,...,...,...,...,...,...,...,...,...,...,...
100849,193581.0,5476944.0,432131.0,Black Butler: Book of the Atlantic,2017.0,Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42,F,33.0,other
100850,193583.0,5914996.0,445030.0,No Game No Life: Zero,2017.0,Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25,F,33.0,other
100851,193585.0,6397426.0,479308.0,Flint,2017.0,Drama,184.0,3.5,2018-09-16 14:56:45,F,33.0,other
100852,193587.0,8391976.0,483455.0,Bungo Stray Dogs: Dead Apple,2018.0,Action|Animation,184.0,3.5,2018-09-16 15:00:21,F,33.0,other


In [17]:
df_temp = df.dropna(subset=['MovieId'])

In [18]:
df_temp.isnull().sum()

MovieId        0
ImdbId         0
TmdbId        13
Title          0
Year           0
Genres         0
UserId        18
Rating        18
Timestamp     18
Gender        18
Age           18
Occupation    18
dtype: int64

#### 🔖 'movieId'가 결측치인 행을 삭제하고 inplace 옵션을 True로 하여 데이터프레임 df를 저장해주세요.

In [23]:
df.dropna(subset=['MovieId'], inplace=True)

---

### 3. 결측치 채우기 : fillna(), replace()
결측치를 특정 값으로 대치하여 저장합니다.<br>
자주 사용되는 대치 값은 '최빈값(mode)', '중앙값(median)', '평균(mean)' 으로 채우는 평균화 기법과 <br>
fillna() 의 method 옵션을 활용하여 '이전 값(ffill)', '이후 값(bfill)' 과 같은 주변 값을 채우는 기법이 있습니다. 

- 주로 사용하는 옵션
> - value : 특정 값 또는 평균화 기법을 활용하여 대치하는 값 <br>
> - method : '이전 값(ffill)', '이후 값(bfill)'과 같은 주변 값으로 대치하는 방법<br>
> - inplace : 결측치를 대치한 상태를 바로 저장

In [24]:
# fillna() 변화 확인을 위한 임시 데이터프레임 변수 저장
df_na_sample1 = df[(df.index>23777)].head()
df_na_sample2 = df[df.index>22817].head()

In [25]:
# 'tmdbId'가 결측치가 있는 Sample
df_na_sample1

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
23778,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,492.0,4.0,1997-05-18 17:29:27,M,27.0,programmer
23779,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,544.0,5.0,1996-12-15 22:26:16,M,36.0,programmer
23780,1107.0,102336.0,,Loser,1991.0,Comedy,160.0,3.5,2004-02-25 22:50:09,M,41.0,executive/managerial
23781,1107.0,102336.0,,Loser,1991.0,Comedy,298.0,0.5,2016-11-13 19:31:35,F,19.0,college/grad student
23782,1111.0,117040.0,9305.0,Microcosmos (Microcosmos: Le peuple de l'herbe),1996.0,Documentary,105.0,4.0,2015-11-22 12:39:05,M,45.0,programmer


In [27]:
# 'userId', 'age', 'rating' 등 고객정보와 별점 정보에 결측이가 있는 Sample
df_na_sample2

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
22818,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,605.0,4.0,2010-06-22 03:08:04,F,20.0,college/grad student
22819,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,608.0,3.5,2005-05-31 01:27:46,M,20.0,college/grad student
22820,1076.0,55018.0,16372.0,"Innocents, The",1961.0,Drama|Horror|Thriller,,,,,,
22821,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,4.0,5.0,1999-12-14 12:14:52,M,45.0,executive/managerial
22822,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,19.0,3.0,2000-08-08 03:00:05,M,3.0,K-12 student


#### 3-1. 결측치를 특정 단일 값으로 대체하는 방법

- 결측치를 특정 단일값으로 대체하는 방법

In [28]:
df_na_sample1.fillna(0)

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
23778,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,492.0,4.0,1997-05-18 17:29:27,M,27.0,programmer
23779,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,544.0,5.0,1996-12-15 22:26:16,M,36.0,programmer
23780,1107.0,102336.0,0.0,Loser,1991.0,Comedy,160.0,3.5,2004-02-25 22:50:09,M,41.0,executive/managerial
23781,1107.0,102336.0,0.0,Loser,1991.0,Comedy,298.0,0.5,2016-11-13 19:31:35,F,19.0,college/grad student
23782,1111.0,117040.0,9305.0,Microcosmos (Microcosmos: Le peuple de l'herbe),1996.0,Documentary,105.0,4.0,2015-11-22 12:39:05,M,45.0,programmer


- 특정열의 결측치를 다른 값으로 대체하기

In [31]:
# 특정 열의 결측치만 0으로 대체하기
df_na_sample2['UserId'].fillna(0, inplace=True)

In [32]:
df_na_sample2

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
22818,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,605.0,4.0,2010-06-22 03:08:04,F,20.0,college/grad student
22819,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,608.0,3.5,2005-05-31 01:27:46,M,20.0,college/grad student
22820,1076.0,55018.0,16372.0,"Innocents, The",1961.0,Drama|Horror|Thriller,0.0,,,,,
22821,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,4.0,5.0,1999-12-14 12:14:52,M,45.0,executive/managerial
22822,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,19.0,3.0,2000-08-08 03:00:05,M,3.0,K-12 student


#### 3-2. 평균화 기법을 사용하여 결측치 채우기

- 특정 열의 결측치만 평균값으로 대체하기

In [33]:
df_na_sample2['Age'].fillna(df_na_sample2['Age'].mean(), inplace=True)

In [34]:
df_na_sample2

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
22818,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,605.0,4.0,2010-06-22 03:08:04,F,20.0,college/grad student
22819,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,608.0,3.5,2005-05-31 01:27:46,M,20.0,college/grad student
22820,1076.0,55018.0,16372.0,"Innocents, The",1961.0,Drama|Horror|Thriller,0.0,,,,22.0,
22821,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,4.0,5.0,1999-12-14 12:14:52,M,45.0,executive/managerial
22822,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,19.0,3.0,2000-08-08 03:00:05,M,3.0,K-12 student


- 특정 열의 결측치만 최빈값으로 대체하기

In [35]:
df['Gender'].mode()

0    M
Name: Gender, dtype: object

In [36]:
df_na_sample2['Gender'].mode()[0]

'M'

In [37]:
df_na_sample2['Gender'].fillna(df_na_sample2['Gender'].mode()[0], inplace=True)

In [38]:
df_na_sample2

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
22818,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,605.0,4.0,2010-06-22 03:08:04,F,20.0,college/grad student
22819,1073.0,67992.0,252.0,Willy Wonka & the Chocolate Factory,1971.0,Children|Comedy|Fantasy|Musical,608.0,3.5,2005-05-31 01:27:46,M,20.0,college/grad student
22820,1076.0,55018.0,16372.0,"Innocents, The",1961.0,Drama|Horror|Thriller,0.0,,,M,22.0,
22821,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,4.0,5.0,1999-12-14 12:14:52,M,45.0,executive/managerial
22822,1077.0,70707.0,11561.0,Sleeper,1973.0,Comedy|Sci-Fi,19.0,3.0,2000-08-08 03:00:05,M,3.0,K-12 student


#### 3-3. 결측치를 주변 값으로 채우기

- 결측치 '이전 값'으로 채우기

In [39]:
df_na_sample1

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
23778,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,492.0,4.0,1997-05-18 17:29:27,M,27.0,programmer
23779,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,544.0,5.0,1996-12-15 22:26:16,M,36.0,programmer
23780,1107.0,102336.0,,Loser,1991.0,Comedy,160.0,3.5,2004-02-25 22:50:09,M,41.0,executive/managerial
23781,1107.0,102336.0,,Loser,1991.0,Comedy,298.0,0.5,2016-11-13 19:31:35,F,19.0,college/grad student
23782,1111.0,117040.0,9305.0,Microcosmos (Microcosmos: Le peuple de l'herbe),1996.0,Documentary,105.0,4.0,2015-11-22 12:39:05,M,45.0,programmer


In [40]:
df_na_sample1.isna().sum()

MovieId       0
ImdbId        0
TmdbId        2
Title         0
Year          0
Genres        0
UserId        0
Rating        0
Timestamp     0
Gender        0
Age           0
Occupation    0
dtype: int64

In [41]:
df_na_sample1.fillna(method='ffill')

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
23778,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,492.0,4.0,1997-05-18 17:29:27,M,27.0,programmer
23779,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,544.0,5.0,1996-12-15 22:26:16,M,36.0,programmer
23780,1107.0,102336.0,25750.0,Loser,1991.0,Comedy,160.0,3.5,2004-02-25 22:50:09,M,41.0,executive/managerial
23781,1107.0,102336.0,25750.0,Loser,1991.0,Comedy,298.0,0.5,2016-11-13 19:31:35,F,19.0,college/grad student
23782,1111.0,117040.0,9305.0,Microcosmos (Microcosmos: Le peuple de l'herbe),1996.0,Documentary,105.0,4.0,2015-11-22 12:39:05,M,45.0,programmer


- 결측치 '이후 값'으로 채우기

In [42]:
df_na_sample1.fillna(method='bfill')

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
23778,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,492.0,4.0,1997-05-18 17:29:27,M,27.0,programmer
23779,1105.0,115885.0,25750.0,Children of the Corn IV: The Gathering,1996.0,Horror,544.0,5.0,1996-12-15 22:26:16,M,36.0,programmer
23780,1107.0,102336.0,9305.0,Loser,1991.0,Comedy,160.0,3.5,2004-02-25 22:50:09,M,41.0,executive/managerial
23781,1107.0,102336.0,9305.0,Loser,1991.0,Comedy,298.0,0.5,2016-11-13 19:31:35,F,19.0,college/grad student
23782,1111.0,117040.0,9305.0,Microcosmos (Microcosmos: Le peuple de l'herbe),1996.0,Documentary,105.0,4.0,2015-11-22 12:39:05,M,45.0,programmer


---

## 실습문제

#### Q1. 데이터프레임 'df' 의 결측치 개수를 확인해 보세요.

In [44]:
# 여기에 작성하세요.
df.isna().sum()

MovieId        0
ImdbId         0
TmdbId        13
Title          0
Year           0
Genres         0
UserId        18
Rating        18
Timestamp     18
Gender        18
Age           18
Occupation    18
dtype: int64

#### Q2. 데이터프레임 'df'에서 'tmdbId' 에 결측치가 있는 경우 해당 행을 삭제하고 변경 된 데이터프레임을 'df_2' 변수에 저장하세요.
    📌 dropna() 시 'subset' 옵션을 사용하세요.

In [47]:
# 여기에 작성하세요.
df_2 = df.dropna(subset=['TmdbId'])

#### Q3. 데이터프레임 'df_2'에서 'rating' , 'Age' 결측치는 중간값으로 채우고 변경된 내용을 df_2'에 저장해주세요. 
    📌 데이터프레임의 각각의 컬럼 별로 중간값을 구하고 그 값으로 채웁니다. 중간값은 median() 함수를 사용합니다. 

In [51]:
# 여기에 작성하세요.

df_2['Age'].fillna(df_2['Age'].median(), inplace=True)
df_2.isna().sum()


MovieId        0
ImdbId         0
TmdbId         0
Title          0
Year           0
Genres         0
UserId        18
Rating         0
Timestamp     18
Gender        18
Age            0
Occupation    18
dtype: int64

#### Q4. 데이터프레임 'df_2'에서 'gender','occupation' 의 결측치는 최빈값으로 채우고 변경된 내용을 df_2'에 저장해주세요. 
    📌 데이터프레임의 각각의 컬럼 별로 최빈값을 구하고 그 값으로 채웁니다. 최빈값은 mode() 함수를 사용합니다. 

In [55]:
# 여기에 작성하세요.
df_2['Gender'].fillna(df_2['Gender'].mode()[0], inplace=True)
df_2['Occupation'].fillna(df_2['Occupation'].mode()[0], inplace=True)


#### Q5. 데이터프레임 'df_2'에서 'userId' 는 회원 ID로 순차적인 번호로 되어 있습니다. <br> &emsp; 'userId'에 결측치는 가장 마지막 회원 번호를 확인하고, 그 '번호 + 1' 의 값으로 결측치로 채우고 변경된 내용을 df_2'에 저장해주세요.
    📌 컬럼에서 가장 큰 값을 확인하는 방법은 max() 함수를 사용하면 됩니다. df['컬럼명'].max()

In [62]:
df_2[df_2.UserId.isna()]

Unnamed: 0,MovieId,ImdbId,TmdbId,Title,Year,Genres,UserId,Rating,Timestamp,Gender,Age,Occupation
22820,1076.0,55018.0,16372.0,"Innocents, The",1961.0,Drama|Horror|Thriller,,3.5,,M,32.0,college/grad student
49539,2939.0,46126.0,19997.0,Niagara,1953.0,Drama|Thriller,,3.5,,M,32.0,college/grad student
53555,3338.0,97372.0,20423.0,For All Mankind,1989.0,Documentary,,3.5,,M,32.0,college/grad student
54467,3456.0,191043.0,17078.0,"Color of Paradise, The (Rang-e khoda)",1999.0,Drama,,3.5,,M,32.0,college/grad student
60535,4194.0,37800.0,56137.0,I Know Where I'm Going!,1945.0,Drama|Romance|War,,3.5,,M,32.0,college/grad student
68396,5721.0,82175.0,25773.0,"Chosen, The",1981.0,Drama,,3.5,,M,32.0,college/grad student
71896,6668.0,235060.0,12276.0,"Road Home, The (Wo de fu qin mu qin)",1999.0,Drama|Romance,,3.5,,M,32.0,college/grad student
72428,6849.0,66344.0,13765.0,Scrooge,1970.0,Drama|Fantasy|Musical,,3.5,,M,32.0,college/grad student
73354,7020.0,102721.0,14904.0,Proof,1991.0,Comedy|Drama|Romance,,3.5,,M,32.0,college/grad student
75450,7792.0,71970.0,17365.0,"Parallax View, The",1974.0,Thriller,,3.5,,M,32.0,college/grad student


In [71]:
# 여기에 작성하세요.
df_2['UserId'].fillna(df_2['UserId'].max() + 1, inplace=True)

#### Q6. 데이터프레임 'df_2'에 아직까지 결측치가 남은 컬럼이 있다면 해당 열을 삭제하고 변경된 내용을 df_2'에 저장해주세요.
    📌 결측치가 있는 열을 삭제하기 위해서는 axis 옵션을 변경하시면 됩니다. 

In [72]:
df_2.isna().sum()

MovieId        0
ImdbId         0
TmdbId         0
Title          0
Year           0
Genres         0
UserId         0
Rating         0
Timestamp     18
Gender         0
Age            0
Occupation     0
dtype: int64

In [76]:
# 여기에 작성하세요.
df_2.dropna(axis=1, inplace=True)



---