## 데이터 핸들링 및 전처리 

## 02. 데이터 결합과 정렬

<img src = "https://images.unsplash.com/photo-1544383835-bda2bc66a55d?ixlib=rb-4.0.3&ixid=MnwxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8&auto=format&fit=crop&w=1736&q=80" width=80% align="center"/>

<div align="right">사진: <a href="https://unsplash.com/ko/%EC%82%AC%EC%A7%84/lRoX0shwjUQ?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>의<a href="https://unsplash.com/@jankolar?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Jan Antonin Kolar</a>
  
  </div>
  
  

- 데이터 설명
1. movies.csv : 무비렌즈에 저장된 영화 정보
> - movieId : (int) 영화 아이디 <br>
> - title : (object) 영화 제목과 연도 <br> 
> - genres : (object) 영화의 장르, '|'을 구분자로 한 복수 장르

2. links.csv : 무비렌즈에서 관리하는 영화 ID와 매핑 된 IMDb, TMDB와 같은 영화 데이터베이스 정보
> - movieId : (int) 영화 아이디 <br>
> - imdbId : (int) IMDb 데이터베이스 관리 아이디<br>
> - tmdbId : (float) TMDB 데이터베이스 관리 아이디<br>

3. ratings.csv : 고객이 평가한 영화 평점 정보
> - userId : (int) 유저 아이디 <br>
> - movieId : (int) 영화 아이디 <br>
> - rating : (float) 평점 
> - timestamp : (object) 평점 등록 시간 정보 

4. users.csv : 무비렌즈 가입고객 정보
> - userId : (int) 유저 아이디 <br>
> - gender : (object) 성별, M/F <br>
> - age : (int) 나이<br>
> - occupation : (object) 직업<br>
> - zipcode : (object) 우편코드 

### 0. 데이터 불러오기

In [1]:
# 라이브러리 불러오기
import pandas as pd

import warnings
warnings.filterwarnings(action='ignore')

In [2]:
df_movies = pd.read_csv("./data/movies.csv")
df_links = pd.read_csv("./data/links.csv")
df_ratings = pd.read_csv("./data/ratings.csv")
df_users = pd.read_csv("./data/users.csv")

In [3]:
df_movies.head(3)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance


In [4]:
df_links.head(3)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0
2,3,113228,15602.0


### 1. 데이터프레임 붙이기 : concat()
concat 함수를 사용하면 기준 열(key column)을 사용하지 않고 단순히 데이터를 연결할 수 있습니다.


- 기본 사용 방법
> - pd.concat([df1, df2]): 데이터프레임 df1과 df2를 이어 붙이기

- 옵션
> - axis: 축 ( 0:위/아래로, 1:옆으로 )<br>
> - join: 합치는 방법 ( 'outer: 합집합, 'inner: 교집합 )


#### 1-1 기본 사용
#### 데이터프레임 df_movies와 df_links를 concat을 사용해서 이어 붙여 보세요.
기본적으로 위/아래로 데이터 행을 연결하게 되어, 단순히 데이터프레임을 연결하기 때문에 인덱스 값이 중복될 수 있습니다.


In [5]:
df_concat = pd.concat([df_movies, df_ratings])

In [6]:
df_concat

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,,
1,2,Jumanji (1995),Adventure|Children|Fantasy,,,
2,3,Grumpier Old Men (1995),Comedy|Romance,,,
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,,,
4,5,Father of the Bride Part II (1995),Comedy,,,
...,...,...,...,...,...,...
100831,166534,,,610.0,4.0,2017-05-03 21:53:22
100832,168248,,,610.0,5.0,2017-05-03 22:21:31
100833,168250,,,610.0,5.0,2017-05-08 19:50:47
100834,168252,,,610.0,5.0,2017-05-03 21:19:12


In [7]:
df_concat[df_concat.index == 0]

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,,
0,1,,,1.0,4.0,2000-07-30 18:45:03


#### 1-2 옆으로(열 기준) 합쳐 보세요. 
axis 옵션을 사용하면 열 기준, 데이터프레임이 옆으로 이어 붙습니다.

In [8]:
pd.concat([df_movies, df_ratings], axis=1)

Unnamed: 0,movieId,title,genres,userId,movieId.1,rating,timestamp
0,1.0,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,1,4.0,2000-07-30 18:45:03
1,2.0,Jumanji (1995),Adventure|Children|Fantasy,1,3,4.0,2000-07-30 18:20:47
2,3.0,Grumpier Old Men (1995),Comedy|Romance,1,6,4.0,2000-07-30 18:37:04
3,4.0,Waiting to Exhale (1995),Comedy|Drama|Romance,1,47,5.0,2000-07-30 19:03:35
4,5.0,Father of the Bride Part II (1995),Comedy,1,50,5.0,2000-07-30 18:48:51
...,...,...,...,...,...,...,...
100831,,,,610,166534,4.0,2017-05-03 21:53:22
100832,,,,610,168248,5.0,2017-05-03 22:21:31
100833,,,,610,168250,5.0,2017-05-08 19:50:47
100834,,,,610,168252,5.0,2017-05-03 21:19:12


#### 1-3 'inner join'을 활용하여 두 데이터 프레임에서 모두 존재하는 행 인덱스만 가져와 보세요,

In [9]:
pd.concat([df_movies, df_ratings], join='inner')

Unnamed: 0,movieId
0,1
1,2
2,3
3,4
4,5
...,...
100831,166534
100832,168248
100833,168250
100834,168252


In [11]:
df_movies.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [12]:
df_ratings.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,2000-07-30 18:45:03
1,1,3,4.0,2000-07-30 18:20:47


In [10]:
pd.concat([df_movies, df_ratings], join='outer')

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,,,
1,2,Jumanji (1995),Adventure|Children|Fantasy,,,
2,3,Grumpier Old Men (1995),Comedy|Romance,,,
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance,,,
4,5,Father of the Bride Part II (1995),Comedy,,,
...,...,...,...,...,...,...
100831,166534,,,610.0,4.0,2017-05-03 21:53:22
100832,168248,,,610.0,5.0,2017-05-03 22:21:31
100833,168250,,,610.0,5.0,2017-05-08 19:50:47
100834,168252,,,610.0,5.0,2017-05-03 21:19:12


---

### 2. 데이터프레임 병합하기 : merge()
merge함수는 두 데이터프레임을 각 데이터에 존재하는 고유한 key 값을 기준으로 병합할 수 있습니다. 

- 기본 사용 방법
> - pd.merge(df_left, df_righr, on=key값): 왼쪽 데이터프레임 df_left와 오른쪽 데이터프레인 df_right를 on(key값)을 기준으로 병합

- 옵션
> - how: 병합시 기준이 될 인덱스 ( left: 기존 데이터, right: 병합할 데이터, inner: 교집합, outer: 합집합 )<br>
> - on: 열 기준 병합시 기준으로할 열, key 값 

#### 2-1 기본 사용
#### 데이터프레임 df_movies와 df_ratings를 merge를 사용해서 병합해 보세요.
기본적으로 on='key값'이 없을 경우에는 같은 이름을 가진 열이 자동으로 key 값으로 지정됩니다. 

In [13]:
pd.merge(df_movies, df_ratings)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,1996-11-08 06:36:02
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,2005-01-25 06:52:26
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,2017-11-13 12:59:30
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,2018-09-16 14:44:42
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,2018-09-16 14:52:25
100833,193585,Flint (2017),Drama,184,3.5,2018-09-16 14:56:45
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,2018-09-16 15:00:21


In [14]:
pd.merge(df_movies, df_ratings, on='movieId')

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5,4.0,1996-11-08 06:36:02
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7,4.5,2005-01-25 06:52:26
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15,2.5,2017-11-13 12:59:30
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...
100831,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184,4.0,2018-09-16 14:44:42
100832,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184,3.5,2018-09-16 14:52:25
100833,193585,Flint (2017),Drama,184,3.5,2018-09-16 14:56:45
100834,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184,3.5,2018-09-16 15:00:21


#### 2-2 how 옵션을 사용해서 outer, left, right 조인을 기준으로 병합해 보세요.

In [15]:
pd.merge(df_movies, df_ratings, how='outer', on='movieId')

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25 06:52:26
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13 12:59:30
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25
100851,193585,Flint (2017),Drama,184.0,3.5,2018-09-16 14:56:45
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,2018-09-16 15:00:21


In [16]:
pd.merge(df_movies, df_ratings, how='left', on='movieId')

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,7.0,4.5,2005-01-25 06:52:26
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,15.0,2.5,2017-11-13 12:59:30
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,17.0,4.5,2011-05-18 05:28:03
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25
100851,193585,Flint (2017),Drama,184.0,3.5,2018-09-16 14:56:45
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,2018-09-16 15:00:21


In [17]:
pd.merge(df_movies, df_ratings, how='right', on='movieId')

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1,4.0,2000-07-30 18:45:03
1,3,Grumpier Old Men (1995),Comedy|Romance,1,4.0,2000-07-30 18:20:47
2,6,Heat (1995),Action|Crime|Thriller,1,4.0,2000-07-30 18:37:04
3,47,Seven (a.k.a. Se7en) (1995),Mystery|Thriller,1,5.0,2000-07-30 19:03:35
4,50,"Usual Suspects, The (1995)",Crime|Mystery|Thriller,1,5.0,2000-07-30 18:48:51
...,...,...,...,...,...,...
100831,166534,Split (2017),Drama|Horror|Thriller,610,4.0,2017-05-03 21:53:22
100832,168248,John Wick: Chapter Two (2017),Action|Crime|Thriller,610,5.0,2017-05-03 22:21:31
100833,168250,Get Out (2017),Horror,610,5.0,2017-05-08 19:50:47
100834,168252,Logan (2017),Action|Sci-Fi,610,5.0,2017-05-03 21:19:12


- 다음 실습을 위해 'df_movies'와 'df_ratings'를 merge한 데이터프레임을 df_2 변수에 저장합니다. 

In [18]:
df_2 = pd.merge(df_movies, df_ratings, how='outer', on='movieId')

In [19]:
df_2.head(2)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,5.0,4.0,1996-11-08 06:36:02


---

### 3. 정렬하기 : sort_values()
데이터를 처리하다 보면 일정한 기준에 맞추어 정렬이 필요한 경우가 발생합니다. <br>
pandas 라이브러리를 활용하면 데이터를 필요에 맞게 정렬할 수 있습니다. <br>
데이터프레임 정렬은 인덱스 기준과 값 기준 2가지로 정렬할 수 있습니다.

- 기본사용법
> df.sort_values(by='정렬 기준이 되는 컬럼'): 값을 기준으로 오름차순 정렬

- 옵션
> by: 정렬 기준이 되는 컬럼<br>
> ascending: 정렬 방법 (True=오름차순, False=내림차순)<br>
> inplace: 데이터프레임을 정렬된 값으로 변경 후 저장하기 (True, False)

#### 3-1. 기본 사용 방법
#### 데이터프레임 'df_2' 를 'movieId'를 기준으로 정렬해 보세요. 

In [20]:
df_2.sort_values(by='movieId')

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
137,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,382.0,4.5,2018-01-05 14:30:28
138,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,385.0,4.0,1996-06-13 18:47:22
139,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,389.0,5.0,1997-03-09 19:02:54
140,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,391.0,3.0,2002-09-18 22:27:57
...,...,...,...,...,...,...
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25
100851,193585,Flint (2017),Drama,184.0,3.5,2018-09-16 14:56:45
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,2018-09-16 15:00:21


#### 3-2. 'df_2'를 'movieId'를 기준으로 내림차순 정렬해 보세요.

In [21]:
df_2.sort_values(by='movieId', ascending=False)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
100853,193609,Andrew Dice Clay: Dice Rules (1991),Comedy,331.0,4.0,2018-09-17 04:13:26
100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,2018-09-16 15:00:21
100851,193585,Flint (2017),Drama,184.0,3.5,2018-09-16 14:56:45
100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25
100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42
...,...,...,...,...,...,...
141,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,396.0,5.0,2005-03-24 18:23:46
140,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,391.0,3.0,2002-09-18 22:27:57
139,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,389.0,5.0,1997-03-09 19:02:54
138,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,385.0,4.0,1996-06-13 18:47:22


#### 3-3. 데이터프레임을 정렬이 완료된 형태로 재 저장해주세요.

In [22]:
df_2.sort_values(by='movieId', inplace=True)

In [23]:
df_2.head(20)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
137,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,382.0,4.5,2018-01-05 14:30:28
138,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,385.0,4.0,1996-06-13 18:47:22
139,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,389.0,5.0,1997-03-09 19:02:54
140,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,391.0,3.0,2002-09-18 22:27:57
141,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,396.0,5.0,2005-03-24 18:23:46
142,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,399.0,4.0,2006-12-27 11:53:48
143,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,401.0,3.5,2017-11-12 01:35:50
144,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,411.0,5.0,1996-06-23 12:15:55
145,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,412.0,2.0,1999-10-05 09:05:53


### 4. 행 인덱스를 초기화하기: reset_index()
데이터프레임의 다양한 전처리 과정을 거치게 되면 인덱스가 뒤죽박죽 섞이게 되는 경우가 많습니다. <br>
이럴 때 reset_index()를 사용하면 인덱스를 처음부터 다시 재배열 해줄 수 있습니다. 

- 기본사용법
> - df.reset_index(): 인덱스를 다시 0부터 시작하도록 reset, 기존 인덱스가 새로운 컬럼으로 저장됩니다.

- 옵션
> - drop: True 옵션을 주면 기존의 인덱스는 삭제됩니다. <br>
> - inplace: 데이터프레임을 정렬된 값으로 변경 후 저장하기 (True, False)

#### 4-1. 기본 사용 방법
#### 데이터프레임 'df_2' 를 reset_index() 함수를 사용해서 인덱스를 재정렬해보세요.

In [24]:
df_2.reset_index()

Unnamed: 0,index,movieId,title,genres,userId,rating,timestamp
0,0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
1,137,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,382.0,4.5,2018-01-05 14:30:28
2,138,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,385.0,4.0,1996-06-13 18:47:22
3,139,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,389.0,5.0,1997-03-09 19:02:54
4,140,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,391.0,3.0,2002-09-18 22:27:57
...,...,...,...,...,...,...,...
100849,100849,193581,Black Butler: Book of the Atlantic (2017),Action|Animation|Comedy|Fantasy,184.0,4.0,2018-09-16 14:44:42
100850,100850,193583,No Game No Life: Zero (2017),Animation|Comedy|Fantasy,184.0,3.5,2018-09-16 14:52:25
100851,100851,193585,Flint (2017),Drama,184.0,3.5,2018-09-16 14:56:45
100852,100852,193587,Bungo Stray Dogs: Dead Apple (2018),Action|Animation,184.0,3.5,2018-09-16 15:00:21


#### 4-2. drop 옵션을 True로 하여 기존의 인덱스는 삭제하고, 데이터프레임을 인덱스 리셋이 완료된 형태로 재 저장해주세요.

In [25]:
df_2.reset_index(drop=True, inplace=True)

In [26]:
df_2.head(3)

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,382.0,4.5,2018-01-05 14:30:28
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,385.0,4.0,1996-06-13 18:47:22


---

## 실습문제

#### Q1. 교육에서 생성한 데이터프레임 'df_2' 에  <br> &emsp; 데이터프레임 'df_links'를 아래의 조건으로 병합하고 'df_3' 변수에 저장해주세요.

> - 데이터프레임 'df_links'가 앞으로 오게 해주세요.<br>
> - key 값은 'movieId' 로 하세요.
>
    📌 merge 시 순서는 데이터프레임을 불러오는 순서대로 지정됩니다.

In [30]:
import pandas as pd
import ydata_profiling

In [38]:
df_movies = pd.read_csv("./data/movies.csv")
df_links = pd.read_csv("./data/links.csv")
df_ratings = pd.read_csv("./data/ratings.csv")
df_users = pd.read_csv("./data/users.csv")

In [32]:
df_2 = pd.merge(df_movies, df_ratings, how='outer', on='movieId')
df_2.sort_values(by='movieId', inplace=True)
df_2.reset_index(drop=True, inplace=True)

In [33]:
# 여기에 작성하세요.
df_2.head()

Unnamed: 0,movieId,title,genres,userId,rating,timestamp
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1.0,4.0,2000-07-30 18:45:03
1,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,382.0,4.5,2018-01-05 14:30:28
2,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,385.0,4.0,1996-06-13 18:47:22
3,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,389.0,5.0,1997-03-09 19:02:54
4,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,391.0,3.0,2002-09-18 22:27:57


In [34]:
df_links.head(2)

Unnamed: 0,movieId,imdbId,tmdbId
0,1,114709,862.0
1,2,113497,8844.0


In [36]:
df_3 =pd.merge(df_links, df_2, on='movieId')

#### Q2. 앞서 생성한 데이터프레임 'df_3' 에  <br> &emsp; 데이터프레임 'df_users'를 아래의 조건으로 병합하고 'df_3' 에 저장해주세요.

> - 데이터프레임 'df_3'가 앞으로 오게 해주세요.<br>
> - key 값은 'userId' 로 해주세요.
> - join 조건은 합집합 형태로 해주세요.
>
    📌 join 조건은 how 옵션을 활용하면 됩니다. 

In [37]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100854 entries, 0 to 100853
Data columns (total 8 columns):
 #   Column     Non-Null Count   Dtype  
---  ------     --------------   -----  
 0   movieId    100854 non-null  int64  
 1   imdbId     100854 non-null  int64  
 2   tmdbId     100841 non-null  float64
 3   title      100854 non-null  object 
 4   genres     100854 non-null  object 
 5   userId     100836 non-null  float64
 6   rating     100836 non-null  float64
 7   timestamp  100836 non-null  object 
dtypes: float64(3), int64(2), object(3)
memory usage: 6.2+ MB


In [39]:
df_users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6040 entries, 0 to 6039
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   userId      6040 non-null   int64 
 1   gender      6040 non-null   object
 2   age         6040 non-null   int64 
 3   occupation  6040 non-null   object
 4   zipcode     6040 non-null   object
dtypes: int64(2), object(3)
memory usage: 236.1+ KB


In [44]:
# 여기에 작성하세요.
df_3 = pd.merge(df_3, df_users, on='userId', how='outer')

#### Q3. 데이터프레임 'df_3' 를 1순위 : 'movieId',  2순위 : 'userId'를 기준으로 오름차순 정렬하고, 정렬된 값으로 변경해 주세요.
    📌 by 옵션을 List 형태로 입력하면 by=['1순위컬럼', '2순위컬럼'] 순서대로 컬럼을 고려해서 정렬합니다.

In [45]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 106284 entries, 0 to 106283
Data columns (total 12 columns):
 #   Column      Non-Null Count   Dtype  
---  ------      --------------   -----  
 0   movieId     100854 non-null  float64
 1   imdbId      100854 non-null  float64
 2   tmdbId      100841 non-null  float64
 3   title       100854 non-null  object 
 4   genres      100854 non-null  object 
 5   userId      106266 non-null  float64
 6   rating      100836 non-null  float64
 7   timestamp   100836 non-null  object 
 8   gender      106266 non-null  object 
 9   age         106266 non-null  float64
 10  occupation  106266 non-null  object 
 11  zipcode     106266 non-null  object 
dtypes: float64(6), object(6)
memory usage: 9.7+ MB


In [51]:
# 여기에 작성하세요.
df_3.sort_values(['movieId', 'userId'], ascending=True, inplace=True)

#### Q4. 정렬이 완료된 데이터프레임 'df_3' 의 index를 초기화 해주세요.
> - 기존 Index는 삭제해주세요. 
>
    📌 기존 Index를 삭제하는 것은 drop 옵션을 사용하세요.

In [53]:
# 여기에 작성하세요.
df_3.reset_index(drop=True, inplace=True)

---