# 협업 필터링(Collaborative Filtering)

## 협업 필터링의 특징
- **사용자의 행동(User behavior)에만 기반하는 추천 알고리즘**들을 전반적으로 지칭
- **사용자가 아직 평가하지 않은 상품(item)에 대한 평가(rating)를 예측**하는 것이 주요 역할
  - 협업 필터링을 통해 예측한 평가(ex. 영화에서는 평점)가 높은 아이템을 사용자에게 추천할 수 있다.
- **사용자-아이템 행렬(User-Item Matrix)**를 기반으로 한다.
  - 모든 사용자가 모든 아이템을 구매하지는 않기 때문에 데이터가 많이 비워진 **희소행렬**의 형태이다.

## 협업 필터링의 유형


### 1. 최근접 이웃 기반(Nearest Neighbor)
1. **사용자 기반**(User-user CF) : **사용자와 취향이 비슷한 다른 사용자**들을 찾는 것
2. **아이템 기반**(Item-item CF) : **사용자가 선호하는 아이템과 비슷한 아이템**들을 찾는 것

### 2. 잠재요인 기반(Latent Factor)
- 행렬 분해 기반(Matrix Factorization)
  - **사용자-아이템 행렬(User-Item Matrix)을 분해하여 다시 행렬을 재결합 후 재결합된 행렬을 이용하여 예측 평점을 추출하는 방식**
  - 주로 행렬의 특이값 분해를 의미하는 SVD (Singular Value Decomposition)로 진행. 경사 하강법을 통해 특이값행렬($Σ$)이 없는 상태($U V^T$)로만 행렬을 특이 분해한다.
  - **사용자가 아직 평가하지 않은 아이템에 대해 부여할 평점을 예측**하는 것

## 사용자 기반(User-User) 협업 필터링
- **특정 사용자와 비슷한 고객들을 기반으로 비슷한 고객들이 선호하는 다른 상품을 추천**
- 특정 사용자와 비슷한 상품을 구매해온 고객들을 비슷한 고객으로 간주

> "당신과 비슷한 취향의 고객들은 다음과 같은 상품을 구매했습니다."

## 아이템 기반(Item-Item) 협업 필터링
- **특정 상품과 유사한 좋은 평가를 받은 다른 비슷한 상품을 추천**
- 사용자들로부터 특정 상품과 비슷한 평가를 받은 상품들은 비슷한 상품으로 간주

> "이 상품을 구매한 다른 고객들은 다음 상품도 구매했습니다."

- 일반적으로 사용자 기반 보다 **아이템 기반 협업 필터링 방식이 더 선호**된다.
- 사람간의 특성은 상대적으로 다양한 요소를 기반하기 때문이다.
  - 단순히 동일 상품을 구입하였다고 해도, 유사한 사람이라고 판단하기 어려운 경우가 더 많다.

## 협업 필터링을 위한 준비사항

### 사용자-아이템 행렬 만들기
-  협업 필터링을에 사용할 수 있도록 기존 데이터 세트를 사용자가 행, 아이템이 열로 구성되고, 사용자 별 아이템의 평점을 값을 가지는 사용자-아이템 행렬로 변환해야 한다.
- 희소 행렬의 형태를 띈다.

1. 판다스의 `pivot_table`을 이용
  - 행과 열을 특정 Data로 지정하여 DataFrame의 모양을 바꾸는 방식 중 하나
  - 해당 DataFrame에는 지정되지 않은 컬럼은 포함되지 않는다.

```python
<Dataframe>.pivot_table('평가관련컬럼명', index='사용자관련컬럼명', columns='아이템관련컬럼명')
```
- `pivot_table()`의 인자는 차례대로 데이터프레임의 안쪽 값에 대한 데이터, 데이터프레임의 행에 들어갈 데이터, 데이터프레임의 열에 들어갈 데이터가 들어간다.

2. NaN값을 `fillna()`를 이용해 벡터화
- NaN값이 있으면 유사도를 구할 수 없기 때문에 NaN값을 모두 0으로 변환
```python
<사용자-아이템 행렬에 대한 Dataframe>.fillna(0)
```

### 코사인 유사도 행렬 만들기

#### 사용자 기반(User-User) 코사인 유사도 행렬 구하기
사용자 기반 협업 필터링의 경우 사용자-아이템 행렬을 이용해 코사인 유사도 행렬을 구할 수 있다.

1. 사용자 기반 코사인 유사도 구하기
```python
# A가 사용자-아이템 행렬을 나타내는 변수 일 때
cosine_similarity(A, A)
```

2. 사용자 기반 코사인 유사도 행렬 생성
```python
pd.DataFrame(data=cosine_similarity(A, A), index=A.index, columns=A.index)
```
- `data`에 사용자 기반 코사인 유사도를 넣어준다.
- 행이름과 열이름에 사용자를 나타내는 `A.index`을 넣어준다.

#### 아이템 기반(Item-Item) 코사인 유사도(Cosine_similarities) 행렬 구하기
아이템 기반 협업 필터링의 경우 사용자-아이템 행렬의 전치행렬간의 내적을 통해 아이템 기반 행렬이 만들어지는 것을 이용
```python
# A가 사용자-아이템 행렬을 나타내는 변수 일 때
A.T @ A.T
```

1. 아이템 기반 코사인 유사도 구하기
```python
cosine_similarity(A.T, A.T)
```

2. 아이템 기반 코사인 유사도 행렬 생성
```python
pd.DataFrame(data=cosine_similarity(A.T, A.T), index=A.columns, columns=A.columns)
```
- `data`에 아이템 기반 코사인 유사도를 넣어준다.
- 행이름과 열이름에 아이템을 나타내는 `A.columns`을 넣어준다.

## 아이템 기반 협업 필터링의 예측 평점 구하기
- 아이템 기반 협업 필터링에서 개인화된 평점 예측은 Weighted Rating Sum 방식을 이용
- Weighted Rating Sum 방식

$$
\hat R_{u,i}
=
\frac{\sum(S_{i,N} \times R_{u,N})}{\sum(|S_{i,N}|)}
$$
> - $\hat R_{u,i}$ : 사용자 $u$의 아이템 $i$에 대한 개인화된 예측 평점 값
>
> - $S_{i,N}$ : 아이템 $i$와 가장 유사도가 높은 Top-N개 아이템(아이템 $i$ 제외)의 유사도 벡터
>
> - $R_{u,N}$ : 사용자 $u$의 아이템 $i$와 가장 유사도가 높은 Top-N개 아이템에 대한 실제 평점 벡터

## 아이템 기반 협업 필터링 실습
- 미국 유명한 영화 평점 사이트 IMDB 데이터 세트를 이용
  - 출처: Kaggle

In [1]:
import warnings
warnings.filterwarnings('ignore')

In [2]:
from google.colab import drive
drive.mount("/content/drive")

Mounted at /content/drive


### 1. Raw Data Lodading
- 아이템(영화), 사용자-평점이 기록된 데이터 가져오기

In [3]:
import pandas as pd
import numpy as np

MOVIES_DATA_PATH = "/content/drive/MyDrive/MLP-33-ML-DL/ML/실습_data/ml-latest-small/movies.csv"
RATINGS_DATA_PATH = "/content/drive/MyDrive/MLP-33-ML-DL/ML/실습_data/ml-latest-small/ratings.csv"

movies = pd.read_csv(MOVIES_DATA_PATH)
ratings = pd.read_csv(RATINGS_DATA_PATH)

In [4]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


### 2. `merge()` 함수로 데이터프레임 병합
- `merge()` 함수는 두 데이터프레임을 각 데이터에 존재하는 고유값(key)을 기준으로 병합할때 사용한다.
  - `on=None`은 두 데이터의 공통 열이름을 기준으로 inner(교집합) 조인

```python
pd.merge(df_left, df_right, how='inner', on=None)
```

- 사용자가 title(영화제목)에 대해 어떻게 평점을 주었는지 확인하기 위해 영화 데이터프레임과 영화 평점 데이터프레임을 합친다.

In [9]:
## merge() 이용하여 영화 DataFrame과 평점 DataFrame 합치기
rating_movies = pd.merge(ratings, movies, on='movieId') # 'movieId' 컬럼 기준으로 inner 조인
rating_movies.head()

Unnamed: 0,userId,movieId,rating,timestamp,title,genres
0,1,1,4.0,964982703,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,5,1,4.0,847434962,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
2,7,1,4.5,1106635946,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
3,15,1,2.5,1510577970,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
4,17,1,4.5,1305696483,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy


### 3. 사용자-아이템(영화) 평점 행렬 만들기
#### `pivot_table` 활용해 데이터프레임 생성
- 희소행렬의 형태를 띈다.

In [14]:
ratings_matrix = rating_movies.pivot_table('rating', index='userId', columns='title') # 값(평점), 행(사용자), 열(아이템_영화)
ratings_matrix.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,,,,,,,4.0,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,,,,,,,,,,,...,,,,,,,,,,


- 대부분 NaN값으로 들어가 있는 것을 확인할 수 있다.
- NaN값을 벡터화해야 아이템 기반 코사인 유사도를 구할 수 있다.

#### `fillna()` 메서드 활용해 결측값 벡터화
- `fillna()`는 DataFrame에서 결측값을 원하는 값으로 변경하는 메서드

```python
df.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
```

- `value` : 결측값을 대체할 값. dict형태도 가능
- `method` : 결측값을 변경할 방식.
  - `bfill`로 할 경우 결측값 바로 아래 값과 동일하게 변경
  - `ffill`로 할 경우 결측값 바로 위 값과 동일하게 변경
- `axis` : 메서드를 적용할 레이블
  - `axis=0` : index
  - `axis=1` : columns
- `inplace` : 원본을 변결할지 여부. `True`일 경우 원본을 변경
- `limit` : 결측값을 변경할 횟수. 위에서부터 지정된 갯수의 결측값만 변경
- `downcast` : 다운캐스트할지 여부
  - `downcast='infer'` : `float64`를 `int64`로 변경

In [15]:
# NaN값이 있으면 유사도를 구할 수 없기 때문에 NaN값을 모두 0으로 변환 - 벡터화
ratings_matrix = ratings_matrix.fillna(0)
ratings_matrix

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,4.5,3.5,0.0,0.0,0.0
609,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3. 아이템 기반(영화-영화) 유사도 행렬 생성
- 사용자-아이템 행렬의 전치행렬을 이용해 아이템 기반 코사인 유사도 구하기

In [17]:
# 사용자-아이템(영화) 행렬의 전치행렬 => 아이템(영화)-사용자 행렬
ratings_matrix_T = ratings_matrix.T
ratings_matrix_T.head()

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- 행과 열이 전치되어 영화-사용자 행렬이 된 것을 확인할 수 있다.

In [19]:
# 아이템 기반 코사인 유사도 구하기
from sklearn.metrics.pairwise import cosine_similarity
item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)


# 유사도 행렬을 데이터 프레임으로 만들기
item_sim_df = pd.DataFrame(data=item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns) # data=아이템 기반 코사인 유사도, index=아이템(영화), columns=아이템(영화)
item_sim_df.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Round Midnight (1986),0.0,0.707107,1.0,0.0,0.0,0.0,0.176777,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Salem's Lot (2004),0.0,0.0,0.0,1.0,0.857493,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
'Til There Was You (1997),0.0,0.0,0.0,0.857493,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 4. 특정 영화와 비슷한 영화 조회
- 확인 단계이므로 생략해도 괜찮다.

In [20]:
# 'Godfather, The (1972)'와 비슷한 영화 Top 10 찾기
item_sim_df['Godfather, The (1972)'].sort_values(ascending=False)[:10]

title
Godfather, The (1972)                                    1.000000
Godfather: Part II, The (1974)                           0.821773
Goodfellas (1990)                                        0.664841
One Flew Over the Cuckoo's Nest (1975)                   0.620536
Star Wars: Episode IV - A New Hope (1977)                0.595317
Fargo (1996)                                             0.588614
Star Wars: Episode V - The Empire Strikes Back (1980)    0.586030
Fight Club (1999)                                        0.581279
Reservoir Dogs (1992)                                    0.579059
Pulp Fiction (1994)                                      0.575270
Name: Godfather, The (1972), dtype: float64

- 'Godfather, The (1972)' 영화를 본 사용자 중 해당 영화와 비슷한 평가를 한 다른 영화를 추천해주는 것
- 영화간 평점에 대한 유사도가 높은 순서대로 10개의 영화가 조회되었다.

### 5. 평점 예측
- 아이템 기반 협업 필터링으로 개인화된 평점 예측을해 영화 추천을 하도록 하겠다.

$$
\hat{R}_{u, i}=\frac{\sum\big{(}S_{i,N} \cdot R_{u,N}\big{)}}{\sum\big{(}\big{|}S_{i,N}\big{|}\big{)}}
$$

In [21]:
## 모든 사용자의 예측 평점을 구하는 함수 정의
def predict_rating(ratings_arr, item_sim_arr):
  # ratings_arr : 사용자-아이템 평점 행렬
  # item_sim_arr : 아이템 기반(아이템-아이템) 유사도 행렬

  # 아이템 기반 유사도 벡터에 대한 절대값들을 모두 더한 것
  item_sim_arr_abs_sum = np.array([np.abs(item_sim_arr).sum(axis=1)])

  # 사용자-아이템 평점 행렬과 아이템 기반 유사도 행렬의 내적을 이용해 모든 사용자의 예측 평점을 구한다.
  ratings_pred = ratings_arr @ item_sim_arr / item_sim_arr_abs_sum

  return ratings_pred

In [22]:
# 예측 평점 구하기
ratings_pred = predict_rating(ratings_matrix.values, item_sim_df.values) # 사용자-아이템 평점 행렬, 아이템 기반 유사도 행렬

# 사용자-아이템 예측 평점 행렬 생성
ratings_pred_matrix = pd.DataFrame(data=ratings_pred, index=ratings_matrix.index, columns=ratings_matrix.columns) # data=예측 평점, index=사용자, columns=아이템(영화)
ratings_pred_matrix.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.070345,0.577855,0.321696,0.227055,0.206958,0.194615,0.249883,0.102542,0.157084,0.178197,...,0.113608,0.181738,0.133962,0.128574,0.006179,0.21207,0.192921,0.136024,0.292955,0.720347
2,0.01826,0.042744,0.018861,0.0,0.0,0.035995,0.013413,0.002314,0.032213,0.014863,...,0.01564,0.020855,0.020119,0.015745,0.049983,0.014876,0.021616,0.024528,0.017563,0.0
3,0.011884,0.030279,0.064437,0.003762,0.003749,0.002722,0.014625,0.002085,0.005666,0.006272,...,0.006923,0.011665,0.0118,0.012225,0.0,0.008194,0.007017,0.009229,0.01042,0.084501
4,0.049145,0.277628,0.160448,0.206892,0.309632,0.042337,0.130048,0.116442,0.099785,0.097432,...,0.051269,0.076051,0.055563,0.054137,0.008343,0.159242,0.100941,0.062253,0.146054,0.231187
5,0.007278,0.066951,0.041879,0.01388,0.024842,0.01824,0.026405,0.018673,0.021591,0.018841,...,0.009689,0.022246,0.01336,0.012378,0.0,0.025839,0.023712,0.018012,0.028133,0.052315


In [23]:
# 사용자-아이템 평점 행렬, 아이템 기반 유사도 행렬, 사용자-아이템 예측 평점 행렬 형태 확인
ratings_matrix.shape, item_sim.shape, ratings_pred.shape

((610, 9719), (9719, 9719), (610, 9719))

### 6. 예측 평가
가중치 평점 부여한 후 예측 성능 평가에 대한 MSE, RMSE를 구해 예측 평가 확인

1. 실제 평점 데이터 중 원래 값이 들어있던 위치값(index) 구하기
  - 정답 데이터 세트는 사용자-아이템 평점 행렬(희소 행렬)
  - 사용자-아이템 평점 행렬에서 NaN 값이 아니라 평점값이 들어있던 위치(index)를 구한다.

2. 예측 평점 데이터 중 [1]에 있었던 위치값(index) 구하기
  - 예측값 데이터 세트는 사용자-아이템 예측 평점 행렬(밀집행렬_Dense Matrix)
  - 사용자-아이템 예측 평점 행렬에서 사용자-아이템 평점 행렬에서 NaN 값이 아니라 평점값이 들어있던 위치(index) 값을 구한다.

3. [1], [2] 행렬에서 동일한 위치(index)에 있는 값을 이용해 MSE를 계산

In [25]:
# 실제 평점 데이터의 좌표 확인 - 사용자-아이템 평점 행렬에서 0이 아닌 값을 가지는 위치 좌표(index)
ratings_nonzero = ratings_matrix.values.nonzero()
ratings_nonzero

(array([  0,   0,   0, ..., 609, 609, 609]),
 array([  48,   66,  202, ..., 9712, 9715, 9716]))

 - `[0]`은 행에서 0이 아닌 값을 가지는 인덱스를 표시
 - `[1]`은 열에서 0이 아닌 값을 가지는 인덱스를 표시
 - 각각 행의 인덱스, 열의 인덱스로 팬시 인덱싱을 사용

In [39]:
ratings_matrix.iloc[0, 48], ratings_matrix.iloc[0, 66], ratings_matrix.iloc[609, 9716]

(4.0, 4.0, 1.5)

In [42]:
ratings_matrix.values[ratings_nonzero]

array([4. , 4. , 4. , ..., 3. , 2. , 1.5])

- 행, 열 좌표값으로 실제 평점 데이터 조회 가능 확인

 #### 팬시 인덱싱(Fancy Indexing)
 - 여러개의 컬럼을 조회할 경우 컬럼명들을 담은 리스트/튜플로 조회
 - 원하는 행 인덱스를 요소로 넣은 리스트와 원하는 열 인덱스 요소를 넣은 리스트를 활용해 원하는 데이터프레임을 인덱싱

```python
# 특정 row, column을 요소로 가지는 리스트를 활용해 데이터프레임을 인덱싱해 조회할 수 있다.
select_rows = [1, 2, 4]
select_cols = ["Yejun", "Noah", "Eunho"]

df.loc[selcet_rows, selected_cols]
```

In [55]:
## 사용자가 평점을 부여한 영화에 대해서만 예측 성능 평가 MSE 구하는 함수 정의
from sklearn.metrics import mean_squared_error

def get_mse(pred, actual):
  # 실제 사용자가 평점을 부여한(0이 아닌 부분) 위치의 데이터 가져오기 + 평탄화(MSE는 1차원 배열로만 구할 수 있기 때문에)
  actual_y = actual[actual.nonzero()].flatten()

  # 실제 사용자가 평점을 부여한 위치의 예측 데이터 가져오기 + 평탄화
  pred_y = pred[actual.nonzero()].flatten()

  print(actual_y)
  print(pred_y)
  return mean_squared_error(actual_y, pred_y)

# 평점 예측에 대한 MSE 구하기
mse = get_mse(ratings_pred, ratings_matrix.values)
print(f'평점 예측에 대한 MSE : {mse}')
print(f'평점 예측에 대한 RMSE : {np.sqrt(mse)}')

[4.  4.  4.  ... 3.  2.  1.5]
[0.2855597  1.08359021 0.35404974 ... 2.57350896 1.08329872 1.81609065]
평점 예측에 대한 MSE : 9.895354759094706
평점 예측에 대한 RMSE : 3.1456882806620725


- 단순하게 유저-아이템 행렬과 아이템-아이템 유사도 행렬로 가중 평균 계산을 하면, 유사 하지 않은 아이템들도 평점 예측에 참여하기 때문에 예측 평점이 낮을 수 밖에 없다.
  - 유사도 벡터가 0벡터이면 예측값이 0이 된다.
  - 유저들에게 평점이 부여되지 않은 아이템의 경우 NaN값을 0으로 벡터화시켰기 떄문에 예측값이 0이 된다.
- 그렇기 때문에 실제 평점과 예측 평점의 오차가 크기 때문에 MSE가 굉장히 높게 나왔다.
- 따라서 유사도가 높은 Top-N개의 데이터들에 대해서만 예측 평점을 계산하겠다.

In [64]:
## 유사도가 높은 top_n개의 데이터들에 대해서만 예측 평점을 계산하는 함수 정의
def predict_rating_topsim(ratings_arr, item_sim_arr, n=20):

  ## 사용자-아이템 실제 평점 행렬 크기만큼 0으로 채운 사용자-아이템 예측 평점 행렬 생성
  # for문에서 사용자-아이템 예측 평점 행렬의 전체를 내적하지 않고, 사용자-아이템 실제 평점 행렬에 값이 위치하는 인덱스를 이용해 컬럼별 내적을 수행 예정
  pred = np.zeros(ratings_arr.shape)

  ## 사용자-아이템 실제 평점 행렬의 열 크기만큼 반복 수행
  # 사용자-아이템 실제 평점 행렬의 열벡터(특정 영화에 대한 사용자 정보)를 하나씩 탐색하면서 수행
  for col in range(ratings_arr.shape[1]):
    ## 아이템 기반 유사도 행렬에서 유사도가 큰 순서대로 데이터 행렬의 n개의 index 반환
    top_n_items = [np.argsort(item_sim_arr[:, col])[:-n-1:-1]]

    ## 개인화된 예측 평점을 계산
    # 각 유저가 아이템에 내린 실제 평점이 필요

    ## 유사도가 높은 top 20개 아이템에 대한 사용자-아이템 실제 평점 행렬의 벡터(유저들이 영화에 부여한 평점) 추출하기 - 이중 for문 활용
    for row in range(ratings_arr.shape[0]):
      # 0으로 채운 행렬에 팬시 인덱싱을 이용해 유사도가 높은 top 20개의 아이템에 대한 예측 평점값을 넣어준다.
      pred[row, col] = item_sim_arr[col, :][top_n_items] @ ratings_arr[row, :][top_n_items].T
      pred[row, col] /= np.sum(np.abs(item_sim_arr[col, :][top_n_items]))

  return pred

```python
[np.argsort(item_sim_arr[:, col])[:-n-1:-1]]
```
- `item_sim_arr[:, col]` : col번째 영화와 다른 영화들 간의 유사도를 추출
- `np.argsort(item_sim_arr[:, col])` : `argsort()` 함수로 유사도가 작은 순서대로 인덱스 반환
- `[:-n-1:-1]` : 역순으로 음수 인덱스로 슬라이싱. 음수 인덱스는 `-(n+1)`를 넣어야 뒤에서 부터 n개를 추출할 수 있다.
- 팬시 인덱싱으로 사용하기 위해 리스트로 넣어준다.

```
pred[row, col] = item_sim_arr[col, :][top_n_items] @ ratings_arr[row, :][top_n_items].T`
```
- `item_sim_arr[col, : ][top_n_items]` : 아이템 기반 유사도 행렬 전체 데이터 중 top_n_items 리스트에 들어있는 n개의 열 벡터만 가져온다.
- `ratings_arr[row, :][top_n_items].T` : 사용자-아이템 실제 평점 행렬의 전체 데이터 중 top_n_items 리스트에 들어있는 n개의 행벡터를 전치시켜 열 벡터로 가져온다.
- 열 벡터와 열벡터를 내적시켜 예측 평점을 구할 수 있게 한다.

In [66]:
## 유사도가 높은 top_n개의 데이터들에 대해서만 예측 평점을 계산
ratings_pred = predict_rating_topsim(ratings_matrix.values, item_sim_df.values, n=20)

# 평점 예측에 대한 MSE 구하기
mse = get_mse(ratings_pred, ratings_matrix.values)
print(f'평점 예측에 대한 MSE : {mse}')
print(f'평점 예측에 대한 RMSE : {np.sqrt(mse)}')

[4.  4.  4.  ... 3.  2.  1.5]
[1.98892435 1.23173695 3.70897796 ... 3.30311669 2.3330633  0.74587428]
평점 예측에 대한 MSE : 3.6949827608772314
평점 예측에 대한 RMSE : 1.9222337945414527


- 이전보다 MSE가 감소한 것을 확인

In [67]:
## 예측 평점 행렬 - DataFrame으로 생성
ratings_pred_matrix = pd.DataFrame(data=ratings_pred, index=ratings_matrix.index, columns=ratings_matrix.columns)
ratings_pred_matrix.head()

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.220798,0.0,0.0,1.677291,0.284372
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.220798,0.0,0.0,0.194828,0.0
5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 7. 추천 시스템 작동

In [68]:
# 추천해줄 사용자 지정
target_user_id = 78

# target_user_id에 대한 모든 영화 정보 가져오기
user_rating_id = ratings_matrix.loc[target_user_id, :]

# target_user_id가 평점을 부여(0보다 높은 점수)한 영화들 중 평점이 높은 순으로 내림차순 정렬을 한 후 top 10를 가져오기
user_rating_id[ user_rating_id > 0 ].sort_values(ascending=False)[:10]

title
Die Hard (1988)                                                                   5.0
Airplane! (1980)                                                                  5.0
Terminator, The (1984)                                                            5.0
Terminator 2: Judgment Day (1991)                                                 4.5
Raiders of the Lost Ark (Indiana Jones and the Raiders of the Lost Ark) (1981)    4.5
Shawshank Redemption, The (1994)                                                  4.5
Ghostbusters (a.k.a. Ghost Busters) (1984)                                        4.5
Matrix, The (1999)                                                                4.5
Dodgeball: A True Underdog Story (2004)                                           4.5
Naked Gun: From the Files of Police Squad!, The (1988)                            4.5
Name: 78, dtype: float64

- 위와 같은 추천 시스템의 과정을 함수로 정의하여 사용자에게 추천할 아이템(영화)를 확인해보겠다.

#### 사용자가 평가하지 않은 아이템 중에서 아이템 기반의 협업 필터링 추천
- 사용자가 관람하지 않은, 즉 평점을 남기지 않은 영화 중에서 영화(아이템) 기반의 인접 이웃 협업 필터링 추천

In [69]:
## userId가 보지 않은 영화명 리스트 가져오는 함수 정의
def get_unseen_movies(ratings_matrix, userId):
  # userId로 입력받은 사용자의 모든 영화 정보를 추출하여 시리즈로 변환
  user_rating = ratings_matrix.loc[userId, :]

  # userId로 입력받은 사용자가 기존에 관람한 영화 - user_rating이 0보다 크면 기존에 관람한 영화이다.
  # 대상 인덱스를 추출하여 리스트 객체로 만든다.
  already_seen = user_rating[ user_rating > 0 ].index.tolist()

  # 모든 영화명을 리스트로 만들기
  movies_list = ratings_matrix.columns.tolist()

  # userId로 입력받은 사용자가 관람하지 않은 영화명 리스트
  unseen_list = [ movie for movie in movies_list if movie not in already_seen ]

  return unseen_list

In [70]:
# userId에게 추천할 영화 목록을 가져오는 함수 저으이
def recomm_movie_by_userid(pred_df, userid, unseen_list, top_n=10):
  recomm_movies = pred_df.loc[userid, unseen_list].sort_values(ascending=False)[:top_n]
  return recomm_movies

In [71]:
# 사용자가 관람하지 않은 영화명 추출
unseen_list = get_unseen_movies(ratings_matrix, target_user_id)

# 사용자에게 추천할 영화 목록 만들기
recomm_movies = recomm_movie_by_userid(ratings_pred_matrix, target_user_id, unseen_list, top_n=10)

# 평점 데이터를 DataFrame으로 변환
recomm_movies_df = pd.DataFrame(data=recomm_movies.values, index=recomm_movies.index, columns=['pred_score'])
recomm_movies_df

Unnamed: 0_level_0,pred_score
title,Unnamed: 1_level_1
Star Wars: Episode VI - Return of the Jedi (1983),2.226222
Braveheart (1995),1.985336
Star Wars: Episode V - The Empire Strikes Back (1980),1.929772
Seven (a.k.a. Se7en) (1995),1.780817
Schindler's List (1993),1.693
Alien (1979),1.664949
Batman (1989),1.615901
Léon: The Professional (a.k.a. The Professional) (Léon) (1994),1.60749
Indiana Jones and the Temple of Doom (1984),1.552995
"Fugitive, The (1993)",1.547346
