### 구현하는 추천 시스템 개념
- 잠재 요인 협업 필터링(latent factor collaborative filtering)을 구현하며,
- 이때 Matrix Factorization(행렬 분해)를 사용해서 구현
- 사용자 개인에게 맞춤으로 영화를 추천해주는 개인화 추천 방식
- 사용자 개인 영화 히스토리를 기반으로 영화를 추천

In [1]:
import pandas as pd
import numpy as np
from scipy.sparse.linalg import svds

In [2]:
rating_data = pd.read_csv('./output/user_list.csv')
movie_data = pd.read_csv('./output/movie_info.csv')

In [3]:
rating_data.columns = ['index', 'movieRank', 'movieId', 'score', 'userId']
movie_data.columns = ['index', 'movieRank', 'movieId', 'title', 'genres']

In [4]:
rating_data = rating_data.drop(['index', 'movieRank'], axis = 1)
movie_data = movie_data.drop(['index', 'movieRank'], axis = 1)

In [5]:
rating_data.head()

Unnamed: 0,movieId,score,userId
0,192150,8,잡는다(jsm6****)
1,192150,8,추억(king****)
2,192150,8,바라는대로(jimi****)
3,192150,10,iwan****
4,192150,10,하이잉(ssor****)


### data cleansing - userId 정리
- userId를 1, 2, 3, 4 ... 로 간소화하여 후에 찾기 편하도록 함

In [6]:
sort_userId = rating_data['userId']
sort_userId = sort_userId.value_counts().reset_index()

In [7]:
sort_userId['user_id'] = sort_userId.index
sort_userId

Unnamed: 0,index,userId,user_id
0,Abc123(fowe****),10596,0
1,HOMEOSTASIS(xogu****),7872,1
2,범쇠범팔뺌수(ttas****),6902,2
3,보호주의(wooz****),6840,3
4,짱(eorm****),6840,4
...,...,...,...
52322,넌누구니(ange****),1,52322
52323,galaxy(nabi****),1,52323
52324,샤이닝이치(fusi****),1,52324
52325,이민우(leem****),1,52325


In [8]:
sort_userId.columns = ['userId', 'value_counts', 'user_id']
sort_userId

Unnamed: 0,userId,value_counts,user_id
0,Abc123(fowe****),10596,0
1,HOMEOSTASIS(xogu****),7872,1
2,범쇠범팔뺌수(ttas****),6902,2
3,보호주의(wooz****),6840,3
4,짱(eorm****),6840,4
...,...,...,...
52322,넌누구니(ange****),1,52322
52323,galaxy(nabi****),1,52323
52324,샤이닝이치(fusi****),1,52324
52325,이민우(leem****),1,52325


In [9]:
sort_userId = sort_userId.drop('value_counts', axis = 1)
sort_userId

Unnamed: 0,userId,user_id
0,Abc123(fowe****),0
1,HOMEOSTASIS(xogu****),1
2,범쇠범팔뺌수(ttas****),2
3,보호주의(wooz****),3
4,짱(eorm****),4
...,...,...
52322,넌누구니(ange****),52322
52323,galaxy(nabi****),52323
52324,샤이닝이치(fusi****),52324
52325,이민우(leem****),52325


In [10]:
rating = rating_data.merge(sort_userId, on = 'userId')

In [11]:
rating.head(3)

Unnamed: 0,movieId,score,userId,user_id
0,192150,8,잡는다(jsm6****),804
1,184318,8,잡는다(jsm6****),804
2,184311,8,잡는다(jsm6****),804


In [12]:
rating = rating.sort_values(['user_id'], ascending=True)
rating.head(10)

Unnamed: 0,movieId,score,userId,user_id
29686,187549,1,Abc123(fowe****),0
32899,207360,1,Abc123(fowe****),0
32898,207360,1,Abc123(fowe****),0
32897,207360,1,Abc123(fowe****),0
32896,207360,1,Abc123(fowe****),0
32895,207360,1,Abc123(fowe****),0
32894,207360,1,Abc123(fowe****),0
32900,207360,1,Abc123(fowe****),0
32893,207360,1,Abc123(fowe****),0
32891,207360,1,Abc123(fowe****),0


In [13]:
rating_data = rating.copy()

In [14]:
rating_data

Unnamed: 0,movieId,score,userId,user_id
29686,187549,1,Abc123(fowe****),0
32899,207360,1,Abc123(fowe****),0
32898,207360,1,Abc123(fowe****),0
32897,207360,1,Abc123(fowe****),0
32896,207360,1,Abc123(fowe****),0
...,...,...,...,...
4441171,200900,3,넌누구니(ange****),52322
4441149,200900,1,galaxy(nabi****),52323
4441167,200900,1,샤이닝이치(fusi****),52324
4441168,200900,10,이민우(leem****),52325


### 코드 수정

In [116]:
# 기존 블로그 속 코드
df_user_movie_ratings = rating_data.pivot(index = 'user_id',
                                          columns = 'movieId',
                                          values = 'score').fillna(0)

ValueError: Index contains duplicate entries, cannot reshape

## 발생한 에러
### Index contains duplicate entries, cannot reshape
- 추측: 같은 아이디 유저가 같은 영화에 여러 번 평점을 남겨서 생긴 문제로 파악
- 해결 방안: pd.pivot_table 기능을 써서 aggfunc=np.mean 을 하여 평균 값을 적용
 * aggfunc=np.mean 는 디폴트이긴 함!
- 참고한 블로그(https://rfriend.tistory.com/275)
   

In [15]:
# 수정 코드
df_user_movie_ratings = pd.pivot_table(data = rating_data,
                                       index = 'user_id',
                                       columns = 'movieId',
                                       values = 'score',
                                       aggfunc=np.mean).fillna(0)                                       

In [16]:
df_user_movie_ratings.head()

movieId,99702,107992,159074,163811,167403,167569,179406,182019,184311,184318,...,206657,207182,207284,207360,207364,207370,207523,208530,209496,210283
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,9.5,9.3125,9.214286,2.598039,2.573248,9.307692,9.214286,4.040404,9.333333,10.0,...,4.108374,4.089109,2.563291,3.943299,4.040404,4.040404,9.214286,9.0,2.598039,2.525641
1,6.0,6.166667,6.0,6.166667,6.0,6.0,6.0,5.696,0.0,0.0,...,5.700787,5.700787,6.0,6.0,5.982955,5.982955,6.0,6.166667,6.181818,6.0
2,0.0,3.0,0.0,3.0,3.0,0.0,3.0,3.082474,0.0,0.0,...,3.132653,3.132653,3.0,3.0,3.082474,3.082474,3.0,3.0,3.0,3.0
3,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
4,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


- 사용자-영화 평점 데이터를 pivot table 형식으로 바꿔줌
- => 사용자 별 각 영화 평점 방식으로 바꾸는 것
- N명의 사용자가 있고 M개의 영화가 있으면 N X M 크기의 행렬이 만들어짐

matrix = df_user_movie_ratings.as_matrix()



두 개의 차이점 무엇인지 공부 필요

In [17]:
# matrix는 pivot_table 값을 numpy matrix로 만든 것
matrix = df_user_movie_ratings.to_numpy()

# user_ratings_mean은 사용자의 평균 평점
user_ratings_mean = np.mean(matrix, axis = 1)

# R_user_mean : 사용자-영화에 대해 사용자 평균 평점을 뺀 것.
matrix_user_mean = matrix - user_ratings_mean.reshape(-1, 1)

In [18]:
matrix

array([[9.5       , 9.3125    , 9.21428571, ..., 9.        , 2.59803922,
        2.52564103],
       [6.        , 6.16666667, 6.        , ..., 6.16666667, 6.18181818,
        6.        ],
       [0.        , 3.        , 0.        , ..., 3.        , 3.        ,
        3.        ],
       ...,
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ],
       [0.        , 0.        , 0.        , ..., 0.        , 0.        ,
        0.        ]])

In [19]:
user_ratings_mean

array([5.80187549, 5.29533888, 2.23059557, ..., 0.01020408, 0.10204082,
       0.10204082])

In [20]:
matrix_user_mean

array([[ 3.69812451,  3.51062451,  3.41241023, ...,  3.19812451,
        -3.20383627, -3.27623446],
       [ 0.70466112,  0.87132779,  0.70466112, ...,  0.87132779,
         0.88647931,  0.70466112],
       [-2.23059557,  0.76940443, -2.23059557, ...,  0.76940443,
         0.76940443,  0.76940443],
       ...,
       [-0.01020408, -0.01020408, -0.01020408, ..., -0.01020408,
        -0.01020408, -0.01020408],
       [-0.10204082, -0.10204082, -0.10204082, ..., -0.10204082,
        -0.10204082, -0.10204082],
       [-0.10204082, -0.10204082, -0.10204082, ..., -0.10204082,
        -0.10204082, -0.10204082]])

In [21]:
# N명의 사용자가 매긴 각각의 평균을 사용자 별로 빼주었다
pd.DataFrame(matrix_user_mean, columns = df_user_movie_ratings.columns).head()

movieId,99702,107992,159074,163811,167403,167569,179406,182019,184311,184318,...,206657,207182,207284,207360,207364,207370,207523,208530,209496,210283
0,3.698125,3.510625,3.41241,-3.203836,-3.228627,3.505817,3.41241,-1.761471,3.531458,4.198125,...,-1.693501,-1.712767,-3.238584,-1.858577,-1.761471,-1.761471,3.41241,3.198125,-3.203836,-3.276234
1,0.704661,0.871328,0.704661,0.871328,0.704661,0.704661,0.704661,0.400661,-5.295339,-5.295339,...,0.405449,0.405449,0.704661,0.704661,0.687616,0.687616,0.704661,0.871328,0.886479,0.704661
2,-2.230596,0.769404,-2.230596,0.769404,0.769404,-2.230596,0.769404,0.851879,-2.230596,-2.230596,...,0.902057,0.902057,0.769404,0.769404,0.851879,0.851879,0.769404,0.769404,0.769404,0.769404
3,-0.734694,0.265306,-0.734694,0.265306,0.265306,-0.734694,0.265306,0.265306,-0.734694,-0.734694,...,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306
4,-0.734694,0.265306,-0.734694,0.265306,0.265306,-0.734694,0.265306,0.265306,-0.734694,-0.734694,...,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306,0.265306


## SVD(특이값 분해)를 사용한 latent factor matrix factorization


In [22]:
# scipy에서 제공해주는 svd
# U 행렬, sigma 행렬, V 전치 행렬을 반환

U, sigma, VT = svds(matrix_user_mean, k = 12)

In [23]:
print(U.shape)
print(sigma.shape)
print(VT.shape)

(52327, 12)
(12,)
(12, 98)


In [24]:
sigma = np.diag(sigma)

In [25]:
sigma.shape

(12, 12)

In [26]:
sigma[0]

array([463.7042435,   0.       ,   0.       ,   0.       ,   0.       ,
         0.       ,   0.       ,   0.       ,   0.       ,   0.       ,
         0.       ,   0.       ])

In [27]:
sigma[1]

array([  0.       , 513.0050749,   0.       ,   0.       ,   0.       ,
         0.       ,   0.       ,   0.       ,   0.       ,   0.       ,
         0.       ,   0.       ])

### 현재 상황
- 원본 user-movie 평점 행렬이 있었음
- 이를 user의 평균 점수를 빼서 matrix_user_mean 이라는 행렬로 만듦
- 2번의 값을 SVD를 적용해 U, Sigma, VT 행렬을 구했음
- sigma 행렬은 현재 0이 포함이 되지 않은 값으로만 구성되어 있음. 이를 대칭행렬로 변환

### 원본 행렬로 다시 복구시켜야 함
- U, Sigma, VT의 내적을 수행
- np.dat(np.dot(U, sigma), VT)를 수행
- 그리고 아까 사용자 평균을 빼주었으니 여기서는 더해줌

In [28]:
# U, Sigma, Vt의 내적을 수행하면, 다시 원본 행렬로 복원됨
# 거기에 + 사용자 평균 rating을 적용
svd_user_predicted_ratings = np.dot(np.dot(U, sigma), VT) + user_ratings_mean.reshape(-1, 1)

In [29]:
df_svd_preds = pd.DataFrame(svd_user_predicted_ratings, columns = df_user_movie_ratings.columns)
df_svd_preds.head()

movieId,99702,107992,159074,163811,167403,167569,179406,182019,184311,184318,...,206657,207182,207284,207360,207364,207370,207523,208530,209496,210283
0,9.321388,7.614523,8.896411,2.295429,2.686973,9.080426,9.547322,4.173115,9.565636,9.527478,...,3.959553,3.965131,2.902342,3.332885,4.208042,4.227303,8.898141,6.920511,2.490777,2.639455
1,5.469114,6.026282,6.027808,6.248886,6.091815,5.715027,5.868653,6.011569,0.128671,0.241074,...,5.628695,5.654393,5.947444,5.82994,6.048287,6.058235,6.034419,6.104652,6.286596,6.138374
2,0.177641,2.963799,1.397102,3.049401,3.009587,0.574572,3.22487,3.060557,0.053296,0.074403,...,3.134341,3.130185,3.019296,3.017109,3.046585,3.032744,1.416562,2.913557,3.03259,3.003163
3,0.058945,0.987839,0.465434,1.016603,1.003256,0.191502,1.07524,0.997881,0.017665,0.024668,...,1.003792,1.003547,1.007551,1.00741,0.997251,0.996279,0.471919,0.970983,1.010912,1.00073
4,0.058945,0.987839,0.465434,1.016603,1.003256,0.191502,1.07524,0.997881,0.017665,0.024668,...,1.003792,1.003547,1.007551,1.00741,0.997251,0.996279,0.471919,0.970983,1.010912,1.00073


### 데이터 준비 끝 -> 함수 만들기
- 인자로 사용자 아이디, 영화 정보 테이블, 평점 테이블 등을 받음
- 사용자 아이디에 SVD로 나온 결과의 영화 평점이 가장 높은 데이터 순으로 정렬
- 사용자가 본 데이터를 제외
- 사용자가 안 본 영화에서 평점이 높은 것을 추천
- => 사용자 히스토리 기반으로 가장 연관성 높은 영화를 추천(안 본 것 중에서!)

In [30]:
def recommend_movies(df_svd_preds, user_id, ori_movies_df, ori_ratings_df, num_recommendations=5):
    
    # 현재는 index로 적용이 되어있으므로 user_id -1 을 해야함 -> 안 해도 됨
    # 우리 userID는 이게 아니라 어떡하지 -> 해결 됨
    user_row_number = user_id
    
    # 최종적으로 만든 pred_df에서 사용자 index에 따라 영화 데이터 정렬 -> 영화 평점이 높은 순으로 정렬 됨(ascending=False, 내림차순 정렬)
    sorted_user_predictions = df_svd_preds.iloc[user_row_number].sort_values(ascending=False)
    
    # 원본 평점 데이터에서 user id에 해당하는 데이터를 뽑아낸다.
    user_data = ori_ratings_df[ori_ratings_df.user_id == user_id]
    
    # 위에서 뽑은 user_data와 원본 영화 데이터를 합친다.
    user_history = user_data.merge(ori_movies_df, on = 'movieId').sort_values(['score'], ascending=False)
    
    # 원본 영화 데이터에서 사용자가 본 영화 데이터를 제외한 데이터를 추출
    recommendations = ori_movies_df[~ori_movies_df['movieId'].isin(user_history['movieId'])]
    # 사용자의 영화 평점이 높은 순으로 정렬된 데이터와 위 recommendations을 합친다.
    recommendations = recommendations.merge(pd.DataFrame(sorted_user_predictions).reset_index(), on = 'movieId')
    # 컬럼 이름 바꾸고 정렬해서 return
    recommendations = recommendations.rename(columns = {user_row_number: 'Predictions'}).sort_values(['Predictions'], ascending=False)
    
    return user_history, recommendations

In [63]:
already_rated, predictions = recommend_movies(df_svd_preds, 3001, movie_data, rating_data, 10)

In [64]:
already_rated.head(10)

Unnamed: 0,movieId,score,userId,user_id,title,genres
58,204768,10,gpal****,3001,그린 나이트,"모험, 드라마, 판타지"
68,193973,10,gpal****,3001,톰과 제리,애니메이션
135,203643,10,gpal****,3001,내가 죽기를 바라는 자들,"범죄, 스릴러"
80,190382,10,gpal****,3001,새해전야,멜로/로맨스
78,191570,10,gpal****,3001,스파이럴,스릴러
137,210283,10,gpal****,3001,아담스 패밀리 2,"애니메이션, 코미디, 가족"
76,200065,10,gpal****,3001,올드,"스릴러, 공포"
74,202901,10,gpal****,3001,더 박스,드라마
139,200052,10,gpal****,3001,도라에몽: 스탠바이미 2,"애니메이션, 드라마, 가족, SF"
72,197520,10,gpal****,3001,노바디,"액션, 코미디"


In [65]:
predictions

Unnamed: 0,movieId,title,genres,Predictions
4,188472,싱크홀,드라마,2.506104
3,189150,분노의 질주: 더 얼티메이트,액션,1.39221
2,184311,이터널스,"액션, 드라마, 판타지",0.450232
1,184318,블랙 위도우,"액션, 모험, SF",-0.084398
0,192150,모가디슈,"액션, 드라마",-0.696223
