# 협업 필터링
협업 필터링(Collaborative Filtering)은 추천 시스템의 한 종류로, 여러 사용자의 행동 양식 데이터를 기반으로 특정 사용자에게 아이템을 추천하는 기술입니다. "취향이 비슷한 사람들은 유사한 아이템을 선호할 것이다"라는 아이디어를 바탕으로 합니다. 만약 A라는 사용자와 B라는 사용자가 모두 영화 1과 영화 2에 높은 평점을 주었다면, 이 두 사용자는 취향이 비슷하다고 볼 수 있습니다. 이때 A는 영화 3을 좋아했는데 B는 아직 영화 3을 보지 않았다면, 협업 필터링 시스템은 B에게 영화 3을 추천할 수 있습니다.

## 장점
* 아이템 자체의 정보가 없어도 추천이 가능합니다.
* 사용자의 숨겨진 관심사를 발견할 수 있습니다.
* 다양한 종류의 아이템을 추천할 수 있습니다.

## 단점
* 콜드 스타트(Cold Start) 문제: 새로운 사용자나 새로운 아이템에 대한 정보가 부족하면 추천이 어렵습니다.
* 데이터 희소성(Sparsity) 문제: 사용자-아이템 상호작용 데이터가 충분하지 않으면 성능이 저하될 수 있습니다.

In [1]:
import numpy as np

## 유클리드 거리
유클리드 거리(Euclidean distance)는 2차원 평면이나 3차원 공간에서 두 점 사이의 직선 거리를 측정하는 방법입니다. 더 나아가 임의의 n차원 공간에서도 두 점 사이의 거리를 일반화하여 정의할 수 있습니다.

In [3]:
d1 = np.array([4,3,5,2,4]) # 현승
d2 = np.array([4,4,4,3,4]) # 동욱
d3 = np.array([1,5,1,5,3]) # 영훈

sum((d1 - d2) ** 2) ** (1/2)

1.7320508075688772

In [5]:
def euclidean(a, b):
    return sum((a - b) ** 2) ** (1/2)

euclidean(d1, d2)

1.7320508075688772

## 코사인 유사도
코사인 유사도(Cosine Similarity)는 두 벡터 사이의 각도의 코사인 값을 이용하여 벡터 간의 유사성을 측정하는 방법입니다. 주로 텍스트 마이닝이나 추천 시스템에서 문서나 사용자/아이템 간의 유사도를 측정하는 데 널리 사용됩니다.

In [6]:
d1 = np.array([4,5,3,2,1])
d2 = np.array([3,2,4,2,2])

A = sum(d1 * d2)

B = sum(d1 ** 2) ** (1/2)
C = sum(d2 ** 2) ** (1/2)

cos = A / (B*C)
cos

0.8867021970429453

In [10]:
def cosine_similarity(a , b):
    A = sum(a * b)
    B = sum(a ** 2) ** (1/2)
    C = sum(b ** 2) ** (1/2)
    return A / (B*C)

cosine_similarity(d1 , d2)

0.8867021970429453

* A는 B,C 중에 유사도가 더 높은 것은 무엇일까요?

In [14]:
A = np.array([1,5,1,5])
B = np.array([2,5,2,5])
C = np.array([5,1,5,1])

cosine_similarity(A, B), cosine_similarity(A, C)

# B와 더 비슷

(0.9832820049844601, 0.3846153846153847)

## 평점 데이터 전처리
### 0으로 계산하기
꽤나 간단한 방법이긴 한데요. 사실 이 방법은 문제가 조금 있습니다. 바로 저희의 추천 시스템이 0은 1보다 작은, 그러니까 최악의 평점으로 계산이 된다는 점이죠. 단순히 평점을 안 준 거뿐인데 이렇게 유저가 싫어하는 영화로 계산이 되면 추천 시스템의 정확도가 별로 안 좋게 나오게 되겠죠? 그렇기 때문에 이 방법은 별로 좋은 방법이 아닙니다.

In [21]:
import pandas as pd

matrix = np.array([
    [4, 1, 5, np.nan, 1],
    [2, 3, np.nan, 2, 3],
    [1, np.nan, 4, 1, 3],
    [np.nan, 2, 4, np.nan, 2],
    [1, np.nan, 4, 1, 3]
])

df = pd.DataFrame(matrix)
df.index = ["현승", "영훈", "동욱", "종훈", "우재"]
df.columns = [f"영화{i}" for i in range(1, 6)]
df

Unnamed: 0,영화1,영화2,영화3,영화4,영화5
현승,4.0,1.0,5.0,,1.0
영훈,2.0,3.0,,2.0,3.0
동욱,1.0,,4.0,1.0,3.0
종훈,,2.0,4.0,,2.0
우재,1.0,,4.0,1.0,3.0


In [22]:
df.fillna(0)      # 이러면 문제 많음

Unnamed: 0,영화1,영화2,영화3,영화4,영화5
현승,4.0,1.0,5.0,0.0,1.0
영훈,2.0,3.0,0.0,2.0,3.0
동욱,1.0,0.0,4.0,1.0,3.0
종훈,0.0,2.0,4.0,0.0,2.0
우재,1.0,0.0,4.0,1.0,3.0


### 유저 별 평균 평점으로 계산하기
첫 번째 유저가 준 평점 평균이 3이잖아요? 그럼 첫 번째 유저의 빈칸들을 3으로 채워 넣고, 두 번째 유저는 평균적으로 2.5점을 주니까 빈칸을 2.5로 채워 넣고... 이렇게 할 수 있는 거죠. 유저의 준 평점들의 평균은 유저가 좋아하지도, 싫어하지도 않는다고 해석할 수 있는데요. 저희가 모르는 평점들에 대해서 0을 사용하는 거보다 훨씬 더 합리적으로 유사도를 계산할 수 있습니다.

In [25]:
# 누락값 - 각 유저의 평균값으로 채우면?
mean = df.mean(axis = 1)
mean

현승    2.750000
영훈    2.500000
동욱    2.250000
종훈    2.666667
우재    2.250000
dtype: float64

In [26]:
df2 = df.apply(lambda row : row.fillna(row.mean()), axis = 1)
df2

Unnamed: 0,영화1,영화2,영화3,영화4,영화5
현승,4.0,1.0,5.0,2.75,1.0
영훈,2.0,3.0,2.5,2.0,3.0
동욱,1.0,2.25,4.0,1.0,3.0
종훈,2.666667,2.0,4.0,2.666667,2.0
우재,1.0,2.25,4.0,1.0,3.0


### Mean Normalization으로 계산하기
각 유저 평점에서 각 유저의 평균 평점을 다시 빼주는 겁니다. 첫 번째 유저 평균 평점은 3이니까 모든 평점에서 3을 빼주고, 두 번째 유저 평균 평점은 2.25니까 2.25를 다 빼주고... 이런 식으로 해준다는 말이죠. 이렇게 해주면 유저마다 평점 평균이 0이 되는데요. 이렇게 각 데이터에서 평균을 빼서 데이터의 평균을 0으로 만들어 주는 걸 mean normalization이라고 부릅니다. mean normalization을 사용하면 모르는 값들을 합리적으로 채워 넣을 수 있다는 장점과 함께 까다로운 유저들과 유한 유저들에 대한 처리를 해줄 수 있다는 건데요. 예를 들어 어떤 유저들은 영화가 그저 그러면 평균 평점인 2점을 주고, 별로 마음에 안 들면 0, 마음에 들면 3점을 줄 수 있고요. 어떤 유저들은 그저 그러면 평점 4, 마음에 안 들면 3, 마음에 들면 5점을 줄 수도 있잖아요? 모든 유저의 평균 평점을 0으로 맞춰주면 더 싫거나 보통이거나 좋아하는 영화들이 비슷한 값들을 가질 수 있기 때문에 비슷한 유저를 찾을 때 좀 더 직관적으로 찾아낼 수 있습니다.

In [28]:
df2.sub(mean, axis = 0)

# 사람마다 기준이 다르기때문에  모든 사람의 평균을 0으로 맞춰서 봄

Unnamed: 0,영화1,영화2,영화3,영화4,영화5
현승,1.25,-1.75,2.25,0.0,-1.75
영훈,-0.5,0.5,0.0,-0.5,0.5
동욱,-1.25,0.0,1.75,-1.25,0.75
종훈,0.0,-0.666667,1.333333,0.0,-0.666667
우재,-1.25,0.0,1.75,-1.25,0.75


## 협업 필터링 추천 시스템 구현하기

In [83]:
ratings = pd.read_csv("Data/ratings.csv")
ratings["timestamp"] = pd.to_datetime(ratings["timestamp"], unit = "s") 

df = ratings.pivot_table(index = "userId", columns = "movieId", 
                         values = "rating", aggfunc = "sum")
df

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
5,4.0,,,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,
609,3.0,,,,,,,,,4.0,...,,,,,,,,,,


In [84]:
# 위에 했던 과정과 똑같이 누락값 채우고 표준화
mean = df.mean(axis = 1)

df2 = df.apply(lambda row : row.fillna(row.mean()), axis = 1)
df3 = df2.sub(mean, axis = 0)

df3

movieId,1,2,3,4,5,6,7,8,9,10,...,193565,193567,193571,193573,193579,193581,193583,193585,193587,193609
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,-0.366379,0.000000,-0.366379,0.0,0.0,-0.366379,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.000000,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
5,0.363636,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,-1.157399,0.000000,0.000000,0.0,0.0,0.000000,-1.157399,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
607,0.213904,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.000000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
608,-0.634176,-1.134176,-1.134176,0.0,0.0,0.000000,0.000000,0.0,0.0,0.865824,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
609,-0.270270,0.000000,0.000000,0.0,0.0,0.000000,0.000000,0.0,0.0,0.729730,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [85]:
# 거리 계산
from sklearn.metrics.pairwise import euclidean_distances, cosine_similarity

distance_matrix = euclidean_distances(df3)
distance_matrix

array([[ 0.        , 12.88018162, 17.71367592, ..., 32.52330461,
        12.52374725, 33.10732342],
       [12.88018162,  0.        , 13.57431978, ..., 31.40943515,
         5.18205377, 31.11334478],
       [17.71367592, 13.57431978,  0.        , ..., 33.81269431,
        13.1676513 , 33.27108605],
       ...,
       [32.52330461, 31.40943515, 33.81269431, ...,  0.        ,
        31.07357893, 42.64413263],
       [12.52374725,  5.18205377, 13.1676513 , ..., 31.07357893,
         0.        , 31.07797619],
       [33.10732342, 31.11334478, 33.27108605, ..., 42.64413263,
        31.07797619,  0.        ]])

In [86]:
# 거리를 기반으로 비슷한 사람 5명 찾기
k = 5
nearest_neighbors = np.argsort(distance_matrix, axis = 1)[:, 1:k+1]
nearest_neighbors
# 0번째 유저와 비슷한 5명, 1번째 유저와 비슷한 5명, ....

array([[ 52,  48, 188, 213, 514],
       [ 52, 188,  48, 514,  24],
       [ 52,  48, 514, 188,  24],
       ...,
       [315, 430, 539,  71,  53],
       [ 52,  48,  53, 514, 188],
       [462, 361, 347, 292, 121]], dtype=int64)

### 0번째 유저에 대한 추천작 5개

In [87]:
df = df.values

In [88]:
user = 0

user_ratings = df[user]
user_ratings

# 0번째 유저가 본 영화들

array([ 4., nan,  4., ..., nan, nan, nan])

In [89]:
nan_mask = np.isnan(df)
nan_mask

# 유저별로 누락값이면 true 

array([[False,  True, False, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       [ True,  True,  True, ...,  True,  True,  True],
       ...,
       [False, False, False, ...,  True,  True,  True],
       [False,  True,  True, ...,  True,  True,  True],
       [False,  True,  True, ...,  True,  True,  True]])

In [90]:
user_nan_mask = nan_mask[user]
user_nan_mask

array([False,  True, False, ...,  True,  True,  True])

In [91]:
unrated_idx = np.where(user_nan_mask)[0]    # 평가하지 않은 영화 인덱스 
unrated_idx

# 0번째 유저가 보지않은 영화의 인덱스

array([   1,    3,    4, ..., 9721, 9722, 9723], dtype=int64)

In [92]:
neighbors_idx = nearest_neighbors[user]     # 가장 유사도가 높은 이웃의 인덱스
neighbors_idx

# 0번째 유저와 가장 가까운 이웃들의 인덱스 

array([ 52,  48, 188, 213, 514], dtype=int64)

In [93]:
pred = {}

for movie_idx in unrated_idx:
    box = []       # 0번째 유저와 가장 가까운 5명에 대한 해당 영화의 평점 저장할거임
    for neighbor in neighbors_idx:
        if pd.notnull(df[neighbor, movie_idx]):
            box.append(df[neighbor, movie_idx])
    if box != []:      # box에 값이 있으면 (= 유사한 5명 중 한명이라도 평점이 있으면)
        pred_rating = np.mean(box)
        pred[movie_idx] = pred_rating

recommend = sorted(pred.items(), key = lambda i : i[1], reverse = True)[:5]       # i[1]을 기준으로 정렬 
recommend
# 0번째 유저가 안본영화와 해당 영화의 예측 평점

[(171, 5.0), (213, 5.0), (338, 5.0), (357, 5.0), (419, 5.0)]

In [95]:
import pandas as pd
movies = pd.read_csv("Data/movies.csv")

movies_idx = {}
for i in range(len(movies)):
    row = movies.iloc[i]
    movies_idx[row["movieId"]] = row["title"]

for i in recommend:
    print(movies_idx[i[0]])

Jeffrey (1995)
Burnt by the Sun (Utomlyonnye solntsem) (1994)
Virtuosity (1995)
Four Weddings and a Funeral (1994)
Beverly Hillbillies, The (1993)


## 연습문제
1. 위 내용을 복습하고, 내용 기반 추천시스템에서 연습했던 것과 같이 모든 유저에 대해 5개씩 영화를 추천해보세요.

In [97]:
ratings = pd.read_csv("Data/ratings.csv")
ratings['timestamp'] = pd.to_datetime(ratings['timestamp'], unit = 's')

df = ratings.pivot_table(index = "userId", columns = 'movieId',
                    values = 'rating', aggfunc="sum")

mean = df.mean(axis = 1)

df2 = df.apply(lambda row : row.fillna(row.mean()), axis = 1)

df3 = df2.sub(mean, axis = 0)

df4 = pd.DataFrame(euclidean_distances(df3))

df4.index = df.index
df4.columns = df.index
df4

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.000000,12.880182,17.713676,22.282291,13.659757,19.773635,20.426990,13.557556,14.719956,18.518760,...,12.905251,16.756453,37.611895,14.474085,17.433418,26.924368,17.424192,32.523305,12.523747,33.107323
2,12.880182,0.000000,13.574320,19.807213,7.690730,15.727706,16.931075,8.022803,9.541592,14.489728,...,6.389527,12.223483,35.670329,8.551749,12.870061,24.515627,13.925462,31.409435,5.182054,31.113345
3,17.713676,13.574320,0.000000,23.302697,14.613394,19.766514,20.809386,14.666236,15.458309,18.926845,...,13.687872,17.291207,37.464410,14.867742,17.720147,27.810887,18.498044,33.812694,13.167651,33.271086
4,22.282291,19.807213,23.302697,0.000000,20.516561,24.284270,24.529476,20.354308,21.121914,23.149703,...,19.965812,21.725144,39.841009,20.742013,22.362051,30.598706,23.180862,37.190631,19.506217,36.455161
5,13.659757,7.690730,14.613394,20.516561,0.000000,16.336706,17.520716,9.312540,10.726201,15.501257,...,7.869663,12.982558,35.824839,9.794989,13.977460,24.860522,14.497213,31.775438,6.796082,31.603054
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
606,26.924368,24.515627,27.810887,30.598706,24.860522,28.587767,28.787094,24.910129,25.376043,28.337724,...,24.351908,26.520786,40.866504,24.843491,26.833591,0.000000,27.315889,38.284910,24.215837,37.785899
607,17.424192,13.925462,18.498044,23.180862,14.497213,19.542125,20.665242,14.437896,15.781244,19.287968,...,13.720744,17.036584,37.163690,15.097813,18.174368,27.315889,0.000000,33.223134,13.393497,33.348104
608,32.523305,31.409435,33.812694,37.190631,31.775438,34.248248,34.711105,31.318536,31.972743,34.695073,...,31.111212,32.539212,46.821926,31.919217,33.470584,38.284910,33.223134,0.000000,31.073579,42.644133
609,12.523747,5.182054,13.167651,19.506217,6.796082,15.148394,16.536787,6.930859,8.953595,14.227977,...,5.188654,11.474880,35.469060,7.647929,12.583861,24.215837,13.393497,31.073579,0.000000,31.077976


In [104]:
k = 5
nearest_netghbors = np.argsort(df4, axis = 1)[:, 1:k+1]          # 각 행별로 가장 가까운 이웃 5명씩 가져옴

for n in df4.index:
    user = n
    nan_mask = np.isnan(df).values       # 사람들마다 안본영화의 인덱스 가져옴
    user_nan_mask = nan_mask[user]
    unrated_idx = np.where(user_nan_mask)[0]
    neighbors_idx = nearest_netghbors[user]
    
    pred = {}
    for movie_idx in unrated_idx:
        box = []
        for neighbor in neighbors_idx:
            if pd.notnull(df.iloc[neighbor, movie_idx]):
                box.append(df.iloc[neighbor, movie_idx])
        if box:
            pred_rating = np.mean(box)
            pred[df.columns[movie_idx]] = pred_rating
    
    pred

{17: 3.0,
 25: 2.0,
 62: 3.0,
 66: 3.0,
 83: 3.0,
 104: 3.0,
 141: 2.0,
 203: 5.0,
 249: 5.0,
 318: 4.333333333333333,
 381: 5.0,
 413: 5.0,
 481: 5.0,
 628: 3.0,
 719: 3.0,
 743: 3.0,
 748: 5.0,
 788: 2.0,
 802: 3.0,
 880: 5.0,
 916: 5.0,
 922: 5.0,
 1059: 4.0,
 1100: 5.0,
 1125: 5.0,
 1200: 4.5,
 1356: 3.0,
 1363: 3.0,
 1409: 3.0,
 1441: 5.0,
 1982: 5.0,
 2686: 5.0,
 2762: 4.0,
 3100: 5.0,
 4019: 5.0,
 4022: 4.5,
 4993: 4.0,
 5218: 4.0,
 5952: 4.0,
 6377: 4.0,
 7153: 4.0,
 33794: 4.25,
 47099: 4.5,
 48516: 4.5,
 48780: 5.0,
 54286: 4.5,
 58559: 5.0,
 59315: 5.0,
 60069: 5.0,
 68954: 4.5,
 70286: 4.0,
 76093: 4.0,
 79091: 4.0,
 79132: 4.666666666666667,
 91529: 4.25,
 103335: 4.0,
 109487: 4.5,
 112552: 5.0,
 116797: 5.0,
 122916: 5.0,
 122918: 5.0,
 122920: 4.0,
 139385: 4.5,
 168250: 4.5,
 168252: 4.75,
 174055: 4.0,
 175569: 4.0,
 176371: 5.0,
 179819: 4.0}

In [105]:
recommend = sorted(pred.items(), key = lambda i : i[1], reverse = True)[:5]
recommend

[(203, 5.0), (249, 5.0), (381, 5.0), (413, 5.0), (481, 5.0)]

In [106]:
import pandas as pd
movies = pd.read_csv("Data/movies.csv")

movies_idx = {}
for i in range(len(movies)):
    row = movies.iloc[i]
    movies_idx[row["movieId"]] = row["title"]

for i in recommend:
    print(movies_idx[i[0]])

To Wong Foo, Thanks for Everything! Julie Newmar (1995)
Immortal Beloved (1994)
When a Man Loves a Woman (1994)
Airheads (1994)
Kalifornia (1993)


In [107]:
k = 5
nearest_netghbors = np.argsort(df4, axis = 1)[:, 1:k+1]

total = []

for n in df4.index:
    user = n
    nan_mask = np.isnan(df).values
    user_nan_mask = nan_mask[user]
    unrated_idx = np.where(user_nan_mask)[0]
    neighbors_idx = nearest_netghbors[user]
    
    pred = {}
    for movie_idx in unrated_idx:
        box = []
        for neighbor in neighbors_idx:
            if pd.notnull(df.iloc[neighbor, movie_idx]):
                box.append(df.iloc[neighbor, movie_idx])
        if box:
            pred_rating = np.mean(box)
            pred[df.columns[movie_idx]] = pred_rating
    
    recommend = sorted(pred.items(), key = lambda i : i[1], reverse = True)[:5]

    box2 = []
    for i in recommend:
        box2.append(movies_idx[i[0]])

    total.append(box2)

KeyboardInterrupt: 

2. 협업 필터링 추천시스템을 코사인 유사도 기반으로 계산해서 특정 유저에게 영화를 추천해보세요.