# 영화 추천 시스템

- 추천 시스템 : 특정 사용자에 대하여 다양한 정보를 활용하여 원하는 콘텐츠를 제공하는 것

- ex) 유튜브, 넷플릭스, 쇼핑몰, 광고 등

## 콘텐츠 기반 필터링 (TMDB 5000 영화 데이터)

- 사용자가 특정한 아이템을 선호하는 경우, 그 아이템과 비슷한 아이템을 추천하는 방식


### 0. 데이터 준비

In [None]:
%pip install pandas scikit-learn

In [7]:
# - 데이터 읽기
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

movies = pd.read_csv('./data/tmdb_5000_movies.csv')
print(movies.shape)

(4803, 20)


In [8]:
movies.head(2)

Unnamed: 0,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count
0,237000000,"[{""id"": 28, ""name"": ""Action""}, {""id"": 12, ""nam...",http://www.avatarmovie.com/,19995,"[{""id"": 1463, ""name"": ""culture clash""}, {""id"":...",en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800
1,300000000,"[{""id"": 12, ""name"": ""Adventure""}, {""id"": 14, ""...",http://disney.go.com/disneypictures/pirates/,285,"[{""id"": 270, ""name"": ""ocean""}, {""id"": 726, ""na...",en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500


### 1. EDA, 피처 선정
- 컬럼 정보
    - id : 영화 아이디
    - title : 영화명
    - genres : 영화 장르
    - vote_average : 영화 평균 평점
    - vote_count : 영화 투표수
    - popularity : 영화 인기
    - keywords : 영화 키워드
    - overview : 영화 개요

※ genres, keywords의 자료형을 확인해보자 !

In [11]:
# 사용할 피처 선정
# 데이터 확인 head(n)
use_columns = ['id', 'title', 'genres', 'vote_average', 'vote_count', 
               'popularity', 'keywords', 'overview']
movies_df = movies[use_columns]

In [21]:
# genres의 자료형
movies_df['genres'][0]

str

In [19]:
# keywords의 자료형
type(movies_df['keywords'][3])

str

In [None]:
import json
# 작은따옴표로 key, value가 구분이 되어있을 경우 json loads 사용 불가
json.loads(movies_df['genres'][0].replace('\"', '\''))

In [43]:
"temp = " + movies_df['genres'][0]

'temp = [{"id": 28, "name": "Action"}, {"id": 12, "name": "Adventure"}, {"id": 14, "name": "Fantasy"}, {"id": 878, "name": "Science Fiction"}]'

In [48]:
# 문자열을 파이썬 코드처럼 해석 (작은따옴표, 큰따옴표 구분하지 않음)
exec("temp = " + movies_df['genres'][0].replace('\"', '\''))

In [49]:
temp

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [50]:
movies_df['genres'] = movies_df['genres'].apply(lambda x : eval(x))

In [51]:
# movies_df['genres'].apply(eval)

movies_df['keywords'] = movies_df['keywords'].apply(eval)

In [55]:
movies_df['genres'][0]

[{'id': 28, 'name': 'Action'},
 {'id': 12, 'name': 'Adventure'},
 {'id': 14, 'name': 'Fantasy'},
 {'id': 878, 'name': 'Science Fiction'}]

In [None]:
# movies_df['genres'][0] <= ['Action', 'Adventure', 'Fantasy', 'Science Fiction']

genres와 keywords에서 id는 제외 (name만 뽑기)

In [None]:
# 데이터프레임이 선호하지 않는 방법
# for genre in movies_df['genres']:
#     print(genre[0])

In [56]:
# genres = [{'id': 28, 'name': 'Action'}, {'id': 12, 'name': 'Adventure'}, {'id': 14, 'name': 'Fantasy'}, {'id': 878, 'name': 'Science Fiction'}]
# genre_dict = {'id': 28, 'name': 'Action'}
movies_df['genres'] = movies_df['genres'].apply(lambda genres: [genre_dict['name'] for genre_dict in genres])
movies_df['keywords'] = movies_df['keywords'].apply(lambda keywords: [keyword_dict['name'] for keyword_dict in keywords])

In [57]:
movies_df['genres'][0]

['Action', 'Adventure', 'Fantasy', 'Science Fiction']

In [None]:
movies_df['keywords'][0]

하나의 문자열로 바꾸기 (새로운 feature로 추가)

In [60]:
movies_df['genres_literal'] = movies_df['genres'].apply(lambda x: ' '.join(x))

In [62]:
movies_df.head(3).T

Unnamed: 0,0,1,2
id,19995,285,206647
title,Avatar,Pirates of the Caribbean: At World's End,Spectre
genres,"[Action, Adventure, Fantasy, Science Fiction]","[Adventure, Fantasy, Action]","[Action, Adventure, Crime]"
vote_average,7.2,6.9,6.3
vote_count,11800,4500,4466
popularity,150.437577,139.082615,107.376788
keywords,"[culture clash, future, space war, space colon...","[ocean, drug abuse, exotic island, east india ...","[spy, based on novel, secret agent, sequel, mi..."
overview,"In the 22nd century, a paraplegic Marine is di...","Captain Barbossa, long believed to be dead, ha...",A cryptic message from Bond’s past sends him o...
genres_literal,Action Adventure Fantasy Science Fiction,Adventure Fantasy Action,Action Adventure Crime


해당 문자열을 DTM으로 벡터화

In [64]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(min_df=0.0, ngram_range=(1, 2))
genre_mat = count_vect.fit_transform(movies_df['genres_literal'])
print(genre_mat.shape)

(4803, 276)


In [None]:
[
    # 데이터 1
    # Action, Adventure, Fantasy, Science, ... 
    # 276개 단어 토큰 Action, Action Adventure, Science, Science Fiction
    (1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0)

    # 데이터 2
    # Action, Adventure, Fantasy, Science, ...
    (1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)

    # ... 4803

]

※ genre_mat의 자료형은 무엇일까?

In [65]:
genre_mat

<4803x276 sparse matrix of type '<class 'numpy.int64'>'
	with 20631 stored elements in Compressed Sparse Row format>

In [66]:
type(genre_mat)

scipy.sparse._csr.csr_matrix

In [67]:
genre_mat[0]

<1x276 sparse matrix of type '<class 'numpy.int64'>'
	with 9 stored elements in Compressed Sparse Row format>

In [76]:
genre_mat.toarray().shape

(4803, 276)

In [77]:
4803*276

1325628

In [None]:
# 데이터를 위와 같이 저장하는 이유는 불필요한 공간의 낭비를 막고, 
# 연산의 효율성을 높이기 위해서이다.
# 모든 데이터 수에 대하여 모든 차원의 데이터를 저장하면 (4803, 276), 공간의 낭비가 심하고, 
# 연산할때도 모든 값들을 계산에 포함해야하기 때문에 필요한 정보만 저장하여 활용하는 것이 공간상, 연산상 이득이 크다.
# 여기서 공간이라는 것은 메모리
# 연산상의 이득인 이유는 한번 코사인 유사도 계산식을 통해 생각해보시면 좋을 것 같습니다.
[0, 1]
[1, 1]
[20, 1]
[26, 1]
...
# 9개까지

# 276

In [78]:
genre_mat.toarray()[0]

array([1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int64)

cosine similarity 계산

In [79]:
import numpy as np

def cos_similarity(a:np.ndarray, b:np.ndarray):

    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

cos_similarity(np.array((7.0, 12.4, 256.0, 322.17)), np.array((7.0, 12.4, 256.0, 322.17)))

1.0

### 2. 유사도 계산을 통한 추천 시스템 구현
matirx의 코사인 유사도를 계산
- 코사인 유사도 : 두 벡터 간의 cosine 각도를 이용하여 구할 수 있는 두 벡터 간 유사도

In [80]:
genre_array = genre_mat.toarray()

In [82]:
genre_array.shape

(4803, 276)

In [84]:
cos_similarity(genre_array[0, :], genre_array[1, :])

0.5962847939999439

In [None]:
4803*4803

In [None]:
## 직접 계산
cos_sim_result = []
for i in range(genre_array.shape[0]): # 4803
    temp = []
    for j in range(genre_array.shape[0]): # 4803
        sim = cos_similarity(genre_array[i, :], genre_array[j, :])
        temp.append(sim)

    cos_sim_result.append(temp)


In [87]:
from sklearn.metrics.pairwise import cosine_similarity

genre_sim = cosine_similarity(genre_mat, genre_mat)
print(genre_sim.shape)
print(genre_sim[:2])

(4803, 4803)
[[1.         0.59628479 0.4472136  ... 0.         0.         0.        ]
 [0.59628479 1.         0.4        ... 0.         0.         0.        ]]


In [None]:
genre_sim

genre_sim에서 특정 영화와 유사도가 높은 순서대로 정렬
- argsort : 값들의 배열에서 데이터를 정렬한 **index** 반환

In [92]:
temp = [1, 2, 3, 4, 5]

In [95]:
temp

[1, 2, 3, 4, 5]

In [93]:
temp[::-1]#.reverse()

[5, 4, 3, 2, 1]

In [91]:
# 내림차순 정렬을 위해 -1 옵션을 추가로 준다
genre_sim_sorted_ind = genre_sim.argsort()[:, ::-1]
print(genre_sim_sorted_ind[:1])

[[   0 3494  813 ... 3038 3037 2401]]


추천 영화 DataFrame 반환 함수

In [98]:
# 1. 영화 제목으로 index 탐색
movie_idx = movies_df[movies_df['title']=="The Godfather"].index.values
movies_df[movies_df['title']=="The Godfather"].index.values

array([3337], dtype=int64)

In [104]:
recommend_idxs = genre_sim_sorted_ind[movie_idx, :5].reshape(-1)
genre_sim_sorted_ind[movie_idx, :5].reshape(-1)

array([2731, 1243, 3636, 1946, 2640], dtype=int64)

In [105]:
movies_df.iloc[recommend_idxs]

Unnamed: 0,id,title,genres,vote_average,vote_count,popularity,keywords,overview,genres_literal
2731,240,The Godfather: Part II,"[Drama, Crime]",8.3,3338,105.792936,"[italo-american, cuba, vororte, melancholy, pr...",In the continuing saga of the Corleone crime f...,Drama Crime
1243,203,Mean Streets,"[Drama, Crime]",7.2,345,17.002096,"[epilepsy, protection money, secret love, mone...","A small-time hood must choose from among love,...",Drama Crime
3636,36351,Light Sleeper,"[Drama, Crime]",5.7,15,6.063868,"[suicide, drug dealer, redemption, addict, exi...",A drug dealer with upscale clientele is having...,Drama Crime
1946,11699,The Bad Lieutenant: Port of Call - New Orleans,"[Drama, Crime]",6.0,326,17.339852,"[police brutality, organized crime, policeman,...","Terrence McDonagh, a New Orleans Police sergea...",Drama Crime
2640,400,Things to Do in Denver When You're Dead,"[Drama, Crime]",6.7,85,6.932221,"[father son relationship, bounty hunter, boat,...",A mafia film in Tarantino style with a star-st...,Drama Crime


In [106]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    # 유사한 영화를 찾을 영화를 가져옴
    title_movie = df[df['title']==title_name]

    # 그것의 index
    title_index = title_movie.index.values

    # 코사인 유사도 배열에서 해당 index에서 상위 n개만 추출
    similar_indexes = sorted_ind[title_index, :(top_n)]

    print(similar_indexes)
    # 1차원 배열로 shape 변경
    similar_indexes = similar_indexes.reshape(-1)

    # iloc를 통해 데이터프레임에서 해당 인덱스를 가진 영화들만 반환
    return df.iloc[similar_indexes]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, "The Godfather", 10)
similar_movies[['title', 'vote_average']]

[[2731 1243 3636 1946 2640 4065 1847 4217  883 3866]]


Unnamed: 0,title,vote_average
2731,The Godfather: Part II,8.3
1243,Mean Streets,7.2
3636,Light Sleeper,5.7
1946,The Bad Lieutenant: Port of Call - New Orleans,6.0
2640,Things to Do in Denver When You're Dead,6.7
4065,Mi America,0.0
1847,GoodFellas,8.2
4217,Kids,6.8
883,Catch Me If You Can,7.7
3866,City of God,8.1


- Vote Average Feature 활용

적은 수의 사람이 투표한 경우, 평점이 정말 유효한지 판단하는 데에는 어려움이 있을 수 있다.

In [107]:
movies_df[['title', 'vote_average', 'vote_count']].sort_values('vote_average', ascending=False)[:10]

Unnamed: 0,title,vote_average,vote_count
3519,Stiff Upper Lips,10.0,1
4247,Me You and Five Bucks,10.0,2
4045,"Dancer, Texas Pop. 81",10.0,1
4662,Little Big Top,10.0,1
3992,Sardaarji,9.5,2
2386,One Man's Hero,9.3,2
2970,There Goes My Baby,8.5,2
1881,The Shawshank Redemption,8.5,8205
2796,The Prisoner of Zenda,8.4,11
3337,The Godfather,8.4,5893


영화 선정을 위한 가중치 계산식

- v: 개별 영화에 평점을 투표한 횟수  
- m: 평점을 부여하기 위한 최소 투표 횟수  
- R: 개별 영화에 대한 평균 평점  
- C: 전체 영화에 대한 평균 평점  

최소 투표 횟수를 전체의 60% 지점으로 지정

In [109]:
C = movies_df['vote_average'].mean()
m = movies_df['vote_count'].quantile(0.6)

def weighted_vote_average(record):
    v = record['vote_count']
    R = record['vote_average']

    return ((v/(v+m)) * R) + ((m / (v + m)) * C)

movies_df['weighted_vote'] = movies_df.apply(weighted_vote_average, axis=1)
movies_df.head(2).T

Unnamed: 0,0,1
id,19995,285
title,Avatar,Pirates of the Caribbean: At World's End
genres,"[Action, Adventure, Fantasy, Science Fiction]","[Adventure, Fantasy, Action]"
vote_average,7.2,6.9
vote_count,11800,4500
popularity,150.437577,139.082615
keywords,"[culture clash, future, space war, space colon...","[ocean, drug abuse, exotic island, east india ..."
overview,"In the 22nd century, a paraplegic Marine is di...","Captain Barbossa, long believed to be dead, ha..."
genres_literal,Action Adventure Fantasy Science Fiction,Adventure Fantasy Action
weighted_vote,7.166301,6.838594


가중치평점을 적용한 추천 영화 DataFrame 반환 함수

In [110]:
def find_sim_movie(df, sorted_ind, title_name, top_n=10):
    title_movie = df[df['title'] == title_name]
    title_index = title_movie.index.values

    similar_indexes = sorted_ind[title_index, : (top_n * 2)]
    similar_indexes = similar_indexes.reshape(-1)
    
    similar_indexes = similar_indexes[similar_indexes != title_index]

    return df.iloc[similar_indexes].sort_values('weighted_vote', ascending=False)[:top_n]

similar_movies = find_sim_movie(movies_df, genre_sim_sorted_ind, 'The Godfather', 10)
similar_movies[['title', 'vote_average', 'weighted_vote']]

Unnamed: 0,title,vote_average,weighted_vote
2731,The Godfather: Part II,8.3,8.079586
1847,GoodFellas,8.2,7.976937
3866,City of God,8.1,7.759693
1663,Once Upon a Time in America,8.2,7.657811
883,Catch Me If You Can,7.7,7.557097
281,American Gangster,7.4,7.141396
4041,This Is England,7.4,6.739664
1149,American Hustle,6.8,6.717525
1243,Mean Streets,7.2,6.626569
2839,Rounders,6.9,6.530427


## 아이템 기반 최근접 이웃 협업 필터링 (MovieLens 리뷰 데이터)

### 0. 데이터 준비

In [111]:
import pandas as pd
import numpy as np
import warnings; warnings.filterwarnings('ignore')

movies = pd.read_csv('./data/movielens/movies.csv')
ratings = pd.read_csv('./data/movielens/ratings.csv')
movies.shape, ratings.shape

((9742, 3), (100836, 4))

In [112]:
movies.head(2)

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy


In [113]:
ratings.head(2)

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247


### 1. EDA, 데이터 정리

- ratings와 movie를 movieId를 기준으로 merge

In [114]:
rating_movies = pd.merge(ratings, movies, on='movieId')

In [117]:
rating_movies[:2].T

Unnamed: 0,0,1
userId,1,5
movieId,1,1
rating,4.0,4.0
timestamp,964982703,847434962
title,Toy Story (1995),Toy Story (1995)
genres,Adventure|Animation|Children|Comedy|Fantasy,Adventure|Animation|Children|Comedy|Fantasy


pivot table를 활용하여, rating에 대하여 userId, title로 이루어진 데이터프레임으로 변환

In [None]:
# userid, movieid, rating
# 1, 1, 4.0
# 1, 5, 5.0

In [118]:
ratings_matrix = rating_movies.pivot_table('rating', index='userId', columns='title')
ratings_matrix.fillna(0, inplace=True)
ratings_matrix.head(2)

title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


영화에 대한 리뷰 갯수를 벡터로 사용하기 위해 transpose

In [119]:
ratings_matrix_T = ratings_matrix.T

In [120]:
ratings_matrix_T.head(2)

userId,1,2,3,4,5,6,7,8,9,10,...,601,602,603,604,605,606,607,608,609,610
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.0
'Hellboy': The Seeds of Creation (2004),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


코사인 유사도 측정

In [121]:
from sklearn.metrics.pairwise import cosine_similarity

item_sim = cosine_similarity(ratings_matrix_T, ratings_matrix_T)
item_sim_df = pd.DataFrame(data=item_sim, index=ratings_matrix.columns, columns=ratings_matrix.columns)

print(item_sim_df.shape)
item_sim_df.head(2)

(9719, 9719)


title,'71 (2014),'Hellboy': The Seeds of Creation (2004),'Round Midnight (1986),'Salem's Lot (2004),'Til There Was You (1997),'Tis the Season for Love (2015),"'burbs, The (1989)",'night Mother (1986),(500) Days of Summer (2009),*batteries not included (1987),...,Zulu (2013),[REC] (2007),[REC]² (2009),[REC]³ 3 Génesis (2012),anohana: The Flower We Saw That Day - The Movie (2013),eXistenZ (1999),xXx (2002),xXx: State of the Union (2005),¡Three Amigos! (1986),À nous la liberté (Freedom for Us) (1931)
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
'71 (2014),1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.141653,0.0,...,0.0,0.342055,0.543305,0.707107,0.0,0.0,0.139431,0.327327,0.0,0.0
'Hellboy': The Seeds of Creation (2004),0.0,1.0,0.707107,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


추천영화 DataFrame을 반환하는 함수

In [122]:
def find_sim_movie_item(df, title_name, top_n=10):
    title_movie_sim = df[[title_name]].drop(title_name, axis=0)

    return title_movie_sim.sort_values(title_name, ascending=False)[:top_n]

In [123]:
find_sim_movie_item(item_sim_df, 'Godfather, The (1972)')

title,"Godfather, The (1972)"
title,Unnamed: 1_level_1
"Godfather: Part II, The (1974)",0.821773
Goodfellas (1990),0.664841
One Flew Over the Cuckoo's Nest (1975),0.620536
Star Wars: Episode IV - A New Hope (1977),0.595317
Fargo (1996),0.588614
Star Wars: Episode V - The Empire Strikes Back (1980),0.58603
Fight Club (1999),0.581279
Reservoir Dogs (1992),0.579059
Pulp Fiction (1994),0.57527
American Beauty (1999),0.575012


In [124]:
find_sim_movie_item(item_sim_df, 'Inception (2010)')

title,Inception (2010)
title,Unnamed: 1_level_1
"Dark Knight, The (2008)",0.727263
Inglourious Basterds (2009),0.646103
Shutter Island (2010),0.617736
"Dark Knight Rises, The (2012)",0.617504
Fight Club (1999),0.615417
Interstellar (2014),0.60815
Up (2009),0.606173
"Avengers, The (2012)",0.586504
Django Unchained (2012),0.581342
"Departed, The (2006)",0.580849
