## 추천시스템(Recommendation System)
- 사용자에게 영화, 음악, 유투브 콘텐츠 등을 추천해 주는 것
- 범주형 자료를 수치로 변환하고 거리를 계산하는 것이 가장 기초적인 매커니즘

### 01 사이킷런 통한 추천 시스템 입문

#### 01-01 코사인 유사도
- Cosine Similarity는 두 벡터 간의 방향 유사도를 측정

$$
\text{Cosine Similarity} = \cos(\theta) = \frac{A \cdot B}{\|A\|\|B\|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \cdot \sqrt{\sum_{i=1}^{n} B_i^2}}
$$

- $A \cdot B$: 두 벡터의 내적 : 두 벡터의 내적 (dot product)  
- $\|A\|, \|B\| $: 벡터의 L2 노름 (길이)  
- 값의 범위는 \(-1\) ~ \(1\), 주로 \(0\) ~ \(1\)에서 사용됨 (음의 코사인은 방향이 반대라는 뜻)

##### numpy 활용

In [None]:
import numpy as np

t1 = np.array([1, 1, 1])
t2 = np.array([2, 0, 1])

In [None]:
from numpy import dot
from numpy.linalg import norm

def cos_sim(A,B):
  return dot(A,B)/(norm(A)*norm(B))

In [None]:
cos_sim(t1, t2)

np.float64(0.7745966692414834)

##### 사이킷런 활용

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

t1 = np.array([[1, 1, 1]])
t2 = np.array([[2, 0, 1]])
cosine_similarity(t1,t2)

array([[0.77459667]])

In [None]:
# Q. 아래 t1, t2 리스트의 요소 값을 조정해서 코사인 유사도 값이 0이 되게 만들어보세요.
# +코사인 유사도 값이 -1이 되게도 만들어보세요.
# 0이 되도록 만들기!
t1 = np.array([[1, 1, 1]])
t2 = np.array([[-1, 0, 1]])
cosine_similarity(t1,t2)

array([[0.]])

In [None]:
# -1이 되도록 만들기
t1 = np.array([[1, 1, 1]])
t2 = np.array([[-1, -1, -1]])
cosine_similarity(t1,t2)

array([[-1.]])

#### 01-02 추천시스템 종류
- 콘텐츠 기반 필터링
- 협업 필터링
- 딥러닝 활용

#### 01-03 콘텐츠 기반 필터링
- [CodeHeroku](https://www.codeheroku.com/post.html?name=Building%20a%20Movie%20Recommendation%20Engine%20in%20Python%20using%20Scikit-Learn)에 나온 내용 기반으로 강의자료 만들어짐

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


##### ① 필요한 모듈 import

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

##### ② 필요한 데이터 로드
- 로컬에서 받은 걸 colab에 올려둔 상태

In [None]:
import os

folder_path = '/content/drive/MyDrive/ColabNotebooks/03_Modulabs/Recommendation_System'
csv_path = os.path.join(folder_path, 'movie_dataset.csv')
df = pd.read_csv(csv_path)
df.head()

Unnamed: 0,index,budget,genres,homepage,id,keywords,original_language,original_title,overview,popularity,production_companies,production_countries,release_date,revenue,runtime,spoken_languages,status,tagline,title,vote_average,vote_count,cast,crew,director
0,0,237000000,Action Adventure Fantasy Science Fiction,http://www.avatarmovie.com/,19995,culture clash future space war space colony so...,en,Avatar,"In the 22nd century, a paraplegic Marine is di...",150.437577,"[{""name"": ""Ingenious Film Partners"", ""id"": 289...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2009-12-10,2787965087,162.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}, {""iso...",Released,Enter the World of Pandora.,Avatar,7.2,11800,Sam Worthington Zoe Saldana Sigourney Weaver S...,"[{'name': 'Stephen E. Rivkin', 'gender': 0, 'd...",James Cameron
1,1,300000000,Adventure Fantasy Action,http://disney.go.com/disneypictures/pirates/,285,ocean drug abuse exotic island east india trad...,en,Pirates of the Caribbean: At World's End,"Captain Barbossa, long believed to be dead, ha...",139.082615,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2007-05-19,961000000,169.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"At the end of the world, the adventure begins.",Pirates of the Caribbean: At World's End,6.9,4500,Johnny Depp Orlando Bloom Keira Knightley Stel...,"[{'name': 'Dariusz Wolski', 'gender': 2, 'depa...",Gore Verbinski
2,2,245000000,Action Adventure Crime,http://www.sonypictures.com/movies/spectre/,206647,spy based on novel secret agent sequel mi6,en,Spectre,A cryptic message from Bond’s past sends him o...,107.376788,"[{""name"": ""Columbia Pictures"", ""id"": 5}, {""nam...","[{""iso_3166_1"": ""GB"", ""name"": ""United Kingdom""...",2015-10-26,880674609,148.0,"[{""iso_639_1"": ""fr"", ""name"": ""Fran\u00e7ais""},...",Released,A Plan No One Escapes,Spectre,6.3,4466,Daniel Craig Christoph Waltz L\u00e9a Seydoux ...,"[{'name': 'Thomas Newman', 'gender': 2, 'depar...",Sam Mendes
3,3,250000000,Action Crime Drama Thriller,http://www.thedarkknightrises.com/,49026,dc comics crime fighter terrorist secret ident...,en,The Dark Knight Rises,Following the death of District Attorney Harve...,112.31295,"[{""name"": ""Legendary Pictures"", ""id"": 923}, {""...","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-07-16,1084939099,165.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,The Legend Ends,The Dark Knight Rises,7.6,9106,Christian Bale Michael Caine Gary Oldman Anne ...,"[{'name': 'Hans Zimmer', 'gender': 2, 'departm...",Christopher Nolan
4,4,260000000,Action Adventure Science Fiction,http://movies.disney.com/john-carter,49529,based on novel mars medallion space travel pri...,en,John Carter,"John Carter is a war-weary, former military ca...",43.926995,"[{""name"": ""Walt Disney Pictures"", ""id"": 2}]","[{""iso_3166_1"": ""US"", ""name"": ""United States o...",2012-03-07,284139100,132.0,"[{""iso_639_1"": ""en"", ""name"": ""English""}]",Released,"Lost in our world, found in another.",John Carter,6.1,2124,Taylor Kitsch Lynn Collins Samantha Morton Wil...,"[{'name': 'Andrew Stanton', 'gender': 2, 'depa...",Andrew Stanton


##### ③ 특성 선택
- features = ['keywords','cast','genres','director']만 선택(편의상)

In [None]:
features = ['keywords','cast','genres','director']
features

['keywords', 'cast', 'genres', 'director']

In [None]:
def combine_features(row):
    return row['keywords']+" "+row['cast']+" "+row['genres']+" "+row['director']

combine_features(df[:5])

Unnamed: 0,0
0,culture clash future space war space colony so...
1,ocean drug abuse exotic island east india trad...
2,spy based on novel secret agent sequel mi6 Dan...
3,dc comics crime fighter terrorist secret ident...
4,based on novel mars medallion space travel pri...


In [None]:
for feature in features:
    df[feature] = df[feature].fillna('')

df["combined_features"] = df.apply(combine_features,axis=1)
df["combined_features"]

Unnamed: 0,combined_features
0,culture clash future space war space colony so...
1,ocean drug abuse exotic island east india trad...
2,spy based on novel secret agent sequel mi6 Dan...
3,dc comics crime fighter terrorist secret ident...
4,based on novel mars medallion space travel pri...
...,...
4798,united states\u2013mexico barrier legs arms pa...
4799,Edward Burns Kerry Bish\u00e9 Marsha Dietlein...
4800,date love at first sight narration investigati...
4801,Daniel Henney Eliza Coupe Bill Paxton Alan Ru...


##### ④ 벡터화하고 코사인 유사도 계산
-  장르, 배우명, 감독명의 텍스트 데이터를 범주형 데이터로 보기 때문에 단순하게 등장횟수를 세어 숫자 벡터

In [None]:
cv = CountVectorizer()
count_matrix = cv.fit_transform(df["combined_features"])
print(type(count_matrix))
print(count_matrix.shape)
print(count_matrix)

<class 'scipy.sparse._csr.csr_matrix'>
(4803, 14845)
<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 97547 stored elements and shape (4803, 14845)>
  Coords	Values
  (0, 3115)	1
  (0, 2616)	1
  (0, 4886)	1
  (0, 12386)	2
  (0, 14235)	1
  (0, 2755)	1
  (0, 12299)	1
  (0, 11517)	1
  (0, 14561)	1
  (0, 14820)	1
  (0, 11490)	1
  (0, 12134)	1
  (0, 14291)	1
  (0, 12567)	1
  (0, 7496)	1
  (0, 8831)	1
  (0, 11217)	1
  (0, 86)	1
  (0, 144)	1
  (0, 4435)	1
  (0, 11745)	1
  (0, 4566)	1
  (0, 6542)	1
  (0, 2061)	1
  (1, 86)	1
  :	:
  (4801, 10069)	1
  (4801, 5844)	1
  (4801, 252)	1
  (4801, 4098)	1
  (4801, 14796)	1
  (4801, 11361)	1
  (4801, 2978)	1
  (4801, 12036)	1
  (4801, 6138)	1
  (4802, 9659)	1
  (4802, 3812)	1
  (4802, 1788)	2
  (4802, 4210)	1
  (4802, 5181)	1
  (4802, 2912)	1
  (4802, 3821)	1
  (4802, 1069)	1
  (4802, 11185)	1
  (4802, 3681)	1
  (4802, 5399)	1
  (4802, 3894)	1
  (4802, 2056)	1
  (4802, 3093)	1
  (4802, 4502)	1
  (4802, 5900)	2


In [None]:
cosine_sim = cosine_similarity(count_matrix)
print(cosine_sim)
print(cosine_sim.shape)

[[1.         0.10540926 0.12038585 ... 0.         0.         0.        ]
 [0.10540926 1.         0.0761387  ... 0.03651484 0.         0.        ]
 [0.12038585 0.0761387  1.         ... 0.         0.11145564 0.        ]
 ...
 [0.         0.03651484 0.         ... 1.         0.         0.04264014]
 [0.         0.         0.11145564 ... 0.         1.         0.        ]
 [0.         0.         0.         ... 0.04264014 0.         1.        ]]
(4803, 4803)


##### ⑤ 추천하기
- Titanic과 비슷한 영화 다섯 편 추천

In [None]:
def get_title_from_index(index):
    return df[df.index == index]["title"].values[0]
def get_index_from_title(title):
    return df[df.title == title]["index"].values[0]

movie_user_likes = 'Titanic'
movie_index = get_index_from_title(movie_user_likes)
similar_movies = list(enumerate(cosine_sim[movie_index]))

sorted_similar_movies = sorted(similar_movies,key=lambda x:x[1],reverse=True)[1:]

i=0
print(movie_user_likes+"와 비슷한 영화 5편은 "+"\n")
for item in sorted_similar_movies:
    print(get_title_from_index(item[0]))
    i=i+1
    if i==5:
        break

Titanic와 비슷한 영화 5편은 

Revolutionary Road
Me You and Five Bucks
All the King's Men
The Day the Earth Stood Still
Almost Famous


#### 01-04 협업 필터링 종류
- 협업 필터링(Collaborative Filtering) 은 과거의 사용자 행동 양식(User Behavior) 데이터를 기반으로 추천
- 협업 필터링의 종류에는 크게 사용자 기반과 아이템 기반 그리고 잠재요인(latent factor) 방식이 있음
- 사용자 기반, 아이템 기반은 유사도 계산
- 잠재요인 방식은 '행렬 분해(Matrix Factorization)' 사용

#### 01-05 협업 필터링:행렬 인수분해
- SVD(Singular Vector Decomposition)
  - [공돌이의수학노트](https://angeloyeo.github.io/2019/08/01/SVD.html)
- ALS(Alternating Least Squares)
- NMF(Non-negative Matrix Factorization)

##### ① SVD 예시

In [None]:
import numpy as np
from numpy.linalg import svd

In [None]:
np.random.seed(30)
A = np.random.randint(0, 100, size=(4, 4))
A

array([[37, 37, 45, 45],
       [12, 23,  2, 53],
       [17, 46,  3, 41],
       [ 7, 65, 49, 45]])

In [None]:
svd(A)

SVDResult(U=array([[-0.54937068, -0.2803037 , -0.76767503, -0.1740596 ],
       [-0.3581157 ,  0.69569442, -0.13554741,  0.60777407],
       [-0.41727183,  0.47142296,  0.28991733, -0.72082768],
       [-0.6291496 , -0.46389601,  0.55520257,  0.28411509]]), S=array([142.88131188,  39.87683209,  28.97701433,  14.97002405]), Vh=array([[-0.25280963, -0.62046326, -0.4025583 , -0.6237463 ],
       [ 0.06881225, -0.07117038, -0.8159854 ,  0.56953268],
       [-0.73215039,  0.61782756, -0.23266002, -0.16767299],
       [-0.62873522, -0.47775436,  0.34348792,  0.50838848]]))

In [None]:
U, Sigma, VT = svd(A)

print('U matrix: {}\n'.format(U.shape),U)
print('Sigma: {}\n'.format(Sigma.shape),Sigma)
print('V Transpose matrix: {}\n'.format(VT.shape),VT)

U matrix: (4, 4)
 [[-0.54937068 -0.2803037  -0.76767503 -0.1740596 ]
 [-0.3581157   0.69569442 -0.13554741  0.60777407]
 [-0.41727183  0.47142296  0.28991733 -0.72082768]
 [-0.6291496  -0.46389601  0.55520257  0.28411509]]
Sigma: (4,)
 [142.88131188  39.87683209  28.97701433  14.97002405]
V Transpose matrix: (4, 4)
 [[-0.25280963 -0.62046326 -0.4025583  -0.6237463 ]
 [ 0.06881225 -0.07117038 -0.8159854   0.56953268]
 [-0.73215039  0.61782756 -0.23266002 -0.16767299]
 [-0.62873522 -0.47775436  0.34348792  0.50838848]]


In [None]:
Sigma_mat = np.diag(Sigma)

A_ = np.dot(np.dot(U, Sigma_mat), VT)
A_

array([[37., 37., 45., 45.],
       [12., 23.,  2., 53.],
       [17., 46.,  3., 41.],
       [ 7., 65., 49., 45.]])

##### ② Truncated SVD
- 추천 시스템에서의 행렬 인수분해는 SVD 중에서도 Truncated SVD를 사용

#### 01-06 잠재요인 협업 필터링
- 사용자가 평점을 매기는 요인을 '잠재요인'으로 취급
- SVD로 분해하고 다시 재결합해서 영화에 대한 평점 매긴 이유를 벡터화
- [모두의 연구소 '클릭률'](https://modulabs.co.kr/blog/click-through-rate-recommendation)