# **推薦系統 Recommender Systems - Collaborative Filtering**

# **Collaborative Filtering**


## Memory-based CF

*  **User-based Collaborative Filtering**-  推薦與你相似的人喜歡的東西給你，會依照「橫列」來計算相似度

![](https://cdn-images-1.medium.com/max/1000/1*9TC6BrfxYttJwiATFAIFBg.png)

但 User-based filtering 會遇到使用者偏好改變的問題，故我們同時可以考慮 Item-based 的方法：

*  **Item-based Collaborative Filtering**- 推薦與你喜歡的東西一樣地被喜歡的東西給你，會依照「直行」來計算相似度

![](https://cdn-images-1.medium.com/max/1000/1*LqFnWb-cm92HoMYBL840Ew.png)

但這種 Memory-based 的方法仍然會有幾個問題:
* ***scalability***：我們需要針對每一個使用者以及每一部電影進行計算 
* ***sparsity***： 當電影數量增加，電影有被使用者觀看或評分的比例不大時，會使得矩陣稀疏



## Data

In [2]:
movie = pd.read_csv("movielens-20m-dataset/movie.csv")
movie.columns

Index(['movieId', 'title', 'genres'], dtype='object')

In [3]:
movie = movie.loc[:,["movieId","title"]]
movie.head(10)

Unnamed: 0,movieId,title
0,1,Toy Story (1995)
1,2,Jumanji (1995)
2,3,Grumpier Old Men (1995)
3,4,Waiting to Exhale (1995)
4,5,Father of the Bride Part II (1995)
5,6,Heat (1995)
6,7,Sabrina (1995)
7,8,Tom and Huck (1995)
8,9,Sudden Death (1995)
9,10,GoldenEye (1995)


In [4]:
# 納入評分資料
rating = pd.read_csv("movielens-20m-dataset/rating.csv")
rating.columns

Index(['userId', 'movieId', 'rating', 'timestamp'], dtype='object')

In [5]:
rating = rating.loc[:,["userId","movieId","rating"]]
rating.head(10)

Unnamed: 0,userId,movieId,rating
0,1,2,3.5
1,1,29,3.5
2,1,32,3.5
3,1,47,3.5
4,1,50,3.5
5,1,112,3.5
6,1,151,4.0
7,1,223,4.0
8,1,253,4.0
9,1,260,4.0


In [6]:
# merge movie and rating
data = pd.merge(movie,rating)

In [7]:
data.head(10)

Unnamed: 0,movieId,title,userId,rating
0,1,Toy Story (1995),3,4.0
1,1,Toy Story (1995),6,5.0
2,1,Toy Story (1995),8,4.0
3,1,Toy Story (1995),10,4.0
4,1,Toy Story (1995),11,4.5
5,1,Toy Story (1995),12,4.0
6,1,Toy Story (1995),13,4.0
7,1,Toy Story (1995),14,4.5
8,1,Toy Story (1995),16,3.0
9,1,Toy Story (1995),19,5.0


In [9]:
data = data.iloc[:1000000,:]

In [11]:
# 利用pandas.pivot_table()可以得到user與電影的評分關係表
pivot_table = data.pivot_table(index = ["userId"],columns = ["title"],values = "rating")
pivot_table.head(10)

title,Ace Ventura: When Nature Calls (1995),Across the Sea of Time (1995),"Amazing Panda Adventure, The (1995)","American President, The (1995)",Angela (1995),Angels and Insects (1995),Anne Frank Remembered (1995),Antonia's Line (Antonia) (1995),Assassins (1995),Babe (1995),...,Unforgettable (1996),Up Close and Personal (1996),"Usual Suspects, The (1995)",Vampire in Brooklyn (1995),Waiting to Exhale (1995),When Night Is Falling (1995),"White Balloon, The (Badkonake sefid) (1995)",White Squall (1996),Wings of Courage (1995),"Young Poisoner's Handbook, The (1995)"
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,,,,,,,,,,,...,,,3.5,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,,,,,,,,,,,...,,,5.0,,,,,,,
4,3.0,,,,,,,,,,...,,,,,,,,,,
5,,,,5.0,,,,,,,...,,2.0,,,,,,,,
6,,,,,,,,,,,...,,4.0,,,,,,,,
7,,,,4.0,,,,,,,...,,,,,,,,,,
8,1.0,,,,,,,,,,...,,,,,,,,,,
10,,,,4.0,,,,,,,...,,,,,,,,,,
11,3.5,,,,,,,,,,...,,,,,,,,,,


In [12]:
# 計算出與 Bad Boys 的觀看紀錄相似的電影
movie_watched = pivot_table["Bad Boys (1995)"]
similarity_with_other_movies = pivot_table.corrwith(movie_watched)
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()

title
Bad Boys (1995)                        1.000000
Headless Body in Topless Bar (1995)    0.723747
Last Summer in the Hamptons (1995)     0.607554
Two Bits (1995)                        0.507008
Shadows (Cienie) (1988)                0.494186
dtype: float64

## Model-based CF
### **Single Value Decomposition 奇異值分解**
我們可以透過使用 **latent factor model** 來處理sparsity以及scalability的問題，並且可以捕捉用戶以及電影之間的相似性。
基本上就是將推薦問題轉換成一種最佳化問題，利用RMSE作為loss function。

![](https://cdn-images-1.medium.com/max/800/1*GUw90kG2ltTd2k_iv3Vo0Q.png)

In [13]:
from surprise import Reader, Dataset, SVD, evaluate
reader = Reader()
ratings = pd.read_csv('the-movies-dataset/ratings_small.csv')
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205


In [14]:
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)
data.split(n_folds=5)

In [15]:
# 使用 SVD 模型
svd = SVD()
evaluate(svd, data, measures=['RMSE', 'MAE'])



Evaluating RMSE, MAE of algorithm SVD.

------------
Fold 1
RMSE: 0.9010
MAE:  0.6955
------------
Fold 2
RMSE: 0.8968
MAE:  0.6893
------------
Fold 3
RMSE: 0.8915
MAE:  0.6872
------------
Fold 4
RMSE: 0.8971
MAE:  0.6909
------------
Fold 5
RMSE: 0.8947
MAE:  0.6901
------------
------------
Mean RMSE: 0.8962
Mean MAE : 0.6906
------------
------------


CaseInsensitiveDefaultDict(list,
                           {'rmse': [0.9010161426473884,
                             0.896817847899472,
                             0.8914768494663217,
                             0.897119082160679,
                             0.8947207132287867],
                            'mae': [0.6955214350013869,
                             0.6893084485764994,
                             0.6871928616435473,
                             0.690947594294902,
                             0.6900667687353885]})

In [16]:
trainset = data.build_full_trainset()
svd.fit(trainset)

<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1116c9dd8>

In [17]:
ratings[ratings['userId'] == 1]

Unnamed: 0,userId,movieId,rating,timestamp
0,1,31,2.5,1260759144
1,1,1029,3.0,1260759179
2,1,1061,3.0,1260759182
3,1,1129,2.0,1260759185
4,1,1172,4.0,1260759205
5,1,1263,2.0,1260759151
6,1,1287,2.0,1260759187
7,1,1293,2.0,1260759148
8,1,1339,3.5,1260759125
9,1,1343,2.0,1260759131


In [18]:
svd.predict(1, 302, 3)

Prediction(uid=1, iid=302, r_ui=3, est=2.8117911167109404, details={'was_impossible': False})