<a href="https://colab.research.google.com/github/sunnyskydream/ML-practice/blob/main/Recommendation_System_Collaborative_Filtering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**協同過濾(Collaborative Filtering)**

source: 
https://ithelp.ithome.com.tw/articles/
https://ithelp.ithome.com.tw/articles/10220129

協同就是集合眾人的意見協同合作，進而篩選或推薦商品，作法與購物籃分析類似，一樣是以銷售記錄進行分析，不同的是，並不進行商品組合分析，而是將銷售記錄轉成『使用者/商品對應的矩陣』(User-Item Matrix)，記錄哪些使用者買過哪些商品，計算顧客間或商品間的相似度，再推薦相似顧客曾買過的商品，或推薦與目前商品最相似的其他商品，進行 cross selling

**User-Item Matrix**

A. USER-USER CF: 轉換為最相似的顧客族群(USER-USER Similarity Matrix)，查看他們經常購買的商品，推薦給目前鎖定的顧客。

B. ITEM-ITEM CF:找出與目前瀏覽的商品最相似的商品族群(ITEM-ITEM Similarity Matrix)，推薦給顧客。

**相似性(Similarity)計算**
計算相似性(Similarity)有三種統計量：

A. Jaccard Similarity：(購買A 且 購買B 的交易筆數) / (購買A 或 購買B 的交易筆數)
B. Pearson Similarity：兩兩計算其『皮爾森係數』(Pearson coefficient)。
C. Cosine Similarity：最常用的方法，本文採用此統計量。
不管我們是採用哪一種協同過濾，都是將矩陣中每一欄或列視為一向量，兩兩計算其統計量，得到一個相似性的值。例如Cosine Similarity，當Cosθ=1時，表示夾角=0°，兩者最相似，反之，Cosθ=-1時，表示夾角=180°，兩者最不相似


[Dataset](https://grouplens.org/datasets/movielens/100k/)

GroupLens 提供各種不同大小的影評檔案給大家測試:
1. u.data：包含 943 個使用者對 1682 部電影所做的 100000 筆評論，欄位包括：

*   使用者代碼(user id)
*   電影代碼(item id)
*   評論分數(rating)

2. u.item：1682 部電影基本資料，欄位很多，這裡只會用到前兩欄，
*   電影代碼(movie id)
*   電影名稱(movie title)



處理步驟如下：
1. 讀取u.item、u.data 兩個檔案，並合併。
2. 使用樞紐分析函數(pivot_table)，將資料轉換 USER-ITEM Matrix。
3. 計算 USER-USER、ITEM-ITEM Similarity Matrix。
4. 隨機指定一個使用者或商品，進行測試，列出推薦的商品。

In [1]:
import pandas as pd

In [6]:
# Read the input training data
input_data_file_movie = "./u.item"
input_data_file_rating = "./u.data"

movie = pd.read_csv(input_data_file_movie, sep='|', encoding='ISO-8859-1', names=['movie_id', 'movie_title'], usecols = [0,1])
rating = pd.read_csv(input_data_file_rating, sep='\t', encoding='ISO-8859-1', names=["user_id","movie_id","rating"], usecols = [0,1,2])
print(movie.head())
print(rating.head())

   movie_id        movie_title
0         1   Toy Story (1995)
1         2   GoldenEye (1995)
2         3  Four Rooms (1995)
3         4  Get Shorty (1995)
4         5     Copycat (1995)
   user_id  movie_id  rating
0      196       242       3
1      186       302       3
2       22       377       1
3      244        51       2
4      166       346       1


In [3]:
# then merge movie and rating data
data = pd.merge(movie,rating)
data.head()

Unnamed: 0,movie_id,movie_title,user_id,rating
0,1,Toy Story (1995),308,4
1,1,Toy Story (1995),287,5
2,1,Toy Story (1995),148,4
3,1,Toy Story (1995),280,4
4,1,Toy Story (1995),66,3


In [7]:
# USER-ITEM Matrix
# lets make a pivot table in order to make rows are users and columns are movies. And values are rating
pivot_table = data.pivot_table(index = ["user_id"],columns = ["movie_title"],values = "rating")
pivot_table.head(10)

movie_title,'Til There Was You (1997),1-900 (1994),101 Dalmatians (1996),12 Angry Men (1957),187 (1997),2 Days in the Valley (1996),"20,000 Leagues Under the Sea (1954)",2001: A Space Odyssey (1968),3 Ninjas: High Noon At Mega Mountain (1998),"39 Steps, The (1935)",8 1/2 (1963),8 Heads in a Duffel Bag (1997),8 Seconds (1994),A Chef in Love (1996),Above the Rim (1994),Absolute Power (1997),"Abyss, The (1989)",Ace Ventura: Pet Detective (1994),Ace Ventura: When Nature Calls (1995),Across the Sea of Time (1995),Addams Family Values (1993),Addicted to Love (1997),"Addiction, The (1995)","Adventures of Pinocchio, The (1996)","Adventures of Priscilla, Queen of the Desert, The (1994)","Adventures of Robin Hood, The (1938)","Affair to Remember, An (1957)","African Queen, The (1951)",Afterglow (1997),"Age of Innocence, The (1993)",Aiqing wansui (1994),Air Bud (1997),Air Force One (1997),"Air Up There, The (1994)",Airheads (1994),Akira (1988),Aladdin (1992),Aladdin and the King of Thieves (1996),Alaska (1996),Albino Alligator (1996),...,"Whole Wide World, The (1996)",Widows' Peak (1994),"Wife, The (1995)",Wild America (1997),Wild Bill (1995),"Wild Bunch, The (1969)",Wild Reeds (1994),Wild Things (1998),William Shakespeare's Romeo and Juliet (1996),Willy Wonka and the Chocolate Factory (1971),Window to Paris (1994),Wings of Courage (1995),Wings of Desire (1987),"Wings of the Dove, The (1997)",Winnie the Pooh and the Blustery Day (1968),"Winter Guest, The (1997)",Wishmaster (1997),With Honors (1994),Withnail and I (1987),Witness (1985),"Wizard of Oz, The (1939)",Wolf (1994),"Woman in Question, The (1950)","Women, The (1939)","Wonderful, Horrible Life of Leni Riefenstahl, The (1993)",Wonderland (1997),"Wooden Man's Bride, The (Wu Kui) (1994)","World of Apu, The (Apur Sansar) (1959)","Wrong Trousers, The (1993)",Wyatt Earp (1994),Yankee Zulu (1994),Year of the Horse (1997),You So Crazy (1994),Young Frankenstein (1974),Young Guns (1988),Young Guns II (1990),"Young Poisoner's Handbook, The (1995)",Zeus and Roxanne (1997),unknown,Á köldum klaka (Cold Fever) (1994)
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,,,2.0,5.0,,,3.0,4.0,,,,,,,,,3.0,3.0,,,,,,,,,,,,,,1.0,,,,4.0,4.0,,,,...,,,,,,,,,,4.0,,,,,,,,,,,4.0,,,,,,,,5.0,,,,,5.0,3.0,,,,4.0,
2,,,,,,,,,1.0,,,,,,,3.0,,,,,,,,,,,,,,,,,4.0,,,,,,,,...,,,,,,,,,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,,
3,,,,,2.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
4,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,,,,,,
5,,,2.0,,,,,4.0,,,,,,,,,,,1.0,,2.0,,,,5.0,,,,,3.0,,,,,,,4.0,4.0,,,...,,,,,,,,,1.0,3.0,,,,,,,,,,,,,,,,,,,5.0,,,,,4.0,,,,,4.0,
6,,,,4.0,,,,5.0,,,,,,,,,,,,,,2.0,,,,4.0,,4.0,,,,3.0,,,,,2.0,,,,...,,,,,,,,,,3.0,,,4.0,,,,,,,,5.0,,,,,,,,4.0,,,,,4.0,,,,,,
7,,,,4.0,,,5.0,5.0,,4.0,,,,,,,5.0,,,,4.0,,,,4.0,5.0,,5.0,,3.0,,,4.0,,,,,,,,...,,,,,3.0,5.0,,,3.0,4.0,,,,,,,1.0,,,,5.0,4.0,,,,,,,,3.0,,,,5.0,3.0,,3.0,,,
8,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
9,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
10,,,,5.0,,,,5.0,,4.0,,,,,,,4.0,,,,,,,,,,,5.0,,,,,,,,,,,,,...,,5.0,,,,5.0,,,,,,,,,,,,,,,5.0,,,,4.0,,,,,,,,,,,,,,,


In [8]:
# ITEM-ITEM 協同過濾相似性(Similarity)計算
movie_watched = pivot_table["Bad Boys (1995)"]
similarity_with_other_movies = pivot_table.corrwith(movie_watched)  # find correlation between "Bad Boys (1995)" and other movies
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


movie_title
Enchanted April (1991)                             1.0
Homeward Bound II: Lost in San Francisco (1996)    1.0
Race the Sun (1996)                                1.0
Ready to Wear (Pret-A-Porter) (1994)               1.0
Great Dictator, The (1940)                         1.0
dtype: float64

In [9]:
# USER-USER 協同過濾相似性(Similarity)計算
# lets make a pivot table in order to make rows are users and columns are movies. And values are rating
pivot_table = data.pivot_table(index =["movie_title"],columns =  ["user_id"],values = "rating")
pivot_table.head(10)

user_id,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,904,905,906,907,908,909,910,911,912,913,914,915,916,917,918,919,920,921,922,923,924,925,926,927,928,929,930,931,932,933,934,935,936,937,938,939,940,941,942,943
movie_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
'Til There Was You (1997),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
1-900 (1994),,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
101 Dalmatians (1996),2.0,,,,2.0,,,,,,,,2.0,,3.0,,,,,,,,,,,,,,,,,,,,,,,5.0,,,...,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,2.0,,2.0,,,2.0,4.0,,,,,
12 Angry Men (1957),5.0,,,,,4.0,4.0,,,5.0,,,4.0,,,5.0,,3.0,,,,,,5.0,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,5.0,,,,,,,,,,,
187 (1997),,,2.0,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
2 Days in the Valley (1996),,,,,,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,,,,,,...,,3.0,4.0,,,,,,,,,,4.0,,,,,,,4.0,3.0,,,,,,,,,,,,4.0,,,,,,,2.0
"20,000 Leagues Under the Sea (1954)",3.0,,,,,,5.0,,,,,,2.0,,,,,,,,,,,,4.0,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,4.0,,,,,,,,,,,
2001: A Space Odyssey (1968),4.0,,,,4.0,5.0,5.0,,,5.0,4.0,,5.0,,,4.0,,3.0,,,,,,,3.0,,,,,5.0,4.0,,,,,,,,,,...,,,,,,,,,,,,,4.0,,1.0,,,,2.0,,,,,,4.0,5.0,,,5.0,4.0,4.0,,,,,,,,3.0,
3 Ninjas: High Noon At Mega Mountain (1998),,1.0,,,,,,,,,,,1.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,
"39 Steps, The (1935)",,,,,,,4.0,,4.0,4.0,,,4.0,,,,,,,,,,,,5.0,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,5.0,,,,,,,,,,3.0,


In [10]:
target_user = pivot_table[10]
similarity_with_other_movies = pivot_table.corrwith(target_user)  # find correlation between "Bad Boys (1995)" and other movies
similarity_with_other_movies = similarity_with_other_movies.sort_values(ascending=False)
similarity_with_other_movies.head()

  c = cov(x, y, rowvar)
  c *= np.true_divide(1, fact)


user_id
400    1.0
636    1.0
772    1.0
477    1.0
10     1.0
dtype: float64