# Term_Project - Recommendation System
## Initial and Set config
### 結果共用連結: https://drive.google.com/drive/folders/1d_FzC6vwwCdwkUU9hfPHk3blut6E6xq2?usp=sharing
#### 兩題共約20min

In [7]:
#initial
import math
import os
from pyspark import StorageLevel
import time
from pyspark import SparkContext, SparkConf
sc.stop()
conf = SparkConf().setMaster("local").setAppName("Recommendation System")
conf = SparkConf().set("spark.default.parallelism",8)\
    .set('spark.driver.memory', '12G') \
    .set('spark.driver.maxResultSize', '100G')\
    .set('spark.memory.fraction',0.9)\
    .set("spark.hadoop.validateOutputSpecs", "False")\
    .set("spark.serializer","org.apache.spark.serializer.KryoSerializer")\
    .set("spark.kryoserializer.buffer.max",'2000m')
    #.set("spark.executor.instances", 4)\
    #.set("spark.executor.cores", 4)
    #.set('spark.executor.cores',4) \
sc = SparkContext(conf = conf)

## Get input data
1. 利用sc.textFile()取得資料夾內ratings.csv
2. 使用filter 將 header 那一列去掉
3. 之後使用map function getNecesssary()來將每個line依據逗號切分，並重新排列順序形成一以movie id為key, (user, rating)為value 的 RDD
4. 再來利用groupByKey()來將相同movie id的line合併再一起，形成一以movie id 為key, 所有有評分該movie的(user,rateing)為values的RDD

## Input Data
1. 取自https://grouplens.org/datasets/movielens/
2. format: (userID, movieID, rating)
3. 以解壓縮，放在ml-latest-small/ratings.csv

In [8]:
datas = sc.textFile("ml-latest-small/ratings.csv").filter(lambda x:x[0]!='u')
# = sc.textFile("ml-latest-small/movies.csv").filter(lambda x:x[0]!='m')

In [9]:
#parsing all_movie 47448411
def get_movie_id(x):
    lines = x.split(',')
    id_ = int(lines[0])
    return id_,[]
#parsing => rdd -> (movie,[(user,rating),(user,rating)..])
def getNecesssary(x):
    line = x.split(',')
    user = str(line[0])
    movie = int(line[1])
    rating = float(line[2])
    
    pair = (user, rating)
    return movie,pair
#All_movies = all_movies.map(get_movie_id)
Datas = datas.map(getNecesssary)
simple_Items_RDD = Datas.groupByKey().map(lambda x:(x[0],list(x[1]))).persist(StorageLevel.DISK_ONLY).sortBy(lambda x:x[0])

In [10]:
tmp_store = simple_Items_RDD.collect()

## Normalization and Calculate length of Rating_vector
1. map function construct_row_with_nornalization()進行Normalization, 主要步驟為先將平均值算出(sum of ratings / sum of rating user for that movie), 並進行去中心化，即將該movie 的ratings 減去平均數，並順便計算新的rating_vector的長度，最後return回去形成新的RDD，一樣以movie為id, 但values 是(user, sqrt(rating_vector_len))
2. 附註: 在其中可能會出現去中心化後，所有rating都變為0的狀況，以此我的解決方法是之後將其filter刪除掉

In [5]:
def construct_row_with_nornalization(x):
    part_list = list(x[1])
    sum_ = 0
    users = len(part_list)
    for i in part_list:
        sum_ += i[1]
    mean = sum_/users
    new_result = []
    len_ = 0
    cnt = 0
    for j in part_list:
        j = list(j)
        user = int(j[0])
        j[1] = j[1]-mean
        len_+= j[1]**2
        cnt+=1
        new_result.append((user,j[1]))
    #cal len
    #if len_== 0:
        #return x[0],(0,cnt,part_list) #(id,(len, mean, list))
    return x[0],(math.sqrt(len_),new_result)
Simple_Items_RDD = simple_Items_RDD.map(construct_row_with_nornalization).filter(lambda x:x[1][0]!=0).persist(StorageLevel.DISK_ONLY)

In [6]:
#t2 = Simple_Items_RDD.count()

In [7]:
#t2

## Construct Combination
1. 為求出每個movie之間的相似度，我使用catesian進行組合，產生C(n,2) line數量的組合RDD

In [8]:
#join
cartesian_iten_item = Simple_Items_RDD.cartesian(Simple_Items_RDD).persist(StorageLevel.DISK_ONLY)

In [9]:
#cartesian_iten_item.collect()

In [10]:
combination_items = cartesian_iten_item.filter(lambda x:((x[0][0]<x[1][0])))\
                                        .persist(StorageLevel.DISK_ONLY)

In [11]:
#c = combination_items.collect()
#c

## Calculate Cosine Similarity
1. 有了上一步的結果，接下來就能開始計算相似程度
2. 依據一組合出的line裡面提供的 movie_id, movie_vector, |movie|，可以直接利用公式: cosine = movie1 dot movie2/ |movie1|*|movie2|

In [12]:
import math
def cal_similarity(x):
    item1 = x[0][0]
    s1 = x[0][1][0]
    item2 = x[1][0]
    s2 = x[1][1][0]
    dividend = s1*s2
    permutation = 0
    for i in x[0][1][1]:
        for j in x[1][1][1]:
            if i[0]==j[0]:
                permutation += i[1]*j[1]
                break
    try:
        similarity = permutation/dividend
    except:
        result = ((item1,item2),0)
        return result
    result = (item1,item2),similarity
    return result
cos_sims = combination_items.map(cal_similarity).persist(StorageLevel.DISK_ONLY)

In [13]:
#t3 = cos_sims.take(100)

In [14]:
#cos_sims.count()

In [15]:
#t8 = cos_sims.collect()
#t8

In [16]:
#t8

## Output To File
1. 將結果利用 saveAsTextFile()，將結果output到out1的資料夾
2. coalesce將RDD變成partition = 1的RDD，方便output時檔案不會變成分散式 
3. 結果會在該資料夾內的part-00000檔案裡，
4. 建議用Vscode打開，這樣才能做分配記憶體來開啟較大的檔案
5. output pattern:
### (item, item), similarity
## 輸出結果為所有 "可去中心化的"  且  "不同movies間" 的相似度
1. 附註: 為了demo呈現，以事先跑好output放在google雲端上sample_out資料夾的q1_similarity_part-00000.txt (雲端上已將副檔名改成.txt)

In [17]:
cos_sims.coalesce(1,True).saveAsTextFile('out1')

In [18]:
#simple_Items_RDD.persist(StorageLevel.DISK_ONLY).collect()

## Sorting The movie similarity
1. 接下來要進行Prediction，我以item的叫角度出發，為了計算方便，可以利用map function duplicate 從 ((movie1,movie2),similarity) 額為複製出 ((movie2,movie1),similarity)的key value pair，並利用flatMap將他們拆開來。
2. 下一步利用groupByKey將所有第一項相同的movie_id 聚集起來，形成一以movie_id為key, (other movie_id, similarity)為values的RDD
3. 接下來為了之後的計算方便，我先利用map function sorting_list來進行每一個line內部values排序，依據similarity的大小來排序(由大到小)

In [19]:
#sorting
def duplicate(x):
    m1 = x[0][0]
    m2 = x[0][1]
    sim = x[1]
    ele1 = (m1,(m2,sim))
    ele2 = (m2,(m1,sim))
    return ele1,ele2
def takeSecond(ele):
    return ele[1]
def sorting_list(x):
    l = list(x[1])
    l.sort(key=takeSecond, reverse=True)
    return x[0],l
item_sim = cos_sims.map(duplicate).flatMap(lambda x:x).groupByKey().map(sorting_list).persist(StorageLevel.DISK_ONLY)

In [20]:
#item_sim.count()

In [21]:
#ttmp = item_sim.collect()
#ttmp

## Get All Users
1. 與第一步驟的做法雷同，只是這次要利用getNecesssary2()來建構以 user為key, (movie_id,rating)為values的RDD
2. 利用 dict()來使查詢user rate的時候，以user_id為index，如此就不用刻意先sort好users RDD

In [22]:
#item_sim_table = item_sim.collect()
#item_sim_table

In [23]:
def getNecesssary2(x):
    line = x.split(',')
    user = str(line[0])
    movie = str(line[1])
    rating = float(line[2])
    
    pair = (movie, rating)
    return user,pair
users = sc.textFile("ml-latest-small/ratings.csv").filter(lambda x:x[0]!='u').map(getNecesssary2).groupByKey()\
        .map(lambda x:(x[0],list(x[1])))\
        .persist(StorageLevel.DISK_ONLY)
user_table = users.collect()

In [24]:
#user_table

In [25]:
new_user_table = []
for i in user_table:
    tar = dict(i[1])
    new_user_table.append((i[0],tar))
new_user_table_dict = dict(new_user_table)
#new_user_table_dict

## Caluclate all Prediction
1. 有了user rated movie的資料後，以及movie間similarity的資料後，我們可以開始計算user評分其他為評分movie的rating了
2. 在這我對movie_similarity的RDD進行map function cal_all_prediction()，在每個line裡進行610次的迴圈(user人數)，先檢查該user是否已經評分過該movie，若有，則換一下曾迴圈，反之，則進行prediction
3. 在進行prediction時，先找出其他該user評分過且與目前movie相似度前10高的movie(不包括自己)
4. 當找完10個後(或不滿10個)， 則可以開始利用加權平均的方式計算新的rating，權重為similarity，數值為rating
5. 在function裡我利用try and except的方式來處理分母為0的情況
6. 得出結果後，會append到一個負責收集該movie所有rating的list (all_rating)並在跑完迴圈後return，得出一(目前movie_id,user, rating)為element的新RDD
7. 若是跑smaple.csv，可以將N=10改成N=2，這樣結果會與課本上的範例一樣，該結果一樣放在google雲端上

In [26]:
N=10
users_num = users.count()
def cal_all_prediction(x):
    main_item = str(x[0])
    all_rating = []
    for i in range(users_num):
        user_str = str(i+1)
        try:
            value = new_user_table_dict[user_str][main_item]
            object_ = ((user_str,str(x[0])), value)
            #all_rating.append(object_)
            continue
        except: # hasn't been rate
            # find top N sim with main item
            divider = 0
            dividend = 0
            cnt = 0
            for i in x[1]:
                if cnt==N: break
                if i[1]>0:
                    try:
                        sim_one = str(i[0])
                        rating = new_user_table_dict[user_str][sim_one] #user rate about sim_item
                        dividend += rating*i[1]
                        divider += i[1]
                        cnt+=1
                    except:
                        continue
            predict_rate = 0
            try:
                predict_rate = dividend/divider
            except:
                predict_rate = 0
            object_ = ((int(user_str),int(x[0])), predict_rate)
            all_rating.append(object_) 
    return all_rating
ans3 = item_sim.map(cal_all_prediction).persist(StorageLevel.DISK_ONLY)

In [27]:
#tmp3 = ans3.flatMap(lambda x:x).filter(lambda x:x[2]>0).collect()
#tmp3

## Output To File
1. 利用coalesce，將RDD repatition成1 partition的RDD，並利用flatMap來取的所有elements，並利用saveAsFile來output資料至out2的資料夾
2. 結果將存到out2資料夾內 part-00000檔案內，一樣建議用VsCode打開
3. output pattern:
### (user, item), rating)
## 輸出結果為所有 "可被預測的" 且 "預測值大於0的" user rating
4. 附註: 為了demo呈現，以事先跑好output放在google雲端sample_out資料夾的q2_prediction_part-00000.txt (雲端上已將副檔名改成.txt)

In [28]:
ans3.coalesce(1,True).flatMap(lambda x:x).filter(lambda x:x[1]>0).filter(lambda x:x[1]>0).saveAsTextFile('out2')

In [29]:
#ans3.coalesce(1,True).flatMap(lambda x:x).sortBy(lambda x: (x[0][0], x[0][1])).collect()