# 1 相似度计算

狭义的协同过滤即基于用户的喜好来自动实现用户的喜好判断。比如：A、B两位用户对某个事物相同的观点，那么A用户在其他事物的观点相比于随机挑选的一个人而言会更加接近B用户。广义的协同过滤，数据源更加广泛。之所以叫协同过滤，是因为在实现过滤推荐的时候是根据其他人的行为来做出预测的。

第一步是如何评价两位用户之间的相似度。

**注**: 电影评分数据源请点击[movielens](https://grouplens.org/datasets/movielens/20m/)去下载。

In [1]:
import pandas as pd
from IPython.display import Latex

In [2]:
movies = pd.read_csv('./ml-20m/movies.csv')
ratings = pd.read_csv('./ml-20m/ratings.csv')

In [3]:
# 对数据进行相加，通过movieId关联。
data = pd.merge(movies, ratings, on = "movieId")

In [4]:
# 存入文本，方便后期处理（这一步比较耗时）
# data[['userId', 'rating', 'movieId', 'title']].sort_values('userId').to_csv('./data.csv', index=False)

In [5]:
data[['userId', 'rating', 'movieId', 'title']][:10]

Unnamed: 0,userId,rating,movieId,title
0,3,4.0,1,Toy Story (1995)
1,6,5.0,1,Toy Story (1995)
2,8,4.0,1,Toy Story (1995)
3,10,4.0,1,Toy Story (1995)
4,11,4.5,1,Toy Story (1995)
5,12,4.0,1,Toy Story (1995)
6,13,4.0,1,Toy Story (1995)
7,14,4.5,1,Toy Story (1995)
8,16,3.0,1,Toy Story (1995)
9,19,5.0,1,Toy Story (1995)


In [6]:
#打开文件
with open('./data.csv') as file:
    data = {}
    for line in file.readlines()[1:1000]:
        line = line.strip().split(",")
        if not line[0] in data.keys():
            data[line[0]] = {line[3]:line[1]}
        else:
            data[line[0]][line[3]] = line[1]

In [7]:
data['1']

{"Monty Python's The Meaning of Life (1983)": '3.5',
 'Kill Bill: Vol. 2 (2004)': '4.0',
 '"Ring': '3.5',
 'Shrek (2001)': '4.0',
 'Contact (1997)': '3.5',
 'Die Hard (1988)': '4.0',
 'Rumble in the Bronx (Hont faan kui) (1995)': '3.5',
 'One Million Years B.C. (1966)': '4.0',
 '"Mask': '3.5',
 'Dawn of the Dead (1978)': '3.5',
 'Freaks (1932)': '5.0',
 'Twelve Monkeys (a.k.a. 12 Monkeys) (1995)': '3.5',
 'Ringu (Ring) (1998)': '3.5',
 'Escape to Witch Mountain (1975)': '3.5',
 'Videodrome (1983)': '4.0',
 'Léon: The Professional (a.k.a. The Professional) (Léon) (1994)': '4.0',
 'Interview with the Vampire: The Vampire Chronicles (1994)': '4.0',
 'Star Trek II: The Wrath of Khan (1982)': '4.0',
 "Harry Potter and the Sorcerer's Stone (a.k.a. Harry Potter and the Philosopher's Stone) (2001)": '4.0',
 'Highlander: Endgame (Highlander IV) (2000)': '4.0',
 'Seven (a.k.a. Se7en) (1995)': '3.5',
 '"Lock': '4.0',
 '2001: A Space Odyssey (1968)': '3.5',
 'Memento (2000)': '3.5',
 '"Wizard of O

下面计算两位用户之间的相似度。首先需要找到两位用户之间共同评论的电影，在使用欧氏距离公式计算距离，最后计算两位用户之间的相似度。 

In [8]:
from math import *

def Euclidean(user1, user2):
    # 两位用户交集
    user1_data = data[user1]
    user2_data = data[user2]
    distance = 0
    for key in user1_data.keys():
        if key in user2_data.keys():
            distance += pow(float(user1_data[key]) - float(user2_data[key]), 2)
    # 值越小，相似度越高
    return 1/(1+sqrt(distance))

In [9]:
Euclidean('1', '2')

0.20658711810431302

In [10]:
Euclidean('1', '4')

0.27792629762666365

# 2 协同过滤推荐算法的实现

公式待定。。。。

Pearson相关系数可以用来衡量两个变量之间的线性关系，它的值范围[-1, 1], 1: 完全线性相关，0: 完全不线性相关，-1: 完全负相关。

In [11]:
# 计算两位用户之间的Pearson相关系数
def pearson_sim(user1, user2):
    user1_data=data[user1]
    user2_data=data[user2]
    distance=0
    common={}
    #找到都评价过的电影                                                                                                                                                                                       
    for key in user1_data:
        if key in user2_data:
            common[key]=1
    #如果没有共同的电影，返回0                                                                                                                                                                              
    if len(common)==0:
        return 0
     #计算电影数目                                                                                                                                                                                              
    n=len(common)

   #计算评分和                                                                                                                                                                                               
    sum1 = sum([float(user1_data[movie]) for movie in common])
    sum2 = sum([float(user2_data[movie]) for movie in common])
    #计算评分平方和                                                                                                                                                                                                 
    sum1Sq=sum([pow(float(user1_data[movie]),2) for movie in common])
    sum2Sq=sum([pow(float(user2_data[movie]), 2) for movie in common])
    # 计算乘积和                                                                                                                                                                                              
    pSum=sum([float(user1_data[it])*float(user2_data[it]) for it in common])

    # 计算Pearson系数                                                                                                                                                                                         
    num=pSum-(sum1*sum2/n)
    den=sqrt((sum1Sq-pow(sum1,2)/n)*(sum2Sq-pow(sum2,2)/n))
    if den==0:
        return 0
    r=num/den
    return r

In [12]:
pearson_sim('1', '2')

-0.06189844605901989

In [13]:
pearson_sim('1', '4')

0