## 基于user的协同过滤
UserCF的基本思想：
* 如果⽤户𝑢𝑠𝑒𝑟1 跟⽤户𝑢𝑠𝑒𝑟2 相似，⽽且𝑢𝑠𝑒𝑟2喜欢某物品，
* 那么⽤户𝑢𝑠𝑒𝑟2也很可能喜欢该物品。


优点
* 每个人的兴趣点都很广泛，usercf可以快速的给每个用户发散出不同的兴趣点，比如热点新闻的推荐。

In [1]:
%cd /playground/sgd_deep_learning/sgd_rec_sys/
import sys 
sys.path.append('./python')

/playground/sgd_deep_learning/sgd_rec_sys


In [2]:
import numpy as np
import random
from sgd_rec_sys.retrieval import UserCF, RateInfo

## rate_info

* 从文件中读取用户、物品的meta info（比如id-name的映射关系）
* 读取用户历史评分文件，针对不同算法整理对应数据
  * itemcf：需要每个物品 对应的 用户评价list
  * usercf：需要每个用户 评价过的 所有物品的list

In [3]:
rate_info = RateInfo(user_file='./data/retrieval/user2id.txt',
                     item_file='./data/retrieval/item2id.txt',
                    rate_file='./data/retrieval/userid_itemid_rate.txt')

In [4]:
# 用户侧信息
rate_info.user_meta_info()

{'col_name': ['user_id', 'user_name'],
 'id2name': {1: 'A', 2: 'B', 3: 'C', 4: 'D', 5: 'E'},
 'name2id': {'A': 1, 'B': 2, 'C': 3, 'D': 4, 'E': 5}}

In [5]:
# 物品侧信息
rate_info.item_meta_info()

{'col_name': ['item_id', 'item_name'],
 'id2name': {1: 'story_book', 2: 'magazine', 3: 'tv', 4: 'ps4'},
 'name2id': {'story_book': 1, 'magazine': 2, 'tv': 3, 'ps4': 4}}

In [6]:
#  rate 信息
rate_info.rate_meta_info()

{'col_name': ['userid', 'itemid', 'rate'],
 'rate_pairs': [[1, 1, 1],
  [1, 2, -1],
  [1, 3, 1],
  [1, 4, 1],
  [2, 2, 1],
  [2, 3, -1],
  [2, 4, -1],
  [3, 1, 1],
  [3, 2, 1],
  [3, 3, -1],
  [4, 1, -1],
  [4, 3, 1],
  [5, 1, 1],
  [5, 2, 1],
  [5, 4, -1]]}

## UserCF

In [15]:
usercf = UserCF(meta_info=rate_info)

user_info = rate_info.user_meta_info()
uids = list(user_info['id2name'].keys())
print("all user ids:", uids)
print(user_info['id2name'])
print()

# 计算两两物品间的cos sim (耗时操作可离线计算)
for i in range(len(uids)-1):
    for j in range(i, len(uids)):
        id1, id2 = uids[i], uids[j]
        print("sim score of {}-{} : {:.2f}".format(id1, id2, usercf.sim(id1, id2)))

all user ids: [1, 2, 3, 4, 5]
{1: 'A', 2: 'B', 3: 'C', 4: 'D', 5: 'E'}

sim score of 1-1 : 1.00
sim score of 1-2 : -0.87
sim score of 1-3 : -0.29
sim score of 1-4 : 0.00
sim score of 1-5 : -0.29
sim score of 2-2 : 1.00
sim score of 2-3 : 0.67
sim score of 2-4 : -0.41
sim score of 2-5 : 0.67
sim score of 3-3 : 1.00
sim score of 3-4 : -0.82
sim score of 3-5 : 0.67
sim score of 4-4 : 1.00
sim score of 4-5 : -0.41


In [19]:
# 计算两两物品间的jarcard sim (耗时操作可离线计算)
for i in range(len(uids)-1):
    for j in range(i, len(uids)):
        id1, id2 = uids[i], uids[j]
        print("sim score of {}-{} :\t {:.3f}\n".format(id1, id2, usercf.jarcard_sim(id1, id2)))

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 3: 1, 4: 1}
common [1, 3, 4]
sim score of 1-1 :	 1.000

J1, J2 {1: 1, 3: 1, 4: 1} {2: 1}
common []
sim score of 1-2 :	 0.000

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 2: 1}
common [1]
sim score of 1-3 :	 0.408

J1, J2 {1: 1, 3: 1, 4: 1} {3: 1}
common [3]
sim score of 1-4 :	 0.577

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 2: 1}
common [1]
sim score of 1-5 :	 0.408

J1, J2 {2: 1} {2: 1}
common [2]
sim score of 2-2 :	 1.000

J1, J2 {2: 1} {1: 1, 2: 1}
common [2]
sim score of 2-3 :	 0.707

J1, J2 {2: 1} {3: 1}
common []
sim score of 2-4 :	 0.000

J1, J2 {2: 1} {1: 1, 2: 1}
common [2]
sim score of 2-5 :	 0.707

J1, J2 {1: 1, 2: 1} {1: 1, 2: 1}
common [1, 2]
sim score of 3-3 :	 1.000

J1, J2 {1: 1, 2: 1} {3: 1}
common []
sim score of 3-4 :	 0.000

J1, J2 {1: 1, 2: 1} {1: 1, 2: 1}
common [1, 2]
sim score of 3-5 :	 1.000

J1, J2 {3: 1} {3: 1}
common [3]
sim score of 4-4 :	 1.000

J1, J2 {3: 1} {1: 1, 2: 1}
common []
sim score of 4-5 :	 0.000



In [18]:
# 打压热门物品的jarcard sim (耗时操作可离线计算)
for i in range(len(uids)-1):
    for j in range(i, len(uids)):
        id1, id2 = uids[i], uids[j]
        print("sim score of {}-{} :\t {:.3f}\n".format(id1, id2, usercf.jarcard_sim_with_suppressing_hot(id1, id2)))

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 3: 1, 4: 1}
common [1, 3, 4]
sim score of 1-1 :	 1.025

J1, J2 {1: 1, 3: 1, 4: 1} {2: 1}
common []
sim score of 1-2 :	 0.000

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 2: 1}
common [1]
sim score of 1-3 :	 0.294

J1, J2 {1: 1, 3: 1, 4: 1} {3: 1}
common [3]
sim score of 1-4 :	 0.526

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 2: 1}
common [1]
sim score of 1-5 :	 0.294

J1, J2 {2: 1} {2: 1}
common [2]
sim score of 2-2 :	 0.721

J1, J2 {2: 1} {1: 1, 2: 1}
common [2]
sim score of 2-3 :	 0.510

J1, J2 {2: 1} {3: 1}
common []
sim score of 2-4 :	 0.000

J1, J2 {2: 1} {1: 1, 2: 1}
common [2]
sim score of 2-5 :	 0.510

J1, J2 {1: 1, 2: 1} {1: 1, 2: 1}
common [1, 2]
sim score of 3-3 :	 0.721

J1, J2 {1: 1, 2: 1} {3: 1}
common []
sim score of 3-4 :	 0.000

J1, J2 {1: 1, 2: 1} {1: 1, 2: 1}
common [1, 2]
sim score of 3-5 :	 0.721

J1, J2 {3: 1} {3: 1}
common [3]
sim score of 4-4 :	 0.910

J1, J2 {3: 1} {1: 1, 2: 1}
common []
sim score of 4-5 :	 0.000



## 打压热门物品区别

In [21]:
# 由于1是热门物品，喜欢的用户较多。对用户相似度的评估不应该占太多权重。

id1, id2 = 1, 3
print("not suppressing_hot: sim score of {}-{} :\t {:.3f}\n".format(id1, id2, usercf.jarcard_sim(id1, id2)))
print("suppressing_hot: sim score of {}-{} :\t {:.3f}\n".format(id1, id2, usercf.jarcard_sim_with_suppressing_hot(id1, id2)))

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 2: 1}
common [1]
not suppressing_hot: sim score of 1-3 :	 0.408

J1, J2 {1: 1, 3: 1, 4: 1} {1: 1, 2: 1}
common [1]
suppressing_hot: sim score of 1-3 :	 0.294



## UserCF召回的完整流程
事先做离线计算
  
建⽴ ⽤户->⽤户 的索引
* 对于每个⽤户，索引他最相似的k个⽤户。
* 给定任意⽤户ID，可以快速找到他最相似的k个⽤户。

建⽴ ⽤户->物品 的索引
* 记录每个⽤户最近点击、交互过的物品ID。
* 给定任意⽤户ID，可以找到他近期感兴趣的物品列表。


线上做召回
1) 给定⽤户ID，通过⽤户->⽤户索引，找到top-k相似⽤户。
2) 对于每个top-k相似⽤户，通过⽤户->物品索引，找到⽤户近期感兴趣的物品列表（last-n）。
3) 对于取回的𝑛𝑘 个相似物品，⽤公式预估⽤户对每个物品的兴趣分数。
4) 返回分数最⾼的100个物品，作为召回结果。



