#  <center> 基于协同过滤的推荐系统案列<center>

## 案例背景

由于电商发展的日趋成熟，电商商家之间的竞争愈来愈激烈。  
该玩具公司销售额突破6千万后，公司的发展便遇到了瓶颈。
公司现状描述，公司属于中型发展公司，资金链充足；该公司的玩具货品来源，来着品牌代理加工工厂，玩具质量和款式以及性价比均为亮点；公司运营团队的规模20人。  
该玩具公司希望，可以通过我们的分析，给予突破瓶颈+提升销售额的方案。


## 数据来源

某玩具厂商

## 数据介绍

数据共有三个文件  
- Items_attribute.csv   
- Items_orders.csv  
- orders.csv

### orders表

**Orders**文件是用户在产生购买行为后的交易数据，该数据以订单编号为区分，所产生的数据均在一个订单编号下，时间段为2017年5月整月的数据。  
一些难理解字段的说明：

| 字段名 |    相关描述|                                                         |
| ------ | ------ | ------------------------------------------------------------ |
|**订单编号**|只要是一次下单购买的商品，不管购买数量为多少，均为一个订单，均具备一个订单编号|
|买家应付金额/邮费|在购买商品理应支付的金额|
|买家实付金额/邮费|买家实际支付的金额（享受折扣，以及包邮），一般在淘宝中，会有售价以及折扣价，售价为商家自认为商品所值得的价格，折扣价为实际的销售价格。|
|订单状态|买家的行为决定了订单的状态，买家将宝贝加入到订单中，并提交该订单，此时为下单，此时产生的一系列数据均为下单数据；当买家实际支付之后，此时订单状态才会变为交易成功。|
|宝贝种类|这一笔订单中，一共购买的宝贝总类别|
|宝贝数量|这一笔订单中，宝贝的总数量|
|等等...||


### Items_orders表

**Items_orders**文件为每一个商品的交易数据

| 字段名 |    相关描述|                                                         |
| ------ | ------ | ------------------------------------------------------------ |
|**订单编号**|只要是一次下单购买的商品，不管购买数量为多少，均为一个订单，均具备一个订单编号|
|**标题**|购买的商品名称|
|价格|该商品的价格|
|等等....||

### Items_attribute表

**Items_attribute**为商品的属性数据，该数据包含宝贝的ID、标题、价格以及玩具类型、适用年龄以及品牌。


| 字段名 |    相关描述|                                                         |
| ------ | ------ | ------------------------------------------------------------ |
|**宝贝ID**|商品的ID编号|
|**标题**|购买的商品名称|
|价格|该商品的价格|
|等等....||

## 导库

In [1]:
import numpy as np
import pandas as pd

## 导入数据

In [2]:
orders = pd.read_csv("data/orders.csv")
items = pd.read_csv("data/Items_orders.csv")
items_attribute = pd.read_csv("data/Items_attribute.csv",encoding='gbk')

### 初步探索orders

In [3]:
orders.head()

Unnamed: 0,订单编号,买家会员名,买家支付宝账号,买家应付货款,买家应付邮费,买家支付积分,总金额,返点积分,买家实际支付金额,买家实际支付积分,...,是否代付,定金排名,修改后的sku,修改后的收货地址,异常信息,天猫卡券抵扣,集分宝抵扣,是否是O2O交易,退款金额,预约门店
0,21407300627014900,1425,yorzikyA6C,58.51,0.0,0,58.51,0,58.51,0,...,否,,,,,,,,0.0,
1,24270488269081200,2163,AC870BA5860,15.7,5.0,0,20.7,0,20.7,0,...,否,,,,,,,,0.0,
2,21402600386365500,375,AC7574B65A0,7.9,5.0,0,12.9,0,12.9,0,...,否,,,,,,,,0.0,
3,21398820349555700,2618,A807C90766A,4.81,5.0,0,9.81,0,9.81,0,...,否,,,,,,,,0.0,
4,21446781606162100,2012,A505588565B,23.92,5.0,0,28.92,0,28.92,0,...,否,,,,,,,,0.0,


In [4]:
orders.shape

(3989, 46)

In [5]:
orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3989 entries, 0 to 3988
Data columns (total 46 columns):
订单编号        3989 non-null int64
买家会员名       3989 non-null int64
买家支付宝账号     3988 non-null object
买家应付货款      3989 non-null float64
买家应付邮费      3989 non-null float64
买家支付积分      3989 non-null int64
总金额         3989 non-null float64
返点积分        3989 non-null int64
买家实际支付金额    3989 non-null float64
买家实际支付积分    3989 non-null int64
订单状态        3989 non-null object
买家留言        384 non-null object
收货人姓名       3989 non-null int64
收货地址        3989 non-null object
运送方式        3989 non-null object
联系电话        142 non-null object
联系手机        3986 non-null object
订单创建时间      3989 non-null object
订单付款时间      3989 non-null object
宝贝标题        3989 non-null object
宝贝种类        3989 non-null int64
物流单号        3988 non-null object
物流公司        3988 non-null object
订单备注        460 non-null object
宝贝总数量       3989 non-null int64
店铺Id        3989 non-null int64
店铺名称        3989 non-null int64
订单关闭原因     

In [6]:
len(np.unique(orders.订单编号.values))  # 去重之后还有3989说明订单编号没有重复值

3989

In [7]:
len(np.unique(orders.买家会员名.values)) # 说明卖家会员名有重复值

3411

一个卖家会产生多个订单

### 初步探索items

In [8]:
items.head(2)

Unnamed: 0,订单编号,标题,价格,购买数量,外部系统编号,商品属性,套餐信息,备注,订单状态,商家编码
0,21407300627014900,发光玩具批发光纤手指灯闪光夜市热卖货源儿童玩具地摊义乌厂家,0.58,12,WY013-2SZD0426,颜色分类：小号,,,交易成功,WY013-2SZD0426
1,21407300627014900,特价5号AA普通干电池 电动玩具配件 厂家直销批,1.0,20,HT-5H0094,,,,交易成功,HT-5H0094


In [9]:
items.shape

(21897, 10)

In [10]:
items.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21897 entries, 0 to 21896
Data columns (total 10 columns):
订单编号      21897 non-null int64
标题        21897 non-null object
价格        21897 non-null float64
购买数量      21897 non-null int64
外部系统编号    21897 non-null object
商品属性      12636 non-null object
套餐信息      0 non-null float64
备注        130 non-null object
订单状态      21897 non-null object
商家编码      21897 non-null object
dtypes: float64(2), int64(2), object(6)
memory usage: 1.7+ MB


In [11]:
len(items.订单编号.values)

21897

In [12]:
len(np.unique(items.订单编号.values))  # 这里的订单编号有重复值， 所以这是一个多表， 但是种类依旧是3989

3989

所以如果这里要连接orders和items, 需要考虑谁做主表谁做副表，需要考虑用左连接 右连接 还是内连接？ 

- 去重后两个表连接键订单编号都是3989，谁做主表谁做副表，用左连接 右连接 还是内连接得到的结果都是一样的，不需要考虑

In [13]:
len(items.标题.values)

21897

In [14]:
len(np.unique(items.标题.values))  # 标题只有327种， 大量重复

327

### 初步探索items_attribute

In [15]:
items_attribute.head(2)

Unnamed: 0,宝贝ID,标题,价格,玩具类型,适用年龄,品牌
0,537396783238,创意新款回力小车惯性坦克 军事儿童玩具模型地摊货源玩具车批发,8.9,塑胶玩具,"3岁,4岁,5岁,6岁",3
1,36286235128,2017热卖大号仿真惯性挖土机儿童益智礼品创意义乌地摊货玩具批发,3.9,其它玩具,"3岁,4岁,5岁,6岁",3


In [16]:
items_attribute.shape

(288, 6)

In [17]:
items_attribute.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 6 columns):
宝贝ID    288 non-null int64
标题      288 non-null object
价格      288 non-null float64
玩具类型    252 non-null object
适用年龄    284 non-null object
品牌      288 non-null int64
dtypes: float64(1), int64(2), object(3)
memory usage: 13.6+ KB


In [18]:
len(np.unique(items_attribute.宝贝ID.values))  # 宝贝ID 无缺失值 无重复值

288

In [19]:
len(np.unique(items_attribute.标题.values))  # 标题也是无缺失值无重复值

288

问题， items表总的标题一共是327种， items_attribute表中的标题是288种， 应该谁做主表， 谁做副表？  左连接  右连接  还是内连接？？

items表做主表，内连接（因为推荐系统需要宝贝ID和用户都存在的数据）

## 将数据合并成一张表

In [20]:
orders_items = pd.merge(orders,items,on="订单编号")

In [21]:
orders_items.shape

(21897, 55)

In [22]:
orders_items.head(2)

Unnamed: 0,订单编号,买家会员名,买家支付宝账号,买家应付货款,买家应付邮费,买家支付积分,总金额,返点积分,买家实际支付金额,买家实际支付积分,...,预约门店,标题,价格,购买数量,外部系统编号,商品属性,套餐信息,备注,订单状态_y,商家编码
0,21407300627014900,1425,yorzikyA6C,58.51,0.0,0,58.51,0,58.51,0,...,,发光玩具批发光纤手指灯闪光夜市热卖货源儿童玩具地摊义乌厂家,0.58,12,WY013-2SZD0426,颜色分类：小号,,,交易成功,WY013-2SZD0426
1,21407300627014900,1425,yorzikyA6C,58.51,0.0,0,58.51,0,58.51,0,...,,特价5号AA普通干电池 电动玩具配件 厂家直销批,1.0,20,HT-5H0094,,,,交易成功,HT-5H0094


In [23]:
orders_items_props = pd.merge(orders_items,items_attribute,on="标题", how='inner')

In [24]:
orders_items_props.shape

(19943, 60)

In [26]:
len(np.unique(orders_items_props.宝贝ID.values))

269

## 构建关系矩阵

In [29]:
result = orders_items_props.loc[:,["买家会员名","宝贝ID"]]
result["购买次数"] = 0

In [30]:
result.shape

(19943, 3)

In [31]:
result.head()

Unnamed: 0,买家会员名,宝贝ID,购买次数
0,1425,530449665002,0
1,882,530449665002,0
2,882,530449665002,0
3,279,530449665002,0
4,279,530449665002,0


In [32]:
result.groupby(["买家会员名","宝贝ID"]).count()  # 购买次数即分数

Unnamed: 0_level_0,Unnamed: 1_level_0,购买次数
买家会员名,宝贝ID,Unnamed: 2_level_1
0,42577833473,1
1,536728628605,1
1,545516801138,1
1,547644315780,1
1,550735773284,1
2,537318544352,1
2,545516801138,1
3,545516801138,1
3,549744152016,1
4,35722333869,1


In [38]:
freq = result.groupby(["买家会员名","宝贝ID"]).count().reset_index()  # 重置索引
freq.head()

Unnamed: 0,买家会员名,宝贝ID,购买次数
0,0,42577833473,1
1,1,536728628605,1
2,1,545516801138,1
3,1,547644315780,1
4,1,550735773284,1


In [39]:
freq= freq.pivot_table(index="买家会员名",columns="宝贝ID",values="购买次数", fill_value=0)
# 填0的原因是后面计算要去均值化，去均值化后均值就会变为0 ，现在数据缺失值用0填充即使用均值填充是合理的
freq.head()

宝贝ID,35721027449,35721723963,35722000205,35722333869,35722423659,35750823403,35753244214,35754637865,35797606083,35798309577,...,551081926272,551091071810,551091439907,551135995408,551248890640,551625132527,551675713112,551715398940,552007245556,552028581381
买家会员名,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [40]:
freqMatrix = freq.values
freqMatrix

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [41]:
freqMatrix.shape

(3318, 269)

## freqMatrix关系矩阵的说明

freqMatrix是一个二维数组：  
- 每一行是一个用户对每一个物品的打分
- 每一列是一个物品得到的所有用户的打拼

## 计算物品相似度和用户相似度

In [42]:
from sklearn.metrics.pairwise import cosine_similarity

### 用户相似度  
要求行是用户

In [43]:
user_similar = cosine_similarity(freqMatrix)

### 物品相似度  
要求行是物品， 则需要将freqMatrix转置

In [44]:
item_similar = cosine_similarity(freqMatrix.T)

### user_similar和item_similar说明

用户相似度是一个二维数组， 任意一个值就是对应两个维度索引的相似度， 即两个用户的相似度

In [45]:
user_similar.shape

(3318, 3318)

In [47]:
pd.DataFrame(user_similar).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,3308,3309,3310,3311,3312,3313,3314,3315,3316,3317
0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.316228,0.0,0.0,0.0,0.0
1,0.0,1.0,0.353553,0.353553,0.204124,0.0,0.0,0.0,0.188982,0.0,...,0.0,0.353553,0.25,0.0,0.0,0.158114,0.0,0.0,0.0,0.0
2,0.0,0.353553,1.0,0.5,0.288675,0.0,0.0,0.0,0.267261,0.0,...,0.0,0.5,0.353553,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.353553,0.5,1.0,0.288675,0.0,0.0,0.0,0.267261,0.0,...,0.0,0.5,0.353553,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.204124,0.288675,0.288675,1.0,0.0,0.0,0.235702,0.154303,0.0,...,0.0,0.288675,0.204124,0.0,0.0,0.0,0.0,0.288675,0.0,0.182574


物品相似度是一个二维数组， 任意一个值就是对应两个维度索引的相似度， 即两个物品的相似度

In [48]:
item_similar.shape

(269, 269)

In [50]:
pd.DataFrame(item_similar).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,259,260,261,262,263,264,265,266,267,268
0,1.0,0.10163,0.146662,0.127454,0.0,0.0,0.0,0.095019,0.044076,0.0,...,0.068519,0.0,0.083918,0.0,0.0,0.059339,0.0,0.0,0.0,0.0
1,0.10163,1.0,0.130841,0.088007,0.076338,0.023473,0.053231,0.074796,0.130107,0.128446,...,0.0,0.064931,0.036699,0.031782,0.028427,0.038925,0.055048,0.0,0.0,0.0
2,0.146662,0.130841,1.0,0.093601,0.03393,0.037261,0.014083,0.059366,0.050486,0.069907,...,0.0,0.025768,0.023302,0.0,0.045125,0.012358,0.0,0.0,0.0,0.0
3,0.127454,0.088007,0.093601,1.0,0.033255,0.051127,0.083739,0.054306,0.062977,0.013323,...,0.032634,0.035358,0.039968,0.0,0.030959,0.0,0.0,0.0,0.0,0.0
4,0.0,0.076338,0.03393,0.033255,1.0,0.0,0.02235,0.0,0.036418,0.046225,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 构建推荐函数

### 构建基于用户的推荐

我们的目的是什么？？

In [112]:
freqMatrix   # 关系矩阵
# freqMatrix是一个二维数组：  
# - 每一行是一个用户对每一个物品的打分
# - 每一列是一个物品得到的所有用户的打拼

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [115]:
freqMatrix.shape

(3318, 269)

In [118]:
#构建一个基于用户的推荐
def Recommendation_user(uid,iid,similar,k=10):
    """减去平均数的计算方法"""
    score = 0
    weight = 0
    user_id_action = freq_matrix[uid,:]      #用户uid 对所有商品的评分  
    item_id_action = freq_matrix[:,iid]      #物品iid 得到的所有用户评分  

    user_id_similar = similar[uid,:]      #用户uid 对所有用户的相似度    
    similar_index = np.argsort(user_id_similar)[-(k+1):-1]  #最相似的k个用户的index（除了自己）
    user_id_i_mean = np.sum(user_id_action)/user_id_action[user_id_action!=0].size
                               # user_id_i_mean是uid这个用户对所有商品评分的平均值
    for j in similar_index :  # j是和用户uid最相似的k个之一用户在用户相似性矩阵similar中的索引
        if item_id_action[j]!=0: # item_id_action[j]是j这个用户对物品iid的评分，如果对该物品评分等于0，就不用算。
                                 # 否则的话就需要用该分数
            user_id_j_action = freq_matrix[j,:]  # 拿到j这个用户对每一个商品的打分
            user_id_j_mean = np.sum(user_id_j_action)/user_id_j_action[user_id_j_action!=0].size
                                 # 计算j这个用户对所有物品打分的平均值（没有打分的不计入）
            score += user_id_similar[j]*(item_id_action[j]-user_id_j_mean)
                #    用户uid与用户j的相似度 *（j用户对物品iid的打分-j这个用户对所有物品打分的平均值（）
            weight += abs(user_id_similar[j])
                #       用户uid与用户j的相似度的绝对值

    if weight==0:  
        return 0
    else:
        return user_id_i_mean + score/weight

### 构建一个物品的推荐(有小改动)

In [119]:
#构建一个物品的推荐
def Recommendation_item(uid,iid,similar,k=10):
    """减去平均数的计算方法"""
    score = 0
    weight = 0
    user_id_action = freqMatrix[uid,:]      #用户uid 对所有商品的评分  
    item_id_action = freqMatrix[:,iid]      #物品iid 得到的所有用户评分  

    item_id_similar = similar[iid,:]      #物品iid 与所有物品的相似度    
    similar_index = np.argsort(item_id_similar)[-(k+1):-1]  #与物品iid最相似的k个物品的index（除了自己）
    item_id_i_mean = np.sum(item_id_action)/item_id_action[item_id_action!=0].size 
                                # 物品iid得到的所有用户打分的平均值
    for j in similar_index :    # j是与物品iid最相似的k个物品之一的索引
        if user_id_action[j]!=0: # user_id_action[j]是用户uid对物品j的打分， 如果为0， 自然没有继续操作的必要
                                 # 如果不为0，就需要用到。
            # 原代码 
            # item_id_j_action = freqMatrix[j,:]  
            # 新代码
            item_id_j_action = freqMatrix[:,j] # item_id_j_action是物品j得到的所有用户的打分
            
            item_id_j_mean = np.sum(item_id_j_action)/item_id_j_action[item_id_j_action!=0].size
                        # item_id_j_mean 是物品j得到的打分的平均值
            score += item_id_similar[j]*(user_id_action[j]-item_id_j_mean)
                    #  物品iid与物品j的相似度*（用户uid对物品j的打分-物品j得到的打分的平均值）
            weight += abs(item_id_similar[j])
                    #  物品iid与物品j的相似度的绝对值

    if weight==0:  
        return 0
    else:
        return item_id_i_mean + score/weight

### 合并成一个函数

In [51]:
#构建一个基于用户和物品的推荐
def Recommendation_s(uid,iid,similar,base,k=10):
    """减去平均数的计算方法"""
    score = 0
    weight = 0
    user_id_action = freqMatrix[uid,:]      #用户uid 对所有商品的行为评分  
    item_id_action = freqMatrix[:,iid]      #物品iid 得到的所有用户评分  
    
    if base =='item':  # 基于物品的推荐
        item_id_similar = similar[iid,:]      #物品iid 与所有物品的相似度    
        similar_index = np.argsort(item_id_similar)[-(k+1):-1]  #与物品iid最相似的k个物品的index（除了自己）
        item_id_i_mean = np.sum(item_id_action)/item_id_action[item_id_action!=0].size 
                                                              # 物品iid得到的所有评分的平均值
        for j in similar_index :            # j是与物品iid最相似的k个物品之一的索引
            if user_id_action[j]!=0:# user_id_action[j]是用户uid对物品j的打分， 如果为0， 自然没有继续操作的必要
                                    # 如果不为0，就需要用到。
                item_id_j_action = freqMatrix[:,j] # item_id_j_action是物品j得到的所有用户的打分
                item_id_j_mean = np.sum(item_id_j_action)/item_id_j_action[item_id_j_action!=0].size
                                     # item_id_j_mean 是物品j得到的打分的平均值
                score += item_id_similar[j]*(user_id_action[j]-item_id_j_mean)
                  #  物品iid与物品j的相似度*（用户uid对物品j的打分-物品j得到的打分的平均值）
                weight += abs(item_id_similar[j])
                  #  物品iid与物品j的相似度的绝对值

        if weight==0:  
            return 0
        else:
            return item_id_i_mean + score/weight
        
    else:
        user_id_similar = similar[uid,:]      #用户uid 对所有用户的相似度    
        similar_index = np.argsort(user_id_similar)[-(k+1):-1]  #与用户uid最相似的k个用户的index（除了自己）
        user_id_i_mean = np.sum(user_id_action)/user_id_action[user_id_action!=0].size
                        # 用户uid对所有商品打分的平均值
        for j in similar_index :  # j是和用户uid最相似的k个之一用户在用户相似性矩阵similar中的索引
            if item_id_action[j]!=0:  # 物品iid得到的用户j的评分，如果为0， 就没有必要下去
                
                user_id_j_action = freqMatrix[j,:]  #user_id_j_action是用户j对所有物品的打分
                user_id_j_mean = np.sum(user_id_j_action)/user_id_j_action[user_id_j_action!=0].size
                               # 用户j对所有物品的打分的平均值
                score += user_id_similar[j]*(item_id_action[j]-user_id_j_mean)
                #    用户uid与用户j的相似度 *（j用户对物品iid的打分-j这个用户对所有物品打分的平均值（）
                weight += abs(user_id_similar[j])
                 #       用户uid与用户j的相似度的绝对值

        if weight==0:  
            return 0
        else:
            return user_id_i_mean + score/weight

## 构建预测函数

In [53]:
#构建预测函数
def predict(similar,base='item'):
    user_cnt = freqMatrix.shape[0]#用户数3318
    item_cnt = freqMatrix.shape[1]#商品数269
    pred = np.zeros((user_cnt,item_cnt))
    for uid in range(user_cnt):
        for iid in range(item_cnt):
            if freqMatrix[uid,iid] == 0:
                pred[uid,iid] = Recommendation_s(uid,iid,similar,base)
    return pred

### 基于物品的协同过滤的预测

In [54]:
item_prediction = predict(item_similar,base='item')

In [55]:
pd.DataFrame(item_prediction).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,259,260,261,262,263,264,265,266,267,268
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.891026,0.0,1.458673,1.195238,0.884504,0.0,1.6875,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.980952,0.0,0.0,0.0,0.0,1.0
2,0.891026,0.0,1.458673,0.0,0.884504,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.891026,0.0,1.458673,0.0,0.884504,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.891026,0.0,1.145696,0.0,0.884504,0.0,1.473214,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [56]:
item_prediction.shape

(3318, 269)

### 基于用户的协同过滤的预测

In [57]:
user_prediction = predict(user_similar,base='user')

In [58]:
pd.DataFrame(user_prediction).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,259,260,261,262,263,264,265,266,267,268
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.666667,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.666667,0.666667,0.73812,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [59]:
user_prediction.shape

(3318, 269)

## 构建最终的推荐函数

In [60]:
def get_recom(prediction,k=5):
    recom_df = pd.DataFrame(prediction,columns=freq.columns,index=freq.index)  
    recom_df = recom_df.stack().reset_index()   
    recom_df.rename(columns={0:"推荐指数"},inplace=True)    
    grouped = recom_df.groupby("买家会员名")   
    topk = grouped.apply(get_topk,k=k)
    topk = topk.drop(["买家会员名"],axis=1)
    topk.index = topk.index.droplevel(1)   
    topk.reset_index(inplace=True)
    return topk

In [61]:
def get_topk(group,k):
    return group.sort_values("推荐指数",ascending=False)[:k]

## 最终结果

### 基于用户的协同过滤的推荐结果

In [62]:
#计算用户相似度矩阵
user_similar = cosine_similarity(freqMatrix)
#计算基于用户的推荐
user_prediction = predict(user_similar,base='user')
#Topk推荐
user_recom = get_recom(user_prediction,5)
user_recom

Unnamed: 0,买家会员名,宝贝ID,推荐指数
0,0,527419046969,0.500000
1,0,538658965256,0.428571
2,0,542939108885,0.428571
3,0,547380519834,0.428571
4,0,547306204530,0.428571
5,1,544066720474,1.000000
6,1,35721027449,0.666667
7,1,36074765406,0.666667
8,1,521926312352,0.666667
9,1,536009750573,0.666667


### 基于物品的协同过滤的推荐结果

In [63]:
#计算物品相似度矩阵
item_similar = cosine_similarity(freqMatrix.T)
#计算基于物品的推荐
item_prediction = predict(item_similar,base='item')
#Topk推荐
item_recom = get_recom(item_prediction,5)
item_recom

Unnamed: 0,买家会员名,宝贝ID,推荐指数
0,0,544016559367,2.158102
1,0,537396783238,1.945519
2,0,544115359956,1.830060
3,0,546275765548,0.866250
4,0,550715341924,0.783523
5,1,35753244214,1.687500
6,1,35722000205,1.458673
7,1,520310825412,1.292639
8,1,35722333869,1.195238
9,1,527475911875,1.144947
