演示协同过滤方法，先来构造协同矩阵。
## 协同矩阵
## 用户协同

In [1]:
import numpy as np
import pandas as pd
import sklearn.externals.joblib as jl

In [2]:
data_root = "./pre"
# load data
dfTrain = jl.load("%s/trainAll.pkl"%data_root)
dfVal = jl.load("%s/valAll.pkl"%data_root)
dfAd = pd.read_csv("%s/ad.csv"%data_root)

In [3]:
dfTrain = dfTrain.drop(['index'],axis=1)
dfVal = dfVal.drop(['index'],axis=1)

In [4]:
dfTrain.head()

Unnamed: 0,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator,day,adID,camgaignID,advertiserID,appID,appPlatform
0,0,170000,,3089,2798058,293,1,1,17,1321,83,10,434,1
1,0,170001,,3089,195578,3659,0,2,17,1321,83,10,434,1
2,0,170014,,3089,1462213,3659,0,3,17,1321,83,10,434,1
3,0,170030,,3089,1985880,5581,1,1,17,1321,83,10,434,1
4,0,170047,,3089,2152167,5581,1,1,17,1321,83,10,434,1


In [5]:
dfAll = pd.concat([dfTrain,dfVal],axis=0,ignore_index=True)

In [6]:
dfAll.head()

Unnamed: 0,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator,day,adID,camgaignID,advertiserID,appID,appPlatform
0,0,170000,,3089,2798058,293,1,1,17,1321,83,10,434,1
1,0,170001,,3089,195578,3659,0,2,17,1321,83,10,434,1
2,0,170014,,3089,1462213,3659,0,3,17,1321,83,10,434,1
3,0,170030,,3089,1985880,5581,1,1,17,1321,83,10,434,1
4,0,170047,,3089,2152167,5581,1,1,17,1321,83,10,434,1


构造协同矩阵

In [17]:
n_users = dfTrain.userID.unique().shape[0]
n_items = dfTrain.creativeID.unique().shape[0]

In [18]:
(n_users,n_items)

(2388864, 5924)

In [19]:
# 稀疏率
dfTrain.shape[0] / (n_users * n_items) 

0.00024132420921936664

从上面我们可以看到，我们用户数远大于物品，下面我们来构建协同矩阵，先来对 user 和 item 进行编码

In [9]:
from sklearn.preprocessing import LabelEncoder

In [10]:
userEncoder = LabelEncoder()
itemEncoder = LabelEncoder()

In [11]:
userEncoder.fit(dfAll.userID)
itemEncoder.fit(dfAll.creativeID)

LabelEncoder()

In [12]:
dfTrain['encode_user_id'] = userEncoder.transform(dfTrain['userID'])
dfTrain['encode_item_id'] = itemEncoder.transform(dfTrain['creativeID'])

In [13]:
dfVal['encode_user_id'] = userEncoder.transform(dfVal['userID'])
dfVal['encode_item_id'] = itemEncoder.transform(dfVal['creativeID'])

In [25]:
# 矩阵太大，我们必须要采用其他方式存储
# train_data_matrix = np.zeros((n_users, n_items))
from scipy import sparse

对于稀疏矩阵，我们有两种存储方案

1. CSR：这个存储稍微复杂点，是一个整体编码方式。它有三个组成：数值、列号和行偏移共同编码。
2. COO：这个存储方式很简单，每个元素用一个三元组表示（行号，列号，数值），只存储有值的元素，缺失值不存储。

coo 方式如下：
例子
```
>>> row  = np.array([0, 3, 1, 0])
>>> col  = np.array([0, 3, 1, 2])
>>> data = np.array([4, 5, 7, 9])
>>> coo_matrix((data, (row, col)), shape=(4, 4)).toarray()
```

In [26]:
label1 = dfTrain[dfTrain['label']==1].reset_index()

In [27]:
# 只有 86954 有评分。其他都是0.
(label1.shape[0],dfTrain.shape[0])

(86954, 3415131)

In [28]:
n_users = dfAll.userID.unique().shape[0]
n_items = dfAll.creativeID.unique().shape[0]

In [29]:
cooMatrix = sparse.coo_matrix((label1['label'].values, 
                         (label1['encode_user_id'].values, label1['encode_item_id'].values)), 
                        shape=(n_users,n_items))

In [30]:
(label1['encode_user_id'].unique().shape[0],label1['encode_item_id'].unique().shape[0])
# 我们会发现有转换行为的item只有 1783 个

(85867, 1783)

In [31]:
cooMatrix.row

array([  63038,  531622, 2231763, ..., 1968127, 1279661, 1962348],
      dtype=int32)

In [32]:
cooMatrix.col

array([1217, 1217, 1217, ..., 1901, 1363,   94], dtype=int32)

In [34]:
(cooMatrix.getrow(63038),cooMatrix.getrow(531622)) 
# 都只有对1147发生了行为，我们计算相似度的时候就认为相似性为1，显然这个是不合理的。

(<1x6315 sparse matrix of type '<class 'numpy.int64'>'
 	with 1 stored elements in Compressed Sparse Row format>,
 <1x6315 sparse matrix of type '<class 'numpy.int64'>'
 	with 1 stored elements in Compressed Sparse Row format>)

下面我们要来计算用户之间的相似性了。

In [35]:
from sklearn.metrics.pairwise import cosine_similarity

In [36]:
user_similarity = cosine_similarity(cooMatrix,dense_output=False)

In [37]:
# user_similarity
user_similarity

<2595627x2595627 sparse matrix of type '<class 'numpy.float64'>'
	with 219588803 stored elements in Compressed Sparse Row format>

csr_matrix形式，按row行来压缩

对于第i行，非0数据列是indices[indptr[i]:indptr[i+1]] 数据是data[indptr[i]:indptr[i+1]]

数据是data[indptr[i]:indptr[i+1]]

In [38]:
user_similarity.indices

array([2563548, 2530032, 2512036, ...,  975346,  446509,  330920],
      dtype=int32)

In [39]:
user_similarity.indptr

array([        0,         0,         0, ..., 219588803, 219588803,
       219588803], dtype=int32)

In [41]:
first_row = 0
for i in user_similarity.indptr:
    if user_similarity.indptr[i] != user_similarity.indptr[i+1]:
        print(i,user_similarity.indptr[i],user_similarity.indptr[i+1])
        first_row = i
        break

21977 1717795 1720998


In [42]:
# 此时第 21977 行有非0的元素
user_similarity.indices[user_similarity.indptr[first_row]:user_similarity.indptr[first_row+1]].shape

(3203,)

In [43]:
user_similarity.data[user_similarity.indptr[first_row]:user_similarity.indptr[first_row+1]]

array([1., 1., 1., ..., 1., 1., 1.])

In [44]:
dfTrain[dfTrain['encode_user_id'] == first_row]

Unnamed: 0,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator,day,adID,camgaignID,advertiserID,appID,appPlatform,encode_user_id,encode_item_id
204267,1,231444,231445.0,1456,23761,7440,0,1,23,3379,411,3,465,1,21977,1403


In [46]:
dfTrain[dfTrain['encode_item_id']==1403][dfTrain['label']==1]['encode_user_id'].unique().shape[0]

  """Entry point for launching an IPython kernel.


3203

通过对上面一行数据的观察。我们发现真是。其中一个item好多用户都发生了行为，然后这些用户也基本上只对这个物品产生了行为，所以产生的相似度好多都是1.

关于上面的稀疏矩阵形式可以参考:[ scipy csr_matrix和csc_matrix函数详解](https://blog.csdn.net/u013010889/article/details/53305595)

现在我们已经有了用户相似度矩阵了，但是不幸的是，我们会发现好多用户没有任何相似的用户，对于这些用户我们能做的只是取平均值来预估其点击率，现在我们来看如何能计算用户评分

1. 我们先取用户u有评分的物品列表（不多）[i1,i2,i3...]
2. 我们再取用户u的相似用户列表[u1,u2,u3...]
3. 接着我们计算用户u对于用户列表[u1,u2,u3...]的贡献 $ <u1,i1,0> = simu_{1,u} * 1, <u1,i1,1> = simu_{1,u}  $ 上面 $ <u1,i1,0> $ 表示分子，即用户u对于计算 $P_{u1,i1}$ 的分子贡献，$<u1,i1,1>$表示用户u对于计算$P_{u1,i1}$ 的分母贡献，此时我们应该注意，分子中是只取用户u有评分的物品列表，但是分母我们是取所有物品。

In [47]:
from collections import defaultdict

In [48]:
dfUser = dfTrain.loc[dfTrain['label']==1].groupby('encode_user_id').apply(lambda df: df['encode_item_id'].unique()).reset_index()

In [51]:
dfUser.columns = ['user_id', 'items']

In [52]:
dfUser.head()

Unnamed: 0,user_id,items
0,18,[4323]
1,28,[4393]
2,58,[364]
3,60,[5370]
4,102,[1435]


In [53]:
dfUser = dfUser.set_index('user_id')

In [54]:
dfUser.head()

Unnamed: 0_level_0,items
user_id,Unnamed: 1_level_1
18,[4323]
28,[4393]
58,[364]
60,[5370]
102,[1435]


In [55]:
dfUser.loc[58].values[0]

array([364])

In [56]:
uniqueUsers = dfTrain.loc[dfTrain['label']==1]['encode_user_id'].unique()

In [57]:
len(uniqueUsers)

85867

In [58]:
%%time
predicts = defaultdict(lambda: 0)
for ui in uniqueUsers:
    # 取出有过点击行为的物品
    items = dfUser.loc[ui].values[0]
    # 接着我应该获取用户ui的相似用户
    simusers = user_similarity.indices[user_similarity.indptr[ui]:user_similarity.indptr[ui+1]]
    sims = user_similarity.data[user_similarity.indptr[ui]:user_similarity.indptr[ui+1]]
    for index, simuser in enumerate(simusers):  
        for item in items:
            key = str(simuser) + "_" + str(item) #+ "_0"
            predicts[key] += sims[index]
        # 此处我们得到了每个用户ui有过评分的物品

CPU times: user 13min 43s, sys: 140 ms, total: 13min 44s
Wall time: 13min 44s


可以说上一步非常非常耗时。。。。。

下面我计算每个用户的分母和，即用户相似度的和

In [59]:
sumSims = defaultdict(lambda: 0)
for ui in uniqueUsers:
    sims = (user_similarity.data[user_similarity.indptr[ui]:user_similarity.indptr[ui+1]])
    if len(sims)>0:
        sumSims[ui] = sum(sims)

In [104]:
final_predicts = defaultdict(lambda: 0)
for key in predicts.keys():
#     print(key,predicts[key])
    uid = int(key.split("_")[0])
    if sumSims[uid] == 0:
        print(key, uid,predicts[key],sumSims[uid])
    final_predicts[key] = predicts[key] / sumSims[uid]

In [61]:
dfVal.head()

Unnamed: 0,label,clickTime,conversionTime,creativeID,userID,positionID,connectionType,telecomsOperator,day,adID,camgaignID,advertiserID,appID,appPlatform,encode_user_id,encode_item_id
0,0,300000,,6198,1972793,5622,1,3,30,831,562,3,465,1,1825696,5950
1,0,300000,,6198,1452891,1465,2,1,30,831,562,3,465,1,1344914,5950
2,0,300000,,6198,1360982,2395,1,1,30,831,562,3,465,1,1259803,5950
3,0,300000,,6198,2343726,3678,2,3,30,831,562,3,465,1,2168714,5950
4,0,300001,,6198,2017555,4867,2,1,30,831,562,3,465,1,1867014,5950


In [105]:
meanP = np.mean(dfTrain["label"])
def lookUpSim(r):
#     print(r)
    key  = str(int(r['encode_user_id'])) + "_" + str(int(r['encode_item_id']))# + "_0"
    if final_predicts.get(key) != 0:
#         print(key)
        return final_predicts[key]
    else:
        return meanP

In [106]:
%%time
pred = dfVal.apply(lookUpSim,axis=1)
# dfVal['key'] = str(dfVal['encode_user_id']) + "_" + str(dfVal['encode_item_id'])

CPU times: user 10 s, sys: 24 ms, total: 10.1 s
Wall time: 10.1 s


In [107]:
pred[pred>100]

Series([], dtype: float64)

In [108]:
from sklearn.metrics import log_loss

In [109]:
log_loss(dfVal['label'].values, pred.values)

0.6570573679602255

我们的基线是0.09 现在用用户协同后效果竟然如此之差。不可思议

## surprise 协同

In [7]:
from surprise import Dataset,SVD,KNNBasic
from surprise.model_selection import cross_validate
# Load the movielens-100k dataset (download it if needed),
data = Dataset.load_builtin('ml-100k')

In [129]:
# We'll use the famous SVD algorithm.
algo = SVD()
# Run 5-fold cross-validation and print results
cross_validate(algo, data,verbose=True)

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9413  0.9389  0.9415  0.9380  0.9334  0.9386  0.0029  
MAE (testset)     0.7418  0.7401  0.7422  0.7402  0.7362  0.7401  0.0021  
Fit time          6.70    5.83    6.37    6.42    5.95    6.25    0.32    
Test time         0.20    0.37    0.18    0.19    0.22    0.23    0.07    


{'fit_time': (6.698886394500732,
  5.828982591629028,
  6.365151643753052,
  6.42438006401062,
  5.9542272090911865),
 'test_mae': array([0.74175544, 0.7401227 , 0.74216876, 0.74015516, 0.73617134]),
 'test_rmse': array([0.94130687, 0.93889919, 0.94152958, 0.9379804 , 0.9334151 ]),
 'test_time': (0.20086240768432617,
  0.37468743324279785,
  0.175858736038208,
  0.18991637229919434,
  0.21747589111328125)}

In [60]:
df = dfTrain[dfTrain['label']>=1][['label','encode_user_id','encode_item_id']]
df.columns = ['rating','userID', 'itemID']
df.head()
df1 = dfTrain[dfTrain['label']==0][['label','encode_user_id','encode_item_id']]
df1.columns = ['rating','userID', 'itemID']
dfa = pd.concat([df.head(),df1.head()],axis=0,ignore_index=True)
dfa

Unnamed: 0,rating,userID,itemID
0,1,63038,1217
1,1,531622,1217
2,1,2231763,1217
3,1,2457770,1217
4,1,1928886,1217
5,0,2589097,2966
6,0,181217,2966
7,0,1353582,2966
8,0,1837785,2966
9,0,1991509,2966


In [53]:
from surprise import Reader

In [94]:
# A reader is still needed but only the rating_scale param is requiered.
reader = Reader(rating_scale=(0, 1))
reader.offset = 0
# The columns must correspond to user id, item id and ratings (in that order).
data = Dataset.load_from_df(df1[['userID', 'itemID', 'rating']], reader)

In [95]:
trainSet.offset = 0
trainSet = data.build_full_trainset()

In [98]:
len(list(trainSet.all_items())), len(list(trainSet.all_users()))

(5910, 2328040)

In [96]:
%%time
sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between users
               }
algo = KNNBasic(sim_options=sim_options)
algo.fit(trainSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.
CPU times: user 3.06 s, sys: 332 ms, total: 3.4 s
Wall time: 3.34 s


  sim = construction_func[name](*args)


下一步我们开始预测

In [100]:
trainSet.global_mean

0.0

In [119]:
def predict_default(r):
    uid = r['encode_user_id']
    iid = r['encode_item_id']
    return algo.predict(uid, iid).est
#     if trainSet.knows_user(uid) and trainSet.knows_item(iid):
#         print(uid, iid)
#         return algo.predict(uid, iid).est
#     else:
#         return trainSet.global_mean

In [124]:

preds = dfVal.apply(predict_default, axis=1)

In [125]:
from sklearn.metrics import log_loss
log_loss(dfVal['label'].values, preds.values)

0.6515327634491246

In [99]:
# for r in trainSet.all_ratings():
#     print(r)

In [90]:
# %%time
# sim_options = {'name': 'cosine',
#                'user_based': False  # compute  similarities between users
#                }
# algo = KNNBasic(sim_options=sim_options)
# cross_validate(algo,data,cv=3, n_jobs=1,verbose=True)

Computing the cosine similarity matrix...


  sim = construction_func[name](*args)


Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNBasic on 3 split(s).

                  Fold 1  Fold 2  Fold 3  Mean    Std     
RMSE (testset)    0.0000  0.0000  0.0000  0.0000  0.0000  
MAE (testset)     0.0000  0.0000  0.0000  0.0000  0.0000  
Fit time          2.86    4.68    4.29    3.94    0.78    
Test time         19.97   19.35   20.15   19.82   0.34    
CPU times: user 1min 53s, sys: 3.41 s, total: 1min 56s
Wall time: 1min 56s


In [21]:
# algo.get_neighbors(1,10)

[23, 979, 0, 2, 3, 4, 5, 6, 7, 8]