# 利用python构建基于协同过滤的推荐引擎

本文探讨不同类型的推荐系统，了解它们的商业化实现，以及内部的工作原理。最后来实现自己的推荐引擎，找到适合的GitHub资料库。

## 1.协同过滤

协同过滤（Collaborative Filtering）推荐算法是诞生最早，并且较为著名的推荐算法，主要的功能是预测和推荐。算法通过对用户历史行为数据的挖掘发现用户的偏好，基于不同的偏好对用户进行群组划分并推荐品味相似的商品。协同过滤推荐算法分为两类，分别是基于用户的协同过滤算法(user-based collaboratIve filtering)，和基于物品的协同过滤算法(item-based collaborative filtering)。简单的说就是：人以类聚，物以群分。下面我们将分别说明这两类推荐算法的原理和实现方法。

### 1.1基于用户的协同过滤算法(user-based collaboratIve filtering)

基于用户的协同过滤算法是通过用户的历史行为数据发现用户对商品或内容的喜欢(如商品购买，收藏，内容评论或分享)，并对这些喜好进行度量和打分。根据不同用户对相同商品或内容的态度和偏好程度计算用户之间的关系。在有相同喜好的用户间进行商品推荐。简单的说就是如果A,B两个用户都购买了x,y,z三本图书，并且给出了5星的好评。那么A和B就属于同一类用户。可以将A看过的图书w也推荐给用户B。

下面通过一个例子来看看实践中这是如何运作的。

这里，我们假设有顾客A到D，以及他们所评分到一组产品，评分从0到5.

|西游记|封神榜|白蛇传|天龙八部|射雕英雄传|神雕侠侣|铁齿铜牙纪晓岚
-|-|-|-|-|-|-
A|4||5|3|5||
B||4||4||5|
C|2||2||1||
D||5||3||5|4

当想要查找类似的项目时，可以使用余弦相似度（Cosine Similarity）。我们将为用户A发现最相似的其他顾客。由于这里的向量是稀疏的，包含了许多未评分的项目，我们将在这些缺失的地方输入一些默认值，这里填入0。我们从用户A和用户B的比较开始。

In [1]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
                  np.array([0,4,0,4,0,5,0]).reshape(1,-1))

array([[ 0.18353259]])

可以看到，这两者没有很高的相似性，因为他们没有多少共同的评分。现在来看看用户C和用户A的比较。

In [3]:
cosine_similarity(np.array([4,0,5,3,5,0,0]).reshape(1,-1),\
                  np.array([2,0,2,0,1,0,0]).reshape(1,-1))

array([[ 0.88527041]])

这里可以看到他们有很高的相似度（1是完美的相似度），尽管他们对同样产品的评价有所不同。为什么得到了如此高的相似度？问题在于我们对没有评分的产品选择使用0分，它表示强烈的一致性。这种情况下，0不是中性的。

那么，如何解决这个问题？

我们可以重新生成每位用户的评分，使得平均分变为0或中性，而不是缺失值简单地使用0。将每位用户的评分减去该用户所有评分的平均值。

最终，我们得到下面的数据表格。注意，每行的用户评分总和为0（忽略四舍五入带来的问题）。

|西游记|封神榜|白蛇传|天龙八部|射雕英雄传|神雕侠侣|铁齿铜牙纪晓岚
-|-|-|-|-|-|-
A|-0.25||0.75|-1.25|0.75||
B||-0.33||-0.33||0.66|
C|0.33||0.33||-0.66||
D||0.75||-1.25||0.75|-0.25

现在，我们在新的数据集上尝试余弦相似度。

首先，将用户A和用户B进行比较：

In [4]:
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
                  .reshape(1,-1),\
                  np.array([0,-.33,0,-.33,0,.66,0])\
                  .reshape(1,-1))

array([[ 0.30772873]])

其次，将用户A和用户C进行比较：

In [5]:
cosine_similarity(np.array([-.25,0,.75,-1.25,.75,0,0])\
                  .reshape(1,-1),\
                  np.array([.33,0,.33,0,-.66,0,0])\
                  .reshape(1,-1))

array([[-0.24618298]])

可以看到，A和B之间的相似度略有增加，而A和C之间的相似度显著下降。这正是我们所希望的。

这种中心化的过程除了帮助我们处理缺失值之外，还有其它好处，例如帮助我们处理不同严苛程度的打分者，现在每位打分者的平均分都是0。注意，这个公式等价于Pearson相关系数，取值落在-1和1之间。

我们现在采用这个框架来预测产品的评分。我们将示例限制为三位用户X、Y、Z，我们将预测X尚未评价，而和X非常相似的Y、Z已经评价过的产品，对于X而言会得到多少分。

每位用户的基本评分如下：

|西游记|封神榜|白蛇传|天龙八部|射雕英雄传|神雕侠侣|铁齿铜牙纪晓岚
-|-|-|-|-|-|-
X||4||3||4|
Y||3.5||2.5||4|4
Z||4||3.5||4.5|4.5

接下来，我们将去中心化这些评分，如下表：

|西游记|封神榜|白蛇传|天龙八部|射雕英雄传|神雕侠侣|铁齿铜牙纪晓岚
-|-|-|-|-|-|-
X||0.33||-0.66||0.33|?
Y||0||-1||0.5|0.5
Z||-0.125||-0.625||0.375|0.375

现在，我们想知道用户X会给铁齿铜牙纪晓岚打多少分。我们可以根据用户评分中心化之后的余弦相似度获得权重，并通过这些权重对用户Y和用户Z的评分进行加权计算。

首先计算用户Y和用户X的相似度：

In [6]:
user_x = [0,.33,0,-.66,0,33,0]
user_y = [0,0,0,-1,0,.5,.5]

cosine_similarity(np.array(user_x).reshape(1,-1),\
                  np.array(user_y).reshape(1,-1))

array([[ 0.42447212]])

然后计算用户Z和用户X的相似度：

In [7]:
user_x = [0,.33,0,-.66,0,33,0]
user_z = [0,-.125,0,-.625,0,.375,.375]

cosine_similarity(np.array(user_x).reshape(1,-1),\
                  np.array(user_z).reshape(1,-1))

array([[ 0.46571861]])

现在，我们可以通过每位用户与X之间的相似度，对每位用户的评分进行加权，然后除以总相似度。

(0.42447212 x 4 + 0.46571861 x 4.5) / (0.42447212 + 0.46571861) = 4.26

我们可以看到用户X对铁齿铜牙纪晓岚的预估评分为4.26。

### 1.2基于物品的协同过滤算法(item-based collaborative filtering)

基于物品的协同过滤算法与基于用户的协同过滤算法很像，将商品和用户互换。通过计算不同用户对不同物品的评分获得物品间的关系。基于物品间的关系对用户进行相似物品的推荐。这里的评分代表用户对商品的态度和偏好。简单来说就是如果用户A同时购买了商品1和商品2，那么说明商品1和商品2的相关度较高。当用户B也购买了商品1时，可以推断他也有购买商品2的需求。

下面通过一个例子来看看这是如何运作的。

这次，我们看看用户对歌曲的评分。每一列是一位用户，每一行是一首歌曲。

|U1|U2|U3|U4|U5
-|-|-|-|-
S1|2||4||5
S2||3||3|
S3|1||5||4
S4||4|4|4|
S5|3||||5

现在，假设我们想知道用户U3对S5的评分。这里，我们根据用户对歌曲的评分来寻找类似的歌曲，而不是寻找类似的用户。

首先，对歌曲评分中心化，并计算其它每首歌曲和目标歌曲（S5）的余弦相似度。

|U1|U2|U3|U4|U5
-|-|-|-|-
S1|-1.66||0.33||1.33
S2||0||0|
S3|-2.33||1.66||0.66
S4||0|0|0|
S5|-1||?||1

In [8]:
s1 = [-1.66,0.0,.33,0.0,1.33]
s5 = [-1.0,0.0,0.0,0.0,1.0]

cosine_similarity(np.array(s1).reshape(1,-1),\
                  np.array(s5).reshape(1,-1))

array([[ 0.98221439]])

In [9]:
s2 = [0.0,0.0,0.0,0.0,0.0]
s5 = [-1.0,0.0,0.0,0.0,1.0]

cosine_similarity(np.array(s2).reshape(1,-1),\
                  np.array(s5).reshape(1,-1))

array([[ 0.]])

In [10]:
s3 = [-2.33,0.0,1.66,0.0,0.66]
s5 = [-1.0,0.0,0.0,0.0,1.0]

cosine_similarity(np.array(s3).reshape(1,-1),\
                  np.array(s5).reshape(1,-1))

array([[ 0.72011198]])

In [11]:
s4 = [0.0,0.0,0.0,0.0,0.0]
s5 = [-1.0,0.0,0.0,0.0,1.0]

cosine_similarity(np.array(s4).reshape(1,-1),\
                  np.array(s5).reshape(1,-1))

array([[ 0.]])

接下来需要选择一个数字k，这是我们为预测U3对歌曲的评分所要使用的最近邻数量。这里，我们取k=2。

通过上面计算的余弦相似度，我们可以看到对于歌曲S5，S1、S3和它最相似，所以我们将使用U3对这两首歌曲的评分。

(0.98 x 4 + 0.72 x 5) / (0.98 + 0.72) = 4.42

因此，通过基于物品的协同过滤，我们可以看到U3很可能给歌曲S5打出4.42的高分。

现在，我们通过代码来实现上面的过程。

In [12]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [13]:
df = pd.DataFrame({'U1':[2,None,1,None,3], 'U2':[None,3,None,4,None], 'U3':[4,None,5,4,None], 'U4':[None,3,None,4,None], 'U5':[5,None,4,None,5]})

In [14]:
df.index = ['S1', 'S2', 'S3', 'S4', 'S5']
df

Unnamed: 0,U1,U2,U3,U4,U5
S1,2.0,,4.0,,5.0
S2,,3.0,,3.0,
S3,1.0,,5.0,,4.0
S4,,4.0,4.0,4.0,
S5,3.0,,,,5.0


In [15]:
def get_sim(ratings, target_user, target_item, k=2):
    centered_ratings = ratings.apply(lambda x: x - x.mean(), axis=1)
    csim_list = []
    for i in centered_ratings.index:
        csim_list.append(cosine_similarity(np.nan_to_num(centered_ratings.loc[i,:].values).reshape(1,-1), np.nan_to_num(centered_ratings.loc[target_item,:]).reshape(1,-1)).item())
    new_ratings = pd.DataFrame({'similarity': csim_list, 'rating': ratings[target_user]}, index=ratings.index)
    top = new_ratings.dropna().sort_values('similarity', ascending=False)[:k].copy()
    top['multiple'] = top['rating'] * top['similarity']
    result = top['multiple'].sum() / top['similarity'].sum()
    return result

In [16]:
get_sim(df, 'U3', 'S5', 2)

4.4232320023615763

## 2.基于内容的推荐算法

基于内容的推荐算法，原理是用户喜欢和自己关注过的Item在内容上类似的Item，比如你看了哈利波特I，基于内容的推荐算法发现哈利波特II-VI，与你以前观看的在内容上面（共有很多关键词）有很大关联性，就把后者推荐给你，这种方法可以避免Item的冷启动问题（冷启动：如果一个Item从没有被关注过，其他推荐算法则很少会去推荐，但是基于内容的推荐算法可以分析Item之间的关系，实现推荐），弊端在于推荐的Item可能会重复，典型的就是新闻推荐，如果你看了一则关于MH370的新闻，很可能推荐的新闻和你浏览过的，内容一致；另外一个弊端则是对于一些多媒体的推荐（比如音乐、电影、图片等)由于很难提内容特征，则很难进行推荐，一种解决方式则是人工给这些Item打标签。

## 3.构建推荐引擎

现在，我们使用GitHub API，创建基于协同过滤的推荐引擎。这能够帮我们获得所有自己已经加了星标的资料库，然后得到这些库的全部创作者。然后，再获取这些作者加了星标的所有资料库。接下来，我们可以比较已经加了星标的资料库，找到和自己最相似的用户。最后，我们可以使用他们所有加了星标（而自己没有加过星标）的资料库来生成一组推荐。

首先，我们需要创建用于API的令牌，可以访问[https://github.com/settings/tokens](https://github.com/settings/tokens)来创建。

In [40]:
myun = 'jingsupo'  # GitHub用户名
mypw = 'a89d6f51e5385ae01d531d18699b5992147727be'  # GitHub个人令牌

创建函数，用来拉取自己已经加了星标的资料库的名称。

In [18]:
import pandas as pd
import numpy as np
import requests
import json

In [19]:
my_starred_repos = []
def get_starred_by_me():
    resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/user/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        resp_list.append(json.loads(next_url_resp.text))

    for lis in resp_list:
        for dic in lis:
            msr = dic['html_url']
            my_starred_repos.append(msr)

In [20]:
get_starred_by_me()

In [21]:
my_starred_repos

['https://github.com/facebook/prophet',
 'https://github.com/WillKoehrsen/Data-Analysis',
 'https://github.com/wepe/MachineLearning',
 'https://github.com/pydata/pandas-datareader',
 'https://github.com/toddmotto/public-apis',
 'https://github.com/chinese-poetry/chinese-poetry',
 'https://github.com/SegmentFault/deploy-robot',
 'https://github.com/ethan-funny/explore-python',
 'https://github.com/RJT1990/pyflux',
 'https://github.com/plotly/dash',
 'https://github.com/pypa/pipenv',
 'https://github.com/Theano/Theano',
 'https://github.com/h5bp/Front-end-Developer-Interview-Questions',
 'https://github.com/keras-team/keras',
 'https://github.com/madhug-nadig/Machine-Learning-Algorithms-from-Scratch',
 'https://github.com/kennethreitz/records',
 'https://github.com/jakevdp/sklearn_tutorial',
 'https://github.com/jakevdp/PythonDataScienceHandbook',
 'https://github.com/rhiever/Data-Analysis-and-Machine-Learning-Projects',
 'https://github.com/justmarkham/scikit-learn-videos',
 'https://gi

In [22]:
len(my_starred_repos)

155

解析每个已加星标的资料库的用户名，检索他们曾经标记的库。

In [23]:
my_starred_users = []
for ln in my_starred_repos:
    right_split = ln.split('.com/')[1]
    starred_usr = right_split.split('/')[0]
    my_starred_users.append(starred_usr)

In [24]:
my_starred_users

['facebook',
 'WillKoehrsen',
 'wepe',
 'pydata',
 'toddmotto',
 'chinese-poetry',
 'SegmentFault',
 'ethan-funny',
 'RJT1990',
 'plotly',
 'pypa',
 'Theano',
 'h5bp',
 'keras-team',
 'madhug-nadig',
 'kennethreitz',
 'jakevdp',
 'jakevdp',
 'rhiever',
 'justmarkham',
 'PacktPublishing',
 'kailashahirwar',
 'mwaskom',
 'ParhamP',
 'hangsz',
 'statsmodels',
 'ecomfe',
 'yunjey',
 'jvns',
 'BrambleXu',
 'wesm',
 'quandl',
 'waditu',
 'ogrisel',
 'rasbt',
 'rasbt',
 'rasbt',
 'rasbt',
 'jupyter',
 'donnemartin',
 'josephmisiti',
 'xianhu',
 'mame',
 'numpy',
 'scipy',
 'matplotlib',
 'scikit-learn',
 'nltk',
 'pandas-dev',
 'younghz',
 'younghz',
 'rmax',
 'Studio3T',
 'uglide',
 'aosabook',
 'scrapy',
 'norvig',
 'gabrielecirulli',
 'tesseract-ocr',
 'tesseract-ocr',
 'binux',
 'scrapy',
 'apache',
 'apache',
 'apache',
 'cloudera',
 'casperjs',
 'SeleniumHQ',
 'ariya',
 'python',
 'rainyear',
 'loverajoel',
 'chengcxy',
 'alibaba',
 'lealife',
 'pytorch',
 'pytorch',
 'ryanjay0',
 'Roch

In [25]:
len(my_starred_users)

155

In [26]:
len(set(my_starred_users))  # 看起来有些用户重复，因为我标记了某些用户的多个资料库

122

检索他们加了星标的资料库

In [27]:
starred_repos = {k:[] for k in set(my_starred_users)}
def get_starred_by_user(user_name):
    starred_resp_list = []
    last_resp = ''
    first_url_to_get = 'https://api.github.com/users/'+ user_name +'/starred'
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw))
    last_resp = first_url_resp
    starred_resp_list.append(json.loads(first_url_resp.text))
    
    while last_resp.links.get('next'):
        next_url_to_get = last_resp.links['next']['url']
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw))
        last_resp = next_url_resp
        starred_resp_list.append(json.loads(next_url_resp.text))
        
    for lis in starred_resp_list:
        for dic in lis:
            sr = dic['html_url']
            starred_repos.get(user_name).append(sr)

调用上面的函数，运行可能需要几分钟。（我本次运行使用了22分钟）

In [28]:
for usr in list(set(my_starred_users)):
    try:
        get_starred_by_user(usr)
    except:
        print('failed for user', usr)

现在，我们需要为所有加星标的资料库构建一个特征集。

In [29]:
repo_vocab = [item for sl in list(starred_repos.values()) for item in sl]

由于多个用户可能会标记同一个资料库，我们需要将其转换为一个集合，以去除可能存在的重复。

In [30]:
repo_set = list(set(repo_vocab))

In [31]:
len(repo_set)

13576

现在，我们有了完整的特征集，或者说资料库的词汇，我们要对每位用户和每个资料库的组合创建一个二进制向量，如果该用户对该资料库加了星标，设为1，否则设为0。

In [32]:
all_usr_vector = []
for k,v in starred_repos.items():
    usr_vector = []
    for url in repo_set:
        if url in v:
            usr_vector.extend([1])
        else:
            usr_vector.extend([0])
    all_usr_vector.append(usr_vector)

In [33]:
len(all_usr_vector)

122

我们现在有13576个项目（资料库），122个用户，以及他们之间的二进制向量。

In [35]:
df = pd.DataFrame(all_usr_vector, columns=repo_set, index=starred_repos.keys())
df

Unnamed: 0,https://github.com/0rpc/zerorpc-python,https://github.com/toddmotto/angular-tesla-range-calculator,https://github.com/kamyu104/LeetCode,https://github.com/CamDavidsonPilon/demographica,https://github.com/tpeng/mrjobcc,https://github.com/twitter/fatcache,https://github.com/mcroydon/opencv_playground,https://github.com/lukego/blog,https://github.com/nachocab/clickme,https://github.com/sickill/stderred,...,https://github.com/va1en0k/django_settings_template,https://github.com/mame/flagir,https://github.com/timgrossmann/InstaPy,https://github.com/dgilland/pydash,https://github.com/aurapm/aura,https://github.com/nigma/django-easy-pjax,https://github.com/dask/zict,https://github.com/linkedin/hopscotch,https://github.com/johnmyleswhite/SimpleAintEasy,https://github.com/saulpw/visidata
loverajoel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
FutunnOpen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
keras-team,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
numpy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
alibaba,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
gabrielecirulli,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
scikit-learn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
robbyrussell,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ogrisel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
justmarkham,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
df.to_csv('github_starred_repos.csv', encoding='utf-8')

为了将我们自己与其他用户进行比较，需要将自己的那行添加进去。

In [38]:
my_repo_comp = []
for i in df.columns:
    if i in my_starred_repos:
        my_repo_comp.append(1)
    else:
        my_repo_comp.append(0)

In [41]:
mrc = pd.Series(my_repo_comp).to_frame(myun).T
mrc

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,13566,13567,13568,13569,13570,13571,13572,13573,13574,13575
jingsupo,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [42]:
mrc.columns = df.columns

In [43]:
fdf = pd.concat([df, mrc])
fdf

Unnamed: 0,https://github.com/0rpc/zerorpc-python,https://github.com/toddmotto/angular-tesla-range-calculator,https://github.com/kamyu104/LeetCode,https://github.com/CamDavidsonPilon/demographica,https://github.com/tpeng/mrjobcc,https://github.com/twitter/fatcache,https://github.com/mcroydon/opencv_playground,https://github.com/lukego/blog,https://github.com/nachocab/clickme,https://github.com/sickill/stderred,...,https://github.com/va1en0k/django_settings_template,https://github.com/mame/flagir,https://github.com/timgrossmann/InstaPy,https://github.com/dgilland/pydash,https://github.com/aurapm/aura,https://github.com/nigma/django-easy-pjax,https://github.com/dask/zict,https://github.com/linkedin/hopscotch,https://github.com/johnmyleswhite/SimpleAintEasy,https://github.com/saulpw/visidata
loverajoel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
FutunnOpen,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
keras-team,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
numpy,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
alibaba,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
gabrielecirulli,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
scikit-learn,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
robbyrussell,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
ogrisel,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
justmarkham,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


现在，我们要计算自己和其他用户之间的相似度。这次使用pearsonr函数，需要从scipy导入。

In [44]:
from scipy.stats import pearsonr

In [45]:
sim_score = {}
for i in range(len(fdf)):
    ss = pearsonr(fdf.iloc[-1,:], fdf.iloc[i,:])
    sim_score.update({i: ss[0]})

  r = r_num / r_den


In [46]:
sf = pd.Series(sim_score).to_frame('similarity')
sf

Unnamed: 0,similarity
0,0.002301
1,0.081274
2,
3,
4,
5,0.006759
6,
7,0.001918
8,0.017995
9,0.056781


我们上面所做的说将DataFrame中最后一个向量和其它向量进行比较，并生成中心化余弦相似度（Pearson相关系数）。

下面进行排序，返回最相似用户的索引编号。

In [47]:
sf.sort_values('similarity', ascending=False)

Unnamed: 0,similarity
122,1.000000
114,0.169804
83,0.105841
99,0.092054
93,0.092024
48,0.091798
111,0.088406
94,0.086590
10,0.083429
1,0.081274


可以看到，除去第一个得分为1的用户（那是我们自己），三个最接近的匹配是用户114、83、99。

In [48]:
fdf.index[114]

'lining0806'

In [50]:
fdf.index[83]

'9miao'

In [51]:
fdf.index[99]

'mbakker7'

通过代码查看这些用户对哪些库加了星标

In [52]:
fdf.iloc[114,:][fdf.iloc[114,:]==1]

https://github.com/dmlc/xgboost                                           1
https://github.com/luongvo209/Begin-Latex-in-minutes                      1
https://github.com/PaddlePaddle/Paddle                                    1
https://github.com/keras-team/keras                                       1
https://github.com/jasondavies/d3-cloud                                   1
https://github.com/Blankj/AndroidUtilCode                                 1
https://github.com/AnthonyCalandra/modern-cpp-features                    1
https://github.com/ujjwalkarn/Machine-Learning-Tutorials                  1
https://github.com/xitu/tensorflow-docs                                   1
https://github.com/jrjohansson/scientific-python-lectures                 1
https://github.com/whtsky/WeRoBot                                         1
https://github.com/zeeshanu/learn-regex                                   1
https://github.com/YadiraF/GAN                                            1
https://gith

In [53]:
fdf.iloc[83,:][fdf.iloc[83,:]==1]

https://github.com/9miao/G-Firefly    1
https://github.com/9miao/CrossApp     1
https://github.com/9miao/Firefly      1
Name: 9miao, dtype: int64

In [54]:
fdf.iloc[99,:][fdf.iloc[99,:]==1]

https://github.com/mbakker7/exploratory_computing_with_python    1
Name: mbakker7, dtype: int64

创建一个DataFrame，放入我和三位相似用户已加星标的资料库。

In [55]:
all_recs = fdf.iloc[[114,83,99,122],:][fdf.iloc[[114,83,99,122],:]==1].fillna(0).T

In [56]:
all_recs

Unnamed: 0,lining0806,9miao,mbakker7,jingsupo
https://github.com/0rpc/zerorpc-python,0.0,0.0,0.0,0.0
https://github.com/toddmotto/angular-tesla-range-calculator,0.0,0.0,0.0,0.0
https://github.com/kamyu104/LeetCode,0.0,0.0,0.0,0.0
https://github.com/CamDavidsonPilon/demographica,0.0,0.0,0.0,0.0
https://github.com/tpeng/mrjobcc,0.0,0.0,0.0,0.0
https://github.com/twitter/fatcache,0.0,0.0,0.0,0.0
https://github.com/mcroydon/opencv_playground,0.0,0.0,0.0,0.0
https://github.com/lukego/blog,0.0,0.0,0.0,0.0
https://github.com/nachocab/clickme,0.0,0.0,0.0,0.0
https://github.com/sickill/stderred,0.0,0.0,0.0,0.0


看一下是否存在我们几个都已加星标的资料库

In [57]:
all_recs[(all_recs==1).all(axis=1)]

Unnamed: 0,lining0806,9miao,mbakker7,jingsupo


看看其他几位都标记了哪些我没有标记的资料库

In [58]:
str_recs_tmp = all_recs[all_recs[myun]==0].copy()
str_recs = str_recs_tmp.iloc[:,:-1].copy()
str_recs

Unnamed: 0,lining0806,9miao,mbakker7
https://github.com/0rpc/zerorpc-python,0.0,0.0,0.0
https://github.com/toddmotto/angular-tesla-range-calculator,0.0,0.0,0.0
https://github.com/kamyu104/LeetCode,0.0,0.0,0.0
https://github.com/CamDavidsonPilon/demographica,0.0,0.0,0.0
https://github.com/tpeng/mrjobcc,0.0,0.0,0.0
https://github.com/twitter/fatcache,0.0,0.0,0.0
https://github.com/mcroydon/opencv_playground,0.0,0.0,0.0
https://github.com/lukego/blog,0.0,0.0,0.0
https://github.com/nachocab/clickme,0.0,0.0,0.0
https://github.com/sickill/stderred,0.0,0.0,0.0


再看看是否存在两位共同加了星标的库

In [60]:
str_recs[str_recs.sum(axis=1)>1]

Unnamed: 0,lining0806,9miao,mbakker7


看来，到目前为止，并没有两位用户都加了星标的资料库。

我们使用协同过滤生成了推荐（虽然并没有给出推荐的内容，那是因为我加了星标的资料库大都不是个人用户，他们并没有很多加了星标的资料库。），然后通过聚合执行了一些额外的过滤。为了改进结果，我们可以做的另一件事是添加基于内容的过滤。