## 推荐系统

根据用户的喜好推荐相关的项目， 有历史记录来决定
主要的技术：
协同过滤：该方法通过收集多个用户的偏好或偏好信息(协作)来自动预测(过滤)用户的兴趣。
基于内容的过滤：此方法仅使用用户之前的描述和信息，推荐类似的项目。特备的，将各种候选项与用户先前评分进行比较，推荐最佳匹配项。
混合方法：以上两种方法的结合，比单纯的一种方法要好。尤其适合解决一些常见的问题，比如冷启动和稀疏问题。

## 展示如何实现上述三种方法

In [175]:
import numpy as np
import scipy
import pandas as pd
import math
import random
import sklearn
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
import matplotlib.pyplot as plt

### 加载数据

In [176]:
articles_df = pd.read_csv('./datasets/shared_articles.csv')
articles_df = articles_df[articles_df['eventType'] == 'CONTENT SHARED']

In [177]:
articles_df.head()

Unnamed: 0,timestamp,eventType,contentId,authorPersonId,authorSessionId,authorUserAgent,authorRegion,authorCountry,contentType,url,title,text,lang
1,1459193988,CONTENT SHARED,-4110354420726924665,4340306774493623681,8940341205206233829,,,,HTML,http://www.nytimes.com/2016/03/28/business/dea...,"Ethereum, a Virtual Currency, Enables Transact...",All of this work is still very early. The firs...,en
2,1459194146,CONTENT SHARED,-7292285110016212249,4340306774493623681,8940341205206233829,,,,HTML,http://cointelegraph.com/news/bitcoin-future-w...,Bitcoin Future: When GBPcoin of Branson Wins O...,The alarm clock wakes me at 8:00 with stream o...,en
3,1459194474,CONTENT SHARED,-6151852268067518688,3891637997717104548,-1457532940883382585,,,,HTML,https://cloudplatform.googleblog.com/2016/03/G...,Google Data Center 360° Tour,We're excited to share the Google Data Center ...,en
4,1459194497,CONTENT SHARED,2448026894306402386,4340306774493623681,8940341205206233829,,,,HTML,https://bitcoinmagazine.com/articles/ibm-wants...,"IBM Wants to ""Evolve the Internet"" With Blockc...",The Aite Group projects the blockchain market ...,en
5,1459194522,CONTENT SHARED,-2826566343807132236,4340306774493623681,8940341205206233829,,,,HTML,http://www.coindesk.com/ieee-blockchain-oxford...,IEEE to Talk Blockchain at Cloud Computing Oxf...,One of the largest and oldest organizations fo...,en


In [178]:
interactions_df = pd.read_csv('./datasets/users_interactions.csv')
interactions_df.head(10)

Unnamed: 0,timestamp,eventType,contentId,personId,sessionId,userAgent,userRegion,userCountry
0,1465413032,VIEW,-3499919498720038879,-8845298781299428018,1264196770339959068,,,
1,1465412560,VIEW,8890720798209849691,-1032019229384696495,3621737643587579081,Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2...,NY,US
2,1465416190,VIEW,310515487419366995,-1130272294246983140,2631864456530402479,,,
3,1465413895,FOLLOW,310515487419366995,344280948527967603,-3167637573980064150,,,
4,1465412290,VIEW,-7820640624231356730,-445337111692715325,5611481178424124714,,,
5,1465413742,VIEW,310515487419366995,-8763398617720485024,1395789369402380392,Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebK...,MG,BR
6,1465415950,VIEW,-8864073373672512525,3609194402293569455,1143207167886864524,,,
7,1465415066,VIEW,-1492913151930215984,4254153380739593270,8743229464706506141,Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/53...,SP,BR
8,1465413762,VIEW,310515487419366995,344280948527967603,-3167637573980064150,,,
9,1465413771,VIEW,3064370296170038610,3609194402293569455,1143207167886864524,,,


### 数据清理

In [179]:
# 根据事件类型表示兴趣程度，比如comment 是最感兴趣的
event_type_strength = {
    'VIEW':1.0,
    'LIKE':2.0,
    'BOOKMARK':2.5,
    'FOLLOW':3.0,
    'COMMENT CREATED':4.0
}

In [180]:
interactions_df['eventStrength'] = interactions_df['eventType'].apply(lambda x: 
                                                                      event_type_strength[x])

推荐系统的冷启动：对缺少足够信息的新用户难以做出个性化的推荐。  
因为这个原因，只保留至少有5次交互的用户信息。

In [181]:
user_interactions_count_df = interactions_df.groupby(['personId', 'contentId']).size().groupby('personId').size()

In [182]:
# 针对不同用户的交互进行分组统计
print("# user: %d" % len(users_interactions_count_df))

# user: 1895


In [183]:
# 选择大于5的用户
users_with_enough_interactions_df = user_interactions_count_df[
    user_interactions_count_df >= 5].reset_index()[['personId']]

In [184]:
print('# users with at least 5 interactions: %d' % len(users_with_enough_interactions_df))

# users with at least 5 interactions: 1140


In [185]:
print("# of interactions: %d" % len( interactions_df))

# 根据后者选出的personId对全集进行合并，得到交互多于5次用户的记录
interactions_from_selected_users_df = interactions_df.merge(user_with_enough_interactions_df, 
                                                           how='right',
                                                           left_on = 'personId', 
                                                           right_on='personId')

# of interactions: 72312


In [186]:
interactions_from_selected_users_df.shape

(69868, 9)

由于用户可能会对文章进行不用的交互，比如多次查看，喜欢和评论。  
因此对用户感兴趣的文章的建模，可根据用户交互强度进行加权聚合，并进行log转换，平滑分布。

In [187]:
def smooth_user_preference(x):
    return math.log(1 + x, 2)

In [188]:
interactions_full_df = interactions_from_selected_users_df.groupby(['personId', 'contentId']) \
                                ['eventStrength'].sum().apply(smooth_user_preference).reset_index()  
interactions_full_df.shape

(39106, 3)

## 评估

评估是机器学习中非常重要的一环，因为需要比较不同的算法及其超参数。  
评估的一个关键方面是，通过交叉验证技术，确保经过训练的模型对未训练的数据进行泛化。  
这里使用一种简单的方式，holdout， 随机选取(20%)的数据用于测试评估。    
另一种方式是由时间来区分，比如选取某个时间之前的数据用于训练，用于预测后续的数据。  

In [189]:
interactions_train_df, interactions_test_df = train_test_split(interactions_full_df, stratify=
                                                              interactions_full_df['personId'], 
                                                              test_size=0.2)

In [190]:
interactions_train_df.shape, interactions_test_df.shape

((31284, 3), (7822, 3))

在推荐系统中，有多种评估的方法，这里选择top-n accuracy metrics：评估提供给用户的top推荐数目，与测试集中的进行比较。
过程如下：
* 对于每个用户：
   * 对于测试集中用户交互过的每一项:  
       * 取该用户从未交互过的100项: 这里假设非交互项与用户无关，可能并不是事实。
       * 从包含一个交互项和100为交互项的数据集中， 推荐模型产生一个推荐列表。
       * 从推荐列表中， 计算该用户及其交互项的top-n accuracy metrics.
   * 合并全部的top-n分数。(recall)

In [199]:
#Indexing by personId to speed up the searches during evaluation
interactions_full_indexed_df = interactions_full_df.set_index('personId')
interactions_train_indexed_df = interactions_train_df.set_index('personId')
interactions_test_indexed_df = interactions_test_df.set_index('personId')

In [200]:
interactions_test_indexed_df.shape, interactions_train_indexed_df.shape

((7822, 2), (31284, 2))

In [208]:
def get_items_interacted(person_id, interactions_df):
    # Get the user's data and merge in the movie information.
    interacted_items = interactions_df.loc[person_id]['contentId']
    return set(interacted_items if type(interacted_items) == pd.Series else [interacted_items])

In [209]:
EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS = 100

class ModelEvaluator:


    def get_not_interacted_items_sample(self, person_id, sample_size, seed=42):
        interacted_items = get_items_interacted(person_id, interactions_full_indexed_df)
        all_items = set(articles_df['contentId'])
        non_interacted_items = all_items - interacted_items

        random.seed(seed)
        non_interacted_items_sample = random.sample(non_interacted_items, sample_size)
        return set(non_interacted_items_sample)

    def _verify_hit_top_n(self, item_id, recommended_items, topn):        
            try:
                index = next(i for i, c in enumerate(recommended_items) if c == item_id)
            except:
                index = -1
            hit = int(index in range(0, topn))
            return hit, index

    def evaluate_model_for_user(self, model, person_id):
        #Getting the items in test set
        interacted_values_testset = interactions_test_indexed_df.loc[person_id]
        if type(interacted_values_testset['contentId']) == pd.Series:
            person_interacted_items_testset = set(interacted_values_testset['contentId'])
        else:
            person_interacted_items_testset = set([int(interacted_values_testset['contentId'])])  
        interacted_items_count_testset = len(person_interacted_items_testset) 

        #Getting a ranked recommendation list from a model for a given user
        person_recs_df = model.recommend_items(person_id,
                                               items_to_ignore=get_items_interacted(
                                                   person_id,interactions_train_indexed_df), 
                                               topn=10000000000)

        hits_at_5_count = 0
        hits_at_10_count = 0
        #For each item the user has interacted in test set
        for item_id in person_interacted_items_testset:
            #Getting a random sample (100) items the user has not interacted 
            #(to represent items that are assumed to be no relevant to the user)
            non_interacted_items_sample = self.get_not_interacted_items_sample(person_id, 
                                                                          sample_size=EVAL_RANDOM_SAMPLE_NON_INTERACTED_ITEMS, 
                                                                          seed=item_id%(2**32))

            #Combining the current interacted item with the 100 random items
            items_to_filter_recs = non_interacted_items_sample.union(set([item_id]))

            #Filtering only recommendations that are either the interacted item or from a random sample of 100 non-interacted items
            valid_recs_df = person_recs_df[person_recs_df['contentId'].isin(items_to_filter_recs)]                    
            valid_recs = valid_recs_df['contentId'].values
            #Verifying if the current interacted item is among the Top-N recommended items
            hit_at_5, index_at_5 = self._verify_hit_top_n(item_id, valid_recs, 5)
            hits_at_5_count += hit_at_5
            hit_at_10, index_at_10 = self._verify_hit_top_n(item_id, valid_recs, 10)
            hits_at_10_count += hit_at_10

        #Recall is the rate of the interacted items that are ranked among the Top-N recommended items, 
        #when mixed with a set of non-relevant items
        recall_at_5 = hits_at_5_count / float(interacted_items_count_testset)
        recall_at_10 = hits_at_10_count / float(interacted_items_count_testset)

        person_metrics = {'hits@5_count':hits_at_5_count, 
                          'hits@10_count':hits_at_10_count, 
                          'interacted_count': interacted_items_count_testset,
                          'recall@5': recall_at_5,
                          'recall@10': recall_at_10}
        return person_metrics

    def evaluate_model(self, model):
        #print('Running evaluation for users')
        people_metrics = []
        for idx, person_id in enumerate(list(interactions_test_indexed_df.index.unique().values)):
            #if idx % 100 == 0 and idx > 0:
            #    print('%d users processed' % idx)
            person_metrics = self.evaluate_model_for_user(model, person_id)  
            person_metrics['_person_id'] = person_id
            people_metrics.append(person_metrics)
        print('%d users processed' % idx)

        detailed_results_df = pd.DataFrame(people_metrics) \
                            .sort_values('interacted_count', ascending=False)
        
        global_recall_at_5 = detailed_results_df['hits@5_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        global_recall_at_10 = detailed_results_df['hits@10_count'].sum() / float(detailed_results_df['interacted_count'].sum())
        
        global_metrics = {'modelName': model.get_model_name(),
                          'recall@5': global_recall_at_5,
                          'recall@10': global_recall_at_10}    
        return global_metrics, detailed_results_df
    
model_evaluator = ModelEvaluator()

In [210]:
# 计算最受欢迎的项目
item_popularity_df = interactions_full_df.groupby('contentId')['eventStrength'].sum().sort_values(
    ascending=False).reset_index()
item_popularity_df.head(10)

Unnamed: 0,contentId,eventStrength
0,-4029704725707465084,307.733799
1,-6783772548752091658,233.762157
2,-133139342397538859,228.024567
3,-8208801367848627943,197.107608
4,-6843047699859121724,193.825208
5,8224860111193157980,189.04468
6,-2358756719610361882,183.110951
7,2581138407738454418,180.282876
8,7507067965574797372,179.094002
9,1469580151036142903,170.548969


In [211]:
class PopularityRecommender:
    
    MODEL_NAME = 'Popularity'
    
    def __init__(self, popularity_df, items_df=None):
        self.popularity_df = popularity_df
        self.items_df = items_df
        
    def get_model_name(self):
        return self.MODEL_NAME
        
    def recommend_items(self, user_id, items_to_ignore=[], topn=10, verbose=False):
        # Recommend the more popular items that the user hasn't seen yet.
        recommendations_df = self.popularity_df[~self.popularity_df['contentId'].isin(items_to_ignore)] \
                               .sort_values('eventStrength', ascending = False) \
                               .head(topn)

        if verbose:
            if self.items_df is None:
                raise Exception('"items_df" is required in verbose mode')

            recommendations_df = recommendations_df.merge(self.items_df, how = 'left', 
                                                          left_on = 'contentId', 
                                                          right_on = 'contentId')[['eventStrength', 'contentId', 'title', 'url', 'lang']]


        return recommendations_df
    
popularity_model = PopularityRecommender(item_popularity_df, articles_df)

In [212]:
print('Evaluating Popularity recommendation model...')
pop_global_metrics, pop_detailed_results_df = model_evaluator.evaluate_model(popularity_model)

Evaluating Popularity recommendation model...
1139 users processed


In [214]:
print('\nGlobal metrics:\n%s' % pop_global_metrics)
pop_detailed_results_df.head(10)


Global metrics:
{'modelName': 'Popularity', 'recall@5': 0.24009204806954743, 'recall@10': 0.37074916901048327}


Unnamed: 0,_person_id,hits@10_count,hits@5_count,interacted_count,recall@10,recall@5
61,3609194402293569455,56,28,192,0.291667,0.145833
49,-2626634673110551643,24,14,134,0.179104,0.104478
106,-1032019229384696495,28,12,130,0.215385,0.092308
165,-1443636648652872475,16,7,117,0.136752,0.059829
5,-2979881261169775358,37,21,88,0.420455,0.238636
6,-3596626804281480007,10,7,80,0.125,0.0875
41,1116121227607581999,34,20,73,0.465753,0.273973
52,-9016528795238256703,20,12,69,0.289855,0.173913
59,692689608292948411,22,16,69,0.318841,0.231884
37,3636910968448833585,34,26,68,0.5,0.382353
