基于已打分的电影，算出平均分后，对P90的电影随机推荐高分电影（作对比用途）

导入movielen数据

In [1]:
import pandas as pd
import numpy as np
#import matplotlib.pyplot as plt

import random
from sklearn.model_selection import train_test_split

# Reading ratings file
# Ignore the timestamp column
ratings = pd.read_csv('ratings.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'movie_id', 'rating'])

# Reading users file
users = pd.read_csv('users.csv', sep='\t', encoding='latin-1', usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

# Reading movies file
movies = pd.read_csv('movies.csv', sep='\t', encoding='latin-1', usecols=['movie_id', 'title', 'genres'])

根据ratings的数据数据，groupby movie id，算出每个movie id的rating的平均值，作为一个dataframe备用

In [2]:
a = ratings.groupby('movie_id').rating.sum()/ratings.groupby('movie_id').rating.count()
movie_avg = pd.DataFrame(a.values, index=a.index, columns=['avg'])

两个函数分别为，
一、单纯返回最高评分的若干movie id作为推荐列表。主要问题是所有用户看到的结果是一致的；
二、返回最高评分P90的movie id，经过随机处理再推荐。

In [3]:
def gen_high_rated_list(movie_avg_pd, n=10):
    """
    Return movie id list of most high average rated movies
    """
    return movie_avg_pd.sort_values(by=['avg'], inplace=False, ascending=False).head(n).index.values


def gen_high_rated_p90_list(movie_avg_pd, n=10):
    """
    Return movie id list of most high average rated movies, >= p90
    randomnize to avoid always recommending the same list
    """
    movie_id_list = movie_avg[movie_avg.avg >= movie_avg.avg.quantile(0.9)].index.values
    random.shuffle(movie_id_list)
    
    return movie_id_list[:n]

将数据集以默认1：3的比例以随机的方式分为训练和验证的部分，后续用验证集来验证推荐效果。 具体见， https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [4]:
# split into train & test data
ratings_train, ratings_test = train_test_split(ratings)

通过推荐列表的命中率验证效果，使用验证集来验证推荐结果，训练集用来生成推荐列表。 由于movielen里除了有评价过的电影还有相应的评分，为了贴合实际，认为， 只有命中且评分高于该用户的p80的评分（有些人习惯打高分，其他人反之），才算命中

会执行很久，但可以像下列例子那样仅验证头100个user的推荐的命中率

In [5]:
def hit_ratio_benchmark(ratings_train, ratings_test, rated_movie_limit=10):
    """
    for each user_id
    1. get recommend list using ratings_train rated movies
    2. use ratings_test rated movies to validate hit ratio
    it is considered hit when,
    1. user rated this movie
    2. the rate is >= this user's p80 rate in ratings_train
    """
    hit = 0
    
    #for user_id in np.sort(ratings_test['user_id'].unique()):
    for user_id in np.arange(1, 100):
        recommend_list = gen_high_rated_p90_list(movie_avg)
        #print("%s, %s" % (user_id, recommend_list))
        
        for item in recommend_list:
            x = ratings_test[ratings_test.user_id == user_id][['movie_id', 'rating']]
            if x[x.movie_id == item].empty:
                continue
            elif x[x.movie_id == item].rating.values < np.percentile(x.rating.values, 80):
                continue
            else:
                hit += 1
    
    print(hit)
    hit_ratio = hit / ratings_test.movie_id.count() * 1.0
    return hit_ratio
        
        
hit_ratio = hit_ratio_benchmark(ratings_train, ratings_test)
hit_ratio *= 100
print('hit ratio percentage: %.10f%%' % hit_ratio)

9
hit ratio percentage: 0.0000000000%


NDCG原理见， http://sofasofa.io/forum_main_post.php?postid=1002561

In [6]:
def dcg_at_k(r, k, method=0):
    """Score is discounted cumulative gain (dcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
    >>> dcg_at_k(r, 1)
    3.0
    >>> dcg_at_k(r, 1, method=1)
    3.0
    >>> dcg_at_k(r, 2)
    5.0
    >>> dcg_at_k(r, 2, method=1)
    4.2618595071429155
    >>> dcg_at_k(r, 10)
    9.6051177391888114
    >>> dcg_at_k(r, 11)
    9.6051177391888114
    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Discounted cumulative gain
    """
    r = np.asfarray(r)[:k]
    if r.size:
        if method == 0:
            return r[0] + np.sum(r[1:] / np.log2(np.arange(2, r.size + 1)))
        elif method == 1:
            return np.sum(r / np.log2(np.arange(2, r.size + 2)))
        else:
            raise ValueError('method must be 0 or 1.')
    return 0.


def ndcg_at_k(r, k, method=0):
    """Score is normalized discounted cumulative gain (ndcg)
    Relevance is positive real values.  Can use binary
    as the previous methods.
    Example from
    http://www.stanford.edu/class/cs276/handouts/EvaluationNew-handout-6-per.pdf
    >>> r = [3, 2, 3, 0, 0, 1, 2, 2, 3, 0]
    >>> ndcg_at_k(r, 1)
    1.0
    >>> r = [2, 1, 2, 0]
    >>> ndcg_at_k(r, 4)
    0.9203032077642922
    >>> ndcg_at_k(r, 4, method=1)
    0.96519546960144276
    >>> ndcg_at_k([0], 1)
    0.0
    >>> ndcg_at_k([1], 2)
    1.0
    Args:
        r: Relevance scores (list or numpy) in rank order
            (first element is the first item)
        k: Number of results to consider
        method: If 0 then weights are [1.0, 1.0, 0.6309, 0.5, 0.4307, ...]
                If 1 then weights are [1.0, 0.6309, 0.5, 0.4307, ...]
    Returns:
        Normalized discounted cumulative gain
    """
    dcg_max = dcg_at_k(sorted(r, reverse=True), k, method)
    if not dcg_max:
        return 0.
    return dcg_at_k(r, k, method) / dcg_max

对所有验证集的用户的推荐列表，计算ndcg并计算他们的平均值，作为该推荐算法的ndcg分数 由于ndcg计算的是推荐列表的顺序的精确度，因此如果验证集中该用户没有给推荐的电影打分，就认为是打了0分。

In [7]:
def ndcg_benchmark(ratings_train, ratings_test, rated_movie_limit=10):
    """
    for each user_id
    1. get recommend list using ratings_train rated movies
    2. use ratings_test rated movies to validate ndcg value
    if it is not rated, make it zero
    return average ndcg_score for all ratings_test users
    """
    
    ndcg_score, count = (0, 0)
    #for user_id in np.sort(ratings_test['user_id'].unique()):
    for user_id in np.arange(1, 100):
        r = []
        #print('user_id %s' % user_id)
        
        recommend_list = gen_high_rated_p90_list(movie_avg)
        #print(user_id, recommend_list)
        
        for item in recommend_list:
            x = ratings_test[ratings_test.user_id == user_id][['movie_id', 'rating']]
            if x[x.movie_id == item].empty:
                r.append(0)
            else:
                r.append(\
                    np.asscalar(ratings_test[(ratings_test.user_id == user_id) & \
                        (ratings_test.movie_id == item)].rating.values))
    
        ndcg_score += ndcg_at_k(r, len(r))
        count += 1.0

    ndcg_score /= count
    return ndcg_score

print(ndcg_benchmark(ratings_train, ratings_test))


0.09268070152565544


可以看到hit ratio和NDCG的值都偏低。原因包括该验证集并不是由新的推荐算法的产生的，实际生产中更多会通过A/B方式做验证。
同时，推荐最高分p90的结果明显比只推荐最高评分的若干movie id效果要好。