标签：是一种分类信息，标识物品或者用户的类别。描述物品或用户等事物的属性，是一类事物的抽象集合
标签按照来源: 一是用户主观的标注，二是应用一些标注工具(标注，关键词提取)而得来

应用
- 特征表示，标签作为物品的特征，然后推荐系统可以基于这些标签来计算物品之间的相似性
- 标签可以用来分类物品、用户
- 用户标记物品的标签时候可以反应用户的一些偏好信息
- 所以标签式联系用户和物品之间的纽带
- 物品召回(过滤物品)，例如可以基于标签过滤掉一些物品
- 过滤用户(训练时，过滤恶意用户)

标签怎么得来:
- 人工标注(基于用户主观意愿)
- 应用工具：数据标注和关键词提取
- 关键词：对短文本所传达含义的抽取概述，直接反映了短文本的所传达的属性或特征




对于物品的短文本描述，可以首先对短文本进行提取其中的关键词然后作为该物品的标签

jieba
jieba.cut(str='',cut_all=False,HMM=False) cut_all表示使用的是精确模型还是全模式
jieba.cut_for_search()搜索引擎式的分词方式

可以自己添加词典，自己调整词典

可以使用基于TF-IDF算法的关键词提取，可以使用基于TreeRank方法的关键词提取

In [10]:
import jieba
import os
from collections import defaultdict
import math

In [26]:
class TF_IDF(object):
    
    def __init__(self,file_path='../../data/ch6/id_title.txt',stop_words_path='../../data/ch6/stop_words.txt'):
        """初始化
        @params:
            file_path
            stop_words
        """
        self.stop_words = self.get_stop_words(stop_words_path)
        self.data = self.load_process_data(file_path)
        self.IDF = self.get_IDF(self.data)
            
    def get_stop_words(self,stop_words_path):
        """加载停用词
        @params:
            stop_words_path
        """
        print('加载停用词')
        stops = []
        for line in open(stop_words_path,'r',encoding='utf-8'):
            stops.append(line.strip())
        return stops
    
    def load_process_data(self,file_path):
        """加载并预处理数据，分词、去除停用词
        @params:
            file_path: str
        @return:
            data: dict
        """
        print('加载数据并进行预处理')
        dic = {}
        for line in open(file_path,'r',encoding='utf-8'):
            id_,text = line.strip().split('\t')
            dic[id_] = []
            for word in jieba.cut(text.replace(' ',''),cut_all=False):
                if word not in self.stop_words:
                    dic[id_].append(word)
        return dic

    def get_single_TF(self,words):
        """计算每一段短文本的TF
        @params:
            words
        @returns:
            dic
        """
        print('计算TF')
        TF = defaultdict(int)
        for word in words:
            TF[word] += 1/len(words)
        return TF
            
    def get_IDF(self,all_words):
        """计算每个词的IDF
        @params:
            all_words: dic, 表示所有每个id和其对应的文本
        @return:
            IDF : dic, 表示每个词的IDF
        """
        print('计算IDF')
        count = defaultdict(int)
        total = 0
        for id_,words in all_words.items():
            words = set(words)
            total += 1
            for word in words:
                count[word] += 1
        IDF = {}
        for word,n in count.items():
            IDF[word] = math.log(total/(1+count[word]))
        return IDF
    
    def get_TFIDF(self,id_):
        """计算每个词的TF-IDF
        """
        print('计算IF-IDF')
        words = self.data[id_]
        TF = self.get_single_TF(words)
        TFIDF = {}
        for word in set(words):
            TFIDF[word] = TF[word] * self.IDF[word]
        return TFIDF
        

In [27]:
ifidf = TF_IDF()
# for id_ in ifidf.data.keys():
#     print(ifidf.get_TFIDF(id_))
print(ifidf.get_TFIDF('5594'))

加载停用词
加载数据并进行预处理
计算IDF
计算IF-IDF
计算TF
{'红米': 0.15809333207382342, '1200': 0.30740662117616135, '万双': 0.30740662117616135, '异形': 0.30740662117616135, '4000mAh': 0.30740662117616135, '小米': 0.10758201510963047, '电池': 0.2736178621671477, '超大': 0.24964435612949923, '6Pro': 0.30740662117616135, '摄': 0.30740662117616135, '屏': 0.10758201510963047, '后置': 0.2736178621671477}


In [31]:
import jieba.analyse
print(jieba.analyse.extract_tags(''.join(ifidf.data['5594']),topK=20,withWeight=True))
print(jieba.analyse.textrank(''.join(ifidf.data['5594']),topK=20,withWeight=True))

[('万双', 1.22912397395), ('红米', 1.19547675029), ('6Pro', 1.19547675029), ('1200', 1.19547675029), ('4000mAh', 1.19547675029), ('后置', 0.987532596124), ('异形', 0.982314020807), ('超大', 0.941204128224), ('小米', 0.9164479203579999), ('电池', 0.751886163457)]
[('异形', 1.0), ('超大', 0.6695435672241968), ('电池', 0.6654493036167798), ('小米', 0.4993681156439996), ('后置', 0.4993681156439996)]


计算用户与标签之间联系，即用户对标签的偏好程度

计算用户对标签的依赖程度

将偏好程度与依赖程度相乘就可以得出用户对标签的兴趣度

即偏好程度要大(如果单独计算会存在偶然性，恰巧用户喜欢的物品的评分很高)，依赖程度也要大(用户使用的大量标签但是评分却很低，表示用户对这些标签是不感兴趣的)

建模步骤
- 用户的对标签的爱好程度
- 用户对标签的依赖程度
- 用户对标签的兴趣程度
- 物品的标签基因
- 用户对物品的爱好程度

In [51]:
class TagBasedRec(object):
    
    def __init__(self,user_artists='../../data/ch6/lastfm-2k/user_artists.dat',user_tags='../../data/ch6/lastfm-2k/user_taggedartists.dat',artists='../../data/ch6/lastfm-2k/artists.dat',tags='../../data/ch6/lastfm-2k/tags.dat'):
        """初始化
        @params:
        """
        self.user_artists = user_artists
        self.user_tags = user_tags
        self.artists = self.get_artists(artists)
        self.tags = tags
        self.init_model()
        
    def get_artists(self,artists_path):
        """加载全部的artists
        @params                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    v
        @return
        """
        first = True
        artists = []
        for line in open(artists_path,'r'):
            if first:
                first = False
                continue
            artist = line.strip().split()[0]
            artists.append(artist)
        return artists
    
    def init_model(self):
        """初始化模型，计算一些
        @params
        @return
        """
        self.user_artist_rating,self.user_rating_avg = self.get_user_artist_rating()
#         print(self.user_artist_rating)
        self.artist_tag_relation = self.get_artist_tag_relation()
        self.TF,self.IDF = self.get_tag_TF_IDF()
        self.user_tag_rating = self.get_user_tag_rating()
        
    def get_user_artist_rating(self):
        """加载用户对各个艺术品的评分
        @params:
        @return:
        """
        user_artist_rating = defaultdict(dict)
        first = True
        for line in open(self.user_artists,'r'):
            if first:
                first = False
                continue
            user,artist,rating = line.strip().split()
            user_artist_rating[user][artist] = float(rating)/100
        user_rating_avg = {}
        for user,artists in user_artist_rating.items():
            user_rating_avg[user] = sum(artists.values())/len(artists)
        return user_artist_rating,user_rating_avg

    def get_artist_tag_relation(self):
        """加载艺术品与标签之间相关度，标签出现在艺术品上就表示它们之间存在相关度，用1来表示
        @params
        @return
        """
        artist_tag_relation = defaultdict(dict)
        first = True
        for line in open(self.user_tags,'r'):
            if first:
                first = False
                continue
            _,artist,tag = line.strip().split()[:3]
            artist_tag_relation[artist][tag] = 1
        return artist_tag_relation
    
    def get_tag_TF_IDF(self):
        """加载用户使用标签的依赖程度 ： TF * IDF
        @params
        @return
        """
        TF = {}
        IDF = defaultdict(int)
        first = True
        for line in open(self.user_tags,'r'):
            if first:
                first = False
                continue
            user,_,tag = line.strip().split()[:3]
            TF.setdefault(user,{}).setdefault(tag,0)
            TF[user][tag] += 1
            IDF[tag] += 1 # 就是计算是实实在在的打标次数
        for user,tags in TF.items():
            sum_ = sum(tags.values())
            for tag in tags:
                TF[user][tag] /= sum_
        sum_ = sum(IDF.values())
        for tag in IDF:
            IDF[tag] = math.log(sum_/(IDF[tag]+1))
        return TF,IDF
    
    def get_user_tag_rating(self,k=0):
        """计算每个用户对每个标签的喜爱程度
        @params
        @return
        """
        user_tag_rating_ = {}
        user_tag_rating = defaultdict(dict)
        count_user_tag = {}
        first = True
        for line in open(self.user_tags,'r'):
            if first:
                first = False
                continue
            user,artist,tag = line.strip().split()[:3]
            user_tag_rating_.setdefault(user,{}).setdefault(tag,0)
            count_user_tag.setdefault(user,{}).setdefault(tag,0)
            rating = self.user_artist_rating[user]
            user_tag_rating_[user][tag] += self.user_artist_rating[user].get(artist,0)*self.artist_tag_relation[artist][tag]
            count_user_tag[user][tag] += self.artist_tag_relation[artist][tag]
        for user,tags in user_tag_rating_.items():
            for tag in tags:
                user_tag_rating[user][tag] = (user_tag_rating_[user][tag] + self.user_rating_avg[user]*k)/(count_user_tag[user][tag]+k)*(self.TF[user][tag]*self.IDF[tag])
        return user_tag_rating
    
    def recommend(self,user,k=10,filter_=True):
        """根据标签信息进行推荐
        @params
        @return
        """
        recommendation = {}
        for artist in self.artists:
            prefer = 0.0
            if artist in self.artist_tag_relation:
                for tag,relation in self.artist_tag_relation[artist].items():
                    prefer += relation * self.user_tag_rating[user].get(tag,0)
            recommendation[artist] = prefer
        return sorted(recommendation.items(),key=lambda x:x[1],reverse=True)
                    

In [52]:
rec = TagBasedRec()
print(rec.recommend('2'))

[('2', {'13': 21.106803189894855, '15': 19.489890277191424, '18': 11.693633092848605, '21': 18.63580968416462, '41': 29.726180773569816, '14': 6.596601226324219, '23': 5.524011039913762, '40': 8.457835922571705, '20': 5.792166755255121, '22': 6.734582820375147, '26': 3.851152418065871, '36': 3.4913640891455393, '37': 2.7597395981339474, '39': 1.287678946069352, '19': 2.35225123866369, '24': 1.0552316987571373, '16': 0.0, '17': 0.0, '25': 0.0, '42': 0.0, '43': 0.0, '32': 0.0, '33': 0.0, '34': 0.0, '38': 0.0, '35': 0.0}), ('3', {'14': 12.11584681694388, '15': 14.195159587872613, '33': 12.2657034483966, '44': 20.976550522199425, '45': 27.27435605936723, '46': 18.674401690213102, '47': 13.5869414679748, '48': 26.923571893353614, '49': 13.826444709320391, '50': 24.767670699929706, '51': 26.923571893353614, '52': 26.667295670128528, '53': 26.144987218348476, '54': 26.923571893353614, '55': 26.43472058593968, '56': 16.73397501985135, '57': 26.923571893353614, '58': 26.923571893353614, '63': 2