### 媒体与认知教程——推荐系统 
#### 代码编写：武楚涵，黄颖卓 

本教程主要展现基于内容的推荐系统，但对场景进行了极大简化，使得新闻统一只使用一个细粒度的子类别来表示，没有使用原始文本信息。其中数据由MIND新闻推荐数据集采样和构造得到，原始数据在https://msnews.github.io/ 下载。关于原始的新闻推荐数据集的使用以及场景定义，感兴趣的同学们可以在数据集的CodaLab页面上阅读。

#### 问题定义

给定一个用户每个历史点击新闻的类别，以及一个候选新闻所对应的类别，预测该用户是否会点击该候选新闻

依赖安装：numpy, sklearn, torch, tqdm
其中tqdm是进度条，能让进度可视化，建议使用

In [None]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
import ipywidgets as widgets
from IPython.display import display,clear_output
from sklearn.metrics import roc_auc_score
from tqdm import tqdm

读入新闻类别和id的对应表，便于后续使用

In [None]:

with open('category_id.txt','r',encoding='utf-8')as f:
    category = f.readlines()
    category_id={line.split('\t')[0]:int(line.strip().split('\t')[1]) for line in category}
    id_category={int(line.strip().split('\t')[1]):line.split('\t')[0] for line in category}
    

将每个样本转换为对应的用户点击新闻历史，候选新闻，以及点击与否的label，其中历史取最近的50个，不足的进行零填充


In [None]:

def get_data(data_file):
    with open(data_file,'r',encoding='utf-8')as f:
        data=[line.strip().split('\t') for line in f.readlines()]
    all_history = []
    all_candidate = []
    all_label = []
    for sample in data: 
        history = [category_id[i] for i in sample[0].split(',')][-50:]
        candidate = category_id[sample[1]]
        label = sample[2]
        history = [0]*(50-len(history)) + history
        all_history.append(history)
        all_candidate.append(candidate)
        all_label.append(label)
    return  torch.LongTensor(np.array(all_history,dtype='int32')),torch.LongTensor(np.array(all_candidate,dtype='int32')),torch.LongTensor(np.array(all_label,dtype='int32'))

In [None]:
train_history, train_candidate, train_label = get_data('training.txt')
test_history, test_candidate, test_label = get_data('test.txt')

推荐模型的定义。这里将新闻的类别的one-hot编码向量作为输入，通过两层全连接网络学习一个64维的隐含新闻表示。接下来，用户模型对历史点击新闻表示的序列进行处理，使用一个GRU模型扫描该序列，得到用户表示。用户表示与候选新闻的表示之间的余弦相似度便可以作为点击预测的分数。

In [None]:

class RecModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.news_fc1 = nn.Linear(len(category_id), 256)
        self.news_fc2 = nn.Linear(256, 64)
        self.cosine_dense = nn.Linear(1, 1)
        self.user_encoder = nn.GRU(64, 64, 1, batch_first = True)
        self.bceloss = nn.BCEWithLogitsLoss()
    def forward(self, history, candidate, label=None):
        history_embedding = self.news_fc2(torch.tanh(self.news_fc1(history)))
        candidate_embedding = self.news_fc2(torch.tanh(self.news_fc1(candidate)))
        output_states, user_embedding = self.user_encoder(history_embedding)  
        y = self.cosine_dense(F.cosine_similarity(user_embedding.squeeze(0), candidate_embedding,dim=-1).unsqueeze(dim=1)).squeeze()
        if label is not None:
            loss = self.bceloss(y , label)
            return y,loss
        else:
            return y

In [None]:
# 模型训练
def train(train_history, train_candidate, train_label, batch_size=50, epochs=10):
    model = RecModel()
    optimizer = optim.Adam(model.parameters(), lr = 1e-3)
    for epoch in range(epochs):
        train_losses = []
        model.train()
        for i in tqdm(range(len(train_history)//batch_size)):
            
            optimizer.zero_grad()
            output,loss = model(F.one_hot(train_history[i*batch_size:(i+1)*batch_size],num_classes=len(category_id)).float(),
                                F.one_hot(train_candidate[i*batch_size:(i+1)*batch_size],num_classes=len(category_id)).float(),
                                train_label[i*batch_size:(i+1)*batch_size].float())
            train_losses.append(loss.item())
            loss.backward()
            optimizer.step()
        print('[epoch {:d}], train_loss: {:.4f}'.format(epoch + 1, np.average(train_losses)))
    return model

模型测试，使用AUC进行指标，注意这里为了简化，AUC的计算方法与MIND数据集原始计算方法不同

In [None]:
def test(model, test_history, test_candidate, test_label, batch_size=200):
    model.eval() 
    prediction = []
    for i in tqdm(range(len(test_history)//batch_size)):
        output = model(F.one_hot(test_history[i*batch_size:(i+1)*batch_size],num_classes=len(category_id)).float(),
                            F.one_hot(test_candidate[i*batch_size:(i+1)*batch_size],num_classes=len(category_id)).float()).detach().numpy().tolist()
        prediction+=output
    print(roc_auc_score(test_label.numpy(), prediction))

In [None]:
model = train(train_history, train_candidate, train_label)

由于该模型输入信息非常简单，而且模型也较为粗糙，AUC数值通常不足60%。但是新闻推荐难度非常大，目前最先进的方法也只能达到73%左右的AUC。

In [None]:
test(model, test_history, test_candidate, test_label)
model.eval() 

以下是一个简单的demo，根据历史点击行为，从200多个类别中选取top20的类别进行推荐。

In [None]:
def generate_recommendation(model,history):
    history = history[-50:]
    history = [[0]*(50-len(history)) + history]*(len(category_id)-1)
    candidates = np.arange(1,len(category_id))
    history_batch = F.one_hot(torch.LongTensor(np.array(history,dtype='int32')),num_classes=len(category_id)).float()
    candidate_batch = F.one_hot(torch.LongTensor(candidates),num_classes=len(category_id)).float()
    output = model(history_batch, candidate_batch).detach().numpy()
    top10_rec = 1+np.argsort(output)[::-1][:20]
    return top10_rec

以下展示的按钮点击后便会加入点击历史。可以观察到模型虽然很弱，但是有一定的个性化能力。同时，我们可以观察到，模型推荐结果对于最近的点击非常敏感，这一现象在很多同学们熟悉的推荐系统中都存在，这往往是GRU等序列模型推荐的通病，容易过度放大短期的兴趣模式。

In [None]:
import random
history_now = []
    
def display_news():   
    clear_output()
    top_list = generate_recommendation(model,history_now)
    top_news = [id_category[i] for i in top_list]
    btns = []
    for i in top_news:
        btn = widgets.Button(description = i, tooltip = i)
        btn.on_click(btn_click)  
        btns.append(btn)
    box = widgets.VBox(children=[widgets.HBox(btns[:5]),widgets.HBox(btns[5:10]),widgets.HBox(btns[10:15]),widgets.HBox(btns[15:])])
    display(box) 
    
def btn_click(sender):
    history_now.append(category_id[sender.description])
    display_news()
    
display_news()

### 思考
如何改善模型使得推荐结果更加准确，并且不对短期模式过于敏感？