In [1]:
import pandas as pd
import numpy as np
import nltk
import random
import sys
import os
import tqdm
import sklearn
import seaborn as sns
from scipy import sparse
import matplotlib.pyplot as plt
import torch

from config import hyperParams
from test import MINDTest, OnlineNRMSModel, OnlineDeepCrossNRMSModel, OnlineDeepCrossNRMSCategoryModel

In [2]:
import warnings

# Pandas Style
pd.set_option("display.max_column", 9999)
pd.set_option("display.max_row", 9999)
pd.set_option("display.max_colwidth", 250)

# Seaborn Style
sns.set(style='ticks')
sns.set_style({'font.family': 'Hiragino Maru Gothic Pro'})
sns.set_palette("cool")

warnings.filterwarnings("ignore")

## ItemCF  
To begin with, we want to explore the baseline method of the traditional recommandation system.  
  
Basically, Item-based Collaborative Filtering (ItemCF) Recommendation is widely used when it comes to recommanding commodities or news to the target user according to his/her history perference, impression and behavior. The similarity between different news is calculated for each user by a certain metric like cos-similarity or Pearson correlation-based similarity. The input of the system is users' click matrix which depicts the interaction. For each row, we count the number of total clicks with respect to a particular piece of news corresponds to the current user. Note that the interaction matrix is sparse because only the positive impression on the news item is counted!   

Firstly, load the data from the disk. If we perviously save the processed similarity matrix persistently, we can directly load the pkl file and it saves a lot of time! During calculating the item-to-item similarity matrix for ItemCF, we use the cosine similarity.

In [None]:
from utils import cf
save_path = "./data/cf/"
df_train = cf.load_data(hyperParams)
df_test = cf.load_data(hyperParams, stage="test")
if not os.path.exists(save_path + 'itemcf_item2item_sim.pkl'):
    _ = cf.itemcf_sim(df_train, save_path)

After generating the item-to-item similarity matrix, we need to produce the recommandation given a new user. When a user desires to get the news from our system, we use its user ID to get his personal interaction vector containing positive impressions on history news. By simply multiply it with the similarity matrix, we can fetch the prediction score for each news. Rank them and select top k news for the user! Work done!  

Now we load the test data and manage to recommand 20 potential news for the user. Note that there may less than 20 news produced by the system for the reason that the given user, we will also explore serveral "hot" news based on all users. Recommand these "hot" news to the user is actually a preservative choice, and has a high probability that the user will love our recommendation.

In [None]:
item_count_map = cf.get_item_click_num(df_train)
user_label_dict = cf.get_user_label(df_test)
user_recall_items_dict = cf.itemcf(df_test, item_count_map, save_path)
category_map_train = cf.get_news_category(hyperParams)
category_map_test = cf.get_news_category(hyperParams, stage="test")
category_map = cf.merge_category_map(category_map_train, category_map_test)

Start our recommendation now!

In [None]:
recall_df = cf.get_recall_df(user_recall_items_dict, category_map)
label_df = cf.get_label_df(user_label_dict, category_map)

Now we can check our recommendation result by printing our the generated dataframe. Concretely, the new's category feature is a important metric to examine the quality of our recommendation, because user tends to be willing to browse more news of his/her preferred category.

In [None]:
# Here you can select arbitary user
recall_df.head(20)

In [None]:
label_df.head(20)

The result is exciting! We can see that the news we recommand is actually in the similar category comparing to the user perference.   
  
But it still has some obvious potential problem.

1. We can see that most news recommended to the user is actually not user-tailored. "Hot" news is frequently selected because of the sparse data composition in the user-news interaction matrix. It is the most severe problem in the Collabrative Filtering based model. Simply to say, the recommendation is too general!  
2. We are only able to use the impression feature given by the MIND dataset, which is a pretty waste for this qualitied data source! Maybe we can utilize text data such as title and abstract.  
3. It is difficult to evaluate how well our recommendation is. Simply checking the matches between user's perferred categories and the recommanded news' categories/sub-categories is not enough! 

Deep learning based model may meet the requirement!  
But wait a second. It seems that there is few proposed model nowadays targeting at the news recommandation system. Unlike the commodity recommendation, the news recommendation may depend more on text data. What's more, the method to combine user-related feature and time-related information into the model is tricky.  
  
Actually it suffers us a lot. But we find some powerful baseline models and manages to integrate them into a brand-new and powerful model concentrated on the news recommendation problem. 

## Neural News Recommendation with Deep & Cross Multi-Head Attentions

Our handcrafted model is based on the paper Neural News Recommendation with Multi-Head Self-Attention (NRMS). 

### Description

News recommendation can help users find interested news and alleviate information over- load. Precisely modeling news and users is critical for news recommendation, and capturing the contexts of words and news is important to learn news and user representations. The paper, proposea a neural news recommendation approach with multi-head self-attention (NRMS). The core of the approach is a news encoder and a user encoder. In the news encoder, it is used multi-head self-attentions to learn news representations from news titles by modeling the interactions between words. In the user encoder, it is learned the representations of users from their browsed news and use multihead self-attention to capture the relatedness between the news. Besides, it is appled additive attention to learn more informative news and user representations by selecting important words and news.  
  
This approch is motivated by several observations. First, the interactions between words in news title are important for understanding the news. For example, the word “Rockets” has strong relatedness with “Bulls”. Besides, a word may in- teract with multiple words, e.g., “Rockets” also has semantic interactions with “trade”. Second, different news articles browsed by the same user may also have relatedness. For example, in Fig. 1 the second news is related to the first and the third news. Third, different words may have different importance in representing news. In Fig. 1, the word “NBA” is more informative than “2018”. Besides, different news articles browsed by the same user may also have different importance in representing this user. For example, the first three news articles are more informative than the last one.

In [1]:
import ipyplot
ipyplot.plot_images(['figure/description.png'], ['Fig. 1'], img_width=550)

In the paper, they propose a neural news recommendation approach with multi-head self- attention (NRMS). The core of our approach is a news encoder and a user encoder. In the news encoder, we learn news representations from news titles by using multi-head self-attention to model the interactions between words. In the user encoder, They learn representations of users from their browsing by using multi-head self-attention to capture their relatedness. Besides, we apply additive attentions to both news and user encoders to select important words and news to learn more informative news and user representations. Extensive experiments on a realworld dataset show that our approach can effectively and efficiently improve the performance of news recommendation.

The NRMS approach for news recommendation is shown in Fig. 2. It contains three modules: news encoder, user encoder, click predictor.  

In [2]:
ipyplot.plot_images(['figure/nrms.png'], ['Fig. 2'], img_width=550)

We manage to enhance the model with more given features which we explored before, i.e. news abstract text feature, user personality feature. Features should be properly integrated, so we plug in a deep & cross structure after the News encoder. Concretely, title embeddings, abstract embeddings are concatenated as a whole feature and passed in two seperated feature extraction models: deep network and cross network. Deep network tends to select high-dimensional latent features while cross network works as feature combination.  
  
In order to merge user-related features such as user's category perference into the model, another attention-based idea is implemented. We find out that the data given in the MIND is pretty biased. The category "news" and "sports" dominates the total distribution, while some categories only appear a few times. If we employ the frequency-based method to calculate the weight for each title and abstract embedding features, it is very likely that we totally omit those unpopular news. It will certainly degrade the performance if a given user coincidentally loves those news!  
  
As a result, we implements a trick based on the idea from tf-idf algorithm. Similar to the gist in tf-idf, the weight of corresponding new's title and abstract is positively related to the frequency it appears in the user's history impression and negatively related to the global frequency it appears in the whole dataset. After some experiments, we also find that simply multiply the weights with the feature is not a good idea. Instead, an addition attention (adding the original features with the attentive features) improves the performance of the model.

## Analysis at Interference Time for our Model

We have trained a NRMS model using the following hyperparamters:  
  
Max Title length: 10  
Max Abstract length: 50  
Number of Multi-Head Attention: 10  
Dimesion of pretrain GloVe Word Vectors: 300  
Negative Sampling K: 4  
Maximun number of historical News seen by the user: 50  
Vocabulary Size : 40000  

### How to evaluate the performance of our model at the interference time?

During training time we get an outperformed performance with respect to the original NRMS. Our AUC statistic is 0.72, and NDCG-10 is over 0.4, which is a pretty good result. But a fun way to see the model working is by trying to imagine ourselves the type of user based on the news previously seen by the user and create possible hypotheses, such as the user seems to be interested in Hollywood news and sports but not in Political news. Therefore, from a pool of candidate news based on our intuitions as humans, we will expect that political news will be rank lower than sports news for example.|

During online interference phase, we randomly sample 200 news from the news dataset with the prior distribution based on the "tf-idf" like probability, just as we implement in the training phase. It can be regarded as a bloom filtering.

In [3]:
combine_category = True
if combine_category:
    model = OnlineDeepCrossNRMSCategoryModel.load_from_checkpoint(hyperParams["checkpoint_path_category"], hyperParams=hyperParams, file_path=hyperParams["test_data_path"])
else:
    model = OnlineDeepCrossNRMSModel.load_from_checkpoint(hyperParams["checkpoint_path"], hyperParams=hyperParams, file_path=hyperParams["test_data_path"])
model.load_test_data()
title_test = MINDTest(hyperParams, model, combine_category)

In [4]:
userIndex = 90
result, prob, news_ranking, news_history_title_token, rank_category, news_category = title_test.online_test(userIndex)

In [5]:
df_user_history_clicks = pd.DataFrame(data={"Category":news_category, "History clicked News":[" ".join(text for text in token) for token in news_history_title_token]})
df_user_history_clicks

Unnamed: 0,Category,History clicked News
0,sports,here are 7 incredible stats from game 1 of the alcs
1,entertainment,knightley tricky to stay in the loop with new baby
2,music,rihanna addresses pregnancy rumors in new video
3,tv,tamron hall says she never dealt cocaine after reports say she confessed to doing so
4,lifestyle,duchess meghan describes really challenging life as new royal i m not ok
5,lifestyle,"yes , we analyzed the royal family s birth charts"
6,lifestyle,the internet is banding together to support meghan markle after she opened up about her mental health
7,lifestyle,how kate middleton and prince william s royal tour of pakistan compared to harry and meghan s trip to africa
8,lifestyle,here s why harry and meghan s latest interview could be the final straw for the royal family
9,tv,wendy williams has no sympathy for duchess meghan


In [6]:
df_ranked_candidates = pd.DataFrame(data={"Confidence":prob, "Category":rank_category, "Candidate News":[" ".join(text for text in token) for token in news_ranking]})
df_ranked_candidates

Unnamed: 0,Confidence,Category,Candidate News
0,0.869458,lifestyle,meghan markle and prince harry won t spend christmas with queen elizabeth at sandringham this year
1,0.702128,tv,kaley cuoco says having separate lives has helped her marriage to karl cook
2,0.69748,entertainment,camila cabello meets with kate middleton prince william
3,0.691352,lifestyle,meghan markle chose a chic black ensemble for remembrance sunday services
4,0.685914,tv,"whoopi goldberg addresses the view tension if we were fighting , you d actually know it"
5,0.667083,lifestyle,kate middleton just wore a tiara for a very special reason
6,0.624499,music,"fans boo , chant refund as madonna starts concert after midnight"
7,0.58677,tv,memorial service for former wesh 2 news anchor wendy chioji
8,0.58459,entertainment,these celebrity fathers and sons look almost identical at the same age
9,0.582182,music,will kentucky native chris stapleton win big at tonight s cma awards ? here s how to watch


In [7]:
df_ranked_candidates.head(5)

Unnamed: 0,Confidence,Category,Candidate News
0,0.869458,lifestyle,meghan markle and prince harry won t spend christmas with queen elizabeth at sandringham this year
1,0.702128,tv,kaley cuoco says having separate lives has helped her marriage to karl cook
2,0.69748,entertainment,camila cabello meets with kate middleton prince william
3,0.691352,lifestyle,meghan markle chose a chic black ensemble for remembrance sunday services
4,0.685914,tv,"whoopi goldberg addresses the view tension if we were fighting , you d actually know it"


In [3]:
# from config import hyperParams

# # model = OnlineNRMSModel(hyperParams, hyperParams["test_data_path"])
# model = OnlineNRMSModel.load_from_checkpoint(hyperParams["checkpoint_path_title"], hyperParams=hyperParams, file_path=hyperParams["test_data_path"])
# model.load_test_data()
# title_test = MINDTest(hyperParams, model)

In [4]:
# userIndex = 20
# result, prob, news_ranking, news_history_title_token = title_test.online_test_title(userIndex)

In [7]:
# df_user_history_clicks = pd.DataFrame([" ".join(text for text in token) for token in news_history_title_token], columns=["History clicked News"])
# df_user_history_clicks

Unnamed: 0,History clicked News
0,america s most and least educated states
1,economists who study poverty win nobel prize
2,"america s factories are in trouble , and the trade war is only part of the problem"
3,"you can get a free wendy s double cheeseburger next week so uh , lunch is solved"
4,ghost kitchens are taking over fast food chains from chick fil a to wendy s
5,"national dessert day where to get free dessert at wendy s , tgi friday and more"
6,the 2020 ram 1500 ecodiesel s official mpg numbers are impressive
7,eddy merckx hospitalized for serious head injury following bike crash
8,stars turning 40 in 2019
9,kate middleton fired her longtime personal assistant amid split from meghan markle and prince harry


In [8]:
# df_ranked_candidates = pd.DataFrame(data={"Confidence":prob, "Candidate News":[" ".join(text for text in token) for token in news_ranking]})

Unnamed: 0,Confidence,Candidate News
0,0.760877,twitter thoroughly enjoys evansville s victory over no . 1 kentucky
1,0.753487,"closed on thanksgiving 2019 costco , sam s club , nordstrom , home depot keep with tradition"
2,0.752661,jill biden to trump stop it . my husband is going to beat you
3,0.733765,kc business owner furious after truck driver pours concrete mix into car wash bay
4,0.730061,"12 attractions that are so creepy , they are off limits to tourists"
5,0.729444,opinions | it s clear . trump doesn t want to be president anymore .
6,0.728828,gov . ron desantis pulls out the stops in fight to remove scott israel as sheriff
7,0.725464,mila kunis is totally cool with her mom being ashton kutcher s next wife if the actress were to die unexpectedly
8,0.723876,dementia and alcohol scientists find link
9,0.722249,"cowboys lesson learned play well , and the road to the playoffs may not be so rough"


In [9]:
df_ranked_candidates.head(5)

Unnamed: 0,Confidence,Candidate News
0,0.760877,twitter thoroughly enjoys evansville s victory over no . 1 kentucky
1,0.753487,"closed on thanksgiving 2019 costco , sam s club , nordstrom , home depot keep with tradition"
2,0.752661,jill biden to trump stop it . my husband is going to beat you
3,0.733765,kc business owner furious after truck driver pours concrete mix into car wash bay
4,0.730061,"12 attractions that are so creepy , they are off limits to tourists"


In [10]:
df_ranked_candidates.tail(5)

Unnamed: 0,Confidence,Candidate News
195,0.375544,"maryland food bank , wbal tv partner for feed a friend phone a thon"
196,0.314433,my husband doesn t like what i wear and it s making me feel terrible
197,0.224162,viagra could help combat blood cancer soon
198,0.117745,pedestrians narrowly escape phoenix car crash
199,0.022461,chicago based potbelly sandwiches unveils restaurant design revamp
