# Building a Collaborative Filtering news recommender
In this notebook we pre-process and train a news recommender system based on the MIND dataset.

In [3]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch.nn as nn
import pytorch_lightning as pl
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
import torch
from collections import Counter


## Manual pre-processing of data
There exist a library to preprocess this data, but it's quite hard to use it. So instead, after a quick look on the data we have descided to create a own version of pre-processing and data creation for this lab. If you are interested in the full data take a look at it here: https://msnews.github.io/. For the first read, it is not necessary to understand what is being done here, and we will re-introduce the dataset in the modeling section.


### behaviors.tsv

#### From documentation:  
The behaviors.tsv file contains the impression logs and users' news click histories. It has 5 columns divided by the tab symbol:
- Impression ID. The ID of an impression.  
- User ID. The anonymous ID of a user.  
- Time. The impression time with format "MM/DD/YYYY HH:MM:SS AM/PM".  
- History. The news click history (ID list of clicked news) of this user before this impression. The clicked news articles are ordered by time.  
- Impressions. List of news displayed in this impression and user's click behaviors on them (1 for click and 0 for non-click). The orders of news in a impressions have been shuffled.  

In [4]:
raw_behaviour = pd.read_csv(
    "behaviors.tsv", 
    sep="\t",
    names=["impressionId","userId","timestamp","click_history","impressions"])

print(f"The dataset originally consist of {len(raw_behaviour)} number of interactions.")
raw_behaviour.head()

The dataset originally consist of 156965 number of interactions.


Unnamed: 0,impressionId,userId,timestamp,click_history,impressions
0,1,U13740,11/11/2019 9:05:58 AM,N55189 N42782 N34694 N45794 N18445 N63302 N104...,N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,N31739 N6072 N63045 N23979 N35656 N43353 N8129...,N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,N10732 N25792 N7563 N21087 N41087 N5445 N60384...,N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...
3,4,U34670,11/11/2019 5:28:05 AM,N45729 N2203 N871 N53880 N41375 N43142 N33013 ...,N35729-0 N33632-0 N49685-1 N27581-0
4,5,U8125,11/12/2019 4:11:21 PM,N10078 N56514 N14904 N33740,N39985-0 N36050-0 N16096-0 N8400-1 N22407-0 N6...


## Load article data
We also need to get the content information of each article. We will use the news.tsv file to index the items.

In [5]:
news = pd.read_csv(
    "news.tsv", 
    sep="\t",
    names=["itemId","category","subcategory","title","abstract","url","title_entities","abstract_entities"])
print(f"The article data consist in total of {len(news)} number of articles.")
news.head()

The article data consist in total of 51282 number of articles.


Unnamed: 0,itemId,category,subcategory,title,abstract,url,title_entities,abstract_entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, an...","Shop the notebooks, jackets, and more that the...",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"",...",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding yo...,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik...","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""Wik..."
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches...,Lt. Ivan Molchanets peeked over a parapet of s...,https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId..."
3,N53526,health,voices,I Was An NBA Wife. Here's How It Affected My M...,"I felt like I was a fraud, and being an NBA wi...",https://assets.msn.com/labs/mind/AACk2N6.html,[],"[{""Label"": ""National Basketball Association"", ..."
4,N38324,health,medical,"How to Get Rid of Skin Tags, According to a De...","They seem harmless, but there's a very good re...",https://assets.msn.com/labs/mind/AAAKEkt.html,"[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI...","[{""Label"": ""Skin tag"", ""Type"": ""C"", ""WikidataI..."


Now we need to process the click history and impressions. We first need to decode impressions into clicks and non-clicks.

In [6]:
# Function to split the impressions and clicks into two seperate lists
def process_impression(impression_list):
    list_of_strings = impression_list.split()
    click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '1']
    non_click = [x.split('-')[0] for x in list_of_strings if x.split('-')[1] == '0']
    return click,non_click

# We can then indexize these two new columns:
raw_behaviour['click'], raw_behaviour['noclicks'] = zip(*raw_behaviour['impressions'].map(process_impression))

In [7]:
# Convert timestamp value to hours since epoch
raw_behaviour['epochhrs'] = pd.to_datetime(raw_behaviour['timestamp']).values.astype(np.int64)/(1e6)/1000/3600
raw_behaviour['epochhrs'] = raw_behaviour['epochhrs'].round()

## Click History

In the dataset we can see that a large number of items and users does not have sufficent amount of clicks. This is since we are working with a small version of the  MIND dataset that contains 50k users instead of the full version of 1 million users. 
Therefore it will be hard to learn the user and item embeddings by only relying on the interactions e.g. the `click` and `noclicks`.

To resolve this issue in the lab, we will expand the click_history column, which will add about 7 times more interactions than the original data. However, note that these events don't have any information about which articles were shown to the user e.g. the impressions or noclicks.


In [8]:
# If there exists several clicks in one session, expand to new observation
raw_behaviour = raw_behaviour.explode("click").reset_index(drop=True)

# Extract the clicks from the previous clicks
click_history = raw_behaviour[["userId","click_history"]].drop_duplicates().dropna()
click_history["click_history"] = click_history.click_history.map(lambda x: x.split())
click_history = click_history.explode("click_history").rename(columns={"click_history":"click"})
# Dummy time set to earlies epochhrs in raw_behaviour as we don't know when these events took place.
click_history["epochhrs"] = raw_behaviour.epochhrs.min() 
click_history["noclicks"] = pd.Series([[] for _ in range(len(click_history.index))])

# concatenate historical clicks with the raw_behaviour
raw_behaviour = pd.concat([raw_behaviour,click_history],axis=0).reset_index(drop=True)
print(f"The dataset after pre-processing consist of {len(raw_behaviour)} number of interactions.")

The dataset after pre-processing consist of 1162402 number of interactions.


## Cold start problem

Still after doing our pre-processing and adding the `click_history` to the click column, we can see that a large number of items does not have sufficent amount of clicks.  This can be thought of as a [cold start problem](https://en.wikipedia.org/wiki/Cold_start_(recommender_systems)).
To adjust for this we will remove items from the `raw_behaviour` that falls under the `min_click_cutoff`. 
 

In [9]:
min_click_cutoff = 100
print(f'Number of items that have less than {min_click_cutoff} clicks make up',np.round(np.mean(raw_behaviour.groupby("click").size() < min_click_cutoff)*100,3),'% of the total, and these will be removed.') 

Number of items that have less than 100 clicks make up 93.852 % of the total, and these will be removed.


In [10]:
# remove items with less clicks than min_click_cutoff
raw_behaviour = raw_behaviour[raw_behaviour.groupby("click")["userId"].transform('size') >= min_click_cutoff].reset_index(drop=True)
# Get a set with all the unique items
click_set = set(raw_behaviour['click'].unique())

# remove items for impressions that is not avaiable in the click set (the items that we will be training on)
raw_behaviour['noclicks'] = raw_behaviour['noclicks'].apply(lambda impressions: [impression for impression in impressions if impression in click_set])

## Output of data preprocessing
In this preprocessing we have processed behaviour data, article data and user data. The main component is `behaviour`, and for collaborative filtering purposes this is all we need. However, if we want to utilize content data on the news items some additional preprocessing on the `news` dataframe must be applied.

In [11]:
## Select the columns that we now want to use for further analysis
behaviour = raw_behaviour[['epochhrs','userId','click','noclicks']].copy()

print('Number of interactions in the behaviour dataset:', behaviour.shape[0])
print('Number of users in the behaviour dataset:', behaviour.userId.nunique())
print('Number of articles in the behaviour dataset:', behaviour.click.nunique())

behaviour.head()

Number of interactions in the behaviour dataset: 781871
Number of users in the behaviour dataset: 49832
Number of articles in the behaviour dataset: 2451


Unnamed: 0,epochhrs,userId,click,noclicks
0,437073.0,U13740,N55689,[N35729]
1,437106.0,U91836,N17059,"[N20678, N39317, N58114, N20495, N42977, N1459..."
2,437143.0,U73700,N23814,"[N23877, N35389, N49712, N16844, N59685, N2344..."
3,437069.0,U34670,N49685,"[N35729, N33632, N27581]"
4,437083.0,U19739,N33619,[]


## Train / Test Split + indexing

Before we carry on to define our first model we first need to apply indexizing for the users and items in the behaviour dataframe,
as pytorch requires integer indicies instead of strings for user and item IDs. 

We do this by two dictionaries:

- `ind2item`: mapping the item indicies given in behaviour to the real item Id given in the dataset.
- `ind2user`: mapping the user indicies given in behaviour to the real user Id given in the dataset.

Note that we also create `item2ind` and `user2ind` to do the reverse.

The indexing will be created based on the training data, where new unseen articles in the validation set will get the index 0.
We will use 90% for training 10% for validation, when we split the data it's important to make use of temporal `epochhrs` to divide the data, as a regular random split in this case does not make sense in recommender systems.


In [41]:
# Let us use the last 10pct of the data as our validation data:
test_time_th = behaviour['epochhrs'].quantile(0.9)
train = behaviour[behaviour['epochhrs']< test_time_th].copy()

## Indexize items
# Allocate a unique index for each item, but let the zeroth index be a UNK index:
ind2item = {idx +1: itemid for idx, itemid in enumerate(train.click.unique())}
item2ind = {itemid : idx for idx, itemid in ind2item.items()}
print(ind2item)
print(item2ind)

train['noclicks'] = train['noclicks'].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
train['click'] = train['click'].map(lambda item: item2ind.get(item, 0))

## Indexize users
# Allocate a unique index for each user, but let the zeroth index be a UNK index:
ind2user = {idx +1: userid for idx, userid in enumerate(train['userId'].unique())}
user2ind = {userid : idx for idx, userid in ind2user.items()}

# Create a new column with userIdx:
train['userIdx'] = train['userId'].map(lambda x: user2ind.get(x,0))

# Repeat for validation
valid =  behaviour[behaviour['epochhrs']>= test_time_th].copy()
valid["click"] = valid["click"].map(lambda item: item2ind.get(item, 0))
valid["noclicks"] = valid["noclicks"].map(lambda list_of_items: [item2ind.get(l, 0) for l in list_of_items])
valid["userIdx"] = valid["userId"].map(lambda x: user2ind.get(x,0))

{1: 'N55689', 2: 'N49685', 3: 'N33619', 4: 'N55204', 5: 'N53585', 6: 'N43073', 7: 'N19542', 8: 'N61022', 9: 'N63550', 10: 'N35729', 11: 'N47020', 12: 'N32182', 13: 'N31978', 14: 'N2823', 15: 'N6056', 16: 'N41881', 17: 'N49279', 18: 'N3344', 19: 'N60374', 20: 'N18870', 21: 'N20079', 22: 'N53017', 23: 'N47981', 24: 'N16804', 25: 'N13259', 26: 'N14184', 27: 'N65185', 28: 'N41224', 29: 'N59981', 30: 'N55582', 31: 'N24180', 32: 'N24272', 33: 'N47035', 34: 'N27961', 35: 'N36545', 36: 'N57614', 37: 'N30475', 38: 'N11830', 39: 'N15194', 40: 'N63970', 41: 'N28983', 42: 'N3123', 43: 'N62729', 44: 'N56193', 45: 'N26179', 46: 'N54125', 47: 'N23414', 48: 'N55645', 49: 'N36752', 50: 'N8957', 51: 'N22457', 52: 'N62360', 53: 'N13801', 54: 'N52000', 55: 'N3894', 56: 'N26262', 57: 'N64916', 58: 'N57097', 59: 'N54489', 60: 'N61768', 61: 'N6926', 62: 'N36186', 63: 'N5442', 64: 'N51398', 65: 'N31273', 66: 'N32544', 67: 'N41020', 68: 'N21707', 69: 'N40725', 70: 'N36789', 71: 'N27581', 72: 'N29128', 73: 'N32

# Modeling & Negative sampling
We want to make a matrix factorization model where each user $u$ has a d-dimensional parameter vector $z_u$ and each item $i$ has a parameter vector $v_i$.

Second, to simplify the computation of things and as we do not have a `noclicks` for every interaction we will only utilize two **known** things in the training phase: The item the `userIdx` and `click`. However, as we want to model the binary behavior in terms of clicks and non-clicks we will make use of something called negative sampling. With negative sampling - we will draw a sample a random negative item for each known user-click combination to express  the lack of preference by the user for the sampled item.

In [13]:
class MindDataset(Dataset):
    # A fairly simple torch dataset module that can take a pandas dataframe (as above), 
    # and convert the relevant fields into a dictionary of arrays that can be used in a dataloader
    def __init__(self, df):
        # Create a dictionary of tensors out of the dataframe
        self.data = {
            'userIdx' : torch.tensor(df.userIdx.values.astype(np.int64)),
            'click' : torch.tensor(df.click.values.astype(np.int64))
        }
    def __len__(self):
        return len(self.data['userIdx'])
    def __getitem__(self, idx):
        return {key: val[idx] for key, val in self.data.items()}

In [14]:
# Build datasets and dataloaders of train and validation dataframes:
bs = 1024
ds_train = MindDataset(train)
train_loader = DataLoader(ds_train, batch_size=bs, shuffle=True)
ds_valid = MindDataset(valid)
valid_loader = DataLoader(ds_valid, batch_size=bs, shuffle=False)

batch = next(iter(train_loader))

## Model

#### Framework
We will use pytorch-lightning to define and train our model. It is a high-level framework (similar to fastAI) but with a slightly different way of defining things. It is my personal go-to framework and is very flexible. For more information, see https://pytorch-lightning.readthedocs.io/.

#### The model
We assume that each interaction goes as follow: the user is presented with two items: the click and no-click item, where the no-click item will be randomly chosen with negative sampling. After the user reviewed both items, she will choose the most relevant one. This can be modeled as a categorical distirbution with two options (yes, you could do binomial). There is a loss function in pytorch for this already, called the `F.binary_cross_entropy` that we will use.

In [15]:
# Build a matrix factorization model
class NewsMF(pl.LightningModule):
    def __init__(self, num_users, num_items, dim = 10):
        super().__init__()
        self.dim=dim
        self.num_users = num_users
        self.num_items = num_items
        
        self.useremb = nn.Embedding(num_embeddings=num_users, embedding_dim=dim)
        self.itememb = nn.Embedding(num_embeddings=num_items, embedding_dim=dim)

        
    def step(self, batch, batch_idx, phase="train"):
        batch_size = batch['userIdx'].size(0)
        uservec = self.useremb(batch['userIdx'])       
        itemvec_click = self.itememb(batch['click'])
        
        # For each positive interaction,sample a random negative
        neg_sample = torch.randint_like(batch["click"],1,self.num_items)
        itemvec_noclick = self.itememb(neg_sample)
        
        score_click = torch.sigmoid((uservec*itemvec_click).sum(-1).unsqueeze(-1))
        score_noclick =  torch.sigmoid((uservec*itemvec_noclick).sum(-1).unsqueeze(-1))

        # Compute loss as binary cross entropy (categorical distribution between the clicked and the no clicked item)
        scores_all = torch.concat((score_click, score_noclick), dim=1)
        target_all = torch.concat((torch.ones_like(score_click), torch.zeros_like(score_noclick)),dim=1)
        loss = F.binary_cross_entropy(scores_all, target_all)
        return loss
    
    
    def training_step(self, batch, batch_idx):
        return self.step(batch, batch_idx, "train")
    
    def validation_step(self, batch, batch_idx):
        # for now, just do the same computation as during training
        return self.step(batch, batch_idx, "val")

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
        return optimizer
    

In [16]:
# Define and train model
mf_model = NewsMF(num_users=len(ind2user) + 1, num_items = len(ind2item) + 1, dim = 50)
trainer = pl.Trainer(max_epochs=50, accelerator="gpu")
trainer.fit(model=mf_model, train_dataloaders=train_loader)

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/home/yvesito/.local/lib/python3.10/site-packages/pytorch_lightning/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
2023-11-29 07:29:46.202109: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-29 07:29:46.202214: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-29 07:29:46.209704: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
LOCAL_RANK: 0 - CUDA_V

Epoch 49: 100%|██████████| 686/686 [00:15<00:00, 44.30it/s, v_num=0]

`Trainer.fit` stopped: `max_epochs=50` reached.


Epoch 49: 100%|██████████| 686/686 [00:15<00:00, 44.08it/s, v_num=0]


In [17]:
## Add more information to the article data 
# The item index
news["ind"] = news["itemId"].map(item2ind)
news = news.sort_values("ind").reset_index(drop=True)
# Number of clicks in training data per article, investigate the cold start issue
news["n_click_training"] = news["ind"].map(dict(Counter(train.click)))
# 5 most clicked articles
news.sort_values("n_click_training",ascending=False).head()


Unnamed: 0,itemId,category,subcategory,title,abstract,url,title_entities,abstract_entities,ind,n_click_training
597,N306,movies,movies-celebrity,Kevin Spacey Won't Be Charged in Sexual Assaul...,The Los Angeles County District Attorney's Off...,https://assets.msn.com/labs/mind/AAJy6rv.html,"[{""Label"": ""Kevin Spacey"", ""Type"": ""P"", ""Wikid...","[{""Label"": ""Kevin Spacey"", ""Type"": ""P"", ""Wikid...",598.0,4802.0
0,N55689,sports,football_nfl,"Charles Rogers, former Michigan State football...","Charles Rogers, the former Michigan State foot...",https://assets.msn.com/labs/mind/BBWAPO6.html,"[{""Label"": ""Charles Rogers (American football)...","[{""Label"": ""2003 NFL Draft"", ""Type"": ""U"", ""Wik...",1.0,4316.0
656,N42620,lifestyle,lifestylebuzz,Heidi Klum's 2019 Halloween Costume Transforma...,You might say she's scary good at playing dres...,https://assets.msn.com/labs/mind/AAJFlhi.html,"[{""Label"": ""Heidi Klum"", ""Type"": ""P"", ""Wikidat...","[{""Label"": ""Heidi Klum"", ""Type"": ""P"", ""Wikidat...",657.0,4047.0
10,N47020,news,newsopinion,The News In Cartoons,News as seen through the eyes of the nation's ...,https://assets.msn.com/labs/mind/AAJ7oYd.html,[],[],11.0,3545.0
9,N35729,news,newsus,Porsche launches into second story of New Jers...,The Porsche went airborne off a median in Toms...,https://assets.msn.com/labs/mind/BBWyjM9.html,"[{""Label"": ""Porsche"", ""Type"": ""O"", ""WikidataId...","[{""Label"": ""Porsche"", ""Type"": ""O"", ""WikidataId...",10.0,3346.0


In [29]:
# store the learned item embedding into a seperate tensor
itememb = mf_model.itememb.weight.detach()
print(itememb.shape)

torch.Size([2279, 50])


In [52]:
## Investigate different rows of the item embedding (articles embeddings) to see if the model works
## some examples N13259, N16636, N10272

ind = item2ind.get("N55528") 
if not ind:
    news["ind"] = news["itemId"].map(item2ind)
    news = news.sort_values("ind").reset_index(drop=True)
    # Number of clicks in training data per article, investigate the cold start issue
    news["n_click_training"] = news["ind"].map(dict(Counter(train.click)))
    # 5 most clicked articles
    print(news.sort_values("n_click_training",ascending=True).head())
else:
    # This calculates the cosine similarity and outputs the 10 most similar articles w.r.t to ind in descending order
    similarity = torch.nn.functional.cosine_similarity(itememb[ind], itememb, dim=0)
    most_sim = news[~news.ind.isna()].iloc[(similarity.argsort(descending=True).numpy()-1)]
    print(most_sim.head(5))


     itemId       category              subcategory  \
455  N20678         sports              more_sports   
418  N64094           news                   newsus   
488  N18887         sports     football_ncaa_videos   
476  N14592  entertainment  entertainment-celebrity   
445   N1940      lifestyle      lifestyledidyouknow   

                                                 title  \
455  Bode Miller delivered his twin boys after midw...   
418  Iowa paid a security firm to break into a cour...   
488            Is the Alabama Football Dynasty Ending?   
476  Chrissy Teigen's weekend was basically a clapb...   
445  1990s Mall Rats Will Definitely Remember These...   

                                              abstract  \
455  Bode Miller added yet another impressive title...   
418  A pair of security workers at a prominent cybe...   
488  Is the Alabama football dynasty ending? Max Br...   
476  Chrissy Teigen gives us two more lessons in th...   
445  How could we forget those