# Getting started

In this notebook, we illustrate how to use the Neural News Recommendation with Multi-Head Self-Attention ([NRMS](https://aclanthology.org/D19-1671/)). The implementation is taken from the [recommenders](https://github.com/recommenders-team/recommenders) repository. We have simply stripped the model to keep it cleaner.

We use a small dataset, which is downloaded from [recsys.eb.dk](https://recsys.eb.dk/). All the datasets are stored in the folder path ```~/ebnerd_data/*```.

## Load functionality

In [1]:
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl
import datetime

from ebrec.utils._constants import *

from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_prediction_scores,
    truncate_history,
    ebnerd_from_path,
)
from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore
from ebrec.utils._articles import convert_text2encoding_with_transformers
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings
from ebrec.utils._python import write_submission_file, rank_predictions_by_score

from ebrec.models.newsrec.dataloader import NRMSDataLoader
from ebrec.models.newsrec.model_config import hparams_nrms
from ebrec.models.newsrec import NRMSModel

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# List all physical devices
gpus = tf.config.experimental.list_physical_devices("GPU")
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)

physical_devices = tf.config.list_physical_devices()
print("Available devices:", physical_devices)

Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


## Load dataset

### Generate labels
We sample a few just to get started. For testset we just make up a dummy column with 0 and 1 - this is not the true labels.

In [None]:
PATH = Path("/Users/zhouchuanqi/Desktop/ebnerd_data").expanduser()
#
DATASPLIT = "ebnerd_samall"
DUMP_DIR = Path("ebnerd_predictions")
DUMP_DIR.mkdir(exist_ok=True, parents=True)

History size can often be a memory bottleneck; if adjusted, the NRMS hyperparameter ```history_size``` must be updated to ensure compatibility and efficient memory usage

In [4]:
HISTORY_SIZE = 20
hparams_nrms.history_size = HISTORY_SIZE

In [5]:
# We just want to load the necessary columns
COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_IMPRESSION_TIMESTAMP_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
]
# This notebook is just a simple 'get-started'; we down sample the number of samples to just run quickly through it.
FRACTION = 1

In our project, we used the Ebnerd small dataset to test the performance of the model, which allowed us to quickly achieve improvements. We used the method of subtracting one day from the maximum timestamp as the threshold to split the data. All data before this time point are used as the training set, while data after this point are used as the validation set. We also experimented with random sampling, but for a news recommendation system, time-based data partitioning might be more appropriate.

In [6]:
df = (
    ebnerd_from_path(
        PATH.joinpath(DATASPLIT, "train"),
        history_size=HISTORY_SIZE,
        padding=0,
    )
    .select(COLUMNS)
    .pipe(
        sampling_strategy_wu2019,
        npratio=4,
        shuffle=True,
        with_replacement=True,
    )
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)

dt_split = pl.col(DEFAULT_IMPRESSION_TIMESTAMP_COL).max() - datetime.timedelta(days=1)
df_train = df.filter(pl.col(DEFAULT_IMPRESSION_TIMESTAMP_COL) < dt_split)
df_validation = df.filter(pl.col(DEFAULT_IMPRESSION_TIMESTAMP_COL) >= dt_split)

print(f"Train samples: {df_train.height}\nValidation samples: {df_validation.height}")
df_train.head(2)

Train samples: 21462
Validation samples: 3426


user_id,impression_id,impression_time,article_id_fixed,article_ids_clicked,article_ids_inview,labels
u32,u32,datetime[μs],list[i32],list[i64],list[i64],list[i8]
22779,48401,2023-05-21 21:06:50,"[9769380, 9769781, … 9770541]",[9759966],"[9774461, 9774516, … 9774461]","[0, 0, … 0]"
1001055,214679,2023-05-23 05:25:40,"[9769135, 9767604, … 9769981]",[9776566],"[9776855, 9776071, … 9776855]","[0, 0, … 0]"


### Test set
We'll use the validation set, as the test set.

In [7]:
df_test = (
    ebnerd_from_path(
        PATH.joinpath(DATASPLIT, "validation"),
        history_size=HISTORY_SIZE,
        padding=0,
    )
    .select(COLUMNS)
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)

## Load articles

In [8]:
df_articles = pl.read_parquet(PATH.joinpath("articles.parquet"))
df_articles.head(2)

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3037230,"""Ishockey-spill…","""ISHOCKEY: Isho…",2023-06-29 06:20:57,False,"""Ambitionerne o…",2003-08-28 08:55:00,,"""article_defaul…","""https://ekstra…",[],[],"[""Kriminalitet"", ""Kendt"", … ""Mindre ulykke""]",142,"[327, 334]","""sport""",,,,0.9752,"""Negative"""
3044020,"""Prins Harry tv…","""Hoffet tvang P…",2023-06-29 06:21:16,False,"""Den britiske t…",2005-06-29 08:47:00,"[3097307, 3097197, 3104927]","""article_defaul…","""https://ekstra…","[""Harry"", ""James Hewitt""]","[""PER"", ""PER""]","[""Kriminalitet"", ""Kendt"", … ""Personfarlig kriminalitet""]",414,[432],"""underholdning""",,,,0.7084,"""Negative"""


## Init model using HuggingFace's tokenizer and wordembedding
In the original implementation, they use the GloVe embeddings and tokenizer. To get going fast, we'll use a multilingual LLM from Hugging Face. 
Utilizing the tokenizer to tokenize the articles and the word-embedding to init NRMS.


In [None]:
import torch
import torch.nn.functional as F
import datetime

# Format the published time column by converting it to a string format
df_articles = df_articles.with_columns(
    df_articles["published_time"].dt.strftime('%Y-%m-%d %H:%M:%S').alias('formatted_time')
)

# Function to generate time embeddings
def generate_time_embeddings(timestamps, embedding_dim=16):
    # Convert timestamp strings to datetime objects
    time_features = [datetime.datetime.strptime(ts, "%Y-%m-%d %H:%M:%S") for ts in timestamps]
    # seconds_since_start = [(t - min(time_features)).total_seconds() for t in time_features]

    # Get the current time
    now = datetime.datetime.now()

    # Calculate the time difference in seconds between each timestamp and the current time
    seconds_since_now = [(now - t).total_seconds() for t in time_features]

    # Convert the time differences to a tensor
    time_tensor = torch.tensor(seconds_since_now, dtype=torch.float32).unsqueeze(1)

    # Normalize the time tensor to the range [0, 1]
    time_tensor = (time_tensor - time_tensor.min()) / (time_tensor.max() - time_tensor.min())

    # Use a linear layer to map the time values to a higher dimensional space
    time_emb_layer = torch.nn.Linear(1, embedding_dim)
    time_embeddings = time_emb_layer(time_tensor)

    # Ensure that the time embeddings do not contain invalid values (e.g., negative values)
    time_embeddings = torch.clamp(time_embeddings, min=0)

    return time_embeddings

# Get timestamps and generate time embeddings
timestamps = df_articles['formatted_time'].to_list()
time_embeddings = generate_time_embeddings(timestamps, embedding_dim=30)

In [None]:
TRANSFORMER_MODEL_NAME = "FacebookAI/xlm-roberta-base"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]
MAX_TITLE_LENGTH = 30

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)

print(word2vec_embedding)

#
df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE)
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)

import numpy as np

# Convert Polars Series to nested list
text_embeddings_list = df_articles[token_col_title].to_list()

# Convert nested list to 2D NumPy array
text_embeddings_np = np.vstack(text_embeddings_list)

# Convert the embeddings to a PyTorch tensor
text_embeddings = torch.tensor(text_embeddings_np, dtype=torch.float32)

# Ensure that the shapes of text embeddings and time embeddings are consistent
print(f"Text Embeddings Shape: {text_embeddings.shape}")
print(f"Time Embeddings Shape: {time_embeddings.shape}")



# Combine text and time embeddings
# You can choose to concatenate or add the embeddings depending on your model's requirement
# article_embeddings = torch.cat([text_embeddings, time_embeddings], dim=1)

# Add the transformed time embeddings to the text embeddings
article_embeddings = text_embeddings + time_embeddings

# Convert the combined embeddings to a NumPy array
article_embeddings_np = article_embeddings.detach().cpu().numpy()

# Add the combined embeddings as a new column in df_articles
df_articles = df_articles.with_columns(
    pl.Series(name="combined_embeddings", values=article_embeddings_np.tolist())
)

# Create a new article mapping using the newly added column
article_mapping = create_article_id_to_value_mapping(
    df=df_articles,
    value_col="combined_embeddings"  # Pass the column name instead of the tensor
)

print(article_mapping)



[[ 0.16296387  0.15075684  0.16040039 ...  0.13513184  0.18603516
   0.06726074]
 [-0.00733566  0.0048027  -0.0078125  ...  0.0078125   0.00408936
  -0.0078125 ]
 [ 0.20458984  0.2548828   0.13293457 ...  0.17382812  0.02458191
   0.25097656]
 ...
 [ 0.3828125  -0.4440918   0.13977051 ...  0.22680664  0.05603027
   0.1038208 ]
 [ 0.02050781 -0.12432861  0.01914978 ... -0.01432037  0.03759766
  -0.11444092]
 [ 0.10089111  0.05877686  0.05508423 ...  0.12316895 -0.00642776
   0.12164307]]
Text Embeddings Shape: torch.Size([11777, 30])
Time Embeddings Shape: torch.Size([11777, 30])
{3037230: shape: (30,)
Series: '' [f64]
[
	88.034286
	87218.085938
	441.0
	186104.0
	12.746441
	2071.0
	91621.0
	9.0
	47405.335938
	34.358368
	58685.863281
	49307.0
	…
	104777.148438
	19.0
	203.836441
	183674.1875
	17.0
	4602.693359
	4.053567
	1953.0
	778.537842
	13595.384766
	291.0
	1354.109741
	11380.431641
], 3044020: shape: (30,)
Series: '' [f64]
[
	145519.015625
	126.159904
	5044.0
	1463.0
	24795.691406
	7

# Initiate the dataloaders
In the implementations we have disconnected the models and data. Hence, you should built a dataloader that fits your needs.

Note, with this ```NRMSDataLoader``` the ```eval_mode=False``` is meant for ```model.model.fit()``` whereas ```eval_mode=True``` is meant for ```model.scorer.predict()```. 

In [11]:
BATCH_SIZE = 32

train_dataloader = NRMSDataLoader(
    behaviors=df_train,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=BATCH_SIZE,
)
val_dataloader = NRMSDataLoader(
    behaviors=df_validation, 
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=BATCH_SIZE,
)

## Train the model


In [12]:
# List all physical devices
physical_devices = tf.config.list_physical_devices()
print("Available devices:", physical_devices)

Available devices: [PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU')]


Initiate the NRMS-model:

In [13]:
model = NRMSModel(
    hparams=hparams_nrms,
    word2vec_embedding=word2vec_embedding,
    seed=42,
)
model.model.compile(
    optimizer=model.model.optimizer,
    loss=model.model.loss,
    metrics=["AUC"],
)

MODEL_NAME = model.__class__.__name__
MODEL_WEIGHTS = DUMP_DIR.joinpath(f"state_dict/{MODEL_NAME}/weights")
LOG_DIR = DUMP_DIR.joinpath(f"runs/{MODEL_NAME}")



### Callbacks
We will add some callbacks to model training.

In [14]:
# Tensorboard:
tensorboard_callback = tf.keras.callbacks.TensorBoard(
    log_dir=LOG_DIR,
    histogram_freq=1,
)

# Earlystopping:
early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor="val_auc",
    mode="max",
    patience=3,
    restore_best_weights=True,
)

# ModelCheckpoint:
modelcheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=MODEL_WEIGHTS,
    monitor="val_auc",
    mode="max",
    save_best_only=True,
    save_weights_only=True,
    verbose=1,
)

# Learning rate scheduler:
lr_scheduler = tf.keras.callbacks.ReduceLROnPlateau(
    monitor="val_auc",
    mode="max",
    factor=0.2,
    patience=2,
    min_lr=1e-6,
)

callbacks = [tensorboard_callback, early_stopping, modelcheckpoint, lr_scheduler]

In [15]:
USE_CALLBACKS = True
EPOCHS = 50

hist = model.model.fit(
    train_dataloader,
    validation_data=val_dataloader,
    epochs=EPOCHS,
    callbacks=callbacks if USE_CALLBACKS else [],
)

Epoch 1/50
Epoch 1: val_auc improved from -inf to 0.56141, saving model to ebnerd_predictions/state_dict/NRMSModel/weights
Epoch 2/50
Epoch 2: val_auc improved from 0.56141 to 0.59162, saving model to ebnerd_predictions/state_dict/NRMSModel/weights
Epoch 3/50
Epoch 3: val_auc did not improve from 0.59162
Epoch 4/50
Epoch 4: val_auc improved from 0.59162 to 0.60193, saving model to ebnerd_predictions/state_dict/NRMSModel/weights
Epoch 5/50
Epoch 5: val_auc did not improve from 0.60193
Epoch 6/50
Epoch 6: val_auc did not improve from 0.60193
Epoch 7/50
Epoch 7: val_auc did not improve from 0.60193


In [16]:
if USE_CALLBACKS:
    _ = model.model.load_weights(filepath=MODEL_WEIGHTS)

# Example how to compute some metrics:

In [17]:
BATCH_SIZE_TEST = 16

test_dataloader = NRMSDataLoader(
    behaviors=df_test,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=True,
    batch_size=BATCH_SIZE_TEST,
)

In [18]:
pred_test = model.scorer.predict(test_dataloader)



## Add the predictions to the dataframe

In [19]:
df_test = add_prediction_scores(df_test, pred_test.tolist())
df_test.head(2)

user_id,impression_id,impression_time,article_id_fixed,article_ids_clicked,article_ids_inview,labels,scores
u32,u32,datetime[μs],list[i32],list[i32],list[i32],list[i8],list[f64]
76658,144772,2023-05-30 14:21:34,"[9759544, 9775331, … 9779045]",[9783042],"[9788239, 9780702, … 9783042]","[0, 0, … 1]","[0.938482, 0.887123, … 0.875234]"
76658,144777,2023-05-30 14:22:11,"[9759544, 9775331, … 9779045]",[9788125],"[9788521, 9786217, … 9788125]","[0, 0, … 1]","[0.060727, 0.40817, … 0.935227]"


### Compute metrics

In [20]:
metrics = MetricEvaluator(
    labels=df_test["labels"].to_list(),
    predictions=df_test["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

AUC:   0%|                                            | 0/25356 [00:00<?, ?it/s]huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
AUC: 100%|██████████████████████████████| 25356/25356 [00:09<00:00, 2714.95it/s]
AUC: 100%|████████████████████████████| 25356/25356 [00:00<00:00, 103955.43it/s]
AUC: 100%|█████████████████████████████| 25356/25356 [00:00<00:00, 42258.38it/s]
AUC: 100%|█████████████████████████████| 25356/25356 [00:00<00:00, 45217.45it/s]


<MetricEvaluator class>: 
 {
    "auc": 0.5440755358626681,
    "mrr": 0.3365923404301754,
    "ndcg@5": 0.3746772140764408,
    "ndcg@10": 0.45531741494273764
}

## Make submission file

In [21]:
df_test = df_test.with_columns(
    pl.col("scores")
    .map_elements(lambda x: list(rank_predictions_by_score(x)))
    .alias("ranked_scores")
)
df_test.head(2)

user_id,impression_id,impression_time,article_id_fixed,article_ids_clicked,article_ids_inview,labels,scores,ranked_scores
u32,u32,datetime[μs],list[i32],list[i32],list[i32],list[i8],list[f64],list[i64]
76658,144772,2023-05-30 14:21:34,"[9759544, 9775331, … 9779045]",[9783042],"[9788239, 9780702, … 9783042]","[0, 0, … 1]","[0.938482, 0.887123, … 0.875234]","[1, 2, … 3]"
76658,144777,2023-05-30 14:22:11,"[9759544, 9775331, … 9779045]",[9788125],"[9788521, 9786217, … 9788125]","[0, 0, … 1]","[0.060727, 0.40817, … 0.935227]","[7, 5, … 1]"


This is using the validation, simply add the testset to your flow.

In [22]:
write_submission_file(
    impression_ids=df_test[DEFAULT_IMPRESSION_ID_COL],
    prediction_scores=df_test["ranked_scores"],
    path=DUMP_DIR.joinpath("predictions.txt"),
    filename_zip=f"{DATASPLIT}_predictions-{MODEL_NAME}.zip",
)

25356it [00:00, 30735.25it/s]


Zipping ebnerd_predictions/predictions.txt to ebnerd_predictions/ebnerd_small_predictions-NRMSModel.zip


# DONE 🚀