# Getting started

In this notebook, we illustrate how to use the Neural News Recommendation with Multi-Head Self-Attention ([NRMS](https://aclanthology.org/D19-1671/)). The implementation is taken from the [recommenders](https://github.com/recommenders-team/recommenders) repository. We have simply stripped the model to keep it cleaner.

We use a small dataset, which is downloaded from [recsys.eb.dk](https://recsys.eb.dk/). All the datasets are stored in the folder path ```~/ebnerd_data/*```.

## Load functionality

In [1]:
%cd ../../src/ 
%ls

e:\TUWmaster\1. semester\recsys\Group_9\src
 Volume in drive E is New Volume
 Volume Serial Number is 4871-6226

 Directory of e:\TUWmaster\1. semester\recsys\Group_9\src

07/02/2024  07:24 PM    <DIR>          .
07/02/2024  07:24 PM    <DIR>          ..
07/02/2024  05:58 PM                 0 .gitkeep
07/02/2024  07:19 PM                 0 __init__.py
07/02/2024  07:24 PM    <DIR>          ebrec
               2 File(s)              0 bytes
               3 Dir(s)  105,881,354,240 bytes free


In [2]:
import sys
sys.path.append('/src') 

In [2]:
import tensorflow as tf

if tf.test.gpu_device_name():
    print('Default GPU Device: {}'.format(tf.test.gpu_device_name()))
else:
    print("Please install GPU version of TF")

Default GPU Device: /device:GPU:0


In [5]:
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl
import os

from ebrec.utils._constants import (
    DEFAULT_HISTORY_ARTICLE_ID_COL, 
    DEFAULT_CLICKED_ARTICLES_COL, 
    DEFAULT_INVIEW_ARTICLES_COL, 
    DEFAULT_IMPRESSION_ID_COL, 
    DEFAULT_SUBTITLE_COL, 
    DEFAULT_LABELS_COL,
    DEFAULT_TITLE_COL, 
    DEFAULT_USER_COL, 
)

from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019, 
    add_known_user_column,
    add_prediction_scores, 
    truncate_history, 
)
from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore 
from ebrec.utils._articles import convert_text2encoding_with_transformers 
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes 
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings 
from ebrec.utils._python import write_submission_file, rank_predictions_by_score

from ebrec.models.newsrec.dataloader import NRMSDataLoader 
from ebrec.models.newsrec.model_config import hparams_nrms 
from ebrec.models.newsrec import NRMSModel

## Load dataset

In [6]:
def ebnerd_from_path(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd - function
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
            enable_warning=False,
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
        .collect()
        .pipe(
            slice_join_dataframes,
            df2=df_history.collect(),
            on=DEFAULT_USER_COL,
            how="left",
        )
    )
    return df_behaviors

### Generate labels
We sample a few just to get started. For testset we just make up a dummy column with 0 and 1 - this is not the true labels.

In [8]:
notebook_dir = Path(os.getcwd()).parent
PATH = notebook_dir / "data"
DATASPLIT = "ebnerd_small"

print(PATH)

/content/drive/MyDrive/RECSYS/downloads


In this example we sample the dataset, just to keep it smaller. Also, one can simply add the testset similary to the validation.

In [9]:
COLUMNS = [
    DEFAULT_USER_COL, 
    DEFAULT_HISTORY_ARTICLE_ID_COL, 
    DEFAULT_INVIEW_ARTICLES_COL, 
    DEFAULT_CLICKED_ARTICLES_COL, 
    DEFAULT_IMPRESSION_ID_COL, 
]
HISTORY_SIZE = 10 
FRACTION = 1.0 

df_train = (
    ebnerd_from_path(PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(
        sampling_strategy_wu2019,
        npratio=4, 
        shuffle=True,
        with_replacement=True,
        seed=123,
    )
    .pipe(create_binary_labels_column) 
    .sample(fraction=FRACTION)
)
# =>
df_validation = (
    ebnerd_from_path(PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(create_binary_labels_column) 
    .sample(fraction=FRACTION)
)
df_train.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels
u32,list[i32],list[i64],list[i64],u32,list[i8]
139836,"[9748041, 4800276, … 9765156]","[9778669, 9778728, … 9778657]",[9778657],149474,"[0, 0, … 1]"
143471,"[9770798, 9769306, … 9770989]","[9778669, 9778623, … 9778682]",[9778623],150528,"[0, 1, … 0]"


## Load articles

In [10]:
df_articles = pl.read_parquet(PATH.joinpath(DATASPLIT, "articles.parquet"))
df_articles.head(2)

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3001353,"""Natascha var i…","""Politiet frygt…",2023-06-29 06:20:33,False,"""Sagen om den ø…",2006-08-31 08:06:45,[3150850],"""article_defaul…","""https://ekstra…",[],[],"[""Kriminalitet"", ""Personfarlig kriminalitet""]",140,[],"""krimi""",,,,0.9955,"""Negative"""
3003065,"""Kun Star Wars …","""Biografgængern…",2023-06-29 06:20:35,False,"""Vatikanet har …",2006-05-21 16:57:00,[3006712],"""article_defaul…","""https://ekstra…",[],[],"[""Underholdning"", ""Film og tv"", ""Økonomi""]",414,"[433, 434]","""underholdning""",,,,0.846,"""Positive"""


## Init model using HuggingFace's tokenizer and wordembedding
In the original implementation, they use the GloVe embeddings and tokenizer. To get going fast, we'll use a multilingual LLM from Hugging Face.
Utilizing the tokenizer to tokenize the articles and the word-embedding to init NRMS.


In [11]:
TRANSFORMER_MODEL_NAME = "MCFred/bert-base-danish-uncased-certainly-v2"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL] 
MAX_TITLE_LENGTH = 30

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME) 
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME) 

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model) 
#
df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE) 
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping = create_article_id_to_value_mapping(
    df=df_articles, value_col=token_col_title 
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/841 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/443M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/327 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/253k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/752k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

# Initiate the dataloaders
In the implementations we have disconnected the models and data. Hence, you should built a dataloader that fits your needs.

In [12]:
train_dataloader = NRMSDataLoader( 
    behaviors=df_train, 
    article_dict=article_mapping,
    unknown_representation="zeros", 
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL, 
    eval_mode=False,
    batch_size=128, 
)
val_dataloader = NRMSDataLoader(
    behaviors=df_validation,
    article_dict=article_mapping,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=True,
    batch_size=64, 
)

## Train the model


In [13]:
MODEL_NAME = "NRMS"
LOG_DIR = f"weights/runs/{MODEL_NAME}"
MODEL_WEIGHTS = f"weights/data/state_dict/{MODEL_NAME}/weights"


early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2) 


hparams_nrms.history_size = HISTORY_SIZE
model = NRMSModel(
    hparams=hparams_nrms,
    word2vec_embedding=word2vec_embedding, 
    seed=42,
)
hist = model.model.fit(
    train_dataloader,
    validation_data=val_dataloader,
    epochs=1,
    callbacks=[early_stopping],
)



# Example how to compute some metrics:

In [14]:
pred_validation = model.scorer.predict(val_dataloader)



## Add the predictions to the dataframe

In [15]:
df_validation = add_prediction_scores(df_validation, pred_validation.tolist()).pipe(
    add_known_user_column, known_users=df_train[DEFAULT_USER_COL]
)
df_validation.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels,scores,is_known_user
u32,list[i32],list[i32],list[i32],u32,list[i8],list[f64],bool
22548,"[9774840, 9757574, … 9776929]","[9784591, 9784679, … 9784710]",[9784696],96791,"[0, 0, … 0]","[0.762879, 0.226571, … 0.272942]",True
22548,"[9774840, 9757574, … 9776929]","[9782806, 9784702, … 9784489]",[9784281],96798,"[0, 0, … 0]","[0.37107, 0.757783, … 0.144775]",True


### Compute metrics

In [16]:
metrics = MetricEvaluator(
    labels=df_validation["labels"].to_list(),
    predictions=df_validation["scores"].to_list(),
    metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
)
metrics.evaluate()

<MetricEvaluator class>: 
 {
    "auc": 0.5632915504906345,
    "mrr": 0.3517480098526124,
    "ndcg@5": 0.3929105157050098,
    "ndcg@10": 0.46949975698918694
}

## Make submission file

In [17]:
df_validation = df_validation.with_columns(
    pl.col("scores")
    .map_elements(lambda x: list(rank_predictions_by_score(x)))
    .alias("ranked_scores")
)
df_validation.head(2)

user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels,scores,is_known_user,ranked_scores
u32,list[i32],list[i32],list[i32],u32,list[i8],list[f64],bool,list[i64]
22548,"[9774840, 9757574, … 9776929]","[9784591, 9784679, … 9784710]",[9784696],96791,"[0, 0, … 0]","[0.762879, 0.226571, … 0.272942]",True,"[1, 5, … 4]"
22548,"[9774840, 9757574, … 9776929]","[9782806, 9784702, … 9784489]",[9784281],96798,"[0, 0, … 0]","[0.37107, 0.757783, … 0.144775]",True,"[13, 3, … 23]"


This is using the validation, simply add the testset to your flow.

In [18]:
write_submission_file(
    impression_ids=df_validation[DEFAULT_IMPRESSION_ID_COL],
    prediction_scores=df_validation["ranked_scores"],
    path="downloads/predictions_nrms_danish.txt",
)

244647it [01:05, 3753.48it/s]


Zipping downloads/predictions_nrms_danish.txt to downloads/predictions_nrms_danish.zip
