# Getting started

In this notebook, we illustrate how to use the Neural News Recommendation with Multi-Head Self-Attention ([NRMS](https://aclanthology.org/D19-1671/)). The implementation is taken from the [recommenders](https://github.com/recommenders-team/recommenders) repository. We have simply stripped the model to keep it cleaner.

We use a small dataset, which is downloaded from [recsys.eb.dk](https://recsys.eb.dk/). All the datasets are stored in the folder path ```~/ebnerd_data/*```.

## Load functionality

In [18]:
from transformers import AutoTokenizer, AutoModel
from pathlib import Path
import tensorflow as tf
import polars as pl

from ebrec.utils._constants import (
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
    DEFAULT_SUBTITLE_COL,
    DEFAULT_LABELS_COL,
    DEFAULT_TITLE_COL,
    DEFAULT_USER_COL,
    DEFAULT_NER_COL
)

from ebrec.utils._behaviors import (
    create_binary_labels_column,
    sampling_strategy_wu2019,
    add_known_user_column,
    add_prediction_scores,
    truncate_history,
)
from ebrec.evaluation import MetricEvaluator, AucScore, NdcgScore, MrrScore
from ebrec.utils._articles import convert_text2encoding_with_transformers, concat_list_to_text
from ebrec.utils._polars import concat_str_columns, slice_join_dataframes
from ebrec.utils._articles import create_article_id_to_value_mapping
from ebrec.utils._nlp import get_transformers_word_embeddings
from ebrec.utils._python import write_submission_file, rank_predictions_by_score

from ebrec.models.newsrec.dataloader import PPRecDataLoader
from ebrec.models.newsrec.model_config import hparams_pprec
from ebrec.models.newsrec import PPRecModel
from huggingface_hub import from_pretrained_keras

# from .autonotebook import tqdm as notebook_tqdm
%matplotlib inline
%load_ext autoreload
%autoreload 2


The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## Load dataset

In [21]:
def ebnerd_from_path(path: Path, history_size: int = 30) -> pl.DataFrame:
    """
    Load ebnerd - function
    """
    df_history = (
        pl.scan_parquet(path.joinpath("history.parquet"))
        .select(DEFAULT_USER_COL, DEFAULT_HISTORY_ARTICLE_ID_COL)
        .pipe(
            truncate_history,
            column=DEFAULT_HISTORY_ARTICLE_ID_COL,
            history_size=history_size,
            padding_value=0,
            enable_warning=False,
        )
    )
    df_behaviors = (
        pl.scan_parquet(path.joinpath("behaviors.parquet"))
        .collect()
        .pipe(
            slice_join_dataframes,
            df2=df_history.collect(),
            on=DEFAULT_USER_COL,
            how="left",
        )
    )
    return df_behaviors

### Generate labels
We sample a few just to get started. For testset we just make up a dummy column with 0 and 1 - this is not the true labels.

In [22]:
PATH = Path("/Users/sohamchatterjee/Documents/UvA/RecSYS/Project/ebnerd_data")
DATASPLIT = "ebnerd_demo"

In this example we sample the dataset, just to keep it smaller. Also, one can simply add the testset similary to the validation.

In [4]:
COLUMNS = [
    DEFAULT_USER_COL,
    DEFAULT_HISTORY_ARTICLE_ID_COL,
    DEFAULT_INVIEW_ARTICLES_COL,
    DEFAULT_CLICKED_ARTICLES_COL,
    DEFAULT_IMPRESSION_ID_COL,
]
HISTORY_SIZE = 10
FRACTION = 0.01

df_train = (
    ebnerd_from_path(PATH.joinpath(DATASPLIT, "train"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(
        sampling_strategy_wu2019,
        npratio=4,
        shuffle=True,
        with_replacement=True,
        seed=123,
    )
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)
# =>
df_validation = (
    ebnerd_from_path(PATH.joinpath(DATASPLIT, "validation"), history_size=HISTORY_SIZE)
    .select(COLUMNS)
    .pipe(create_binary_labels_column)
    .sample(fraction=FRACTION)
)
df_train.head(2)

Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here


user_id,article_id_fixed,article_ids_inview,article_ids_clicked,impression_id,labels
u32,list[i32],list[i64],list[i64],u32,list[i8]
1619579,"[9770538, 9769306, … 9769622]","[9773084, 9771686, … 9771686]",[9773084],310380206,"[1, 0, … 0]"
667805,"[9769917, 9769433, … 9769433]","[9734283, 9773873, … 9734283]",[9773877],215228693,"[0, 0, … 0]"


## Load articles

In [5]:
df_articles = pl.read_parquet(PATH.joinpath("articles.parquet"))
df_articles.head(2)

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str
3000022,"""Hanks beskyldt…","""Tom Hanks har …",2023-06-29 06:20:32,False,"""Tom Hanks skul…",2006-09-20 09:24:18,[3518381],"""article_defaul…","""https://ekstra…","[""David Gardner""]","[""PER""]","[""Kriminalitet"", ""Kendt"", … ""Litteratur""]",414,[432],"""underholdning""",,,,0.9911,"""Negative"""
3000063,"""Bostrups aske …","""Studieværten b…",2023-06-29 06:20:32,False,"""Strålende sens…",2006-09-24 07:45:30,"[3170935, 3170939]","""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Underholdning"", … ""Personlig begivenhed""]",118,[133],"""nyheder""",,,,0.5155,"""Neutral"""


## Init model using HuggingFace's tokenizer and wordembedding
In the original implementation, they use the GloVe embeddings and tokenizer. To get going fast, we'll use a multilingual LLM from Hugging Face. 
Utilizing the tokenizer to tokenize the articles and the word-embedding to init NRMS.


In [6]:
TRANSFORMER_MODEL_NAME = "FacebookAI/xlm-roberta-base"
TEXT_COLUMNS_TO_USE = [DEFAULT_SUBTITLE_COL, DEFAULT_TITLE_COL]
MAX_TITLE_LENGTH = 30

# LOAD HUGGINGFACE:
transformer_model = AutoModel.from_pretrained(TRANSFORMER_MODEL_NAME)
transformer_tokenizer = AutoTokenizer.from_pretrained(TRANSFORMER_MODEL_NAME)

# We'll init the word embeddings using the
word2vec_embedding = get_transformers_word_embeddings(transformer_model)
#
df_articles, cat_cal = concat_str_columns(df_articles, columns=TEXT_COLUMNS_TO_USE)
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, cat_cal, max_length=MAX_TITLE_LENGTH
)
# =>
article_mapping_title = create_article_id_to_value_mapping(
    df=df_articles, value_col=token_col_title
)



In [7]:
df_articles = concat_list_to_text(df_articles,'ner_clusters','ner_clusters_text')

Reached here


In [8]:
df_articles, token_col_title = convert_text2encoding_with_transformers(
    df_articles, transformer_tokenizer, 'ner_clusters_text', max_length=MAX_TITLE_LENGTH
)
df_articles.head()

article_id,title,subtitle,last_modified_time,premium,body,published_time,image_ids,article_type,url,ner_clusters,entity_groups,topics,category,subcategory,category_str,total_inviews,total_pageviews,total_read_time,sentiment_score,sentiment_label,subtitle-title,subtitle-title_encode_FacebookAI/xlm-roberta-base,ner_clusters_text,ner_clusters_text_encode_FacebookAI/xlm-roberta-base
i32,str,str,datetime[μs],bool,str,datetime[μs],list[i64],str,str,list[str],list[str],list[str],i16,list[i16],str,i32,i32,f32,f32,str,str,list[i64],str,list[i64]
3000022,"""Hanks beskyldt…","""Tom Hanks har …",2023-06-29 06:20:32,False,"""Tom Hanks skul…",2006-09-20 09:24:18,[3518381],"""article_defaul…","""https://ekstra…","[""David Gardner""]","[""PER""]","[""Kriminalitet"", ""Kendt"", … ""Litteratur""]",414,[432],"""underholdning""",,,,0.9911,"""Negative""","""Tom Hanks har …","[8352, 2548, … 56]","""David Gardner""","[6765, 90968, … 1]"
3000063,"""Bostrups aske …","""Studieværten b…",2023-06-29 06:20:32,False,"""Strålende sens…",2006-09-24 07:45:30,"[3170935, 3170939]","""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Underholdning"", … ""Personlig begivenhed""]",118,[133],"""nyheder""",,,,0.5155,"""Neutral""","""Studieværten b…","[60716, 17052, … 1]","""""","[1, 1, … 1]"
3000613,"""Jesper Olsen r…","""Den tidligere …",2023-06-29 06:20:33,False,"""Jesper Olsen, …",2006-05-09 11:29:00,[3164998],"""article_defaul…","""https://ekstra…","[""Frankrig"", ""Jesper Olsen"", … ""Jesper Olsen""]","[""LOC"", ""PER"", … ""PER""]","[""Kendt"", ""Sport"", … ""Sygdom og behandling""]",142,"[196, 271]","""sport""",,,,0.9876,"""Negative""","""Den tidligere …","[1575, 12532, … 111326]","""Frankrig-Jespe…","[192380, 9, … 1]"
3000700,"""Madonna topløs…","""47-årige Madon…",2023-06-29 06:20:33,False,"""Skal du have s…",2006-05-04 11:03:12,[3172046],"""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Livsstil"", ""Underholdning""]",414,[432],"""underholdning""",,,,0.8786,"""Neutral""","""47-årige Madon…","[7657, 9, … 22907]","""""","[1, 1, … 1]"
3000840,"""Otto Brandenbu…","""Sangeren og sk…",2023-06-29 06:20:33,False,"""'Og lidt for S…",2007-03-01 18:34:00,[3914446],"""article_defaul…","""https://ekstra…",[],[],"[""Kendt"", ""Underholdning"", … ""Musik og lyd""]",118,[133],"""nyheder""",,,,0.9468,"""Negative""","""Sangeren og sk…","[22986, 3683, … 1]","""""","[1, 1, … 1]"


In [9]:
article_mapping_entity = create_article_id_to_value_mapping(
    df=df_articles, value_col=token_col_title
)

# Initiate the dataloaders
In the implementations we have disconnected the models and data. Hence, you should built a dataloader that fits your needs.

In [10]:
train_dataloader = PPRecDataLoader(
    behaviors=df_train,
    article_dict=article_mapping_title,
    body_mapping=article_mapping_entity,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=False,
    batch_size=64,
)


In [11]:
val_dataloader = PPRecDataLoader(
    behaviors=df_validation,
    article_dict=article_mapping_title,
    body_mapping=article_mapping_entity,
    unknown_representation="zeros",
    history_column=DEFAULT_HISTORY_ARTICLE_ID_COL,
    eval_mode=True,
    batch_size=32,
)

In [12]:
# model = PPRecModel(
#     hparams=hparams_pprec,
#     word2vec_embedding=None,
#     seed=42,
# )
# print("Model:",model.model)
# print("Scorer:",model.scorer)

## Train the model


In [23]:
MODEL_NAME = "PPRecModel"
LOG_DIR = f"downloads/runs/{MODEL_NAME}"
MODEL_WEIGHTS = f"downloads/data/state_dict/{MODEL_NAME}/weights"

# CALLBACKS
tensorboard_callback = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, histogram_freq=1)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor="val_loss", patience=2)
modelcheckpoint = tf.keras.callbacks.ModelCheckpoint(
    filepath=MODEL_WEIGHTS, save_best_only=True, save_weights_only=True, verbose=1
)

hparams_pprec.history_size = HISTORY_SIZE
model = PPRecModel(
    hparams=hparams_pprec,
    word2vec_embedding=word2vec_embedding,
    seed=42,
)



In [24]:
hist = model.model.fit(
    train_dataloader,
    validation_data=val_dataloader,
    epochs=1,
    callbacks=[tensorboard_callback, early_stopping, modelcheckpoint],
)

Reached here
Reached here
Reached here
Reached here




Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
Reached here
['title_user_id', 'title_article_ids_clicked', 'title_impression_id', 'title_n_samples', 'title_article_id_fixed', 'title_article_ids_inview', 'ner_clusters_text_user_id', 'ner_clusters_text_article_ids_clicked', 'ner_clusters_text_impression_id', 'ner_clusters_text_n_samples', 'ner_clusters_text_article_id_fixed', 'ner_clusters_text_article_ids_inview']
Reached here
Reached here
Reached here
Reached here
['title_user_id', 'title_article_ids_clicked', 'title_impression_id', 'title_n_samples', 'title_article_id_fixed', 'title_article_ids_inview', 'ner_clusters_text_user_id', 'ner_clusters_text_article_ids_clicked', 'ner_clusters_text_impression_id', 'ner_clusters_text_n_samples', 'ner_clusters_text_article_id_fixed', 'ner_clusters_text_article_ids_inview']


: 

In [None]:

_ = model.model.load_weights(filepath=MODEL_WEIGHTS)

# Example how to compute some metrics:

In [17]:
# pred_validation = model.scorer.predict(val_dataloader)

## Add the predictions to the dataframe

In [14]:
# df_validation = add_prediction_scores(df_validation, pred_validation.tolist()).pipe(
#     add_known_user_column, known_users=df_train[DEFAULT_USER_COL]
# )
# df_validation.head(2)

### Compute metrics

In [15]:
# metrics = MetricEvaluator(
#     labels=df_validation["labels"].to_list(),
#     predictions=df_validation["scores"].to_list(),
#     metric_functions=[AucScore(), MrrScore(), NdcgScore(k=5), NdcgScore(k=10)],
# )
# metrics.evaluate()

## Make submission file

In [16]:
# df_validation = df_validation.with_columns(
#     pl.col("scores")
#     .map_elements(lambda x: list(rank_predictions_by_score(x)))
#     .alias("ranked_scores")
# )
# df_validation.head(2)

This is using the validation, simply add the testset to your flow.

In [None]:
# write_submission_file(
#     impression_ids=df_validation[DEFAULT_IMPRESSION_ID_COL],
#     prediction_scores=df_validation["ranked_scores"],
#     path="downloads/predictions.txt",
# )

# DONE 🚀