# Chameleon news recommendation system: Globo dataset

## Project setup
Create virtual environment:
```sh
conda env create -f env.yml
conda activate chameleon
```

Download data from [kaggle]() and unzip to `data/gcom` subdirectory; unzip `clicks.zip`.
```
data/gcom
  articles_embeddings.pickle  # Article content embedding results
  articles_metadata.csv       # Article metadata
  clicks/clicks_hour_*.csv    # Click information for recommendation
```

## Article Content Representation (ACR) module
Function:
1. extract features from news articles text and metadata 
2. learn a distributed representations (embeddings) for each news article context.

The inputs for the *ACR* module are 
1. article metadata attributes (e.g., publisher)
2. article textual content, represented as a sequence of word embeddings.

### 0. ACR Description
In this instantiation of the *Textual Features Representation (TFR)* sub-module from ACR module, 1D CNNs over pre-trained Word2Vec embeddings was used to extract features from textual items.  Article's textual features and metadata inputs were combined by using a sequence of Fully Connected (FC) layers to produce *Article Content Embeddings*.

For scalability reasons, *Article Content Embeddings* are not directly trained for recommendation task, but for a side task of news metadata classification. For this architecture instantiation of CHAMELEON, they were trained to classify the category (editorial section) of news articles.

After training, the *Article Content Embeddings* for news articles (NumPy matrix) are persisted into a Pickle dump file, for further usage by *NAR* module.

### 1. ACR Preprocessing: Extract features from news articles text and metadata 

Input
* input_articles_csv_path:    path of a CSV containing articles text and metadata
* input_word_embeddings_path: path of pre-trained word embeddings: must be in gensim format binary / plain txt

Output
* output_tf_records_path: exports articles data into TFRecords format
* output_word_vocab_embeddings_path: the dictionaries that mapped tokenized words to sequences of int\
* output_label_encoders: metadata the categorical features encoders (**)

For Globo.com dataset, we used pre-trained Portuguese word embeddings (skip-gram model (300 dimensions), available [here](http://nilc.icmc.usp.br/embeddings).

In [None]:
%%bash
cd acr_module && \
DATA_DIR="../data/gcom" && \
python3 -m acr.preprocessing.acr_preprocess_gcom \
	--input_articles_csv_path ${DATA_DIR}/document_g1/documents_g1.csv \
 	--input_word_embeddings_path ${DATA_DIR}/word2vec/skip_s300.txt \
 	--vocab_most_freq_words 50000 \
 	--output_word_vocab_embeddings_path ${DATA_DIR}/pickles/acr_word_vocab_embeddings.pickle \
 	--output_label_encoders ${DATA_DIR}/pickles/acr_label_encoders.pickle \
 	--output_tf_records_path "${DATA_DIR}/articles_tfrecords/gcom_articles_tokenized_*.tfrecord.gz" \
 	--articles_by_tfrecord 5000

### 2. ACR training
Learn a distributed representations (embeddings) for each news article context.

Input from last step:
* train_set_path_regex: path of pre-procesased TFRecords
* input_word_vocab_embeddings_path: 
* input_label_encoders_path: 

Output: 
* output_acr_metadata_embeddings_path: the trained *Article Content Embeddings* (NumPy matrix), with the dimensions specified by *acr_embeddings_size*, exported as Pickle dump file. 


In [None]:
%%bash

cd acr_module && \
DATA_DIR="../data/gcom" && \
JOB_PREFIX=gcom && \
JOB_ID=`whoami`_${JOB_PREFIX}_`date '+%Y_%m_%d_%H%M%S'` && \
MODEL_DIR='/tmp/chameleon/gcom/jobs/'${JOB_ID} && \
echo 'Running training job and outputing to '${MODEL_DIR} && \
python3 -m acr.acr_trainer_gcom \
	--model_dir ${MODEL_DIR} \
	--train_set_path_regex "${DATA_DIR}/articles_tfrecords/gcom_articles_tokenized_*.tfrecord.gz" \
	--input_word_vocab_embeddings_path ${DATA_DIR}/pickles/acr_word_vocab_embeddings.pickle \
	--input_label_encoders_path ${DATA_DIR}/pickles/acr_label_encoders.pickle \
	--output_acr_metadata_embeddings_path ${DATA_DIR}/pickles/acr_articles_metadata_embeddings.pickle \
	--batch_size 64 \
	--truncate_tokens_length 300 \
	--training_epochs 5 \
	--learning_rate 3e-4 \
	--dropout_keep_prob 1.0 \
	--l2_reg_lambda 7e-4 \
	--text_feature_extractor "CNN" \
	--cnn_filter_sizes "3,4,5" \
	--cnn_num_filters 128 \
	--acr_embeddings_size 250


## Next article recommendation (NAR)
The *Next-Article Recommendation (NAR)* module is responsible for providing news articles recommendations for active sessions.

The inputs for the *NAR* module are: 
1. the pre-trained *Article Content Embedding* of the last viewed article; 
2. the contextual properties of the articles (popularity and recency); 
3. the user context (e.g. time, location, and device). 

### 0. NAR Description

Due to the high sparsity of users and their constant interests shift, the CHAMELEON instantiation leverages only session-based contextual information, ignoring possible users’ past sessions.

These inputs are combined by Fully Connected layers to produce a *User-Personalized Contextual Article Embedding*, whose representations might differ for the same article, depending on the user context and on the current article context (popularity and recency).

The *NAR* module uses a type of Recurrent Neural Network (RNN) – the Long Short-Term Memory (LSTM) – to model the sequence of articles read by users in their sessions, represented by their *User-Personalized Contextual Article Embeddings*. For each article of the sequence, the RNN outputs a *Predicted Next-Article Embedding* – the expected representation of a news content the user would like to read next in the active session.

In most deep learning architectures proposed for RS, the neural network outputs a vector whose dimension is the number of available items. Such approach may work for domains were the items number is more stable, like movies and books. Although, in the dynamic scenario of news recommendations, where thousands of news stories are added and removed daily, such approach could require full retrain of the network, as often as new articles are published.

For this reason, instead of using a softmax cross-entropy loss, the NAR module is trained to maximize the similarity between the *Predicted Next-Article Embedding* and the *User-Personalized Contextual Article Embedding* corresponding to the next article actually read by the user in his session (positive sample), whilst minimizing its similarity with negative samples (articles not read by the user during the session). With this strategy to deal with item cold-start, a newly published article might be immediately recommended, as soon as its *Article Content Embedding* is trained and added to a repository.

### 1. NAR preprocessing: click pattern into tf format
INPUT: 
* input_clicks_csv_path_regex: users session split by hour

OUTPUT:
* output_sessions_tfrecords_path: 

In [3]:
%%bash

cd nar_module && \
DATA_DIR="../data/gcom" && \
python3 -m nar.preprocessing.nar_preprocess_gcom \
--input_clicks_csv_path_regex "${DATA_DIR}/clicks/clicks_hour_*" \
--number_hours_to_preprocess 5 \
--output_sessions_tfrecords_path "${DATA_DIR}/sessions_tfrecords/sessions_hour_*.tfrecord.gz"

Loading sessions by hour
Exporting sessions by hour to TFRecords: ../data/gcom/sessions_tfrecords/sessions_hour_*.tfrecord.gz
Exported 0 TFRecord files
Preprocessing finalized


### 2. NAR Training and evaluation
The *NAR* module is trained and evaluated according to the following *Temporal Offline Evaluation Method*
1. Train the NAR module with sessions within the active hour.
2. Evaluate the NAR module with sessions within the next hour, for the task of the next-click prediction.

INPUT from ACR:
- acr_module_articles_metadata_csv_path article metadata path
- acr_module_articles_content_embeddings_pickle_path: article content embedding
- train_set_path_regex: click info

OUTPUT: model_dir

### 3. NAR Evaluation

The following baseline methods (described in more detail in [2]) are also trained and evaluated in parallel, as benchmarks for CHAMELEON accuracy:
- **Co-occurrent (CO)**
- **Sequential Rules (SR)**
- **Item-kNN**
- **Vector Multiplication Session-Based kNN (V-SkNN)**
- **Recently Popular (RP)**
- **Content-Based (CB)**

The choosen evaluation metrics were **Hit-Rate@N** and **MRR@N** for accuracy, **COV** for catalog coverage, **ESI-R** and **ESI-RR** for novelty, and **EILD-R** and **EILD-RR** for diversity, described in [2].

**Parameters**

It is necessary to specify a subset of files (representing sessions started in the same hour) for training and evaluation (*train_files_from* to *train_files_up_to*). The frequency of evaluation is specified in *training_hours_for_each_eval* (e.g. *training_hours_for_each_eval=5* means that after training on 5 hour's (files) sessions, the next hour (file) is used for evaluation.

To reproduce the experiments of [2], where additional features are used as inputs to the NAR module, you must change the following parameters according to the Input Configurations (IC) reported in the paper: *enabled_articles_input_features_groups*, *enabled_clicks_input_features_groups*, *enabled_internal_features*.

To reproduce the experiments reported in [2] with the novelty regularization in loss function, change the parameter *novelty_reg_factor*.

In [None]:
%%bash 
cd nar_module && \
DATA_DIR="../data/gcom" && \
JOB_PREFIX=gcom && \
JOB_ID=`whoami`_${JOB_PREFIX}_`date '+%Y_%m_%d_%H%M%S'` && \
MODEL_DIR='/tmp/chameleon/jobs/'${JOB_ID} && \
echo 'Running training job and outputing to '${MODEL_DIR} && \
python3 -m nar.nar_trainer_gcom \
	--model_dir ${MODEL_DIR} \
	--acr_module_articles_metadata_csv_path ${DATA_DIR}/articles_metadata.csv \
	--acr_module_articles_content_embeddings_pickle_path ${DATA_DIR}/articles_embeddings.pickle \
	--train_set_path_regex "${DATA_DIR}/sessions_tfrecords/sessions_hour_*.tfrecord.gz" \
	--train_files_from 0 \
	--train_files_up_to 72 \
	--training_hours_for_each_eval 5 \
	--save_results_each_n_evals 1 \
	--batch_size 64 \
	--truncate_session_length 20 \
	--learning_rate 3e-5 \
	--dropout_keep_prob 1.0 \
	--reg_l2 1e-5 \
	--softmax_temperature 0.1 \
	--recent_clicks_buffer_hours 1.0 \
	--recent_clicks_buffer_max_size 20000 \
	--recent_clicks_for_normalization 2000 \
	--eval_metrics_top_n 6 \
	--CAR_embedding_size 1024 \
	--rnn_units 255 \
	--rnn_num_layers 1 \
	--train_total_negative_samples 30 \
	--train_negative_samples_from_buffer 3000 \
	--eval_total_negative_samples 30 \
	--eval_negative_samples_from_buffer 3000 \
	--eval_negative_sample_relevance 0.02 \
	--enabled_articles_input_features_groups "category" \
	--enabled_clicks_input_features_groups "time,device,location,referrer" \
	--enabled_internal_features "item_clicked_embeddings,recency,novelty,article_content_embeddings" \
	--novelty_reg_factor 0.0 \
	--disable_eval_benchmarks