In [1]:
# ultilize the accelerate for academic resource
import subprocess
import os

result = subprocess.run('bash -c "source /etc/network_turbo && env | grep proxy"', shell=True, capture_output=True, text=True)
output = result.stdout

for line in output.splitlines():
    if '=' in line:
        var, value = line.split('=', 1)
        os.environ[var] = value

# Total Pipeline
## 1. Data Preprocess
### 1.1 Set Configs
To manage the configurations effectively, I've implemented the `Config` class in utils/load_configs.py. This class is capable of loading configurations from JSON files and can be utilized in various scenarios. In essence, it simplifies the input parameters for functions and classes.

In [2]:
from utils.load_configs import Config

Then, we have some JSON files located in the config_files directory, each containing pre-set parameters for different models and stages. Next, we'll load all of these JSON files. The specifics will be explained later when we utilize these configurations.

In [3]:
# corresponding to stage1 method 1 mentioned below.
stage1_train_config_using_lavis = Config()
stage1_train_config_using_lavis.stage1_read_from_json("config_files/stage1_train_configs_lavis.json")
# corresponding to stage1 method 2 mentioned below.
stage1_train_config_seperatly_extract = Config()
stage1_train_config_seperatly_extract.stage1_read_from_json("config_files/stage1_train_configs.json")

# corresponding to stage1 method 1 mentioned below.
stage1_dev_config_using_lavis = Config()
stage1_dev_config_using_lavis.stage1_read_from_json("config_files/stage1_dev_configs_lavis.json")
# corresponding to stage1 method 2 mentioned below.
stage1_dev_config_seperatly_extract = Config()
stage1_dev_config_seperatly_extract.stage1_read_from_json("config_files/stage1_dev_configs.json")

# corresponding to stage1 method 1 mentioned below.
stage1_test_config_using_lavis = Config()
stage1_test_config_using_lavis.stage1_read_from_json("config_files/stage1_test_configs_lavis.json")
# corresponding to stage1 method 2 mentioned below.
stage1_test_config_seperatly_extract = Config()
stage1_test_config_seperatly_extract.stage1_read_from_json("config_files/stage1_test_configs.json")

# corresponding to stage2 method 1 menthioned below.
stage2_train_config = Config()
stage2_train_config.stage2_read_from_json("config_files/stage2_train_configs.json")

stage2_dev_config = Config()
stage2_dev_config.stage2_read_from_json("config_files/stage2_dev_configs.json")

stage2_test_config = Config()
stage2_test_config.stage2_read_from_json("config_files/stage2_test_configs.json")

### 1.2 load news data
In my project, the data loading process has been extensively wrapped, allowing for the use of a simple function to accomplish it.<br>

You can directly call `Get_news_data(configs)` from utils/load_from_files.py to retrieve the news data. This process will take some time to load the data from the files, approximately 2 minutes in my environment using **72GB memory and 32GB V100**. It utilizes the function `extract_path_from_config(configs)` from the same file as `Get_news_data(configs)` to parse the image folder and the path to the news data.

After loading all the news data, we should organize it into a dataset inherited from the PyTorch dataset. You can refer to the code in the data_processor folder for details about the dataset structure, and the README.md file explains how it functions.<br>

The method `get_NewsDataset(configs)` is a wrapper function to obtain a news dataset using the provided configuration, thus avoiding the need to explicitly parse the parameters.<br>

Now that we've acquired the news data, in order to reduce memory usage, we'll simply initialize the NewsDataset using the standard method.

The each item in dataset includes: (newsid, title, abstract, img)<br>

Then, we should load a Dataloader, so that it can collaborate with the other pytorch components. For future extensions, I've inherited the pytorch Dataloader as `BaseDataloader`, and then `NewsDataloader`, `UserDataloader` inherited from `BaseDataloader`, each without additional methods.

After successfully constructing the `news_dataloader`, we can employ it for training or any other tasks within the PyTorch framework.<br>

Although it won't be utilized in training due to efficiency considerations (matching each news and clicked user pair each epoch would consume a significant amount of time and memory), it can aid us in comprehending the `UserDataLoader`. <br>

However, there is one noteworthy detail: the batch structure in `NewsDataloader` is transposed, with the batch size appearing in the second position. We can delve into this with the following code:

### 1.3 Embedding
Now that we have obtained the news data using the codes above, the next step is to embed the news into a 768-dimensional vector. I've devised two methods to investigate the effectiveness of multimodal pretraining:

1. Utilizing the Lavis package, which offers the BLIP pretrained model to extract features.
2. Employing the transformers package, which provides BCEmbedding and Dinov2-base for textual and visual extraction respectively.<br>

In theory,  method one should outperform method two, and this will be validated in the subsequent experiments.<br>

Now, we use the method 2 first. We can employ the `News_Embedding_Pipeline.save_encode(dataloader)` to achieve it, it is wrapped in the News_Embedding_Pipeline.py. When running, it will raise a future warning caused by `use_auth_token`, limited by the environment, it can't be solved. It will also take some time to calculate, I have set the tqdm to monitor progress.

In [9]:
# method 2, using the pretrained textual model and visual model seperatly.
from News_Embedding_Pipeline import News_Embedding_Pipeline
from tqdm import tqdm
train_news_embedder = News_Embedding_Pipeline(stage1_train_config_seperatly_extract)

train_news_embedder.save_encode(train_news_dataloader)

  return self.fget.__get__(instance, owner)()
04/14/2024 23:06:04 - [INFO] - Text Feature Extracter ->>> Successfully load maidalun1020/bce-embedding-base_v1 from hugging face
04/14/2024 23:06:07 - [INFO] - Text Feature Extracter ->>> Execute device: cuda;	 gpu num: 1;	
04/14/2024 23:06:07 - [INFO] - Visual Feature Extractor ->>> Successfully load facebook/dinov2-base from hugging face
04/14/2024 23:06:07 - [INFO] - Visual Feature Extractor ->>> Execute device: cuda;	 gpu num: 1;	
Saving Encode: 100%|██████████| 201/201 [06:35<00:00,  1.97s/it]


In [10]:
# to avoid the memery overflow, the following code will be run seperately to get 
# news embedding title dev/test hadamard/lavis.
from News_Embedding_Pipeline import News_Embedding_Pipeline
from data_processor.NewsDataset import get_NewsDataset,NewsDataset
from data_processor.NewsDataloader import NewsDataLoader
from utils.load_from_files import Get_news_data

dev_news_data = Get_news_data(stage1_dev_config_seperatly_extract)
dev_news_dataset = NewsDataset(dev_news_data)
dev_news_dataloader = NewsDataLoader(dev_news_dataset, stage1_dev_config_seperatly_extract)
dev_news_embedder = News_Embedding_Pipeline(stage1_dev_config_seperatly_extract)
dev_news_embedder.save_encode(dev_news_dataloader)

get images of news now.: 100%|██████████| 42416/42416 [00:56<00:00, 751.89it/s]
  return self.fget.__get__(instance, owner)()
04/14/2024 23:15:53 - [INFO] - Text Feature Extracter ->>> Successfully load maidalun1020/bce-embedding-base_v1 from hugging face
04/14/2024 23:15:53 - [INFO] - Text Feature Extracter ->>> Execute device: cuda;	 gpu num: 1;	
04/14/2024 23:15:54 - [INFO] - Visual Feature Extractor ->>> Successfully load facebook/dinov2-base from hugging face
04/14/2024 23:15:55 - [INFO] - Visual Feature Extractor ->>> Execute device: cuda;	 gpu num: 1;	
Saving Encode: 100%|██████████| 166/166 [05:20<00:00,  1.93s/it]


Due to there is no test dataset for MIND-Small, the code in the next section won't run. I just save the result of loading test dataset in MIND.

In [4]:
# to avoid the memery overflow, the following code will be run seperately to get 
# news embedding title dev/test hadamard/lavis.
from News_Embedding_Pipeline import News_Embedding_Pipeline
from data_processor.NewsDataset import get_NewsDataset,NewsDataset
from data_processor.NewsDataloader import NewsDataLoader
from utils.load_from_files import Get_news_data

test_news_data = Get_news_data(stage1_test_config_seperatly_extract)
test_news_dataset = NewsDataset(test_news_data)
test_news_dataloader = NewsDataLoader(test_news_dataset, stage1_test_config_seperatly_extract)
test_news_embedder = News_Embedding_Pipeline(stage1_test_config_seperatly_extract)
test_news_embedder.save_encode(test_news_dataloader)

get images of news now.: 100%|██████████| 120959/120959 [02:46<00:00, 725.16it/s]
  return self.fget.__get__(instance, owner)()
04/14/2024 15:46:32 - [INFO] - Text Feature Extracter ->>> Successfully load maidalun1020/bce-embedding-base_v1 from hugging face
04/14/2024 15:46:34 - [INFO] - Text Feature Extracter ->>> Execute device: cuda;	 gpu num: 1;	
04/14/2024 15:46:36 - [INFO] - Visual Feature Extractor ->>> Successfully load facebook/dinov2-base from hugging face
04/14/2024 15:46:36 - [INFO] - Visual Feature Extractor ->>> Execute device: cuda;	 gpu num: 1;	
Saving Encode: 100%|██████████| 473/473 [15:09<00:00,  1.92s/it]


After embedding it, it will be saved at the path correlate with the `stage1_config_seperatly_extract.emb_path`.<br>

Then, we can load it from file to get the news embeddings.

### 1.4 load user data
With the embedded news, we can match the user and the news embeddings respectively. It will cost a lot of time and memeory, thus it is necessary to save as a pkl file.<br>
To achieve it, we need to load user data to `UserDataset` first.

In [11]:
from data_processor.UserDataset import get_UserDataset,UserDataset

raw_train_user_dataset = get_UserDataset(stage1_train_config_seperatly_extract)
print(raw_train_user_dataset[0])

04/14/2024 23:41:46 - [INFO] - UserDataset ->>> successfully built raw user dataset.


['U13740', ['N55189', 'N42782', 'N34694', 'N45794', 'N18445', 'N63302', 'N10414', 'N19347', 'N31801'], ['N55689-1', 'N35729-0']]


Now, we can process the `raw_user_dataset`. It should be a list with shape: [batchsize, clicked_histories, 768].<br>

But now, it is [batchsize, clicked_histories], we need to match the news embeddings th clicked histories.<br>

We can use the function `combine_user_embedding(user_data, configs)` in load_from_files.py, it will parse the path of `news_embeddings` from configs and then add the news embeddings respectively to the users' tuple.

In [12]:
from utils.load_from_files import combine_user_embedding

train_user_data = combine_user_embedding(raw_train_user_dataset, stage1_train_config_seperatly_extract)

combining user with news embeddings.: 100%|██████████| 156965/156965 [00:13<00:00, 11836.54it/s]


In [5]:
# to avoid the memery overflow, the following code will be run seperately to get 
#user batches dev/test hadamard/lavis.
from utils.load_from_files import combine_user_embeddings_test
from data_processor.UserDataset import get_UserDataset,UserDataset

raw_dev_user_dataset = get_UserDataset(stage1_dev_config_seperatly_extract)
dev_user_data = combine_user_embeddings_test(raw_dev_user_dataset, stage1_dev_config_seperatly_extract)

04/16/2024 13:14:46 - [INFO] - UserDataset ->>> successfully built raw user dataset.
combining user with news embeddings.: 100%|██████████| 73152/73152 [00:03<00:00, 21681.87it/s]


In [4]:
# to avoid the memery overflow, the following code will be run seperately to get 
#user batches dev/test hadamard/lavis.
from utils.load_from_files import combine_user_embeddings_test
from data_processor.UserDataset import get_UserDataset,UserDataset

raw_test_user_dataset = get_UserDataset(stage1_test_config_seperatly_extract)
test_user_data = combine_user_embeddings_test(raw_test_user_dataset, stage1_test_config_seperatly_extract)

04/14/2024 16:22:51 - [INFO] - UserDataset ->>> successfully built raw user dataset.
combining user with news embeddings.: 100%|██████████| 2370727/2370727 [02:25<00:00, 16271.43it/s]


after that, we can use the method `load_embedded_user(configs)` at path utils.load_from_files.py to load the embedded user_data from files. Also, the `load_embedded_news(configs)` exists, but not used frequently.

In [4]:
from utils.load_from_files import load_embedded_user, load_embedded_news

train_user_data = load_embedded_user(stage1_train_config_seperatly_extract)
print("loaded train data." )
dev_user_data = load_embedded_user(stage1_dev_config_seperatly_extract)
print("loaded dev data.")
# test_user_data = load_embedded_user(stage1_test_config_seperatly_extract)
# print("loaded test data.")
# news_data = load_embedded_news(stage1_config_seperatly_extract)
# news_data won't be used in the following code, so it is just a demonstration

loaded train data.
loaded dev data.


Then, the user data will be saved at the path correlate to `stage1_config_seperatly_extract.user_path`, and the file will be loaded as a list, the elements are `{"seq":……,"impression":……,"target":0/1}`. Then, we should pad it and build it into a dataset and dataloader.<br>

To pad it, I will use the zero padding and the target is 0 as well. To achieve it, can add a padding method at UserDataset or UserDataLoader. In my opinion, it should be at UserDataLoader for it can adapt the max length in batch flexible.<br>
First, transfer the user_data into user_dataset:

In [5]:
from data_processor.UserDataset import get_UserDataset,UserDataset

train_user_dataset = UserDataset(train_user_data)
print(len(train_user_dataset[126]["seq"])) # how long the sequence of histories is.
print(len(train_user_dataset[0]["impression"])) # the embedding size of impression news.
print(train_user_dataset[0]['target']) # the target.

36
768
1


Now, let's change it to UserDataloader and use the UserDataLoader. In the class `UserDataLoader`, I added the function `build_batch_fn(self,batch)` and it will be used to pad the sequence and add mask and so on.

In [6]:
from data_processor.UserDataloader import UserDataLoader

train_user_dataloader = UserDataLoader(train_user_dataset,stage1_train_config_seperatly_extract)
for input_batch in train_user_dataloader:
    print(len(train_user_dataloader))
    print(type(input_batch["batch"]),input_batch["batch"].shape)
    print(type(input_batch["mask"]),input_batch["mask"].shape)
    print(type(input_batch["target"]),len(input_batch["target"]))
    break

22826
<class 'torch.Tensor'> torch.Size([256, 50, 768])
<class 'torch.Tensor'> torch.Size([50])
<class 'torch.Tensor'> 256


In [7]:
from data_processor.UserDataset import get_UserDataset,UserDataset
from data_processor.TestDataloader import TestDataLoader

dev_user_dataset = UserDataset(dev_user_data)
dev_user_dataloader = TestDataLoader(dev_user_dataset, stage1_dev_config_seperatly_extract)
for input_batch in dev_user_dataloader:
    print(type(input_batch["batch"]),input_batch["batch"].shape)
    print(type(input_batch["mask"]),input_batch["mask"].shape)
    print(type(input_batch["target"]),len(input_batch["target"]))
    break

<class 'torch.Tensor'> torch.Size([22, 16, 768])
<class 'torch.Tensor'> torch.Size([22, 16])
<class 'torch.Tensor'> 22


**CODE IN THE NEXT SECTION IGNORED.**

The user_dataloader has the format `input_batch = tensor[batchsize, sequence_length, 768]`, `src_mask = tensor[batchsize,sequence_length]`, `target = list[batchsize]`<br>

the targets is a list of the user clicked the impression or not which should be predicted in the following works.<br>

Now, we have already loaded all of our data and built the batch successfully. Then, we are going to our recommendation system.

## 2. Model Structures
### 2.1 Recaller

Given that we have already obtained the historical sequence embeddings for each user, a straightforward approach is to compute the average of these representations to obtain the user interest embedding. The rationale behind employing such a simple method will be elaborated in section 2.2.3<br>

Subsequently, we can embed the newly arrived news and assess their similarity with the user interest. Based on this comparison, we retrieve the top 50 or more relevant news items for further processing in the reranker stage.<br>

Since the Recaller component is straightforward and not utilized in this experiment (given that the MIND dataset targets click prediction), we can consider the candidate news as the output of the Recaller. Therefore, the code for this segment will not be provided.

### 2.2 Reranker
#### 2.2.1 Model structure
In this section, I utilize a Transformer-based **Decoder Only** model to refine the ranking of passages recalled by the recaller<br>

From a task perspective, we can conceptualize it as a binary classify task where the translated language comprises two words, '0' and '1'. The predicted probability assigned to '1' represents the rerank score for the respective newly arrived news item.<br>

#### 2.2.2 Analysis
It is imperative to elucidate the rationale behind selecting this particular architecture. Specifically, I will address the following inquiries:<br>

1. Why Transformer?
2. Why Encoder Only?
3. Why employ a special build batch function?<br>

Firstly, why Transformer? Traditional RNN-based models lack the capability for parallel computation, thereby compromising computational efficiency. In contrast, Transformer models have demonstrated superior efficiency.<br>

Secondly, why Encoder Only? Considering that the Encoder is more good at comprehension than decoder, it is very explicit to use encoder structure in classifier task. However, the encoder might engender a low rank. Inspired by the concept of the  [*Low-Rank Bottleneck in Multi-head Attention Models*](https://arxiv.org/abs/2002.07028), it is observed that an encoder employing multi-head attention might encounter a low-rank bottleneck. But the sequences in our experiments is not as lengthy as typical files and the d_model is large enough to avoid such a problem.<br>

Thirdly, why a special build batch function? The rationale lies in preventing the re-ranker from accessing other candidate passages. Analogous to students being discouraged from viewing each other's answers during examinations, such access could potentially lead to inferior performance(you cheated and changed your correct answer.).<br>

#### 2.2.3 Why do we need Reranker?
Initially, the recaller employing a simplistic approach fails to effectively fulfill the task requirements. Evidence from experiments within the realm of Retrieval Augment Generation **[(RAG)](https://zhuanlan.zhihu.com/p/681370855)** indicates that the collaboration between a basic recaller and a reranker yields superior scores .<br>

One might question the rationale behind not employing a more sophisticated recaller to attain a more precise outcome before reranking. However, the consideration of computational costs negates this proposition. Opting for a more precise recall, such as using the reranker as the primary recaller, presents a significant computational challenge when confronted with vast volumes of incoming news data, potentially leading to system collapse.<br>

In summary, the optimal solution for recommendation entails the combination of a basic recaller with an enhanced reranker. This synergy leverages the strengths of both components, ultimately leading to superior recommendation outcomes. I believe this response adequately addresses the previous inquiry.

With the analyses complete, we can proceed to initialize the model for our experiment. Let's move forward with the initialization process.<br>

The specifics of initializing the model can be found in `model/RecTransformer.py`. The code has been encapsulated, allowing us to invoke the `build_model` function to obtain a model object initialized using the Xavier method.<br>

In [8]:
from model.RecTransformer import build_model

model = build_model(stage2_train_config)
print(model)

RecTransformer(
  (encoder): Encoder(
    (layers): ModuleList(
      (0-3): 4 x Encoder_Layer(
        (self_attention): Multi_Head_Attention(
          (linears): ModuleList(
            (0-3): 4 x Linear(in_features=768, out_features=768, bias=True)
          )
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (feed_forward): Positionwise_Feed_Forward(
          (w_1): Linear(in_features=768, out_features=1024, bias=True)
          (w_2): Linear(in_features=1024, out_features=768, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
        )
        (sublayer): ModuleList(
          (0-1): 2 x Sub_Layer_Connection(
            (norm): Layer_Norm()
            (dropout): Dropout(p=0.1, inplace=False)
          )
        )
      )
    )
    (norm): Layer_Norm()
  )
  (source_embedder): Positional_Embedding(
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (output_layer): Sequential(
    (0): Linear(in_features=768, out_features=256, bias=True)
    (1): L

Subsequently, equipped with the dataloader and prepared model, we may proceed to utilize the trainer for model training and subsequent evaluation. Detailed information regarding the trainer and evaluator functionalities is available in the files `components/trainer.py` and `components/evaluator.py`, where the classes `Trainer` and `Evaluator` are defined, respectively.<br>

However, for convenience, one may simply instantiate the trainer object as follows: `trainer=Trainer(config,model,dataloader,evaluator)`, followed by invoking the `trainer.fit()` method to commence model training.

In [9]:
from components.trainer import Trainer
from components.evaluator import Evaluator

evaluator = Evaluator()

trainer = Trainer(stage2_train_config, model, train_user_dataloader,
                  dev_user_dataloader, evaluator)

After initialize the `trainer`, which wrapped the method including `fit`,`evaluate` and `test`, we can easily call the `trainer.fit()` and `trainer.test()` to train and test the model:<br>

train it for an epoch takes about 2 hours.

In [None]:
trainer.fit()

04/18/2024 17:21:13 - [INFO] - Trainer ->>> start training...
train process :   0%|          | 0/20 [00:00<?, ?it/s]

global_step:500, loss:1.082
global_step:1000, loss:1.589
global_step:1500, loss:1.053
global_step:2000, loss:1.231
global_step:2500, loss:1.364
global_step:3000, loss:1.594


Until now, we have finished the one experiment with condition single modal pretrained model. Then, we can train the model with the same method with the following code:

## 3. Results
