---
title: "AXEL Dataset"
author: "Suchith Prabhu"
date: "2023-04-29"
jupyter: python3
description: The purpose of this article is to detail the utilization of various datasets in the experimentation of the AXEL model, including how they are being used and what types of datasets are involved. 
image: "images/axel_model.png"
sidebar: false

---

The AXEL model, which stands for Augmented eXtreme Classifiers for Efficient Language generation, represents a category of models that leverage both extreme classifiers (XC) and generative models to address specific tasks. An approach to integrating these two components involves augmenting the output of the extreme classifier and feeding it as input to the language model. The following diagram illustrates this methodology.

![AXEL Model](images/axel_model.png)

This passage elucidates the datasets used for public experimentation of the model and outlines the creation of the train-test split for evaluating the model. The primary objective of these models is to enhance the quality of the generative model by combining it with an extreme classifier, with the anticipation that this fusion will yield improved results. Therefore, the task at hand involves comparing the performance of the generative model, the combined generative model with the extreme classifier, and assessing their respective outcomes.

Three public datasets, namely ORCAS, Wikipedia pages, and Amazon product datasets, have been selected for this purpose. In alignment with the objective of enhancing the generative ability of the decoder for a given task by incorporating information derived from the extreme classifier, two datasets are required. The first dataset is used to train the decoder, which is the primary focus, while the second auxiliary dataset is used to train the extreme classifier, which augments the input of the decoder.

Care must be taken to ensure that the extreme classifier task complements the generative task and does not impede the learning process of the decoder. One potential concern arises when there is substantial overlap between the datasets used to train both models. In such cases, the extreme classifier might provide the correct output to the decoder, resulting in the decoder learning nothing new and functioning merely as an identity function.

# ORCAS dataset

ORCAS is a dataset that revolves around query-to-document relationships and is constructed using click-logs. For additional details about the dataset's construction, further information can be accessed [here](https://suchith720.github.io/posts/research/orcas-related-queries/orcas_dataset_blog.html). When constructing the dataset using the dump, we have employed two distinct approaches. Given our ultimate aim of deploying the model at Bing-Ads, our focus lies on the query-to-query dataset, which closely aligns with the actual task of query-to-keyword matching.

## Approach 1

Initially, we create a query-query dataset by incorporating all the two-hop neighbors associated with a given query.

![ORCASRelatedQuery dataset generation process](images/ORCASRelatedQuery_generation.png)

Subsequently, we divide the dataset into three sections and eliminate any intersecting edges.

![ORCASRelatedQuery train-test split](images/ORCASRelatedQuery_split.png)

The first partition, denoted as $D_1$, is utilized to train the decoder, which constitutes our primary objective. The second partition, $D_2$, is employed to train the XC model, while the third partition, $D_3$, is reserved for evaluating the performance of the decoder both before and after the integration of the XC augmentation.

![ORCASRelatedQuery dataset](images/ORCASRelatedQuery.png)

## Approach 2

The query-document dataset is partitioned on the document side into two distinct parts, ensuring that each query is assigned to one of the partitions based on the query-document links. This division effectively separates the dataset into two sections, completely eliminating any crossover links. Subsequently, we utilize the $D_1$ partition, known as the query-query dataset, to train the decoder, while the $D_2$ partition, which comprises the query-document dataset, is used for training the XC model.

![Mix of query-document and query-query dataset.](images/ORCAS_approach2.png)

# Wikipedia articles dataset

Here, the decoder is trained to make predictions regarding the `category` of a given Wikipedia article. And the auxiliary XC task involves predicting the `SeeAlso` title associated with the same article

![Wikipedia dataset](images/wikipedia_dataset.png)

# Amazon Products dataset

The decoder in this context is designed to generate the `also_buy` product by utilizing a given product title as input. In parallel, the XC models can be trained to predict the `also_view` and `similar items` associated with the same product, based on its title as input.

![Amazon dataset](images/amazon_dataset.png)