# Multilabel Classification of Journal and Conference Abstracts

## Overview

Text classification is a common example of an application for supervised machine learning on texts.

Example use-cases include:

* Detecting topics of abstracts submitted to conferences/for publication in scientific journals. 
* Detecting types of toxic comments in internet forums as is done in the [Toxic Comment Classification Challenge](https://paperswithcode.com/task/toxic-comment-classification).
* Automating routing of customer queries in order to scale customer support efforts.  

The goal of this project is to build ML models that can predict the topics of an abstract that a user is submitting to a conference. The task at hand is framed as a multilabel classification task, meaning that each abstract can have multiple tags associated with it. Multilabel classification tasks are typically solved in one of the three following ways:

* Reframe the task into a multiclass classification task, where each class represents a combination of original labels. This scales very poorly with the number of labels used. 
* Reframe the task of predicting N multi-labels into N single-label classification tasks.
* Reframe the task of predicting N multi-labels into N single-label classification tasks, but chain them together, so that each classifier receives as additional inputs the outputs of the previous classifiers.  

In this project, I will be using the second approach - reframing the task into independent single-label classification tasks.

Since we are dealing with imbalanced data (each label occurs in fewer than 10% of articles) and I am interested in both capturing many abstract labels as well as not producing too many false positives, I choose to assess model performance using the f1 score and then average the f1-score across classes using the weighted average, since I consider it more important to correctly predict as many tags as possible than to get equally good performance across all labels. 

The project is developed in a combination of source code as well as notebooks for exploration. All notebooks can be found as html exports in the reports folder. Figures can be found in reports/figures.

This project (so far) contains six steps (for more details see the details description below and the actual notebooks and scripts):

### Data Sourcing

The first step of any machine learning project is sourcing and cleaning the data. In order to train a classifier that would be able to classify abstracts into topics that would be useful for a data science and machine learning conference I scraped abstracts from arXiv. arXiv at the time of writing uses a [set of 155 labels](https://arxiv.org/category_taxonomy). From these categories, I selected the ones I found most applicable for a machine learning conference, downloaded 3000 abstracts for each and finally deduplicated the dataset. The script used to generate the dataset is distributed as part of this project and can be run using `make arxiv-data`.

The following categories are included:

* Computation and Language
* Computer Vision and Pattern Recognition
* Computers and Society
* Computer Science and Game Theory
* Machine Learning
* Multiagent Systems
* Robotics
* Social and Information Networks
* Audio and Speech Processing
* Signal Processing
* Systems and Control
* Numerical Analysis
* Optimization and Control
* Statistics Theory


### Notebook 01 - Label Exploration and Problem Framing

In Notebook-01, I explore the label distribution and the interplay between the different labels.

![Raw label distribution](../reports/figures/distribution-of-tags.png)

I then clean the labels by discarding additional labels and only keeping the labels of interest. The following figure shows the resulting distribution of number of tags per abstract after cleaning:

![](../reports/figures/distribution-of-number-of-tags.png)

I then proceed to split the dataset into train, validation and test data using stratified sampling. Since we are dealing with a multilabel classification task, we need to adapt the stratified sampling process to perform stratified sampling across all labels. Due to the resulting added complexity, I decided to not perform cross-validation, but create a single train-validation-test split. 

The effects of this notebook on the data can be reproduced by executing `make interim-data`.

### Notebook 02 - Feature Exploration

In this step, I look at the actual abstract data. I found that newlines are coded as '\n', something that requires cleaning for all models. Since these are scientific abstracts, we also find LaTeX notation as part of the abstracts, another text feature that requires cleaning. 
For the dataset I looked at, I found the following distribution of word counts after cleaning and stopword removal. 

![](../reports/figures/word-count-distribution-after-stopword-removal.png)

### Notebook 03 - Baseline Model

To create a baseline model, I chose to construct a simple keyword-based model: For each label, I found the top 15 most common words. For each label out of these 15 I then kept the words, which did not appear in any of the other labels' top-15 words. This yielded results for all labels but for Machine-Learning, for which I then choose 'learning'. 

These are the keywords I chose:

```python
label_keywords = {
    "cs.CL": {"large", "text", "llm"},
    "cs.CV": {"feature", "image"},
    "cs.CY": {"fairness", "research", "ai", "provide"},
    "cs.GT": {"mechanism", "strategy", "equilibrium", "player", "game"},
    "cs.LG": {"learning"},
    "cs.MA": {"policy", "multi"},
    "cs.RO": {"real", "robot"},
    "cs.SI": {"community", "node", "information", "graph"},
    "eess.AS": {"audio", "speaker"},
    "eess.SP": {"communication", "signal", "channel"},
    "eess.SY": {"state"},
    "math.NA": {"scheme", "solution", "numerical", "order", "equation"},
    "math.OC": {"optimal", "optimization"},
    "math.ST": {"sample", "estimator", "distribution"},
}
```

The constructed classifier predicts a label to be True if any of its keywords appear in the abstract.

This model achieved a weighted f1-score of 0.41 on the validation set, which is not great, but also not too bad as a baseline given the imbalance of the dataset.   

This could be an interesting benchmark to use and an interesting fallback as a model. Alternatively, we could do some more hand-curation of the words to use. 

### Notebook 04 - TFIDF with Logistic Regression

As first ML model, I chose to encode the data using tfidf after basic text-cleaning. I then trained a logistic regression model on the tfidf data, since logistic regression models are suitable for sparse data and large datasets and allow for easy weight balancing. To account for the multi-label classification, I used sklearn's `MultiOutputClassifier` wrapper, which trains one model per label.

To understand the effect of the imbalance, I trained the model once with and once without balancing the data. Without rebalancing the model achieved an f1 score of 0.66, with rebalancing it achieved an f1 score of 0.72 on the validation set. 

A comparison of train and validation metrics shows that our model is still overfitting to the training set though, something to be fixed in the future. 

The following table shows the 5 most important words for each label. We can see that further cleaning and regularization is still needed since there are words present which should bery probably not be predictive (i.e. propose). 

![Top-5 text features for tfidf+logregression](../reports/figures/top-5-per-label-tfidf.png)

For a more detailed insight into word importance, the following figure below shows the importance for all words in the top-5 for each label.

![](../reports/figures/feature-importance-tfidf.png)

### Notebook 05 - DistilBert 

Two issues with bag-of-words approaches like tfidf are that (i) they treat each word along it's own dimension, ie 'man' is to 'woman' not any closer than 'man' is to 'apple' and (ii) they loose the context of words in a sentence. One way to counteract (i) and therefore an improvement on bag-of-words+tfidf is to use word embeddings. Commonly used embedddings include Word2Vec and GloVe. However, since because of the specialized vocabulary in our task, I would not necessarily expect this to bring too much benefit in isolation and because of a lack of time, I do not explore this option, but directly jump to exploring the use of a [BERT model](https://en.wikipedia.org/wiki/BERT_(language_model)). BERT-models are known for achieving excellent results on natural language tasks and have been applied to multi-label classification tasks similar to the one presented here (an example being the [Toxic Comment Classification Dataset](https://paperswithcode.com/task/toxic-comment-classification)). 

For this project, I choose to explore the use of [DistilBert](https://arxiv.org/abs/1910.01108), a smaller version of BERT. Being smaller and faster than the original BERT, DistilBert is particularly well suited for tasks where inference speed is important (for example in real-time customer message routing or for running models on the edge). To account for the multilabel task, I use the binary-crossentropy-with-logits loss function.   

I fine-tuned DistilBert for 10 epochs on our dataset and plotted train and validation loss to check for overfitting. By fine-tuning DistilBert for 10 epochs on our dataset while keeping the most performant model, I achieved a weighted f1-score of 0.75 on the validation set, the best performance so far. Note, that so far this model has not been taking into account weight-imbalances, an improvement to include in the future.

Finally I evaluated the DistilBert model on the test data, reaching a weighted f1 score of 0.75. 

## Detailed Description of Process

This section gives some more details of the processsed summarized above. 

### Notebook-02 - Feature Exploration

The following text cleaning steps are used:

* line breaks are deleted
* LaTeX expressions are removed
* words are lemmatized taking into account POS tags
* all words are lower-cased
* punctuation is deleted
* white space is deleted
* stopwords are removed. Stopwords are taken from the nltk library. 

### Notebook 03 - Baseline Model

I used the text cleaning steps described in Notebook 02. 

The following figure shows the metrics report of the baseline model.

![](../reports/figures/classification-report-baseline-on-validation.png)

### Notebook-04 - TFIDF plus Logistic Regression

The same text cleaning steps as for the baseline model were used. 

To implement the multi-label classification I used sklearn's `MultiOutputClassifier` wrapper, which trains one model per label. Logistic regression was implemented once without and once with weight balancing. The following figure shows the validation report for balanced logistic regression:

![](../reports/figures/classification-report-tfidf-balanced-logistic-regression-on-validation.png)


### Notebook-05 - DistilBert

I fine-tuned DistilBert for 10 epochs on our dataset and plotted train and validation loss as well as the weighted f1 score to check for overfitting:

![](../reports/figures/distilbert-cross-entropy-against-epochs.png)

![](../reports/figures/distilbert-f1-against-epochs.png)

After a single epoch, validation loss and f1 score plateaued with the f1 score continuing to slightly increase afterwards. 

Finally I evaluated the best DistilBert model on the test data, reaching a weighted f1 score of 0.75, which is in a similar ballbark to the validation set.

The performance report on validation set and test set also look similar enough to be confident that we have not overfitted.

Validation report:

![](../reports/figures/classification-report-distilbert-on-validation.png)

Test report:

![](../reports/figures/classification-report-distilbert-on-test.png)


## Summary and Potential Improvements

The goal of this project was to predict to which academic categories an abstract submitted to a conference or a journal belongs. This model could be used by conference organizers to help them sort submitted talks into tracks and by academics submitting abstracts to a journal to help them find appropriate tags. 

This task was framed as a multi-label classification task on NLP data. After creating a dataset by scraping relevant data from arXiv, followed by data exploration and cleaning, I experimented with three modelling approaches: a manual baseline model, a tfidf model in combination with a logistic regression classifier and a DistilBert model.

I used the weighted f1-score to measure the model's performance.

The model performances on the validation set are:

* heuristic model: 0.41
* tfidf + logistic regression:  0.72
* DistilBert: 0.75

The (so far) final performance on the test set was 0.75.

Note that these scores were achieved on a test set which was also obtained from arXiv. Whilst probable, it is not certain that the model will perform well on conference abstracts, since the style of conference abstracts might be different (for example less formal) than the writing style used for journal abstracts.

### Future Improvements

Possible ways to improve the model are:

* tfidf-model:
    * The tfidf model is still overfitting to the training data. Sourcing more data as well as regularizing the model should help to improve this. 
* DistilBert-model:
    * The DistilBert model does not yet account for class imbalances. As seen for the tfidf+logistic regression model, accounting for the class imbalances within each classifier should improve the performance. 
    * As part of this project, I have experimented with different training set sizes (not shown in this code-base). The conclusion was that there is still room for improving the model performance by increasing the size of the training data set. This is straight forward with the scripts provided. 
* As mentioned above, it is not certain that this model generalizes well to conference abstracts. This requires testing and potentially another round of fine-tuning.

### Deployment

This project so far focuses on the data sourcing and model training part of a machine-learning project. In order for it to be run in a production environment, software and data testing still need to be addressed. 