# Notebook 01 - Label Exploration

In this notebook, we explore the distribution of labels for the abstracts and split the dataset into train, test and validation set. 

## Setup

In [1]:
# --- Configture Notebook ------
# show all outputs of cell
from IPython.core.interactiveshell import InteractiveShell

InteractiveShell.ast_node_interactivity = "all"

import black
import jupyter_black

jupyter_black.load(
    lab=True,
    line_length=100,
    verbosity="DEBUG",
    target_version=black.TargetVersion.PY310,
)

# enable automatic reloading
%load_ext autoreload
%autoreload 2

from pathlib import Path

from arxiv_article_classifier.utils import display_fully, save_image
from arxiv_article_classifier.data.scrape_arxiv import CATEGORIES_OF_INTEREST

from arxiv_article_classifier.data.make_interim_data import make_interim_data
from arxiv_article_classifier.data.load import load_processed_data, load_taxonomy

from pandas.core.base import PandasObject

PandasObject.display_fully = display_fully

DATAFOLDER = Path().cwd().parent / "data"
FIGUREFOLDER = Path().cwd().parent / "reports" / "figures"

import pandas as pd
import ast
import pickle

from sklearn.preprocessing import MultiLabelBinarizer
from skmultilearn.model_selection.measures import get_combination_wise_output_matrix
import plotly.express as px

DEBUG:jupyter_black:Using config from /home/walkenho/repositories/arxiv_article_classifier/pyproject.toml
DEBUG:jupyter_black:config: {'line_length': 100, 'target_versions': {<TargetVersion.PY310: 10>}}


In [2]:
# load taxonomy
taxonomy = load_taxonomy(DATAFOLDER / "raw" / "taxonomy.pkl")

In [3]:
df = pd.read_csv(DATAFOLDER / "raw" / "articles.csv").assign(
    tags=lambda df: df["tags"].apply(lambda x: ast.literal_eval(x))
)
df.head()

Unnamed: 0,ids,titles,abstracts,tags
0,http://arxiv.org/abs/2009.10061v1,Faster Algorithms for Optimal Ex-Ante Coordina...,We focus on the problem of finding an optimal ...,"[cs.GT, cs.AI, cs.LG, cs.MA]"
1,http://arxiv.org/abs/2009.10249v1,Dynamic Multi-Agent Path Finding based on Conf...,We study a dynamic version of multi-agent path...,"[cs.AI, cs.LO, cs.MA, cs.RO]"
2,http://arxiv.org/abs/2009.10638v1,Sense-Deliberate-Act Cognitive Agents for Sens...,"In this paper, we advocate Agent-Oriented Soft...","[cs.MA, cs.SE]"
3,http://arxiv.org/abs/2009.10689v1,Simulation model of spacetime with the Minkows...,"In this paper, we propose a simulation model o...","[cs.CE, cs.MA]"
4,http://arxiv.org/abs/2009.10890v1,Demand Responsive Dynamic Pricing Framework fo...,Demand Response (DR) has a widely recognized p...,"[eess.SY, cs.LG, cs.MA, cs.SY]"


In [4]:
unique_tags = {cat for catlist in df["tags"] for cat in catlist}
len(unique_tags)

list(unique_tags)[:10]

3660

['I.2.1; I.2.11; I.2.6',
 '65F10, 65K10',
 '60H35, 60J75, 65C05, 93E20',
 'Multiagent Systems',
 '15B52',
 '65M60, 65N30, 65M22',
 '62R20, 62P10, 47N30',
 'Primary 60G57, Secondary 60G55',
 'primary 62R07, 62G20, 62G30, 49Q22, secondary 62E20, 62F35, 60B10',
 '91A99 (Primary) 68W27, 90C05 (Secondary)']

In [8]:
with open(DATAFOLDER / "raw" / "taxonomy.pkl", "rb") as f:
    taxonomy = pickle.load(f)

categories = taxonomy.keys()

df["tags"] = df["tags"].map(lambda tags: [tag for tag in tags if tag in categories])

print(
    f"{len({cat for catlist in df['tags'] for cat in catlist})} found out of {len(categories)} existing categories."
)

mlb = MultiLabelBinarizer()

message_tags_matrix = pd.DataFrame(mlb.fit_transform(df["tags"]), columns=mlb.classes_)

fig = px.bar(
    pd.DataFrame(
        pd.DataFrame(message_tags_matrix, columns=mlb.classes_).sum()
        / message_tags_matrix.shape[0]
        * 100,
        columns=["perc_articles"],
    )
    .sort_values(by="perc_articles", ascending=False)
    .assign(is_category_of_interest=lambda df: df.index.map(lambda x: x in CATEGORIES_OF_INTEREST)),
    title="Which percentage of abstracts has which tag?",
    color="is_category_of_interest",
)
_ = fig.update_xaxes(tickangle=45)
save_image(fig, path=FIGUREFOLDER, filename="distribution-of-tags")
fig.show()


fig = px.bar(
    pd.DataFrame(message_tags_matrix.sum(axis=1)).groupby(0).size()
    / message_tags_matrix.shape[0]
    * 100,
    title="Percentage of articles with n tags",
)
save_image(fig, path=FIGUREFOLDER, filename="distribution-of-number-of-tags-raw")
fig.show()

152 found out of 155 existing categories.


In the first figure, we notice that there are a few tags that have a lot of associated articles even though these tags were not used when creating the dataset. These are in particular cs.SY (systems and control), cs.NA (numerical analysis) and stat.TH (statistics theory) which are aliases for eees.SY, math.NA and math.ST respectively and to some extend cs.SD (sound), which should overlap with audio and speech processing and stat.ML (machine learning). We also ignore cs.AI (artificial intelligence), which seems like not such a useful distinction for our purposes.

In [9]:
df_sum = (
    pd.DataFrame(
        pd.DataFrame(message_tags_matrix, columns=mlb.classes_).sum(),
        columns=["n_articles"],
    )
    .sort_values(by="n_articles", ascending=False)
    .cumsum()
    .assign(perc_tags=lambda df: df["n_articles"] / df["n_articles"].max())
)
df_sum.head(15)

Unnamed: 0,n_articles,perc_tags
cs.LG,9978,0.109469
cs.AI,16042,0.175998
cs.CL,20236,0.22201
cs.CV,24354,0.267189
cs.RO,27953,0.306674
math.OC,31530,0.345917
cs.SY,35039,0.384415
eess.SY,38544,0.422868
cs.CY,41754,0.458085
eess.SP,44921,0.49283


In [10]:
fig = px.area(df_sum, y="perc_tags", title="Percentage of tags captured")
fig.show()
_ = fig.update_xaxes(tickangle=45)

Let's clean the dataset by reducing the tags to the tags of interest. 

In [12]:
df["tags_filtered"] = df["tags"].map(
    lambda tags: [tag for tag in tags if tag in CATEGORIES_OF_INTEREST]
)

print(
    f"{len({cat for catlist in df['tags_filtered'] for cat in catlist})} found out of {len(categories)} existing categories."
)

mlb = MultiLabelBinarizer()

message_tags_matrix = pd.DataFrame(mlb.fit_transform(df["tags_filtered"]), columns=mlb.classes_)

fig = px.bar(
    pd.DataFrame(
        pd.DataFrame(message_tags_matrix, columns=mlb.classes_).sum()
        / message_tags_matrix.shape[0]
        * 100,
        columns=["perc_articles"],
    ).sort_values(by="perc_articles", ascending=False),
    title="Which percentage of abstracts has which tag?",
)
_ = fig.update_xaxes(tickangle=45)
fig.show()

fig = px.bar(
    pd.DataFrame(message_tags_matrix.sum(axis=1)).groupby(0).size()
    / message_tags_matrix.shape[0]
    * 100,
    title="Percentage of articles with n tags",
)
fig.show()

save_image(fig, path=FIGUREFOLDER, filename="distribution-of-number-of-tags-cleaned")

14 found out of 155 existing categories.


## Split into Train, Dev and Test

This is a multilabel dataset. There are multiple ways of how one can pose a multilabel classification problem:

* Reframe the task into a multiclass classification task, where each class represents a combination of original labels. This scales very poorly with the number of labels used. 
* Reframe the task of predicting N multi-labels into N single-label classification tasks.
* Reframe the task of predicting N multi-labels into N single-label classification tasks, but chain them together, so that each classifier receives as additional inputs the outputs of the previous classifiers.  

Here, I will convert the multilabel into multiple, single-label classification tasks. For each single classification task, the data is quite imbalanced. Therefore, we need to apply a stratified sampling scheme, which stratifies across a combination of labels. 

In [9]:
make_interim_data(
    input_file=DATAFOLDER / "raw" / "articles.csv",
    output_folder=DATAFOLDER / "interim",
    categories_to_keep=CATEGORIES_OF_INTEREST,
)

Quick check that the stratified split produced reasonable results:

In [10]:
from collections import Counter


(_, _, _, y_train, y_val, y_test), _ = load_processed_data(Path().cwd().parent / "data" / "interim")

pd.DataFrame(
    {
        "train": Counter(
            str(combination)
            for row in get_combination_wise_output_matrix(y_train, order=2)
            for combination in row
        ),
        "validation": Counter(
            str(combination)
            for row in get_combination_wise_output_matrix(y_val, order=2)
            for combination in row
        ),
        "test": Counter(
            str(combination)
            for row in get_combination_wise_output_matrix(y_test, order=2)
            for combination in row
        ),
    }
).fillna(0).display_fully()

Unnamed: 0,train,validation,test
"(6, 6)",726.0,242.0,242.0
"(5, 5)",600.0,200.0,200.0
"(5, 6)",124.0,41.0,41.0
"(4, 4)",1990.0,664.0,663.0
"(4, 5)",208.0,70.0,69.0
"(7, 7)",612.0,204.0,204.0
"(2, 7)",94.0,31.0,31.0
"(5, 7)",21.0,7.0,7.0
"(2, 2)",661.0,220.0,220.0
"(2, 5)",19.0,6.0,6.0


Looks good. Let's go to the next notebook, where we explore the text data.