#### Dataset

you can download the datset from [Kaggle](https://www.kaggle.com/datasets/spsayakpaul/arxiv-paper-abstracts)

### Import Libraries

In [1]:
from tensorflow.keras import layers
from tensorflow import keras
import tensorflow as tf

from sklearn.model_selection import train_test_split
from ast import literal_eval

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# configure pandas display for full text
pd.set_option('display.max_columns', None)  
pd.set_option('display.max_rows', None)  
pd.set_option('display.max_colwidth', None) 

2022-11-01 23:24:32.952639: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F AVX512_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-11-01 23:24:33.219803: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2022-11-01 23:24:33.254969: W tensorflow/tsl/platform/default/dso_loader.cc:66] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-11-01 23:24:33.254986: I tensorflow/tsl/cuda/cudart_stub.cc:28] Ignore above cudart dlerror if you do not have a GPU set 

### Load the data

In [2]:
arxiv_data = pd.read_csv("../dataset/arxiv_data.csv")

In [3]:
arxiv_data.head(2)

Unnamed: 0,titles,summaries,terms
0,Survey on Semantic Stereo Matching / Semantic Depth Estimation,"Stereo matching is one of the widely used techniques for inferring depth from\nstereo images owing to its robustness and speed. It has become one of the major\ntopics of research since it finds its applications in autonomous driving,\nrobotic navigation, 3D reconstruction, and many other fields. Finding pixel\ncorrespondences in non-textured, occluded and reflective areas is the major\nchallenge in stereo matching. Recent developments have shown that semantic cues\nfrom image segmentation can be used to improve the results of stereo matching.\nMany deep neural network architectures have been proposed to leverage the\nadvantages of semantic segmentation in stereo matching. This paper aims to give\na comparison among the state of art networks both in terms of accuracy and in\nterms of speed which are of higher importance in real-time applications.","['cs.CV', 'cs.LG']"
1,FUTURE-AI: Guiding Principles and Consensus Recommendations for Trustworthy Artificial Intelligence in Future Medical Imaging,"The recent advancements in artificial intelligence (AI) combined with the\nextensive amount of data generated by today's clinical systems, has led to the\ndevelopment of imaging AI solutions across the whole value chain of medical\nimaging, including image reconstruction, medical image segmentation,\nimage-based diagnosis and treatment planning. Notwithstanding the successes and\nfuture potential of AI in medical imaging, many stakeholders are concerned of\nthe potential risks and ethical implications of imaging AI solutions, which are\nperceived as complex, opaque, and difficult to comprehend, utilise, and trust\nin critical clinical applications. Despite these concerns and risks, there are\ncurrently no concrete guidelines and best practices for guiding future AI\ndevelopments in medical imaging towards increased trust, safety and adoption.\nTo bridge this gap, this paper introduces a careful selection of guiding\nprinciples drawn from the accumulated experiences, consensus, and best\npractices from five large European projects on AI in Health Imaging. These\nguiding principles are named FUTURE-AI and its building blocks consist of (i)\nFairness, (ii) Universality, (iii) Traceability, (iv) Usability, (v) Robustness\nand (vi) Explainability. In a step-by-step approach, these guidelines are\nfurther translated into a framework of concrete recommendations for specifying,\ndeveloping, evaluating, and deploying technically, clinically and ethically\ntrustworthy AI solutions into clinical practice.","['cs.CV', 'cs.AI', 'cs.LG']"


### Explore data

In [4]:
print(f"There are {len(arxiv_data)} rows in the dataset.")

There are 51774 rows in the dataset.


In [5]:
total_duplicate_titles = sum(arxiv_data["titles"].duplicated())

print(f"There are {total_duplicate_titles} duplicate titles.")

There are 12802 duplicate titles.


In [6]:
# drop duplicates

arxiv_data = arxiv_data[~arxiv_data["titles"].duplicated()]
print(f"There are {len(arxiv_data)} rows in the deduplicated dataset.")

There are 38972 rows in the deduplicated dataset.


In [7]:
# There are some terms with occurrence as low as 1.
print(sum(arxiv_data["terms"].value_counts() == 1))

2321


In [8]:
# How many unique terms?
print(arxiv_data["terms"].nunique())

3157


As observed above, out of 3,157 unique combinations of terms, 2,321 entries have the lowest occurrence. To prepare our train, validation, and test sets with stratification, we need to drop these terms.

In [9]:
# Filtering out the rare terms.
arxiv_data_filtered = arxiv_data.groupby("terms").filter(lambda x: len(x) > 1)
arxiv_data_filtered.shape

(36651, 3)

Term are given as raw strings, we will convert these raw strings as List[str] using ast module

In [10]:
arxiv_data_filtered['terms'][0]

"['cs.CV', 'cs.LG']"

In [11]:
arxiv_data_filtered["terms"] = arxiv_data_filtered["terms"].apply(
    lambda x: literal_eval(x)
)

In [12]:
arxiv_data_filtered['terms'][0]

['cs.CV', 'cs.LG']

### Split dataset with stratification
Dataset is class-imbalanced, so to have fair evaluation result, we will use stratified split.

In [13]:
# keep 10% of the data for testing
test_split = 0.1

train_df, test_df = train_test_split(
    arxiv_data_filtered,
    test_size=test_split,
    stratify=arxiv_data_filtered["terms"].values,
)

In [15]:
# Splitting the test set further into validation and new test sets.

val_df = test_df.sample(frac=0.5)
test_df.drop(val_df.index, inplace=True)

In [16]:
print(f"Number of rows in training set: {len(train_df)}")
print(f"Number of rows in validation set: {len(val_df)}")
print(f"Number of rows in test set: {len(test_df)}")

Number of rows in training set: 32985
Number of rows in validation set: 916
Number of rows in test set: 917


### Multi-label binarization
To binarize out labels we will use TF's  StringLookup layer. It's preprocessing layer which maps string features to integer indices.

[StringLookup](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup)

In [17]:
# convert terms to RaggedTensor
terms = tf.ragged.constant(train_df["terms"].values)
lookup = tf.keras.layers.StringLookup(output_mode="multi_hot") # used for multi label classification
lookup.adapt(terms)

2022-11-02 00:27:16.401405: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:995] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-11-02 00:27:16.402315: W tensorflow/tsl/platform/default/dso_loader.cc:66] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-11-02 00:27:16.402382: W tensorflow/tsl/platform/default/dso_loader.cc:66] Could not load dynamic library 'libcublas.so.11'; dlerror: libcublas.so.11: cannot open shared object file: No such file or directory
2022-11-02 00:27:16.402431: W tensorflow/tsl/platform/default/dso_loader.cc:66] Could not load dynamic library 'libcublasLt.so.11'; dlerror: libcublasLt.so.11: cannot open shared object file: No such file or directory
2022-11-02 00:27:16.418134: W tensorflow/tsl/platform/default/dso_loader.cc:66] Could not load dynamic library 'libcu

In [18]:
vocab = lookup.get_vocabulary()

In [19]:

def invert_multi_hot(encoded_labels):
    """Reverse a single multi-hot encoded label to a tuple of vocab terms."""
    hot_indices = np.argwhere(encoded_labels == 1.0)[..., 0]
    return np.take(vocab, hot_indices)