<h1>Analysis of Preprint Papers from the ArXiv</h1>

The website https://arxiv.org is a popular database for scientific papers in STEM fields. ArXiv has its own classification system consisting of roughly 150 different categories, which are manually added by the authors whenever a new paper is uploaded. A paper can be assigned multiple categories.

The goal for this project is to develop a machine learning model which can predict the ArXiv category from a given title and abstract. The data set used here has been scraped from the ArXiv API, see https://arxiv.org/help/api, using a Python scraper which can be downloaded from

<p><center><a href=https://github.com/saattrupdan/scholarly/blob/master/arxiv_scraper.py>https://github.com/saattrupdan/scholarly/blob/master/arxiv_scraper.py.</a>.</center></p>

I have downloaded two data sets using this scraper, which can be freely downloaded below:
<ul>
    <li> The <tt>arxiv.csv</tt> file includes at most 10,000 papers from every ArXiv category. This data set is very large (~1.2gb) and takes a day or two to scrape.</li>
    <li> The <tt>arxiv_small.csv</tt> file is more manageable (~200mb) and includes at most 1,000 papers from every ArXiv category.</li>
</ul>

We will also need a list of all the ArXiv categories, which can be downloaded below:

<h2> Fetching data </h2>

We start by importing all the packages we will need.

In [112]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re # regular expressions
import pickle # enables saving data and models locally

We first set the file name we're interested in and load up the categories.

In [113]:
# the file name of the data set, e.g. 'arxiv' or 'arxiv_small'
file_name = 'arxiv_small'

# load array of categories
cats = np.asarray(pd.read_csv("cats.csv")['category'].values)

Next up we load the data into a dataframe. This takes a while, which is why we are also saving it locally.

In [114]:
# load dataframe    
with open(f"{file_name}_df.pickle", "rb") as pickle_in:
    df = pickle.load(pickle_in)

print(f"Loaded metadata from {df.shape[0]} papers. Here are some of them:")
pd.set_option('display.max_colwidth', 300)
df.sample(3)

Loaded metadata from 150479 papers. Here are some of them:


Unnamed: 0,title,abstract,category
92570,Sub-leading asymptotics of ECH capacities,"In previous work, the first author and collaborators showed that the leading\nasymptotics of the embedded contact homology (ECH) spectrum recovers the\ncontact volume. Our main theorem here is a new bound on the sub-leading\nasymptotics.","[math.DG, math.SG]"
40694,Enabling Quality-Driven Scalable Video Transmission over Multi-User NOMA\n System,"Recently, non-orthogonal multiple access (NOMA) has been proposed to achieve\nhigher spectral efficiency over conventional orthogonal multiple access.\nAlthough it has the potential to meet increasing demands of video services, it\nis still challenging to provide high performance video streami...","[cs.IT, cs.MM, cs.NI, eess.SP, math.IT]"
82813,Lorentzian length spaces,"We introduce an analogue of the theory of length spaces into the setting of\nLorentzian geometry and causality theory. The r\^ole of the metric is taken\nover by the time separation function, in terms of which all basic notions are\nformulated. In this way we recover many fundamental results i...","[gr-qc, math-ph, math.DG, math.MG, math.MP]"


<h2> Cleaning the data </h2>

We next do some basic cleaning of the data.

In [None]:
# drop rows with NaNs
df.dropna(inplace=True)

# merge title and abstract
df['clean_text'] = df['title'] + ' ' + df['abstract']

# remove punctuation marks
punctuation ='\!\"\#\$\%\&\(\)\*\+\-\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~'
df['clean_text'] = df['clean_text'].apply(lambda x: re.sub(punctuation, '', x))

# convert text to lowercase
df['clean_text'] = df['clean_text'].str.lower()

# remove numbers
df['clean_text'] = df['clean_text'].str.replace("[0-9]", " ")

# remove whitespaces
df['clean_text'] = df['clean_text'].apply(lambda x:' '.join(x.split()))

pd.set_option('display.max_colwidth', 300)
df[['title', 'abstract', 'clean_text']].sample(3)

Unnamed: 0,title,abstract,clean_text
141086,Visualizing Treasury Issuance Strategy,"We introduce simple cost and risk proxy metrics that can be attached to\nTreasury issuance strategy to complement analysis of the resulting portfolio\nweighted-average maturity (WAM). These metrics are based on mapping issuance\nfractions to their long-term, asymptotic portfolio implications f...","visualizing treasury issuance strategy we introduce simple cost and risk proxy metrics that can be attached to treasury issuance strategy to complement analysis of the resulting portfolio weighted-average maturity (wam). these metrics are based on mapping issuance fractions to their long-term, a..."
144991,Inferring the temporal structure of directed functional connectivity in\n neural systems: some extensions to Granger causality,"Neural processes in the brain operate at a range of temporal scales. Granger\ncausality, the most widely-used neuroscientific tool for inference of directed\nfunctional connectivity from neurophsyiological data, is traditionally deployed\nin the form of one-step-ahead prediction regardless of ...","inferring the temporal structure of directed functional connectivity in neural systems: some extensions to granger causality neural processes in the brain operate at a range of temporal scales. granger causality, the most widely-used neuroscientific tool for inference of directed functional conn..."
124939,Modeling observations of solar coronal mass ejections with heliospheric\n imagers verified with the Heliophysics System Observatory,"We present an advance towards accurately predicting the arrivals of coronal\nmass ejections (CMEs) at the terrestrial planets, including Earth. For the\nfirst time, we are able to assess a CME prediction model using data over 2/3 of\na solar cycle of observations with the Heliophysics System O...","modeling observations of solar coronal mass ejections with heliospheric imagers verified with the heliophysics system observatory we present an advance towards accurately predicting the arrivals of coronal mass ejections (cmes) at the terrestrial planets, including earth. for the first time, we ..."


Our last text cleaning step is to lemmatise the text, which reduces all words to its base form. For instance, 'eating' is converted into 'eat' and 'better' is converted into 'good'. This usually takes a little while.

In [None]:
# download spacy if needed
#!pip install -U spacy && python -m spacy download en

from spacy import load as sp

def lemmatization(texts):

    # import spaCy's language model
    nlp = sp('en', disable=['parser', 'ner'])
    
    output = []
    for text in texts:
        s = [token.lemma_ for token in nlp(text)]
        output.append(' '.join(s))
    
    return output

# lemmatise text
df['clean_text'] = lemmatization(df['clean_text'])

# save dataframe to {file_name}_clean_text.pickle
with open(f"{file_name}_clean_text.pickle","wb") as pickle_out:
    pickle.dump(df[['clean_text']], pickle_out)

print(f"Lemmatisation complete. Dataframe with the clean_text column saved to {file_name}_clean_text.pickle.")

pd.set_option('display.max_colwidth', 300)
df.sample(3)

<h2> One hot encoding of categories </h2>

We then perform a one hot encoding for the category variable, as this will make training our model easier. We do this by first creating a dataframe with columns the categories and binary values for every paper, and then concatenate our original dataframe with the binary values.

In [None]:
def cat_to_binary(x):
    cat_index = np.nonzero(cats == x)[0][0]
    return [0] * cat_index + [1] + [0] * (len(cats) - cat_index - 1)

def cats_to_binary(x):
    binary_cat = np.sum([cat_to_binary(y) for y in x], 0, dtype=np.int32)
    return binary_cat.tolist()

# populate cats_df with the information from df
bincat_list = np.array([cats_to_binary(x) for x in df['category']]).transpose().tolist()
bincat_dict = {key:value for (key,value) in list(zip(cats, cat_list))}
bincat_df = pd.DataFrame.from_dict(cat_dict)

# concatenate df with the columns in cats_df
df = pd.concat([df, bincat_df], axis=1, sort=False)

# drop the category column
df.drop(['category'], axis=1, inplace=True)

print("One hot encoding complete.")

# show the new columns of the data frame
pd.set_option('display.max_colwidth', 10)
df.sample(3)

<h2> Analysis of the data </h2>

Here is how the categories in our data set are distributed.

In [None]:
# find the amount of papers in each category
sum_cats = bincat_df.apply(lambda x: x.sum())
sum_cats.describe()

In [None]:
# plot the distribution of the amount of papers in each category
plt.figure(figsize=(20,10))
plt.bar(x=sum_cats.keys(), height=sum_cats.values)
plt.xlabel('Categories', fontsize=13)
plt.ylabel('Number of papers', fontsize=13)
plt.title('Distribution of categories in data set', fontsize=18)
plt.xticks([])
plt.show()

In [None]:
cat_df = pd.read_csv("cats.csv")
cat_df['count'] = sum_cats.values
cat_df = cat_df.sort_values(by=['count'], ascending=False)

cat_df.head(5)

<h2> ELMo feature extraction </h2>

To build our model we have to extract features from the titles and abstracts. We will be using ELMo, a state-of-the-art NLP framework developed by AllenNLP, which converts text input into vectors, with similar words being closer to each other. We will need the following extra packages.

Next, we download the ELMo model. It is over 350mb in size, so it might take a little while.

We now need to extract ELMo features from our cleaned text data. This takes a LONG time (potentially many days) and will usually crash if run in a notebook. Instead this is done using the Python extractor which can be downloaded from

<p><center><a href="


to the <tt>extract_elmo.py</tt>, which is run with the data file as input without its file extension. For instance, the following extracts the features from the <tt>arxiv.csv</tt> file and saves the features as <tt>arxiv_elmo.pickle</tt>:
<p><center><tt>$ python extract_elmo.py arxiv</tt></center></p>

Assuming the ELMo features have now been extracted, we now load them in.

In [13]:
# load ELMo data
with open(f"{file_name}_elmo.pickle", "rb") as pickle_in:
    elmo_data = pickle.load(pickle_in)

print(f"ELMo data loaded from {file_name}_elmo.pickle.")

ELMo data loaded!


<h2> Building the model </h2>

We are now done manipulating our data, and the time has come to build a model. We will be building a logistic regression model for every category.

In [220]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

In [15]:
Y_test = []
Y_hat = []
lregs = np.asarray([])

for cat in cats:
        
    X_train, X_test, y_train, y_test = train_test_split(elmo_data, df[cat], random_state=4, test_size=0.2)
    Y_test.append(y_test)
    
    try:
        lreg = LogisticRegression(C=0.0001, solver='liblinear')
        lreg.fit(X_train, y_train)
        Y_hat.append(lreg.predict(X_test))
        lregs.append(lreg)
    except:
        Y_hat.append(np.array([0] * len(y_test)))

with open(f"{file_name}_model.pickle","wb") as pickle_out:
    pickle.dump(lregs, pickle_out)
        
print(f"Model built and saved to {file_name}_model.pickle")

[array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]),
 array([0, 0, 0, 0, 

<h2> Testing the model </h2>

We now check whether our model is capable of predicting the categories for the papers in our test set.

In [16]:
from sklearn.metrics import f1_score
from statistics import mean

In [17]:
f1s = [f1_score(x, y) for (x,y) in zip(Y_test, Y_hat)]

print(f"The average f1 score for the model is {mean(f1s)}.")

The highest f1 score for the model is 0.8571428571428571.
The average f1 score for the model is 0.0056022408963585435.


  'precision', 'predicted', average, warn_for)
