# Analysis of Preprint Papers from the ArXiv

The website [arxiv.org](https://arxiv.org) is a popular database for scientific papers in STEM fields. ArXiv has its own classification system consisting of roughly 150 different categories, which are manually added by the authors whenever a new paper is uploaded. A paper can be assigned multiple categories.

The goal for this project is to develop a machine learning model which can predict the ArXiv category from a given title and abstract.

We start by importing all the packages we will need and setting up a data directory.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import os
from sklearn.decomposition import PCA # dimension reduction of data

# local files
import arxiv_scraper
import cleaner
import elmo
import NN

print("Packages loaded.")

The data set used here has been scraped from the [ArXiv API](https://arxiv.org/help/api) over several days, using the Python scraper `arxiv_scraper.py`. To get a sense for how long the scraping takes, you can uncomment and run the script below.

In [None]:
#arxiv_scraper.cat_scrape(
#    max_results_per_cat = 100, # maximum number of papers to download per category (there are ~150 categories)
#    file_path = "arxiv_data", # name of output file
#    batch_size = 100 # size of every batch - lower batch size requires less memory - must be less than 30,000
#)

Alternatively, I have downloaded metadata from about a million papers using this scraper (with `max_results_per_cat` = 10000), which can be freely downloaded below. This data set takes up ~1gb of space, however, so I've included many random samples of this data set as well:

* `arxiv` contains the main data set
* `arxiv_sample_1000` contains 1,000 papers
* `arxiv_sample_5000` contains 5,000 papers
* `arxiv_sample_10000` contains 10,000 papers
* `arxiv_sample_25000` contains 25,000 papers
* `arxiv_sample_50000` contains 50,000 papers
* `arxiv_sample_100000` contains 100,000 papers
* `arxiv_sample_200000` contains 200,000 papers
* `arxiv_sample_500000` contains 500,000 papers
* `arxiv_sample_750000` contains 750,000 papers

Choose your favorite below. Alternatively, of course, you can set it to be the file name of your own scraped data.

In [None]:
file_name = "arxiv_sample_100000"

Next up, we specify the folder in which we will store all our data. Change to whatever folder you would like.

In [None]:
data_path = os.path.join("/home", "dn16382", "pCloudDrive", "Public Folder", "scholarly_data")

## Fetching data

We then do some basic setting up.

In [None]:
# create path directory and download a list of all arXiv categories
cleaner.setup(data_path)

# download the raw titles and abstracts
cleaner.download_papers(file_name, data_path)

Next, we store the list of arXiv categories.

In [None]:
# construct category dataframe and array
full_path = os.path.join(data_path, "cats.csv")
cats_df = pd.read_csv(full_path)
cats = np.asarray(cats_df['category'].values)

pd.set_option('display.max_colwidth', 50)
cats_df.head()

## Cleaning the data

We now do some basic cleaning operations on our raw data. We convert strings '\[cat_1, cat_2\]' into actual lists \[cat_1, cat_2\], make everything lower case, removing punctuation, numbers and whitespace, and dropping NaN rows.

Our last text cleaning step is to lemmatise the text, which reduces all words to its base form. For instance, 'eating' is converted into 'eat' and 'better' is converted into 'good'. This usually takes a while to finish, so instead we're simply going to download a lemmatised version of your chosen data set. Alternatively, if you're dealing with your own scraped data set, you can uncomment the marked lines below.

In [None]:
full_path = os.path.join(data_path, f"{file_name}_clean.csv")
if not os.path.isfile(full_path):
    # preclean raw data and save the precleaned texts and
    # categories to {file_name}_preclean.csv
    cleaner.get_preclean_text(file_name, data_path)

    # lemmatise precleaned data and save lemmatised texts to 
    # {file_name}_clean.csv and delete the precleaned file
    cleaner.lemmatise_file(file_name, batch_size = 1000, path = data_path, confirmation = False)

# load in cleaned text
print("Loading cleaned text...")
full_path = os.path.join(data_path, f"{file_name}_clean.csv")
clean_text = pd.read_csv(full_path, delimiter = '\n', header = None)
clean_df = pd.DataFrame(clean_text)
clean_df.columns = ['clean_text']

# load in cats and add them to df
full_path = os.path.join(data_path, f"{file_name}.csv")
clean_cats_with_path = lambda x: cleaner.clean_cats(x, path = data_path)
cleaned_cats = pd.read_csv(full_path, header = None, converters = {0 : clean_cats_with_path})

# join the two dataframes
clean_df['category'] = cleaned_cats.iloc[:, 0]

print(f"Shape of clean_df: {clean_df.shape}. Here are some of the lemmatised texts:")
pd.set_option('display.max_colwidth', 1000)
clean_df[['clean_text', 'category']].head()

## ELMo feature extraction

To build our model we have to extract features from the titles and abstracts. We will be using ELMo, a state-of-the-art NLP framework developed by AllenNLP, which converts text input into vectors, with similar words being closer to each other. We will first download the ELMo model. It is over 350mb in size, so it might take a little while.

In [None]:
elmo.download_elmo_model()

We now need to extract ELMo features from our cleaned text data. This is done using the `extract` function from `elmo.py`. This usually takes a LONG time.

In [None]:
full_path = os.path.join(data_path, f"{file_name}_elmo.csv")
if not os.path.isfile(full_path):
    # extract ELMo data
    elmo.extract(
        file_name = file_name,
        path = data_path,
        batch_size = 20, # lower batch size gives less accurate vectors but requires less memory
        doomsday_clock = 50,
        confirmation = False
    )

# load ELMo data
print("Loading ELMo'd text...")
full_path = os.path.join(data_path, f"{file_name}_elmo.csv")
elmo_data = pd.read_csv(full_path, header = None)
print(f"ELMo data loaded from {full_path}.")

elmo_df = clean_df.copy()
elmo_df = elmo_data.join(elmo_df['category'])
elmo_df = elmo_df.dropna()

print(f"Shape of elmo_df: {elmo_df.shape}")
elmo_df.head()

In [None]:
n = elmo_df.shape[1] - 1
X = np.asarray(elmo_df.iloc[:, :n])
X_2d = PCA(n_components = 2).fit_transform(X)
elmo_2d = pd.DataFrame(X_2d, columns = ['x', 'y'])

fig, ax = plt.subplots(1, figsize = (15, 10))
ax.set_xticks([])
ax.set_yticks([])
sns.scatterplot(data = elmo_2d, x = 'x', y = 'y')

## One-hot encoding of categories

We then perform a one hot encoding for the category variable, as this will make training our model easier. We do this by first creating a dataframe with columns the categories and binary values for every paper, and then concatenate our original dataframe with the binary values.

## Analysis of the data

Here is how the categories in our data set are distributed.

We see that our data is not particularly uniformly distributed. These are the categories with the most amount of papers in the data set.

In [None]:
def aggregate_cat(cat):
    if cat[:8] == 'astro-ph':
        agg_cat = 'physics'
    elif cat[:2] == 'cs':
        agg_cat = 'cs'
    elif cat[:5] == 'gr-qc':
        agg_cat = 'physics'
    elif cat[:3] == 'hep':
        agg_cat = 'physics'
    elif cat[:4] == 'math':
        agg_cat = 'math'
    elif cat[:4] == 'nlin':
        agg_cat = 'physics'
    elif cat[:4] == 'nucl':
        agg_cat = 'physics'
    elif cat[:7] == 'physics':
        agg_cat = 'physics'
    elif cat[:8] == 'quant-ph':
        agg_cat = 'physics'
    elif cat[:4] == 'stat':
        agg_cat = 'stats'
    else:
        agg_cat = 'other'
    return agg_cat

def aggregate_cats(cats):
    return np.asarray([aggregate_cat(cat) for cat in cats])

In [None]:
agg_cats_df = cats_df.copy()
agg_cats_df['category'] = agg_cats_df['category'].apply(aggregate_cat)
agg_cats = np.asarray(agg_cats_df['category'].unique())
print("Aggregated categories:")
print(agg_cats)

In [None]:
def agg_cats_to_binary(categories):
    '''
    Turns aggregated categories into a 0-1 sequence with 1's at every category index.
    
    INPUT
        categories, an iterable of strings
    
    OUTPUT
        numpy array with 1 at the category indexes and zeros everywhere else
    '''
    agg_categories = aggregate_cats(categories)
    return np.in1d(agg_cats, agg_categories).astype('int8')

print("One-hot encoding...", end = " ")

df_1hot_agg = elmo_df.copy()

# populate cats_df with the information from df
bincat_arr = np.array([agg_cats_to_binary(cat_list) for cat_list in df_1hot_agg['category']]).transpose()
bincat_dict = {key:value for (key,value) in zip(agg_cats, bincat_arr)}
bincat_df = pd.DataFrame.from_dict(bincat_dict)

# concatenate df with the columns in cats_df
df_1hot_agg = pd.concat([df_1hot_agg, bincat_df], axis=1, sort=False)

# drop the category column
df_1hot_agg.drop(['category'], axis=1, inplace=True)

# save the one hot encoded dataframe
full_path = os.path.join(data_path, f"{file_name}_1hot_agg.csv")
df_1hot_agg.to_csv(full_path)

print(f"Done! Also saved the dataframe to {full_path}.")

# load data
print("Loading aggregated category data...")
full_path = os.path.join(data_path, f"{file_name}_1hot_agg.csv")
df_1hot_agg = pd.read_csv(full_path, header = None)
print(f"Aggregated category data loaded from {full_path}.")

# show the new columns of the data frame
pd.set_option('display.max_colwidth', 10)
print(f"Dimensions of df_1hot_agg: {df_1hot_agg.shape}.")
df_1hot_agg.head()

In [None]:
# save a dataframe with the amount of papers in each category
sum_cats = bincat_df.apply(lambda x: x.sum())

# plot the distribution of the amount of papers in each category
plt.figure(figsize=(20,10))
plt.bar(x=sum_cats.keys(), height=sum_cats.values)
plt.xlabel('Categories', fontsize=13)
plt.ylabel('Number of papers', fontsize=13)
plt.title('Distribution of categories in data set', fontsize=18)
plt.show()

## Building a model

We are now done manipulating our data, and the time has come to build a model.

In [None]:
nn_model = NN.NeuralNetwork(
    layer_dims = [1024, 512, 1],
    activations = ['tanh', 'tanh', 'sigmoid'],
    learning_rate = 0.0075,
    num_iterations = 25000,
    plot_cost = True
)

In [None]:
X = np.asarray(df_1hot_agg.iloc[:, :1024].T)
y = np.asarray(df_1hot_agg.loc[:, 'physics'])
y = y.reshape(1, y.size)

nn_model.fit(X, y)

In [None]:
nn_model.params_