[arXiv](https://arxiv.org) is one of the greatest things ever happened to mankind, I believe. We are living in a time where data is available in abundance, we just need to turn the right knobs to make it speak for itself. Being a machine learning engineer, most of my time is spent with data and models and I am quite happy with it. 

Since, arXiv is one of the inseparable parts of our lives (be it a data scientist, be it a machine learning engineer, research the list goes on). So, I always wanted to do something interesting with the data arXiv has in its store beside the life-changing papers. Earlier, the scene was - we had a problem statement, but we did not have the data. Now it has reversed. Thinking of a machine learning suitable problem statement is not trivial (offenders, please take a look [here](https://developers.google.com/machine-learning/problem-framing/)). 

I had this dataset - [ARXIV data from 24,000+ papers](https://www.kaggle.com/neelshah18/arxivdataset). This dataset contains all paper related to ML, CL, NER, AI and CV field publish between 1992 to February 2018. The dataset is in `json` format but when it is loaded as a `pandas` DataFrame, it looks something like so - 

![](https://i.ibb.co/rfb1pxH/Screen-Shot-2019-09-07-at-5-51-20-PM.png)

The column names are self-explanatory. The `day`, `month` and `year` columns collectively form the date of publication of the paper. The `tag` column denotes the category(s) of the paper. I decided to form a problem statement like - 

> Can a machine (code, essentially) automatically generate the categories just by taking a look at the title of a paper? 

The problem statement hit me hard and I decided to dive all in. First, I had to decide on which features I would need for solving the problem. Naturally, the features would be - `title` and `tag`. Given the **title** of a paper, the machine will have to predict its tag. Now, if we take a closer look at a value of `tag`, it looks like so - 

```
"'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'cs.CV', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None"
```

The `term` is exactly what I was looking for. It denotes the category(s) more specifically. So, the next challenge was to extract the term(s) of each of the paper and assign them accordingly. So, the problem now becomes - 

> Upon seeing the title of a paper, can a machine predict the `term`(s)? 

In machine learning literature, this can be modeled *as a **multi-label** text classification problem*. Multi-label because a paper can be associated with more than one categories. 

In this notebook, the following tasks are performed - 
- Data loading
- Data preprocessing and deriving the desired form
- Data serialization

In [0]:
!unzip arxivdataset.zip

## Imports

In [0]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import json 
import re

## Data loading

In [2]:
json_df = pd.read_json('arxivData.json')
json_df.head()

Unnamed: 0,author,day,id,link,month,summary,tag,title,year
0,"[{'name': 'Ahmed Osman'}, {'name': 'Wojciech S...",1,1802.00209v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",2,We propose an architecture for VQA which utili...,"[{'term': 'cs.AI', 'scheme': 'http://arxiv.org...",Dual Recurrent Attention Units for Visual Ques...,2018
1,"[{'name': 'Ji Young Lee'}, {'name': 'Franck De...",12,1603.03827v1,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",3,Recent approaches based on artificial neural n...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Sequential Short-Text Classification with Recu...,2016
2,"[{'name': 'Iulian Vlad Serban'}, {'name': 'Tim...",2,1606.00776v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",6,We introduce the multiresolution recurrent neu...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",Multiresolution Recurrent Neural Networks: An ...,2016
3,"[{'name': 'Sebastian Ruder'}, {'name': 'Joachi...",23,1705.08142v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",5,Multi-task learning is motivated by the observ...,"[{'term': 'stat.ML', 'scheme': 'http://arxiv.o...",Learning what to share between loosely related...,2017
4,"[{'name': 'Iulian V. Serban'}, {'name': 'Chinn...",7,1709.02349v2,"[{'rel': 'alternate', 'href': 'http://arxiv.or...",9,We present MILABOT: a deep reinforcement learn...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org...",A Deep Reinforcement Learning Chatbot,2017


## Columns of interest

In [3]:
json_df_short = json_df[['title', 'tag']]
json_df_short.head()

Unnamed: 0,title,tag
0,Dual Recurrent Attention Units for Visual Ques...,"[{'term': 'cs.AI', 'scheme': 'http://arxiv.org..."
1,Sequential Short-Text Classification with Recu...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org..."
2,Multiresolution Recurrent Neural Networks: An ...,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org..."
3,Learning what to share between loosely related...,"[{'term': 'stat.ML', 'scheme': 'http://arxiv.o..."
4,A Deep Reinforcement Learning Chatbot,"[{'term': 'cs.CL', 'scheme': 'http://arxiv.org..."


In [4]:
# I was trying go for applying the `split()` method, :()
json_df_short['tag'][0].split('term')

["[{'",
 "': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'",
 "': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'",
 "': 'cs.CV', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'",
 "': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}, {'",
 "': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None}]"]

## Start of data preprocessing

In [0]:
json_df_short['tag'] = json_df_short['tag'].str.replace('[', '')
json_df_short['tag'] = json_df_short['tag'].str.replace(']', '')
json_df_short['tag'] = json_df_short['tag'].str.replace('{', '')
json_df_short['tag'] = json_df_short['tag'].str.replace('}', '')

In [6]:
# A sample
json_df_short.tag[0]

"'term': 'cs.AI', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'cs.CL', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'cs.CV', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'cs.NE', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None, 'term': 'stat.ML', 'scheme': 'http://arxiv.org/schemas/atom', 'label': None"

Regular expressions are the best-fit here to extract the `term`s from these samples. So, here it goes - 

In [7]:
re.findall(r"'[A-Za-z]*\.[A-Z]*'", json_df_short.tag[0])

["'cs.AI'", "'cs.CL'", "'cs.CV'", "'cs.NE'", "'stat.ML'"]

In [0]:
# A helper function which I will apply on the `tag` column
def get_terms(string):
    return re.findall(r"'[A-Za-z]*\.[A-Z]*'", string)

In [9]:
# Voila!
json_df_short['labels'] = json_df_short['tag'].apply(get_terms)
json_df_short.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Unnamed: 0,title,tag,labels
0,Dual Recurrent Attention Units for Visual Ques...,"'term': 'cs.AI', 'scheme': 'http://arxiv.org/s...","['cs.AI', 'cs.CL', 'cs.CV', 'cs.NE', 'stat.ML']"
1,Sequential Short-Text Classification with Recu...,"'term': 'cs.CL', 'scheme': 'http://arxiv.org/s...","['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']"
2,Multiresolution Recurrent Neural Networks: An ...,"'term': 'cs.CL', 'scheme': 'http://arxiv.org/s...","['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']"
3,Learning what to share between loosely related...,"'term': 'stat.ML', 'scheme': 'http://arxiv.org...","['stat.ML', 'cs.AI', 'cs.CL', 'cs.LG', 'cs.NE']"
4,A Deep Reinforcement Learning Chatbot,"'term': 'cs.CL', 'scheme': 'http://arxiv.org/s...","['cs.CL', 'cs.AI', 'cs.LG', 'cs.NE', 'stat.ML']"


## Data serialization

In [10]:
json_df_short.drop('tag', axis=1, inplace=True)
json_df_short.to_csv('arXivdata.csv', index=False)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [11]:
pd.read_csv('arXivdata.csv').head()

Unnamed: 0,title,labels
0,Dual Recurrent Attention Units for Visual Ques...,"[""'cs.AI'"", ""'cs.CL'"", ""'cs.CV'"", ""'cs.NE'"", ""..."
1,Sequential Short-Text Classification with Recu...,"[""'cs.CL'"", ""'cs.AI'"", ""'cs.LG'"", ""'cs.NE'"", ""..."
2,Multiresolution Recurrent Neural Networks: An ...,"[""'cs.CL'"", ""'cs.AI'"", ""'cs.LG'"", ""'cs.NE'"", ""..."
3,Learning what to share between loosely related...,"[""'stat.ML'"", ""'cs.AI'"", ""'cs.CL'"", ""'cs.LG'"",..."
4,A Deep Reinforcement Learning Chatbot,"[""'cs.CL'"", ""'cs.AI'"", ""'cs.LG'"", ""'cs.NE'"", ""..."


I will need to escape the double quotes from the labels but I will handle this in the model building notebook.