# arXiv data processing

- The objective of this notebook is as follows
- Get the arXiv metadata via kaggle as well as via python arxiv module
- Exploratory data analyis
- Understand how word embedding mechanisms work, such as word2vec and Glove
- Create a vector database of arXiv data
- This forms the basis for next stage work in training a RAG + transformer 

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import json
import pandas as pd 
import re   # regular expression 

arxiv_dataset = './arxiv-metadata-oai-snapshot.json'

In [2]:
df = pd.read_json(arxiv_dataset, lines=True)

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2515829 entries, 0 to 2515828
Data columns (total 14 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   id              object
 1   submitter       object
 2   authors         object
 3   title           object
 4   comments        object
 5   journal-ref     object
 6   doi             object
 7   report-no       object
 8   categories      object
 9   license         object
 10  abstract        object
 11  versions        object
 12  update_date     object
 13  authors_parsed  object
dtypes: object(14)
memory usage: 268.7+ MB


In [4]:
df.drop(columns=['submitter','comments',
                 'journal-ref','doi','report-no','license','versions','update_date'],inplace=True)
df.dropna(inplace=True)

In [5]:
df.sample(4)

Unnamed: 0,id,authors,title,categories,abstract,authors_parsed
602161,1502.07501,"S.D. Connolly, I.M. McHardy and T. Dwelly",Long-Term X-ray Spectral Variability of Seyfer...,astro-ph.HE,We present analysis of the long-term X-ray s...,"[[Connolly, S. D., ], [McHardy, I. M., ], [Dwe..."
1607484,2202.08660,"Guilherme Eduardo Freire Oliveira, Christian M...",Photon frequency diffusion process,cond-mat.stat-mech astro-ph.CO,We introduce a stochastic multi-photon dynam...,"[[Oliveira, Guilherme Eduardo Freire, ], [Maes..."
2129668,astro-ph/0310521,"A. S. Cohen, H. J. A. Rottgering, M. J. Jarvis...","A Deep, High-Resolution Survey at 74 MHz",astro-ph,We present a 74 MHz survey of a 165 square d...,"[[Cohen, A. S., ], [Rottgering, H. J. A., ], [..."
2155459,astro-ph/0610551,"Alan B. Whiting, George K. T. Hau, Mike Irwin,...",An Observational Limit on the Dwarf Galaxy Pop...,astro-ph,"We present the results of an all-sky, deep o...","[[Whiting, Alan B., ], [Hau, George K. T., ], ..."


In [63]:
list(df[['authors_parsed']].iloc[2141155])

[[['Prokopec', 'Tomislav', '', 'Utrecht University'],
  ['Valkenburg', 'Wessel', '', 'Utrecht University']]]

In [64]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2515829 entries, 0 to 2515828
Data columns (total 6 columns):
 #   Column          Dtype 
---  ------          ----- 
 0   id              object
 1   authors         object
 2   title           object
 3   categories      object
 4   abstract        object
 5   authors_parsed  object
dtypes: object(6)
memory usage: 115.2+ MB


### Create author data base

- Traditional rule based approach is difficult
- for instance:
- "Wang, Xiaoyu; Vafek, Oskar"
- "Xiaoyu Wang and Oskar Vafek"
- "Xiaoyu Wang (NHMFL) and Oskar Vafek (FSU)"
- "Xiaoyu Wang (1) NHMFL and Oskar Vafek (2) FSU"
-  "Xiaoyu Wang &  Oskar Vafek"

All cases exist due to nonstandardized way of introducing names but all correspond to the same author list

Here we using HuggingFace's transformer pipeline to implement a NER (named entity recognition) task to identify names 

In [20]:
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
st = StanfordNERTagger('stanford-ner-2020-11-17/classifiers/english.all.3class.distsim.crf.ser.gz',
					   'stanford-ner-2020-11-17/stanford-ner.jar',
					   encoding='utf-8')

text = 'Karine Jean-Piere (1) NHMFL and Oskar Vafek (2) FSU'

tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print(classified_text)


[('Karine', 'PERSON'), ('Jean-Piere', 'PERSON'), ('(', 'O'), ('1', 'O'), (')', 'O'), ('NHMFL', 'O'), ('and', 'O'), ('Oskar', 'PERSON'), ('Vafek', 'PERSON'), ('(', 'O'), ('2', 'O'), (')', 'O'), ('FSU', 'O')]


In [21]:
print([cf for cf in classified_text if cf[1]=='PERSON'])

# author_list = []
# for index, row in enumerate(df['authors_parsed']):
#     for j in range(len(row)):
#         author = ''
#         for k in range(len(row[j])):
#             author = row[j][k] + ' ' + author
#             author_list.append(author.strip())
# author_list = set(author_list)

[('Karine', 'PERSON'), ('Jean-Piere', 'PERSON'), ('Oskar', 'PERSON'), ('Vafek', 'PERSON')]


In [19]:
author_db = pd.DataFrame({'author_list':list(author_list)})

In [20]:
author_db.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2449529 entries, 0 to 2449528
Data columns (total 1 columns):
 #   Column       Dtype 
---  ------       ----- 
 0   author_list  object
dtypes: object(1)
memory usage: 18.7+ MB


In [21]:
tmp = author_db.sort_values(by='author_list')

In [22]:
tmp

Unnamed: 0,author_list
1144597,"""Aek"" Thotsaporn Thanatipanonda"
1616714,"""Coherentia""-INFM and Dipartimento di Fisica, ..."
1540529,"""Coherentia""-INFM and Dipartimento di Fisica, ..."
240012,"""Coherentia""-INFM and Dipartimento di Fisica, ..."
130830,"""Coherentia""-INFM and Dipartimento di Fisica, ..."
...,...
94402,Žąsinas
2398600,Žďánský
1139684,ž. Crljen
497416,žÍidek


In [23]:
author_db

Unnamed: 0,author_list
0,"LPL, LNE-SYRTE Fabio Stefani"
1,Carsten Hensel
2,Rahul Sheth
3,Luis Miaja
4,Yirui
...,...
2449524,Srihari Nanniyur
2449525,P. Eckert
2449526,the NOvA Collaboration B. Bhuyan
2449527,on behalf of the ATLAS Liquid Argon Calorimete...
