# Fix ArXiv data
This python package includes an option to download ArXiv metadata and then use that data to train a simple model for tracking rejected articles with CrossRef. The ArXiv data used here is incomplete in that not all ArXiv preprints correctly link to the right DOI. 

This notebook runs a simple procedure to use that model to complete the ArXiv data. 

In [1]:
# load ArXiv data
import os
import pandas as pd
from rejected_article_tracker.src.ML.ArXivOAIPMH import ArXivOAIPMH

arxiv_data_generator = ArXivOAIPMH().yield_json()
arxiv = pd.DataFrame(arxiv_data_generator)
arxiv.shape

2021-03-14 10:17:50,761 - rejected_article_tracker.src.ML.ArXivOAIPMH.ArXivOAIPMH - DEBUG - Sufficient OAI-PMH data found. Loading from files.
2021-03-14 10:17:51,688 - rejected_article_tracker.src.ML.ArXivOAIPMH.ArXivOAIPMH - DEBUG - Sufficient OAI-PMH data found. Loading from files.
100%|██████████| 90677/90677 [01:01<00:00, 1485.22it/s]


(90677, 16)

In [2]:
# limit to missing entries
arxiv = arxiv[arxiv['doi']=='']

In [3]:
# convert columns to match expected input
# we've based the input for the tracker on Scholar One column-headings
# so we need to convert our ArXiv column headings to match.
from rejected_article_tracker.src.ML.ArXivAuthorNames import ArXivAuthorNames
arxiv['Journal Name']  = 'ArXiv'

arxiv = arxiv.rename(columns={'created':'Submission Date',
                            'id':'Manuscript ID',
                            'title':'Manuscript Title',
                            'authors':'Author Names',
                            })
# add a column for decision date (not important)
arxiv['Decision Date'] = arxiv['Submission Date']
# pretend they all got rejected
arxiv['Accept or Reject Final Decision'] = 'Reject'

In [4]:
# preprocess author list
def pre_authors(authors):

    author_ls = [x.strip() for x in authors.split(',')]
    author_ls_out = []
    for author in author_ls:
        first_last = author.split('+')
        first = first_last[0]
        try:
            last = first_last[1]
            author_name = last+', '+first
        except:
            author_name = first
        
        author_ls_out.append(author_name)
    return '; '.join(author_ls_out)
arxiv['Author Names'] = arxiv['Author Names'].map(lambda x: pre_authors(x))

In [5]:
arxiv.head(2)

Unnamed: 0,Manuscript ID,Submission Date,Author Names,Manuscript Title,categories,comments,journal-ref,doi,license,abstract,query_id,updated,report-no,msc-class,acm-class,proxy,Journal Name,Decision Date,Accept or Reject Final Decision
2,1201.0003,2011-12-25,"Gorkavenko, Volodymyr M.; Rudenok, Igor; Vilch...",Leptonic asymmetry of the sterile neutrino had...,hep-ph,"12 figures, 22 pages, the case of inverted hie...","Ukr. J. Phys., Vol.58, No.9 (2013) 811-826",,http://arxiv.org/licenses/nonexclusive-distrib...,We consider the leptonic asymmetry generation ...,1201.0003,2012-06-14,,,,,ArXiv,2011-12-25,Reject
4,1201.0005,2011-12-27,"Krugly, Alexey L.",The dynamics of binary alternatives for a disc...,gr-qc,"13 pages, 9 figures, work presented at the ""In...","in Theoretical physics, Proceeding of the inte...",,http://arxiv.org/licenses/nonexclusive-distrib...,A particular case of a causal set is considere...,1201.0005,2011-12-27,,,,,ArXiv,2011-12-27,Reject


In [6]:
# use package to search for missing entries
# code from readme.md
from rejected_article_tracker import ScholarOneRejectedArticlesMatch
import pandas as pd

allowed_cols = [
    'Journal Name',
    'Manuscript ID',
    'Manuscript Title',
    'Author Names',
    'Submission Date',
    'Decision Date',
    'Accept or Reject Final Decision'
]

# for testing
articles = arxiv.sample(20)[allowed_cols].to_dict('records')

# Which might look like:
"""  
articles = [
{
      "Journal Name": "The International Journal of Robotics Research",
      "Manuscript Title": "Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection",
      "Author Names": "Levine, Sergey; Pastor, Peter; Krizhevsky, Alex; Ibarz, Julian; Quillen, Deirdre",
      "Accept or Reject Final Decision": "",
      "Decision Date": "2019-01-01T13:29:58.999Z", 
      "Submission Date": "2018-10-01T13:29:58.999Z",
      "Manuscript ID": "ABC-18-070",
    }
]
"""

# @see below for configuration details.
search_config = {
    "threshold": 70, # Filters out matches which are less than this nubmer  
}

# The CrossRef API requires an email address for lookups.    
email = os.environ['MY_EMAIL'] or "someome@example.com"


# Test
Check to see how long it takes to retrieve results from a batch of articles.

In [7]:
# %%time
# # Define a 'results' list.
# results = []

# # Run match
# ScholarOneRejectedArticlesMatch(
#     articles=articles,
#     config=search_config,
#     email=email,
#     results=results
# ).match()

# print(len(results))

Test took 2mins to retrieve data for 20 articles. In the past, we've had average times of ~3s per article, but this is more like 6s per article. 

At 6s per article, the entire 40k dataset will take ~65hrs to retrieve. 

In practice, you can use a server / VM to run the process constantly. But another option is to do it in stages.

In [8]:
from rejected_article_tracker.src.ML.CrossRefUtils import CrossRefUtils
from rejected_article_tracker.src.ML.config import Config as config
from tqdm import tqdm

data_path = config.ml_data_dir
data_file_path = os.path.join(data_path,'arxiv_results.csv')

if os.path.exists(data_file_path):
    results_df = pd.read_csv(data_file_path)
else:
    results_df = pd.DataFrame()

print('Starting with dataframe, shape:', results_df.shape)

# run the whole dataset
articles = arxiv[allowed_cols].to_dict('records')

# drop any that we've already retrieved
n_results_retrieved = results_df.shape[0]
articles = articles[n_results_retrieved:]

# split into batches so that we can write out periodically
batches = list(CrossRefUtils().chunks(articles, 30))

for batch in tqdm(batches):
    # Define a 'results' list.
    results = []
    # Run match
    ScholarOneRejectedArticlesMatch(
        articles=batch,
        config=search_config,
        email=email,
        results=results
    ).match()

    results_df = pd.concat([results_df,pd.DataFrame(results)])
    results_df.to_csv(data_file_path, index=False)

100%|██████████| 58/58 [00:00<?, ?it/s]
  0%|          | 0/58 [00:00<?, ?it/s]Starting with dataframe, shape: (37470, 24)
2021-03-14 10:18:54,995 - numexpr.utils - INFO - NumExpr defaulting to 8 threads.
100%|██████████| 58/58 [51:36<00:00, 53.39s/it]


## What did we find?

In [9]:
results_df.head(2)

Unnamed: 0,manuscript_id,raw_manuscript_id,journal_name,manuscript_title,submission_date,decision_date,authors,text_sub_date,final_decision,match_doi,...,match_journal,match_pub_date,match_earliest_date,match_similarity,match_one,match_all,match_crossref_score,match_crossref_cites,match_rank,match_total_decision_days
0,1201,1201,ArXiv,Leptonic asymmetry of the sterile neutrino had...,2011-12-25,2011-12-25,"Volodymyr M.+Gorkavenko, Igor+Rudenok, Stanisl...",2011-12-25,Reject,No Match,...,No Match,No Match,No Match,No Match,No Match,No Match,No Match,No Match,No Match,No Match
1,1201,1201,ArXiv,The dynamics of binary alternatives for a disc...,2011-12-27,2011-12-27,Alexey L.+Krugly,2011-12-27,Reject,No Match,...,No Match,No Match,No Match,No Match,No Match,No Match,No Match,No Match,No Match,No Match
