# ArXiv parser
This is an example how to query and parse Arxiv for articles.

original taken from
https://arxiv.org/help/api/examples/python_arXiv_parsing_example.txt

API is documented here
http://arxiv.org/help/api/user-manual

# Search parameters

In [1]:
# to get latest publications
#search_query = 'astro-ph'

# or search for a specfic term
search_query = 'all:docker' 

sortBy = 'lastUpdatedDate' # can be relevance, lastUpdatedDate, submittedDate
sortOrder = 'descending'    # can be either ascending or descending

# retreive the first 5 results
start = 0                     
max_results = 10

# The code

In [2]:
import urllib
import feedparser
import pandas as pd

In [3]:
# Base api query url. Probable no need to change
base_url = 'http://export.arxiv.org/api/query?';

fields = ('search_query','start','max_results', 'sortBy', 'sortOrder')
query = '&'.join([f + '=' + str(globals()[f]) for f in fields])


In [4]:
# Opensearch metadata such as totalResults, startIndex, 
# and itemsPerPage live in the opensearch namespase.
# Some entry metadata lives in the arXiv namespace.
# This is a hack to expose both of these namespaces in
# feedparser v4.1
feedparser._FeedParserMixin.namespaces['http://a9.com/-/spec/opensearch/1.1/'] = 'opensearch'
feedparser._FeedParserMixin.namespaces['http://arxiv.org/schemas/atom'] = 'arxiv'

# perform a GET request using the base_url and query
response = urllib.urlopen(base_url+query).read()

# parse the response using feedparser
feed = feedparser.parse(response)

In [5]:
# print out feed information
print 'Feed title: %s' % feed.feed.title
print 'Feed last updated: %s' % feed.feed.updated

Feed title: ArXiv Query: search_query=all:docker&amp;id_list=&amp;start=0&amp;max_results=10
Feed last updated: 2016-06-03T00:00:00-04:00


In [6]:
# print opensearch metadata
print 'totalResults for this query: %s' % feed.feed.opensearch_totalresults
print 'itemsPerPage for this query: %s' % feed.feed.opensearch_itemsperpage
print 'startIndex for this query: %s'   % feed.feed.opensearch_startindex

totalResults for this query: 12
itemsPerPage for this query: 10
startIndex for this query: 0


In [7]:
rows = []

In [8]:
# Run through each entry, and print out information
for entry in feed.entries:
    row = {}
    row['arxiv-id'] = entry.id.split('/abs/')[-1]
    row['Published'] = entry.published
    row['Title'] = entry.title
    
    row['Last Author'] = entry.author
                                             
    if 'arxiv_affiliation' in entry:
         row['Last Author'] += ' (%s)' % entry.arxiv_affiliation                            
    
    row['Authors'] = ', '.join(author.get('name', '?') for author in entry.get('authors', []))

    # get the links to the abs page and pdf for this e-print
    for link in entry.links:
        if link.rel == 'alternate':
            row['page'] = link.href
        elif link.title == 'pdf':
            row['pdf'] = link.href
    
    # The journal reference, comments and primary_category sections live under 
    # the arxiv namespace
    row['journal_ref'] = entry.get('arxiv_journal_ref',  '-')
    
    row['Comments'] = entry.get('arxiv_comment',  '-')
    
    # Since the <arxiv:primary_category> element has no data, only
    # attributes, feedparser does not store anything inside
    # entry.arxiv_primary_category
    # This is a dirty hack to get the primary_category, just take the
    # first element in entry.tags.  If anyone knows a better way to do
    # this, please email the list!
    row['Primary Category'] = entry.tags[0]['term']
    
    # Lets get all the categories
    all_categories = [t['term'] for t in entry.tags]
    row['All Categories'] =  ', '.join(all_categories)
    
    # The abstract is in the <summary> element
    row['Abstract'] = entry.summary
    rows.append(row)

In [9]:
pd.DataFrame(rows)

Unnamed: 0,Abstract,All Categories,Authors,Comments,Last Author,Primary Category,Published,Title,arxiv-id,journal_ref,page,pdf
0,We show NP-completeness for several planar var...,cs.CC,"Andreas Darmann, Janosch Döcker, Britta Dorn","8 pages, 4 figures",Britta Dorn,cs.CC,2016-04-19T14:24:51Z,On planar variants of the monotone satisfiabil...,1604.05588v1,-,http://arxiv.org/abs/1604.05588v1,http://arxiv.org/pdf/1604.05588v1
1,Monotone 3-Sat-4 is a variant of the satisfiab...,cs.CC,"Andreas Darmann, Janosch Döcker",-,Janosch Döcker,cs.CC,2016-03-25T11:13:56Z,Monotone 3-Sat-4 is NP-complete,1603.07881v1,-,http://arxiv.org/abs/1603.07881v1,http://arxiv.org/pdf/1603.07881v1
2,Existing benchmarking methods are time consumi...,cs.DC,"Blesson Varghese, Lawan Thamsuhang Subba, Long...",16th IEEE/ACM International Symposium on Clust...,Adam Barker,cs.DC,2016-03-23T20:55:44Z,DocLite: A Docker-Based Lightweight Cloud Benc...,1603.07357v1,-,http://arxiv.org/abs/1603.07357v1,http://arxiv.org/pdf/1603.07357v1
3,"Application containers, such as Docker contain...",cs.CR,"Vaibhav Rastogi, Drew Davidson, Lorenzo De Car...",-,Patrick McDaniel,cs.CR,2016-02-26T17:34:38Z,Towards Least Privilege Containers with Cimpli...,1602.08410v1,-,http://arxiv.org/abs/1602.08410v1,http://arxiv.org/pdf/1602.08410v1
4,With the availability of a wide range of cloud...,cs.DC,"Blesson Varghese, Lawan Thamsuhang Subba, Long...",Accepted to the IEEE International Conference ...,Adam Barker,cs.DC,2016-01-15T10:57:02Z,Container-Based Cloud Virtual Machine Benchmar...,1601.03872v1,-,http://arxiv.org/abs/1601.03872v1,http://arxiv.org/pdf/1601.03872v1
5,"Finding inclusion-minimal ""hitting sets"" for a...","cs.DS, cs.AI, cs.CC, 68W05, 68R05, 05C85","Andrew Gainer-Dewar, Paola Vera-Licona",-,Paola Vera-Licona,cs.DS,2016-01-05T19:24:25Z,The minimal hitting set generation problem: al...,1601.02939v1,-,http://arxiv.org/abs/1601.02939v1,http://arxiv.org/pdf/1601.02939v1
6,Virtualization is growing rapidly as a result ...,"cs.DC, cs.PF",Roberto Morabito,Accepted to the IEEE/ACM UCC 2015 (SD3C Worksh...,Roberto Morabito,cs.DC,2015-11-04T07:49:47Z,Power Consumption of Virtualization Technologi...,1511.01232v1,-,http://arxiv.org/abs/1511.01232v1,http://arxiv.org/pdf/1511.01232v1
7,Recent developments in the commercial open sou...,"cs.SE, D.2.m; D.2.12; K.6.3; K.6.1","Robert Nagler, David Bruhwiler, Paul Moeller, ...",2 pages,Stephen Webb,cs.SE,2015-09-28T17:49:16Z,Sustainability and Reproducibility via Contain...,1509.08789v1,-,http://arxiv.org/abs/1509.08789v1,http://arxiv.org/pdf/1509.08789v1
8,Solving the software dependency issue under th...,cs.DC,"Hsi-En Yu, Weicheng Huang",PRAGMA-ICDS 15,Weicheng Huang,cs.DC,2015-09-28T08:21:30Z,Building a Virtual HPC Cluster with Auto Scali...,1509.08231v1,-,http://arxiv.org/abs/1509.08231v1,http://arxiv.org/pdf/1509.08231v1
9,"Over the last few years, the use of virtualiza...",cs.CR,Thanh Bui,-,Thanh Bui,cs.CR,2015-01-13T11:44:02Z,Analysis of Docker Security,1501.02967v1,-,http://arxiv.org/abs/1501.02967v1,http://arxiv.org/pdf/1501.02967v1
