Skip to content

textmining-infopros/Appendix-A

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 

Repository files navigation

Appendix-A: Online Repositories Available for Text Mining DOI

Appendix A is associated with Chapter 2: Text data and where to find them of the book -- Manika Lamba and Margam Madhusudhan (2021) Text Mining for Information Professionals: An Uncharted Territory, SpringerNature.

How to Cite

Lamba, Manika, & Madhusudhan, Margam. (2021). Appendix-A: Online Repositories Available for Text Mining (Version v1.0). http://doi.org/10.5281/zenodo.5104488

A.1 Selected Online Repositories Available for Text Mining

Repository Description Data Types
Registry of Research Data Repositories Searchable registry of over 2,000 repositories that host research data. Individual datasets may be subject to use restrictions Archived, audiovisual, configuration, databases, images, network-based, raw, scientific and statistical data among others
Harvard Dataverse Searchable repository of research data in a variety of formats. Individual datasets may be subject to use restrictions Applications, audio, documents, FITS, images, tabular data, text, compressed files (e.g. ZIP)
Full-text corpus data Contains full-text, downloadable corpus data from six large English corpora. Individual datasets may be subject to use restrictions or require purchase Databases, plain text
English-Corpora Contains downloadable corpora developed by Mark Davies, Brigham Young University. Individual datasets may be subject to use restrictions or require purchase Databases, plain text
Project Gutenberg Offers over 58,000 free eBooks in a variety of languagues ePub, HTML, Kindle, plain text
Spatial Data Repository Provides geographically-linked health and demographic data from DHS Program and the U.S. Census Bureau for mapping in geographic information systems (GIS) Various geospatial formats, CSV
Natural Earth Free vector and raster map data ESRI shapefile, TIFF, TFW
New York University (NYU) Spatial Data Repository Provides a catalog of geospatial data and maps available from New York University Image, Polygon, Raster, Line, Point, Mixed
Hathi-Trust Non-profit large-scale digital preservation repository that includes digital content from research libraries via Google Books and Internet Archive initiatives PDF
Global NDLTD Open-access electronic theses and dissertations database provided by the Networked Digital Library of Theses and Dissertations PDF
Open Access Theses and Dissertations Open-access electronic theses and dissertations database PDF
PQDT Open Full-text open access theses and dissertations database PDF
arXiv Provides open-access pre-print full-text in the field of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and system science,and economics PDF
biorXiv Provides open-access pre-print full-text in the field of life sciences PDF
Wikipedia Collects and develops content for the public in an open-access environment PDF

A.2: Selected Online Repositories with API Available for Text Mining

Adapted from
©2020 MIT Libraries - reprinted with permission. https://libraries.mit.edu/scholarly/publishing/apis-for-scholarly-resources/. Accessed 26th Feb 2020,
©2020 Purdue University - reprinted with permission. https://guides.lib.purdue.edu/c.php?g=412592. Accessed 26th Feb 2020,
©2020 USC LibGuides - reprinted with permission. https://libguides.usc.edu/contentmining/databases. Accessed 26th Feb 2020.

Resource Description Fee Result Format Limitations Registration
arXiv It provides access to both metadata and article abstracts Free Atom None None
SAO/NASA Astrophysics Data System (ADS) It provides access to bibliographic data on astronomy and physics publications Free JSON Rate limits apply Key required
BioMed Central It provides access to both metadata and full-text content Free XML,JSON None Key required
Chronicling America It provides access to historic newspapers and select digitized newspaper pages Free HTML(default),JSON,Atom None None
CrossRef It provides access to metadata records with CrossRef DOIs Free JSON None None
Digital Public Library of America It provides access to metadata of its collection Free JSON-LD None Key required
HathiTrust (Bibliographic API) It provides access to bibliographic and rights information for its collection. It does not provide API for bulk-retrieval of records Free MARC-XML,JSON No specific limits, however, only intended for small numbers of items. Permission must besought for bulk retrieval None
HathiTrust (Data API) It provides access to HathiTrust and Google digitized texts of public domain works Free XML, JSON No specific limits. However, consult their policies on data use Key required
IEEE Xplore It provides metadata for the articles submitted to the database Free XML Max 200 results per query Must subscribe to or be a member of an institution that subscribes to IEEE Xplore
JSTOR Data for Research It provides access to content on JSTOR for research and teaching Free Zip files, XML Max 25,000 documents per dataset; users can get access to more number of datasets by special request Requires MyJSTOR account registration
Library of Congress It provides multiple APIs available to download bibliographic data and search Library of Congress digital collections Free Varies Varies Most APIs do not require key
Nature It provides access to the metadata of its collection Free XML, JSON, and more No specific limits; however, downloads should be limited to “reasonable rates” Springer Nature TDM Policy Varies
National Library of Medicine It provides 29 separate APIs for accessing a wide variety of content from various NLM databases Varies Varies Varies Varies
National Center for Biotechnology Information It offers several public APIs to access many databases and tools, including PubMed, PMC, Gene, Nuccore, and Protein Free Varies Varies Key required for some
Organisation for Economic Co-Operation and Development (OECD) It provides access to the top used OECD datasets Free JSON, XML Max 1,000,000 results per query, max URL length of 1,000 characters None
Open Academic Graph It provides datasets for citations drawn from two large academic graphs: Microsoft Academic Graph and AMiner Free Zip, JSON None None
ORCID It provides researcher profile data Free, with subscription options HTML, XML, or JSON Two options:
1) Users can access the free Public API, which only returns data marked as “public”;
2) Become an ORCID member to receive API credentials
ORCID ID Account required
Oxford English Dictionary (OED) It provides access to its datasets Free, with subscription options JSON 3,000 requests per month and 60 calls per minute with a free option, other options available Key required. Academic Researchers can request free access
PLoS Article-Level Metrics It provides article-level metrics (including usage statistics, citation counts, and social networking activity) for articles published in PLOS journals and articles added to PLOS Hubs: Biodiversity Free XML, JSON, CSV Results limited to batches of 50 at a time Key required
PLOS Search It allows PLoS content to be queried for integration with web, desktop, or mobile applications Free XML, JSON Max is 7200 requests a day, 300 per hour, 10 per minute; users should wait 5 seconds for each query to return results; requests should not return more than 100 rows. API users are limited to no more than five concurrent connections from a single IP address Key required
SpringerLink It provides access to the metadata of its collection Free XML, JSON, and more No specific limits; however, downloads should be limited to “reasonable rates.” Springer Nature TDM Policy Varies
Worldbank It provides access to WorldBank statistical databases, indicators, projects, and loans, credits, financial statements and other data related to financial operations Free Varies Request volume limits are unspecified, but should be “reasonable” None