Appendix A is associated with Chapter 2: Text data and where to find them of the book -- Manika Lamba and Margam Madhusudhan (2021) Text Mining for Information Professionals: An Uncharted Territory, SpringerNature.
Lamba, Manika, & Madhusudhan, Margam. (2021). Appendix-A: Online Repositories Available for Text Mining (Version v1.0). http://doi.org/10.5281/zenodo.5104488
Repository | Description | Data Types |
---|---|---|
Registry of Research Data Repositories | Searchable registry of over 2,000 repositories that host research data. Individual datasets may be subject to use restrictions | Archived, audiovisual, configuration, databases, images, network-based, raw, scientific and statistical data among others |
Harvard Dataverse | Searchable repository of research data in a variety of formats. Individual datasets may be subject to use restrictions | Applications, audio, documents, FITS, images, tabular data, text, compressed files (e.g. ZIP) |
Full-text corpus data | Contains full-text, downloadable corpus data from six large English corpora. Individual datasets may be subject to use restrictions or require purchase | Databases, plain text |
English-Corpora | Contains downloadable corpora developed by Mark Davies, Brigham Young University. Individual datasets may be subject to use restrictions or require purchase | Databases, plain text |
Project Gutenberg | Offers over 58,000 free eBooks in a variety of languagues | ePub, HTML, Kindle, plain text |
Spatial Data Repository | Provides geographically-linked health and demographic data from DHS Program and the U.S. Census Bureau for mapping in geographic information systems (GIS) | Various geospatial formats, CSV |
Natural Earth | Free vector and raster map data | ESRI shapefile, TIFF, TFW |
New York University (NYU) Spatial Data Repository | Provides a catalog of geospatial data and maps available from New York University | Image, Polygon, Raster, Line, Point, Mixed |
Hathi-Trust | Non-profit large-scale digital preservation repository that includes digital content from research libraries via Google Books and Internet Archive initiatives | |
Global NDLTD | Open-access electronic theses and dissertations database provided by the Networked Digital Library of Theses and Dissertations | |
Open Access Theses and Dissertations | Open-access electronic theses and dissertations database | |
PQDT Open | Full-text open access theses and dissertations database | |
arXiv | Provides open-access pre-print full-text in the field of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and system science,and economics | |
biorXiv | Provides open-access pre-print full-text in the field of life sciences | |
Wikipedia | Collects and develops content for the public in an open-access environment |
Adapted from
©2020 MIT Libraries - reprinted with permission. https://libraries.mit.edu/scholarly/publishing/apis-for-scholarly-resources/. Accessed 26th Feb 2020,
©2020 Purdue University - reprinted with permission. https://guides.lib.purdue.edu/c.php?g=412592. Accessed 26th Feb 2020,
©2020 USC LibGuides - reprinted with permission. https://libguides.usc.edu/contentmining/databases. Accessed 26th Feb 2020.
Resource | Description | Fee | Result Format | Limitations | Registration |
---|---|---|---|---|---|
arXiv | It provides access to both metadata and article abstracts | Free | Atom | None | None |
SAO/NASA Astrophysics Data System (ADS) | It provides access to bibliographic data on astronomy and physics publications | Free | JSON | Rate limits apply | Key required |
BioMed Central | It provides access to both metadata and full-text content | Free | XML,JSON | None | Key required |
Chronicling America | It provides access to historic newspapers and select digitized newspaper pages | Free | HTML(default),JSON,Atom | None | None |
CrossRef | It provides access to metadata records with CrossRef DOIs | Free | JSON | None | None |
Digital Public Library of America | It provides access to metadata of its collection | Free | JSON-LD | None | Key required |
HathiTrust (Bibliographic API) | It provides access to bibliographic and rights information for its collection. It does not provide API for bulk-retrieval of records | Free | MARC-XML,JSON | No specific limits, however, only intended for small numbers of items. Permission must besought for bulk retrieval | None |
HathiTrust (Data API) | It provides access to HathiTrust and Google digitized texts of public domain works | Free | XML, JSON | No specific limits. However, consult their policies on data use | Key required |
IEEE Xplore | It provides metadata for the articles submitted to the database | Free | XML | Max 200 results per query | Must subscribe to or be a member of an institution that subscribes to IEEE Xplore |
JSTOR Data for Research | It provides access to content on JSTOR for research and teaching | Free | Zip files, XML | Max 25,000 documents per dataset; users can get access to more number of datasets by special request | Requires MyJSTOR account registration |
Library of Congress | It provides multiple APIs available to download bibliographic data and search Library of Congress digital collections | Free | Varies | Varies | Most APIs do not require key |
Nature | It provides access to the metadata of its collection | Free | XML, JSON, and more | No specific limits; however, downloads should be limited to “reasonable rates” Springer Nature TDM Policy | Varies |
National Library of Medicine | It provides 29 separate APIs for accessing a wide variety of content from various NLM databases | Varies | Varies | Varies | Varies |
National Center for Biotechnology Information | It offers several public APIs to access many databases and tools, including PubMed, PMC, Gene, Nuccore, and Protein | Free | Varies | Varies | Key required for some |
Organisation for Economic Co-Operation and Development (OECD) | It provides access to the top used OECD datasets | Free | JSON, XML | Max 1,000,000 results per query, max URL length of 1,000 characters | None |
Open Academic Graph | It provides datasets for citations drawn from two large academic graphs: Microsoft Academic Graph and AMiner | Free | Zip, JSON | None | None |
ORCID | It provides researcher profile data | Free, with subscription options | HTML, XML, or JSON | Two options: 1) Users can access the free Public API, which only returns data marked as “public”; 2) Become an ORCID member to receive API credentials |
ORCID ID Account required |
Oxford English Dictionary (OED) | It provides access to its datasets | Free, with subscription options | JSON | 3,000 requests per month and 60 calls per minute with a free option, other options available | Key required. Academic Researchers can request free access |
PLoS Article-Level Metrics | It provides article-level metrics (including usage statistics, citation counts, and social networking activity) for articles published in PLOS journals and articles added to PLOS Hubs: Biodiversity | Free | XML, JSON, CSV | Results limited to batches of 50 at a time | Key required |
PLOS Search | It allows PLoS content to be queried for integration with web, desktop, or mobile applications | Free | XML, JSON | Max is 7200 requests a day, 300 per hour, 10 per minute; users should wait 5 seconds for each query to return results; requests should not return more than 100 rows. API users are limited to no more than five concurrent connections from a single IP address | Key required |
SpringerLink | It provides access to the metadata of its collection | Free | XML, JSON, and more | No specific limits; however, downloads should be limited to “reasonable rates.” Springer Nature TDM Policy | Varies |
Worldbank | It provides access to WorldBank statistical databases, indicators, projects, and loans, credits, financial statements and other data related to financial operations | Free | Varies | Request volume limits are unspecified, but should be “reasonable” | None |