Skip to content

Alternative Tabular to RDF converters

Tim L edited this page · 247 revisions
Clone this wiki locally
csv2rdf4lod-automation is licensed under the Apache License, Version 2.0

csv2rdf4lod is a tool by some folks in the Tetherless World Constellation at RPI. It is currently being used as part of the infrastructure for their Linking Open Government Data and Linking Open Biomedical Data projects. Tim Lebo wrote it with some invaluable design guidance from Greg Williams.

The number of utilities available to convert tabular data to RDF suggests a large and diverse set of requirements. To help you find the right match for your needs, this page collects pointers to other utilities that can convert tabular data to RDF.

If you know of yet another, feel free to email Tim or jot (and save) suggestions on this piratepad.

Please note that csv2rdf4lod was NOT used to produce the RDF available at http://www.data.gov, such as http://www.data.gov/semantic/data/alpha/92/dataset-92.rdf.gz. That was from some code somewhere in Google.

Special thanks to Jim McCusker, Paola, Li, Greg, Christoph, and Alvaro for their help in developing this list.

Other listings

Tabula

https://github.com/tabulapdf/tabula/releases/tag/v1.0.0

KARMA

  • homepage: http://www.isi.edu/integration/karma/
  • Available under Apache 2 License on GitHub
  • GUI based
  • Uses Conditional Random Field (CRF) to propose mappings to classes and properties.
  • Uses relational database and views.
  • Avoids data preparation - requires it as a preprocessing step.
  • Provides entity matching based on Song and Heflin's entity coreference approach (Silk did not work for them)
  • Permits manual curation of sameAs links. Uses PROV-O to distinguish different sets of links.

Publications:

Any23

Anything To Triples (any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents.

  • http://any23.apache.org/download.html

  • July 2013 released Apache Any23 0.8.0 which includes a major re-factoring of the codebase providing improved modularity and enabling much better use of Any23 within your applications. Currently it supports the following input formats:

    • RDF/XML, Turtle, Notation 3
    • RDFa with RDFa1.1 prefix mechanism
    • Microformats: Adr, Geo, hCalendar, hCard, hListing, hResume, hReview, License, XFN and Species
    • HTML5 Microdata: (such as Schema.org)
    • CSV: Comma Separated Values with separator autodetection.
  • May 2014 The Apache Any23 PMC are proud to announce the immediate release of Any23 1.0 which is a major release for the project. Anything To Triples (Any23) is a library, a web service and a command line tool that extracts structured data in RDF format from a variety of Web documents. Currently it supports the following input formats. A release report for this release can be accessed here http://s.apache.org/Ull. Although we suggest that you use and consume the Any23 Maven artifacts there are also a number of other download options on our downloads page as well as documentation for how you can include Any23 in your projects. http://any23.apache.org/download.html
    • RDF/XML, Turtle, Notation 3
    • RDFa with RDFa1.1 prefix mechanism
    • Microformats: Adr, Geo, hCalendar, hCard, hListing, hRecipe, hReview, License, XFN and Species
    • HTML5 Microdata: (such as Schema.org)
    • JSON-LD: JSON for Linking Data. a lightweight Linked Data format based on the already successful JSON format and provides a way to help JSON data interoperate at Web-scale.
    • CSV: Comma Separated Values with separator autodetection.
    • Vocabularies: Extraction support for CSV, Dublin Core Terms, Description of a Career, Description Of A Project, Friend Of A Friend, GEO Names, ICAL, lkif-core, Open Graph Protocol, BBC Programmes Ontology, RDF Review Vocabulary, schema.org, VCard, BBC Wildlife Ontology and XHTML.

UMBC's T2LD

"Tabular to Linked Data"

AKSW's CSVImport

Representing multi-dimensional statistical data as RDF using the RDF Data Cube Vocabulary

(csv2rdf4lod handles n-ary relations in spreadsheets, including multi-dimentional statistics; see Converting with cell based subjects)

Trifacta

A nicer, commercial, version of OpenRefine. http://www.trifacta.com. Partnered with Tableau.

Google's Refine

OpenRefine

October 2nd, 2012, Google is not supporting actively Refine which have been rebranded to OpenRefine

http://openrefine.org/

TAO's RDBToOnto

RDBToOnto is a tool that allows to automatically generate fine-tuned populated ontologies from relational databases (in RDFS/OWL).

A major feature of this tool is the ability to produce highly structured ontologies by exploiting both the database schema and structuring patterns hidden in the data (see publications for details on the RTAXON learning method, including its formal description).

Though automated to a large extent, the process can be constrained in many ways through a friendly user interface. It also provides a framework that eases the development and integration of new learning methods and database readers. A database optimization module allows to enhance the input database before ontology generation.

Ermilov's wiki.publicdata.eu CSV2RDF Application

More notes and comments at Ermilov's wiki.publicdata.eu CSV2RDF Application

DataLift

Datalift brings raw structured data coming from various formats (relational databases, CSV, XML, ...) to semantic data interlinked on the Web of Data.

Anzo

Anzo (in particular Anzo for Excel) is designed for enterprises to curate large numbers of spreadsheets, map them to ontologies & to existing RDF instance data, and maintain them as changes are made to the spreadsheets or to the data in the spreadsheets. It can be used for CSV-style "tabular" spreadsheets and also for arbitrarily "human-oriented" spreadsheets. It can be used both in interactive modes (where people are opening up and interacting with spreadsheets) and also in automated batch modes.

Anzo stores the RDF data from spreadsheets in an RDF database. Anzo includes both authenticated and unauthenticated SPARQL endpoints for this data; Anzo can also directly publish the data as Linked Data. Finally, Anzo gives you several ways to export RDF data from the database.

Anzo is available in several editions: Anzo Express Starter -- includes Anzo for Excel as above for limited #s of users; freely available Anzo Express -- includes Anzo for Excel and Anzo on the Web, a user-friendly browser-based dashboard tool for visualization, searching, and analyzing RDF data Anzo Enterprise -- includes the above in addition to tools to connect to data in relational databases, to integrate unstructured data from documents, web pages, etc., to run rules and reasoning and work flow processes, various server-side and client-side APIs, etc. We also make Anzo available for free for academic use. (Lee)

Michel Dumontier's php-lib

Michel Dumontier's php-lib library is what Bio2RDF has been using for converting TSV, CSV files (and other file formats) to RDF [1]. It contains some aspects that are Bio2RDF specific, namely its support for prefixed URIs, but any Pull Requests on GitHub would be appreciated to generalise that. OSX has PHP installed by default as far as I know so you can use it on the command line without any other dependencies.

You can find examples of scripts using php-lib in the bio2rdf-scripts repository on GitHub [2]. A fairly simple example would be the HGNC converter, which is Tab separated, but quite similar [3].

Cheers,

Peter

[1] https://github.com/micheldumontier/php-lib [2] https://github.com/bio2rdf/bio2rdf-scripts [3] https://github.com/bio2rdf/bio2rdf-scripts/blob/master/hgnc/hgnc.php#L129

Christopher Gutteridge's Grinder

raw2ld

Set of tools and scripts for converting raw data (csv, tsv, $sv, and other custom formats), creating links, and managing a triple store http://www.data2semantics.org

https://github.com/Data2Semantics/raw2ld

TabLinker

https://github.com/Data2Semantics/TabLinker

RightField

RightField allows the creation of spreadsheets that have ontology terms embedded within them for data validation. -Simon Jupp RightField (http://www.rightfield.org.uk), allows you to embed ontology term selection into spreadsheets, and to extract these selections as RDF. It is designed more for assisting in the data collection process (i.e. when users fill in a spreadsheet that has been marked-up using RightField, they are automatically collecting semantically enriched data). Their paper RDF extraction in more detail:

Wolstencroft, Katherine; Owen, Stuart; Goble, Carole; Nguyen, Quyen; Krebs, Olga; Muller, Wolfgang; , "RightField: Semantic enrichment of Systems Biology data using spreadsheets," E-Science (e-Science), 2012 IEEE 8th International Conference on , vol., no., pp.1-8, 8-12 Oct. 2012

doi: 10.1109/eScience.2012.6404412 (Katy)

Populous is a spawn of RightField

Populous

  • Populous is a spawn of RightField. Populous (http://populous.org.uk) uses the ontology pre-processing language (OPPL) to convert spreadsheet data in OWL/RDF. It also supports validating spreadsheet content against existing ontologies. Populous is a spawn of RightField (http://rightfield.org.uk).

IO Informatics’ Knowledge Explorer

  • IO informatics Knowledge Explorer, a good tool. I used Google Refine+ RDF plugin and faced some problem with large datasets but KE worked perfectly well. -Abdul Mateen Rajput
  • IO Informatics’ Knowledge Explorer. Professional Edition, also provides an automated way to facilitate import and updating a triplestore backend of your choice via monitored folders which will map and import incoming spreadsheets to RDF. You can set up multiple monitored folders with different data mappings, and this will run as background processes to continuously update one or multiple connected triplestores (or different graphs in a single triplestore.

The Knowledge Explorer also provide scripting within the import mapping, application of thesauri and other mechanisms for data transformation to clean, consolidate and harmonize data during the import.

You can find out more about this tool here: http://www.io-informatics.com/products/sentient-KE.html -Erich Gombocz

Spain I-SEM 2010 submission

eBiquity's RDF123

OWL spreadsheets

The idea behind "spreadsheet" work in .bib is to enrich spreadsheets with an ontology that makes the semantics of the spreadsheet cells, particularly of derived/computed values, more explicit, and using that information to provide user assistance. -Christoph

Talis' csvmapper

Tetherless World's 2009 data-gov converter

Simple Sloppy Semantic Database

S3DB stands for Simple Sloppy Semantic Database. It is a way to represent information on the Semantic Web without the rigidness of relational/XML schema while avoiding the "spaghetti" of unconstrained RDF stores. The critical feature of S3DB is a core datamodel that makes an explicit distinction between domain of discourse and its instantiation. The motivation and basic design is introduced in our publications [Nature Biotechnology - 24, 1070 - 1071 (2006)], [PLoS ONE 3(8) 2008] and [BMC Bioinformatics 11:387 (2010)]. For a shortcut to the syntax of the REST protocol used to expose S3DB's API click here. For the sprawling list of documents and media describing installation and usage see the documentation page. https://sites.google.com/a/s3db.org/s3db/ http://www.biomedcentral.com/content/pdf/1471-2105-11-387.pdf http://ibl.mdanderson.org/~jsalmeida/

Li Ding's lod-apps

Talend Open Studio

  • "tons of connectors to get your data from any sources"
  • "nice data cleaning and transormation components to massage your data"
  • "fuzzymatch option (using levenshtein‎ and metaphone) for reconciliation"
  • "job can be exported in a shell script and included in a cron job."
  • "Talend is more complex than Refine and the learning curve a bit longer"

Michael Grove's ConvertToRDF

Command line version of Mindswap Convert To RDF Tool

Michael Grove's Mindswap Convert To RDF Tool

GUI version of Michael Grove's ConvertToRDF

Mindswap's Excel2RDF

R2RML

http://www.w3.org/TR/r2rml/

D2RQ

http://d2rq.org/

http://www.w3.org/2001/sw/wiki/D2RQ

Others

Other (non-converter) related work

Triple Store Evaluation

Conversion to RDF is reported by the triple store evaluation literature, where they propose queries as well. Hexastore used as evaluation, but didn't mention how they converted. Library thing a dataset (LUBM?). Rdf4x guys have a non-public dataset. Work did not describe their considerations during the conversion process. (was some of this work from MIT?)

Bibtex to RDF tools

meta

reuse comparison table from http://www.toodledo.com/info/compare.php?

Conferences

PCI 2013 - Special Session on the Web of Data (DATAWEB) Production and deployment of Open, Linked and Big Data

Something went wrong with that request. Please try again.