Join GitHub today
There are a range of technologies that address different problems with creating, publishing and using open data. The aim of this project is to pull together information about these technologies, their advantages and disadvantages, to enable those who are performing these tasks to choose an appropriate technology, to avoid reinventing tools that already exist, and to identify tools to collaborate on.
This page is a list of open data tools. The current goal is for breadth rather than depth. Each tool should be linked to its home page, have a short description, a link to the organisation that has developed it, and a list of tags which could be extracted later.
As with Wikipedia, if something is missing add it so we all can know ;-)
Cleaning / Transformation Tools
Web Data Extraction Tools
- Apache Any23: 'Anything To Triples' (any23) is a library, a web service and a command line tool that extracts structured data from Web documents. — From: Apache - Tags: RDF, RDFa, Microdata, Microformats, CSV
- Tableau Public: Imports from spreadsheets, databases etc; stores data in the cloud — From: Tableau Software - Tags: spreadsheets, visualisation, Windows, cloud storage
- URIBurner: service deliveriing RDF-based structured descriptions of Web-addressable resources (documents or real-world objects) in a variety of formats through Generic HTTP URIs. The underlying technology is OpenLink Virtuoso Sponger, which takes an existing Web-accessible resource (webpage, media) and generates an RDF graph of its metadata using existing well-known ontologies as well as site-specific knowledge. URIBurner then re-presents this data either as a new HTML webpage, or directly as RDF in a variety of serializations (RDF/XML, text/n3, turtle, JSON).
- Linked SDMX: SDMX-ML to RDF transformation. See also documentation: Linked SDMX Data
- The DataTank: see under Publication Tools below.
- Import.io: freemium app that proposes a user-friendly UI to extract tabular data from Web pages.
- Silk: an open source framework for integrating heterogeneous data sources.
Enterprise API Data Extraction Tools
- Apache ManifoldCF: A framework for connecting source content repositories (like Microsoft Sharepoint and EMC Documentum) to target repositories (such as Apache Solr, OpenSearchServer or ElasticSearch). — From: Apache - Tags: API, indexing, cataloging, Enterprise
- OpenLink Virtuoso and its Sponger: Virtuoso has built in features supporting RDB2RDF transformation (R2RML and Direct Mapping), including replication as well as fully dynamic RDF/Graph/Linked Data VIEWs of RDB data. Custom Sponger cartridges may be constructed for any relevant API.
Excel/CSV/Tabular Data Extraction Tools
- Apache Any23 — see above, Web Data Extraction Tools.
- [cDatset] (http://ramblings.mcpher.com/Home/excelquirks/json): Anything jSon to Excel related, and library of Rest API/Excel integrations - Tags: jSon, Rest , Excel
- csv2rdf4lod automation: (aka "csv2rdf4lod") csv2rdf4lod provides a quick and easy way to produce an RDF encoding of data available in CSV format. csv2rdf4lod also functions as a custom reasoner tailored for heavy-duty data integration. Although csv2rdf4lod can handle tabular data from well-structured RDBMS dumps, its forte is in handling "messier" tabular data created manually or using less rigorous information modeling strategies -- perfect for handling real data that evolved ''in the wild''. In either case, csv2rdf4lod is designed to aggregate and integrate multiple versions of multiple datasets of multiple source organizations in an incremental and backward-compatible way. Strong emphasis on provenance. - From: Tim Lebo @ TWC RPI - Tags: csv, RDF, linked data, data quality, reconciliation, transformation, enhancement, provenance, linking, workflow
- csv2xml: An XSLT for converting CSV to XML; _From: The National Archives - Tags: XML, CSV, TSV
- q: q allows performing SQL-like statements on tabular text data, including joins and subqueries; - Tags: CSV, TSV
- Google Refine (note that this will become Open Refine, soon): Allows to clean up, transform, and link data in tabular form — From: Google - Tags: cleaning, transformation, tabular data, linking, reconciliation, desktop tool
- MessyTables: Python library to cope well opening the various variants of CSV and Excel files. It is used by OpenSpending amongst other OKF projects.
- OpenLink Virtuoso Sponger: Existing Cartridges support transformation from CSV and other tabular formats, among many other targets, to RDF. More cartridges are always under development.
- RDF Refine: Google Refine extension for exporting RDF — From: DERI - Tags: RDF, linking, reconciliation, plug-in
- ScraperWiki: Collaborative routine scraping of websites and Excel files to create an API — From: ScraperWiki - Tags: HTML, CSV, Excel, API, scraping
- Tabels: Allows to clean up, transform, and link data, not only CSV, etc. but also PC-Axis, ESRI shapefile, etc. — From: CTIC - Tags: cleaning, transformation, tabular data, linking, reconciliation, online tool
- XLWrap: A spreadsheet-to-RDF wrapper, capable of transforming spreadsheets to arbitrary RDF graphs based on a mapping specification. It supports Microsoft Excel and OpenDocument spreadsheets such CSV/TSV files and it can load local files or download remote files via HTTP. — From: Andreas Langegger - Tags: RDF, Excel, CSV, TSV
- Mr. Data Convertor: Will convert your Excel data into one of several web-friendly formats, including HTML, JSON and XML. Tags: HTML, JSON, XML, Excel, MySQL, Ruby_
- Tarql: Small command-line tool for converting CSV to RDF, with a user-defined mapping expressed in standard SPARQL. From: Richard Cyganiak - Tags: RDF, CSV, SPARQL
- OpenStack Swift/S3: Open Source Cloud (IaaS) software using similar (interoperable) APIs as Amazon. Multiple country based providers/companies that can assure provenance of data in the originating country (e.g. not subject to Patriot Act).
- Amazon S3: Stores files for publication on the web; pricing based on amount stored and number of requests — From: Amazon - Tags: cloud, storage
- Dropbox: Stores any files, which can then be made public on the web; paid-for option for organisations — From: Dropbox - Tags: file storage, cloud
- Google Drive: Stores spreadsheets, which can be populated through forms you create, and can make them available on the web; used by Guardian Data Blog — From: Google - Tags: spreadsheet, form, cloud
- ScriptDB: Stores JSON data for use with Google Apps Script — From: Google - Tags: JSON, cloud
_Note what about key:value stores or rather see NoSQL wikipedia article?
- Apache CouchDB: JSON database with replication, map/reduce — From: Apache CouchDB - Tags: json database, map/reduce, open source
- Elasticsearch: JSON database with HTTP-based searching — From: Elasticsearch - Tags: json database, search, open source
- MongoDB: JSON database with replication, sharding, map/reduce and querying — From: MongoDB - Tags: json database, map/reduce, open source
- OpenLink Virtuoso: Imports/digests JSON, and outputs same; native storage may be as RDB/SQL, RDF/Graph, JSON or other document/file, etc.
- eXist: Native XML database and App platform; supports XQuery, XSLT, JSON, HTML. As well as XML, can also store and index binary documents - Tags: xml database, xquery, xslt, json, open source
- MarkLogic: XML database, can be used for storing binaries as well; supports XQuery, XSLT; can be configured to create web applications; 'Express' license for single developers — From: MarkLogic - Tags: xml database, xquery, xslt, closed source
- OpenLink Virtuoso: XML feature tutorials
_Note what about Graph stores like neo4j ?
Note that Lars Marius Garshol has put together a very nice comparison of RDF stores.
- Apache Jena
- OpenLink Virtuoso: Guide to RDF/SPARQL features
- OpenRDF Sesame
Retrieval and Loading
- ckan: For publishing, storing, and managing datasets.
- The DataTank: Convert & publish data with APIs in CSV, XML, JSON, RDF.
- EVE (Github): RESTful web API framework written in Python. Auto-generates API documentation. As of July 2013, supports XML and JSON data representations. Uses Flask and MongoDB. See also this presentation.
- iQvoc: For creating SKOS(-XL) vocabularies (thesauri, taxonomies, classification schemes etc.)
- Linked Data Pages
- LODSPeaKr LODSPeaKr is a framework for quickly and easily creating Linked Data applications and for publishing RDF data. LODSPeaKr's SPARQL-based modelling approach lets you move quickly from SPARQL queries to published applications. Check out the gallery of demo applications! - From: Alvaro Graves @ @ TWC RPI - Tags: SPARQL, linked data, publishing
- OpenLink Data Spaces (ODS)
- URIBurner: service deliveriing RDF-based structured descriptions of Web-addressable resources (documents or real-world objects) in a variety of formats through Generic HTTP URIs. The underlying technology is Virtuoso's Sponger, which takes an existing Web-accessible resource (webpage, media) and generates an RDF graph of its metadata using existing well-known ontologies as well as site-specific knowledge. URIBurner then re-presents this data either as a new HTML webpage, or directly as RDF in a variety of serializations (RDF/XML, text/n3, turtle, JSON).
- MyTardis: primarily used in the Academic sector for connecting to scientific instruments to store their data output, e.g. The Australia Syncrotron stores the data produced from its nine beam lines via MyTardis which then provides a GUI for scientists to share, license and publish their data for reference as part of their research. Can be used with almost any scientific instrument that produces digital data, including microscopy, electron microscope, magnetic resonance imaging (fMRI, MRI), environmental sensors, marine sensors, etc.
- GeoTriples: Publishing geospatial data as Linked Open Geospatial Data
- OASIS OData
- Google Apps
Linked Data API
Analysis / Data Mining Tools
- [R] (http://www.r-project.org/): focused on statistical analysis, but many packages available.
- [Weka] (http://www.cs.waikato.ac.nz/ml/weka/)
See list in Data Wrangling Handbook
- ckan: Has some visualisation, particularly around geolocation data.
- TileMill/Mapbox: Interactive map builder.
- Google Visualisation Tools
- OpenLink Data Explorer (ODE)
- Pivot Viewer
- [R] (http://www.r-project.org/)
- Odyssey: combining maps and storytelling
- CartoDB: maps from data
- Datawrapper: outstanding open source tool for simple charts on the web.
- Sextant: a web-based and mobile ready platform for visualizing, exploring and interacting with time-evolving linked geospatial data
- CubicWeb a semantic web framework written in Python to load, store, query, visualise and republish/share (open)data.