Skip to content

Conversion process phase: name

Tim L edited this page Aug 30, 2013 · 85 revisions
csv2rdf4lod-automation is licensed under the [Apache License, Version 2.0](https://github.com/timrdf/csv2rdf4lod-automation/wiki/License)

What is first

What we will cover

This page describes the "SDV" naming convention that csv2rdf4lod-automation uses to organize datasets.

Let's get to it

Consistent naming conventions makes working with others' data easier. Establishing identifiers for source, dataset, and version affects the naming of directories used to organize all of the aggregated data (see Directory Conventions), the Enhancement parameters given to the converter, the URI naming of the resulting RDF datasets, and the URI naming of instances within the RDF datasets. Therefore, thought, care, and consideration should be taken when establishing these identifiers. Keep in mind that any single URI created could end up in someone else's hands in isolation and it is incredibly useful to humans if they have a good guess as to what it is before dereferencing it and starting to crawl it as linked data.

diagram of naming with source dataset and version identifiers

For all identifiers, we highly recommend that you:

  • Use lower case
  • Replace spaces and underscores with dashes
  • Avoid acronyms; try to expand them

Step 0 of 3: Establish your Base URI

The Base URI is the web domain that you plan to deploy your Linked Data to. By default, every URI created by csv2rdf4lod (e.g., for datasets, entities, classes, and predicates) is formed by appending to the Base URI. In effect, the Base URI is your namespace for the data you create. At some point, "something tangible" should be created to respond to HTTP web requests of the Base URI you create, but until you're ready to deploy you can convert data without having a server ready. The converter uses the Base URI specified by the conversion:base_uri property in the enhancement parameters, which are are [created automatically](Generating enhancement parameters) if they do not already exist. The shell variable CSV2RDF4LOD_BASE_URI is used to determine the value of conversion:base_uri when the enhancement parameters are generated, so make sure that it is set in your source-me.

conversion:base_uri           "http://sparql.tw.rpi.edu/ontowiki"^^xsd:anyURI;

NOTE: do not include a slash at the end of this; we'll add it for you.

Step 1 of 3: Establish an identifier for the source

Here, source indicates a person, organization, or other agent providing you the data that you want to convert. The intent here is a living or social entity and not something rote like a web service or external hard drive. If you grabbed the White House visitors list, identify your source as whitehouse-gov. If you got the data from a new acquaintance, identify them as the source using something like hotmail-com-joey. If you've got an inside scoop and someone from the White House handed you next week's visitors list on a thumb drive, identify them as the source using something like whitehouse-gov-potus. Make like an investigative reporter and mind your source. For several examples, see the list of source identifiers that LOGD has used. Keep in mind that these identifiers are scoped by your base URI, so you control their meaning.

Directory holding all datasets from this source will be:
URI of source will become:
The source identifier will be encoded in the conversion parameter: `conversion:source_identifier`
The web page describing the source identified will be:
  • Reuse DNS name for the organization, ignoring all non-organization identifying fragments such as "www", "www2", "ftp", "data", etc.

(Note that this perspective of source does not align with dcterms:source because our dataset is not derived from the source that we are citing. Dublin Core's dcterms:publisher is closer to what we are referencing, though our source may be an intermediary that was not the original publisher -- as is the case in hand-me down data sharing in cases such as scraperwiki.com (which scrapes gov sites and rehosts as csv) , impacteen.org (which aggregates statistics from many federal surveys that are not readily accessible), and Xian's company financial earnings (which states that the reports are from the gov -- but one can not be sure -- and in reality the reports were submitted to the government by the individual companies))

(see also Considerations for choosing an identifier for conversion:source_identifier)

Step 2 of 3: Establish an identifier for the dataset

The dataset identifier will be encoded in the conversion parameter: `conversion:dataset_identifier`
  • Reuse the source organization's identifier for the dataset whenever possible.
  • If the dataset has an acronym, use the acronym expansion and follow it with the acronym (e.g. enforcement-and-compliance-history-online-echo from http://www.epa-echo.gov/echo/).
  • If none given, construct a clear descriptive name based on the web pages' descriptions of the dataset.

For several examples, see the list of dataset identifiers that LOGD has used. Note that most of the identifiers at the bottom are reused from data.gov's numeric convention. Also, keep in mind that these identifiers are scoped by your base URI and your source identifier, so different source organizations can name their datasets similarly without clashing with other organizations.

Step 3 of 3: Establish an identifier for the version

Very often, datasets that you retrieved from another organization have been updated since that last time you grabbed it. For example, http://www.uniprot.org/downloads releases every four weeks. So, when you tell a colleague that your analysis results showed X, they might want to know which data you analyzed. The version identifier handles this situation.

The version identifier will be encoded in the conversion parameter: `conversion:version_identifier`
  • It is highly recommended to REUSE the source organization's name for the version.
  • If not, consider using the Last Modified date as reported by HTTP HEAD.
  • If not, consider using the current day's date (i.e., date of retrieval). This is a very good default in the absence of other version information.
  • Optionally, a curator's tag could be used (e.g., we used "mashathon" during a mashathon)

When using a date, we suggest using the form 2010-Dec-31 in "year-mon-day" form (date +%Y-%b-%d or date +%Y-%b-%d_%H_%M_%S can be used on unix). This follows the "larger to smaller" convention of the URI decomposition. Also, Dec instead of 12 increases readability and avoids confusion for less technical folks and for those used to international conventions -- THESE DATE-LOOKING STRINGS ARE NOT INTENDED FOR PARSING, ONLY AS HUMAN AIDS. (Note that using the 2010-Dec-31 convention will not provide chronological order when sorted lexiographically. This is a tradeoff that should be overcome by the following point.). If date modeling is desired, augment the conversion:VersionedDataset URI with additional RDF descriptions, using appropriate RDF vocabularies and properly formatted values such as xsd:date or xsd:dateTime.

For several examples, see the list of version identifiers that LOGD has used. Keep in mind that these identifiers are scoped by your base URI, and your source identifier, and your dataset identifier.

There are a lot of conversion:version_identifiers that look like 2010-Dec-09 and dataset URIs that have version/2010-Dec-09. What does that date mean? Does it have to be a date? What methodology should a curator use to name the Version?

After establishing identifiers for source, dataset, and version, they can be used to construct the conversion cockpit -- the place to be when converting a dataset.

What's next?

See also

Historical note

This page aggregates and replaces:

Clone this wiki locally