This page describes the "SDV" naming convention that csv2rdf4lod-automation uses to organize datasets.
Consistent naming conventions makes working with others' data easier. Establishing identifiers for source, dataset, and version affects the naming of directories used to organize all of the aggregated data (see Directory Conventions), the Enhancement parameters given to the converter, the URI naming of the resulting RDF datasets, and the URI naming of instances within the RDF datasets. Therefore, thought, care, and consideration should be taken when establishing these identifiers. Keep in mind that any single URI created could end up in someone else's hands in isolation and it is incredibly useful to humans if they have a good guess as to what it is before dereferencing it and starting to crawl it as linked data.
For all identifiers, we highly recommend that you:
The Base URI is the web domain that you plan to deploy your Linked Data to. By default, every URI created by csv2rdf4lod (e.g., for datasets, entities, classes, and predicates) is formed by appending to the Base URI. In effect, the Base URI is your namespace for the data you create. At some point, "something tangible" should be created to respond to HTTP web requests of the Base URI you create, but until you're ready to deploy you can convert data without having a server ready. The converter uses the Base URI specified by the
conversion:base_uri property in the enhancement parameters, which are are created automatically if they do not already exist. The shell variable CSV2RDF4LOD_BASE_URI is used to determine the value of
conversion:base_uri when the enhancement parameters are generated, so make sure that it is set in your source-me.
NOTE: do not include a slash at the end of this; we'll add it for you.
source indicates a person, organization, or other agent providing you the data that you want to convert. The intent here is a living or social entity and not something rote like a web service or external hard drive. If you grabbed the White House visitors list, identify your source as
whitehouse-gov. If you got the data from a new acquaintance, identify them as the source using something like
hotmail-com-joey. If you've got an inside scoop and someone from the White House handed you next week's visitors list on a thumb drive, identify them as the source using something like
whitehouse-gov-potus. Make like an investigative reporter and mind your source. For several examples, see the list of source identifiers that LOGD has used. Keep in mind that these identifiers are scoped by your base URI, so you control their meaning.
Directory holding all datasets from this source will be: URI of source will become: The source identifier will be encoded in the conversion parameter: `conversion:source_identifier` The web page describing the source identified will be:
(Note that this perspective of
source does not align with dcterms:source because our dataset is not derived from the source that we are citing. Dublin Core's dcterms:publisher is closer to what we are referencing, though our
source may be an intermediary that was not the original publisher -- as is the case in hand-me down data sharing in cases such as scraperwiki.com (which scrapes gov sites and rehosts as csv) , impacteen.org (which aggregates statistics from many federal surveys that are not readily accessible), and Xian's company financial earnings (which states that the reports are from the gov -- but one can not be sure -- and in reality the reports were submitted to the government by the individual companies))
The dataset identifier will be encoded in the conversion parameter: `conversion:dataset_identifier`
For several examples, see the list of dataset identifiers that LOGD has used. Note that most of the identifiers at the bottom are reused from data.gov's numeric convention. Also, keep in mind that these identifiers are scoped by your base URI and your source identifier, so different source organizations can name their datasets similarly without clashing with other organizations.
Very often, datasets that you retrieved from another organization have been updated since that last time you grabbed it. For example, http://www.uniprot.org/downloads releases every four weeks. So, when you tell a colleague that your analysis results showed X, they might want to know which data you analyzed. The version identifier handles this situation.
The version identifier will be encoded in the conversion parameter: `conversion:version_identifier`
When using a date, we suggest using the form
2010-Dec-31 in "year-mon-day" form (
date +%Y-%b-%d or
date +%Y-%b-%d_%H_%M_%S can be used on unix). This follows the "larger to smaller" convention of the URI decomposition. Also,
Dec instead of
12 increases readability and avoids confusion for less technical folks and for those used to international conventions -- THESE DATE-LOOKING STRINGS ARE NOT INTENDED FOR PARSING, ONLY AS HUMAN AIDS. (Note that using the
2010-Dec-31 convention will not provide chronological order when sorted lexiographically. This is a tradeoff that should be overcome by the following point.). If date modeling is desired, augment the
conversion:VersionedDataset URI with additional RDF descriptions, using appropriate RDF vocabularies and properly formatted values such as
For several examples, see the list of version identifiers that LOGD has used. Keep in mind that these identifiers are scoped by your base URI, and your source identifier, and your dataset identifier.
There are a lot of conversion:version_identifiers that look like
2010-Dec-09 and dataset URIs that have
version/2010-Dec-09. What does that date mean? Does it have to be a date? What methodology should a curator use to name the Version?
After establishing identifiers for source, dataset, and version, they can be used to construct the conversion cockpit -- the place to be when converting a dataset.
This page aggregates and replaces: