A sane format for raw data files
Raw data is incredibly easy to format but can be dauntingly difficult to interpret everyone's formatting whims. A few simple standards can go a long way. The structure of the raw data isn't enough. Besides, the CSV spec, though relatively simple, is still quite often misinterpreted or even entirely ignored.
When processing a raw file, the schema of the file is almost always needed. Having to read documentation, all too often saved as a PDF, is nonsensical when virtually all the information about the contents of the file can easily be encoded into the file itself by the creator of the content.
- No raw file should ever be created that doesn't contain a header row.
- All raw files should be TSV.
- All rows should be single line.
\is encoded as
- All non-printable characters are hex-encoded as
- The entire contents of a field is the value of the field. That is, if there are quotes in the field, the quotes are part of the value. There is no such thing as a quoted value.
Column headings are specified like URL query parameters.
where parameters can be (not always requiring values)
pkprimary key (is requied by default)
uk[=name[,order]]optionally named unique key (required/not-null by default)
[type][!][=length|precision|format]optionally lengthed and required data type
!after the type (before
=) means required (not null)
=length[!]text is the default type
!after length means fixed length
null[=re]the regular expression that represents nulls. defaults to empty string
Consider this space-delimited header example:
a b c d e f g h i j k
It's great to at least provide column names but it doesn't help at all in expectations for processing the data. With a little bit of context, files can be completely automated and verified:
a?pk&2 b?pk&i c?i&uk d?t&uk e?t=3&uk=A f?i&uk=A,1 g?i&uk=A h?12?null=. i?f&n=[-.]|^$ j?i! k?d
- The primary key fields are ordered left to right so as not to require an order specifier.
- The uniqueness flag's identifier is the trailing flag and can be any single character for grouping unique keys.
- All uniqueness flags without an identifier are grouped into the same unique key.
- Uniqueness columns are ordered left to right.
How to interpret each header hint:
a?pk&2- The first primary key field. Is text, by default, and must always be two characters.
b?pk&i- The second primary key field. An integer.
c?i- The first field of the "default" unique key. An integer.
d?t- The second field of the "default" unqique key. Text.
e?t=3&uk=A- Both the third field of the "default" unique key and the first field of unique key "A". Text with three character maximum length.
f?i&uk=A- The second field of unique key "A". An integer.
g?i&uk=A- The third field of unique key "A". An integer.
h?12?null=.- A text column with a maximum of twelve characters.
.represents that the value is null. This implies that an empty field is not null.
i?f&n=[-.]|^$- A float column where
., and the empty field represent null values. Note the significance of this being a floating point field possibly containing values that are not floats which would derail automated data type attempts.
j?i!- A required integer field.
k?d- A date.
The above is equivalent to (assuming a somewhat generic SQL database):
create table ... ( a char(2) not null, b integer not null, c integer not null, d text not null, e varchar(3) not null, f integer not null, g integer not null, h varchar(12), i float, j integer not null, k date, primary key (a, b) ); create unique index ... (c,d); create unique index A (f,e,g);