Package dsv
is a Go library for working with delimited separated value (DSV).
NOTE: This package has been deprecated. See https://github.com/shuLhan/share/lib/dsv for latest implementation.
DSV is a free-style form of CSV format of text data, where each record is separated by newline, and each column can be separated by any string, not just comma.
Lets process this input file input.dat
,
Mon Dt HH MM SS Process
Nov 29 23:14:36 process-1
Nov 29 23:14:37 process-2
Nov 29 23:14:38 process-3
and generate output file output.dat
which format like this,
"process_1","29-Nov"
"process_2","29-Nov"
"process_3","29-Nov"
How do we do it?
First, create file metadata for input and output, name it config.dsv
,
{
"Input" :"input.dat"
, "Skip" :1
, "InputMetadata" :
[{
"Name" :"month"
, "Separator" :" "
},{
"Name" :"date"
, "Separator" :" "
, "Type" :"integer"
},{
"Name" :"hour"
, "Separator" :":"
, "Type" :"integer"
},{
"Name" :"minute"
, "Separator" :":"
, "Type" :"integer"
},{
"Name" :"second"
, "Separator" :" "
, "Type" :"integer"
},{
"Name" :"process_name"
, "Separator" :"-"
},{
"Name" :"process_id"
}]
, "Output" :"output.dat"
, "OutputMetadata":
[{
"Name" :"process_name"
, "LeftQuote" :"\""
, "Separator" :"_"
},{
"Name" :"process_id"
, "RightQuote":"\""
, "Separator" :","
},{
"Name" :"date"
, "LeftQuote" :"\""
, "Separator" :"-"
},{
"Name" :"month"
, "RightQuote":"\""
}]
}
The metadata is using JSON format. For more information see metadata.go
and reader.go
.
Second, we create a reader to read the input file.
dsvReader, e := dsv.NewReader("config.dsv", nil)
if nil != e {
t.Fatal(e)
}
Third, we create a writer to write our output data,
dsvWriter, e := dsv.NewWriter("config.dsv")
if nil != e {
t.Error(e)
}
Last action, we process them: read input records and pass them to writer.
for {
n, e := dsv.Read(dsvReader)
if n > 0 {
dsvWriter.Write(dsvReader)
// EOF, no more record.
} else if e == io.EOF {
break
}
}
// we will make sure all open descriptor is closed.
_ = dsvReader.Close()
Easy enough? We can combine the reader and writer using dsv.New()
, which will
create reader and writer,
rw, e := dsv.New("config.dsv", nil)
if nil != e {
t.Error(e)
}
// do usual process like in the last step.
Thats it!
Here are some terminology that we used in developing this library, which may help reader understanding the configuration and API.
- Dataset: is a content of file
- Record: a single cell in row or column, or the smallest building block of dataset
- Row: is a horizontal representation of records in dataset
- Column: is a vertical representation of records in dataset
COL-0 COL-1 ... COL-x
ROW-0: record record ... record
ROW-1: record record ... record
...
ROW-y: record record ... record
We choose and use JSON for configuration because,
- No additional source to test.
- Easy to extended. User can embed the current metadata, add additional configuration, and create another reader to work with it.
Metadata contain information about each column when reading input file and writing to output file,
Name
: mandatory, the name of columnType
: optional, type of record when reading input file. Valid value are "integer", "real", or "string" (default)Separator
: optional, default to"\n"
. Separator is a string that separate the current record with the next record.LeftQuote
: optional, default is empty""
. LeftQuote is a string that start at the beginning of record.RightQuote
: optional, default is empty""
. RightQuote is a string at the end of record.Skip
: optional, boolean, default isfalse
. If true the column will be saved in dataset when reading input file, otherwise it will be ignored.ValueSpace
: optional, slice of string, default is empty. This contain the string representation of all possible value in column.
Input configuration contain information about input file.
Input
: mandatory, the name of input file, could use relative or absolute path. If no path is given then it assumed that the input file is in the same directory with configuration file.InputMetadata
: mandatory, list of metadata.Skip
: optional, number, default 0. Skip define the number of line that will be skipped when first input file is opened.TrimSpace
: optional, boolean, default is true. If its true, before parsed, the white space in the beginning and end of each input line will be removed, otherwise it will leave unmodified.Rejected
: optional, default torejected.dat
. Rejected is file where data that does not match with metadata will be saved. One can inspect the rejected file fix it for re-process or ignore it.MaxRows
: optional, default to256
. Maximum number of rows for one read operation that will be saved in memory. If its negative, i.e.-1
, all data in input file will be processed.DatasetMode
: optional, default to "rows". Mode of dataset in memory. Valid values are "rows", "columns", or "matrix". Matrix mode is combination of rows and columns, it give more flexibility when processing the dataset but will require additional memory.
For example, given input data file,
col1,col2,col3
a,b,c
1,2,3
"rows" mode is where each line saved in its own slice, resulting in Rows:
Rows[0]: [a b c]
Rows[1]: [1 2 3]
"columns" mode is where each line saved by columns, resulting in Columns:
Columns[0]: {col1 0 0 [] [a 1]}
Columns[1]: {col2 0 0 [] [b 2]}
Columns[1]: {col3 0 0 [] [c 3]}
Unlike rows mode, each column contain metadata including column name, type, flag, and value space (all possible value that may contain in column value).
"matrix" mode is where each record saved both in row and column.
Output configuration contain information about output file when writing the dataset.
Output
: mandatory, the name of output file, could use relative or absolute path. If no path is given then it assumed that the output file is in the same directory with configuration file.OutputMetadata
: mandatory, list of metadata.
After opening the input file, we can process the dataset based on rows/columns
mode using simple for
loop. Example,
// Save dataset object for used later.
dataset := dsvReader.GetDataset().(tabula.DatasetInterface)
for {
n, e := dsv.Read(dsvReader)
if n > 0 {
// Process each row ...
for x, row := dataset.GetDataAsRows() {
for y, record := range row.Records {
// process each record in row
}
}
// Or, process each columns
for x, column := dataset.GetDataAsColumns() {
for y, record := range column.Records {
// process each record in column
}
}
// Write the dataset to file after processed
dsvWriter.Write(dsvReader)
}
if e == io.EOF {
break
}
if e != nil {
// handle error
}
}
Default dataset used by Reader is tabula.Dataset.
You can extend and implement DatasetInterface and use it in reader object, either by
-
passing it in the second parameter in
NewReader
, for example,myset := MySet{ ... } reader, e := dsv.NewReader("config.dsv", &myset)
-
or by calling
reader.Init
after creating new Reader,myset := MySet{ ... } reader := dsv.Reader{ ... } reader.Init("config.dsv", &myset)
Since we use tabula package to manage data, any features in those package can be used in our dataset. For more information see tabula package.
-
New line is
\n
for each row. -
Reader and Writer operate in ASCII (8 bit or char type), UTF-8 is not supported yet, since we can not test it. Patch for supporting UTF-8 (or runes type) are welcome.
-
About escaped character in content of data.
Since we said that we handle free-style form of CSV, what we mean was the left-quote, right-quote and separator can be string. Its not only one single character like single quote or double quote or any single character, but literally one or more characters without space. Any escaped character will be read as is (along with
'\'
) unless its followed by right-quote or separator. For example,"test\'"
will be readed as
test\'
. But"test\""
will be readed as
test"
, since the right-quote is matched with escaped token.