Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.Sign up
DelimitedLineTokenizer always trims the input data [BATCH-1696] #1890
The Delimited Line tokenizer seems to trim all the tokens by default, there is no way to get non trimmed raw data without implementing a custom tokenizer.
I guess this bug arose when fixing the JIRA BATCH-285.
Referenced from: commits 57b0cb7
Dave Syer commented
The way to get non-trimmed raw data is to quote it. That's pretty normal if you get the file from a spreadsheet for instance.
Moved from Bug to New Feature (since we'd have to keep the old behaviour anyway if we added an option not to trim unquited strings).
Arun Yogesh commented
I understand why you would want to put it as a feature request, but please do let me explain my line of reasoning.
1st point is we don't always have the control over the input files, especially in enterprise applications where the file may come from independent source systems, so i expected the tokenizer to separate out the tokens as is, without by itself modifying the data in any way(which is what is expected from a tokenizer is it not?).
2nd point is in the mapper, we already have the logic to get trimmed data or the raw data (FieldSet.readString(0) for instance returns trimmed data and readRawString() returns the actual value) by having the tokenizer itself trim the input data we break this functionality, since the input to the FieldSet itself is trimmed.
These were the things i felt when i noticed this behavior.