Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LittleEndianInputStream EOF test is incorrect #1614

Open
cmnbroad opened this issue Jun 14, 2022 · 9 comments
Open

LittleEndianInputStream EOF test is incorrect #1614

cmnbroad opened this issue Jun 14, 2022 · 9 comments
Labels

Comments

@cmnbroad
Copy link
Collaborator

As reported by @jrobinso: The test for EOF in LittleEndianInputStream.readString is wrong because in.read() is cast to a byte. The contract for read() returns an "int", and the signal for EOF is the return value is -1. However this is no longer a valid test after casting to a byte. The correct thing to do here, I think, is type "b" as an int, the cast to byte only when written to "bis".

    byte b;
    while ((b = (byte) in.read()) != 0) {
        if(b < 0) {
            throw new EOFException();
        }
        bis.write(b);
@lbergelson
Copy link
Member

This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values.

Currently EOFException seems to be triggered correctly when the end of file is reached prematurely, but it's also triggered incorrectly on some non-ASCII characters which truncate to negative byte values.

It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly. How to fix it depends on what file encoding we're intending to support which means digging into what this thing actually is used for.

@jrobinso
Copy link
Contributor

jrobinso commented Oct 7, 2022

@lbergelson File encoding should be irrelevant, this is a raw byte stream. Regardless of other issues casting in.read() to a byte before that test is incorrect, I think. A positive int can become a negative byte.

@jrobinso
Copy link
Contributor

jrobinso commented Oct 7, 2022

InputStream.read() is supposed to return an integer between 0 and 255, or -1 if EOF, there is no encoding involved. So 1/2 of the values it can legally return will be negative when cast to a byte.

@jrobinso
Copy link
Contributor

jrobinso commented Oct 7, 2022

It's a weird convention, but its been like that since the beginning. I suppose the contract is for an "int" specifically so they can have a return value (-1) which indicates EOF. If the contract was for a byte there would be no available values to indicate EOF.

@lbergelson
Copy link
Member

@jrobinso I agree that it's currently wrong and the fix you proposed is an improvement. What I meant to say is that we have to decide what the file encoding should be when we do the conversion to String because we will get incorrect/inconsistent answers if there's a mismatch and the file contains non-ascii values even once your fix goes in.

@jrobinso
Copy link
Contributor

jrobinso commented Oct 7, 2022

@lbergelson I still don't see what encoding has to do with this class, which is a stream reader for raw bytes. You can decode these bytes anyway you please, but that is downstream. In many cases the bytes will not be decoded to strings at all, but again what is done with the bytes is no concern here.

@lbergelson
Copy link
Member

@jrobinso It's relevant because the method in question returns a string as it's result, so it has to make a decision about what the bytes mean.

@jrobinso
Copy link
Contributor

jrobinso commented Oct 8, 2022

@lbergelson OK, yes, sorry, I was focused on the test in question, the part below. It is most definitely a bug to cast the result from in.read() to a byte before testing for b<0. An EOF exception should not be thrown unless -1 is returned as an int. If there is a bug decoding the string later that would be a different problem. This method probably doesn't belong on LittleEndianInputStream, but since it's there maybe it should take the encoding as an argument, defaulting to utf-8. In practice this method is used to read genomic file formats, many which specify ascii text. If some other character encoding is used we might get a garbled string, or some other error, but we shouldn't see an EOFException.

    byte b;
    while ((b = (byte) in.read()) != 0) {
        if(b < 0) {
            throw new EOFException();
        }
        bis.write(b);

@jrobinso
Copy link
Contributor

jrobinso commented Oct 8, 2022

Here's the relevant text for GFF format, which I think is pretty typical among genomic file formats. The term "plain text" might be ambiguous, but I think is understand to mean ascii

" GFF3 files are 9-column, tab-delimited plain text files. Literal tabs, newline, carriage return, percent sign and control characters must be encoded using RFC 3986 Percent-Encoding "

The SAM specification is more specific

"Unless explicitly specified elsewhere, all fields are encoded using 7-bit US-ASCII 1".

WRT "specified elsewhere", a number of fields allow UTF-8.

@lbergelson lbergelson added the bug label Nov 18, 2022
lbergelson added a commit that referenced this issue Dec 2, 2022
This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values.
EOF is triggered correctly, but it's also triggered on various potential characters which become negative values when they are truncated to byte.
It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly.
lbergelson added a commit that referenced this issue Dec 2, 2022
This is more complicated than originally described because the file encoding isn't really described and java bytes include negative values.
EOF is triggered correctly, but it's also triggered on various potential characters which become negative values when they are truncated to byte.
It's unclear if this is meant to only read ASCII, ISO 8859-1, or UTF-8 but nothing outside of the ascii space works correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants