Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Client line endings are overwritten when starting a session #707

Open
tobous opened this issue Sep 25, 2019 · 27 comments
Open

Client line endings are overwritten when starting a session #707

tobous opened this issue Sep 25, 2019 · 27 comments
Labels
Area: IntelliJ Issue affecting Saros for IntelliJ (Saros/I) Aspect: User Experience Issue that describes a problem with the usability/user experience of Saros Prio: Low State: Open Confirmed issue not currently being worked on Type: Bug Issue that describes an unintended behavior of Saros

Comments

@tobous
Copy link
Member

tobous commented Sep 25, 2019

Saros/I overwrites the line endings of all client files with the ones used by the host. I.e. if the host uses Unix (LF) and the client Windows (CRLF), all files will be created as (LF) on the client side when creating the new module. Furthermore, when using an existing module, CRLF files will be detected as changed and will be overwritten by their LF counterpart from the host as part of the negotiation.

This is due to IntelliJFileImpl#getContents() using the VirtualFile content, meaning the content on disk.

Solution approach

An alternative to using the VirtualFile content could be using the document representing the file. This would allow us to access the content independently of the line endings.

@tobous tobous added Type: Bug Issue that describes an unintended behavior of Saros Aspect: User Experience Issue that describes a problem with the usability/user experience of Saros Area: IntelliJ Issue affecting Saros for IntelliJ (Saros/I) Prio: Medium State: Open Confirmed issue not currently being worked on labels Sep 25, 2019
@tobous tobous added this to the Saros/I Release 0.3.0 milestone Sep 25, 2019
@tobous tobous changed the title Line endings are overwritten when starting a session Client line endings are overwritten when starting a session Sep 25, 2019
@srossbach
Copy link
Contributor

I do not think this is a bug but a configuration issue. On Eclipse we transmit the meta data files which include the line separator for the project and specific files. I do not see a reason to use Windows Line endings on Windows systems when it is not requested.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

I do not see a reason to use Windows Line endings on Windows systems when it is not requested.

The issue is that the local configuration is completely ignored during the negotiation.

At any other point during the session, this is not an issue as the Document content is used for transmitting content and creating checksums for the consistency watchdog. The document content is what is internally managed by IntelliJ. Its consistent across all platforms as it always uses '\n' line endings.

But, during the negotiation (to create the FileList, checksum hashes and archive), IFile.getContent() is used. This uses the VirtualFile to just output a stream of the file contents on disk, which does contain operating system specific information (like the line endings).

As a result, the client line endings will always be overwritten with the ones on the host side during the negotiation. Afterwards, this is no longer an issue and both client and host can change line endings freely without influencing each other (since the document content is used).

@srossbach
Copy link
Contributor

So why do you want to change the IntelliJ implementation of getContent (which will result in transmitting pure binary files as UTF-8 encoded strings) instead of negotiating the line delimiter for the module ?

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

In IntelliJ, projects (and not modules) have a standard encoding and line ending. However, this can be changed on a file-to-file bases. I.e. each file can have a different encoding/line ending, independently of the project configuration. Furthermore, the line endings (and I think also the encodings) are determined dynamically. If I change the line ending, the new line ending will be used the next time the document is written to disk. The same goes the other way around. If the file is overwritten with a different line ending on disk, the line ending for the document will automatically be adjusted accordingly, meaning the IDE now uses the new line endings for any future writes to disk.

The only way around this I see is to save the line ending type of all files and then, after the negotiation, re-apply it to the document and subsequently re-write it to disk to recreate the old state. But this is not a very nice solution in my opinion.

Also, this still does not solve the checksum issue, i.e. Saros detecting file with a different line ending to the host one as "changed", leading to a load of false-positives when starting a session.

Could you explain your approach to avoid this issue in more detail? I don't know what you mean by "negotiating the line delimiter for the module", or rather what to do with this information to avoid the above mentioned issues.
Do you mean to just enforce a singular style of line endings for all participants?

@srossbach
Copy link
Contributor

Eclipse behaves almost the same way. Encodings or stored for files / projects and line endings are stored for the editor. So after we transmit everything the line endings are correctly converted. Converted in this context means not converted at all because there is nothing to convert. If you do transmit the "line endings meta file" and Eclipse decides to convert the content of a file (more specific the editor) then it is normal that you get an inconsistency because the editor content do not match. They look the same but they are not the same.

@srossbach
Copy link
Contributor

srossbach commented Sep 26, 2019

Do you mean to just enforce a singular style of line endings for all participants?

Do you find CRLF endings in our project ? There are a bunch of developers (including me) using Windows.

https://github.com/saros-project/saros/blob/master/core/.settings/org.eclipse.core.runtime.prefs

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

Do you mean to just enforce a singular style of line endings for all participants?

Do you find CRLF endings in our project ? There are a bunch of developers (including me) using Windows.

https://github.com/saros-project/saros/blob/master/core/.settings/org.eclipse.core.runtime.prefs

Well no, not any more. Because we took care of it a while ago. (But before that, this was an issue. There are multiple commits fixing rogue files with wrong line endings through the years.)

But, the way I see it, that only strengthens my point. With collaboration through git, this only works because, with the correct configuration, git only stores a unified version of the content (e.g. always using '\n') that is then converted to the local format on the user side.

This is exactly what I am suggesting for Saros. Just transfer it in a unified format and let the client figure it out. Doing it through the editor content was just the easiest way to access a unified version of the content. There probably also are other possibilities.

If you do transmit the "line endings meta file" and Eclipse decides to convert the content of a file (more specific the editor) then it is normal that you get an inconsistency because the editor content do not match. They look the same but they are not the same.

Yeah but why should this count as an inconsistency? Even though the contents on disk are different, the contents in the IDE (which is what the user most likely cares about) are the same.

So after we transmit everything the line endings are correctly converted.

Since the configuration is project and not module specific, this is not that easy. We would have to change the default for the entire project, even though we are only sharing a single module. As far as I know, the default is not enforced in any way. It is just what is used by default when creating new files, so it would only help us for creating new files.

Also, since line endings are not transferred during the session, this means that I could create a session, change the line ending type of a file (and write it back to disk) without causing inconsistencies. I could then stop and restart the session and Saros would display that there are differences in the shared module, even though the session state beforehand was completely valid/consistent. Isn't that a very unexpected behavior/outcome?

I also don't know whether the line ending type for a particular file is even saved anywhere. I couldn't find a matching settings file or a mention in the documentation. Furthermore, from how the type is automatically adjusted when the file is overwritten, it seems to me as though the type is just determined dynamically based on the file content.

@srossbach
Copy link
Contributor

The reasons why we had issues with Line Endings in Saros was (my big guess) a lot of students just editing the files with their favorite editor rather (there was also a lot of issues with formatting) than using Eclipse.

I also do not get: "the contents in the IDE" are the same. They are not the same, you simply do not see the line endings and almost any editor can handle all different line endings, even if they are mixed up.

You are trying to fix stuff for hobby programmers. In every bigger company you should normally have rules regarding line endings. Of course the issue that line endings only apply to a project and not to a specific module is indeed not so easy to resolve. I can understand your reason not to change the project settings.

Again, we do not have the problem in Eclipse as we are syncing the metadata which contain all relevant information for the editors.

@srossbach
Copy link
Contributor

If you want to to change the Watchdog stuff you have to get the editor content, remove all line endings from it and transmit it to the remote side which has to do the same and the compare the results.

@srossbach
Copy link
Contributor

You could also create 2 hashes, 1 over the editor content, the other as I mentioned above. Then send it to the remote side. At least you can now tell there are issues because of different line endings.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

I also do not get: "the contents in the IDE" are the same. They are not the same, you simply do not see the line endings and almost any editor can handle all different line endings, even if they are mixed up.

This is not true for IntelliJ. Line endings are always normalized to '\n' in the document representation (see documentation). (Sorry, I though I mentioned that already.) This is why we don't have any issue in the running session. The watchdog uses the document/editor contents (as it uses EditorManager.getContent(...)).

If you want to to change the Watchdog stuff you have to get the editor content, remove all line endings from it and transmit it to the remote side which has to do the same and the compare the results.

The watchdog is not the issue. The issue is the checksum generation in the core Filesystem class used to determine whether files match for the negotiation.

As I already mentioned, this is only an issue for the parts of the logic that use IFile.getContents(), which is, as far as I can tell, the whole FileList logic and the project negotiation (for transferring the content).

@srossbach
Copy link
Contributor

This is not true for IntelliJ. Line endings are always normalized to '\n' in the document representation (see documentation), which is why we don't have any issue in the running session. The watchdog uses the document/editor contents (as it uses EditorManager.getContent(...)).

Thanks. Now I understand the issue. However is it that problematic ? The negotiation does what it is supposed to do. It synchronizes the files on their actual byte contents. Your approach is more semantic, i.e it performs different checksum calculation based on either the file suffix or the actual content of the file. However what will happen if you load a binary file with an editor? If the editor replaces any CRLF (Windows) or CR (Mac) and afterwards you take this content and transmit it you are damaging files.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

However what will happen if you load a binary file with an editor? If the editor replaces any CRLF (Windows) or CR (Mac) and afterwards you take this content and transmit it you are damaging files.

Good point. I hadn't thought of that.

One could argue that sharing binary files is not very sensible anyways, but that gets us nowhere. (Which is also not true. I only thought about build artifacts, but of course there are also other binary files like images and stuff.)

And introducing a distinction between binary files and text files for the negotiation seems like pain as well.

Hmmm.... I am still not very satisfied with the current state of the logic. I will think about it some more, but there probably are more pressing issues if this proves too much of a pain. 😁

@srossbach
Copy link
Contributor

I am currently googeling on how GIT performs binary file detection / the heuristic that is used 😄

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

I also thought about that. We had a similar issue with the .gitattributes, but there we had to figure it out ourselves and just tell it which file extensions represent binary files.

@srossbach
Copy link
Contributor

srossbach commented Sep 26, 2019

Well the algorithm is somewhat simple.

https://github.com/git/git/blob/master/xdiff-interface.c#L188

Load some bytes from a file and check if at least on byte has a value of 0.

I guess this will not work every time. However it is pretty unlikely not to encounter \0 in a binary file.

However you first have also to check if the file has a BOM, if the file has no BOM (and even it has one it could be a binary) you can compare against \0\0 for UTF-16 encoded files and so on.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

Checking for a BOM should be easy enough. Even though it is dropped when obtaining the inputstream for a VirtualFile (see javadoc), it can be obtained through VirtualFile.getBOM() (returns byte[] or null).

Which brings me to another point. Even if a file has a BOM, it won't be transferred with the current logic (as we use VirtualFile.getInputStream()). Could this be an issue? I don't know much about the topic.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

The java nio method FileTypeDetector.probeContentType(Path) could be helpful. But it is very vague with how it determines the file type (as this is "implementation specific"). And I'm not sure how much I can trust the result.
The question is which way the false positives go. Binary files being detected as text files would be bad as the line ending conversion could damage the data. Text files being detected as binary files is just "a bit annoying" as it would transfer the line endings to other participants.

@srossbach
Copy link
Contributor

From an editor perspective it depends. If the editor is not that smart it will fail to render the content without the BOM given as hint. In general you should transfer the BOM (when regarding binary transmission).

I cannot really help you at the point because I do not know how BOMs are written, read and discarded in IntelliJ. I also do not know why IntelliJ discards the BOM when you are trying to access the file in binary mode. I would personally first try to detect a BOM. If the file has a BOM then I would write it back. If it has no BOM then I would only add a BOM if I can ensure that is causes no harm => do not add it at all is the easiest solution.

Here are the guidelines from the referenced link in the Javadoc

A: Here are some guidelines to follow:

A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

Some protocols allow optional BOMs in the case of untagged text. In those cases,

    Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

    Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used. (See also Q: What is the difference between UCS-2 and UTF-16?.) [AF]

@srossbach
Copy link
Contributor

Binary files being detected as text files would be bad as the line ending conversion could damage the data.

You do not want the user to experience this scenario, e.g broken images, broken .dll, .so, or jar files etc.

The watchdog will also not detect any corruption in such files.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

The watchdog will also not detect any corruption in such files.

On the basis that such files will never be opened in an editor, therefore the watchdog will never check them?

Because in IntelliJ (and probably also in Eclipse) you can open images in the editor. I have never tried what happens if you do during a session..... Maybe it breaks everything as we try to get a document for an editor that is not text based. Maybe it's simply ignored. Who knows? Will do some smoke testing tomorrow. 😁

@srossbach
Copy link
Contributor

On the basis that such files will never be opened in an editor, therefore the watchdog will never check them?

Yep, this is how the logic works. It only checks files that are represented by an editor. You can easily check it by yourself. Sync a project. Close the session. Edit some file, save them. Manipulate the IPN in such a way that it says all file in sync and wait forever for an inconsitency warning after the negotiation for files that are not opened in an editor.

@srossbach
Copy link
Contributor

I just looked at the VirtualFile again. I think you have a problem. The BOM is discarded when you read contents, however it is added again when you write contents back to THIS file.

The problem is this will never happen during project negotiation. And maybe there are even some corner cases for FileActivities, e.g change the BOM. Now read the file, transmit the contents and write to an existing file on the remote side which still contains the old BOM.

@tobous
Copy link
Member Author

tobous commented Sep 26, 2019

I just looked at the VirtualFile again. I think you have a problem. The BOM is discarded when you read contents, however it is added again when you write contents back to THIS file.

The problem is this will never happen during project negotiation. And maybe there are even some corner cases for FileActivities, e.g change the BOM. Now read the file, transmit the contents and write to an existing file on the remote side which still contains the old BOM.

Could you be more specific? I don't get the issue.

@tobous
Copy link
Member Author

tobous commented Sep 30, 2019

Ok, finding out whether a file is a binary should be relatively easy as IntelliJ provides a method for this. For a VirtualFile, you can request the FileType, which provides the method isBinary(). The implementation seems to act on the safe side as unknown file types are seen as binary, so I am relatively confident that we can rely on the information.

The next issue is now how to pass the information whether the file content or document content was read to the receiving side so that it can be processed correctly. I am currently looking into this.

@srossbach
Copy link
Contributor

I guess you have to change alot of stuff in the core just to serve IntelliJ. I.e modify the ZIP format (InstantSession too) and apply modifications to the FileActivity and subclasses (and do not forgot isBOMIncluded in those activities because you would receive them if Saros/E <-> Saros/I should be possible)

@tobous
Copy link
Member Author

tobous commented Oct 2, 2019

I guess you have to change alot of stuff in the core just to serve IntelliJ.

Yeah, that's my issue with it as well. I am currently encountering a lot of issues that could simply be solved by adding additional, Saros/I specific information to stuff. But that is not really an option as we are trying to keep the core IDE-agnostic. And just adding free-form (i.e. String) fields instead doesn't seem like a good alternative either.

But I still really like this approach as it allows us to not have to deal with line endings and file encodings.

Is something similar possible with Eclipse (i.e. obtaining a view of the file that is normalized by the IDE)? Because if this were the case, it would make more sense to move the distinction between byte and "text" transfer for files into the core.

Also, could you please give me some more information on the problem you mentioned earlier (quoted below)? I am still not sure I understand what you mean.

I just looked at the VirtualFile again. I think you have a problem. The BOM is discarded when you read contents, however it is added again when you write contents back to THIS file.

The problem is this will never happen during project negotiation. And maybe there are even some corner cases for FileActivities, e.g change the BOM. Now read the file, transmit the contents and write to an existing file on the remote side which still contains the old BOM.

@tobous tobous removed this from the Saros/I Release 0.3.0 milestone Mar 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: IntelliJ Issue affecting Saros for IntelliJ (Saros/I) Aspect: User Experience Issue that describes a problem with the usability/user experience of Saros Prio: Low State: Open Confirmed issue not currently being worked on Type: Bug Issue that describes an unintended behavior of Saros
Projects
None yet
Development

No branches or pull requests

2 participants