Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't handle files >=4GB #1713

Closed
namazso opened this issue Oct 26, 2019 · 25 comments
Closed

Can't handle files >=4GB #1713

namazso opened this issue Oct 26, 2019 · 25 comments

Comments

@namazso
Copy link

namazso commented Oct 26, 2019

I tried opening a file (5.76 GB) with Notepad3 5.19.815.2595 (x64), however it stopped displaying the file at around 1.76 GB into the file. Search or anything else also doesn't work past that point.
the problem appears to be here:

DWORD dwFileSize = GetFileSize(hFile, NULL);

because the higher DWORD of the file size is ignored

@RaiKoHoff
Copy link
Collaborator

This will be only the first triggered problem (out of many other problems I expect for files larger than 2 GB). Support from Scintilla is also in beta state (see also: https://www.scintilla.org/ScintillaDoc.html#MultipleViews : SC_DOCUMENTOPTION_TEXT_LARGE).

Beside the expected "sluggish" behavior of the Lexers (better to switch to Lexer_NULL for these large files, I expect other very inconvenient slow behavior (search and replace, etc.).
On the other hand, sensible text documents of this size are a "pathological case" and this will reduce the priority to support this "very large file handling" by order of magnitude ... 🤔

@namazso
Copy link
Author

namazso commented Nov 4, 2019

I think at least a warning and failing or something similar about this would be nice, just in case someone tries to open large files like this. Additionally, you might want to check smaller sizes too, like 4GB-16 (which ends up at an allocation size of 0 bytes) to make sure they don't crash the program. Appropriately sized files also bypass the file size warning currently.

(also just a sidenote: my file was in fact a sensible text document, namely a log file)

@zufuliu
Copy link

zufuliu commented Nov 5, 2019

It's not slow when use idle styling, see zufuliu/notepad4#125.

@zufuliu
Copy link

zufuliu commented Nov 6, 2019

HI @RaiKoHoff, there are bugs in WideCharToMultiByteEx() and MultiByteToWideCharEx(): they don't handle UTF-16 surrogate pairs (CharNextW, CharPrevW) or DBCS multibyte characters (IsDBCSLeadByteEx, CharNextExA, CharPrevExA).

@RaiKoHoff
Copy link
Collaborator

RaiKoHoff commented Nov 6, 2019

Hi @zufuliu , thanks for the hint, it was just an idea workaround the 2GB (MAX_INT) limitation, I didn't test or used it yet ... 🤔

@hpwamr
Copy link
Collaborator

hpwamr commented Nov 11, 2019

Hello @namazso , for very large files, I recommend for example: PilotEdit 13.3.0

PilotEdit is a handy and reliable file (text- and hex-) editor designed to help users to execute scripts, extract strings and edit large files.

Features:

  • PilotEdit is four times faster than PilotEdit Lite when opening huge files in ASCII mode.
  • Edit huge files of 400GB (40 billion lines) in quick mode.
  • Compare and merge two huge files of 100GB (10 billion lines).
  • Encrypt/decrypt files larger than 10GB.
  • Edit an encrypted file transparently.
  • Sort a huge file of 1GB.
  • Find/remove duplicate lines in a file larger than 1GB.
  • Extract strings matching a regular expression.
  • Execute PilotEdit scripts to replace strings automatically.
  • Automatically detect start tag and end tag.
  • Format source code.
  • Edit, download/upload large files through SFTP.
  • Highlight all occurrences of selected word.
  • Replace millions occurrences of strings in a huge file in quick mode.
  • Change the encoding of big files.
  • Code Collapse. ...

@hpwamr
Copy link
Collaborator

hpwamr commented Nov 13, 2019

Hello @RaiKoHoff ,
With this link, you can download 3 log files to test the 2GB limit: Test_files_size_2GB_limit.rar

The file: Size_2.01 GB (2065 MB - 12.231.271 lines).log produces this dialog. 👍

Size_2 01 GB (2065 MB - 12 231 271 lines)

The file: Size_1.99 GB (2042 MB - 12.095.368 lines).log opens (on a fast i7 system) after:

  • 1 min. 10 sec. with Notepad3 (64-bit) v5.19.1114.2674 BETA 🐌
  • 1 sec. with EditPadLite 7.6.5 😮
  • 12 sec. with EditPlus 5.2.2386
  • 8 sec. with Notepad++ v7.8.1
  • 21 sec. with Notepad2 (original) 4.2.25
  • 21 sec. with Notepad2-mod 4.2.25.998
  • crash with Notepad2e R92
  • 5 sec. with Notepad2-zulufiu 4.19.11r2524
  • 5 sec. with SciTE 4.2.1
  • 15 sec. with VSCode 1.40.1

@zufuliu
Copy link

zufuliu commented Nov 13, 2019

@hpwamr how about Notepad2, you can test the AVX2 build.
using idle styling will make loading bigger file fast.

@hpwamr
Copy link
Collaborator

hpwamr commented Nov 14, 2019

how about Notepad2, you can test the AVX2 build.
using idle styling will make loading bigger file fast.

Hello @zufuliu , indeed your AVX2 build is very fast to open a big file.
I will update my previous list with the result of your Notepad2 and some others Notepad.
PS: What do you mean with: "idle styling" ?

@zufuliu
Copy link

zufuliu commented Nov 14, 2019

use SciCall_SetIdleStyling(SC_IDLESTYLING_ALL) or with other none SC_IDLESTYLING_NONE value.

and don't call SciCall_ColouriseAll() in Style_SetLexer(), call it somewhere when colourising whole document is needed, i.e. when toggling all folds.

SciTE use SC_DOCUMENTOPTION_STYLES_NONE for bigger file (default 10MB).
Notepad++ can't open file larger than 2GB, they explicitly cast Scintilla API return values to int.
EditPadLite might use file mapping.

@RaiKoHoff
Copy link
Collaborator

@zufuliu : Sorry, we didn't find the time to look at your solution yet - I think too: that is the right way.
Using the "file mapping" technology would be a lot more effort, especially we want to continue the support of the AES Encryption feature.

@zufuliu
Copy link

zufuliu commented Nov 15, 2019

@RaiKoHoff basic changes

  1. avoid Sci_ApplyLexerStyle(0, -1); in Style_SetLexer, it can be changed to SCI_STARTSTYLING(0)
  2. use https://www.scintilla.org/ScintillaDoc.html#SCI_SETIDLESTYLING in _InitializeSciEditCtrl()
  3. profile the exe. "1 min. 10 sec. " looks suspicious, even slow than old GDI-based Notepad2-mod.
  4. IsUTF8() (takes 1~2 seconds for 2GB file on my i5 system) and EditDetectEOLMode() (about 200ms for 2GB file) has SSE2/AVX2 accelerations.

About "file mapping": I think it's more complicated, I don't work out how to avoid loading entire file into memory but still keep correct syntax color and folding (all file in my Notepad2 has code folding). I only have experience (for viewing huge SQL dump) with Vim which can load file larger than 4GB with color (seems Vim loads file chunks on demand, but I not checked it's source code).

@zufuliu
Copy link

zufuliu commented Nov 21, 2019

Just tested file mapping (based on example from https://docs.microsoft.com/en-us/windows/win32/memory/creating-a-view-within-a-file), mapping entire file is indeed faster than reading entire file, the other benefit of file mapping is that the API supports file larger then 4GiB.
https://docs.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-mapviewoffile

@RaiKoHoff
Copy link
Collaborator

@zufuliu : Exactly what I expected: Creating only a Mapped View on huge files should be faster 😃.
What about changing and saving that huge file ?
What about searching and replacing strings spread over the huge file?

On the other hand, I have to dig deeper in that issue, maybe Notepad3 may not support a huge encrypted file (if it uses a stream cipher ...).

@zufuliu
Copy link

zufuliu commented Nov 21, 2019

@RaiKoHoff searching is OK, including mark all occurrences. Editing (a rarely operation for huge files) is very slow, with huge memory/CPU usage (resizing and moving). Saving is not slow (using WriteFile), using file mapping (and range pointer) to save file is not tested, I think it should faster than WriteFile.

Any way, loading entire file into memory is not the proper way to view huge files, but it's the only way supported by current Scintilla.

@hpwamr
Copy link
Collaborator

hpwamr commented Feb 26, 2020

Hello @hpwamr ,

Feel free to test the RC2 version "Notepad3Portable_5.20.225.3_RC2.paf.exe.7z" or higher.
See "Notepad3 BETA-channel access #1129" or here Notepad3Portable_5.20.225.3_RC2.paf.exe.7z or from site_2.

Note: "Notepad3Portable RC2" can be used in "2 flavors", see with/without ext.: ".7z" or from site_2.

Your comments and suggestions are welcome... 😃

@hpwamr
Copy link
Collaborator

hpwamr commented Mar 5, 2020

Hello @namazso ,

Today, I've tested Notepad3 (64-bit) v5.20.304.1 RC2 with the file :

  • Size_2.01 GB (2065 MB - 12.231.271 lines).log

The file is opened after +/- 8 sec on a fast i7 system.

Note: The maximum size is changed from 2 GB to 4 GB. 👍

As far as I'm concerned, I think you (requester) can close this issue...

@namazso
Copy link
Author

namazso commented Mar 5, 2020

@hpwamr well, the original issue was about an 5.76 GB file, so I think it could stay. however I appreciate the increased limit.

@RaiKoHoff
Copy link
Collaborator

We can leave this issue open, until the Scintilla Component can handle files bigger than 4GB ...

@RaiKoHoff RaiKoHoff changed the title Broken on files >=4GB Can't handle files >=4GB Mar 5, 2020
@hpwamr hpwamr added this to the Large Document (> 2GB) Handling milestone Jun 6, 2020
@zufuliu
Copy link

zufuliu commented Sep 13, 2020

@hpwamr what's the time on your fast computer for SciTE 4.4.5, Notepad2 4.20.09 (AVX2, x64 and Win32 builds), Notepad3 with Scintilla 4.4.5?

@hpwamr
Copy link
Collaborator

hpwamr commented Sep 14, 2020

Hello @zufuliu ,

Download test files: "Test_big_files_till_4GB.rar"

The test file: "Size_3.98 GB (4086 MB - 24.190.736 lines).log" opens (on a fast i7 system) after:

  • 7 sec. with Notepad2_en_x64_v4.20.08r3252 (Scintilla 4.4.4)
  • 6 sec.. with Notepad2_en_AVX2_v4.20.08r3252 (Scintilla 4.4.4)
  • 5 sec. with Notepad2_en_x64_v4.20.09r3288 (Scintilla 4.4.5)
  • 4,7 sec. with Notepad2_en_AVX2_v4.20.09r3288 (Scintilla 4.4.5)
  • 14 sec. with Notepad3Portable (x64) v5.20.913.2 rc (Scintilla 4.4.4)
  • 12,7 sec. with Notepad3Portable (x64) v5.20.917.1 beta (Scintilla 4.4.5)

@zufuliu
Copy link

zufuliu commented Sep 15, 2020

Hi @hpwamr, thank you 👍. What's about the time for AVX2 builds (v4.20.08r3252, v4.20.09r3288)?

For Notepad3, there will have improvements after Scintilla 5 (maybe in 5.1 or 5.2).
Currently, I suggest update EditDetectEOLMode, remove the eol_table, which turns out to be slower, the latest code for EditDetectEOLMode is available at https://github.com/zufuliu/notepad2/blob/master/src/Edit.c#L742

@RaiKoHoff
Copy link
Collaborator

@zufuliu : Thank you for enhancement suggestions - I will check your latest code for EOL detection soon... - best regards 👍

@hpwamr
Copy link
Collaborator

hpwamr commented Sep 17, 2020

Hello @zufuliu ,
I've just updated my above list: #1713 (comment)

@zufuliu
Copy link

zufuliu commented Sep 17, 2020

@hpwamr Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants