-
Notifications
You must be signed in to change notification settings - Fork 619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Porting memory I/O from Geany #863
Comments
I'm very interested in MIO. I have wanted to share more code with geany. However, as far as I am maintaining ctags, I will not provide universal-ctags as a shared library because designing API is too difficult for me. The most of my use case command line interface is enough. However, I have been worked for providing sophisticated command line interface. This is my basic position. I wrote a document about input facility of u-ctags: http://docs.ctags.io/en/latest/internal.html The biggest question is that MIO works on Windows? |
Great to hear!
This wouldn't be that hard. The tm_ctags_wrappers.h/c here geany/geany#957 could serve as a base for such an API. Anyway, first it would be good to sync Geany with universal-ctags as much as possible to know exactly what we need from such an API.
Good, will have a look at that. But I think I know quite well what's going on in ctags and can do the necessary changes.
There's no problem on Windows - Geany works on Windows just fine. There's nothing special about MIO - the file back-end basically just calls the normal IO functions like getc() etc. which are already called anyway and the memory back-end maintains a char array on top of which it simulates file operations and there are no OS-related functions at all. Good, I'll try to prepare a patch and you can decide if you like it or not and what needs to be changed. |
Fine. IF POSSIBLE, I would like to get a pull request for introducing glib first. universal-ctags uses libxml directly. The code should be replaced with memory-stream based. |
Hmm, do you want to introduce glib dependency? I think it would be better to remove the glib stuff from MIO (which I actually did yesterday already) and not to introduce more dependencies.
This is actually something MIO would help you a lot. As I said, it's more efficient to load every file into a buffer in memory first and then use the MIO memory back-end. And in this case you can get the underlying buffer directly from MIO so this should be trivial once MIO is in. |
@techee, oh, I see. Please, forget what I wrote about glib. |
I think that using something like MIO would be a good idea. Recently I have looked at the performance of ctags and found out that the filesystem access is a major bottleneck. The function that is called more often and takes most processing time is getc(). That's why I have started looking into mmap()'d file access that would allow to load the files faster and access content via pointers instead of a function call. I have even tried to implement it. However I've found out that a naive #ifdef based implementation is difficult because there are really many places where the code relies on the stream-like file system API. Something like MIO is needed for this: replace all the plain posix calls with mio_* prefixed ones, implement memory mapping hidden inside mio_* and then try to improve performance by turning certain functions into macros or accessing the buffer directly only in certain places. I also think that having ctags as a library would be cool. Though quite a lot of refactoring is needed. |
I profiled it too (this way http://wiki.geany.org/howtos/profiling/gperftools) and the result after using MIO and loading files smaller than 1MB into memory first is this for the linux kernel sources: https://www.dropbox.com/s/wf6rj7o430mrzzh/pprof14382.0.svg?dl=0 mio_getc() now takes less than 3% but you have to add the about 7% taken by openInputFile() which now loads the file into memory. You might save something from the 7% using mmap but I guess it won't be that much. By the way, the new parser is about twice as slow as the old C parser. The reason why the old parser was so fast was that it wasn't generating any tokens when skipping the irrelevant code like function bodies - it was just counting braces, skipping comments and waiting for the right end brace without any token generation on the way. I haven't checked the sources but couldn't something like that be done for the new parser too? The cxxParserParseNextToken() is the most expensive function from the profile.
I don't think that much refactoring is needed actually - there should just be one more header that exposes the public stuff. The Geany pull request geany/geany#957 tries to separate ctags from Geany as much as possible and introduces ctags_wrappers.h which tries to be such an interface. |
By the way the profile svg doesn't display nicely on dropbox - better to download it locally and open it with web browser. |
More gains can probably be achieved by adapting certain functions to work directly on the memory buffer by using pointers and avoiding mio_getc(), mio_fgetpos() etc. The goal is to change getc() with *ptr++, ungetc() with ptr--, getpos with (ptr - buffer)...
The new parser doesn't treat code inside functions as irrelevant. There are variable declarations, nested prototypes, lambda calls... this is the whole point of a new parser: extract the information that the old parser was skipping. |
Possibly, but it will be hard - if you look at the path getcFromInputFile()->iFileGetLine()->iFileGetc()->mio_getc() then the losses in each of the functions are minimal (2-3%). And anyway, you'll get 13% in the best case. And you need something like iFileGetLine() to get line counting automatically so you get 9% at most (realistically I'd say 6%).
I see, then there's not much to do in this respect. By the way is it possible somehow to switch to the old parser? Might be useful sometimes. |
@techee, thank you again. Many queued issues will be solved with MIO.
OldC and OldC++ are kept for testing purpose.
You cannot run |
Great to hear. Thanks for the info about how to use the old parsers, could be useful for testing. |
@pragmaware writes:
I brought this up a couple of times before, for example in issue #423. There, I wrote:
|
@techee, @pragmaware: You can't get around disk access penalty using mmap(). Once a file is in the buffer cache, however, a sane line-by-line implementation based by a large buffer offers the same performance as that of mmap. |
It does for me, because if I am to use such a library I probably want to be able to parse i.e. a buffer the user is currently editing, and then writing to a file just so the library can read it seems highly suboptimal. |
The patch loads the complete file into memory if it's smaller than 1MB (which most sources are). This is faster than reading it by 4KB blocks - in the original version of the patch I was reading the file into memory in 4KB blocks instead all at once and the difference was 4s for parsing all the linux kernel sources.
Sure, but with SSD disks there's almost none (with classical hard drives you are mostly limited by the random access time which is around 10ms so you will be able to parse only about 100 files per second). Anyway, mmap() shouldn't be needed now as the file is loaded into the memory by simply reading its contents.
For good performance it's also important to reduce system calls as much as possible as the context switch is expensive. This is probably the main cause for the total 8s improvement after using GIO because all the IO and kernel interaction is done once at the beginning and during parsing no system calls happen any more. |
In addition to #862 the second big difference in Geany is we need to parse a file loaded into memory instead of parsing a file stored on disk. For this reason there's a small library called MIO:
https://github.com/geany/geany/tree/master/tagmanager/mio
which tries to hide where the file is stored - whether on disk or in memory. One creates a MIO file using either
to get a file-based backend or
to get a memory-based backend. Afterwards, one can use POSIX-like functions like mio_read(), mio_write(), mio_getc(), mio_ungetc(), mio_setpos(), etc. without having to care of what nature the file really is.
The question now is - would it be acceptable for universal-ctags to use this library? There are two main benefits I can think of:
We can do some things to make the MIO library play more nicely with universal-ctags:
What do you think?
cc @b4n
The text was updated successfully, but these errors were encountered: