Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parallel ctags #761

Closed
fommil opened this issue Jan 13, 2016 · 19 comments
Closed

parallel ctags #761

fommil opened this issue Jan 13, 2016 · 19 comments

Comments

@fommil
Copy link

fommil commented Jan 13, 2016

as far as I understand it, ctags is single threaded. Are there any plans to support parallelisation? May speed things up on huge codebases.

@fommil fommil changed the title parallel exuberant ctags parallel ctags Jan 13, 2016
@mawww
Copy link
Contributor

mawww commented Jan 13, 2016

Hello,

while built in parallel implementation can be interesting, its already possible to parallelize updating a big codebase by launching different ctags on different directory and then merging the generated files (which can be done simply by dropping the lines beginning with ! from all files but one and using sort --merge on all files afterwards).

However, I am not convinced you'll get any speedup from parallelized ctags, as I expect modern machines to be I/O bound. That would need to be profiled to make sure though.

@fommil
Copy link
Author

fommil commented Jan 13, 2016

@mawww I'm sure https://github.com/ggreer/the_silver_searcher would disagree 😉

Running multiple ctags would be extremely difficult to coordinate from the standard emacs https://github.com/bbatsov/projectile/blob/master/projectile.el#L180-L183

@mawww
Copy link
Contributor

mawww commented Jan 13, 2016

@mawww I'm sure https://github.com/ggreer/the_silver_searcher would disagree 😉

Good point.

Running multiple ctags would be extremely difficult to coordinate from the standard emacs https://github.com/bbatsov/projectile/blob/master/projectile.el#L180-L183

A shell script wrapper could already go a long way, but yeah it might be more efficient to integrate that directly into ctags.

@b4n
Copy link
Member

b4n commented Jan 13, 2016

@fommil That guy's blogpost on the matter is not very clear from where it started to where it went (well, you can read it between the lines, but well), and anyway it's really not that much. And I don't mean to disregard any of his work, but I'm not gonna fully trust the results of someone that seemingly just learned about multithreading (esp. because of i.e. how much a misused mutex destroys any performance that MT can give). Not saying he's not totally right, but I'll need to be convinced :)
And note how his own tests showed that on his machines too many worker threads quickly became worse than a no parallelization at all. It's cute, but likely to heavily depend on the hardware, OS and data to process, so it should probably be more sensible than "well, using N threads seemed to perform better in my tests".

Also, another reason why it doesn't appeal me so much is that not only I don't believe it'll give us so much, but it'll be a large amount of error-prone work. Currently the CTags code base is in absolutely no shape to support parallel tags parsing threads. All you might be able to split relatively easily is the init/directory traversal and one single parser thread.
And finally, I'm confident we have more sensible optimizations to perform about everywhere in the codebase (and esp. in the parsers).

So sure, multithreading probably can have some benefits if used very well, but it's likely not to be the most interesting improvement.

@b4n
Copy link
Member

b4n commented Jan 13, 2016

Also, another reason why it doesn't appeal me so much is that […] it'll be a large amount of error-prone work. Currently the CTags code base is in absolutely no shape to support parallel tags parsing threads. All you might be able to split relatively easily is the init/directory traversal and one single parser thread.

BTW, I don't mean that improving this area in the code is not a good idea, I do think it is (esp. for a possible future libctags). I just mean that if performance is the goal, it's probably not (currently) worth the effort, and that there are more important area to focus on.

BTW, firing up a profiler and profiling a buckload of data in a gazillion ways would probably be interesting.

@masatake
Copy link
Member

GNU parallel may help you.

@dtikhonov
Copy link
Contributor

As mentioned before, optimizing reading can speed up ctags quite a bit.

@pragmaware
Copy link
Contributor

Parallel execution of parsers could speed up things quite a bit if I/O comes from cache (and this is often the case the Nth time you run ctags on a directory from an editor).

@dtikhonov
Copy link
Contributor

@pragmaware IMO, a library should not fork.

@masatake
Copy link
Member

masatake commented Mar 16, 2018

If you read Japanese text, look at the article, https://qiita.com/dalance/items/c76141a097e25fabefe8 .
(After writing this comment, I found git repository for ptags (https://github.com/dalance/ptags). The page is written in English.)

It reports a tool named ptags the author developed. The tool is written in Rust and wraps ctags.
It runs ctags parallel for the set of input.
I don't loot at its internal. However, it is obviously runs multiple ctags processes.

The result is quite impressive. 5 times faster than single processing. The number of cpus is not written. The size of memory may be enough(=128GB). The author runs 10 times ptags for the same input set to make page cache hot.

Though these things should be done in wrappers like ptags, it is difficult to ignore this great result.
I hacked quickly. https://github.com/masatake/ctags/tree/parallel
Newly introduced option --_parallel runs multiple ctags processes in _parallel.

The number of worker processes, 8, is hard coded. My note PC has 8 cores.
MEMORY is 32GB. The target input is the latest Linux kernel source tree.
My .ctags is enough hairy.

The result is the mostly the same: 2 ~ 3 times faster.

[yamato@master]~/var/ctags-github% cat run.sh
cat run.sh
for i in $(seq 1 5); do
	echo "#"
	echo "# TRAIL #$i"
	echo "#"
	echo "# parallel 8"
	time  ./ctags    --_parallel -R  ~/var/linux > /dev/null
	echo "# single"
	time  ./ctags -o - --sort=no -R  ~/var/linux > /dev/null
done
[yamato@master]~/var/ctags-github% bash run.sh 
bash run.sh 
#
# TRAIL #1
#
# parallel 8

real	0m29.073s
user	3m5.791s
sys	0m32.347s
# single

real	1m21.397s
user	1m14.601s
sys	0m6.521s
#
# TRAIL #2
#
# parallel 8

real	0m29.746s
user	3m4.601s
sys	0m32.175s
# single

real	1m26.660s
user	1m19.176s
sys	0m7.191s
#
# TRAIL #3
#
# parallel 8

real	0m28.290s
user	3m2.524s
sys	0m31.081s
# single

real	1m21.927s
user	1m14.775s
sys	0m6.896s
#
# TRAIL #4
#
# parallel 8

real	0m28.644s
user	3m3.839s
sys	0m31.756s
# single

real	1m13.319s
user	1m7.294s
sys	0m5.843s
#
# TRAIL #5
#
# parallel 8

real	0m29.274s
user	3m9.387s
sys	0m32.363s
# single

real	1m13.621s
user	1m7.487s
sys	0m5.941s
[yamato@master]~/var/ctags-github% 

(I comapred both tags file. There is no diffference.)

Far from satisfying, but it is good place to start.

@masatake
Copy link
Member

I wonder whether output of workers must be gathered or not.

@fommil fommil closed this as completed Sep 1, 2018
@masatake masatake reopened this Sep 4, 2018
@fommil
Copy link
Author

fommil commented Sep 4, 2018

hi @masatake I'm trying to close all my open tickets that I don't plan to work on. If you're interested in working on this ticket, could you please copy the text into a new ticket?

@masatake
Copy link
Member

masatake commented Sep 4, 2018

I will work on this item in the future. I would like to keep this item open because the record of the discussion here will be valuable for me.

@fommil
Copy link
Author

fommil commented Sep 11, 2018

@masatake you can still link to this ticket from a new one and retain the full history. This would really help me out, as I am trying to clean up my "Issues" tab for a new job and I don't want clutter like this ticket getting in the way.

@fommil fommil closed this as completed Sep 13, 2018
@dtikhonov
Copy link
Contributor

@fommil, I don't see how you can override @masatake, who is the driving force behind Universal Ctags, with 2,700 commits versus your commit count of zero. Once you open a bug (or, in GitHub parlance, "issue"), this bugs becomes the property of the project. I believe you can unwatch it and not receive any emails about it.

Reopening.

@dtikhonov dtikhonov reopened this Sep 13, 2018
@fommil
Copy link
Author

fommil commented Sep 13, 2018

@dtikhonov @masatake please close this ticket. It is the only ticket in my https://github.com/issues view that is not relevant to my work.

It is not possible to remove a ticket from this view unless the ticket is closed. Even if I unsubscribe.

Indeed, I was not aware that repo owners would have this control when I created the ticket, otherwise I would not have done so.

if you want to work on this, please create a new ticket and reference this ticket, all discussion is preserved. Or just copy paste the contents of #761 (comment) into a fresh ticket.

I do not think this is much for me to ask.

@fommil fommil closed this as completed Sep 13, 2018
@masatake
Copy link
Member

Could you make a temporal GitHub account just for copy-paste?
So you can do copy-paste by yourself.
Then, you can remove the account.

@fommil
Copy link
Author

fommil commented Sep 13, 2018

sure, if that's the only way to fix this I can do that.

@ghost ghost mentioned this issue Sep 13, 2018
@fommil
Copy link
Author

fommil commented Sep 13, 2018

done! Thanks for letting me close this ticket. It cleans up my TODO task significantly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants