Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize Vector concatenation (Vector ++ Vector) #4442

Closed
scabug opened this issue Apr 4, 2011 · 17 comments · Fixed by scala/scala#10159
Closed

optimize Vector concatenation (Vector ++ Vector) #4442

scabug opened this issue Apr 4, 2011 · 17 comments · Fixed by scala/scala#10159

Comments

@scabug
Copy link

scabug commented Apr 4, 2011

currently scala.collection.immutable.Vector.++ simply inherits its implementation from a supertrait (through the stubs added in r24227), so if you concatenate two vectors, it doesn't take advantage that that both are the same structure, but just treats the second vector like any traversable.

in theory this could be O(log n), not O(n), yes? Tiark seems to confirm it in this comment on #3724: "Unfortunately, for implementing a fast concat operation (we hope to do that for 2.8.1 or 2.9) heterogenous Arrays seem to be necessary (we'll be storing Int arrays next to the sub nodes). We might rethink this however, and try to stick to homogeneous Arrays."

I just thought there should be enhancement ticket on this as it would still be nice to have someday.

@scabug
Copy link
Author

scabug commented Apr 4, 2011

Imported From: https://issues.scala-lang.org/browse/SI-4442?orig=1
Reporter: @SethTisue
See #3724

@scabug
Copy link
Author

scabug commented Apr 4, 2011

@TiarkRompf said:
I've been having this on my todo list for quite a while now. The problem is that it requires pervasive changes to the current internal structure and those are hard to get right without slowing down other uses of Vector. Performance will most likely be amortized log n, not worst case. It's good to have a ticket for this, though.

@scabug
Copy link
Author

scabug commented Sep 20, 2011

@SethTisue said:
Is Phil working on this? (the latest Scala meeting notes seem to say so)

@scabug
Copy link
Author

scabug commented Mar 2, 2012

@TiarkRompf said (edited on Mar 2, 2012 9:10:30 PM UTC):
Here's a link to the paper:
http://infoscience.epfl.ch/record/169879/files/RMTrees.pdf?version=1

Still need to resolve integration with the current Vector implementation. Maybe it's better to have a separate CatenableVector as a first step.

@scabug
Copy link
Author

scabug commented Jul 29, 2013

@pchiusano said:
Related to this, (1 to N).foldLeft(Vector[Int]())(_ :+ _) runs in linear time, but (1 to N).foldLeft(Vector[Int]())((acc,a) => acc ++ List(a)) is quadratic. This is extremely surprising (I spent several hours tracking down a performance bug in scalaz-stream that ended up being caused by this). Since Vector has a constant time snoc operation, it seems like the default implementation of ++ should be repeated calls to :+, which doesn't require any internal changes to Vector.

@scabug
Copy link
Author

scabug commented Aug 7, 2013

@SethTisue said:
Paul: I'm extremely glad you noticed this. I don't think anyone realized the situation was that bad! I opened a separate ticket on it at #7725.

@scabug
Copy link
Author

scabug commented Feb 10, 2014

@adriaanm said:
was this fixed along with SI-7725?

@scabug
Copy link
Author

scabug commented Feb 10, 2014

@Ichoran said:
This was not fixed--this is about preserving structural sharing across the entire tree, while the other was to at least start from the bigger of two collections to get at least some sharing.

@scabug
Copy link
Author

scabug commented Feb 13, 2014

@SethTisue said:
reassigning to "Backlog" on the grounds that the research on how it would even work hasn't been done yet

@scabug
Copy link
Author

scabug commented Oct 1, 2016

@Blaisorblade said:
Shouldn't RRB-Vector be a potential fix for this, according to the ICFP'15 paper describing it?

https://github.com/nicolasstucki/scala-rrb-vector
https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf
http://dl.acm.org/citation.cfm?id=2784739

@joshlemer
Copy link

These are very relevant!
paper describing a potentially new data structure for Vectors: https://infoscience.epfl.ch/record/213452/files/rrbvector.pdf
implementation: https://github.com/nicolasstucki/scala-rrb-vector

@V-Lamp
Copy link

V-Lamp commented Feb 27, 2019

Is this addressed by any chance in the new collections in 2.13? 🙂

@joshlemer
Copy link

@V-Lamp the Vectors in 2.13 will not be the new data structure (RRB Vectors), but their concatenation may be improved enormously by scala/scala#7588 which ensures that during concatenation of vectors, the structure of the left Vector is used rather than copying it into a new Vector. This essentially means that Vector concatenation will be constant-time on the size of the left vector (but still O(n) where n = size of right vector).

@szeiger
Copy link

szeiger commented Nov 4, 2019

scala/scala#7588 was reverted due to #11636 but the new Vector implementation that I am working on for 2.14 also has this optimization (appending n elements to a Vector of size m is O(n + log(m+n))). I don't know if I'll find the time to implement it but the same is possible for prepending. The combination of the two would allow concatenation depending on the size of the shorter slice.

@scala scala deleted a comment from scabug Apr 26, 2021
@SethTisue
Copy link
Member

Note that Vector was completely redone for 2.13.2: scala/scala#8534

Has anyone studied what the current situation is w/r/t to this ticket?

@SethTisue
Copy link
Member

SethTisue commented Apr 26, 2021

I see that on scala/scala#8534 @szeiger wrote,

I did not get around to implementing an optimized concatenation where the right-hand side vector determines the alignment. Like most other algorithms for this data structure it is quite easy to visualize how it works and assure yourself that it is possible but actually implementing it correctly is much harder. This would allow concatenation of two vectors of sizes m and n in O(min(n, m) + log(n+m)) time.

So that sounds like we're still where we before: doing concatenation as efficiently as possible remains future work.

@ansvonwa
Copy link
Member

For the record: I've implemented concatenation in O(min(n, m) + log(n+m)) time.
(If you're really lucky and the Vector structures are perfectly aligned, it's even only O(log(n+m)).)
It's not yet PR-ready, but I hope I'll be able to polish it in a month or so when I'm back from vacation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants