Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List serialization/deserialization can overflow the stack #6961

Closed
scabug opened this Issue Jan 11, 2013 · 33 comments

Comments

Projects
None yet
2 participants
@scabug
Copy link

scabug commented Jan 11, 2013

Compiling and running this program will result in the issue that it overflows the stack at runtime in 2.10 but not in 2.9. Please fix this - it's a critical bug.

object ListSer extends App {

  val largeList = (1 to 1000000).toList

  val oos = new ObjectOutputStream(new FileOutputStream("C:/tmp/largeList.ser"))
  oos.writeObject(largeList)
  oos.flush()
  oos.close()

  val ois = new ObjectInputStream(new FileInputStream("C:/tmp/largeList.ser"))
  val read = ois.readObject().asInstanceOf[List[Int]]
  println(read.size)


}

Here is the stack trace:

        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
        at scala.collection.immutable.$colon$colon.readObject(List.scala:362)
        at sun.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1004)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1866)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1347)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 11, 2013

Imported From: https://issues.scala-lang.org/browse/SI-6961?orig=1
Reporter: @oxbowlakes
Affected Versions: 2.10.0
See #5374

@scabug

This comment has been minimized.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 11, 2013

@retronym said:
Jan Vanek j3vanek@gmail.com
6:16 PM (3 minutes ago)

to scala-internals
Here is a gist, please have a look. It requires 2 Booleans in thread-local storage. Supports long lists including sub-structure sharing. The oldReadObject() is not incorporated.

https://gist.github.com/4512368

Regards,
Jan

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 14, 2013

Jan Vanek (j3vanek) said:
Hi, I've forked/updated the gist:

https://gist.github.com/4517945

There was couple of bugs in the first version, as you might have noticed.

This version removes the thread-local data (set's to null) after top-level call, so no trace is left. I think it's better so. But if you think it is better to keep the data there to avoid per-top-list-allocation, then it should not be an instance of a scala class (ListSerializationCtrl), but an Array[Boolean] instance, so the hypothetical class-loader leak is prevented.

The version sets inLoop = false directly after write/read in finally block, so that when an exception is thrown the data is in correct state. I first thought it is enough to just remove (set to null) the TLS data in the top-level try/finally, but this way is safer; it's possible to imagine serialization-data which could leave the TLS data in wrong state (if one does e.g serialization within serialization). This is paranoid, but I measured the cost of try/finally there (against setting inLoop = false not within finally) and it was not recognizeable, so no reason to not have that.

It also supports the 2.9 serialization format data, at least according my testing. I wrote the object using oldWriteObject() and read correctly using the new readObject().

I've moved the ListSerializationCtrl class and thread-local variable out of List object. This is mainly to have it private and not private[scala] because the List.tlSerCtrl would be visible from Java in IDEs. So now ListSerializationStart/End/Ctrl are all private classes close to each other.

Finally I did some performance testing. List of size S, serialized/deserialized N times into/from a byte-array:

List size | repeated | 2.9.(ms) | this code(ms) | current 2.10.(ms)
10 | 2000x | 33 | 46 | 50
20 | 2000x | 40 | 62 | 70
500 | 200x | 44 | 86 | 105

So this code is actually faster then the current 2.10. code. That was a pleasant surprise. I don't have definitive explanation on that. I think the code is ready, but I didn't yet have time to learn how to do pull request, or what is the next step.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 15, 2013

@retronym said:
Alex: could you please review Jan's suggested fix?

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 15, 2013

@axel22 said:
Yes, I'll review this today.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 15, 2013

@axel22 said (edited on Jan 15, 2013 4:27:09 PM UTC):
Jan, thank you for your patch!
It seems to me that you have correctly addressed the issues from #5374, in particular
that the structural sharing is maintained when serializing objects which point to the same lists multiple times.
Also, it seems to me that the loop-logic works correctly now.

A couple of comments:

  1. I did not completely understand why the fields seen0 and inLoop0 private, why not have just public vars.

  2. It seems to me that the idea is that when the seen field becomes false, serializing a :: has to bring it
    back to true. Otherwise, the tail must have been Nil, meaning that the loop in writeObjectImpl can end.

  3. It seems to me that serializing nested lists should work as well.
    Calling writeAsSeen in line 34 may serialize a nested list, in which case the nested list should leave the ctrl
    object in the seen == false state, but in that case the writeAsSeen of the outer list brings it back to true.

Am I right?

Nevertheless, could you add a few more tests with nested lists, and lists of objects having nested lists and shared lists to ensure that this works?

  1. I suppose a similar inverse logic applies to readObject, from what I can see.

Overall, looks good.

On a side note, I wonder if there is some existing serialization infrastructure, i.e. a method call, we could use instead
of the thread local with state patch to output the objects into the object output stream and maintain structural sharing.
Something like out.writeObjectId(...) could greatly simplify the code we have in the patch now.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 15, 2013

Jan Vanek (j3vanek) said:
Hi Aleksandar,

ad 1) It was so in the first (or maybe also second) version in the fork. I refactored those code pieces from write/readObjectImpl into those ListSerializationCtrl.write/read/InLoop/AsSeen methods to make the code in write/readObjectImpl shorter and more readable and I hoped giving those pieces names would also make it more understandable (than mutate the vars in the write/readObjectImpl directly). When that happened, having the vars private makes clear that only the methods of LSCtrl can mutate them. Doing that I hoped that the analogy of read vs. write is apparent because both write/readObjectImpl and those write/readInLoop, write/readAsSeen are simple enough to see the mirroring. The goal is readability. If you find the "inlined" version more readable, I've absolutely nothing against it.

ad 2,3) Yes. The idea is that the list is serialized in the loop, and each sub-list contributes just its head and itself, so to say. So technically the serialization stream consists of multiple of single-head unconnected lists (well they don't have tail in the stream so they are not lists). The loop makes sure all of them are in, and during the deserialization the loop connects them again.

This is controlled by the inLoop variable. When the writeObjectImpl sees inLoop = true, it knows someone else manages the loop and that it must serialize its head only, and that it has to set the seen to true, to tell that someone it did so. So the seen is like a response, that the list (not the head) was there and contributed itself and it's head. I don't like the name "seen". I don't know a better name now, but we should find one. The "inLoop" is more critical, it's like a request, and thus the inLoop = true can't escape, that's why I set it to false in finally. For the same reason the "inLoop" is set to false before the head is serialized. Head can be or can contain a list, that one must not enter writeObject with inLoop = true, obviously.

So if after calling writeInLoop the ctrl.seen is false, then it was either Nil or already serialized shared sub-list. There is no need to distinguish the two, in either case the loop must stop.

ad 4) Yes, I tried to make read a clear inverse of write.

I've added 1 :-) test into the gist. It is a "shared" thing, and it deserializes properly.

ad side note) You mean like oos.beginWriteObject(obj): Boolean (false if already there and ID written, true if not and ID generated and definition starts) and oos.endWriteObject()?

Regards...

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

@axel22 said:
Yes, something like beginWriteObject and endWriteObject like you described would be ideal.
I have not seen such a facility on the ObjectOutputStream.

Otherwise, looks good to me!
You can fork the Scala repo and open a new branch issue/6961, apply the patch, build the repo and run all the tests (ant test), and then submit a pull request to the Scala repo on GitHub.
Or I can do this for you, as you like.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

Jan Vanek (j3vanek) said:
When it's there, we can rewrite it, well, I actually doubt it will come. Also when the closures don't cost anything, we can simplify the code which tests/creates/resets the thread-local - I believe this will happen much sooner than the first thing.

Thanks! I'll try to do that during the weekend. Of course if there is pressure, please do that now.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

Jan Vanek (j3vanek) said:
Alex, I tried the closure version, please have a look. The code is simpler, the performance is like 124ms:125ms for the non-closure version and there is some memory cost, 1 closure instance per writeObject call (each non-shared list). I'd do that, but I'm not sure.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

@axel22 said:
I'd vote for the version without the extra object allocations. The difference in complexity does not seem to be huge.
I didn't understand, what was the performance difference you have observed?

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

Jan Vanek (j3vanek) said:
The non-closure version was minimally faster. Like 752.5ms vs 760.7ms (list size 500, serialized 2000x). OK, lets wait for the closure version when it is free.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

@axel22 said:
Ok, thanks for doing this!
I think there is no rush currently, Jason please correct me if I'm wrong.
If there is, I can submit a pull request, otherwise feel free to submit one and set me as a reviewer (review by @axel22).

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

@retronym said:
Current plan would be for 2.10.1. We'll have to see if the fix meets with our binary compatibility constraints. For that reason, might be best for you to do this Alex.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 16, 2013

Jan Vanek (j3vanek) said:
Alex, I noticed today the class is called ListSerializationCtrl and the other two ListSerializeStart, ListSerializeEnd. If I noticed before, I'd call it ListSerializeCtrl. Perhaps you can rename it when you do the patch.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

@axel22 said:
Ok, will do.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

@axel22 said:
Btw, Jason, about the binary compatibility constraints - what do we say about objects serialized with a different version of Scala?
Should they be deserializable in a newer version? Do we maintain binary compatibility there? If so, we may have to modify the patch to include legacy deserialization for 2.10.0.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

@axel22 said:
Related to that, I just noticed that in this commit:

scala/scala@b9e01a0

a part of the test just got commented out. That part of the test was supposed to check if legacy list deserialization works.
The commit message says:

Hmmm, the giant blob of binary data embedded in a test
suddenly stopped working. What does one do in this spot.

As I was unsure what are binary compatibility requirements exactly say on serialization, I've implemented legacy deserialization for lists deserialized from 2.9.x.
It seems that just prior to that change, somebody changed something in List and legacy deserialization stopped working (could this have also been due to a different JVM version?).

I've just implemented a fallback for legacy deserialization for lists serialized with 2.10.0.
I can send it in, however, I see no way of writing a test for it other than the one I've had in test/files/run/t5374.scala.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

Jan Vanek (j3vanek) said:
The 2.10 is not released yet, is it really necessary to support the "legacy" 2.10 serialization? I'd not hesitate to support the "legacy" 2.10 serialization if it were released.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

@axel22 said:
Anyway, the changeset is here: axel22/scala-github@3b23083
Will submit a PR as soon as tests pass.

In any case, would be grateful for any pointers into what our binary compatibility policy has to say.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

@axel22 said:
Looks released to me:

http://www.scala-lang.org/node/27499

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

Jan Vanek (j3vanek) said:
Uh, thanks for the wakeup call. And for the patch. try/finally without {} looks good.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 18, 2013

@axel22 said:
No problem! :)

scala/scala#1923

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 29, 2013

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 29, 2013

@axel22 said:
I think we might have a problem. For certain lists the running time of the serialization becomes quadratic.
Think of a list holding objects each of which points to the tail of that list.
If you take a look at writeObject, you'll see that the recursive calls ensue, each of which does a while-loop to traverse all the elements.

Changing the list size above 32 results in really long running times. Not to say that it can still result in a stack overflow.

object Test {

  def deepCopy[T](obj : T, reportSize: Boolean = false): T = {
    val baos = new ByteArrayOutputStream()
    val oos = new ObjectOutputStream(baos)
    oos.writeObject(obj)
    oos.flush()
    val data = baos.toByteArray
    if (reportSize) println(data.length)
    val bais = new ByteArrayInputStream(data)
    val ois = new ObjectInputStream(bais)
    ois.readObject().asInstanceOf[T]
  }
 
  def test() {
    case class Foo(tail: List[Foo])

    def create(len: Int): List[Foo] = if (len == 0) Nil else {
      val tail = create(len - 1)
      Foo(tail) :: tail
    }

    val xs = List.fill(32)(7)
    val dxs = deepCopy(xs, true)
    assert(xs == dxs)

    val ys = create(32)
    val dys = deepCopy(ys, true)
    assert(ys == dys)
  }

  def main(args: Array[String]) {
    test()
  }

}
@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 29, 2013

Jan Vanek (j3vanek) said (edited on Jan 29, 2013 2:49:29 PM UTC):
Interesting. A possible remedy would be to actually do 2 loops. In the first loop only the "lists" would be serialized, but empty, without the head. In the second loop we serialize the "heads". So currently when we serialize a top level list, it (its serialized stream) contains a sequence of "in-loop" serialized lists, each of which represents just itself and its head, no tail. We can split it so that the top level list contains 2 sequences, the sequence of sub-lists (just itself, no head, no tail) and the sequence of heads. The deserialization first creates the lists (without heads), then the 2nd loop reads the heads (::.hd is a var, so it would be possible). WDYT? I'll give it a try later in the evening/night, I think it should work.

There is one thing, during the deserialization, e.g. in your example the Foo (which is a head) will be deserialized, pointing to the list (Foo.tail) which exists, but the heads in there were not yet deserialized. So, if there is a custom serialization code in Foo (readObject), which scans over the list, then we have a problem :-(. So we should store and read the heads of those in-line tails actually in reverse order, which is not really possible, unless we allocate whole damn list. But even if we do that in reverse order, it is not a guarantee, one could create a list whose sub-sub-..-list's head points to that list. And that head's readObject would see the list in not-fully-deserialized state. As soon as we split into 2 loops the deserialization (readObject) can observe "weird" state. So I think we need to choose how to die. I'd vote for this new death, we'd have to tell users to not scan/iterate over lists in their readObject methods, because those lists might not be fully deserialized yet.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 29, 2013

@axel22 said:
We've talked about this on the meeting for a while now. The verdict preferred by most people was to revert list serialization back to what it was in 2.9 - no structural sharing, just output all the elements, followed by a special marker for the end.

The list buffers which suffered from the loss of structural sharing were fixed anyway. People who use lists and structural sharing will have to implement custom serialization - the consensus was that a common use case is just having a very big list which should be serialized quickly and using the least space as possible. Additionally, this will simplify list serialization code greatly.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 29, 2013

Jan Vanek (j3vanek) said:
Makes sense. The fast simple non-shared serialization should be available. If there is a big need in the future one could provide something like List.sharedSerialization[T](body: => T): T, which would trigger the shared mode for the duration of body, hypothetically.

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Jan 31, 2013

Jan Vanek (j3vanek) said:
I've updated the gist to handle your last issue, as described above. For your reference:
https://gist.github.com/4517945

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Feb 7, 2013

@retronym said:
Moving back to 2.10.1, didn't we agree to simply revert the structural sharing change that triggered this bug?

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Feb 7, 2013

@scabug

This comment has been minimized.

Copy link
Author

scabug commented Feb 8, 2013

@JamesIry said:
Superseded by scala/scala#2093

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.