New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issues #295

Closed
RuiAAPeres opened this Issue Nov 15, 2015 · 31 comments

Comments

Projects
None yet
5 participants
@RuiAAPeres
Copy link

RuiAAPeres commented Nov 15, 2015

I found that on a 4S device, my data model was taking almost 12 seconds, to be parsed. So I did a small experiment with the following:

import Foundation
import Argo
import Curry
import Runes

struct Person {

    let firstName: String
    let secondName: String?
    let age: Int
    let hoobies: [String]
    let description: String
}

extension Person: Decodable {

    static func decode(j: JSON) -> Decoded<Person> {
        return curry(Person.init)
            <^> j <| "firstName"
            <*> j <|? "secondName"
            <*> j <| "age"
            <*> j <|| "hobbies"
            <*> j <| "description"
    }
}

This is not my original model, but a far simpler one. So I used a testing file with 100k lines of JSON with a total of 9130 objects. The original one has more than 330k lines and far deeper hierarchy and complexity (number of members).

I wasn't sure which would the best approach so I used the ones I saw in the wild (the measurements are in release mode on a MacbookPro 2015):

screen shot 2015-11-15 at 21 26 40

Then I did the parsing "by hand":

screen shot 2015-11-15 at 21 28 09

Obviously, the second version has a lot more boilerplate, it's not as easy to maintain or extend, but in this case all those niceties are not worth with such timings. A demo project can be found here.

@RuiAAPeres

This comment has been minimized.

Copy link
Author

RuiAAPeres commented Nov 15, 2015

Btw, this is the memory footprint while measuring test_measure_Argo_1 and test_measure_By_Hand. The red circle is from the later.
screen shot 2015-11-15 at 22 26 02

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 16, 2015

First, 330k lines is a ridiculous amount of data for a mobile app. Can you not page in just bits of the data at a time?

In any case, I ran an equivalent test case through Instruments. The time profiler for:

let people: Decoded<[Person]> = [Person].decode(objectJSON)

breaks down like this:

10.1% JSON.parse(AnyObject)`, mostly in swift_dynamicCast
87.3% CollectionType<A where ...>.decode<A where ...>(JSON) -> Decoded<[A.Generator.Element]>
    78.4%   sequence<A> ([Decoded<A>]) -> Decoded<[A]>
    8.7%    CollectionType<A where ...>.map<A><B where ...>((A.Generator.Element) throws -> B) throws -> [B]

At the bottom, most of that is spent in the memory allocation and releasing for the subarrays that are built into the returned array. This seems to be caused by the use of reduce, as there's a lot of copying and releasing going on.

FWIW, replacing the contents of sequence<T>(xs: [Decoded<T>]) -> Decoded<[T]> with this:

let result = xs.map { $0.value! }
return pure(result)

Reduced this

let person: Decoded<[Person]> = [Person].decode(ViewController.parsedJSON)

test case time from 4.66204387699963s to 0.395208572999763s. Of course I'm not entirely sure if my replacement is functionally equivalent to what's already there, so take that with a grain of salt.

In any event it seems like reduce is a major performance bottleneck right now. Perhaps the functional purity of Argo could be sacrificed in this case?

Here are my Instruments results and test project.

@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Nov 16, 2015

That version of sequence would cause a crash if there were decoding failures, which isn't what we want.

That being said, I do think this is a problem and we should improve it. Interesting that this is where we run into this problem. I'd be interested in seeing if ditching our internal use of sequence entirely and using a single reduce for the decode will speed things up.

I just pushed a wip commit to gf-decode-performance. If either of y'all have a chance to re-run your tests against that branch I'd appreciate it. Otherwise, I'll run some tests whenever I have a chance.

@RuiAAPeres

This comment has been minimized.

Copy link
Author

RuiAAPeres commented Nov 16, 2015

@gfontenot I did a quick test inside my project I had a slight improvement (running on a 4S went from 11s to 9.7s)

@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Nov 16, 2015

Hmm. I was hoping that removing the double iteration would have more of an impact than that.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 17, 2015

@gfontenot You may have already seen this article on reduce by AirspeedVelocity, but it's a good examination of the performance of reduce, especially when it comes to building arrays. Fundamentally, using + to build arrays produces a copy on every iteration, so simple loop building is much faster.

@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Nov 17, 2015

@jshier I hadn't seen that article, thanks for posting it.

I've removed reduce on that branch, in favor of using far worse code a for loop and append. Mind re-running the tests?

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 17, 2015

Still slow, taking 4.64611180400243s for my test case with the JSON already parsed. Most likely because of all the intermediate arrays that are being created when you do this:

 case (.Success(var a), .Success(let x)):
    a.append(x)
    accum = pure(a)

If this code is functionally the same, it's much faster (0.404742711998551s):

var accum: [T] = []

for elem in xs {
  switch elem {
  case let .Success(value):
      accum.append(value)
  case let .Failure(error):
      return .Failure(error)
  }
 }

return pure(accum)
@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Nov 17, 2015

Unfortunately, that's not the same. Your version always returns an array, and so is always successful. My version only returns an array of elements if all are successful.

I'll need to think on this a bit. I'm not sure what the answer here is.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 17, 2015

How does my version always return an array? As soon as it encounters an error in xs, it returns that error. Only if it actually completes the loop successfully does my version return the array. Not that there couldn't be a more elegant solution.

@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Nov 17, 2015

Oh, wow, totally missed that. Sorry for skimming.

Yeah, that should be the same (not sure why I didn't think about the early return). I've pushed that change as well. @RuiAAPeres do you think you can run against this version?

@RuiAAPeres

This comment has been minimized.

Copy link
Author

RuiAAPeres commented Nov 18, 2015

@gfontenot @jshier I used e130dfe0190fbd67274595b064fb38fa8028dee1 for the following tests:

P.S : I am not using the demo project, for these results. But a real one, with more complex hierarchy/number of members.


  • iPhone 4S (OS 8.4.1)
  • Release Mode
  • Parsing 89 objects with 33k lines JSON

Argo: 10.339s (STDVA 1%)
Manually: 0.884s (STDVA 1%)

Difference: 1069.5701357466064%

  • iPhone 6 Simulator (OS 9.1)
  • Debug Mode
  • Parsing 89 objects with 33k lines JSON

Argo: 0.881s (STDVA 5%)
Manually: 0.066s (STDVA 5%)

Difference: 1234.8484848484848%

  • iPhone 6 Simulator (OS 9.1)
  • Release Mode
  • Parsing 89 objects with 33k lines JSON

Argo: 0.882s (STDVA 5%)
Manually: 0.053s (STDVA 6%)

Difference: 1564.1509433962265%

@NachoSoto

This comment has been minimized.

Copy link

NachoSoto commented Nov 18, 2015

Quick question: is the benchmark compiling both the code and the framework with -OwholeModule?

@RuiAAPeres

This comment has been minimized.

Copy link
Author

RuiAAPeres commented Nov 18, 2015

@NachoSoto the code on release is -O and debug -Onone.

@RuiAAPeres

This comment has been minimized.

Copy link
Author

RuiAAPeres commented Nov 18, 2015

btw @gfontenot I will start using the demo project from now on, since I don't have Argo anymore in the real project.

@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Nov 18, 2015

@RuiAAPeres Just out of curiosity, what were your numbers on this project before removing reduce?

@RuiAAPeres

This comment has been minimized.

Copy link
Author

RuiAAPeres commented Nov 19, 2015

@gfontenot On the demo project, the results are on the top . On my own was ~11.3

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 23, 2015

Rerunning my test project on actual iPhone 4S I had laying around, and after making sure whole module optimization was on, here are the results:
Before:

7.7683779166673s for parsing elapsed
79.7539868333333s for decoding elapsed

Newest from branch:

7.72551120833305s for parsing elapsed
3.00022145833464s for decoding elapsed

Here's the important code:

let parseStart = CACurrentMediaTime()

let parsedJSON = JSON.parse(ViewController.foundationJSON)

let parseEnd = CACurrentMediaTime()

print("\(parseEnd - parseStart)s for parsing elapsed")

guard case let .Array(array) = parsedJSON else { return }

print(array.count)

let startTime = CACurrentMediaTime()

let person: Decoded<[Person]> = [Person].decode(parsedJSON)

let endTime = CACurrentMediaTime()

print("\(endTime - startTime)s for decoding elapsed")

I'll take another look at Instruments again and see what's hot now. I'm focusing on decode but clearly parse could use some help as well, though from what I saw that's limited by the speed of swift_dynamic_cast.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 23, 2015

Running Instruments on the newest code, here are the results for decode:

92.3%    CollectionType<A where ...>.map<A><B where ...>((A.Generator.Element) throws -> B) throws -> [B]
    90.1%    static Person.decode(JSON) -> Decoded<Person>
        42.6%    static <|| infix<A where ...> (JSON, String) -> Decoded<[A]>
        22.1%    static <| infix<A where ...> (JSON, String) -> Decoded<A>
        12.2%    static <*> infix<A, B> (Decoded<(A) -> B>, Decoded<A>) -> Decoded<B>
        8.2%     static <|? infix<A where ...> (JSON, String) -> Decoded<A?>
        3.0%     static <^> infix<A, B> ((A) -> B, Decoded<A>) -> Decoded<B>

As you can see, nearly all of the time is spent in the actual Person.decode function and various operators. Notably, despite only one property being an array, 42.6% of the time is spent decoding that array, which seems disproportionate to me. There may be more gains to be made there but I think the sequence conversion has covered the easy gains. I'll post my results for parse next.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Nov 23, 2015

Here are the results for parse:

98.8%    specialized static JSON.parse(AnyObject) -> JSON
    69.2%    swift_dynamicCast
    7.8%     specialized _VariantDictionaryStorage.updateValue<A, B where ...>(B, forKey : A) -> B?
    5.2%     specialized Dictionary.init<A, B where ...>(dictionaryLiteral : [(A, B)]...) -> [A : B]
    5.2%     _NativeDictionaryStorageOwner.__deallocating_deinit
    4.0%     specialized specialized CollectionType<A where ...>.map<A><B where ...>((A.Generator.Element) throws -> B) throws -> [B]
    0.7%     specialized Array.init<A>(_uninitializedCount : Int) -> [A]

As you can see, switching on as incurs a rather large swift_dynamicCast cost. Switching to the following code decreased parse time on the test file from 7.72551120833305s to 4.76936141666738s, a savings of almost 39%.

switch json {
case (is NSArray):
    let v = json as! [AnyObject]
    return .Array(v.map(parse))
case (is NSDictionary):
    let v = json as! [Swift.String : AnyObject]
    return .Object(v.map(parse))
case (is NSString):
    let v = json as! Swift.String
    return .String(v)
case (is NSNumber):
    let v = json as! NSNumber
    return .Number(v)
default: return .Null

This still incurs a rather large swift_dynamicCast cost, but it's only 56%. I'm not sure how to reduce casting further.

Hopefully when Apple open sources Swift (perhaps in time for Christmas?) we'll get a native JSON parsing solution that can be used directly, adapted more efficiently, or allows Argo to write its own parser and get back some of this cost.

@tonyd256

This comment has been minimized.

Copy link
Contributor

tonyd256 commented Jan 4, 2016

@RuiAAPeres @jshier Hey guys, would one of you be able to try out that PR branch on your code and see if there are any improvements?

#313

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Jan 4, 2016

I'll give it a shot as soon as I can. It'll be interesting to see what the performance characteristics will be with large JSON files. I was planning on putting an overall performance update PR together as well, to change the reduce usage and some other stuff.

@tonyd256

This comment has been minimized.

Copy link
Contributor

tonyd256 commented Jan 9, 2016

@jshier There are 3 different commits in that PR now. The first one has moderate speed up, the middle one has a large slow down, and the final one has the biggest speed up. If you are able to run your test on all three that would be amazing!

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Jan 9, 2016

I'll give it a look this weekend.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Jan 11, 2016

@tonyd256 Here are my initial results on an iPhone 4s, iOS 9.2, Xcode 7.2, commits installed through CocoaPods but with whole module optimization enabled:

Baseline (2.2.0 release):
5.67829616666666s for parsing elapsed
76.4035759166667s for decoding elapsed

Commit 1:
0.480252583333254s for parsing elapsed
79.5167338333333s for decoding elapsed

Commit 2:
0.460698416666673s for parsing elapsed
90.4947031666666s for decoding elapsed

Commit 3:
0.48393733333296s for parsing elapsed
82.170605166667s for decoding elapsed

As you can see they all had similar improvements to parse() performance, with different impacts on overall performance. Overall, the first commit is slightly faster than the original code, and the fastest overall. I'm going to try combining it with my other performance enhancements and see what the results are.

Edit:
I just realized that the parse times have the NSJSONSerialization time in them as well, which is around 0.45s. Future stats will have it broken out.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Jan 11, 2016

Here are my results with the other improvement mentioned in this thread:

Commit 1 with reduce removal:
0.489127833333441s for Foundation parsing elapsed.
0.0113119166662727s for parsing elapsed
6.44222995833388s for decoding elapsed

Commit 1 with reduce removal and (is NSType) parsing:
0.445239208333987s for Foundation parsing elapsed.
0.00463695833332167s for parsing elapsed
5.51033829166681s for decoding elapsed

Commit 1 with reduce removal and isKindOfClass parsing:
0.456737041666202s for Foundation parsing elapsed.
0.00453112499963026s for parsing elapsed
5.50347774999955s for decoding elapsed

So it looks like there's still something to be gained by my casting reduction.

I'll put together a PR that removes reduce usage for building Arrays. I'll leave the Dictionary versions in place, as I have no data about their performance.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Jan 11, 2016

Some more performance measurements:

Commit 3 with reduce removal:
0.461397250000118s for Foundation parsing elapsed.
0.0317401666670776s for parsing elapsed
8.26080233333323s for decoding elapsed

Commit 3 with reduce removal and (is NSType) parsing:
0.447976791666406s for Foundation parsing elapsed.
0.0250313333335725s for parsing elapsed
7.31569804166611s for decoding elapsed

Release with reduce removal:
0.433766999999534s for Foundation parsing elapsed.
5.34555225000076s for parsing elapsed
2.91640845833263s for decoding elapsed

Release with reduce removal and (is NSType) parsing:
0.446621999999479s for Foundation parsing elapsed.
4.43270670833408s for parsing elapsed
2.94564566666668s for decoding elapsed

Overall it seems like commit 1 of #313 leads to the best performance, in combination with the reduce removal and a reduction in the dynamic casting require. @tonyd256 You could add the casting reduction to your PR, or it can be added later once everything else has been integrated.

@jshier

This comment has been minimized.

Copy link
Contributor

jshier commented Jan 11, 2016

I've created #317 to remove the usage of reduce to build collections. That's the big win for this issue.

@tonyd256

This comment has been minimized.

Copy link
Contributor

tonyd256 commented Jan 11, 2016

Wow, I thought that third commit would be faster than it was. Good to see you were able to make the huge improvement with that PR.

gfontenot added a commit that referenced this issue Mar 11, 2016

Remove most uses of reduce.
Argo’s performance is quadratic over the size of the collection being
decoded, due to the use of reduce[1]. This commit removes most usages of
reduce in exchange for a for loop of the same functionality. This
improves the performance of decoding an array of over 9000 simple models
on an iPhone 4S by ~2600% (76.4s to 2.9s). See the conversation in the
performance GitHub issue[2] for my analysis and evolution of this code.

[1]: http://airspeedvelocity.net/2015/08/03/arrays-linked-lists-and-performance/
[2]: #295

gfontenot added a commit that referenced this issue Mar 11, 2016

Remove most uses of reduce.
Argo’s performance is quadratic over the size of the collection being
decoded, due to the use of reduce[1]. This commit removes most usages of
reduce in exchange for a for loop of the same functionality. This
improves the performance of decoding an array of over 9000 simple models
on an iPhone 4S by ~2600% (76.4s to 2.9s). See the conversation in the
performance GitHub issue[2] for my analysis and evolution of this code.

[1]: http://airspeedvelocity.net/2015/08/03/arrays-linked-lists-and-performance/
[2]: #295
@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Mar 11, 2016

I've released Argo 2.3 and would love to know if that solves most/all of the issues laid out here.

@gfontenot

This comment has been minimized.

Copy link
Collaborator

gfontenot commented Apr 6, 2016

I'm going to close this due to inactivity, and because I'm fairly sure that Argo 2.3 greatly improves the situation. We'll continue to look for more performance improvements in future versions.

@gfontenot gfontenot closed this Apr 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment