session slowness #415

Closed
gregwebs opened this Issue Sep 1, 2012 · 18 comments

Comments

Projects
None yet
5 participants
@gregwebs
Member

gregwebs commented Sep 1, 2012

Lorenzo Bolla ✆ lbolla@gmail.com
1:37 AM (5 hours ago)

Warp
10250 req/s

Yesod
hamlet: 848 req/s
no-hamlet: 940 req/s (getHomeR = return ∘ RepPlain ∘ toContent $ "Hello World!")
no-session: 8000 req/s (makeSessionBackend = const $ return Nothing)

@lbolla

This comment has been minimized.

Show comment
Hide comment
@lbolla

lbolla Sep 1, 2012

Contributor

Code to reproduce the issue is here:
https://gist.github.com/3567006

See different revisions for the different configurations.

Contributor

lbolla commented Sep 1, 2012

Code to reproduce the issue is here:
https://gist.github.com/3567006

See different revisions for the different configurations.

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

Links to relevant code on master:

Member

meteficha commented Sep 1, 2012

Links to relevant code on master:

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

It seems you forgot to compile with -threaded.

On my machine, here's what I get compiling with -threaded, running with +RTS -N4 (I have six cores) and --num-calls=1000:

vanilla no-session:   41,485 req/s (24 us/req)
vanilla with-session:  3,251 req/s (307 us/req)
noencrypt:             3,678 req/s (264 us/req)

vanilla is yesod-core from Hackage while noencrypt is yesod-core changing CS.encrypt to Base64.encode and CS.decrypt to Base64.decode.

If I didn't mess anything up, then this means that time spent on vanilla with-session is composed of

Time to answer the request:       24 us
Time to encode/decode session:   240 us +
Time to encrypt/decrypt session:  43 us
                                 --------
                                 307 us

So the performance hog may not be clientsession (which was already benchmarked and is plenty fast) but the serialization (perhaps serialization of UTCTime?). More tests are needed.

Member

meteficha commented Sep 1, 2012

It seems you forgot to compile with -threaded.

On my machine, here's what I get compiling with -threaded, running with +RTS -N4 (I have six cores) and --num-calls=1000:

vanilla no-session:   41,485 req/s (24 us/req)
vanilla with-session:  3,251 req/s (307 us/req)
noencrypt:             3,678 req/s (264 us/req)

vanilla is yesod-core from Hackage while noencrypt is yesod-core changing CS.encrypt to Base64.encode and CS.decrypt to Base64.decode.

If I didn't mess anything up, then this means that time spent on vanilla with-session is composed of

Time to answer the request:       24 us
Time to encode/decode session:   240 us +
Time to encrypt/decrypt session:  43 us
                                 --------
                                 307 us

So the performance hog may not be clientsession (which was already benchmarked and is plenty fast) but the serialization (perhaps serialization of UTCTime?). More tests are needed.

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

Here's the profile for vanilla with-session: http://hpaste.org/74032. I've added SCCs to yesod-core as below:

encodeClientSession key iv expire rhost session' =
    ({-# SCC "encodeClientSession/encrypt" #-} CS.encrypt key iv $
    ({-# SCC "encodeClientSession/encode" #-} encode $ SessionCookie expire rhost session'))

First of all, decodeClientSession is never called AFAICS, so httperf probably isn't passing the cookies along.

Now, taking a look at the inherited columns, encodeClientSession/encrypt is taking 18.3% of all program time. By itself (individual columns), it's taking 4.7%. Its interesting children are Data.ByteString.Base64.encode, which is taking 5.4% (inherited), and encodeClientSession/encode, which is taking 7.6% (inherited) and is composed of just Data.Serialize.encode. Those three sum to 17.7%.

In other words, encodeClientSession seems to be the problem (we already knew that) but clientsession itself is just the smallest of the problems: we need to tackle Base64.encode and Data.Serialize.encode first. Besides, it's going to be very hard to extract more performance from clientsession (most of its time is spent on either AES or Skein).

Please, someone correct me if I'm wrong, I'm not a great profiler =).

Member

meteficha commented Sep 1, 2012

Here's the profile for vanilla with-session: http://hpaste.org/74032. I've added SCCs to yesod-core as below:

encodeClientSession key iv expire rhost session' =
    ({-# SCC "encodeClientSession/encrypt" #-} CS.encrypt key iv $
    ({-# SCC "encodeClientSession/encode" #-} encode $ SessionCookie expire rhost session'))

First of all, decodeClientSession is never called AFAICS, so httperf probably isn't passing the cookies along.

Now, taking a look at the inherited columns, encodeClientSession/encrypt is taking 18.3% of all program time. By itself (individual columns), it's taking 4.7%. Its interesting children are Data.ByteString.Base64.encode, which is taking 5.4% (inherited), and encodeClientSession/encode, which is taking 7.6% (inherited) and is composed of just Data.Serialize.encode. Those three sum to 17.7%.

In other words, encodeClientSession seems to be the problem (we already knew that) but clientsession itself is just the smallest of the problems: we need to tackle Base64.encode and Data.Serialize.encode first. Besides, it's going to be very hard to extract more performance from clientsession (most of its time is spent on either AES or Skein).

Please, someone correct me if I'm wrong, I'm not a great profiler =).

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

Here's another profile: http://hpaste.org/74041. This time, besides the SCCs from before, I've also added:

instance Serialize SessionCookie where
    put (SessionCookie a b c) =
      ({-# SCC "SessionCookie/put/1" #-} putTime a) >>
      ({-# SCC "SessionCookie/put/2" #-} put b) >>
      ({-# SCC "SessionCookie/put/3" #-} put (map (first unpack) c))

The inherited times I get for those three are:

SessionCookie/put/1: 2.8%
SessionCookie/put/2: 0.3%
SessionCookie/put/3: 1.3%

Note that putTime, which just serializes a simple UTCTime, is taking 2.8% of all program time, more than half of the time clientsession takes to encrypt and hash the data =). Looks like a nice optimization target.

The third put takes some time but it's going to be a lot harder to improve its performance, since it's basically a list of pairs ByteStrings.

Member

meteficha commented Sep 1, 2012

Here's another profile: http://hpaste.org/74041. This time, besides the SCCs from before, I've also added:

instance Serialize SessionCookie where
    put (SessionCookie a b c) =
      ({-# SCC "SessionCookie/put/1" #-} putTime a) >>
      ({-# SCC "SessionCookie/put/2" #-} put b) >>
      ({-# SCC "SessionCookie/put/3" #-} put (map (first unpack) c))

The inherited times I get for those three are:

SessionCookie/put/1: 2.8%
SessionCookie/put/2: 0.3%
SessionCookie/put/3: 1.3%

Note that putTime, which just serializes a simple UTCTime, is taking 2.8% of all program time, more than half of the time clientsession takes to encrypt and hash the data =). Looks like a nice optimization target.

The third put takes some time but it's going to be a lot harder to improve its performance, since it's basically a list of pairs ByteStrings.

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

I'm going to take a break for now. If anyone is feeling adventurous, try optimizing the putTime/getTime pair.

Member

meteficha commented Sep 1, 2012

I'm going to take a break for now. If anyone is feeling adventurous, try optimizing the putTime/getTime pair.

@lbolla

This comment has been minimized.

Show comment
Hide comment
@lbolla

lbolla Sep 1, 2012

Contributor

Nice job. putTime and getTime seem to convert UTC to JulianDates back and forth. They both rely on Data.Time.Clock.POSIX, so maybe we should store in the session cookie the seconds from Epoch rather than Julian dates or UTCtimes, and avoid expensive calls to diffUTCTime?

Contributor

lbolla commented Sep 1, 2012

Nice job. putTime and getTime seem to convert UTC to JulianDates back and forth. They both rely on Data.Time.Clock.POSIX, so maybe we should store in the session cookie the seconds from Epoch rather than Julian dates or UTCtimes, and avoid expensive calls to diffUTCTime?

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

@lbolla I'm going to try doing that right now. Setting up some criterion benchmarks...

Member

meteficha commented Sep 1, 2012

@lbolla I'm going to try doing that right now. Setting up some criterion benchmarks...

@gregwebs

This comment has been minimized.

Show comment
Hide comment
@gregwebs

gregwebs Sep 1, 2012

Member

@kazu-yamamoto has worked on date optimization for the logger, but that was mostly around formatting

Member

gregwebs commented Sep 1, 2012

@kazu-yamamoto has worked on date optimization for the logger, but that was mostly around formatting

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

Here's the criterion benchmark and the results from running it on my machine: http://hpaste.org/74052.

Using POSIXTime, we can get a 47% faster putTime and a 60% faster getTime (posix_int64).

Using hand written functions, we can get a 74% faster putTime and a 98% (!!) faster getTime (byhand_b). More specifically, putTime is going from 5.22 us to 1.35 us. I don't have any more ideas of optimizing putTime.

Also, on my test the original putTime_orig used 25 bytes, while putTime_byhand_b used only 8 bytes. That means that we are going to be able to save a colossal amount of 23 bytes on the base64 output! (putTime_posix_integer was even better, just 5 bytes, but putTime_byhand_b is faster and those are just 3 bytes).

Member

meteficha commented Sep 1, 2012

Here's the criterion benchmark and the results from running it on my machine: http://hpaste.org/74052.

Using POSIXTime, we can get a 47% faster putTime and a 60% faster getTime (posix_int64).

Using hand written functions, we can get a 74% faster putTime and a 98% (!!) faster getTime (byhand_b). More specifically, putTime is going from 5.22 us to 1.35 us. I don't have any more ideas of optimizing putTime.

Also, on my test the original putTime_orig used 25 bytes, while putTime_byhand_b used only 8 bytes. That means that we are going to be able to save a colossal amount of 23 bytes on the base64 output! (putTime_posix_integer was even better, just 5 bytes, but putTime_byhand_b is faster and those are just 3 bytes).

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 1, 2012

Member

Using toModifiedJulianDay/ModifiedJulianDay instead of addDays/diffDays does shave some nanoseconds off those timings (http://hpaste.org/74055), but I guess that most of the time is spent truncating the time of the day (and that's why putTime is so slower than getTime. It's stored as a Data.Fixed.Fixed, and its truncate implementation goes through Rational.

I won't be able to commit these new implementations onto yesod-core right now, though.

Member

meteficha commented Sep 1, 2012

Using toModifiedJulianDay/ModifiedJulianDay instead of addDays/diffDays does shave some nanoseconds off those timings (http://hpaste.org/74055), but I guess that most of the time is spent truncating the time of the day (and that's why putTime is so slower than getTime. It's stored as a Data.Fixed.Fixed, and its truncate implementation goes through Rational.

I won't be able to commit these new implementations onto yesod-core right now, though.

@kazu-yamamoto

This comment has been minimized.

Show comment
Hide comment
@kazu-yamamoto

kazu-yamamoto Sep 3, 2012

I don't understand the problem well.
But if you want fast formatter of time for HTTP Date (GMT only), use http-date.
If you want fast formatter of time for zoned date (e.g. JST), use unix-time.
If you run Yesod on Linux 2, gettimeofday() is a system call which is very slow.
In this case, you should consider to use date-cache.

I don't understand the problem well.
But if you want fast formatter of time for HTTP Date (GMT only), use http-date.
If you want fast formatter of time for zoned date (e.g. JST), use unix-time.
If you run Yesod on Linux 2, gettimeofday() is a system call which is very slow.
In this case, you should consider to use date-cache.

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 5, 2012

Member

Implemented newer functions on #418. My benchmarks suggest only a 5% to 17% performance increase, though, see b2a9beb for details.

Member

meteficha commented Sep 5, 2012

Implemented newer functions on #418. My benchmarks suggest only a 5% to 17% performance increase, though, see b2a9beb for details.

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Sep 5, 2012

Member

We still need more profiling =).

Member

meteficha commented Sep 5, 2012

We still need more profiling =).

@gregwebs

This comment has been minimized.

Show comment
Hide comment
@gregwebs

gregwebs Dec 30, 2012

Member

I saw some commits recently to improve session performance. What is the status of this ticket?

Member

gregwebs commented Dec 30, 2012

I saw some commits recently to improve session performance. What is the status of this ticket?

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Dec 30, 2012

Member

I merged in @meteficha's code, though for the moment I had to comment out some of the bigger performance savings for backwards compatibility. Once we have a major version bump, I'll switch to the high-speed version.

I have no more work planned for this issue, I'm OK closing it if no one has anything else to add.

Member

snoyberg commented Dec 30, 2012

I merged in @meteficha's code, though for the moment I had to comment out some of the bigger performance savings for backwards compatibility. Once we have a major version bump, I'll switch to the high-speed version.

I have no more work planned for this issue, I'm OK closing it if no one has anything else to add.

@meteficha

This comment has been minimized.

Show comment
Hide comment
@meteficha

meteficha Jan 17, 2013

Member

The benchmarks should be redone now that clientsession uses cipher-aes directly, it should be better at least in theory =).

Member

meteficha commented Jan 17, 2013

The benchmarks should be redone now that clientsession uses cipher-aes directly, it should be better at least in theory =).

@snoyberg

This comment has been minimized.

Show comment
Hide comment
@snoyberg

snoyberg Mar 10, 2013

Member

@meteficha's code is now used exclusively on the yesod1.2 branch.

Member

snoyberg commented Mar 10, 2013

@meteficha's code is now used exclusively on the yesod1.2 branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment