Using fatcache #12

annief · 2013-09-18T20:52:36Z

Hi, is fatcache currently in production use? I'm interested in contributing to optimizing memcache for SSDs, and would like to start with performance analysis to expose optimization possibilities. Should I start with fatcache or is there another deployment you would recommend? Thanks.

thinkingfish · 2013-09-18T23:58:26Z

We are currently not running fatcache in production. Given the work at hand (a data insight project), my guess is we will start looking at the multiple cache solutions at the beginning of next year, and make a case for using fatcache on the long-tail part as we gain more insight into the data.

annief · 2013-09-19T00:26:09Z

Hi Yue,

Thank you. I am a SSD architect at Intel. I’d like to determine performance of memcache on our future SSDs.
Would this be useful to fatcache? Perhaps a chance for collaboration?

Annie

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, September 18, 2013 4:59 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

We are currently not running fatcache in production. Given the work at hand (a data insight project), my guess is we will start looking at the multiple cache solutions at the beginning of next year, and make a case for using fatcache on the long-tail part as we gain more insight into the data.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-24708701.

manjuraj · 2013-09-19T00:38:06Z

@annief the current perf numbers with commodity intel 320 ssd (https://github.com/twitter/fatcache/blob/master/notes/spec.md) can be found here: https://github.com/twitter/fatcache/blob/master/notes/performance.md

One of the designs in fatacache that kind of limits it from getting more out of ssd is the fact that IO to disk is sync. Given the fact that fatcache is single threaded and uses asynchronous IO for network, the synchronous IO to SSD is sort of limiting its performance (throughput to be more precise). I believe we can make fatcache IO async by using epoll and linux aio framework -- https://code.google.com/p/kernel/wiki/AIOUserGuide. If we can fix this, fatcache would become a viable alternative for in-memory caches.

thinkingfish · 2013-09-19T01:30:16Z

We have a pull request adding async support to fatcache and I should look at it real soon.

A collaboration would be fantastic.

annief · 2013-09-19T18:28:09Z

I see - being both single threaded and synchronous can be a limit.
I looked at the performance numbers published, and it seemed that currently, you are at the SSD limit, not quite sync IO yet.

So that you know, I am very new to fatcache. I appreciate you helping me understand fatcache better.

 Would making fatcache multithreaded (vs async) be a better longer term scaling strategy?

 is fatcache an extension of twemcache or memcache with SSD consideration?  does it not inherit memcache’s existing multithreaded framework?

 what in fatcache inherently require serialization? I assume that’s why you chose a single thread implementation?

Is it useful if I were to do a performance analysis to determine what performance bottlenecks are worthy of fixing in fatcache?

From: Manju Rajashekhar [mailto:notifications@github.com]
Sent: Wednesday, September 18, 2013 5:38 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

@anniefhttps://github.com/annief the current perf numbers with commodity intel 320 ssd (https://github.com/twitter/fatcache/blob/master/notes/spec.md) can be found here: https://github.com/twitter/fatcache/blob/master/notes/performance.md

One of the designs in fatacache that kind of limits it from getting more out of ssd is the fact that IO to disk is sync. Given the fact that fatcache is single threaded and uses asynchronous IO for network, the synchronous IO to SSD is sort of limiting its performance (throughput to be more precise). I believe we can make fatcache IO async by using epoll and linux aio framework -- https://code.google.com/p/kernel/wiki/AIOUserGuide. If we can fix this, fatcache would become a viable alternative for in-memory caches.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-24710198.

manjuraj · 2013-09-22T19:31:37Z

Happy to help with any questions you have about fatcache and/or its architecture. Please feel free to ask anything

Would making fatcache multithreaded (vs async) be a better longer term scaling strategy?

I'm betting on async disk IO. If you look at the perf numbers in performance.md, you will notice that the trick is in increasing the average queue size to get 100% utilization (see %util iostat numbers). In order to do that I had to run 8 instances of fatcache on a single ssd. It would be nice get 100% util with single just fatcache instance.

Furthermore, as we evolve the fatcache architecture it would be nice to incorporate the fact that a single fatcache instance can talk to multiple ssd's at the same time (think of lvm) which would essentially allow us to use commodity ssd that have limited parallelism (queue length)

is fatcache an extension of twemcache or memcache with SSD consideration? does it not inherit memcache’s existing multithreaded framework?

No it doesn't. It implements memcache ascii protocol (https://github.com/twitter/fatcache/blob/master/notes/memcache.txt). But instead of being multithreaded, it architecture is single threaded which was popularized by key-value stores like redis.

Single threaded makes sense because you can always run multiple instances of fatcache on one machine or multiple machines to scale horizontally and have some kind of sharding layer on the client to route traffic to one of the fatcache instances

what in fatcache inherently require serialization? I assume that’s why you chose a single thread implementation?

single threaded makes reasoning simple :) There is nothing in fatcache that is CPU intensive that warrants it to be multithreaded. Most of the work in fatchcache is handling network IO and disk IO. Currently Network IO is async, but disk IO is not. Sync disk IO is limiting fatcache performance

Is it useful if I were to do a performance analysis to determine what performance bottlenecks are worthy of fixing in fatcache?

absolutely!

annief · 2013-09-26T01:30:57Z

I understand what you mean now. You are counting on multi-instances (processes) to scale, not necessarily multi-threads. Makes good sense.

Are the default settings optimal for a fatcache/memcache deployment? Looking for help from folks who runs oversees a memcache deployment to help get the right settings.
e.g. do we set a higher MTU size if the median value size is larger than the 1500 TCP payload size? i.e. for the common size that a specific memcache pool sees, would we expect a get/set to be splitted across multiple TCP packets?

Thanks.

From: Manju Rajashekhar [mailto:notifications@github.com]
Sent: Sunday, September 22, 2013 12:32 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

Happy to help with any questions you have about fatcache and/or its architecture. Please feel free to ask anything

Would making fatcache multithreaded (vs async) be a better longer term scaling strategy?

I'm betting on async disk IO. If you look at the perf numbers in performance.md, you will notice that the trick is in increasing the average queue size to get 100% utilization (see %util iostat numbers). In order to do that I had to run 8 instances of fatcache on a single ssd. It would be nice get 100% util with single just fatcache instance.

Furthermore, as we evolve the fatcache architecture it would be nice to incorporate the fact that a single fatcache instance can talk to multiple ssd's at the same time (think of lvm) which would essentially allow us to use commodity ssd that have limited parallelism (queue length)

is fatcache an extension of twemcache or memcache with SSD consideration? does it not inherit memcache’s existing multithreaded framework?

No it doesn't. It implements memcache ascii protocol (https://github.com/twitter/fatcache/blob/master/notes/memcache.txt). But instead of being multithreaded, it architecture is single threaded which was popularized by key-value stores like redis.

Single threaded makes sense because you can always run multiple instances of fatcache on one machine or multiple machines to scale horizontally and have some kind of sharding layer on the client to route traffic to one of the fatcache instances

what in fatcache inherently require serialization? I assume that’s why you chose a single thread implementation?

single threaded makes reasoning simple :) There is nothing in fatcache that is CPU intensive that warrants it to be multithreaded. Most of the work in fatchcache is handling network IO and disk IO. Currently Network IO is async, but disk IO is not. Sync disk IO is limiting fatcache performance

Is it useful if I were to do a performance analysis to determine what performance bottlenecks are worthy of fixing in fatcache?

absolutely!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-24888746.

thinkingfish · 2013-09-26T02:01:07Z

do we set a higher MTU size if the median value size is larger than the 1500 TCP payload size? i.e. for the common size that a specific memcache pool sees, would we expect a get/set to be splitted across multiple TCP packets?

We do use jumbo frames when possible. Single get/set can be split across multiple packets, but for most workloads we observe, simple key/value requests are usually smaller than one MTU, even w/o the jumbo frame config.

archier · 2013-10-29T22:12:03Z

This is a great discussion.

I have a couple questions

Are there any other performance numbers available? I'm curious about how things work out when storing larger values.

Would it make sense to use several SSDs in a box instead of one?

annief, have you been testing this on your newer hardware?

Thanks!

annief · 2013-10-29T23:16:34Z

Hi Archier,

I am behind in what I said I would do. No performance numbers yet. My apologies. Will get back into that mode.

Per your question on size, the way I think about it is:
Assuming that 4KB page size map directly to a SSD block, storing larger values would scale accordingly.

per your question on # of SSDs, if your goal is to achieve the maximum possible throughput, more SSDs is better.
However, I believe a better but more difficult goal is to design the best cost/performance memcache/fatcache node.
if that'sthe case, we may need more SSDs for capacity or Bandwidth, as bounded by cost of the system.
We need to turn to folks who run a production memcache workload to help answer:

what is the data set size of a typical workload?
what is the total networking BW used on a typical memcache deployment? We only need to match SSD BW to NIC's BW.
what is the reduction of TCO possible when we replace DRAM with SSDs?

Can someone who has experience with production memcache help guide us on these questions?

Thanks!

archier · 2013-10-30T00:10:23Z

Hi Annie, looks like you closed the issue?

We run memcache in production, get in touch.

annief · 2013-10-30T01:09:00Z

Hi Archie,

I clicked the wrong button! let me reopen the issue.

what is a good next step to incorporate your workload insights?

From: Archie R [mailto:notifications@github.com]
Sent: Tuesday, October 29, 2013 5:11 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

Hi Annie, looks like you closed the issue?

We run memcache in production, get in touch.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27355464.

archier · 2013-10-30T01:14:06Z

Hi Annie just drop me a message -- my email's on my Git page.

On Tue, Oct 29, 2013 at 6:09 PM, annief notifications@github.com wrote:

Hi Archie,

I clicked the wrong button! let me reopen the issue.

what is a good next step to incorporate your workload insights?

From: Archie R [mailto:notifications@github.com]
Sent: Tuesday, October 29, 2013 5:11 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

Hi Annie, looks like you closed the issue?

We run memcache in production, get in touch.

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27355464>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27357916
.

manjuraj · 2013-10-30T01:40:12Z

page 2 of this link as few numbers -- https://speakerdeck.com/manj/caching-at-twitter-with-twemcache-twitter-open-source-summit; it is a year old though

@thinkingfish can possibly answer your questions

thinkingfish · 2013-10-30T21:49:58Z

Are there any other performance numbers available? I'm curious about how
things work out when storing larger values.

Regarding latencies, here is a profile of Twemcache by Rao Fu:
https://github.com/twitter/twemcache/wiki/Impact-of-Lock-Contention

When it comes to throughput, because of the existence of multiget and the
locking pushdown in later memcached trunk, the numbers will be different
between memcached/Twemcache. However, to give you a ballpark estimate, on
an up-to-date datacenter frontend server, it should be fairly easy to get
100k qps on all commands with Twemcache.

Would it make sense to use several SSDs in a box instead of one?

This depends on the bottleneck. The overly simplified answer is "yes". If a
single Gigalink is used, there is a chance that network proves to be the
bottleneck before disk, or CPU & memory. In the case of 10G NICs, as long
as we have enough memory for indexing, the load on CPU is generally low so
that we can support a higher request rate with better hit rate by having
more data available on the same box. Some issues like logistics within the
DC (at what point do multiple SSDs start taking more space? etc) will
affect the decision too, though I'm not clear about how.

-Yao (@thinkingfish)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

Hi Archier,

I am behind in what I said I would do. No performance numbers yet. My
apologies. Will get back into that mode.

Per your question on size, the way I think about it is:
Assuming that 4KB page size map directly to a SSD block, storing larger
values would scale accordingly.

per your question on # of SSDs, if your goal is to achieve the maximum
possible throughput, more SSDs is better.
However, I believe a better but more difficult goal is to design the best
cost/performance memcache/fatcache node.
if that'sthe case, we may need more SSDs for capacity or Bandwidth, as
bounded by cost of the system.
We need to turn to folks who run a production memcache workload to help
answer:

what is the data set size of a typical workload?

what is the total networking BW used on a typical memcache deployment?
We only need to match SSD BW to NIC's BW.

what is the reduction of TCO possible when we replace DRAM with SSDs?

Can someone who has experience with production memcache help guide us on
these questions?

Thanks!

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27352867
.

thinkingfish · 2013-10-30T22:09:42Z

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the default
exponential growth with f=1.4; and max/min is often more than an order of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is that
we generally fall on the side of allocating for data size (in contrast to
for throughput), indicating that switching to a cheaper storage medium will
save money.

what is the total networking BW used on a typical memcache deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and
the memory expansion (from 32G to 128G) are comparable in cost, we save 50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

annief · 2013-10-30T22:44:17Z

Hi,

 when you note smaller than MTU, we are talking about ~1500B MTU?

 I like the notion of request-to-size ratio.  Do you mean more specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the default
exponential growth with f=1.4; and max/min is often more than an order of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is that
we generally fall on the side of allocating for data size (in contrast to
for throughput), indicating that switching to a cheaper storage medium will
save money.

what is the total networking BW used on a typical memcache deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and
the memory expansion (from 32G to 128G) are comparable in cost, we save 50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27444066.

thinkingfish · 2013-10-30T23:01:28Z

Yes, 1500B commonly. This line is somewhat arbitrary when it comes to
storage, but it simplifies the reasoning of performance as cache
latencies are usually networking heavy.
For now, yes. Usually user comes up with an estimate of how quickly data
is generated and the total dataset size (as in permanent storage), and
decides a baseline. Then we iterate based on hit rate observed. We are
moving toward a more informed procedure by observing live traffic and
projecting capacity for hit rate based on those numbers, so we can be
closer to the ideal allocation in fewer iterations (ideally, the second
time should get it right).

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more
specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and
error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we
have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the default
exponential growth with f=1.4; and max/min is often more than an order of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is that
we generally fall on the side of allocating for data size (in contrast to
for throughput), indicating that switching to a cheaper storage medium
will
save money.

what is the total networking BW used on a typical memcache
deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date
of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio
is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and
the memory expansion (from 32G to 128G) are comparable in cost, we save
50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27446415
.

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more
specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and
error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we
have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the default
exponential growth with f=1.4; and max/min is often more than an order of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is that
we generally fall on the side of allocating for data size (in contrast to
for throughput), indicating that switching to a cheaper storage medium
will
save money.

what is the total networking BW used on a typical memcache
deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date
of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio
is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and
the memory expansion (from 32G to 128G) are comparable in cost, we save
50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27446415
.

annief · 2013-10-30T23:59:10Z

In your setup, what would a request/s-to-size ratio be to motivate moving to higher capacity/lower cost SSDs?

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 4:02 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

Yes, 1500B commonly. This line is somewhat arbitrary when it comes to
storage, but it simplifies the reasoning of performance as cache
latencies are usually networking heavy.
For now, yes. Usually user comes up with an estimate of how quickly data
is generated and the total dataset size (as in permanent storage), and
decides a baseline. Then we iterate based on hit rate observed. We are
moving toward a more informed procedure by observing live traffic and
projecting capacity for hit rate based on those numbers, so we can be
closer to the ideal allocation in fewer iterations (ideally, the second
time should get it right).

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more
specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and
error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we
have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the default
exponential growth with f=1.4; and max/min is often more than an order of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is that
we generally fall on the side of allocating for data size (in contrast to
for throughput), indicating that switching to a cheaper storage medium
will
save money.

what is the total networking BW used on a typical memcache
deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date
of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio
is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and
the memory expansion (from 32G to 128G) are comparable in cost, we save
50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27446415
.

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more
specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and
error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we
have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the default
exponential growth with f=1.4; and max/min is often more than an order of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is that
we generally fall on the side of allocating for data size (in contrast to
for throughput), indicating that switching to a cheaper storage medium
will
save money.

what is the total networking BW used on a typical memcache
deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date
of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio
is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk) and
the memory expansion (from 32G to 128G) are comparable in cost, we save
50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27446415
.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27447469.

thinkingfish · 2013-10-31T00:47:12Z

Hmmm, there are many variables that it depends one. On the service side,
query rate and key/value size. On the HW side, # cores, network bandwidth
and available memory.

Let's fixate key/value size first. Assuming we have 100 Byte key+value,
100k qps single core (HW thread) performance. Now assuming a base-line
platform (the lowest model one can get in a DC) of 8 cores, 1 Gigalink, the
total throughput would be 800k qps & 80MB/s and the service is CPU(locking)
bound. A 32GB heap (the amount of memory coming with the baseline platform,
actual usable capacity will be smaller) sets the request-to-size ratio to
.25% per second (throughput to dataset size ratio is .25%).
Since dataset size is decided independent of the request rate- let's assume
hit rate doesn't affect users' cache behavior in a significant way, we are
going to set a target hit rate and decide the working dataset size S based
on observed traffic pattern, we can say if S makes request-to-size ratio
lower than .25% (e.g. need 64GB dataset to achieve 90% target hit rate), we
should go with SSD.

Now we can vary key/value size. If we go down in key-value size, the
bottleneck remain the same and we will be using less storage, so it will be
served by the baseline model as-is. If we go up in size (to beyond
150Bytes), the bottleneck shifts to the NIC, we will be forced to partition
the dataset further to accommodate traffic, which means we have a lower
request-to-size cutoff about when to switch to SSD (switching when below
the cutoff).

At a high level the incentive comes from datasets with a long tail (large
datasets to meet a certain hit rate goal) and relatively small traffic
hitting the dataset (and making in-memory solutions expensive judged by
CPU/network resource utilization). Since long tail is prevalent in many
workloads, we expect to start from the ones that obviously will benefit
from the transition.

All this calculation can be better presented with a chart. I haven't got
the time to think thoroughly about this yet, but if it's helpful I can plug
in more numbers later.

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 4:59 PM, annief notifications@github.com wrote:

In your setup, what would a request/s-to-size ratio be to motivate moving
to higher capacity/lower cost SSDs?

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 4:02 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

Yes, 1500B commonly. This line is somewhat arbitrary when it comes to
storage, but it simplifies the reasoning of performance as cache
latencies are usually networking heavy.

For now, yes. Usually user comes up with an estimate of how quickly
data
is generated and the total dataset size (as in permanent storage), and
decides a baseline. Then we iterate based on hit rate observed. We are
moving toward a more informed procedure by observing live traffic and
projecting capacity for hit rate based on those numbers, so we can be
closer to the ideal allocation in fewer iterations (ideally, the second
time should get it right).

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more
specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and
error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com
wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network
hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we
have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the
default
exponential growth with f=1.4; and max/min is often more than an order
of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would
be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is
that
we generally fall on the side of allocating for data size (in contrast
to
for throughput), indicating that switching to a cheaper storage medium
will
save money.

what is the total networking BW used on a typical memcache
deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date
of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with
SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio
is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk)
and
the memory expansion (from 32G to 128G) are comparable in cost, we save
50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27446415>
.

-Yao (@thinkingfish)

On Wed, Oct 30, 2013 at 3:44 PM, annief notifications@github.com wrote:

Hi,

when you note smaller than MTU, we are talking about ~1500B MTU?

I like the notion of request-to-size ratio. Do you mean more
specifically requests/s-to-size ratio?

a. the challenge is the knowing size. is the size obtained via trial and
error to meet a given cache hit ratio?
Thanks.

From: Yao Yue [mailto:notifications@github.com]
Sent: Wednesday, October 30, 2013 3:10 PM
To: twitter/fatcache
Cc: Foong, Annie
Subject: Re: [fatcache] Using fatcache (#12)

On Tue, Oct 29, 2013 at 4:16 PM, annief notifications@github.com
wrote:

what is the data set size of a typical workload?

There is no single "typical workload" but a few rules hold for most
datasets:
a. small objects (smaller than a single MTU) dominate in number in many
cases, partly because the use of cache for counters, flags, soft locks
prevail, partly because sending very large objects over the network
hurts
the latencies and stress the network backbone, and hence are usually
mitigated by local caching.
b. wide range of sizes. Even we established a, it is also true that we
have
seen data sizes from a single integer to 1MiB (default slab size). Even
though a particular workload doesn't cover the whole spectrum, it
nonetheless usually covers several slab classes if you go with the
default
exponential growth with f=1.4; and max/min is often more than an order
of
magnitude. This is especially true if multitenacy is adopted.
c. in terms of total data cached, it will be very application dependent,
from a few gigs to tens of terabytes. A more important parameter would
be
"request to size" ratio. Because it tells you whether CPU/bandwidth or
memory/storage is likely to be the bottleneck. So far, observation is
that
we generally fall on the side of allocating for data size (in contrast
to
for throughput), indicating that switching to a cheaper storage medium
will
save money.

what is the total networking BW used on a typical memcache
deployment?
We only need to match SSD BW to NIC's BW.

1G is typical today, will migrate to 10G at some point (no concrete date
of
adoption); 10G NIC can be requested though, if proves necessary/economic
overall.

what is the reduction of TCO possible when we replace DRAM with
SSDs?

This will be workload dependent (see my answer in 1c). The RAM:SSD ratio
is
over 10 but that's not the whole cost of running a server. If we look at
the whole system, very roughly, to store a dataset of 240GB, it will
probably require 2 hosts with 128GB memory each (limited by the
cost-effective per-DIMM capacity) for RAM, and one host with 32GB memory
and 250GB disk for SSD. Assuming the SSD upgrade (from magnetic disk)
and
the memory expansion (from 32G to 128G) are comparable in cost, we save
50%
of the cost for workloads that have a low "request to size" ratio.

-Yao (@thinkingfish)

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27444066>.

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27446415>
.

—
Reply to this email directly or view it on GitHub<
https://github.com/twitter/fatcache/issues/12#issuecomment-27447469>.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/12#issuecomment-27450723
.

annief closed this as completed Oct 29, 2013

annief reopened this Oct 30, 2013

jjones-smug mentioned this issue Jan 11, 2018

SIGSEGV crashes, known issue? #34

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using fatcache #12

Using fatcache #12

annief commented Sep 18, 2013

thinkingfish commented Sep 18, 2013

annief commented Sep 19, 2013

manjuraj commented Sep 19, 2013

thinkingfish commented Sep 19, 2013

annief commented Sep 19, 2013

manjuraj commented Sep 22, 2013

annief commented Sep 26, 2013

thinkingfish commented Sep 26, 2013

archier commented Oct 29, 2013

annief commented Oct 29, 2013

archier commented Oct 30, 2013

annief commented Oct 30, 2013

archier commented Oct 30, 2013

manjuraj commented Oct 30, 2013

thinkingfish commented Oct 30, 2013

thinkingfish commented Oct 30, 2013

annief commented Oct 30, 2013

thinkingfish commented Oct 30, 2013

annief commented Oct 30, 2013

thinkingfish commented Oct 31, 2013

Using fatcache #12

Using fatcache #12

Comments

annief commented Sep 18, 2013

thinkingfish commented Sep 18, 2013

annief commented Sep 19, 2013

manjuraj commented Sep 19, 2013

thinkingfish commented Sep 19, 2013

annief commented Sep 19, 2013

manjuraj commented Sep 22, 2013

annief commented Sep 26, 2013

thinkingfish commented Sep 26, 2013

archier commented Oct 29, 2013

annief commented Oct 29, 2013

archier commented Oct 30, 2013

annief commented Oct 30, 2013

archier commented Oct 30, 2013

manjuraj commented Oct 30, 2013

thinkingfish commented Oct 30, 2013

thinkingfish commented Oct 30, 2013

annief commented Oct 30, 2013

thinkingfish commented Oct 30, 2013

annief commented Oct 30, 2013

thinkingfish commented Oct 31, 2013