do-not-design-unreliability

Some people say "you should always design systems that are reliable
". Others look at that statement and say "no, you don't have to care
about reliability when you're a young startup, it just slows you down
and costs way to much".

Then I look at these and go "I definitely fall in the first camp, but
the statement from the second camp makes no sense."

I think this is because if you're in the "please make N+m possible" (N
being the "number of computers (or data centres) you need to serve
your load, m being the number of computers (or data centres) you want
as spares) design mindset, what you (or, at least I) mean when you say
"you should always design systems that are reliable" is not "always
deploy in N+m" it is actually "please don't unintentionally restrict
your design so that when you become large and successful, your
deployment model is constrained by design decisions you made at the
start and can no longer engineer out and your reliability is now
suffering as a result".

So, when you hear someone say "always design reliable systems", please
hear "never intentionally engineer unreliabiity and non-scalability
into your system".

Certainly, speaking only for myself, I don't mean "you should not
deploy unless you can deploy in at least two, ideally five AWS
availablity zones and aim for at least five, maybe six or seven nines
of uptime".

But I do mean "please look at all bits of your design, figure out what
your 'cannot be replicated' bits are and at least consider how you can
get around the 'cannot be replicated'". On the basis that anything
that cannot be replicated is likely to become both a reliability and a
scalability problem, down the line. And sometimes, the answer isn't
"replicate", it's "shard".

But if you ever end up deciding to shard your data storage, consider
using the "micro-sharding" pattern, rather than go the classic "meh,
we have three boxes, let's shard three ways". Because large shards
means sadness as things fail over from one box to another, since the
shard that lived on the box now needs to move to another box and you
end up with a severe load balancing issue (with three boxes, you now
have one box taking ~2 requests for every request the other box
takes). With microsharding, you basically make lots of very small data
shards, that are then distributed among a small amount of
hosts. Meaning that when one host falls over, you can distribute said
shards over all remaining hosts, causing a much smaller increase in
load on them. It can also allow you to deal with "hot shard" problems.

And, yes, it's perfectly OK to shard your data say 10 000 ways, then
store all those shards on a single backend. And it's not even than
difficult to do, but by doing (or at least considering) it up front,
you end up with a system that inherently embraces scalability.