traceroute
- Wavelength-Division Multiplexing (WDM)
- optical fibers
- three-way handshake
- makes creating new connection expensive
- connection reuse is critical
- TCP Fast Open (TFO)
- allows data transfer within the
SYN
packet
- allows data transfer within the
- congestion collapse
- prevents the sender from overwhelming the receiver with data it may not be able to process
- Approach:
- each side advertises its own receive window (rwnd) within the
ACK
packets- size of the available buffer space to hold the incoming data
- each side advertises its own receive window (rwnd) within the
- originally 16 bits allocated for the rwnd size
- TCP window scaling - now it's up to 1Gb
$ sysctl net.ipv4.tcp_window_scaling
$ sysctl -w net.ipv4.tcp_window_scaling=1
- estimates the available capacity between the client/server by exchanging data
- Steps:
- server initializes a new congestion window (cwnd) variable per TCP connection, initialized to a conservative, system-specified value (
initcwnd
on Linux) - new rule introduced: max amount of data in flight (not ACKed) is
min(rwnd, cwnd)
- start slow and grow the window size as the packets are ACKed
- server initializes a new congestion window (cwnd) variable per TCP connection, initialized to a conservative, system-specified value (
- cwnd
- sender-side limit on the amount of data the sender can have in flight before receiving an
ACK
from the client - NOT advertised/exchanged
- private variable maintained by server
- by default 4 segments
- 10 segments (IW10) in latest RFC 6928
- sender-side limit on the amount of data the sender can have in flight before receiving an
- For every received
ACK
packet (roundtrip), two new packets can be sent - NOT ideal for short and bursty connections
- request terminated before max window size reached
- Slow-Start Restart
- resets the congestion window of a connection after idling for a while
- big impact on performance for long-lived TCP connections
- HTTP keepalive connections recommended to disable SST on the server
$ sysctl net.ipv4.tcp_slow_start_after_idle
$ sysctl -w net.ipv4.tcp_slow_start_after_idle=0
- the congestion avoidance window takes over when the amount of data in flight exceeds:
- the receiver's flow-control window, or
- system-configured congestion threshold (ssthresh) window, or
- until a packet is lost
- Algorithms:
- Multiplicative Decrease and Additive Increase (AIMD)
- too conervative
- Proportional Rate Reduction (PRR)
- improves the speed of recovery
- default in Linux 3.2+ kernel
- Multiplicative Decrease and Additive Increase (AIMD)
- product of data link's capacity and its end-to-end delay
- max amount of unacknowledged data that can be in flight at any point in time
- If the connection is transmitting at a fraction of the available bandwidth,
- likely due to a small window size:
- a saturated peer advertising low receive window
- bad network weather and high packet loss resetting the congestion window
- explicit traffic shaping
- likely due to a small window size:
- makes sure packets arrive in order
- happens within TCP layer
- the application has no visibility
- it only sees a delivery delay
Packet loss is OK.
No bit is faster than one that is not sent; send fewer bits
We can't make bits travel faster; but we can move the bits closer
TCP connection reuse is critical to improve performance
- Datagram
- vs. packet
- used by DNS
- WebRTC
- It's not IP's responsibility to guarantee delivery
- UDP datagrams have definitive boundaries
- each datagram is carried in a single IP packet
- each application read yields the full message
- datagrams CANNOT be fragmented
- NAT
- intermediate timeouts for UDP, even for TCP sometimes
- No handshake
- No connection termination
- Has NAT Traversal problems
- designed to work on top of a reliable transport protocol
- TCP
- but over UDP is possible - DTLS
- TLS provides three essential services:
- encryption
- authentication
- data integrity
- public (asymmetric) key cryptography
- Message Authentication Code (MAC)
- one-way cryptographic hash function (checksum)
- Things to do:
- agree on the version of TLS
- choose the ciphersuite
- verify certificates
ClientHello
andServerHello
are in plain text- By default the handshake requires two roundtrips to complete
- to optimize, session resumption and false start
- use Diffie-Hellman key exchange
- client and server negotiate a shared secrete without explicitly communicating it in the handshake
- Forward Secrecy: Diffie-Hellman + ephemeral session keys
- ALPN (Application Layer Protocol Negotiation)
- TLS extension
- introduces support for application protocol negotiation into TLS handshake
- client puts
ProtocolNameList
inClientHello
- server puts
ProtocolName
inServerHello
- SNI (Server Name Indication)
- part of the handshake
- TLS + SNI ==
Host
header in HTTP
- resume/share the same negotiated secret key between multiple connections
- server creates/sends a 32-byte session identifier as part of
ServerHello
- keeps it in cache
- maintains a session cache for every client
- afterwards, client could store the session ID in
ClientHello
for a subsequent session - removes a round trip
- Most modern browsers intentionally wait for the first TLS connection to complete before opening new connections to the same server
- subsequent TLS connections can reuse the SSL session parameters to avoid the costly handshake
- Requires careful thinking for multi-server deployment
- removes the requirement for server to keep per-client session state
- server creates a New Session Ticket record, encrypted by secret key on server's side
- session ticket stored on client only
- included in
SessionTicket
extension withinClientHello
of a subsequent session
- included in
- all session data stored on client
- Usecase: deploying session tickets across a set of load-balanced servers
- rotating the shared key across all servers periodically
- Root CA
- Certificate Revocation List (CRL)
- check for status of the certificate in real-time
- For id-ing different types of messages via the Content Type field
- handshake
- alert
- data
- Max size 16Kb
- Small records incur a larger overhead due to record framing
- Large records will have to be delivered and reassembled by TCP layer before they can be processed/delivered
- The operational pieces for TLS deployment:
- how/where the servers are deployed
- size of TLS record
- size of memory buffers
- size of certificate
- support for abbreviated handshakes
- up to three roundtrip to set up TCP+TLS session
- place servers closer to the user!
- nearby server establishes a pool of long-lived, secure connections to the origin servers
- proxy all incoming requests/responses to/from origin servers
- CDN (or proxy server) can maintain a "warm connection pool" to relay data to origin servers
- servers with multiple processes/workers should use a shared session cache
- In a multi-server setup, routing the same client IP or the same TLS session ID to the same server is one way to provide good session cache utilization
- A shared cache should be used between different servers (where "sticky" load balancing is not an option)
- secure mechanism needed to share/update secret keys to decrypt the provided session tickets
- One roundtrip handshake for new and repeat visitors
- optional protocol extension
- allows sender to send application data when the handshake is only partially complete
- application data sent alongside
ClientKeyExchange
record - only affects the protocol timing of when the application data can be sent
- max of each record is 16Kb
- 20 - 40 bytes of overhead for MAC, padding, etc
- IP and TCP overhead (if record can fit into a single TCP packet)
- 20-byte header for IP
- 20-byte header for TCP
- The smaller the record, the higher the framing overhead
- NOT necessarily a good idea to increase record size to max 16KB
- if record spans multiple TCP packets, TLS must wait for all TCP packets to arrive,
- before decrypting the data
- additional latency!
- Small records incur overhead; large records incur latency
- For web applications (consumed by the browser): dynamically adjust record size based on TCP connection state
- when
- connection is new and TCP congestion window is low, or
- connection been idle for some time (Slow-Start Restart)
- each TCP packet should carry exactly one TLS record, with max segment size (MSS) allocated by TCP
- when
- connection congestion window is large and large stream is being transferred
- size of TCP record can be increased to span multiple TCP packets (up to 16KB) to reduce framing and CPU overhead on the client and server
- when
- Goal - to minimize buffering at the application layer due to lost packets, reordering, and retransmissions
- if TCP connection been idle
- if slow-start restart is disabled on server
- best strategy: decrease record size when sending a new burst of data
- Small record eliminates unnecessary buffering latency and improves time-to-first-{HTML byte, .., video frame}
- Larger record optimizes throughput by minimizing overhead of TLS for long-lived streams
- Typical strategy:
- increase record size to up to 16KB after
X
KB of data transferred - reset record size after
Y
milliseconds of idle time
- increase record size to up to 16KB after
- support for lossless compression of data transferred within record protocol
- compression algo negotiated during TLS handshake
- compression applied prior to encryption of each record
- HOWEVER, should be DISABLED on server!
- Server should be configured to Gzip all text-based assets
- Should verify server does not forget to include al intermediate certificates during handshake
- otherwise browser has to verify itself
- a new DNS lookup, TCP connection, HTTP GET request
- Minimize the size of certificate chain
- Ideally the sent certificate chain should contain exactly two certificates:
- the site's certificate
- the CA's intermediary certificate
- NO NEED for the site to include root certificate of their CA
- browser needs to check the certificate is not revoked
- an OCSP request during the verification process for "real-time" check
- OCSP Stapling
- server includes (staples) the OCSP response from the CA to its certificate chain
- so that browser can skip the online check
- server can cache the signed OCSP response
- a security policy mechanism that allows server to declare rules to (compliant) browsers
- via HTTP header
Strict-Transport-Security: max-age=31536000
- Enable/configure session caching and stateless resumption
- Monitor session caching hit rates
- Configure forward secrecy ciphers to enable TLS False Start
- Terminate TLS session closer to the user to minimize roundtrip latencies
- Use dynamic TLS record sizing
- Ensure your certificate chain does not overflow the initial congestion window
- Remove unnecessary certificates from chain; minimize the depth
- Configure OCSP on server
- Disable TLS compression on server
- Configure SNI support on server
- Append HSTS header
- client sends header
Connection: close
to terminate a persistent connection - Technically, either side can terminate the connection
- For HTTP/1.1,
Connection: Keep-Alive
is not needed
- Plage Load Time (PLT)
- time to onload event in the browser
- fired by browser once the document and all its dependent resources (JavaScript, images, etc) have finished loading
- HTML document parsed -> DOM
- CSS parsed -> CSSOM
- DOM + CSSOM -> RenderTree
- JavaScript can block both DOM and CSSOM
- Construction of DOM and CSSOM is interwined
- DOM construction cannot proceed until JS is executed
- JS execution cannot proceed untill CSSOM is available
- that's why styles are put at top and scripts at the bottom!
- Less than 100ms: instant
- Less than 20ms to keep users engaged
- HTML parsing is performed incrementally
- For many requests, response times are often dominated by:
- roundtrip latency
- server processing time
- For most web applications,
- bandwidth is not the limiting performance factor
- the real bottleneck is the network roundtrip latency between client and server
- Three tasks of a web program:
- fetching resources
- page layout
- JavaScript execution
- Rendering & scripting
- single-threaded
- interleaved model of execution
- bandwidth limited vs. latency limited
- Number of roundtrips is largely due to handshakes to start communicating between client & server:
- DNS, TCP, HTTP
- TCP slow start
- Navigation Timing
- User Timing
- Resource Timing
- via
performance.timing
- When analyzing performance data, look at the underlying distribution
- NOT the averages
- look at histograms, medians, and quantiles
- Two broad classes:
- Document-aware optimization
- resource priority assignments
- lookahead parsing
- Speculate optimization
- pre-resolving DNS names
- pre-connecting to likely hostnames
- Document-aware optimization
- Resource pre-fetching and prioritization
- document/CSS/JS parsers
- blocking resources required for first rendering are given high priority
- DNS pre-resolve
- likely hostnames pre-resolved ahead of time to avoid DNS latency on a future HTTP request
- triggered through:
- navigation history
- user action
- hovering over a link
- TCP pre-connect
- following a DNS resolution,
- browser may speculatively open the TCP connection in an anticipation of an HTTP request
- Page pre-rendering
- allows user to hint the likely next destination
- pre-renders the entire page in a hidden tab
- How can we (developers) help?
- Critical resources (CSS/JS) should be discovered as early as possible in the document
- CSS should be delivered as early as possible to unblock rendering and JS execution
- Non-critical JS should be deferred to avoid blocking DOM/CSSOM construction
- HTML document parsed incrementally by the parser - document should be periodically flushed for best performance
- Hints for browser for additional optimizations:
<link rel="dns-prefetch" href="//hostname_to_resolve.com">
<link rel="subresource" href="//javascript/myapp.js">
<link rel="prefetch" href="//images/big.jpeg">
<link rel="prerender" href="//example.org/next_page.html">
- Improvements over HTTP/1.0:
- persistent connections to allow connection reuse
- chunked transfer encoding to allow response streaming
- request pipelining to allow parallel request processing
- byte serving to allow range-based resource requests
- improved (much better-specified) caching mechanisms
- Networking optimizations:
- reduce DNS lookups
- Add an Expires header and configure ETags
- Gzip assets
- all text-based assets should be compressed with gzipped
- Avoid HTTP redirects
- especially redirecting to a different hostname - additional DNS lookup, TCP connection latency, etc
- a.k.a. persistent connection
- enabled by default on HTTP/1.1
Connection: Keep-Alive
- a strict FIFO queuing order on the client:
- dispatch request
- wait for full response
- dispatch next request from the client queue
- Allows for relocating the FIFO queue from the client (request queuing) to the server (response queuing)
- server could process the requests in parallel
- However, data from multiple responses CANNOT be interleaved (multiplexed) on the same connection,
- forcing each response to be returned in full before the bytes for the next response can be transferred
- Head-of-line blocking
- When processing requests in parallel, server must buffer pipelined responses, which may exhaust server resources
- A failed response may terminate the TCP connection, forcing client to re-request all subsequent resources
- Not widely enabled
- Up to 6 connections per host
- NOTICE: per host!
- Reasons to use:
- as workaround for limitations of HTTP
- as workaround for low starting congestion window size in TCP
- as workaround for clients that cannot use TCP window scaling
- increase browser's connection limit
- more shards, higher parallelism
- HOWEVER, every new hostname:
- requires additional DNS lookup
- consumes additional resources on both sides for each additional socket
- In practice, different shards could resolve to the same IP
- they are CNAME DNS records
- the browser connection limits are enforced on hostnames, NOT IP
- What affects the optimal number of shards?
- number/size/response-time of each resource
- client latency & bandwidth
- Each browser-initiated HTTP request carries at least an additional 500-800 bytes of HTTP metadata
- worse - cookies
- HTTP/1.1 does not define size limit of HTTP headers,
- but 8KB or 16KB limit widely adopted
- Bundle multiple resources into a single network request
- Concatenation: multiple JS/CSS files combined into a single resource
- Spriting: multiple images combined
- Application-level pipelining
- NOT friendly to caching!
- a single update to any one individual file invalidates the cache!
- Increased memory usage!
- all decoded images are stored as memory-backed RGBA bitmaps within browser
- one byte for each of the RGBA
- 4 bytes for each pixel
- Affecting execution
- JS and CSS parsing & execution is held back until entire file is downloaded
- no incremental execution!
- What is the ideal size for a CSS/JS file?
- probably 30 - 50 KB (compressed)
- Some places considered worth optimizing:
- separating and delivering first-paint critical CSS from the rest of CSS
- separating and delivering smaller JS chunks for incremental execution
- Embed the resource within document itself
- using the data URI theme
<img src="data:image/gif;base64,xxxxxxxx" alt="sample image" />
- Rule of thumb:
- inline resources under 1 - 2 KB
- probably NOT for frequently changed resources
- Considerations:
- if files are small and limited to specific pages, consider inlining
- if the small files are frequently reused across pages, consider bundling
- if the small files have high update frequency, keep them separate
- Minimize the protocol overhead by reducing the size of HTTP cookies
- Binary Framing Layer
- frame - smallest unit of communication in HTTP/2
- contains frame header
- may be interleaved across different streams
- reassembled via the embedded stream identifier in the header of each frame
- stream - bidirectional flow of bytes within an established connection
- carrying one or more messages
- message - a complete sequence of frames that map to a logical request/response
- ALL communication performed over a single TCP connection that can carry any number of bidirectional streams
- Each stream has:
- a unique identifier
- optional priority information
- HTTP/2 enables full request/response multiplexing
- Each stream may be assigned an integer weight between
1
and256
- Each stream may be given an explicit dependency on another stream
- prioritization tree
- dependency: referencing the other stream's ID as parent
- streams do not depend on each other have a implicit root stream
- if possible, parent stream should be allocated resources ahead of its dependencies
- e.g. deliver response
<parent>
before response<child>
- e.g. deliver response
- streams with the same parent (siblings) should be allocated resources in proportion to their weight
- client is allowed to update preferences (dependencies & weights) at any point
- Browser request prioritization
- can learn priority from previous visits
- if rendering was blocked on a certain asset in a previous visit, then
- the same asset may be prioritized higher in the future
- HTTP/2 connections are persistent
- only one connection per origin
- using one TCP connection per origin:
- still head-of-line blocking at TCP level
- when packet loss occurs, TCP congestion window size is reduced
- reduces max throughput of the entire connection
- effects of bandwidth-delay product may limit connection throughput if TCP window scaling is disabled
- What if multiple connections per origin?
- less effective header compression due to distinct compression contexts
- less effective request prioritization due to distinct TCP streams
- less effective utilization of each TCP stream
- higher likelihood of congestion due to more competing flows
- increased resource overhead due to more TCP flows
- flow control is directional
- each receiver may choose to set any window size for each stream and the entire connection
- flow control is credit-based
- each receiver advertises initial connection and stream flow control window (bytes)
- window reduced whenever sender emits a
DATA
frame - incremented via
WINDOW_UPDATE
frame sent by receiver
- flow control CANNOT be disabled
- when HTTP/2 connection is established,
SETTINGS
frames exchanged between client and server - set flow control window sizes in both directions
- default value for flow control window is
65535
bytes - maintained by
WINDOW_UPDATE
frame whenever any data received
- when HTTP/2 connection is established,
- flow control is hop-by-hop
- server can send multiple responses for a single client request
- one-to-many
- server-initiated push
- server initiates new streams (promises) for push resources
- NOTE these are NEW streams!
- Inlining resources is actually server push
- Pushed resources can be:
- cached by client
- reused across different pages
- multiplexed alongside other resources
- prioritized by server
- declined by client
- Security policy:
- pushed resources must obey same-origin: server must be authoritative for the provided content
PUSH_PROMISE
- all server push streams are initiated with
PUSH_PROMISE
frames - delivery order is critical!
- client needs to know which resources the server intends to push
- to solve this, server sends all
PUSH_PROMISE
frames, which contain just the HTTP headers of the promised resource- ahead of the parent's response (i.e.
DATA
frames)
- ahead of the parent's response (i.e.
- clients can reject using
RST_STREAM
- all server push streams are initiated with
- request/response headers compressed using HPACK compression format
- Allows individual values to be compressed when transferred,
- the indexed list of previously transferred values allows for encoding duplicate values by
- transferring index values that can be used to
- efficiently look up and reconstruct the full header keys and values
- the indexed list of previously transferred values allows for encoding duplicate values by
- static table & dynamic table
- static table
- defined in specification
- provides list of common HTTP header fields
- dynamic table
- initially empty
- updated based on exchanged values within a particular connection
- static table
- To negotiate HTTP/2 protocol
- TLS & ALPN is recommended
- client & server negotiate the desired protocol as part of TLS handshake
- without adding extra latency/roundtrips
- Establishing HTTP/2 connection over a regular, non-encrypted channel is still possible
- use HTTP
Upgrade
Connection: Upgrade, HTTP2-Settings
Upgrade: h2c
- initial HTTP/1.1 request with HTTP/2 upgrade header- server could:
- reject by returning
HTTP/1.1 200 OK
- accept by
HTTP1.1 101 Switching Protocols
- then immediately switch to HTTP/2
- return response using the new binary framing protocol
- reject by returning
- use HTTP
- frames exchanged between client & server
- all frames have a 9-byte header
- 24-bit length field
- theoretical 2^24 bytes (16MB) data per frame
- but default is 2^14 bytes (16KB)
- bigger is NOT always better!
- 8-bit type field
- format
- semantics
- 8-bit flag field
- 1-bit reserve field - always set to 0
- 31-bit stream identifier, uniquely identifying the HTTP/2 stream
- 24-bit length field
- client initiates by sending a
HEADERS
frame - payload in
DATA
frames - flow control is applied only to
DATA
frames - non-
DATA
frames always processed with high priority! - To eliminate stream ID collisions between client- and server- initiated streams:
- client-initiated streams have odd IDs
- server-initiated streams have even IDs
- payload can be split into multiple
DATA
frames- last framing contains
END_STREAM
indicating end of the message
- last framing contains
- Reduce DNS lookups
- Reuse TCP connections
- keepalive wherever possible
- Minimize number of HTTP redirects
- redirect to a different origin can result in DNS, TCP, TLS roundtrips
- Reduce roundtrip times
- Eliminate unnecessary resources
- Cache resources on the client
- Compress assets during transfer
- Eliminate unnecessary request bytes
- HTTP cookies
- Parallelize request/response processing
- Apply protocol-specific optimizations
- You should specify both:
Cache-Control
- cache lifetimeLast-Modified
&ETag
- validation
- gzip
- HTTP State Management Mechanism
- extension to HTTP
- allows for cookies
- saved by browser
- auto appended onto every request to the origin within the
Cookie
header
- allowed to associate many cookies per origin
- Best practices:
- Transfer the min amount of required data (e.g. secure session token)
- leverage shared session cache on server to lookup other metadata
- without
Keep-Alive
, a new TCP connection is required for each HTTP request
- browser uses a preload scanner
- only when the document parser is blocked
- HOWEVER, NOT applicable for resources scheduled via JS
- it cannot speculatively execute scripts
- Leverage HTTP pipelining
- if, you control both client and server
- Domain sharding
- Bundle resources to reduce HTTP requests
- Inline small resources
- requests are cheap
- both requests & responses can be multiplexed efficiently
- Domain sharding is anti pattern!
- HTTP/2 has a connection-coalescing mechanism
- allows client to
- coalesce requests from different origins
- then dispatch them over the same connection when the following conditions are satisfied:
- origins are covered by same TLS certificate
- wildcard certificate, or
- certificate with matching Subject Alternative Names
- origins resolve to the same server IP address
- allows client to
- resource inlining is a form of application-layer server push
- with HTTP/2, no longer a reason to inline resources just because they are small
- more of latency optimization
- prime candidates:
- critical resources that block page construction
- Client can control how/where server push is used, by indicating to server:
- the max number of pushed streams initiated by server
- amount of data can be sent on each stream before acknowledged by client
- server push subject to same-origin restrictions
- server can learn from
Referrer
headers- and auto initiate server push for related resources
- A well-implemented server should give precedence to high priority streams, but
- should also interleave lower priority streams if all higher priority streams are blocked (head-of-line blocking)
- Browser intentionally separates request management lifecycle from socket management
- sockets are organized in pools
- grouped by origin
- same sockets can be automatically reused across multiple requests
- Origin - the combination of:
- application protocol
- domain name
- port number
- Socket pool
- group of sockets belonging to the same origin
- in practice, all browsers limit max pool size to 6 sockets
- With Automatic socket pooling, browser can:
- service queued requests in priority order
- reuse sockets to minimize latency and improve throughput
- be proactive in opening sockets in anticipation of request
- optimize when idle sockets are closed
- optimize bandwidth allocation across all sockets
- ^^^ these are managed by the browser!!!
- Defer management of individual sockets:
- allows browser to sandbox and enforce a consistent set of security and policy constraints on untrusted code
- Connection limits
- browser manages all open socket pools
- browser enforces connection limits to protect both client and server
- Request formatting & response processing
- TLS negotiation
- Same-origin policy
- Browser provides authentication, session, and cookie management
- browser maintains separate "cookie jars" for each origin,