Browse files

A few minor improvements.

  • Loading branch information...
1 parent 407313e commit f57220b4a32a4ee4b10721f40f664bfcb624462f Colin Phipps committed Mar 6, 2009
Showing with 3 additions and 3 deletions.
  1. +3 −3 paper/paper.xml
View
6 paper/paper.xml
@@ -46,13 +46,13 @@
<para>The main content of the control file will be the checksums for blocks of data in the file to be downloaded. What checksums should be used, and how should they be transmitted? The choice of the rsync algorithm gives us part of the answer: a weak checksum, which can be calculated in a rolling manner over the data already held by the client, must be transmitted, to allow the client to easily identify blocks. Then a stronger checksum, strong enough to prevent blocks being identified as in common with the target file incorrectly, must be used for verification.</para>
<sect2><title>Weak Checksum</title>
<para>rsync transmits a 4-byte weak checksum, and I have used the same formula, but we could shorten the amount of data, to reduce the size of the control file. For small files, 4 bytes might be more than is needed to efficiently reject false matches (see, for example, <citation><xref linkend="CIS2004"/></citation>).</para>
- <para>There is a tradeoff here between the download time and the processing time on the client. The download time is proportional to the amount of checksum data we transmit; part of the processing time on the client is proportional to the number of strong checksums that need to be calculated. The number of false matches on weak checksums (and hence unnecessary strong checksum calculations) are proportinal to the number of hashes calculated by the client (which is roughly the file size on the client) times the likelyhood of a false match. Adding this to the download time for the weak checksums, we have:</para>
+ <para>There is a tradeoff here between the download time and the processing time on the client. The download time (for the checksums) is proportional to the amount of checksum data we transmit; part of the processing time on the client is proportional to the number of strong checksums that need to be calculated. The number of false matches on weak checksums (and hence unnecessary strong checksum calculations) are proportinal to the number of hashes calculated by the client (which is roughly the file size on the client) times the likelyhood of a false match. Adding this to the download time for the weak checksums, we have:</para>
<informalequation>
<alt>N N over b 2^-d t_2 + d over 8 N over b t_3</alt>
<graphic fileref="math/ws1.png" />
</informalequation>
<para>
- (where d is the number of bits of weak checksum transmitted per block, t<subscript>2</subscript> is the time to calculate a strong checksum, and t<subscript>3</subscript> is the download time for each weak checksum.) This is a simple minimisation problem, with minimum time given by:</para>
+ (where d is the number of bits of weak checksum transmitted per block, for a file of N bytes and b bytes per block, t<subscript>2</subscript> is the time to calculate a strong checksum, and t<subscript>3</subscript> is the download time for each weak checksum.) This is a simple minimisation problem, with minimum time given by:</para>
<informalequation>
<alt>-log 2 N 2^-d t_2 + t_3 over 8 = 0 newline
drarrow d = log_2 N + log_2 left ( { 8 t_2 log 2 } over t_3 right ) </alt>
@@ -66,7 +66,7 @@
<para>For the moment, I have chosen to send a whole number of bytes; and it is better to err on the side of sending too many weak checksum bits than too few, as the client performance degrades rapidly if the weak checksum is too weak. For example, on a 600MB ISO file, a 3-byte weak checksum causes the client to take an extra minute of CPU time (4 minutes, against 3 minutes when 4 bytes of weak checksum were provided) in a test conducted with zsync-0.2.0 (pre-release). In practice, this means that 4 bytes of weak checksum is optimal in most cases, but for some smaller files 3 bytes is better. It may be worth tweaking the parameters of this calculation in specific circumstances, such as when most clients are fast computers on very slow network connections (desktop computers on modems), although in these circumstances the figures here will still be small relative to the total transfer size.</para>
</sect2>
<sect2><title>Strong Checksum</title>
- <para>The strong checksum is a different problem. It must be sufficiently string such that, in combination with the weak checksum, there is no significant risk of a block being identified in common between the available local data and the target file when in practice the blocks differ. rsync uses an MD4 checksum of each block for this purpose.</para>
+ <para>The strong checksum is a different problem. It must be sufficiently strong such that, in combination with the weak checksum, there is no significant risk of a block being identified in common between the available local data and the target file when in practice the blocks differ. rsync uses an MD4 checksum of each block for this purpose.</para>
<para>I have continued to use MD4 for the moment. There are probably alternatives which would be more efficient with CPU time, but this is not a scarce quantity for zsync. What is of interest is the amount of data that must be transmitted to the client: a full MD4 checksum requires 16 bytes. Given a file of length N, blocksize b, there are <literal>N/b</literal> blocks in a file; and assume that the client also has, for simplicity, N bytes of potential local data (and so ~N possible blocks of local data). If there is no data in common, and k bits of checksum transmitted, and the checksums are uniformly and independently distributed, then the chance of no collisions (data incorrectly believed to be in common, and ignoring the weak checksum for now):</para>
<informalequation>
<alt>p &gt;= {left (2^k-N over b right )^N} over {(2^k)^N} = left ( 1 - N over b 1 over 2^k right ) ^N newline</alt>

0 comments on commit f57220b

Please sign in to comment.