# Programming tasks

## Data

For this (and the next) problem set you are provided with yet another `nums.py` script that generates data in the form of unsigned 64-bit integers (note: not in text format). `nums.py` can be found on Moodle.

In any file generated by the script, the first integer output is the number $n$, indicating the number of *input elements* present. 

The data contains (beyond the leading $n$), $n$ unsorted “random” integers. The intent is that the first $n$ integers are added (`ds.append()`) to a data structure in order.

If $q$ is specified (with `-q <q>`), the initial $n$ integers are followed by the value of $q$, followed by $q$ unsorted “random” integers in the $[0, n)$ range. If the data structure supports random access, the next $q$ integers are targets for index queries (`ds[]` or `ds.at()`). The $q$ index queries can be ignored for this set.

If the `-s` flag is given, the $n$ integers to insert are sorted for implementations based on encoding differences between subsequent numbers.

These files generated by the script are similar to what CSES uses. 

Note that, `nums.py` generates random data, and you may treat the input data on CSES as random, but the all data may not actually be random.

Think of each file as containing a set of integers. The integers in the files are 64-bits (i.e. 8 bytes) in length and unsigned.

The $17$ integers generated by `python nums.py -n 10 -q 5 -r 1337` should be: 

```
10,  
975037, 795880, 216231, 480729, 379559, 153946, 857970, 159826, 353364, 309320,
5, 
6, 6, 7, 8, 1
```

## Assignment Overview

In this assignment you are expected to implement various versions of VByte encoding, and to benchmark these using the data you create (with `nums.py` or otherwise).

**VByte** encoding is a famous compression scheme for integers that is used for efficiency in many software systems, including search engines. A brief description of VByte encoding is given in the "Description of VByte encoding" document.

The intent is to create the tooling necessary to benchmark VByte coding. The executable should initially take as a parameter only a file, and if no file is given, should read from the standard input. In addition, arguments `-s` and `-k <num>` may be given. The program should work as follows:
* Read $n$ from file or standard input
* Read $n$ 64-bit integers from file or standard input and store the integers using VByte coding.
    * If `-s` was given, input is sorted and differences between integers should be stored instead of the integers themselves.
    * If `-k <num>` was given, generalized VByte code with $k$ bit blocks should be used.
* Output the number of blocks used for encoding the input to `std::cerr`.
* Output the contents of the data structure (as text). I.e. repeat back the input.

**Note:** For B series task, you will not be able to simply read the entire input into a vector or array first before encoding, as this will take too much space.

For example with input: $[2, 7, 500]$<br/>
The program should output: $[3, 7, 500]$<br/>
I.e. 3 VByte blocks were used and the numbers were $7$ and $500$.

And later when random access is supported:<br/>
For input: $[2, 7, 500, 2, 1, 0]$<br/>
The program should output: $[3, 500, 7]$<br/>
I.e. again 3 blocks used and `VB.get(1)` $= 500$, `VB.get(0)` $= 7$. 

## Task A30/B30: VByte Encoding/Decoding (4 + 2 out of 10 marks)

Implement a VByte encoding based data structure (`VByte.hpp` for example). Implement code to use the data structure as described above. Random access is not required yet for this task. The `-k` flag will also not be present yet, and can safely be ignored here. You can simply assume that k = 7, and use full bytes with an included stop bit. Any index queries at the end of test data files can also be ignored for now.

Measure the total time taken to process files, recording both wallclock and user+system time. What do these numbers mean and why are they different? Measure also the time taken for reading and encoding separately from the time used for outputting. Test different approaches to reading the input data. What do you find? While you were developing your encoding and decoding routines, were you able to improve (or worsen) runtimes? If so, what changes made a difference? How can you divide (and compute modulus) by powers of 2 quickly? How can this be exploited in VByte encoding? Does this matter for this task?

Compute the sizes of your encoded data. How well does compression perform? How do the compressed sizes of the unsorted and sorted files compare to the original files?

Once you have done all of this, submit to CSES task A30.

## Task A31/B31: Generalized VB codes (4 out of 10 marks)

The VByte codes that we looked at in Assignment 1 and 2 used 7 bits for data (and 1 bit for the stop bit) in each part of the code (each part being 8 bits or 1 byte). Making use of your solution for Packed Integer Arrays from Assignment 2, implement a generalised version of VByte codes where the number of data bits can be specified as a parameter, $k$. That is, the argument `-k <num>`.

For example, if we set $k=4$, the number $x = 500$, which is `111110100` in binary, would be encoded as:

```
00100
01111
10001
```

When encoded this way, $x$ has 3 parts, each of 5 bits each (1 stop bit and 4 data bits per part), for a total of 15 bits. Compare this to the encoding obtained when k=7 (as was the case in Task 1):

```
01110100
10000011
```

Where $x$ this time consists of 2 parts, each of 8 bits each (1 stop bit and 7 data bits per part), for a total of 16 bits.

Test the speed and space performance of this implementation. What do you find? Can you generate (non-trivial) test cases that work very well / badly for different combinations of $k$ values and `-s` flag? Think about strengths and weaknesses of the generalised version.

How can you divide (and compute modulus) by powers of 2 quickly? How can this be exploited in VByte encoding? Does this matter for this task?

Once you are satisfied with your implementation, submit it to CSES.
