# Programming tasks 4

## Data

We use the same data sets as last week.

We will focus on the `Fx` files, and stop ignoring the $n$ index queries in the files.

## Assignment Overview

You will now implement the random access to compressed data that was hinted at in the previous set.

The program specification is the same as last week.

For input: $[2,7,500,1,0]$<br/>
The program should output: $[3,500,7]$<br/>
I.e. again 3 blocks used and `VB.get(1)` $=500$, `VB.get(0)` $=7$. 

## Task 1: Naive random access to VB codes (2 out of 10 marks)

Stop ignoring the index queries.

Implement `VB.scan()` for your generalised VB data structure. 

The intention is to just scan from the start of the data structure and output the target integer when you get to it.

Make note of the query performance for this random access implementation.

This is intended as a baseline for benchmarking tasks 2 and 3.

Once you are satisfied with your implementation, submit it to CSES.

NB: if you have trouble with the packed arrays, the A task should pass (possibly faster) if you ignore k and use entire bytes.

## Task 2: Sum-Query-Based Random Access to VB codes (4 out of 10 marks)

In this task you will make use of your `BitArray` class from Set 1 as well as the `sum` function to support access to integer sequences that have been compressed with VByte codes. This requires us to first slightly rearrange the bits of the VByte codes.

For a given integer sequence $S[0..n-1]$, of $n$ integers, let $m$ be the length (in parts) of the longest VByte code when $S$ is VByte encoded (ignoring what $k$ is for now). Also, let $N_p$ be the number of VByte codes consisting of $p$ or more parts.

The data structure you will implement will consist of $m$ layers numbered $1,\ldots, m$.
Layer $L_i$ will consist of a `BitArray`, $B_i$, and a Packed Integer Array, $A_i$, both of length $N_p$.

$B_0$ contains the stop bit of the 1<sup>st</sup> part of each code and $A_0$ contains the data bits of the 1<sup>st</sup> part of each code. Note that every code has a 1<sup>st</sup> part, so $\lvert B_0\rvert = N_0 = n$.

In general $B_i[j]$ contains the stop bit of the $i$<sup>th</sup> part for the $j$<sup>th</sup> code having $i$ or more parts, and $A_i[j]$ contains the corresponding data bits.



To (hopefully) make things clearer, here is a small example for a sequence of integers: 

$S = 4, 500, 200, 18$ 

and $k = 7$.

After VB encoding the integers in S we have the following bits (stop bits in bold):

<pre>
<b>1</b>0000100 <b>0</b>1110100 <b>1</b>0000011 <b>0</b>1001000 <b>1</b>0000001 <b>1</b>0010010
</pre>

Rearranging these bits into layers as described above would give us:

$B_0$:	`1001`<br/>
$A_0$:	`0000100 1110100 1001000 0010010`<br/>
$B_1$:	`11`<br/>
$A_1$: 	`0000011 0000001`<br/>

Note that to this point the number of bits is the same in the two arrangements.

In order to support fast access to the $i$<sup>th</sup> integer, we increase space slightly by building a sum support data structure on each of the $B$ bit arrays. 

Then, to find the data bits that enable us to reconstruct the $i$<sup>th</sup> integer, we first look at the $i$<sup>th</sup> bit of $B_0$. If this (stop) bit is 1, then we extract the data bits from $A_0$ and are finished. Otherwise (stop bit = 0) the next part of the code exists at Layer 1. We can locate the relevant bits in $A_1$ by calculating the number of 0s up to and including position $i$ in $B_0$ --- observe that the number of 0s up to position $i$ is $y = i - \operatorname{sum}(i)$. This means $A_1[y]$ is the next part of the code (spend a minute verifying this to yourself with an example). If $B_1[y] = 1$ we are finished. Otherwise ($B_1[y] = 0$) there are yet more parts of the code, and we can locate the next of those parts at the next layer (Layer 2) using a similar (i.e. sum-based) approach.

Your task is to implement `vb.get()` (or `vb[]` if you want to get fancy) using the data structure sketched above and compare the performance of the sum-based access method to the scan function you implemented in Task 1.

Benchmark this implementation. Compare the query times to the ones for Task 1, and compare the space usage to the other implementations. What is the space overhead of this implementation in practice?

NB: if you have trouble with the packed arrays, the A task should also pass if you ignore k and use entire bytes.

## Task 3: Location-Query-Based Random Access to VB codes (4 out of 10 marks)

The data structure for task 2 was fairly complicated and had loads of moving parts.

Wouldn’t it be more simple to just extract the stop bits into a single `BitArray`, and simply concatenate the payloads? Then simple location queries on the bitArray could be used to retrieve the positions of the blocks to read.

The potential problem with this approach is with the performance of `location` queries compared to `sum` queries.

For the same $S$ we get the same encoding (stop bits in bold):

<pre>
<b>1</b>0000100 <b>0</b>1110100 <b>1</b>0000011 <b>0</b>1001000 <b>1</b>0000001 <b>1</b>0010010
</pre>

Instead of the multi-tiered approach, we simply do:

$B$	`101011`<br/>
$A$	`0000100 1110100 0000011 1001000 0000001 0010010`

Now to access index $1$ for example, the blocks containing the bits for index one are $[\operatorname{location}(i) + 1 \ldots \operatorname{location}(i + 1)] = [1 \ldots 2]$. Now the blocks can be read and reordered to retrieve the value.

Benchmark your implementation for random access using location queries in comparison to your solution to task 2. What is the difference in performance and space overhead? General thoughts on this approach in comparison to task 2?
