<a href="https://colab.research.google.com/github/spencerleewilliams/cse380-notebooks/blob/master/09_2_Ponder_and_Prove_Data_Compression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ponder and Prove Data Compression
#### Due: Saturday, 6 March 2021, 11:59 pm.

# TODO Explore Huffman Trees and Huffman Codes


Your task is examine how to compress a *special piece of information* as compactly as possible, and **calculate various compression ratios**.

Recall that the **compression ratio** of a variable-length encoding like Huffman encoding is the percentage $100(f - v)/f$, where $f$ is the number of bits per symbol of the smallest **fixed**-length encoding, and $v$ is the average number of bits per symbol with the variable-length encoding.

For example, if there were 9 different symbols in a message, $f=4$ is the number of bits of the smallest fixed-length encoding, because $2^3 = 8$ (not enough for $9$) and $2^4 = 16$ (enough and to spare). If the variable-length encoding of the message had $v=3.12$, the compression ratio would be $100(4 - 3.12)/4 \approx 22\%$.

Note that calculating the average number of bits per symbol is not strictly necessary. That's because an alternate and equivalent way is to calculate $100(ft - vt)/ft$, where $ft$ is the **total** number of bits encoded with the fixed encoding, and $vt$ is the **total** number of bits encoded with the variable-length encoding.

The *special piece of information* to be compressed is a list of the first ten million primes. This is a list that starts

|    |
|----|
|  2 |
|  3 |
|  5 |
|  7 |
| 11 |
| 13 |
| 17 |
| 19 |
| 23 |
| 29 |

  and ends

|           |
|-----------|
| 179424551 |
| 179424571 |
| 179424577 |
| 179424601 |
| 179424611 |
| 179424617 |
| 179424629 |
| 179424667 |
| 179424671 |
| 179424673 |

As ASCII text stored in a file with one prime per line, the size of this data file is slightly over 89 megabytes. The goal is to compress this down to just over 5 megabytes (5589056 bytes, to be exact). That's a 94% compression ratio!

Standard compression tools can only get about a 73% compression ratio for this ASCII data. A more clever approach is needed. Instead of compressing the list of prime numbers, compress a list of the *gaps* between them!

It doesn't save much, just the unique (occurring only once) gap size of 1 between 2 and 3, but in the spirit of de Polignac's conjecture that every *even* number appears infinitely often as a gap between consecutive primes, just consider the even-sized gaps. The result will be a list that starts with 2 (the difference between 5 and 3), 2 (the difference between 7 and 5), 4 (the difference between 11 and 7), 2 (the difference between 13 and 11), 4 (the difference between 17 and 13), 2 (the difference between 19 and 17), 4 (the difference between 23 and 19), and 6 (the difference between 29 and 23).

Generating this data is the first task. The algorithm for doing so is very straightforward:

1. Find the gaps between consecutive odd primes.
2. Store these gaps as a list of even numbers.

Tabulating the results, the first ten gaps and the last ten gaps are as follows, where the numbers after the equals signs are the gaps to list:

|                 |
|-----------------|
|  5  -   3  =  2 |
|  7  -   5  =  2 |
| 11  -   7  =  4 |
| 13  -  11  =  2 |
| 17  -  13  =  4 |
| 19  -  17  =  2 |
| 23  -  19  =  4 |
| 29  -  23  =  6 |
| 31  -  29  =  2 |
| 37  -  31  =  6 |

|                                |
|--------------------------------|
| 179424551  -  179424533  =  18 |
| 179424571  -  179424551  =  20 |
| 179424577  -  179424571  =   6 |
| 179424601  -  179424577  =  24 |
| 179424611  -  179424601  =  10 |
| 179424617  -  179424611  =   6 |
| 179424629  -  179424617  =  12 |
| 179424667  -  179424629  =  38 |
| 179424671  -  179424667  =   4 |
| 179424673  -  179424671  =   2 |

As a correctness check, see if your generated list of gaps has length 9999998.

The next step is to count how many times each gap size occurs, so that for the Huffman encoding scheme, the larger the frequency of occurrence, the smaller the number of bits encoding that gap size.

As a correctness check, here are the first ten and the last ten gap counts:

|  Gap | Count   |
|------|---------|
|    2 |  738597 |
|    4 |  738717 |
|    6 | 1297540 |
|    8 |  566151 |
|   10 |  729808 |
|   12 |  920661 |
|   14 |  503524 |
|   16 |  371677 |
|   18 |  667734 |
|   20 |  354267 |
|      |         |
|  190 |       1 |
|  192 |       3 |
|  194 |       1 |
|  196 |       1 |
|  198 |       6 |
|  202 |       2 |
|  204 |       3 |
|  210 |       4 |
|  220 |       1 |
|  222 |       1 |

Note two things from these partial gap counts:

1. Small even numbers (< 100) are well represented, larger ones (< 1000) less so.
2. Ten million primes aren't enough to have *every* even number represented; for example, 200, 206, 208, 212, 214, 216, and 218 do not appear even once.


In [5]:
# Code from 9_5
!pip install pyprimesieve
import pyprimesieve
tmp = pyprimesieve.primes_nth(10000000)
primes = pyprimesieve.primes(tmp+1)
# List of gaps from 1, 10000000
gaps = [*map(lambda i:primes[i]-primes[i-1],range(1,10000000))]
# pl = prime
pl=[2]
[pl.append(pl[-1] + g) for g in gaps]
print(pl==primes)
# Print the total size of gaps
print(len(gaps))
uniqueGaps = 0
# Loop through and print the number of occurances for each gap from 2 to 222
for i in range(2, 223):
  count = 0
  
  for gap in gaps:
    if (gap == i):
      count += 1
  if (count > 0):
    print(i, count)
    uniqueGaps += 1

print(uniqueGaps)



True
9999999
2 738597
4 738717
6 1297540
8 566151
10 729808
12 920661
14 503524
16 371677
18 667734
20 354267
22 307230
24 453215
26 211203
28 229177
30 398713
32 123123
34 129043
36 206722
38 94682
40 111546
42 159956
44 64866
46 54931
48 93693
50 52183
52 38800
54 64157
56 32224
58 27985
60 55305
62 16763
64 17374
66 30960
68 12368
70 17475
72 17255
74 8540
76 7253
78 13758
80 6760
82 4791
84 9818
86 3411
88 3454
90 7056
92 2259
94 2058
96 3544
98 1831
100 1923
102 2374
104 1168
106 933
108 1634
110 941
112 711
114 1125
116 439
118 433
120 948
122 287
124 318
126 533
128 183
130 211
132 301
134 128
136 100
138 210
140 140
142 90
144 123
146 46
148 67
150 94
152 52
154 43
156 57
158 19
160 27
162 27
164 20
166 9
168 25
170 18
172 4
174 10
176 11
178 12
180 10
182 5
184 4
186 3
188 1
190 1
192 3
194 1
196 1
198 6
202 2
204 3
210 4
220 1
222 1
104


# TODO Determine Exact Size of Data to be Compressed


Without actually doing it, imagine creating an ASCII file containing the first ten million primes, represented in decimal, one prime per line. Calculate the size of this file, so you can show an exceptional compression ratio from it (see below).

Using a binary encoding instead of ASCII, each prime requires 32 bits (4 bytes), so the size of a binary file is easily determined.

Using a fixed-width encoding of the gap counts, however, requires knowing how many different gap sizes there are, after which the calculation is straightforward.

# TODO Use Functional Python


You are encouraged to use the [anytree](https://pypi.org/project/anytree) Python library, which has a nice exporter by way of which you can graphically view trees. (You may recall using this in DM1, and thus know that **anytree** depends on [graphviz](https://graphviz.org), which you also used.)

This library uses the object-oriented features of Python to create and visualize trees. You are encouraged to use the functional features of Python as much as possible, achieving your results not by using some existing third-party libraries for building Huffman Trees and Codes, but writing your own code as cleanly and elegantly as you can.

In [18]:
# Functions to determine number of digits, considering length and lines.
# See 9_5
from math import log10, floor
def get_num_digits(n):
  return len(str(n))

def get_num_digits_no_str(n):
  return floor(log10(n)) + 1

def get_line_size(n):
  return get_num_digits_no_str(n) + 1

total_size_in_digits = sum(map(lambda p: get_num_digits_no_str(p), primes))
# Exact Size of Data for ASCII
total_size_in_bits = total_size_in_digits * 8
print((total_size_in_digits + 10 ** 7) / 2 ** 20)
# Exact Size of Data for binary
# Equals the number of primes multiplied by 4 bytes or 32 bits
total_size_in_bits = total_size_in_digits * 32
print((total_size_in_digits + 10 ** 7) / 2 ** 20)
# Exact Size of Data for fixed
# Gaps
# Number of different gap sizes
ceil(log2(uniqueGaps))
total_size_in_bits = total_size_in_digits * 104
print((total_size_in_digits + 10 ** 7) / 2 ** 20)
# Total

89.15371894836426
89.15371894836426
89.15371894836426


# TODO Achieve Target Compression Ratios


Your solution should correctly compute the following three compression ratios:

| Ratio       | Value              |
|-------------|--------------------|
| From fixed  | 36.125168653605158 |
| From binary |              86.03 |
| From ASCII  |              94.02 | 


In [27]:
from math import ceil, log2
# Take the The compression ratio of an encoding is the percentage (f − v) / f·100, 
# where f is the number of bits per symbol of the smallest fixedlength encoding, and v is the average 
# number of bits per prime with the variable-length (e.g. Huffman) encoding.
# f equals number of bits 
# fixed: 7
# binary: 32
# ASCII: 8 or 64?
# v = 4.71238194 average number of bits per prime
# Fixed Compression ratio w/ 104 different gap sizes
print(((7 - 4.71238194) / 7) * 100)
# Binary Compression ratio
print(((32 - 4.471238194) / 32) * 100)
# ASCII Compression ratio
print(((64 - 4.71238194) / 64) * 100)

32.680257999999995
86.02738064374999
92.63690321875


# TODO My Report on What I Did and What I Learned

## Fun


Normally I imagine trees spreading outward, expanding rather than compressing. Huffman tree encodings, compressions, and decompressions are unique concepts to the idea of whenever I think of a tree. It was fun to continue expounding my knowledge by compressing Huffman trees. I see it as a valuable form of computation as it allows us to send large files of data in this age quickly via compression while maintaining the key components of the original file. Comparing the values from fixed, binary and ASCII was most interesting to see how the representation of data affects the ability to compress data.

## New

Comparing the values from fixed, binary and ASCII was most interesting to see how the representation of data affects the ability to compress data. In DM1, learning about Huffman tree data compression was an interesting concept that we as a class didn't explore too deeply into its applications. I had already known about the general concepts and purposes of data compression with Huffman trees, but seeing how the form of data representation affects compression ratios was an important consideration to learn.

## Meaningful


Related to the two previous prompts, I would like to further research/test forms of data and how they compress. The activity here is with the primes, but what are the similarities and differences to compression when using data beyond integers and strings. In other programming languages, like C++, doubles, etc. include larger allocations of integers, but what about video or images. How does the integrity in data compression differ when the stored value isn't a read message?

## Connections


One connection that I made to this activity was when I took linear algebra. A section we learned in this course also talked about data compression with images. However, the focus was more on different states or file sizes from a larger state to a smaller state. Similar to earlier topics of discrete mathematics, we talked in terms of data integrity maintaining 1-1, and or onto properties. 

## Collaborators
There were no collaborators.

# TODO What is True?
Click on each warranted checkbox to toggle it to True (or back to False). 

NOTE: *This only works in Colab. If you run it in some other Jupyter notebook client/server environment you may have to change False to True (or vice versa) manually.*

This self-assessment is subject to revision by a grader.

In [None]:
#@markdown ## What is True about what I did?
#@markdown ### I had fun.
cb00 = True #@param {type:'boolean'}
#@markdown ### I learned something new.
cb01 = True #@param {type:'boolean'}
#@markdown ### I achieved something meaningful, or something I can build upon at a later time.
cb02 = True #@param {type:'boolean'}
#@markdown ## What is True about my report?
#@markdown ### I wrote a sufficient number of well-written sentences.
cb03 = True #@param {type:'boolean'}
#@markdown ### My report is free of mechanical infelicities.
cb04 = True #@param {type:'boolean'}
#@markdown ### I used Grammarly (or something better described in my report) to check for MIs.
cb05 = True #@param {type:'boolean'}
#@markdown ### I reported on any connections I found between these problems and something I already know. 
cb06 = True #@param {type:'boolean'}
#@markdown ### I reported who were and what contribution each of my collaborators made.
cb07 = True #@param {type:'boolean'}
#@markdown ## What is True about my calculations?
#@markdown ### I correctly calculated the number of times each gap size occurs. 
cb08 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the number of bits per gap size with a fixed encoding.
cb09 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the total number of bits encoded with the Huffman encoding.
cb10 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the total number of bits encoded with the fixed encoding.
cb11 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the compression ratio from this fixed encoding.
cb12 = False #@param {type:'boolean'}
#@markdown ### I correctly calculated the size of the first ten million primes encoded as 32-bit integer binary data.
cb13 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the compression ratio from the binary size.
cb14 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the size of the first ten million primes encoded as ASCII data.
cb15 = True #@param {type:'boolean'}
#@markdown ### I correctly calculated the compression ratio from the ASCII size (just the primes, nothing else).
cb16 = False #@param {type:'boolean'}