# Exercise 05.2 (data compression)

For devices with limited memory, data compression can be important. Data compression is
a field of its own, but with libraries we can compress (and uncompress) data easily.

In [1]:
# Import the compression module
import zlib

# Import modules for generating random strings
import random
import string

## Define text to compress: Here we will use a passage from Shakespeare

In [2]:
# Create a string that we wish to compress
text = """
Welcome, dear Rosencrantz and Guildenstern!
Moreover that we much did long to see you,
The need we have to use you did provoke
Our hasty sending. Something have you heard
Of Hamlet's transformation; so call it,
Sith nor the exterior nor the inward man
Resembles that it was. What it should be,
More than his father's death, that thus hath put him
So much from the understanding of himself,
I cannot dream of: I entreat you both,
That, being of so young days brought up with him,
And sith so neighbour'd to his youth and havior,
That you vouchsafe your rest here in our court
Some little time: so by your companies
To draw him on to pleasures, and to gather,
So much as from occasion you may glean,
Whether aught, to us unknown, afflicts him thus,
That, open'd, lies within our remedy."""

## Define function to compress text string, test compression efficiency and check that original and decompressed versions are the same (adapted from assignment)

In [3]:
def compress_text(text):
    """ Compress text string and print quality control checks """
    # Convert Python string to bytes, and check type
    text_bytes = text.encode("utf-8")
    print(type(text_bytes))

    # Get number of bytes (memory) used to store string
    print("Number of bytes for uncompressed string:", len(text_bytes))

    # Compress string and get number of byes used for compressed string
    text_comp = zlib.compress(text_bytes)
    print("Number of bytes for compressed string:", len(text_comp))

    # Display the compression efficiency
    print("Compression efficiency: ", len(text_comp)/len(text_bytes))

    # Decompress the string
    text_decomp = zlib.decompress(text_comp)

    # Check that original and decompressed strings are the same
    if text == text_decomp.decode("utf-8"):
        print("All good: original and decompressed strings are the same.")
    else:
        print("Problem: original and decompressed strings differ.")

In [4]:
# Use function to compress Shakespeare passage text
compress_text(text)

<class 'bytes'>
Number of bytes for uncompressed string: 785
Number of bytes for compressed string: 466
Compression efficiency:  0.5936305732484076
All good: original and decompressed strings are the same.


## 1. Examine the compression efficiency of compressing one large string made up of the passage by Shakespeare repeated 100 times.

In [5]:
# Create string consisting of Shakespeare passage repeated 100 times
text_x100 = text * 100

# Compress repeated-length text
compress_text(text_x100)

<class 'bytes'>
Number of bytes for uncompressed string: 78500
Number of bytes for compressed string: 925
Compression efficiency:  0.01178343949044586
All good: original and decompressed strings are the same.


## 2. Examine the compression efficiency of compressing a random string of the same length as the repeated Shakespeare passage.

To help you, the below function generates a random string of length `N`:

In [6]:
def random_string(N):
    return ''.join([random.choice(string.ascii_letters + string.digits) for n in range(N)])

print(random_string(8))

ExSFN6ut


In [7]:
# Create random string the same length as the repeated Shakespeare passage
text_random = random_string(len(text_x100))

# Compress random text
compress_text(text_random)

<class 'bytes'>
Number of bytes for uncompressed string: 78500
Number of bytes for compressed string: 59031
Compression efficiency:  0.7519872611464968
All good: original and decompressed strings are the same.


Repeated text is compressed more efficiently than a single passage, although the random string was the least efficiently compressed structure. When there is less underlying pattern in the data, there is less structure that the compression algorithm can utilize to save space while retaining information.