# Compression and Entropy
## Introduction
In this notebook, we will explore how compression and entropy are related. We will use the `zlib` library to compress data and the `entropy` function from the `scipy` library to calculate the entropy of the data.

In [26]:
import math
from pathlib import Path
import time
import zlib
import pandas as pd
import matplotlib.pyplot as plt
from collections import Counter
import seaborn as sns

sns.set_theme(style="whitegrid")


In [27]:
sample_dir = Path("samples/entropy")
files = [f for f in sample_dir.iterdir() if f.is_file()]

def read_file_content(file_path: Path) -> bytes:
    return file_path.read_bytes()

In [28]:
# LZ77 compression

def compress_lz77(data: bytes) -> bytes:
    return zlib.compress(data)

def decompress_lz77(data: bytes) -> bytes:
    return zlib.decompress(data)

### Shannon Entropy Formula

The Shannon entropy formula is:

$$
H(X) = - \sum_{i=1}^{n} p(x_i) \cdot \log_2(p(x_i))
$$

Where $p(x_i)$ is the probability of the $i$-th byte occurring in the data.


In [29]:
def calculate_entropy(data: bytes) -> float:
    # Count the frequency of each byte
    byte_counts = Counter(data)
    total_bytes = len(data)
    
    entropy = 0
    for count in byte_counts.values():
        probability = count / total_bytes
        entropy -= probability * math.log2(probability)
        
    return entropy

In [30]:
results = []

for file_path in files:
    data = read_file_content(file_path)
    original_size = len(data)
    file_name = file_path.name
    entropy = calculate_entropy(data)
    
    algorithms = [
        ("LZ77", compress_lz77, decompress_lz77),
    ]

    for algorithm_name, compress_func, decompress_func in algorithms:
        start_time = time.perf_counter()
        compressed_data = compress_func(data)
        compression_time = time.perf_counter() - start_time

        compressed_size = len(compressed_data)
        compression_ratio = original_size / compressed_size

        start_time = time.perf_counter()
        decompressed_data = decompress_func(compressed_data)
        decompression_time = time.perf_counter() - start_time

        results.append({
            "file": file_name,
            "algorithm": algorithm_name,
            "original size (bytes)": original_size,
            "compressed size (bytes)": compressed_size,
            "compression ratio": compression_ratio,
            "compression time (seconds)": compression_time,
            "decompression time (seconds)": decompression_time,
            "entropy (bits/symbol)": entropy
        })

In [31]:
df = pd.DataFrame(results)
df

Unnamed: 0,file,algorithm,original size (bytes),compressed size (bytes),compression ratio,compression time (seconds),decompression time (seconds),entropy (bits/symbol)
0,a.txt,LZ77,1,9,0.111111,0.000271,0.000104,0.0
1,aaa.txt,LZ77,100000,121,826.446281,0.000448,6.9e-05,0.0
2,alphabet.txt,LZ77,100000,290,344.827586,0.000445,9.7e-05,4.70044
3,encrypted_random.txt,LZ77,100032,100073,0.99959,0.002287,3.3e-05,7.998249
4,mobydick.txt,LZ77,646431,263147,2.45654,0.035118,0.002234,4.590077
5,random.txt,LZ77,100000,75200,1.329787,0.003162,0.000328,5.953941


In [32]:
# TODO: add plots here