# Benfords Law Analysis
In this notebook analyze the **distribution of first significant digits** (fsd) of different aspects of an image.
These could be for example:
- The raw pixel values
- The discrete cosine transformation (DCT) values

Benfords law is an observation that in many collections of numbers, be they mathematical tables, real-life data, or combinations thereof, the leading significant digits are not uniformly distributed, as might be expected, but are heavily skewed toward the smaller digits. [[1](https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=1074&context=rgp_rsr)]

It is mathematically defined as (simplified) [[2](https://arxiv.org/pdf/1506.03046.pdf)]:

$$bf(d)=\beta log_b(1+\frac{1}{d})$$

with $b$ being base ($10$ for "normal" numbers) and $d$ being the possible digits (for $b=10$: $\{1,…,9\}$). The corresponding plot for $b=10$ does look as follows:

<img src="./benfords_law_ground_truth.png" alt="Benfords Law">

It was shown, that **natural** image data (e.g. produced fotographs) also follows this distribution, but GAN generated images do not. This fact was used successfully by Bonettini and collegues in [[3](https://arxiv.org/pdf/2004.07682.pdf)] to distinguish between real and fake images.

As an example dataset we will use the famous grayscale MNIST dataset, which is included in TensorFlow Keras.

In [None]:
# Import packages and settings
import tensorflow as tf
import numpy as np
import pandas as pd
from tqdm import tqdm
import cv2
import glob
import numpy.typing as npt
from math import log10, floor

pd.options.plotting.backend = "plotly"

In [None]:
FSD_SLOW = 0
FSD_FAST = 1
BASE_10 = 10

In [None]:
# Import and prepare data
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
images = np.append(train_images, test_images, axis=0)
images = images.reshape(images.shape[0], 28, 28, 1).astype('float32')

In [None]:
def get_dct_array(image_list: npt.ArrayLike) -> npt.ArrayLike:
    """Calculates the DCT for each element in the list, flattens the result and returns a one-dimensional array.

    Args:
        image_list (npt.ArrayLike): A list of images

    Returns:
        npt.ArrayLike: A one-dimensional array of DCT values
    """
    dcts = np.array([cv2.dct(image) for image in image_list])
    dcts = dcts.flatten()
    return dcts


In [None]:
def to_fsd(values: npt.ArrayLike, mode: int = FSD_FAST) -> npt.ArrayLike:
    """Replaces each value in values with its first significant digit.

    Args:
        values (npt.ArrayLike): An array of float values

    Returns:
        npt.ArrayLike: The first significant digits of values
    """
    values = values[values != 0]
    fsd = []
    if mode == FSD_SLOW:
        for value in tqdm(values):
            num = int(abs(value * (10 ** -int(floor(log10(abs(value)))))))
            fsd.append(num)
    elif mode == FSD_FAST:
        n = np.abs(values * np.power(np.full(values.shape, 10.), -np.floor(np.log10(np.abs(values).astype("float64"))))).astype("int")
        fsd.extend(n)
    return np.array(fsd)
        

In [None]:
def count_fsds(fsds: npt.ArrayLike, base: int = BASE_10) -> npt.ArrayLike:
    """Counts up the occurence of each digit, depending on the base (1-9 for base 10).

    Args:
        fsds (npt.ArrayLike): An array of digits

    Returns:
        npt.ArrayLike: An array of the summed digits of length base - 1
    """
    count = []
    for i in range(1,base):
        count.append(np.count_nonzero(fsds == i))
    return np.array(count)

In [None]:
def benfords_law() -> npt.ArrayLike:
    """Create the ground truth distribution according to benfords law for base10 digits.

    Returns:
        npt.ArrayLike: The benfords law distribution for base10 digits
    """
    bf_law = []
    for i in range(1,10):
        bf_law.append(log10(1 + (1 / i)))
    return np.array(bf_law)

In [None]:
# Slow version (pure python): Run dct on images and gather first significant digits
dcts = get_dct_array(images)
fsd = to_fsd(dcts, mode=FSD_SLOW)

print(f"Shape: {fsd.shape}")
print(f"Sum: {np.sum(fsd)}")

In [None]:
# Fast version (numpy): Run dct on images and gather first significant digits
dcts = get_dct_array(images)
fsd = to_fsd(dcts, mode=FSD_FAST)

print(f"Shape: {fsd.shape}")
print(f"Sum: {np.sum(fsd)}")

In [None]:
# Gather first significant digits on raw images
i = images.flatten()
fsd = to_fsd(i)

In [None]:
# Count fsds
fsd_count = count_fsds(fsd, base=BASE_10)
fsd_count

In [None]:
# Calculate distribution of each digit
fsd_count_dist = fsd_count / np.sum(fsd_count)
fsd_count_dist

In [None]:
# generate ground truth benfords law
bf_law = benfords_law()
bf_law

In [None]:
# plot data tp compare fsd vs benfords law
import plotly.express as px
df = pd.DataFrame()
df["digit"] = [1,2,3,4,5,6,7,8,9]
df["MNIST FSD count"] = fsd_count_dist
df["Benfords Law (ground truth)"] = bf_law

df.plot(x="digit", y=["MNIST FSD count", "Benfords Law (ground truth)"],
        labels={
            "digit" : "First Significant Digit (FSD)",
            "value" : "Probability"
        })

In [None]:
horses = np.array([cv2.imread(file, cv2.IMREAD_GRAYSCALE) for file in glob.glob("horses/000000/*.png")]).astype("float32")
print(horses.shape)

fsd = []
for horse in tqdm(horses):
    for dct in cv2.dct(horse):
        for n in dct:
            num = int(abs(n * (10 ** -int(floor(log10(abs(n))))))) if n != 0 else 0
            fsd.append(num)
fsd = np.array(fsd)

# count fds
count = []
for i in range(1,10):
    count.append(np.count_nonzero(fsd == i))
count = count / np.sum(count)
print(count)

bf_law = []
for i in range(1,10):
    bf_law.append(log10(1 + (1 / i)))

df = pd.DataFrame()
df["digit"] = [1,2,3,4,5,6,7,8,9]
df["MNIST FSD count"] = count
df["Benfords Law (ground truth)"] = bf_law

df.plot(x="digit", y=["MNIST FSD count", "Benfords Law (ground truth)"],
        labels={
            "digit" : "First Significant Digit (FSD)",
            "value" : "Probability"
        })
