# Benfords Law Analysis
In this notebook analyze the **distribution of first significant digits** (fsd) of different aspects of an image.
These could be for example:
- The raw pixel values
- The discrete cosine transformation (DCT) values

Benfords law is an observation that in many collections of numbers, be they mathematical tables, real-life data, or combinations thereof, the leading significant digits are not uniformly distributed, as might be expected, but are heavily skewed toward the smaller digits. [[1](https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=1074&context=rgp_rsr)]

It is mathematically defined as (simplified) [[2](https://arxiv.org/pdf/1506.03046.pdf)]:

$$bf(d)=\beta log_b(1+\frac{1}{d})$$

with $b$ being base ($10$ for "normal" numbers) and $d$ being the possible digits (for $b=10$: $\{1,…,9\}$). The corresponding plot for $b=10$ does look as follows:

<img src="./benfords_law_ground_truth.png" alt="Benfords Law">

It was shown, that **natural** image data (e.g. produced fotographs) also follows this distribution, but GAN generated images do not. This fact was used successfully by Bonettini and collegues in [[3](https://arxiv.org/pdf/2004.07682.pdf)] to distinguish between real and fake images.

As an example dataset we will use the famous grayscale MNIST dataset, which is included in TensorFlow Keras.

In [None]:
# Import packages and settings
import tensorflow as tf
import numpy as np
import pandas as pd
from tqdm import tqdm
import cv2
import glob
import numpy.typing as npt
from math import log10, floor

pd.options.plotting.backend = "plotly"

In [None]:
# Import and prepare data
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
images = np.append(train_images, test_images, axis=0)
images = images.reshape(images.shape[0], 28, 28, 1).astype('float32')

In [None]:
# run dct on images and gather first significant digits
fsd = []
dcts = np.array([cv2.dct(image) for image in images])
dcts = dcts.flatten()
dcts = dcts[dcts != 0]
for n in tqdm(dcts):
    num = int(abs(n * (10 ** -int(floor(log10(abs(n))))))) if n != 0 else 0
    fsd.append(num)
fsd = np.array(fsd)
print(f"Shape: {fsd.shape}")
print(f"Sum: {np.sum(fsd)}")
#print(fsd)

In [None]:
# NUMPY run dct on images and gather first significant digits
fsd_fast = [] #np.array([]).astype(int)
dcts = np.array([cv2.dct(image) for image in images])
n = dcts.flatten()
n = n[n != 0]
n = np.abs(n * np.power(np.full(n.shape, 10.), -np.floor(np.log10(np.abs(n).astype("float64"))))).astype("int")
fsd_fast.extend(list(n))
fsd_fast = np.array(fsd_fast)
print(f"Shape: {fsd_fast.shape}")
print(f"Sum: {np.sum(fsd_fast)}")

In [None]:
for i in range(len(fsd)):
    if fsd[i] != fsd_fast[i]:
        print(f"different at index {i}, {fsd[i]} != {fsd_fast[i]}")
print(np.sum(fsd) == np.sum(fsd_fast))

In [None]:
# count fds
count = []
for i in range(1,10):
    count.append(np.count_nonzero(fsd == i))
count = count / np.sum(count)

count_fast = []
for i in range(1,10):
    count_fast.append(np.count_nonzero(fsd_fast == i))
count_fast = count_fast / np.sum(count_fast)

print(count)
print(count_fast)

In [43]:
# generate ground truth benfords law
bf_law = []
for i in range(1,10):
    bf_law.append(log10(1 + (1 / i)))
bf_law

[0.3010299956639812,
 0.17609125905568124,
 0.12493873660829992,
 0.09691001300805642,
 0.07918124604762482,
 0.06694678963061322,
 0.05799194697768673,
 0.05115252244738129,
 0.04575749056067514]

In [50]:
# plot data tp compare fsd vs benfords law
df = pd.DataFrame()
df["digit"] = [1,2,3,4,5,6,7,8,9]
df["MNIST FSD count"] = count
df["Benfords Law (ground truth)"] = bf_law

# df.plot(x="digit", y=["MNIST FSD count", "Benfords Law (ground truth)"],
#         labels={
#             "digit" : "First Significant Digit (FSD)",
#             "value" : "Probability"
#         })
df.plot(x="digit", y=["Benfords Law (ground truth)"],
        labels={
            "digit" : "First Significant Digit (FSD)",
            "value" : "Probability"
        })

In [None]:
horses = np.array([cv2.imread(file, cv2.IMREAD_GRAYSCALE) for file in glob.glob("horses/000000/*.png")]).astype("float32")
print(horses.shape)

fsd = []
for horse in tqdm(horses):
    for dct in cv2.dct(horse):
        for n in dct:
            num = int(abs(n * (10 ** -int(floor(log10(abs(n))))))) if n != 0 else 0
            fsd.append(num)
fsd = np.array(fsd)

# count fds
count = []
for i in range(1,10):
    count.append(np.count_nonzero(fsd == i))
count = count / np.sum(count)
print(count)

bf_law = []
for i in range(1,10):
    bf_law.append(log10(1 + (1 / i)))

df = pd.DataFrame()
df["digit"] = [1,2,3,4,5,6,7,8,9]
df["MNIST FSD count"] = count
df["Benfords Law (ground truth)"] = bf_law

df.plot(x="digit", y=["MNIST FSD count", "Benfords Law (ground truth)"],
        labels={
            "digit" : "First Significant Digit (FSD)",
            "value" : "Probability"
        })
