# Benfords Law Analysis
In this notebook analyze the **distribution of first significant digits** (FSD) of different aspects of an image.
These could be for example:
- The raw pixel values
- The discrete cosine transformation (DCT) values

Benfords law is an observation that in many collections of numbers, be they mathematical tables, real-life data, or combinations thereof, the leading significant digits are not uniformly distributed, as might be expected, but are heavily skewed toward the smaller digits. [[1](https://digitalcommons.calpoly.edu/cgi/viewcontent.cgi?article=1074&context=rgp_rsr)]

It is mathematically defined as (simplified) [[2](https://arxiv.org/pdf/1506.03046.pdf)]:

$$bf(d)=\beta log_b(1+\frac{1}{d})$$

with $b$ being base ($10$ for "normal" numbers) and $d$ being the possible digits (for $b=10$: $\{1,…,9\}$). The corresponding plot for $b=10$ does look as follows:

<img src="./benfords_law_ground_truth.png" alt="Benfords Law">

It was shown, that **natural** image data (e.g. produced fotographs) also follows this distribution, but GAN generated images do not. This fact was used successfully by Bonettini and collegues in [[3](https://arxiv.org/pdf/2004.07682.pdf)] to distinguish between real and fake images.

As an example dataset we will use the famous grayscale MNIST dataset, which is included in TensorFlow Keras.

In [None]:
# Import packages and settings
import glob
import cv2 as cv
import numpy as np
import pandas as pd
import bf_lib as bfl
import tensorflow as tf
import matplotlib.pyplot as plt

from tqdm import tqdm
from statics import FSD_FAST, FSD_SLOW, BASE_10

pd.options.plotting.backend = "plotly"

## MNIST Dataset

In [None]:
# Import and prepare data
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()
images = np.append(train_images, test_images, axis=0)
images = images.reshape(images.shape[0], 28, 28, 1).astype('float32')

### **DCT** whole image

In [None]:
# Run dct on images and gather first significant digits (Slow version - pure python)
dcts = bfl.get_dct_array(images)
fsd = bfl.to_fsd(dcts, mode=FSD_SLOW)

# Count fsds
fsd_count = bfl.count_fsds(fsd, base=BASE_10)

# Calculate distribution of each digit
fsd_count_dist = fsd_count / np.sum(fsd_count)

# Plot distribution against the ground truth benfords law
bfl.plot_df_comparison(fsd_count_dist=fsd_count_dist, title="DCT FSDs vs. Benfords Law")

In [None]:
# Run dct on images and gather first significant digits (Fast version - numpy)
dcts = bfl.get_dct_array(images)
fsd = bfl.to_fsd(dcts, mode=FSD_FAST)

# Count fsds
fsd_count = bfl.count_fsds(fsd, base=BASE_10)

# Calculate distribution of each digit
fsd_count_dist = fsd_count / np.sum(fsd_count)

# Plot distribution against the ground truth benfords law
bfl.plot_df_comparison(fsd_count_dist, title="DCT FSDs vs. Benfords Law")

### **Raw** images

Without normalization...

In [None]:
# Gather first significant digits on raw images
i = images.flatten()
fsd = bfl.to_fsd(i)

# Count fsds
fsd_count = bfl.count_fsds(fsd, base=BASE_10)

# Calculate distribution of each digit
fsd_count_dist_raw_mnist = fsd_count / np.sum(fsd_count)

# Plot distribution against the ground truth benfords law
bfl.plot_df_comparison(fsd_count_dist, title="Raw MNIST Images FSDs vs. Benfords Law")

...and with normalization.

In [None]:
# Gather first significant digits on raw normalized images
i = images.flatten()
# i = np.array([(p - np.min(i)) / (np.max(i) - np.min(i)) for p in images])
i = (i - np.min(i)) / (np.max(i) - np.min(i))

fsd = bfl.to_fsd(i)

# Count fsds
fsd_count = bfl.count_fsds(fsd, base=BASE_10)

# Calculate distribution of each digit
fsd_count_dist_normalized_mnist = fsd_count / np.sum(fsd_count)

# Plot distribution against the ground truth benfords law
bfl.plot_df_comparison(fsd_count_dist, title="Raw Normalized MNIST Images FSDs vs. Benfords Law")

In [None]:
#########################
# DOES NOT WORK FOR NOW #
#########################

# FSDs on quantized DC transformed MNIST
# image_blocks = np.array([img_to_blocks(image) for image in images])
# print(images.shape)
# print(image_blocks.shape) # 7000 images, 9 blocks per image, 8x8 blocks
# image_blocks = image_blocks - 128
# print(f"Image block: \n{image_blocks[0][4]}")

# block_dcts = np.array([[cv2.dct(block) for block in image] for image in image_blocks])
# print(block_dcts.shape)
# print(f"DCT block: \n{block_dcts[0][4]}")

# quantized_blocks = np.array([[quantize_block(block) for block in block_dct] for block_dct in block_dcts])
# print(quantized_blocks.shape)
# print(f"Quantization block: \n{quantized_blocks[0][4]}")

# f_quantized_blocks = quantized_blocks.flatten()
# fsd = to_fsd(f_quantized_blocks)
# fsd_count = count_fsds(fsd, base=BASE_10)
# fsd_count_dist = fsd_count / np.sum(fsd_count)

# plot_df_comparison(fsd_count_dist, title="Quantized DC transformed MNIST vs. Benfords Law")

### **Quantized** DCT blocks

In [None]:
# FSDs on quantized DC transformed MNIST
fd_list = np.array([0] * 9)
i_s = np.array([image.reshape(28,28) for image in images])
for img in tqdm(i_s):
    image_blocks = bfl.img_to_blocks(img)
    # print(image_blocks.shape) # 7000 images, 9 blocks per image, 8x8 blocks
    image_blocks = image_blocks - 128
    # print(image_blocks[4])

    block_dcts = np.array([cv.dct(block) for block in image_blocks])
    # print(block_dcts.shape)
    # print(f"DCT block: \n{block_dcts[4]}")

    quantized_blocks = np.array([bfl.quantize_block(block) for block in block_dcts])
    # print(quantized_blocks.shape)
    # print(f"Quantization block: \n{quantized_blocks[0]}")

    fsds = np.array([bfl.to_fsd(q.flatten()) for q in quantized_blocks])
    # print(f"FSD: \n{fsd}")
    # if 2 in fsds[0]:
    #     print("Number included")
    # else:
    #     print("Number not inlcuded")
    # fsd = to_fsd(f_quantized_blocks)
    # fsd_count = count_fsds(fsd[0], base=BASE_10)
    # fsd_count_dist = fsd_count / np.sum(fsd_count)
    for fsd in fsds:
        for i in range(1,10):
            if i in fsd:
                fd_list[i-1] += 1
print(fd_list)
fd_list = np.array(fd_list) / (len(images) * 9)
# fd_list = np.array(fd_list) / np.sum(fd_list)


bfl.plot_df_comparison(fd_list, title="Quantized DC transformed MNIST vs. Benfords Law")

## GAN generated images

In [None]:
horses = np.array([cv.imread(file, cv.IMREAD_GRAYSCALE) for file in glob.glob("horses/000000/*.png")]).astype("float32")
print(horses.shape)
print(horses[0].shape)
plt.imshow(horses[0], cmap=plt.cm.gray)
plt.show()

### DCT whole image

In [None]:
dcts = bfl.get_dct_array(horses)
fsd = bfl.to_fsd(dcts)
fsd_count = bfl.count_fsds(fsd)
fsd_count_dist = fsd_count / np.sum(fsd_count)

bfl.plot_df_comparison(fsd_count_dist=fsd_count_dist, title="GAN DCT FSDs vs. Benfords Law")

### Raw images

Without normalization...

In [None]:
h = horses.flatten()
fsd = bfl.to_fsd(h)
fsd_count = bfl.count_fsds(fsd)
fsd_count_dist_raw_gan = fsd_count / np.sum(fsd_count)

bfl.plot_df_comparison(fsd_count_dist=fsd_count_dist, title="Raw GAN FSDs vs. Benfords Law")

...and with normalization.

In [None]:
# Gather first significant digits on raw normalized images
flattened_horses = horses.flatten()
i = (flattened_horses - np.min(flattened_horses)) / (np.max(flattened_horses) - np.min(flattened_horses))

fsd = bfl.to_fsd(i)

# Count fsds
fsd_count = bfl.count_fsds(fsd, base=BASE_10)

# Calculate distribution of each digit
fsd_count_dist_normalized_gan = fsd_count / np.sum(fsd_count)

# Plot distribution against the ground truth benfords law
bfl.plot_df_comparison(fsd_count_dist_normalized_gan, title="Raw Normalized MNIST Images FSDs vs. Benfords Law")

### Quantized DCT blocks

In [None]:
# FSDs on quantized DC transformed MNIST
fd_list = [0] * 9
# i_s = np.array([image.reshape(28,28) for image in images])
horses = np.array([cv.imread(file, cv.IMREAD_GRAYSCALE) for file in glob.glob("horses/000000/*.png")]).astype("float32")
for img in tqdm(horses):
    image_blocks = bfl.img_to_blocks(img)
    # print(image_blocks.shape) # 7000 images, 9 blocks per image, 8x8 blocks
    image_blocks = image_blocks - 128
    # print(image_blocks[4])

    block_dcts = np.array([cv.dct(block) for block in image_blocks])
    # print(block_dcts.shape)
    # print(f"DCT block: \n{block_dcts[4]}")

    quantized_blocks = np.array([bfl.quantize_block(block) for block in block_dcts])
    # print(quantized_blocks.shape)
    # print(f"Quantization block: \n{quantized_blocks[0]}")

    fsds = np.array([bfl.to_fsd(q.flatten()) for q in quantized_blocks])
    # print(f"FSD: \n{fsd}")
    # if 2 in fsds[0]:
    #     print("Number included")
    # else:
    #     print("Number not inlcuded")
    # fsd = to_fsd(f_quantized_blocks)
    # fsd_count = count_fsds(fsd[0], base=BASE_10)
    # fsd_count_dist = fsd_count / np.sum(fsd_count)

    
    # print(fd_list)
    for fsd in fsds:
        for i in range(1,10):
            if i in fsd:
                fd_list[i-1] += 1
print(fd_list)
# fd_list = np.array(fd_list) / (len(images) * 9)
fd_list = np.array(fd_list) / np.sum(fd_list)# (len(horses) * 1024)

bfl.plot_df_comparison(fd_list, title="Quantized DC transformed GAN vs. Benfords Law")

In [None]:
# from plotly.subplots import make_subplots
# df = pd.DataFrame()
# df["digit"] = [i for i in range(1, 10, 1)]
# df["MNIST FSD count"] = fsd_count_dist_normalized_mnist
# df["GAN FSD COUNT"] = fsd_count_dist_normalized_gan
# df["Benfords Law (ground truth)"] = bfl.benfords_law()


# fig = make_subplots(rows=2, cols=1)
# fig.add_bar(x=df["digit"], y=df["MNIST FSD count"], name="Measurements normalized MNIST", hoverinfo="y", row=1, col=1)
# fig.add_scatter(x=df["digit"], y=df["Benfords Law (ground truth)"], name="Ground Truth", hoverinfo="y", row=1, col=1)
# fig.add_bar(x=df["digit"], y=df["GAN FSD COUNT"], name="Measurements normalized  GAN", hoverinfo="y", row=2, col=1)
# fig.add_scatter(x=df["digit"], y=df["Benfords Law (ground truth)"], name="Ground Truth", hoverinfo="y", row=2, col=1)
# fig.update_layout(title="Raw normalized FSDs, MNIST (top) and GAN (bottom)", xaxis_title="Digits", yaxis_title="Distribution")
# fig.show(renderer="browser")