### DETAILED EXPLANATION

This script generates blurred versions of each image in the dataset using three types of blur:

**Why Blurring Images?**

Blurring is a common type of image distortion that can occur due to various factors:

- Camera motion during capture (Motion Blur)
- Out-of-focus regions due to depth of field (Gaussian Blur)
- Uniform blurring due to poor image processing or resizing (Box Blur)

These blurs simulate real-world distortions, making our analysis more realistic.

**Types of Blur Used**

1. Gaussian Blur:
   - This blur applies a Gaussian kernel to each pixel, which results in a smooth, out-of-focus effect.
   - The blur is strongest at the center and gradually fades out (bell curve distribution).
   - We chose sigma range (0.5 to 4.0) because it provides visible but controlled blur.
   - Sigma defines the standard deviation of the Gaussian distribution (higher = stronger blur).

2. Motion Blur:
   - Simulates the effect of movement, either from a moving object or a moving camera.
   - The blur has a clear direction (angle) and intensity (length).
   - We chose length (3 to 30) to allow visible motion but not too extreme.
   - Angle (0 to 360) provides all possible directions of movement.

3. Box Blur:
   - Applies a simple averaging of pixel values in a square region around each pixel.
   - This blur is fast and commonly used for quick blur effects.
   - We chose kernel size (3 to 15) because it provides a visible blur without being too extreme.
   - The kernel size defines the square region used for averaging (higher = stronger blur).

**Why Save Parameters in a Parquet File?**

- Parquet is optimized for speed and storage efficiency.
- It is suitable for large datasets, ensuring fast read and write operations.
- Parquet allows us to store the blur parameters alongside the image identifiers, making it easy to track and analyze.

In [7]:
import numpy as np
import cv2

def gen_gaussian_kernel(size, sigma):
    """Generate a Gaussian blur kernel."""
    k1d = cv2.getGaussianKernel(size, sigma)
    kernel = k1d @ k1d.T
    return kernel

def gen_motion_kernel(length, angle):
    """Generate a motion blur kernel based on length and angle."""
    if length % 2 == 0:
        length += 1

    kernel = np.zeros((length, length), dtype=np.float32)
    center = (length - 1) / 2.0

    angle_rad = np.deg2rad(angle)
    dx = length / 2.0 * np.cos(angle_rad)
    dy = length / 2.0 * np.sin(angle_rad)

    start_point = (int(center - dx), int(center - dy))
    end_point = (int(center + dx), int(center + dy))
    
    cv2.line(kernel, start_point, end_point, 1, 1, lineType=cv2.LINE_AA)
    kernel /= np.sum(kernel)
    return kernel

def gen_box_kernel(size):
    """Generate a box blur kernel."""
    return np.ones((size, size), dtype=np.float32) / (size * size)

In [8]:
from PIL import Image
import matplotlib.pyplot as plt
import cv2
import numpy as np
import pandas as pd
import random
from tqdm import tqdm

from common.config import TEST_ORIGINAL_DIR, TEST_BLURRED_BOX_DIR, TEST_BLURRED_GAUSSIAN_DIR, TEST_BLURRED_MOTION_DIR, BLUR_PARAM_RANGES

# Parametri per il sample: None per tutte le immagini, oppure un intero per un subset
SAMPLE_SIZE = None

image_files = sorted(list((TEST_ORIGINAL_DIR / "00000").glob("*.png")))

if SAMPLE_SIZE is not None:
    image_files = image_files[:SAMPLE_SIZE]

for blur_type in [TEST_BLURRED_BOX_DIR, TEST_BLURRED_GAUSSIAN_DIR, TEST_BLURRED_MOTION_DIR]:
    blur_type.mkdir(parents=True, exist_ok=True)

blurred_images = []

for img_path in tqdm(image_files):
    image = cv2.imread(str(img_path))
    if image is None:
        continue
    original_name = img_path.name

    blur_params = {
        "key": original_name.replace('.png', '')
    }

    # box
    kernel_size = random.randint(*BLUR_PARAM_RANGES['box']['kernel_size'])
    
    kernel = gen_box_kernel(kernel_size)
    blurred = cv2.filter2D(image, -1, kernel)
    
    blur_params['box_size'] = kernel_size

    cv2.imwrite(str(TEST_BLURRED_BOX_DIR / original_name), blurred)

    # gaussian
    sigma = random.uniform(*BLUR_PARAM_RANGES['gaussian']['sigma'])

    min_size, max_size = BLUR_PARAM_RANGES['gaussian']['size']
    ksize = int(round(6 * sigma)) | 1
    if ksize < min_size:
        ksize = min_size | 1
    elif ksize > max_size:
        ksize = max_size if max_size % 2 == 1 else max_size - 1

    kernel = gen_gaussian_kernel(ksize, sigma)
    blurred = cv2.filter2D(image, -1, kernel)

    blur_params['gaussian_sigma'] = sigma
    blur_params['gaussian_size'] = ksize
    
    cv2.imwrite(str(TEST_BLURRED_GAUSSIAN_DIR / original_name), blurred)

    # motion
    length = random.randint(*BLUR_PARAM_RANGES['motion']['length'])
    angle = random.uniform(*BLUR_PARAM_RANGES['motion']['angle'])

    kernel = gen_motion_kernel(length, angle)
    blurred = cv2.filter2D(image, -1, kernel)
    
    blur_params['motion_length'] = length
    blur_params['motion_angle'] = angle
    
    cv2.imwrite(str(TEST_BLURRED_MOTION_DIR / original_name), blurred)
    
    blurred_images.append(blur_params)

df_blur = pd.DataFrame(blurred_images)
# df_blur.to_parquet(str(const.DIR_DATASET_PATH / "blur_metadata.parquet"), index=False)

print("Blur generation completed.")

100%|██████████| 950/950 [05:44<00:00,  2.76it/s]

Blur generation completed.





In [9]:
df_blur

Unnamed: 0,key,box_size,gaussian_sigma,gaussian_size,motion_length,motion_angle
0,000000000,13,3.580787,21,7,48.338477
1,000000002,6,1.822229,11,19,139.092941
2,000000003,5,3.916722,21,27,347.372721
3,000000004,15,2.961514,19,25,45.563080
4,000000005,14,1.030438,7,19,290.691205
...,...,...,...,...,...,...
945,000001245,8,3.586043,21,8,331.461313
946,000001246,14,3.551686,21,12,80.181946
947,000001247,10,2.108768,13,18,288.551090
948,000001248,14,1.753334,11,9,277.406920


In [10]:
from common.config import IDPA_DATASET

df_original = pd.read_parquet(IDPA_DATASET)
df_original

Unnamed: 0,url,category,key,width,height,exif,aspect_ratio,size,rms_contrast,sobel_edge_strength,canny_edge_density
0,http://100500foto.com/wp-content/uploads/2016/...,people,000000291,,,,,,,,
1,http://2gfsl7am0og1m91u0pwpiehl.wpengine.netdn...,indoor_scene,000000987,,,,,,,,
2,http://411posters.com/wp-content/uploads/2011/...,poster,000000382,1300.0,1728.0,{},0.752315,3169036.0,0.373365,54.414061,0.139681
3,http://RealEstateAdminImages.gabriels.net/170/...,architecture,000000058,,,,,,,,
4,http://RealEstateAdminImages.gabriels.net/170/...,architecture,000000001,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
1245,https://www.yamaha.com/en/musical_instrument_g...,complex,000000564,1800.0,1042.0,{},1.727447,3199285.0,0.331964,60.298111,0.093565
1246,https://www.yellowmaps.com/usgs/topomaps/drg24...,map,000001225,1509.0,2026.0,"{""Image Tag 0x5100"": ""0""}",0.744817,5742001.0,0.221090,99.311477,0.177919
1247,https://www.zappos.com/images/z/2/5/1/8/8/7/25...,furniture,000000481,1920.0,1440.0,{},1.333333,4276346.0,0.253243,53.379880,0.086103
1248,https://ycdn.space/h/2015/02/Capitol-Hill-Loft...,indoor_scene,000000954,1050.0,1575.0,{},0.666667,2000861.0,0.241754,64.152358,0.085650


In [11]:
# Merge LEFT per mantenere tutte le righe di df_original
df_merged = pd.merge(
    df_original,
    df_blur,
    on='key',
    how='left'
)
df_merged

Unnamed: 0,url,category,key,width,height,exif,aspect_ratio,size,rms_contrast,sobel_edge_strength,canny_edge_density,box_size,gaussian_sigma,gaussian_size,motion_length,motion_angle
0,http://100500foto.com/wp-content/uploads/2016/...,people,000000291,,,,,,,,,,,,,
1,http://2gfsl7am0og1m91u0pwpiehl.wpengine.netdn...,indoor_scene,000000987,,,,,,,,,,,,,
2,http://411posters.com/wp-content/uploads/2011/...,poster,000000382,1300.0,1728.0,{},0.752315,3169036.0,0.373365,54.414061,0.139681,14.0,2.079159,13.0,13.0,232.710817
3,http://RealEstateAdminImages.gabriels.net/170/...,architecture,000000058,,,,,,,,,,,,,
4,http://RealEstateAdminImages.gabriels.net/170/...,architecture,000000001,,,,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1245,https://www.yamaha.com/en/musical_instrument_g...,complex,000000564,1800.0,1042.0,{},1.727447,3199285.0,0.331964,60.298111,0.093565,11.0,0.951264,7.0,7.0,12.545025
1246,https://www.yellowmaps.com/usgs/topomaps/drg24...,map,000001225,1509.0,2026.0,"{""Image Tag 0x5100"": ""0""}",0.744817,5742001.0,0.221090,99.311477,0.177919,14.0,0.910492,5.0,12.0,75.341476
1247,https://www.zappos.com/images/z/2/5/1/8/8/7/25...,furniture,000000481,1920.0,1440.0,{},1.333333,4276346.0,0.253243,53.379880,0.086103,9.0,2.791001,17.0,13.0,16.794518
1248,https://ycdn.space/h/2015/02/Capitol-Hill-Loft...,indoor_scene,000000954,1050.0,1575.0,{},0.666667,2000861.0,0.241754,64.152358,0.085650,13.0,3.734136,21.0,10.0,286.679094


In [12]:
from common.config import IDPA_DATASET

df_merged.to_parquet(IDPA_DATASET)