# Exploratory Data Analysis of Scanned Document Samples

In this notebook, we will explore the scanned document samples collected from various scanner devices. We will analyze basic image properties such as resolution, format, and color channels, and visualize the dataset.

In [1]:
import os
import pandas as pd
import matplotlib.pyplot as plt
from PIL import Image

# Define the path to the raw data directory
raw_data_dir = '../data/raw'
# Load the labels dataset
labels_path = '../data/annotations/labels.csv'
labels_df = pd.read_csv(labels_path)

labels_df.head()

## Image Properties Analysis

We will analyze the properties of the images in the raw data directory.

In [2]:
def analyze_image_properties(image_path):
    with Image.open(image_path) as img:
        width, height = img.size
        format = img.format
        mode = img.mode
    return width, height, format, mode

image_properties = []
for index, row in labels_df.iterrows():
    file_name = row['file_name']
    image_path = os.path.join(raw_data_dir, file_name)
    properties = analyze_image_properties(image_path)
    image_properties.append((row['scanner_model'], file_name, *properties))

properties_df = pd.DataFrame(image_properties, columns=['scanner_model', 'file_name', 'width', 'height', 'format', 'mode'])
properties_df.head()

## Visualizing Image Properties

Let's visualize the distribution of image resolutions and formats.

In [3]:
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.scatter(properties_df['width'], properties_df['height'], alpha=0.5)
plt.title('Image Resolution Distribution')
plt.xlabel('Width (pixels)')
plt.ylabel('Height (pixels)')

plt.subplot(1, 2, 2)
properties_df['format'].value_counts().plot(kind='bar')
plt.title('Image Format Distribution')
plt.xlabel('Format')
plt.ylabel('Count')

plt.tight_layout()
plt.show()

<Figure size 640x480 with 0 Axes>

## Conclusion

In this notebook, we explored the scanned document samples, analyzed their properties, and visualized the results. This analysis will help inform the preprocessing steps to be applied to the images.