# Part I: Data Exloration

Any good data science problem will start with understanding the data.

## Setup

A common tasks in any notebook for data analysis to model experimentation includes setup. These following cells will:
1. create a virtual isolated environment, so we can work on different projects without having conflicting library versions
1. sets up a scratch folder to load the training data into
1. installs the requirements.txt software and libraries

In [None]:
! cd .. && ./scripts/bootstrap.sh

In [None]:
! pip install -u pip -q
! pip install -r ../requirements.txt -q

In [None]:
import numpy as np
import pandas as pd
import os
import glob
import cv2

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

In [None]:
# scratch directory is apart of the .gitignore to ensure it is not committed to git
%env SCRATCH=../scratch
! [ -e "${SCRATCH}" ] || mkdir -p "${SCRATCH}"

scratch_path = os.environ.get('SCRATCH', './scratch')

# Examine and understand data

We will be labeling data using the `tf.keras.utils.image_dataset_from_directory` utility, which will infer the label for the subsequent images from the parent folder name.

## Count the subdirectories

Let's analyze how many subfolders of images we have.

In [None]:
path_to_folder = scratch_path + '/train'
subdirectories = [f.path for f in os.scandir(path_to_folder) if f.is_dir()]

# print the number of subdirectories
print(len(subdirectories))

## Count the total images

Counting the total number of images combined in each directory will indicate if we have enough data to train.

In [None]:
image_count = len(list(glob.glob(scratch_path +'/train/*/*')))
print(image_count)

## List the subdirectories

Let's list the subdirectories to understand what labels we will be predicting.

In [None]:
# Get a list of all files and directories in the path
path_to_images = scratch_path + '/train'

contents = os.listdir(path_to_images)

# Loop through each item in the list
for item in contents:
    # Check if the item is a directory
    if os.path.isdir(os.path.join(path_to_images, item)):
        print("Found directory:", item)

## Count images in subdirectories

Let's see if we have an even distribution of images in each subdirectory or if we have bias/skewed data.

In [None]:
root_path = scratch_path + '/train'
num_images = 0

# Iterate over each subdirectory
for dirpath, dirnames, filenames in os.walk(root_path):
    # Count the number of image files in the current subdirectory
    for filename in filenames:
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            num_images += 1

    # Print the number of image files in the current subdirectory
    print(f"Found {num_images} images in directory: {dirpath}")
    num_images = 0

## Visualize the distribution of images

We can visually represent the images in each class with a bar chart to easily identify imbalances.

In [None]:
subdirectories = [os.path.join(root_path, d) for d in os.listdir(root_path) if os.path.isdir(os.path.join(root_path, d))]

# Count the number of images in each subdirectory
counts = [0] * len(subdirectories)
for i, directory in enumerate(subdirectories):
    counts[i] = len(os.listdir(directory))

# Create a bar chart to visualize the distribution
plt.bar(subdirectories, counts)
plt.xticks(rotation=90)
plt.ylabel("Number of Images")
plt.show()

You are looking for imbalances in the dataset that would result in bias in the model and what it predicts on unseen data.

## Analyzing Image Properties

Now we can examine the image properties to identify what transformations we might need to account for before passing to a model.

In [None]:
# Initialize empty lists to store the information
sizes = []
resolutions = []
color_distributions = []

# Iterate over each image file in each subdirectory
for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
        if filename.lower().endswith(('.png', '.jpg', '.jpeg')):
            # Load the image file using OpenCV
            img_path = os.path.join(dirpath, filename)
            img = cv2.imread(img_path)

            # Extract the size of the image
            size = os.path.getsize(img_path)
            sizes.append(size)

            # Extract the resolution of the image
            resolution = img.shape[:2]
            resolutions.append(resolution)

            # Extract the color distribution of the image
            color_distribution = np.bincount(img.flatten(), minlength=256)
            color_distributions.append(color_distribution)

# Convert the lists to numpy arrays for easier manipulation
sizes = np.array(sizes)
resolutions = np.array(resolutions)
color_distributions = np.array(color_distributions)

### Histogram of image sizes

We can visualize the range of different image sizes in our dataset to understand how best to resize the images.

We will use Plotly to enrich the visualization experience by providing an interactive chart to zoom and uncover additional insights.

In [None]:
# Root directory path
root_path = scratch_path + '/train'

# List to store file sizes
sizes = []

# Iterate over each file in the root directory and its subdirectories
for dirpath, dirnames, filenames in os.walk(root_path):
    for filename in filenames:
        # Get the full path of the file
        file_path = os.path.join(dirpath, filename)
        # Get the file size in bytes
        file_size = os.path.getsize(file_path)
        # Convert file size to MB and add to the list
        sizes.append(file_size / 1_000_000)

# Create a histogram figure with plotly
fig = px.histogram(x=sizes, nbins=50, title="Distribution of Image Sizes")

# Customize the plot
fig.update_layout(
    xaxis_title="File Size (MB)",
    yaxis_title="Number of Images",
    showlegend=False,
    bargap=0.1,
    bargroupgap=0.1
)

# Show the plot
fig.show()

### Scatterplot of image resolutions

Understanding the width and height, or resolution, of an image may reveal irregularities to consider.

In [None]:
# Create a scatter plot figure with plotly
fig = px.scatter(x=resolutions[:, 0], y=resolutions[:, 1], title="Distribution of Image Resolutions")

# Customize the plot
fig.update_layout(
    xaxis_title="Width (pixels)",
    yaxis_title="Height (pixels)",
    showlegend=False,
    hovermode="closest",
    width=800,
    height=600,
    margin=dict(l=50, r=50, b=50, t=50, pad=4)
)

# Show the plot
fig.show()

## 3D scatterplot of image resolutions

It's overkill, but demonstrates the power of visualization if we had more variability in our data.

In [None]:
# Create a dataframe with the resolutions
df = pd.DataFrame(resolutions, columns=['width', 'height'])

# Create a 3D scatter plot with plotly
fig = px.scatter_3d(df, x='width', y='height', z=df.index,
                    title='Distribution of Image Resolutions',
                    labels={'width': 'Width (pixels)',
                            'height': 'Height (pixels)',
                            'index': 'Image Index'},
                    color=df.index)

# Customize the plot
fig.update_traces(marker=dict(size=2, line=dict(width=0.5)))

# Show the plot
fig.show()

You will observe we have two sizes of images ~(300x240) and ~(103x96). This means we will need to do some preprocessing transformations.

## Plot the mean color distribution

Sometime the distribution of color can provide trends and patterns that are useful to normalizing and predictions.

In [None]:
# Calculate the mean color distribution across all images
mean_color_distribution = np.mean(color_distributions, axis=0)

# Create a bar chart of the mean color distribution
fig = go.Figure(
    go.Bar(x=np.arange(256), y=mean_color_distribution, name="Mean Color Distribution")
)

# Set the title and axis labels
fig.update_layout(
    title="Mean Color Distribution",
    xaxis_title="Color Value",
    yaxis_title="Number of Pixels"
)

# Show the plot
fig.show()

## Review some samples

To get an idea of what a record looks like, but being mindful that we should only see minimal data to prevent add bias in later steps.

### Function to display images in Jupyter Notebooks

In [None]:
# convert opencv BGR to RGB ordering
def plt_imshow(title, image):
	# convert the image frame BGR to RGB color space and display it
	image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
	plt.imshow(image)
	plt.title(title)
	plt.grid(False)
	plt.show()

### Display an sample

In [None]:
image = cv2.imread(scratch_path + '/train/left/1__M_Left_index_finger_CR.png')

(h, w, c) = image.shape[:3]

# display the image dimensions
print("width: {} pixels".format(image.shape[1]))
print("heigth: {} pixels".format(image.shape[0]))
print("channels: {}".format(image.shape[2]))

plt_imshow("Original", image)

You will notice there is a border around the image that will definitely impact our models algorithm learning that all fingerprints will have this border. We should remove this because it not a marker that should be used for understanding fingerprints.

### Crop the border

Let's experiment with different values to crop the image

In [None]:
# cropping an image with OpenCV is accomplished via simple NumPy
# array slices in startY:endY, startX:endX order
# cropping the border from the image - expected optimal values [5:95, 5:90]

CROP_TOP = 10
CROP_BOT = 96-1
CROP_L = 5
CROP_R = 96-6

no_border = image[CROP_TOP:CROP_BOT, CROP_L:CROP_R] # EXPERIMENT WITH CHANGES TO THESE VALUES AND DISPLAY THE OUTPUT

(h, w, c) = no_border.shape[:3]

# display the image dimensions
print("width: {} pixels".format(no_border.shape[1]))
print("heigth: {} pixels".format(no_border.shape[0]))
print("channels: {}".format(no_border.shape[2]))

plt_imshow("Borderless", no_border)

# Overall

We know we need to codify some changes:
1. we need to rescale the images to be the same size
1. then, we need to crop the border from around the image
1. we need to create pipeline tasks that can be reused for data ingestion and against batch offline data