Hello Kagglers,

This notebook gives a brief Exploratory Data Analysis, followed by object detection using [YOLOV5](https://github.com/ultralytics/yolov5) and lastly creating [TFRecords](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset).

Object detection is applied to classify the image as cat or dog and to find the contours of the pets inside the image. This could be used for cropping or to extract features. The TFRecords function for fast and easy data processing, especially when using TPU's thousands of images per second can be read.

**V2 Updates**

* Added Test Time Augmentations during YOLOV5 inference for better performance. Reduced the amount of images where no pet could be detected from 50 -> 26.

* Added pet ratio feature. This denotes the percentage in pixels of an image containing a pet.

**V3 Updates**

* Added duplicate removal based on the amazing [post](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/278497) and [notebook](https://www.kaggle.com/schulta/petfinder-identify-duplicates-and-share-findings) from [schuta](https://www.kaggle.com/schulta)

In [None]:
import numpy as np
import pandas as pd
import tensorflow as tf
import matplotlib.pyplot as plt
import matplotlib.patches as patches

from multiprocessing import cpu_count
from tqdm.notebook import tqdm
from sklearn.model_selection import StratifiedKFold
from scipy.stats import pearsonr
from PIL import Image

import glob
import sys
import cv2
import imageio
import joblib
import math
import warnings
import os
import torch
import imagehash

# Ignore Warnings
warnings.filterwarnings("ignore")

# Activate pandas progress apply bar
tqdm.pandas()

print(f'tensorflow version: {tf.__version__}')
print(f'tensorflow keras version: {tf.keras.__version__}')
print(f'python version: P{sys.version}')

# Read Train Test

In [None]:
# Efficient Data Types
dtype = {
    'Id': 'string',
    'Subject Focus': np.uint8, 'Eyes': np.uint8, 'Face': np.uint8, 'Near': np.uint8,
    'Action': np.uint8, 'Accessory': np.uint8, 'Group': np.uint8, 'Collage': np.uint8,
    'Human': np.uint8, 'Occlusion': np.uint8, 'Info': np.uint8, 'Blur': np.uint8,
    'Pawpularity': np.uint8,
}

train = pd.read_csv('/kaggle/input/petfinder-pawpularity-score/train.csv', dtype=dtype)
test = pd.read_csv('/kaggle/input/petfinder-pawpularity-score/test.csv', dtype=dtype)

In [None]:
# Add File path to Train
def get_image_file_path(image_id):
    return f'/kaggle/input/petfinder-pawpularity-score/train/{image_id}.jpg'

train['file_path'] = train['Id'].apply(get_image_file_path)

In [None]:
display(train.head())

In [None]:
display(train.info())

In [None]:
display(test.head())

In [None]:
display(test.info())

# Remove Duplicates

Based on the amazing [post](https://www.kaggle.com/c/petfinder-pawpularity-score/discussion/278497) and [notebook](https://www.kaggle.com/schulta/petfinder-identify-duplicates-and-share-findings) from [schuta](https://www.kaggle.com/schulta). More info on image hashes can be found in the [ImageHash documentation](http://www.hackerfactor.com/blog/index.php?/archives/432-Looks-Like-It.html). A cryptographic hash maps a file of arbitrary length to a unique vector, where changing a single bit in the input file results in a new arbitrary vector as shown below. Image hashes on the other hand aim to generate similar vectors for similarly looking images. The image hash generates a hash of 256 bits, where similarity can be simply computed as the percentage of identical bits in the vectors of two images.

The perceptual hash is used for generating the fingerprint of images. The [pHash website](http://www.phash.org/) explains the difference between cryptographic and image hashes once more:

```
"A perceptual hash is a fingerprint of a multimedia file derived from various features from its content. Unlike cryptographic hash functions which rely on the avalanche effect of small changes in input leading to drastic changes in the output, perceptual hashes are "close" to one another if the features are similar."
```

In [None]:
# Example of cryptographic hashes
print(f'hash of 42: {hash("aaaaa")}, hash 37: {hash("aaaab")}')

In [None]:
# Return the perceptual hash
def get_hash(file_path):
    img = Image.open(file_path)
    img_hash = imagehash.phash(img)
    
    return img_hash.hash.reshape(-1).astype(np.uint8)
    
train['phash'] = train['file_path'].progress_apply(get_hash)

In [None]:
def find_similar_images(threshold=0.90):
    # Number of Duplicate Images Found
    duplicate_counter = 1
    # Indices of Duplicate Images
    duplicate_idxs = set()
    # For each image in the train dataset
    for idx, phash in enumerate(tqdm(train['phash'])):
        # Compute the similarity to all other images
        for idx_other, phash_other in enumerate(train['phash']):
            # Similarity score is imply the percentage of equal bits
            similarity = (phash ==phash_other).mean()
            # Prevent self comparison, threshold similarity and ignore repetetive duplicate detection
            if idx != idx_other and similarity > threshold and not(duplicate_idxs.intersection([idx, idx_other])):
                # Update Duplicate Indices
                duplicate_idxs.update([idx, idx_other])
                # Get DataFrame rows
                row = train.loc[idx]
                row_other = train.loc[idx_other]
                # Plot Duplicate Images
                fig, ax = plt.subplots(nrows=1, ncols=2, figsize=(8,5))
                ax[0].imshow(imageio.imread(row['file_path']))
                ax[0].set_title(f'Idx: {idx}, Pawpularity: {row["Pawpularity"]}')
                ax[1].imshow(imageio.imread(row_other['file_path']))
                ax[1].set_title(f'Idx: {idx_other}, Pawpularity: {row_other["Pawpularity"]}')
                plt.suptitle(f'{duplicate_counter} | PHASH Similarity: {similarity:.3f}')
                plt.show()
                # Increase Duplicate Counter
                duplicate_counter += 1
                
    # Return Indices of Duplicates
    return duplicate_idxs
    
duplicate_idxs = find_similar_images()

In [None]:
print(f'Found {len(duplicate_idxs)} Duplicate Images')
# Removing Duplicate Images
train = train.drop(duplicate_idxs).reset_index(drop=True)

In [None]:
# Check if images are correctly removed
# DataFrame size reduced by 27*2=54 from 9912 -> 9858
display(train.info())

# Image EDA

In [None]:
widths = []
heights = []
ratios = []
for file_path in tqdm(train['file_path']):
    image = imageio.imread(file_path)
    h, w, _ = image.shape
    heights.append(h)
    widths.append(w)
    ratios.append(w / h)

Images tend to be large for an image classification task, the most common image size is 960*720.

In [None]:
# Images Heigt and Width Distribution
print('Width Statistics')
display(pd.Series(widths).describe())
print()
print('Height Statistics')
display(pd.Series(heights).describe())

plt.figure(figsize=(15,8))
plt.title(f'Images Height and Width Distribution', size=24)
plt.hist(heights, bins=32, label='Image Heights')
plt.hist(widths, bins=32, label='Image Widths')
plt.legend(prop={'size': 16})
plt.show()

The image width to height ratio have a mean below zero and a peak on 0.75, pictures thus tend to be taken vertically, not horizontally.

In [None]:
# Images Ratio Distribution
print('Ratio Statistics')
display(pd.Series(ratios).describe())
plt.figure(figsize=(15,8))
plt.title(f'Images Ratio Distribution', size=24)
plt.hist(ratios, bins=16, label='Image Heights')
plt.legend(prop={'size': 16})
plt.show()

The pawpularity score is centered around 40 and has a peak on 0 and 100.

In [None]:
# Pawpularity Score Distribution
print('Pawpularity Statistics')
display(train['Pawpularity'].describe())
plt.figure(figsize=(15,8))
plt.title('Train Data Pawpularity Score Distribution', size=24)
plt.hist(train['Pawpularity'], bins=32)
plt.show()

# Feature Correlation

The relation between the given binary features and the Pawpularity score is visualised using box plots. As can be observed, the feature have no visual correlation with the Pawpularity score.

In [None]:
def box_plot(feature):
    value_counts = train[feature].value_counts().sort_index()
    value_counts_str = ""
    if len(value_counts) == 2: # Binary Feature
        for idx, (k, v) in enumerate(value_counts.to_dict().items()):
            if idx > 0:
                value_counts_str += ', '
            value_counts_str += f'{k} count: {v}'
    else: # Non-Binary Feature
        display(value_counts.to_frame())
        
    fig, ax = plt.subplots(figsize=(12,8))
    ax.set_title(f'{f.upper()}, {value_counts_str}', size=18)
    ax.boxplot(train.groupby(feature)['Pawpularity'].apply(list))
    plt.xticks(np.arange(1, len(value_counts) + 1), value_counts.keys(), size=16)
    plt.yticks(size=16)
    plt.grid()
    plt.show()

In [None]:
def scatter_plot_with_correlation_line(df, feature):
    x = df['Pawpularity']
    y = df[feature]
    
    # Create scatter plot
    plt.figure(figsize=(12,5))
    plt.scatter(x, y, s=10)

    # Add correlation line
    axes = plt.gca()
    m, b = np.polyfit(x, y, 1)
    X_plot = np.linspace(axes.get_xlim()[0],axes.get_xlim()[1],100)
    
    # Pearson Correlation
    cor, _ = pearsonr(x, y)
    plt.title(f'{feature}, Pearson Correlation: {cor:.3f}', size=18)
    plt.plot(X_plot, m*X_plot + b, '-', color='r', linewidth=3)

In [None]:
features = [
    'Subject Focus',
    'Eyes',
    'Face',
    'Near',
    'Action',
    'Accessory',
    'Group',
    'Collage',
    'Human',
    'Occlusion',
    'Info',
    'Blur',
]

for f in features:
    box_plot(f)

# Show Extreme Cases

The following two plots show 32 examples of the lowest and highest scoring pet images. Personally, there is no clear difference between the low and high scoring images. For example, low scoring images 10, 13 and 19 seem quite cute, whereas high scoring image 2 and 13 do not seem like good pictures.

In [None]:
# Shows a batch of images
def show_batch_df(df, rows=8, cols=4):
    df = df.copy().reset_index()
    fig, axes = plt.subplots(nrows=rows, ncols=cols, figsize=(cols*4, rows*4))
    for r in range(rows):
        for c in range(cols):
            idx = r * cols + c
            img = imageio.imread(df.loc[idx, 'file_path'])
            axes[r, c].imshow(img)
            axes[r, c].set_title(f'{idx}, label: {df.loc[idx, "Pawpularity"]}')

# Pets with Lowest Pawpularity Scores

In [None]:
show_batch_df(train.sort_values('Pawpularity'))

# Pets with Highest Pawpularity Scores

In [None]:
show_batch_df(train.sort_values('Pawpularity', ascending=False))

# YOLOV5 Object Detection

[YOLOV5](https://github.com/ultralytics/yolov5) is the fifth iteration of the Yo Only Look Once object detection famility, which is quite controversial as no official paper has been published. It is however freely available, easy to use and scores fairly high in benchmarks.

Using object detection the images can be classified as either cat or dog, the contours of the pets can be determined and the amount of pets in the images can be counted.

This object detection can be a source of features and a fundamental tool for preprocessing.|

In [None]:
# Download YOLOV5 GitHub Repo
!git clone https://github.com/ultralytics/yolov5

In [None]:
# Load Best Performing YOLOV5X Model
yolov5x6_model = torch.hub.load('ultralytics/yolov5', 'yolov5x6')

In [None]:
# Get Image Info
def get_image_info(file_path, plot=False):
    # Read Image
    image = imageio.imread(file_path)
    h, w, c = image.shape
    
    if plot: # Debug Plots
        fig, ax = plt.subplots(1, 2, figsize=(8,8))
        ax[0].set_title('Pets detected in Image', size=16)
        ax[0].imshow(image)
        
    # Get YOLOV5 results using Test Time Augmentation for better result
    results = yolov5x6_model(image, augment=True)
    
    # Mask for pixels containing pets, initially all set to zero
    pet_pixels = np.zeros(shape=[h, w], dtype=np.uint8)
    
    # Dictionary to Save Image Info
    h, w, _ = image.shape
    image_info = { 
        'n_pets': 0, # Number of pets in the image
        'labels': [], # Label assigned to found objects
        'thresholds': [], # confidence score
        'coords': [], # coordinates of bounding boxes
        'x_min': 0, # minimum x coordinate of pet bounding box
        'x_max': w - 1, # maximum x coordinate of pet bounding box
        'y_min': 0, # minimum y coordinate of pet bounding box
        'y_max': h - 1, # maximum x coordinate of pet bounding box
    }
    
    # Save found pets to draw bounding boxes
    pets_found = []
    
    # Save info for each pet
    for x1, y1, x2, y2, treshold, label in results.xyxy[0].cpu().detach().numpy():
        label = results.names[int(label)]
        if label in ['cat', 'dog']:
            image_info['n_pets'] += 1
            image_info['labels'].append(label)
            image_info['thresholds'].append(treshold)
            image_info['coords'].append(tuple([x1, y1, x2, y2]))
            image_info['x_min'] = max(x1, image_info['x_min'])
            image_info['x_max'] = min(x2, image_info['x_max'])
            image_info['y_min'] = max(y1, image_info['y_min'])
            image_info['y_max'] = min(y2, image_info['y_max'])
            
            # Set pixels containing pets to 1
            pet_pixels[int(y1):int(y2), int(x1):int(x2)] = 1
            
            # Add found pet
            pets_found.append([x1, x2, y1, y2, label])

    if plot:
        for x1, x2, y1, y2, label in pets_found:
            c = 'red' if label == 'dog' else 'blue'
            rect = patches.Rectangle((x1, y1), x2-x1, y2-y1, linewidth=2, edgecolor=c, facecolor='none')
            # Add the patch to the Axes
            ax[0].add_patch(rect)
            ax[0].text(max(25, (x2+x1)/2), max(25, y1-h*0.02), label, c=c, ha='center', size=14)
                
    # Add Pet Ratio in Image
    image_info['pet_ratio'] = pet_pixels.sum() / (h*w)

    if plot:
        # Show pet pixels
        ax[1].set_title('Pixels Containing Pets', size=16)
        ax[1].imshow(pet_pixels)
        plt.show()
        
    return image_info

# Some YOLOV5 Examples

In [None]:
for file_path in train['file_path'].head(5):
    get_image_info(file_path, plot=True)

In [None]:
# Image Info
IMAGES_INFO = {
    'n_pets': [],
    'label': [],
    'coords': [],
    'x_min': [],
    'x_max': [],
    'y_min': [],
    'y_max': [],
    'pet_ratio': [],
}


for idx, file_path in enumerate(tqdm(train['file_path'])):
    image_info = get_image_info(file_path, plot=False)
    
    IMAGES_INFO['n_pets'].append(image_info['n_pets'])
    IMAGES_INFO['coords'].append(image_info['coords'])
    IMAGES_INFO['x_min'].append(image_info['x_min'])
    IMAGES_INFO['x_max'].append(image_info['x_max'])
    IMAGES_INFO['y_min'].append(image_info['y_min'])
    IMAGES_INFO['y_max'].append(image_info['y_max'])
    IMAGES_INFO['pet_ratio'].append(image_info['pet_ratio'])
    
    # Not Every Image can be Correctly Classified
    labels = image_info['labels']
    if len(set(labels)) == 1: # unanimous label
        IMAGES_INFO['label'].append(labels[0])
    elif len(set(labels)) > 1: # Get label with highest confidence
        IMAGES_INFO['label'].append(labels[0])
    else: # unknown label, yolo could not find pet
        IMAGES_INFO['label'].append('unknown')

In [None]:
# Add Image Info to Train
for k, v in IMAGES_INFO.items():
    train[k] = v

# YOLOV5 Feature Analysis

The number of pets identified in an image ranges from 0, when the pet can not be detected, to 14! For 50 out of 9912 images the pet could not be detected, about 0.5%.

In [None]:
# Image with 14 pets in it
get_image_info(train.loc[train['n_pets'] == 14, 'file_path'].squeeze(), plot=True)
pass

In [None]:
box_plot('n_pets')

In [None]:
box_plot('label')

In [None]:
# Show images where no pet could be detected
show_batch_df(train.loc[train['label'] == 'unknown'], rows=6)

In [None]:
# Pearson Correlation between per_ratio and Pawpularity
scatter_plot_with_correlation_line(train, 'pet_ratio')

# Save Train

In [None]:
pd.options.display.max_columns = 99
display(train.head())

In [None]:
# Save Train
train.to_pickle('train.pkl')

# TFRecords

[TFRecords](https://www.tensorflow.org/api_docs/python/tf/data/TFRecordDataset) allow for packing multiple samples with corresponding label and features inside one file. This speeds up data loading, as only a single file needs to be read from disk which contains multiple images, features and labels.

In [None]:
N_CHANNELS = 3
VERSION = '1A'

In [None]:
def process_image(file_path):
    # Read Image
    img = imageio.imread(file_path)
    h, w, _ = img.shape

    with open(file_path, 'rb') as f:
        img_jpeg = f.read()
    return img_jpeg, h, w

# Make TFRecords

In [None]:
# Makes the actual TFRecords
def to_tf_records(data_split, subset, fold):
    try:
        os.makedirs(f'fold_{fold}/{subset}')
    except:
        pass
    
    for idx, df_t in enumerate(tqdm(data_split)):
        df = df_t.T
        # Create image processing jobs and execute them in parallel
        jobs = [joblib.delayed(process_image)(fp) for fp in df['file_path']]
        imgs_resized = joblib.Parallel(
            n_jobs=cpu_count(),
            verbose=0,
            batch_size=64,
            pre_dispatch=64*cpu_count(),
            require='sharedmem'
        )(jobs)
        tfrecord_name = f'{VERSION}_{subset}_fold_{fold}_batch_{idx}.tfrecords'
        
        # Create the actual TFRecords
        with tf.io.TFRecordWriter(f'fold_{fold}/{subset}/{tfrecord_name}') as file_writer:
            for (idx, row), (img, h, w) in zip(df.iterrows(), imgs_resized):
                record_bytes = tf.train.Example(features=tf.train.Features(feature={
                    # Image as JPEG bytes
                    'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[img])),
                    # Label of image
                    'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(row['Pawpularity'])])),
                    # Height of image
                    'height': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(h)])),
                    # Width of image
                    'width': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(w)])),
                    # Minimum x value of pets
                    'x_min': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(np.floor(row['x_min']))])),
                    # Maximum x value of pets
                    'x_max': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(np.floor(row['x_max']))])),
                    # Minimum y value of pets
                    'y_min': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(np.floor(row['y_min']))])),
                    # Maximum y value of pets
                    'y_max': tf.train.Feature(int64_list=tf.train.Int64List(value=[int(np.floor(row['y_max']))])),
                })).SerializeToString()
                file_writer.write(record_bytes)

# Stratified KFold

The training data is split into 4 folds, stratified on the Pawpularity score. Each fold the train and validation data will have the same Pawpularity distribution, to not make biased models.

In [None]:
N_KFOLDS = 4
skf = StratifiedKFold(n_splits=N_KFOLDS, shuffle=True, random_state=42)

# Make Fold Indices
fold_idxs = skf.split(train['file_path'], train['Pawpularity'])

# Create TFRecords for each Fold
for fold, (train_idxs, val_idxs) in enumerate(tqdm(fold_idxs, total=N_KFOLDS)):
    print(f'Making TFRecords for Fold {fold}')
    
    # Train Data
    train_fold_data = train.loc[train_idxs].T
    train_fold_chunks = np.array_split(train_fold_data, 8 * 10, axis=1)
    
    # Val Data
    val_fold_data = train.loc[val_idxs].T
    val_fold_chunks = np.array_split(val_fold_data, 8 * 10, axis=1)

    print(f'train_fold_chunks: {len(train_fold_chunks)}, val_fold_chunks: {len(val_fold_chunks)}')
    
    # Create TFRecords
    to_tf_records(train_fold_chunks, 'train', fold)
    to_tf_records(val_fold_chunks, 'val', fold)

# Check TFRecords

In [None]:
IMG_SIZE = 512

# Imagenet mean and standard deviation per channel
IMAGENET_MEAN = tf.constant([0.485, 0.456, 0.406], dtype=tf.float32)
IMAGENET_STD = tf.constant([0.229, 0.224, 0.225], dtype=tf.float32)

# Number of channels, 3 for RGB images
N_CHANNELS = tf.constant(3, dtype=tf.int64)

In [None]:
# Function to decode the TFRecords
def decode_tfrecord(record_bytes):
    features = tf.io.parse_single_example(record_bytes, {
        'image': tf.io.FixedLenFeature([], tf.string),
        'label': tf.io.FixedLenFeature([], tf.int64),
        'width': tf.io.FixedLenFeature([], tf.int64),
        'height': tf.io.FixedLenFeature([], tf.int64),
    })

    image = tf.io.decode_jpeg(features['image'])
    label = features['label']
    height = features['height']
    width = features['width']
    
    # Cutout Random Square if image is not square
    if height != width:
        if height > width:
            offset = (height - width) // 2
            image = tf.slice(image, [offset, 0, 0], [width, width, N_CHANNELS])
        else:
            offset = (width - height) // 2
            image = tf.slice(image, [0, offset, 0], [height, height, N_CHANNELS])
    
    # Reshape and Normalize
    size = tf.math.reduce_min([height, width])
    # Explicit reshape needed for TPU, tell cimpiler dimensions of image
    image = tf.reshape(image, [size, size, N_CHANNELS])
    # Some images are smaller than 384x384 and need to be upscaled
    image = tf.image.resize(image, [IMG_SIZE, IMG_SIZE])
    # Convert to float32 and normalize to range 0-1
    image = tf.cast(image, tf.float32)  / 255.0
    # Normalize according to ImageNet mean and standard deviation
    image = (image - IMAGENET_MEAN) / IMAGENET_STD
    
    return image, label

In [None]:
# Shows a batch of images
def show_batch(dataset, rows=8, cols=4):
    imgs, lbls = next(iter(dataset))
    fig, axes = plt.subplots(nrows=rows, ncols=cols, figsize=(cols*4, rows*4))
    for r in range(rows):
        for c in range(cols):
            img = imgs[r*cols+c].numpy().astype(np.float32)
            img += abs(img.min())
            img /= img.max()
            axes[r, c].imshow(img)
            axes[r, c].set_title(f'Label: {lbls[r*cols+c]}')

In [None]:
# Makes a TFRecordDataser iterator
def get_train_dataset():
    FNAMES_TRAIN_TFRECORDS = tf.io.gfile.glob('./*/*/*.tfrecords')
    train_dataset = tf.data.TFRecordDataset(FNAMES_TRAIN_TFRECORDS, num_parallel_reads=1)
    train_dataset = train_dataset.map(decode_tfrecord, num_parallel_calls=1)
    train_dataset = train_dataset.batch(32)
    
    return train_dataset

In [None]:
# Sanity Check, plot some images from the freshly created TFRecords
train_dataset = get_train_dataset()
show_batch(train_dataset)