Hello Kagglers! So many interesting competitons but not so much of time. With the amazing results from last year, The National Football League (NFL) has asked Kagglers to help them on one more task. This time the NFL wants to assign specific players to each helmet, which would help accurately identify each player's “exposures” throughout a football play.

To aid with helmet detection, the NFL has also provided an ancillary dataset of images showing helmets with labeled bounding boxes. These files are located in `images` directory and the `bounding boxes` information is in `image_labels.csv`

Our goal is to analyze this dataset, and se how can we make the best use of it for additional training. Let's jump in

In [None]:
import os
import cv2
import numpy as np
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

sns.set()
%config IPCompleter.use_jedi = False

In [None]:
# Path to the parent directory containing all the data 
data_path = Path("../input/nfl-health-and-safety-helmet-assignment/")

## Supplementary Image Data

1. Additional images provided for making a good helmet detector. Comparable to train/test distribution
2. Load these images first, check how the images look like and how the labels are provided in images_labels.csv

In [None]:
additional_images = sorted((data_path / "images").glob("*.jpg"))
additional_images_df = pd.read_csv(data_path / "image_labels.csv")

print("Number of images: ", len(additional_images))
print("Number of annotations: ", len(additional_images_df))
print(additional_images_df.head())
print("="*50)
print("\nAverage number of annotations per frame: ", len(additional_images_df) // len(additional_images))

We will make two changes in this dataframe:
1. We will modify the image path to the actual image path
2. We will add `right` and `bottom` coordinates of the box

In [None]:
additional_images_df["image"] = str(data_path / "images") + "/" + additional_images_df["image"]
additional_images_df["right"] = additional_images_df["left"] + additional_images_df["width"]
additional_images_df["bottom"] = additional_images_df["top"] + additional_images_df["height"]
additional_images_df.head()

# Visualizing the data

In [None]:
def plot_bboxes_on_frame(images, bbox, color=(0, 255, 0)):
    """Plots all the boundng boxes on a given frame.
    
    Args:
        images: List of images
        bbox: Corresposnding list of bboxes. All the bboxes corresponding
            to an image should be provided in a single list of list of
            dictionaries. For example:
            [
                [{"left": 100, "top":50, "right":30, "bottom":500}, {...}],
                [{..}..],
                ...
            ]
    """
    
    if len(images) != len(bbox):
        raise ValueError("Number of images and corresponding bboxes should be same")
        
    for i in range(len(bbox)):
        boxes = bbox[i]
        for box in boxes:
            cv2.rectangle(images[i], (box["left"], box["top"]), (box["right"], box["bottom"]), color, 2)    
    return images


def get_bbox(df, idx, cols=["left", "right", "top", "bottom"]):
    """Given an index, return a dictionary of box cooredinates.
    
    Args:
        df: Dataframe containing the information
        idx: The index to pick the data from
        cols: The columns to extract the data
    Returns:
        A dictionary containing cooredinates of the bbox
    """
    
    box = {}
    for col in cols:
        box[col] = df[col][idx]
    return box


def get_all_boxes(df, indices, cols=["left", "right", "top", "bottom"]):
    """Gathers all bboxes corresponding to an image.
    
    Args:
        df: Dataframe containing the information
        indices: The indices to pick the data from
        cols: The columns to extract the data
    Returns:
        A list of bboxes
    """
    bbox = [get_bbox(df, idx) for idx in indices]
    return bbox

def get_all_indices_of_image(df, image_path):
    """Find all the indices corresponding to an image.
    
    Args:
        df: Dataframe containing the information
        image_path: The image_path for which indices are required
    Returns:
        A list of indices
    """
    
    indices = df.index[df["image"]==image_path]
    return indices.tolist()



def read_images_and_bbox(df, images_path):
    """Read images and bbox information"""
    images = []
    bbox = []
    
    for i, img_path in enumerate(images_path):
        indices = get_all_indices_of_image(df, img_path)
        boxes = get_all_boxes(df, indices)
        img = cv2.imread(img_path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        images.append(img)
        bbox.append(boxes)
        
    return images, bbox

In [None]:
# Get the list of all unique images
all_images = list(set(additional_images_df["image"]))

# Select how many rows and columns you want in the subplots
num_rows = 5
num_cols = 2

# Select the desired number of images randomly
selected_images = np.random.choice(all_images, size=num_rows*num_cols)

# Read the images and get box coordinates
images, boxes = read_images_and_bbox(df=additional_images_df, images_path=selected_images)

# Draw the boxes on the images
images_with_boxes = plot_bboxes_on_frame(images, boxes)


# Visualize the images
_, ax = plt.subplots(num_rows, num_cols, figsize=(20, 22))

for i in range(num_rows*num_cols):
    title = selected_images[i].split("/")[-1]
    ax[i // num_cols][i % num_cols].imshow(images_with_boxes[i])
    ax[i // num_cols][i % num_cols].axis("off")
    ax[i // num_cols][i % num_cols].set_title(title)

plt.tight_layout()
plt.show()

# Views in frames

Given that the images are from `endzone` and `sideline`, let's try to group and align these images and see if this data can be used in a much smarter way

In [None]:
def get_image_number(img_path):
    img_num = img_path.split("/")[-1]
    img_num = img_num.split("_")[0]
    return img_num

def get_frame_view(img_path):
    img_view = img_path.split("/")[-1]
    if "Endzone" in img_view:
        return "Endzone"
    elif "Sideline" in img_view:
        return "Sideline"
    else:
        return "Unknown"

In [None]:
additional_images_df["image_no"] = additional_images_df["image"].apply(get_image_number)
additional_images_df["frame_view"] = additional_images_df["image"].apply(get_frame_view)
additional_images_df.head()

# Distribution of `views` 

In [None]:
plt.figure(figsize=(8, 5))
sns.countplot(data=additional_images_df, x="frame_view")
plt.show()

In [None]:
# So, we have more sideline views compared to endzone views, but by how much
additional_images_df["frame_view"].value_counts()

# How many labels are there?

It looks like we have more than one class in this data. Let's take a look

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(data=additional_images_df, x="label")
plt.show()

# Different kind of helmets

As we saw above that the helmets have been annotated in five different ways. We will plot a few samples for each of these type of helmets to get an idea why are they annotated in this fashion

In [None]:
def filter_df(df, filter_col, value):
    df = df[df[filter_col]==value]
    df = df.reset_index(drop=True)
    return df

In [None]:
def plot_samples(df, num_samples_to_plot, figsize=(20, 20)):
    # Get the list of all unique images
    all_images = list(set(df["image"]))

    # Select how many rows and columns you want in the subplots
    num_rows = num_samples_to_plot // 2
    num_cols = 2

    # Select the desired number of images randomly
    selected_images = np.random.choice(all_images, size=num_rows*num_cols)

    # Read the images and get box coordinates
    images, boxes = read_images_and_bbox(df=df, images_path=selected_images)

    # Draw the boxes on the images
    images_with_boxes = plot_bboxes_on_frame(images, boxes)


    # Visualize the images
    _, ax = plt.subplots(num_rows, num_cols, figsize=figsize)

    for i in range(num_rows*num_cols):
        title = selected_images[i].split("/")[-1]
        ax[i // num_cols][i % num_cols].imshow(images_with_boxes[i])
        ax[i // num_cols][i % num_cols].axis("off")
        ax[i // num_cols][i % num_cols].set_title(title)

    plt.tight_layout()
    plt.show()

In [None]:
# We can use groupby here but I will do it manually because
# I would be using these subsets for further analysis

helmet_df = filter_df(additional_images_df, "label", "Helmet")
helmet_blurred_df = filter_df(additional_images_df, "label","Helmet-Blurred")
helmet_difficult_df = filter_df(additional_images_df, "label", "Helmet-Difficult")
helmet_sideline_df = filter_df(additional_images_df, "label", "Helmet-Sideline")
helmet_partial_df =  filter_df(additional_images_df, "label", "Helmet-Partial")

## Blurred Helmets

In [None]:
plot_samples(helmet_blurred_df, num_samples_to_plot=10)

## Difficult Helmets

In [None]:
plot_samples(helmet_difficult_df, num_samples_to_plot=10)

The `difficult` helmet are actually hard to spot. This will require a very good object detector that can detect object at such small scale

## Sideline Helmets

In [None]:
plot_samples(helmet_partial_df, num_samples_to_plot=10)

## Partial Helmets

In [None]:
plot_samples(helmet_partial_df, num_samples_to_plot=10)

I would consider `partial-helmet` and the `difficult-helmet` as the same category because these are really hard to spot and differentiate

*To be continued..*