## Exploratory Data Analysis over NFL 1st and Future - Impact Detection
Overview of EDA for [NFL 1st and Future - Impact Detection](https://www.kaggle.com/c/nfl-impact-detection)


This notebook is highly inspired by [Getting Started Notebook](https://www.kaggle.com/samhuddleston/nfl-1st-and-future-getting-started)

![](https://storage.googleapis.com/kaggle-competitions/kaggle/12125/logos/header.png)

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h3 class="list-group-item list-group-item-action active" data-toggle="list" style='color:white; background:#266da8; border:0' role="tab" aria-controls="home"><center>Quick Navigation</center></h3>

* [1. Overview](#1)    
    
* [2. General Visualization](#2)

* [3. Impact Analysis](#3)

* [4. Adding bounding box to video](#4)

<a id="1"></a>
<h2 style='background:#266da8; border:0; color:white'><center>Overview<center><h2>

### Data:
- `image_labels.csv` - contains the bounding boxes corresponding to the images.
- `train_labels.csv` - Helmet tracking and collision labels for the training set.
- `sample_submission.csv` - A valid sample submission file.
- `[train/test]_player_tracking.csv` - Each player wears a sensor that allows us to precisely locate them on the field; that information is reported in these two files.

Folders:

- `/train/` contains the mp4 video files for the training plays. Each play has both an endzone and sideline view.
- `/test/` contains the videos for the test set. In the public dataset you only see 2 videos but these are just examples and are actually already in the training set. When your model actually submitted it will run on 15 unseen videos. We are told that 20% of the test videos will product the public LB score, and 80% will produce the private score (3 plays public LB, 12 private).. so there may be some shakeup on the private leaderboard!
- `/images/` contains the additional annotated images of player helmets

In [None]:
# Importing libraries for data analysis and wrangling
import imageio
from PIL import Image
import cv2
import numpy as np
import pandas as pd 
import os
import subprocess
from tqdm import tqdm

# Importing libraries for data visualization
import matplotlib.pyplot as plt
import matplotlib.patches as patches
%matplotlib inline
plt.rcParams['figure.dpi'] = 150
import seaborn as sns

from IPython.display import Video, display

#Block those warnings from pandas about setting values on a slice
import warnings
warnings.filterwarnings('ignore')

### - Image Data:
The labeled image dataset consists of 9947 labeled images and a .csv file named image_labels.csv that contains the labeled bounding boxes for all images. This dataset is provided to support the development of helmet detection algorithms.

In [None]:
# Read in the image labels file

image_df = pd.read_csv("../input/nfl-impact-detection/image_labels.csv")
image_df.head()

In [None]:
image_df.tail()

In [None]:
# Get a summary on the data type

image_df.info()

### - Train Data:

In [None]:
train_df = pd.read_csv("../input/nfl-impact-detection/train_labels.csv")
train_df.head()

In [None]:
train_df.tail()

In [None]:
train_df.info()

### - Train/Test Tracking data:

In [None]:
train_tr_df = pd.read_csv("../input/nfl-impact-detection/train_player_tracking.csv")
test_tr_df = pd.read_csv("../input/nfl-impact-detection/test_player_tracking.csv")

In [None]:
train_tr_df.head()

In [None]:
train_tr_df.info()

In [None]:
test_tr_df.head()

In [None]:
test_tr_df.info()

### - Submission Data:

In [None]:
ss_df= pd.read_csv("../input/nfl-impact-detection/sample_submission.csv")

In [None]:
ss_df.head()

In [None]:
ss_df.info()

<a id="2"></a>
<h2 style='background:#266da8; border:0; color:white'><center>General Visualization<center><h2>

In [None]:
# Set the name of our working image
img_name = image_df['image'][0]
img_name

In [None]:
# Define the path to our selected image
img_path = f"/kaggle/input/nfl-impact-detection/images/{img_name}"

In [None]:
# Read in and plot the image
img = imageio.imread(img_path) 
plt.imshow(img)
plt.show()

Writing a function for adding the bounding boxes from the label to the image.To draw the `bounding box`, we need to specify the top left pixel location and the bottom right pixel location of the image.

In [None]:
# Function to add labels to an image

def add_img_boxes(image_name, image_labels):
    # Set label colors for bounding boxes
    HELMET_COLOR = (0, 0, 0)    # Black

    boxes = image_df.loc[image_df['image'] == img_name]
    for j, box in boxes.iterrows():
        color = HELMET_COLOR 

        # Add a box around the helmet
        # Note that cv2.rectangle requires us to specify the top left pixel and the bottom right pixel
        cv2.rectangle(img, (box.left, box.top), (box.left + box.width, box.top + box.height), color, thickness=1)
        
    # Display the image with bounding boxes added
    plt.imshow(img)
    plt.show()

In [None]:
add_img_boxes(img_name, image_df)

Now, we can see in the image above that `bounding boxes` have been added to every helmet.

In [None]:
# Number of unique videos

train_df['video'].nunique()

In [None]:
frame_count = train_df[['gameKey','playID','frame']] \
    .drop_duplicates()[['gameKey','playID']] \
    .value_counts()

fig, ax = plt.subplots(figsize=(12, 5))
sns.set_style("whitegrid")
sns.distplot(frame_count, bins=15)
ax.set_title('Distribution of frames per video file')
plt.show()

As we can see from the graph upwards. The length of each play varies in between 300 to 500. But the longest frame is over 600 frames


In [None]:
train_df['area'] = train_df['width'] * train_df['height']
fig, ax = plt.subplots(figsize=(12, 5))
colorpal = sns.color_palette("husl", 9)
sns.distplot(train_df['area'].value_counts(),
             bins=10,
             color=colorpal[1])
sns.set_style("whitegrid")
ax.set_title('Distribution bounding box sizes')
plt.show()

In [None]:
train_df['label'].value_counts() \
    .sort_values() \
    .tail(25).plot(kind='barh',
                   figsize=(15, 5),
                   title='Top 25 Box Labels',
                   color=colorpal[3])
plt.show()

<a id="3"></a>
<h1 style='background:#266da8; border:0; color:white'><center>Impact Analysis<center><h1>

### For the purposes of evaluation, definitive helmet impacts are defined as meeting three criteria:

- **impact = 1**
- **confidence > 1**
- **visibility > 0**

<h1><center>Impact<center><h1>

In [None]:
train_df['impactType'].value_counts() \
    .plot(kind='bar',
          title='Impact Type Count',
          figsize=(12, 4),
          color=colorpal[4])

plt.show()

The impacts are labeled by types: Helmet, shoudler, body, etc. We can see the the majority of impact types are with other helmets, but shoulder and body impacts do occur. Our submission does not need to identify the impact type, but it may be helpful information when training models.

In [None]:
for i, d in train_df.groupby('impactType'):
    if len(d) < 10:
        continue
    d['frame'].plot(kind='kde', alpha=1, figsize=(12, 4), label=i,
                    title='Impact Type by Frame')
    plt.legend()

In [None]:
pct_impact_occurance = train_df[['video','impact']] \
    .fillna(0)['impact'].mean() * 100
print(f'Of all bounding boxes, {pct_impact_occurance:0.4f}% of them involve an impact event')

In [None]:
train_df[['video','impact','frame']] \
    .fillna(0) \
    .groupby(['frame']).mean() \
    .plot(figsize=(12, 5), title='Occurance of impacts by frame in video.',
         color=colorpal[6])
plt.show()

### Pairplot of Bounding Box, Impact vs Non-Impact
These plots attempt to quickly identify if there is any commonality between the location of the bounding box and where impacts occur. It appears that the locations tend to be

In [None]:
sns.pairplot(train_df[['frame','area',
                        'left','width',
                        'top','height',
                        'impact']] \
                .sample(5000).fillna(0),
             hue='impact')
plt.show()

Similarly we can look at the impact type by bounding box location and area.



In [None]:
sns.pairplot(train_df[['frame','area',
                        'left', 'top',
                        'impactType']].dropna() \
             .sample(1000), hue='impactType',
            plot_kws={'alpha': 0.5})
plt.show()

<h1><center>Confidence<center><h1>

## Confidence Label
- 1 = Possible
- 2 = Definitive 
- 3 = Definitive and Obvious

In [None]:
train_df['confidence'].dropna() \
    .astype('int').value_counts() \
    .plot(kind='bar',
          title='Confidence Type Label Count',
          figsize=(12, 4),rot=0)
plt.show()

<h1><center>Visability<center><h1>

### Visability Label
Visibility labels are:
- 0 = Not Visible from View, 
- 1 = Minimum,
- 2 = Visible,
- 3 = Clearly Visible

In [None]:
train_df['visibility'].dropna() \
    .astype('int').value_counts() \
    .plot(kind='bar',
          title='Visibility Label Count',
          figsize=(12, 4),rot=0)
plt.show()

<a id="4"></a>
<h2 style='background:#266da8; border:0; color:white'><center>Adding bounding box to video<center><h2>

**This part is inspired from this [notebook](https://www.kaggle.com/samhuddleston/nfl-1st-and-future-getting-started).**

In [None]:
# Define the video we'll process
video_name = train_df['video'][0]
video_name

In [None]:
# Define the path and then display the video using 
video_path = f"/kaggle/input/nfl-impact-detection/train/{video_name}"
display(Video(data=video_path, embed=True))

In [None]:
# Create a function to annotate the video at the provided path using labels from the provided dataframe, return the path of the video
def annotate_video(video_path: str, video_labels: pd.DataFrame) -> str:
    VIDEO_CODEC = "MP4V"
    HELMET_COLOR = (0, 0, 0)    # Black
    IMPACT_COLOR = (0, 0, 255)  # Red
    video_name = os.path.basename(video_path)
    
    vidcap = cv2.VideoCapture(video_path)
    fps = vidcap.get(cv2.CAP_PROP_FPS)
    width = int(vidcap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(vidcap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    output_path = "labeled_" + video_name
    tmp_output_path = "tmp_" + output_path
    output_video = cv2.VideoWriter(tmp_output_path, cv2.VideoWriter_fourcc(*VIDEO_CODEC), fps, (width, height))
    frame = 0
    while True:
        it_worked, img = vidcap.read()
        if not it_worked:
            break
        
        # We need to add 1 to the frame count to match the label frame index that starts at 1
        frame += 1
        # Let's add a frame index to the video so we can track where we are
        img_name = f"{video_name}_frame{frame}"
        cv2.putText(img, img_name, (0, 50), cv2.FONT_HERSHEY_SIMPLEX, 1.0, HELMET_COLOR, thickness=2)
    
        # Now, add the boxes
        boxes = video_labels.query("video == @video_name and frame == @frame")
        for box in boxes.itertuples(index=False):
            if box.impact == 1 and box.confidence > 1 and box.visibility > 0:    # Filter for definitive head impacts and turn labels red
                color, thickness = IMPACT_COLOR, 2
            else:
                color, thickness = HELMET_COLOR, 1
            # Add a box around the helmet
            cv2.rectangle(img, (box.left, box.top), (box.left + box.width, box.top + box.height), color, thickness=thickness)
            cv2.putText(img, box.label, (box.left, max(0, box.top - 5)), cv2.FONT_HERSHEY_SIMPLEX, 0.7, color, thickness=1)
        output_video.write(img)
    output_video.release()
    
    # Not all browsers support the codec, we will re-load the file at tmp_output_path and convert to a codec that is more broadly readable using ffmpeg
    if os.path.exists(output_path):
        os.remove(output_path)
    subprocess.run(["ffmpeg", "-i", tmp_output_path, "-crf", "18", "-preset", "veryfast", "-vcodec", "libx264", output_path])
    os.remove(tmp_output_path)
    
    return output_path

In [None]:
# Label the video and display it - this will take a bit
labeled_video = annotate_video(f"/kaggle/input/nfl-impact-detection/train/{video_name}", train_df)
display(Video(data=labeled_video, embed=True))