I will use this notebook just to extract a sample of the larger dataset I used in order to demonstrate the project. If you are using different data with the same structure, you can replace any data references with your own. The data structure is:

- `data/`
  - `dataset.json`
  - `[off_formation]/`
    - `[video_path]/`
      - `sideline_[video_path].png`
      - `endzone_[video_path].png`
      - `tight_[video_path].png`

- `[off_formation]` is the offensive formation (e.g., `ACES`, `JOKERS`, `KINGS`).
- `[video_path]` is the unique identifier for a specific play.
- Each `video_path` folder contains one or more images captured from different camera angles (`sideline`, `endzone`, `tight`).

Within `dataset.json`, each datapoint is a JSON object with fields describing the play, including:
- `off_formation`: Name of the offensive formation (matches a folder name).
- `video_path`: Unique identifier for the play (matches a subfolder name).
- `off_play`, `play_type`, `result`, and other fields capturing metadata about the play (e.g., play type, direction, outcome, gain/loss, quarter).

The images within each `[video_path]` folder correspond to the same play and are prefixed with their camera angle.


In [1]:
import os
import sys
import json
import random
import shutil
# change for your own data
path_to_data = "/Users/thomasmcconnell/Library/CloudStorage/OneDrive-PomonaCollege/School/CS153/Project/Hudl_datacollector/hudl_dataset"
path_for_data = "/Users/thomasmcconnell/Library/CloudStorage/OneDrive-PomonaCollege/School/CS153/Project/cs153-football-formation-id/sample_data"
formations = ['ACES', "KINGSSPLIT", "QUEENS"]
number_of_samples_per_formation = 5
dataset_json = "dataset.json"
output_dataset_json = "sample_dataset.json"
os.makedirs(path_for_data, exist_ok=True)

Now that we actually extract the sample we want.

In [None]:
# load dataset
dataset_path = os.path.join(path_to_data, dataset_json)
with open(dataset_path, 'r') as f:
    full_data = [json.loads(line) for line in f]

# get number of plays per formation
def sample_plays(data, formation, n_samples):
    matching_entries = [entry for entry in data if entry['off_formation'].lower() == formation.lower()]
    if len(matching_entries) < n_samples:
        raise ValueError(f"Not enough samples for formation {formation}.")
    return random.sample(matching_entries, n_samples)

# sample plays
sampled_entries = []

for formation in formations:
    print(f"Sampling {number_of_samples_per_formation} plays from formation: {formation}")
    sampled = sample_plays(full_data, formation, number_of_samples_per_formation)
    sampled_entries.extend(sampled)
    
    # copy image folders
    for entry in sampled:
        video_path = entry['video_path']
        src_folder = os.path.join(path_to_data, formation, video_path)
        dst_folder = os.path.join(path_for_data, formation, video_path)
        
        if os.path.exists(src_folder):
            os.makedirs(dst_folder, exist_ok=True)
            for filename in os.listdir(src_folder):
                src_file = os.path.join(src_folder, filename)
                dst_file = os.path.join(dst_folder, filename)
                shutil.copy2(src_file, dst_file)
        else:
            print(f"Missing source folder for {video_path} under {formation}")

# save dataset
output_path = os.path.join(path_for_data, output_dataset_json)
with open(output_path, 'w') as f:
    for entry in sampled_entries:
        json.dump(entry, f)
        f.write('\n')

print(f"Sample dataset saved to {output_path} with {len(sampled_entries)} plays.")


Sampling 5 plays from formation: ACES
Sampling 5 plays from formation: KINGSSPLIT
Sampling 5 plays from formation: QUEENS
Sample dataset saved to /Users/thomasmcconnell/Library/CloudStorage/OneDrive-PomonaCollege/School/CS153/Project/cs153-football-formation-id/sample_data/sample_dataset.json with 15 plays.
