public-police-footage

This repository is for the paper Constructing Datasets From Public Police Body Camera Footage by Jamie Rosas-Smith, Martijn Bartelds, Ruizhe Huang, Leibny Paola García-Perera, Karen Livescu, Dan Jurafsky, and Anjalie Field. It includes code for downloading and processing public police body-worn camera footage from YouTube. The resulting data is ready for training and fine-tuning of off-the-shelf ASR models, and we provide code for fine-tuning Whisper using our dataset.

Requirements and Installation

Create a conda environment and install the packages in config/requirements.txt:

conda create -n police python=3.9 -y
conda activate police
pip install -r ./config/requirements.txt

Downloading videos - base dataset

First, download the videos. You will need to pass a cookie file to the download script (see below "How to pass YouTube cookies"). Then, use the script yt-dlp.sh with a file list from the config directory to download videos:

# Download the hand-cleaned videos
./yt-dlp.sh config/batch_hand.txt

# Download the automatically cleaned videos
./yt-dlp.sh config/batch.txt

Outputs can then be found in the output folder

Run OCR to extract captions. This can be done with the python script ocr.py. The script takes as arguments the top-level directory containing video files (--video_dir), the directory containing json annotation files (--anno_dir, default = annotations), a directory to store outputs (--output_dir), where the outputs will preserve the directory structure of the video files. The script also takes a flag (--hand), which indicates if you are processing hand-corrected files and should take frame_start and frame_end times.

# Extract captions from the hand-cleaned videos
python ocr.py --video_dir output/batch_hand --hand --anno_dir annotations --output_dir output/batch_hand

# Extract captions from the automatically cleaned videos
python ocr.py --video_dir output/batch --anno_dir annotations --output_dir output/batch

[Optional] extract segments from audio. In steps #1 and #2, you will have fully downloaded and constructed the data. The script conversion.py can be used to convert the original video files into audio segments. This will result in many small files. The script takes as arguments the directory containing the video files (--base_dir), the directory containing the json caption files (--anno_dir) and a directory to write outputs (--output_dir). The script assumes that --base_dir and --anno_dir contain the same directory structure, which will be replicated in --output_dir.

# Clip hand-cleaned videos into audio segments, one corresponding to each caption
python conversion.py --base_dir output/batch_hand --anno_dir output/batch_hand --output_dir segment_test

[Optional] The notebook test_data.ipynb offers some functions to aid in exploring and validating the dowloaded data

Downloading videos - custom data

New videos can be downloaded using the same yt-dlp.sh script. Add URLs to videos or public playlists to a config file in the same format as config/batch.txt and pass the new file to yt-dlp.sh. For example:

./yt-dlp.sh config/new_batch.txt

Extract on-screen captions with OCR: TODO

How to pass YouTube cookies to yt-dlp:

Download the Get cookies.txt Clean Chrome extension.
Log in to the YouTube account you want to use to download.
Navigate to any page on YouTube.
Click the Get cookies.txt extension icon in your toolbar and a dialog will appear showing the list of cookies for the current page.
Click "Export As".
Navigate to your police-data/config directory.
Save cookies as cookies.txt.

Additional resources for using cookies with yt-dlp:

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
annotations		annotations
config		config
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
conversion.py		conversion.py
datasheet.pdf		datasheet.pdf
ocr.py		ocr.py
ocr_newdata.py		ocr_newdata.py
test_data.ipynb		test_data.ipynb
whisper_test.py		whisper_test.py
yt-dl.sh		yt-dl.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

public-police-footage

Requirements and Installation

Downloading videos - base dataset

Downloading videos - custom data

How to pass YouTube cookies to yt-dlp:

About

Uh oh!

Uh oh!

Contributors 2

Uh oh!

Languages

License

97jamie/public-police-footage

Folders and files

Latest commit

History

Repository files navigation

public-police-footage

Requirements and Installation

Downloading videos - base dataset

Downloading videos - custom data

How to pass YouTube cookies to yt-dlp:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Uh oh!

Languages