VidHarm: A dataset for detection of harmful content in video

VidHarm is the first expert labeled clip based dataset for classification of harmful content in video. We provide 3589 different clips with expert annotations from over 300 different trailers. Our dataset contains over 37 different languages, and is the most diverse to date. We provide the data and trained models openly for research purposes.

Below we provide some representative examples of clips in our dataset.

f27ed4acb80c.mp4

2031e20f78e7.mp4

7c07587540f8.mp4

Info

Splits for the dataset are available in this repo. They are lists with dictionary entries saved in json format. Each entry has the following keys:

filename: The unique id of the clip
label: A single or list of labels from the label set ("BT,"7","11","15")
total_frames: The total number of frames in the clip (this number is adapted to the preprocessed dataset, see info below)

Download links

Raw Clips

The raw clips are available at this url.. The password is "vidharm". The clips are encoded in mostly h264. The folder structure is:

├── clips
│   ├── ID_FIRST
│   │   ├── ID_FIRST.mp4
│   ...
│   ├── ID_LAST
│   │   ├── ID_LAST.mp4

Preprocessed Images and Audio

Since online video learning can be very CPU intensive we will also provide a preprocessed dataset containing each clip split into individual frames (jpeg encoded) + an additional log mel spectrogram for the audio. For access to these, please send an email to johan.edstedt@liu.se

New: Unfortunately, the preprocessed dataset might be lost. We have uploaded files to preprocess the data:

python video_to_im_audio.py # by default just takes the first 5 clips (this can take a lot of space if used for all, so make sure you have space)
python audio_to_mel_log_spectrogram.py # uses librosa

├── preprocced_clips
│   ├── ID_FIRST
│   │   ├── ID_FIRST_1.jpg
|   |   ...
│   │   ├── ID_FIRST_N.jpg
│   │   ├── ID_FIRST.npy
│   │   ├── ID_FIRST.wav
│   ...
│   ├── ID_LAST
│   │   ├── ID_LAST_1.jpg
|   |   ...
│   │   ├── ID_LAST_N.jpg
│   │   ├── ID_LAST.npy
│   │   ├── ID_LAST.wav

Reproducing Results

Go to https://github.com/Parskatt/is-this-harmful, for code and evaluation.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
README.md		README.md
audio_to_mel_log_spectrogram.py		audio_to_mel_log_spectrogram.py
clips_to_trailers.json		clips_to_trailers.json
refined_annotations.json		refined_annotations.json
trailer_info.json		trailer_info.json
video_to_im_audio.py		video_to_im_audio.py
vidharm_test.json		vidharm_test.json
vidharm_train.json		vidharm_train.json
vidharm_val.json		vidharm_val.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

audio_to_mel_log_spectrogram.py

audio_to_mel_log_spectrogram.py

clips_to_trailers.json

clips_to_trailers.json

refined_annotations.json

refined_annotations.json

trailer_info.json

trailer_info.json

video_to_im_audio.py

video_to_im_audio.py

vidharm_test.json

vidharm_test.json

vidharm_train.json

vidharm_train.json

vidharm_val.json

vidharm_val.json

Repository files navigation

VidHarm: A dataset for detection of harmful content in video

Info

Download links

Raw Clips

Preprocessed Images and Audio

Reproducing Results

About

Releases

Packages

Languages

vidharm/vidharm

Folders and files

Latest commit

History

Repository files navigation

VidHarm: A dataset for detection of harmful content in video

Info

Download links

Raw Clips

Preprocessed Images and Audio

Reproducing Results

About

Resources

Stars

Watchers

Forks

Languages