Skip to content

This repository contains code and metadata of How2 dataset

Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit


Repository files navigation

The How-2 Dataset

How-2 is a multimodal dataset which consists of around 80,000 instructional videos (about 2,000 hours) with associated English sub-titles and summaries. About 300 hours have also been translated into Portuguese using crowd-sourcing, and used during the JSALT 2018 Workshop. The How-2 training data was split into 300h and 2000h, with only the former supporting Portuguese Machine Translation. The 2000h set can be used for other tasks such as speech recognition, speech summarization, text summarization, and their multimodal extensions.

We currently have released the following packages pertaining to the How-2 data to be able to replicate our results and encourage further research:

  • ASR (300h): This release contains (audio) fbank+pitch features in Kaldi scp/ark format for 300 hours
  • E2E Summarization + ASR (2000h): This release contains (audio_2000) fbank+pitch features in Kaldi scp/ark format, along with transcript and abstractive summaries for 2000 hours
  • Visual features: This release contains (video) Action features in numpy arrays for MT and ASR
  • English Transcript: This release contains (en) English text for How2
  • Portuguese Machine Translations: This release contains (pt) Portuguese crowdsource text
  • English Abstractive Summaries: This release contains (en_sum) Summarization text
  • Visual features for Summarization: This release contains (video_sum) Summarization action features in numpy arrays
  • Object Grounding Features: This release contains (ground) Object grounding test and development set

Please fill the Data Request form

Please cite the following paper in all academic work that uses this dataset:

  title = {{How2:} A Large-scale Dataset For Multimodal Language Understanding},
  author = {Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Lo\"ic and Specia, Lucia and Metze, Florian},
  booktitle = {Proceedings of the Workshop on Visually Grounded Interaction and Language (ViGIL)},
  year = {2018},
  url = {}

More papers can be found in the bibliography.

To subscribe to the How2 mailing list click here.

Speech Summarization

How2 has been used for end to end speech summarization- we are releasing 43 dim fbank+pitch features to support this. See our ESPNet Recipe and paper. Please consider citing our paper on speech summarization if you utilize this data release.

author={Sharma, Roshan and Palaskar, Shruti and Black, Alan W and Metze, Florian},
booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={End-to-End Speech Summarization Using Restricted Self-Attention},

How2 Get Help

Please use the issues ticket system ( to ask questions and get clarifications.

How2 License

License information for every video can be found in the .info.json file that is being downloaded for every video. At the time of release, all videos included in this dataset were being made available by the original content providers under the standard YouTube License.

Unless noted otherwise, we are providing the contents of this repository under the Creative Commons BY-SA 4.0 (Attribution-Share-Alike) License (for data-like content) and/ or BSD-2-Clause License (for software-type content).