<div align="center">
<h1>HiERO: understanding the hierarchy of human behavior
enhances reasoning on egocentric videos</h1>

<h4><b>ICCV 2025</b></h4>

[Simone Alberto Peirone](https://scholar.google.com/citations?user=K0efPssAAAAJ) • [Francesca Pistilli](https://scholar.google.com/citations?user=7MJdvzYAAAAJ) • [Giuseppe Averta](https://scholar.google.com/citations?user=i4rm0tYAAAAJ)

<a href='https://arxiv.org/abs/2505.12911' style="margin: 10px"><img src='https://img.shields.io/badge/Paper-Arxiv:2505.12911-red'></a>&nbsp;&nbsp;&nbsp;
<a href='https://sapeirone.github.io/HiERO/' style="margin: 10px"><img src='https://img.shields.io/badge/Project-Page-Green'></a>&nbsp;&nbsp;&nbsp;
<a target="_blank" href="https://colab.research.google.com/github/sapeirone/HiERO/blob/main/quickstart.ipynb" style="margin: 10px">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<b>Abstract:</b>

Human activities are particularly complex and variable, and this makes challenging for deep learning models to reason about them. However, we note that such variability does have an underlying structure, composed of a hierarchy of patterns of related actions. We argue that such structure can emerge naturally from unscripted videos of human activities, and can be leveraged to better reason about their content. We present HiERO, a weakly-supervised method to enrich video segments features with the corresponding hierarchical activity threads. By aligning video clips with their narrated descriptions, HiERO infers contextual, semantic and temporal reasoning with an hierarchical architecture. We prove the potential of our enriched features with multiple video-text alignment benchmarks (EgoMCQ, EgoNLQ) with minimal additional training, and in zero-shot for procedure learning tasks (EgoProceL and Ego4D Goal-Step). Notably, HiERO achieves state-of-the-art performance in all the benchmarks, and for procedure learning tasks it outperforms fully-supervised methods by a large margin (+12.5% F1 on EgoProceL) in zero shot. Our results prove the relevance of using knowledge of the hierarchy of human activities for multiple reasoning tasks in egocentric vision.
</div>

This notebook is designed for quick experimentation with HiERO in a Google Colab environment (using free GPU resources).

⚠️ To make things smoother (and faster), this notebook trains a HiERO variant using EgoVLP FP16 features for just one epoch, so you may expect different results compared to the paper results. Please, refer to the README.md file in the repository for full reproducibility.

In [None]:
!git clone https://github.com/sapeirone/HiERO.git

## Environment setup

First, we need to download all the required annotations and pre-extracted features.

In [None]:
%%bash

git clone --recursive https://github.com/sapeirone/HiERO.git

cd HiERO

mkdir -p data/ego4d/raw/annotations data/ego4d/raw/features/egovlp pretrained checkpoints

echo "Downloading egoclip and egomcq annotations..."
gdown --fuzzy https://drive.google.com/file/d/1-aaDu_Gi-Y2sQI_2rsI2D1zvQBJnHpXl/view?usp=sharing
gdown --fuzzy https://drive.google.com/file/d/1-5iRYf4BCHmj4MYQYFRMY4bhsWJUN3rW/view?usp=sharing
mv egoclip.csv data/ego4d/raw/annotations/
mv egomcq.json data/ego4d/raw/annotations/

echo "Downloading egovlp features..."
gdown --fuzzy https://drive.google.com/file/d/1pf_WXpZfxr4czpvWFJPzux9aWUSGe9Ct/view?usp=sharing
echo "Unzipping egovlp features..."
tar -xf egovlp.tar.gz -C data/ego4d/raw/features/egovlp/ && rm egovlp.tar.gz

echo "Downloading pre-trained models..."
gdown --fuzzy https://drive.google.com/file/d/1Cv_DXIvdEfN27E0KZ95mh-cYAUm4pcWV/view?usp=sharing
mv egovlp_text.pth pretrained/

Finally, let's install all the dependencies.

In [None]:
%%bash

cd HiERO

echo "Installing the dependencies..."
pip install -r requirements.txt -f https://data.pyg.org/whl/torch-2.4.0+cu124.html --extra-index-url https://download.pytorch.org/whl/

## Training on EgoCLIP

In [None]:
!cd HiERO/ && HYDRA_FULL_ERROR=1 python train.py --config-name=egovlp save_to=checkpoints num_epochs=1 lr_warmup=False