Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

TLDR

Authors: Yuwei Bao, Keunwoo Peter Yu, Yichi Zhang, Shane Storks, Itamar Bar-Yossef, Alexander De La Iglesia, Megan Su, Xiao Lin Zheng, Joyce Chai
Organization: University of Michigan, Computer Science and Engineering
Published in: EMNLP 2023, Singapore
Links: Arxiv, Github, Dataset

WTaG: Watch, Talk and Guide Dataset

WTaG is human-human task guidance dataset with natural language communications, mistake and corrections, and synchronized egocentric videos with transcriptions. WTaG is richly annotated with step detection, user and instructor dialog intention, and mistakes.

Dataset Stats

#Videos Length: 10 hours
#Videos: 56
#Recipes: 3
#User: 17
#Instructor: 3
#Total Utterances: 4233
Median Video Length: 10 mins

Dataset Features and Annotations

Synchronized user and instructor egocentric video + audio transcriptions
Annotation: Step detection
Annotation: Human filtered audio transcriptions
Annotation: User utterrance dialog intention
Annotation: User mistakes
Annotations: Instructor utterrance dialog intention (+ detailed)

Abstract

Despite tremendous advances in AI, it remains a significant challenge to develop interactive task guidance systems that can offer situated, personalized guidance and assist humans in various tasks. These systems need to have a sophisticated understanding of the user as well as the environment, and make timely accurate decisions on when and what to say. To address this issue, we created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor. We further proposed two tasks: User and Environment Understanding, and Instructor Decision Making. We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance. Our quantitative, qualitative, and human evaluation results show that these models can demonstrate fair performances in some cases with no task-specific training, but a fast and reliable adaptation remains a significant challenge. Our benchmark and baselines will provide a stepping stone for future work on situated task guidance.

Tasks

User and Environment Understanding

User intent prediction: Dialog intent of user’s last utterance, if any (options).
Step detection: Current step (options).
Mistake Existence and Mistake Type: Did the user make a mistake at time t (yes/no). If so, what type of mistake (options).

Instructor Decition Making

When to Talk: Should the instructor talk at time t (yes/no).
Instructor Intent: If yes to 1, instructor’s dialog intention (options).
Instruction Type: If yes to 1 and intent in 2 is “Instruction”, what type (options).
Guidance generation: If yes to 1, what to say in natural language.

Dataset

Regular Download WTaG
Audio-Video WTaG, please sign additional licencing aggreement HERE

SetUp

Install EgoHOS
Install CLIP
Insert your openai credentials to open.py
Download dataset WTaG

Running

To evaluate the three methods on WTaG:

python src/pipeline.py [-h] --MTYPE MTYPE --in_path IN_PATH --video_list VIDEO_LIST --out_path
                   OUT_PATH

optional arguments:
  -h, --help            show this help message and exit
  --MTYPE MTYPE, -t MTYPE
                        Vision to Language Method Type: ["lanOnly", "blip2", "objDet"]
  --in_path IN_PATH, -i IN_PATH
                        Video input path
  --video_list VIDEO_LIST, -l VIDEO_LIST
                        Path to the file with a list of videos to be processed
  --out_path OUT_PATH, -o OUT_PATH
                        Guidance output path

Citation

@inproceedings{bao-etal-2023-foundation,
    title = "Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?",
    author = "Bao, Yuwei  and
      Yu, Keunwoo  and
      Zhang, Yichi  and
      Storks, Shane  and
      Bar-Yossef, Itamar  and
      de la Iglesia, Alex  and
      Su, Megan  and
      Zheng, Xiao  and
      Chai, Joyce",
    editor = "Bouamor, Houda  and
      Pino, Juan  and
      Bali, Kalika",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2023",
    month = dec,
    year = "2023",
    address = "Singapore",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.findings-emnlp.824",
    doi = "10.18653/v1/2023.findings-emnlp.824",
    pages = "12325--12341",
}

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change. For further questions, please contact yuweibao@umich.edu.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
assets		assets
src		src
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

src

src

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

TLDR

WTaG: Watch, Talk and Guide Dataset

Dataset Stats

Dataset Features and Annotations

Abstract

Tasks

User and Environment Understanding

Instructor Decition Making

Dataset

SetUp

Running

Citation

Contributing

About

Releases

Packages

Languages

License

sled-group/Watch-Talk-and-Guide

Folders and files

Latest commit

History

Repository files navigation

Can Foundation Models Watch, Talk and Guide You Step by Step to Make a Cake?

TLDR

WTaG: Watch, Talk and Guide Dataset

Dataset Stats

Dataset Features and Annotations

Abstract

Tasks

User and Environment Understanding

Instructor Decition Making

Dataset

SetUp

Running

Citation

Contributing

About

Resources

License

Stars

Watchers

Forks

Languages