How Video Meetings Change Your Expression
Sumit Sarin, Utkarsh Mall, Purva Tendulkar, Carl Vondrick
Columbia University
Create a new environment, and install the required packages.
conda create --name facet python=3.10.13
conda activate facet
pip install -r requirements.txt
The datasets are available at the following links:
The data contains the following column fields:
vid_id
: unique identifier for the videosfps
: frame-rate.vid_url
: URL for the video.view
: (only for ZoomIn dataset) whether video is F2F (off
) or VC (on
).president
: (only for Presidents dataset) whether video is Trump (trump
) or Obama (obama
).participants
: number of participants in the conversation.file_name
: the name of the video file to be be used for saving.
Place the videos in a <directory_with_mp4_videos>
with file name same as file_name
column from the csv file.
In the following sections, we only give instructions for the ZoomIn dataset, but it is straightforward to change configurations accordingly and use the Presidents dataset.
-
Download the landmarker model made available by Mediapipe and save it as
facet/scripts/model/face_landmarker_v2_with_blendshapes.task
. -
cd scripts
-
python vid_to_mesh.py <directory_with_mp4_videos>
(will take some time to run, you can adjust num_threads accordingly). -
python mesh_to_numpy.py <directory_with_mp4_videos>
cd ..
python train-facet.py --wandb --variable_alpha --variable_changepoints --chunks=2
(Wandb can be disabled by not passing the --wandb
flag.)
Note: The above scripts uses the beta
-VAE we trained on the ZoomIn dataset. To train a new model from scratch:
python train-vae.py
We use imaginaire's vid2vid off the shelf. Please follow setup instructions and use it directly.
For reference, we provide the script which can be used for building training/validation data for running vid2vid. Note that this basically involves putting RGB video frames from an F2F video in folder mesh_images
and the MediaPipe keypoints in folder mesh
. To run the script, update the variable vid_directory
and run:
cd scripts
python vid2vid_dataset_create.py
Once that is done, you can update the config file which we used for running vid2vid for reference with correct paths for training/validation data (just use any F2F video from the output of previous command in dataset/zoomin/dezoom
) and then run vid2vid as explained in the repo with the updated config file path as input:
python -m torch.distributed.launch --nproc_per_node=8 train.py --config <facet-base-path>/facet/vid2vid_config/ampO1.yaml --logdir=./logs
The above process trains the vid2vid model. To effectively de-zoom a video, we need to test the model on keypoints that have been de-zoomed by FacET (which was obtained using train-facet.py
). This involves, for a VC video,
- Get its keypoints through MediaPipe (refer vid2vid_dataset_create.py)
- Obtain its latents using trained beta-VAE encoder (refer train-facet.py)
- Passing them through FacET (refer train-facet.py)
- Obtain keypoints back through beta-VAE decoder (refer train-facet.py)
- Save the keypoints (refer vid2vid_dataset_create.py)
The saved keypoints path can then be updated in the config file, and vid2vid can be runned in test mode by running the following in the imaginaire repository:
python -m torch.distributed.launch --nproc_per_node=8 inference.py --config <facet-base-path>/facet/vid2vid_config/ampO1.yaml --logdir=./logs
Note that we train vid2vid on one RGB video and its corresponding keypoints pair, and then test it on keypoints from other videos. This can need callibration because the distance of the keypoints from the camera can change. Ideally, you want that the keypoints should roughly be at the same distance from camera in both train and test videos. This can be easily adjusted by cropping the frame accordingly.
This research is based on work partially supported by the DARPA CCU program under contract HR001122C0034 and the National Science Foundation AI Institute for Artificial and Natural Intelligence (ARNI). PT is supported by the Apple PhD fellowship.
@misc{sarin2024videomeetingschangeexpression,
title={How Video Meetings Change Your Expression},
author={Sumit Sarin and Utkarsh Mall and Purva Tendulkar and Carl Vondrick},
year={2024},
eprint={2406.00955},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2406.00955},
}