workshop, tutorial, oral, and poster with notes in cvpr2022
Wenhao(Reself) Chai
Undergraduate, UIUC
- Workshops
- Machine Learning with Synthetic Data (SyntML) link
- International Challenge on Activity Recognition (ActivityNet) link
- 2nd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings link
- Workshop on Attention and Transformers in Vision link
- 5th MUltimodal Learning and Applications Workshop (MULA) link
- 7th BMTT Workshop on Benchmarking Multi-Target Tracking: How Far Can Synthetic Data Take us? link
- L3D-IVU: Workshop on Learning with Limited Labelled Data for Image and Video Understanding link
- Tutorials
- Orals
- Segmentation, Grouping and Shape Analysis
- 1. Semantic-Aware Domain Generalized Segmentation link
- 2. Pointly-Supervised Instance Segmentation link
- 3. Adaptive Early-Learning Correction for Segmentation From Noisy Annotations link
- 4. Unsupervised Hierarchical Semantic Segmentation With Multiview Cosegmentation and Clustering Transformers link
- Video Analysis & Understanding
- 3D From Single Images
- Transfer / Low-Shot / Long-Tail Learning
- 8. OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization link
- 9. Robust Fine-Tuning of Zero-Shot Models link
- 10. Learning Distinctive Margin Toward Active Domain Adaptation link
- 11. DINE: Domain Adaptation From Single and Multiple Black-Box Predictors link
- 12. Source-Free Object Detection by Learning To Overlook Domain Style link
- 13. Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization link
- 14. Causality Inspired Representation Learning for Domain Generalization link
- 15. Learning What Not To Segment: A New Perspective on Few-Shot Segmentation link
- 16. Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation link
- 17. ADeLA: Automatic Dense Labeling With Attention for Viewpoint Shift in Semantic Segmentation link
- Image & Video Synthesis and Generation
- Deep Learning Architectures & Techniques
- Human Pose Estimation & Tracking, Localization, and Object Pose Estimation
- Segmentation, Grouping and Shape Analysis
- Posters
- Segmentation, Grouping and Shape Analysis
- 1. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels link
- 2. Deep Hierarchical Semantic Segmentation link
- 3. Amodal Segmentation Through Out-of-Task and Out-of-Distribution Generalization With a Bayesian Model link
- 4. SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization link
- 5. Accelerating Video Object Segmentation With Compressed Video link
- 6. High Quality Segmentation for Ultra High-Resolution Images link
- 7. Pin the Memory: Learning To Generalize Semantic Segmentation link
- 8. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity link
- 9. Weakly Supervised Semantic Segmentation Using Out-of-Distribution Data link
- 10. Multimodal Material Segmentation link
- 11. Semi-Supervised Learning of Semantic Correspondence With Pseudo-Labels link
- Machine Learning
- 12. A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty link
- 13. How Much More Data Do I Need? Estimating Requirements for Downstream Tasks link
- 14. Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase link
- 15. Out-of-distribution Generalization with Causal Invariant Transformations link
- Deep Learning Architectures & Techniques
- 16. Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation link
- 17. Revisiting Weakly Supervised Pre-Training of Visual Perception Models link
- 18. Failure Modes of Domain Generalization Algorithms link
- 19. Learning Part Segmentation Through Unsupervised Domain Adaptation From Synthetic Vehicles link
- Vision Applications & Systems
- Recognition: Detection, Categorization, Retrieval
- 3D From Single Images
- Low-Level Vision
- Behavior Analysis
- Vision & Language
- 27. Video-Text Representation Learning via Differentiable Weak Temporal Alignment link
- 28. End-to-End Referring Video Object Segmentation With Multimodal Transformers link
- 29. Are Multimodal Transformers Robust to Missing Modality? link
- 30. Robust Cross-Modal Representation Learning With Progressive Self-Distillation link
- 31. Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification link
- Video Analysis & Understanding
- 32. MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing link
- 33. Coarse-To-Fine Feature Mining for Video Semantic Segmentation link
- 34. The DEVIL Is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting link
- 35. YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset link
- 36. Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark link
- Transfer / Low-Shot / Long-Tail Learning
- Pose Estimation & Tracking
- 39. MetaPose: Fast 3D Pose From Multiple Views Without 3D Supervision link
- 40. Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation link
- 41. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking link
- 42. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion link
- 43. DiffPoseNet: Direct Differentiable Camera Pose Estimation link
- Recognition: Detection, Categorization, Retrieval
- Self-, Semi-, Meta-, & Unsupervised Learning
- 46. DASO: Distribution-Aware Semantics-Oriented Pseudo-Label for Imbalanced Semi-Supervised Learning link
- 47. Unbiased Teacher v2: Semi-Supervised Object Detection for Anchor-Free and Anchor-Based Detectors link
- 48. Semi-Supervised Semantic Segmentation With Error Localization Network link
- 49. Debiased Learning From Naturally Imbalanced Pseudo-Labels link
- Image & Video Synthesis and Generation
- Datasets and Evaluation
- Segmentation, Grouping and Shape Analysis
CVPR workshop June 19-20th.
Machine Learning with Synthetic Data (SyntML) link
Synthetic data are labeled data made using computer graphic. They are cheap, clean, and have richness of label. Keyword: Domain mismatch, Diversity
-
human synthesis Google
- Procedural face generation
templete + features paradigm
features can be: identity, expression, pose - Hair and clothing
- Environment
- Render (Blender)
- Procedural face generation
-
synthetic data & simulation Nvidia
- graphics geometry + texture by a distribution
- mixed reality light estimation + AR
- generative models GAN / diffusion model
-
crossing the domain gap with synthetic data Datagen
- why synthetic data?
- pixel-accurate labels
- rich annotationns
- full control
- types of gap
- photorealism gap
- pose gap
- augmentation gap
- annotation gap
- styleGAN
- cascade "parameter w class" (also like templete + features)
- inversion / editing (good sensitivity)
- mix sythetic data with real data (when limited) can achieve better performance
- address domain gap
- photorealism
- label adaptation
- add noise
- global scene parameter distribution (lights, camera, pose)
- why synthetic data?
International Challenge on Activity Recognition (ActivityNet) link
task: real-time online untrimmed security video action detection
object: single / multi / interaction
pipeline:
- detection
- background removal
- tracking (IOU-based)
- classification
related concept:
- domain adaptation
- overlapping spatio-temperal
- class-unbalance
- multi-label
- generalization performance
2nd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings link
- task: building model through point clouds to room map
- key tech: semantic segmentation of point clouds
Workshop on Attention and Transformers in Vision link
- Visual Attention with Recurrency and Sparsity
- BoxeR: Box-Attention for 2D and 3D Transformers
- 2D / 3D object detection or segmentation
- query: reference window
- key: learnable relative region
- multi-scale feature map
- Depth Estimation with Simplified Transformers
- FC -> 1x1 Conv.
- M2F3D: MaskFormer fo 3D Instance Segmentation
- top-down / bottom-up
- sparse Conv.
5th MUltimodal Learning and Applications Workshop (MULA) link
Learning to Navigate from Vision and Language
- human use semantic priors to understand and navigate in unseen environment
- RL bottlenecks to progress on semantic navigation: scalability, diversity
- no need to learn a policy -> greedy
7th BMTT Workshop on Benchmarking Multi-Target Tracking: How Far Can Synthetic Data Take us? link
L3D-IVU: Workshop on Learning with Limited Labelled Data for Image and Video Understanding link
- Low-Shot Scene Decomposition via Reconstruction
- featurize 3D scene behind the image
- fuse information form range sensors
- RGB rendering is useful pre-training for detections
- continues 3D feature maps with implicit functions
- unsupervised detection: where and what, decouple these
- unsupervised 3D segmentation via reconstruction loss
CVPR tutorial June 19-20th.
Denoising Diffusion-based Generative Modeling: Foundations and Applications link
- kinds of diffusion model
- momentum-based
- energy-based
- latent-space (with pretrained VAE): faster and simpler
- distilation (merge steps)
- discrete state diffusion model
- high-resolution
- condition form: scalar / image / text
- quality-diversity trade-off
- cascade generation with super-resolution method
- application
- semantic segmentation
- image editing
- adversarial robustness (purfied image)
- video generation
- types
- all frames
- past frames
- future frames
- interpolation
- tips: training with different types of mask / use time position encodings to encode times
- backbone: 3D Conv. / 2D Conv. + Att. (ignore initially when train)
- long-term: generate a frame far away and then interpolation
- types
- medical imaging
reconstract original image from sparse measurements
high-level idea: learn pretrained on pure dataset momdel as "prior" than guide synthesis conditioned on sparse obvervations - 3D shape generation
through point clouds
- future trend
- why diffusion models perform better?
- how can we improve VAE / flow from diffusion model?
- sampling from diffusion model is still slow
- diffusion model can be considered as latent variable model without semantic, if with?
- can diffusion model help to discrimination applications?
- what are the best network architectures for diffusion model instead of UNet?
- other data modality further than 2D image
- controllable generation
- in some application replace GAN with diffusion model
Recent Advances in Vision-and-Language Pre-training link
- unifying text and image
- avoiding explicit detection module
- high resolution computing cost
- coarse to fine two-stage VLP
- fusion in the backbone
Beyond Convolutional Neural Networks link
- DETR: DEtection TransfoRmer
- idea: pose the task directly as set prediction, using a transformer encoder-decoder
- bipartite match
Evaluating Models Beyond the Textbook: Out-of-distribution and Without Labels link
- robustness encompasses a broad range of phenomena (adv. examples, corruptions, nat. dist shift, etc.)
- some forms of robustness are currently orthogonal
- consistent trends across natural distribution shifts -> need more fine-grained understanding of different robustness notions.
- training data plays a key role in creating broadly robust models (e.g., CLIP). -> How do we construct training sets that enable broadly reliable models?
- very large improvements in OOD robustness
1. Semantic-Aware Domain Generalized Segmentation link
- sementic-aware normalization adapts a multi-branch normalization strategy, aiming to transform the input feature map into the category-level normalized features that are semantic-aware center aligned.
2. Pointly-Supervised Instance Segmentation link
@Bowen Cheng
- training with pointed-based annotation
- implicit pointrend
3. Adaptive Early-Learning Correction for Segmentation From Noisy Annotations link
- how to define early-training stage without ground truth?
- how to utilze noisy pesudo label?
4. Unsupervised Hierarchical Semantic Segmentation With Multiview Cosegmentation and Clustering Transformers link
5. Self-supervised Video Transformer link
5. Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition link
see the notes https://reself-c.github.io/DualAI
7. Tracking People by Predicting 3D Appearance, Location and Pose link
8. OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization link
- two dimensions of distribution shift
- diversity shift -> shift in label
- correlation shift -> shift in mapping
9. Robust Fine-Tuning of Zero-Shot Models link
- weight-space ensemble of Fine-tune model and Zero-shot model (linear)
10. Learning Distinctive Margin Toward Active Domain Adaptation link
- data sample strategy
- classic uncertainty sample
- diversity sample
- multi-index evaluation
- adversarial learning
- ...
- margin sample (this work)
11. DINE: Domain Adaptation From Single and Multiple Black-Box Predictors link
- BB-SFDA: only logits
12. Source-Free Object Detection by Learning To Overlook Domain Style link
- augmentation + alignment
13. Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization link
14. Causality Inspired Representation Learning for Domain Generalization link
15. Learning What Not To Segment: A New Perspective on Few-Shot Segmentation link
16. Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation link
- pretrain + active-learning
17. ADeLA: Automatic Dense Labeling With Attention for Viewpoint Shift in Semantic Segmentation link
- viewpoint change causes a prior shift for scene parsing
18. Dataset Distillation by Matching Training Trajectories link
- compress the dataset from 50k to 10 by matching the parameter in the model
19. Controllable Dynamic Multi-Task Architectures link
- select the path and weight for a completed multi-task network architecture
20. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation link
21. PoseTriplet: Co-Evolving 3D Human Pose Estimation, Imitation, and Hallucination Under Self-Supervision link
22. Generalizable Human Pose Triangulation link
1. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels link
2. Deep Hierarchical Semantic Segmentation link
3. Amodal Segmentation Through Out-of-Task and Out-of-Distribution Generalization With a Bayesian Model link
- trained with bounding box and output is visible mask
- out-of-task and out-of-distribution generalization with a Bayesian generative model
4. SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization link
- use point feature memory
5. Accelerating Video Object Segmentation With Compressed Video link
- use residual between frames
- only inference on key frame and propagate the others by residual
6. High Quality Segmentation for Ultra High-Resolution Images link
- calculate the relationship between the coordinate of low-resolution feature and ultra high-resolution target to get position information.
7. Pin the Memory: Learning To Generalize Semantic Segmentation link
- store the feature as memory when inference on other domain
- close-set assumption, no label mismatch
8. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity link
- learn a pairwise affinity for each pixels
- a data augmentation strategy
- learn a binary and then classification (is that a object first?)
9. Weakly Supervised Semantic Segmentation Using Out-of-Distribution Data link
10. Multimodal Material Segmentation link
- material segmentation (may close to texture but not so semantic)
11. Semi-Supervised Learning of Semantic Correspondence With Pseudo-Labels link
12. A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty link
- resampling and reweighing for long-tail dataset
- class balance and hardness balance
- define a difficulty for classification
13. How Much More Data Do I Need? Estimating Requirements for Downstream Tasks link
- estimate the amount of data needed
- most regession functions significantly over- or under- estimate how much data we needed
14. Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase link
15. Out-of-distribution Generalization with Causal Invariant Transformations link
16. Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation link
17. Revisiting Weakly Supervised Pre-Training of Visual Perception Models link
- multi-label (hashtags) classification
- target is a uniform probality distribution on all hashtags for an image
18. Failure Modes of Domain Generalization Algorithms link
19. Learning Part Segmentation Through Unsupervised Domain Adaptation From Synthetic Vehicles link
20. Large-Scale Pre-Training for Person Re-Identification With Noisy Labels link
21. Efficient Video Instance Segmentation via Tracklet Query and Proposal link
- both tracklet and appearance query
- both bounding box and mask output
(https://arxiv.org/abs/2111.08644)
23. Learning To Estimate Robust 3D Human Mesh From In-the-Wild Crowded Scenes link
- use 2d pose to reduce domain gap
- self-updated 2d pose from off-the-shelf model
24. Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation link
- augmentation + consistency
25. Multi-Scale Memory-Based Video Deblurring link
- multi-scale
- memory-based, remember the sharp and inference on blur
25. Self-Supervised Keypoint Discovery in Behavioral Videos link
- self-supervised pretraining + downstream tasks
26. GLASS: Geometric Latent Augmentation for Shape Spaces link
27. Video-Text Representation Learning via Differentiable Weak Temporal Alignment link
- pretraining though multimodal alignment like video version CLIP
28. End-to-End Referring Video Object Segmentation With Multimodal Transformers link
- multimodal transformer
- parallel for all the frames instead of sequetial based on memory bank
29. Are Multimodal Transformers Robust to Missing Modality? link
30. Robust Cross-Modal Representation Learning With Progressive Self-Distillation link
31. Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification link
32. MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing link
33. Coarse-To-Fine Feature Mining for Video Semantic Segmentation link
34. The DEVIL Is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting link
35. YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset link
36. Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark link
- changable reflective field for attention
37. Which Model To Transfer? Finding the Needle in the Growing Haystack link
- pretrain model selecting for downstream tasks
38. Task2Sim: Towards Effective Pre-Training and Transfer From Synthetic Data link
- use RL to control the parameter of synthetic data generator
39. MetaPose: Fast 3D Pose From Multiple Views Without 3D Supervision link
40. Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation link
41. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking link
42. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion link
- train a refine net with only 2D gt
43. DiffPoseNet: Direct Differentiable Camera Pose Estimation link
44. Multi-Granularity Alignment Domain Adaptation for Object Detection link
45. Cross-Domain Adaptive Teacher for Object Detection link
- pixel-/instance-/catagory- level discrimination
46. DASO: Distribution-Aware Semantics-Oriented Pseudo-Label for Imbalanced Semi-Supervised Learning link
47. Unbiased Teacher v2: Semi-Supervised Object Detection for Anchor-Free and Anchor-Based Detectors link
48. Semi-Supervised Semantic Segmentation With Error Localization Network link
49. Debiased Learning From Naturally Imbalanced Pseudo-Labels link
- similar to entropy filter