awesome-cvpr2022

workshop, tutorial, oral, and poster with notes in cvpr2022

Wenhao(Reself) Chai

Undergraduate, UIUC

Workshops
Tutorials
Orals
Posters

Workshops

CVPR workshop June 19-20th.

full schedule

Machine Learning with Synthetic Data (SyntML) link

Synthetic data are labeled data made using computer graphic. They are cheap, clean, and have richness of label. Keyword: Domain mismatch, Diversity

human synthesis Google
- Procedural face generation
  templete + features paradigm
  features can be: identity, expression, pose
- Hair and clothing
- Environment
- Render (Blender)
synthetic data & simulation Nvidia
- graphics geometry + texture by a distribution
- mixed reality light estimation + AR
- generative models GAN / diffusion model
crossing the domain gap with synthetic data Datagen
- why synthetic data?
  - pixel-accurate labels
  - rich annotationns
  - full control
- types of gap
  - photorealism gap
  - pose gap
  - augmentation gap
  - annotation gap
- styleGAN
  - cascade "parameter w class" (also like templete + features)
  - inversion / editing (good sensitivity)
- mix sythetic data with real data (when limited) can achieve better performance
- address domain gap
  - photorealism
  - label adaptation
  - add noise
  - global scene parameter distribution (lights, camera, pose)

International Challenge on Activity Recognition (ActivityNet) link

task: real-time online untrimmed security video action detection
object: single / multi / interaction
pipeline:

detection
background removal
tracking (IOU-based)
classification

related concept:

domain adaptation
overlapping spatio-temperal
class-unbalance
multi-label
generalization performance

2nd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings link

task: building model through point clouds to room map
key tech: semantic segmentation of point clouds

Workshop on Attention and Transformers in Vision link

Visual Attention with Recurrency and Sparsity
BoxeR: Box-Attention for 2D and 3D Transformers
- 2D / 3D object detection or segmentation
- query: reference window
- key: learnable relative region
- multi-scale feature map
Depth Estimation with Simplified Transformers
- FC -> 1x1 Conv.
M2F3D: MaskFormer fo 3D Instance Segmentation
- top-down / bottom-up
- sparse Conv.

5th MUltimodal Learning and Applications Workshop (MULA) link

Learning to Navigate from Vision and Language

human use semantic priors to understand and navigate in unseen environment
RL bottlenecks to progress on semantic navigation: scalability, diversity
no need to learn a policy -> greedy

7th BMTT Workshop on Benchmarking Multi-Target Tracking: How Far Can Synthetic Data Take us? link

L3D-IVU: Workshop on Learning with Limited Labelled Data for Image and Video Understanding link

Low-Shot Scene Decomposition via Reconstruction
- featurize 3D scene behind the image
- fuse information form range sensors
- RGB rendering is useful pre-training for detections
- continues 3D feature maps with implicit functions
- unsupervised detection: where and what, decouple these
- unsupervised 3D segmentation via reconstruction loss

Tutorials

CVPR tutorial June 19-20th.

Denoising Diffusion-based Generative Modeling: Foundations and Applications link

kinds of diffusion model
- momentum-based
- energy-based
- latent-space (with pretrained VAE): faster and simpler
- distilation (merge steps)
- discrete state diffusion model
high-resolution
- condition form: scalar / image / text
- quality-diversity trade-off
- cascade generation with super-resolution method
application
- semantic segmentation
- image editing
- adversarial robustness (purfied image)
- video generation
  - types
    - all frames
    - past frames
    - future frames
    - interpolation
  - tips: training with different types of mask / use time position encodings to encode times
  - backbone: 3D Conv. / 2D Conv. + Att. (ignore initially when train)
  - long-term: generate a frame far away and then interpolation
- medical imaging
  reconstract original image from sparse measurements
  high-level idea: learn pretrained on pure dataset momdel as "prior" than guide synthesis conditioned on sparse obvervations
- 3D shape generation
  through point clouds
future trend
- why diffusion models perform better?
- how can we improve VAE / flow from diffusion model?
- sampling from diffusion model is still slow
- diffusion model can be considered as latent variable model without semantic, if with?
- can diffusion model help to discrimination applications?
- what are the best network architectures for diffusion model instead of UNet?
- other data modality further than 2D image
- controllable generation
- in some application replace GAN with diffusion model

Recent Advances in Vision-and-Language Pre-training link

unifying text and image
avoiding explicit detection module
high resolution computing cost
coarse to fine two-stage VLP
fusion in the backbone

Beyond Convolutional Neural Networks link

DETR: DEtection TransfoRmer
- idea: pose the task directly as set prediction, using a transformer encoder-decoder
- bipartite match

Evaluating Models Beyond the Textbook: Out-of-distribution and Without Labels link

robustness encompasses a broad range of phenomena (adv. examples, corruptions, nat. dist shift, etc.)
some forms of robustness are currently orthogonal
consistent trends across natural distribution shifts -> need more fine-grained understanding of different robustness notions.
training data plays a key role in creating broadly robust models (e.g., CLIP). -> How do we construct training sets that enable broadly reliable models?
very large improvements in OOD robustness

Orals

Segmentation, Grouping and Shape Analysis

1. Semantic-Aware Domain Generalized Segmentation link

sementic-aware normalization adapts a multi-branch normalization strategy, aiming to transform the input feature map into the category-level normalized features that are semantic-aware center aligned.

2. Pointly-Supervised Instance Segmentation link

@Bowen Cheng

training with pointed-based annotation
implicit pointrend

3. Adaptive Early-Learning Correction for Segmentation From Noisy Annotations link

how to define early-training stage without ground truth?
how to utilze noisy pesudo label?

4. Unsupervised Hierarchical Semantic Segmentation With Multiview Cosegmentation and Clustering Transformers link

Video Analysis & Understanding

5. Self-supervised Video Transformer link

5. Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition link

see the notes https://reself-c.github.io/DualAI

3D From Single Images

7. Tracking People by Predicting 3D Appearance, Location and Pose link

Transfer / Low-Shot / Long-Tail Learning

8. OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization link

two dimensions of distribution shift
- diversity shift -> shift in label
- correlation shift -> shift in mapping

9. Robust Fine-Tuning of Zero-Shot Models link

weight-space ensemble of Fine-tune model and Zero-shot model (linear)

10. Learning Distinctive Margin Toward Active Domain Adaptation link

data sample strategy
- classic uncertainty sample
- diversity sample
- multi-index evaluation
- adversarial learning
- ...
- margin sample (this work)

11. DINE: Domain Adaptation From Single and Multiple Black-Box Predictors link

BB-SFDA: only logits

12. Source-Free Object Detection by Learning To Overlook Domain Style link

augmentation + alignment

13. Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization link

14. Causality Inspired Representation Learning for Domain Generalization link

15. Learning What Not To Segment: A New Perspective on Few-Shot Segmentation link

16. Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation link

pretrain + active-learning

17. ADeLA: Automatic Dense Labeling With Attention for Viewpoint Shift in Semantic Segmentation link

viewpoint change causes a prior shift for scene parsing

Image & Video Synthesis and Generation

18. Dataset Distillation by Matching Training Trajectories link

compress the dataset from 50k to 10 by matching the parameter in the model

Deep Learning Architectures & Techniques

19. Controllable Dynamic Multi-Task Architectures link

select the path and weight for a completed multi-task network architecture

Human Pose Estimation & Tracking, Localization, and Object Pose Estimation

20. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation link

21. PoseTriplet: Co-Evolving 3D Human Pose Estimation, Imitation, and Hallucination Under Self-Supervision link

22. Generalizable Human Pose Triangulation link

Posters

Segmentation, Grouping and Shape Analysis

1. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels link

2. Deep Hierarchical Semantic Segmentation link

3. Amodal Segmentation Through Out-of-Task and Out-of-Distribution Generalization With a Bayesian Model link

trained with bounding box and output is visible mask
out-of-task and out-of-distribution generalization with a Bayesian generative model

4. SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization link

use point feature memory

5. Accelerating Video Object Segmentation With Compressed Video link

use residual between frames
only inference on key frame and propagate the others by residual

6. High Quality Segmentation for Ultra High-Resolution Images link

calculate the relationship between the coordinate of low-resolution feature and ultra high-resolution target to get position information.

7. Pin the Memory: Learning To Generalize Semantic Segmentation link

store the feature as memory when inference on other domain
close-set assumption, no label mismatch

8. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity link

learn a pairwise affinity for each pixels
a data augmentation strategy
learn a binary and then classification (is that a object first?)

9. Weakly Supervised Semantic Segmentation Using Out-of-Distribution Data link

10. Multimodal Material Segmentation link

material segmentation (may close to texture but not so semantic)

11. Semi-Supervised Learning of Semantic Correspondence With Pseudo-Labels link

Machine Learning

12. A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty link

resampling and reweighing for long-tail dataset
class balance and hardness balance
define a difficulty for classification

13. How Much More Data Do I Need? Estimating Requirements for Downstream Tasks link

estimate the amount of data needed
most regession functions significantly over- or under- estimate how much data we needed

14. Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase link

15. Out-of-distribution Generalization with Causal Invariant Transformations link

Deep Learning Architectures & Techniques

16. Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation link

17. Revisiting Weakly Supervised Pre-Training of Visual Perception Models link

multi-label (hashtags) classification
target is a uniform probality distribution on all hashtags for an image

18. Failure Modes of Domain Generalization Algorithms link

19. Learning Part Segmentation Through Unsupervised Domain Adaptation From Synthetic Vehicles link

Vision Applications & Systems

20. Large-Scale Pre-Training for Person Re-Identification With Noisy Labels link

Recognition: Detection, Categorization, Retrieval

21. Efficient Video Instance Segmentation via Tracklet Query and Proposal link

both tracklet and appearance query
both bounding box and mask output

22. UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection [link]

(https://arxiv.org/abs/2111.08644)

3D From Single Images

23. Learning To Estimate Robust 3D Human Mesh From In-the-Wild Crowded Scenes link

use 2d pose to reduce domain gap
self-updated 2d pose from off-the-shelf model

24. Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation link

augmentation + consistency

Low-Level Vision

25. Multi-Scale Memory-Based Video Deblurring link

multi-scale
memory-based, remember the sharp and inference on blur

Behavior Analysis

25. Self-Supervised Keypoint Discovery in Behavioral Videos link

self-supervised pretraining + downstream tasks

26. GLASS: Geometric Latent Augmentation for Shape Spaces link

Vision & Language

27. Video-Text Representation Learning via Differentiable Weak Temporal Alignment link

pretraining though multimodal alignment like video version CLIP

28. End-to-End Referring Video Object Segmentation With Multimodal Transformers link

multimodal transformer
parallel for all the frames instead of sequetial based on memory bank

29. Are Multimodal Transformers Robust to Missing Modality? link

30. Robust Cross-Modal Representation Learning With Progressive Self-Distillation link

31. Multimodal Dynamics: Dynamical Fusion for Trustworthy Multimodal Classification link

Video Analysis & Understanding

32. MLP-3D: A MLP-Like 3D Architecture With Grouped Time Mixing link

33. Coarse-To-Fine Feature Mining for Video Semantic Segmentation link

34. The DEVIL Is in the Details: A Diagnostic Evaluation Benchmark for Video Inpainting link

35. YouMVOS: An Actor-Centric Multi-Shot Video Object Segmentation Dataset link

36. Large-Scale Video Panoptic Segmentation in the Wild: A Benchmark link

changable reflective field for attention

Transfer / Low-Shot / Long-Tail Learning

37. Which Model To Transfer? Finding the Needle in the Growing Haystack link

pretrain model selecting for downstream tasks

38. Task2Sim: Towards Effective Pre-Training and Transfer From Synthetic Data link

use RL to control the parameter of synthetic data generator

Pose Estimation & Tracking

39. MetaPose: Fast 3D Pose From Multiple Views Without 3D Supervision link

40. Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose Estimation link

41. PoseTrack21: A Dataset for Person Search, Multi-Object Tracking and Multi-Person Pose Tracking link

42. DanceTrack: Multi-Object Tracking in Uniform Appearance and Diverse Motion link

train a refine net with only 2D gt

43. DiffPoseNet: Direct Differentiable Camera Pose Estimation link

Recognition: Detection, Categorization, Retrieval

44. Multi-Granularity Alignment Domain Adaptation for Object Detection link

45. Cross-Domain Adaptive Teacher for Object Detection link

pixel-/instance-/catagory- level discrimination

Self-, Semi-, Meta-, & Unsupervised Learning

46. DASO: Distribution-Aware Semantics-Oriented Pseudo-Label for Imbalanced Semi-Supervised Learning link

47. Unbiased Teacher v2: Semi-Supervised Object Detection for Anchor-Free and Anchor-Based Detectors link

48. Semi-Supervised Semantic Segmentation With Error Localization Network link

49. Debiased Learning From Naturally Imbalanced Pseudo-Labels link

similar to entropy filter

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
README.md		README.md

rese1f/awesome-cvpr2022

Folders and files

Latest commit

History

Repository files navigation

awesome-cvpr2022

Workshops

Machine Learning with Synthetic Data (SyntML) link

International Challenge on Activity Recognition (ActivityNet) link

2nd Workshop and Challenge on Computer Vision in the Built Environment for the Design, Construction, and Operation of Buildings link

Workshop on Attention and Transformers in Vision link

5th MUltimodal Learning and Applications Workshop (MULA) link

7th BMTT Workshop on Benchmarking Multi-Target Tracking: How Far Can Synthetic Data Take us? link

L3D-IVU: Workshop on Learning with Limited Labelled Data for Image and Video Understanding link

Tutorials

Denoising Diffusion-based Generative Modeling: Foundations and Applications link

Recent Advances in Vision-and-Language Pre-training link

Beyond Convolutional Neural Networks link

Evaluating Models Beyond the Textbook: Out-of-distribution and Without Labels link

Orals

Segmentation, Grouping and Shape Analysis

1. Semantic-Aware Domain Generalized Segmentation link

2. Pointly-Supervised Instance Segmentation link

3. Adaptive Early-Learning Correction for Segmentation From Noisy Annotations link

4. Unsupervised Hierarchical Semantic Segmentation With Multiview Cosegmentation and Clustering Transformers link

Video Analysis & Understanding

5. Self-supervised Video Transformer link

5. Dual-AI: Dual-Path Actor Interaction Learning for Group Activity Recognition link

3D From Single Images

7. Tracking People by Predicting 3D Appearance, Location and Pose link

Transfer / Low-Shot / Long-Tail Learning

8. OoD-Bench: Quantifying and Understanding Two Dimensions of Out-of-Distribution Generalization link

9. Robust Fine-Tuning of Zero-Shot Models link

10. Learning Distinctive Margin Toward Active Domain Adaptation link

11. DINE: Domain Adaptation From Single and Multiple Black-Box Predictors link

12. Source-Free Object Detection by Learning To Overlook Domain Style link

13. Exact Feature Distribution Matching for Arbitrary Style Transfer and Domain Generalization link

14. Causality Inspired Representation Learning for Domain Generalization link

15. Learning What Not To Segment: A New Perspective on Few-Shot Segmentation link

16. Towards Fewer Annotations: Active Learning via Region Impurity and Prediction Uncertainty for Domain Adaptive Semantic Segmentation link

17. ADeLA: Automatic Dense Labeling With Attention for Viewpoint Shift in Semantic Segmentation link

Image & Video Synthesis and Generation

18. Dataset Distillation by Matching Training Trajectories link

Deep Learning Architectures & Techniques

19. Controllable Dynamic Multi-Task Architectures link

Human Pose Estimation & Tracking, Localization, and Object Pose Estimation

20. Temporal Feature Alignment and Mutual Information Maximization for Video-Based Human Pose Estimation link

21. PoseTriplet: Co-Evolving 3D Human Pose Estimation, Imitation, and Hallucination Under Self-Supervision link

22. Generalizable Human Pose Triangulation link

Posters

Segmentation, Grouping and Shape Analysis

1. Semi-Supervised Semantic Segmentation Using Unreliable Pseudo-Labels link

2. Deep Hierarchical Semantic Segmentation link

3. Amodal Segmentation Through Out-of-Task and Out-of-Distribution Generalization With a Bayesian Model link

4. SWEM: Towards Real-Time Video Object Segmentation With Sequential Weighted Expectation-Maximization link

5. Accelerating Video Object Segmentation With Compressed Video link

6. High Quality Segmentation for Ultra High-Resolution Images link

7. Pin the Memory: Learning To Generalize Semantic Segmentation link

8. Open-World Instance Segmentation: Exploiting Pseudo Ground Truth From Learned Pairwise Affinity link

9. Weakly Supervised Semantic Segmentation Using Out-of-Distribution Data link

10. Multimodal Material Segmentation link

11. Semi-Supervised Learning of Semantic Correspondence With Pseudo-Labels link

Machine Learning

12. A Re-Balancing Strategy for Class-Imbalanced Classification Based on Instance Difficulty link

13. How Much More Data Do I Need? Estimating Requirements for Downstream Tasks link

14. Deep Safe Multi-view Clustering: Reducing the Risk of Clustering Performance Degradation Caused by View Increase link

15. Out-of-distribution Generalization with Causal Invariant Transformations link

Deep Learning Architectures & Techniques

16. Single-Domain Generalized Object Detection in Urban Scene via Cyclic-Disentangled Self-Distillation link

17. Revisiting Weakly Supervised Pre-Training of Visual Perception Models link

18. Failure Modes of Domain Generalization Algorithms link

19. Learning Part Segmentation Through Unsupervised Domain Adaptation From Synthetic Vehicles link

Vision Applications & Systems

20. Large-Scale Pre-Training for Person Re-Identification With Noisy Labels link

Recognition: Detection, Categorization, Retrieval

21. Efficient Video Instance Segmentation via Tracklet Query and Proposal link

22. UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection [link]

3D From Single Images

23. Learning To Estimate Robust 3D Human Mesh From In-the-Wild Crowded Scenes link

24. Exploiting Pseudo Labels in a Self-Supervised Learning Framework for Improved Monocular Depth Estimation link

Packages