Semantic Segmentation

Conditional Random Fields as Recurrent Neural Networks

ReSeg: A recurrent Neural Network-based Model for Semantic Segmentation

Co-segmentation

CoSegNet: Deep Co-Segmentation of 3D Shapes with Group Consistency Loss

Propose novel group consistency loss for unsupervised part-segmentation. Use inconsistent dataset to train a shape refiner.

Instance Segmentation / Object Detection

Conventional methods

Normalized cut

Cut / Assoc(A) + Cut / Assoc(B). Cut, Assoc(A), and Assoc(B) are summation of edge weights.

Proposal-based methods

ron reverse connection with objectness prior networks

Multi-scale + reverse connection

Frustum PointNets for 3D Object Detection from RGB-D Data

2D proposals + PointNet

Reinforcement learning to choose proposals

Cascade Object Detection with Deformable Part Models

SSD: Single Shot MultiBox Detector

Proposal-free methods

Graph-based methods

Iterative Visual Reasoning Beyond Convolutions

Semantic Object Parsing with Graph LSTM

Semi-convolutional operators

Instance coloring.
y = phi(x) + (u, v)
min(y - cy). Attractive force only.

Bottom-up Instance Segmentation using Deep Higher-Order CRFs

SLAM/SfM

Basics

Direct vs indirect

Direct: minimize the color error of the projection.
Indirect: minimize the geometry error of the projection.

Feature extraction

A feature point is often represented by an oriented planar texture patch.
Surf
Fast

Three paradigms

(Extended) Kalman filter
Particle filter
Graph-based

Bayes filter

Prediction step (motion model): bel(xt)’ = ∫p(xt | ut, xt-1) bel(xt-1)dxt-1
Correction step (sensor/observation model): bel(xt) = \phi p(zt | xt) bel(xt)

(Extended) Kalman filter

A Bayes filter.
Optimal for linear models with Gaussian distributions.
Prediction step: xt (state) = At * xt-1 + Bt * ut (observation) + epsilon.
Correction step: zt (predicted observation) = Ct * xt + delta.
noise smoothing (improve noisy measurements) + state estimation (for state feedback) + recursive (computes next estimate using only most recent measurement).
Marginal and conditional of Gaussian are still Gaussian.
Extended: local linearilization (at the current best-estimated point) of non-linear functions. (The inverse operation is the bottleneck.)
Unscented: sampling techniques to find an approximated Gaussian distribution.

Grid map

Discrete maps into cells (occupied or free space).
Non parametric model.
Assumptions: Cells are binary, static, and independent; Poses are known.
Binary bayes filter (for static state). Correction step only.

Bundle adjustment

Levenberg-Marquardt (LM) algorithm.

Indirect methods

MonoSLAM

Used in a small volume (a room) in a long term.
First real-time monocular SLAM system.
Probabilistic filtering of a joint state consisting of camera and scene feature position estimates.

PTAM

Separate tracking and mapping into two parallel threads.
Mapping is based on keyframes processed using batch techniques(bundle adjustment).
The map is densely initialized from a stereo pair (5-point algorithm).
New points are initialized by epipolar search.
A large number of points are mapped.
Bundle adjustment + robust n-point pose estimation.

KinectFusion

PTAM + dense reconstruction
Projective TSDF (easy to parallelize but correct exactly only at the surface.
Moving average for surface (TSDF) update.
Surface measurement (V, N) -> Projective TSDF -> V, N -> pose (frame to model).

DynamicFusion

Coarse 6D warp-field to model the dynamic motion.
Estimation of the volumetric model-to-frame warp field -> fusion of the live frame depth map into the canonical space -> adaption of the warp-field to capture new geometry.

EKF-SLAM

Estimate pose and landmark locations (represented in the state space).
Assumption: known correspondences.

Fusion++: Volumetric Object-Level SLAM

Object-based map representation. Use Mask R-CNN to predict object-level TSDF for initialization.
Predict foreground probability for rendering.

Direct methods

LSD-SLAM

Pose-graph of keyframes with semi-dense depth maps.
Filtering over a large number pixelwise small-baseline stereo comparisons.
Tracking with sim(3) (detecting scale-drift explicitly).
Initialized with a random depth map and large variance.
Re-weighted Gauss-Newton optimization.

DSO (Direct Sparse Odometry)

Direct + Sparse
Points are well-distributed. Divide the image into 32x32 blocks and select one pixel inside each block with large gradient.

DTAM (Dense Tracking and Mapping)

Incrementally construct cost volume and minimize energy for dense mapping.
Dense tracking.

CodeSLAM

Use a sparse code to represent depth.
Linear depth decoder (no ReLU). Jacobian of the decoder w.r.t the code can be computed.

Stereo

Global optimization

SGM (Semi-global matching) / LoopyBP: Generalize there's an axact solution for a chain.
Graph cuts: generalize there's an exact solution if d has only two values.
TRW-S

PMVS

Uniform coverage by finding top-k local maxima from each image block (32 x 32 blocks).
Matching -> expansion -> filtering -> polygonal surface reconstruction.

Manhattan-world Stereo

Dominant axes + hypothesis planes + optimization

Correspondence

LIFT: Learned Invariant Feature Transform

Detector + orientation + descriptor

GeoDesc

Integrate geometry constraints (based on GT surface normals).

Learning Good Correspondences

Input a list of all possible matching pairs (coordinates only) and predict if each pair is valid or not.

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

V(k) = sum(w_i(x_i - c_k), where K is the number of clusters.

Detect-to-Retrive: Efficient Regional Aggregation for Image Search

Regional VLAD/ASMK

Efficient Deep Learning for Stereo Matching

Inner product layer at the end.
Classify possible disparities.

Learning Correspondence from the Cycle-consistency of Time

Find the most similar patch forward and backward.

End-to-End Learning of Geometry and Context for Deep Stereo Regression

DeepMVS: Learning Multi-view Stereopsis

Cost volume based on plane-sweep and photometric difference. Volume aggregation for multiple views.

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Use variance to consruct the cost volume. Use 3D CNN for regularization.
Depth refinement based on Deep image matting.

Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

Use GRU to process the cost volume sequentially. Concatenate all the outputs to get the regularized cost volume.

3D correspondences

PPFNet: Global Context Aware Local Features for Robust 3D Point Matching

Point pair features
PPF-FoldNet: rotation invariant

Scene Understanding

Naural Scene De-rendering

Reinforce algorithm based on rendered images

Learning to Parse Wireframes in Images of Man-Made Environments

Junctions + post-processing

Im2Pano3D

Single image -> 360 degree panorama.

Single-View 3D Scene Parsing by Attributed Grammar

Indoor Segmentation and Support Inference from RGBD Images

Scene completion

Semantic Scene Completion from a Single Depth Image

SUNCG
TSDF input.

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Layered scene decomposition

Layered Scene Decomposition via the Occlusion-CRF
Layer-structured 3D Scene Inference via View Synthesis

Single-image Reconstruction

Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene

Automatic Photo Pop-up

Unfolding an Indoor Origami World

Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision

Image -> voxel -> projection (supervision)

Pixels, Voxels, and Views: A Study of Shape Representations for Single View 3D Object Shape Prediction

Multi-surface generalizes better than voxel-based representations. It also looks better (high resolution). It can also capture some thin structures, though its post-processing step (surface reconstruction) might discard them.
Viewer-centered gneralizes better than object-centered. It has good shape prediction but poor pose prediction. Object-centered tend to memorizes the observed meshes, and its learned features can be used for object recognition.
The model trained to predict shape and pose can be finetuned for object recognition. Maybe it will generalize better.

FrameNet: Learning Local Canonical Frames of 3D Surfaces from a Single RGB Image

Predict per-pixel dominant directions (frames), which could be used for other applications.

Learning View Priors for Single-view 3D Reconstruction

Deformable mesh model (from Neural 3D Mesh Renderer).
Use rendered views (for both observed and unseen views) to add discriminative loss.
Internal pressure loss to encourage larger volume.

MeshDepth: Disconnected Mesh-based Deep Depth Prediction

Gnerate 2D mesh based on Canny edges.
Estimate plane parameters for each face.

Depth estimation

SURGE: Surface Regularized Geometry Estimation from a Single Image

CNN + DenseCRF

Monocular Depth Estimation using Neural Regression Forest

DeMoN: Depth and Motion Network for Learning Monocular Stereo

Bootstrap net + iterative net + refinement net
Flow-based image warping for iterative refinement.

Machine Learning

Architectures

Neural Module Networks

Image Synthesis / GAN

Tricks

https://github.com/linxi159/GAN-training-tricks

Adversarial Generator-Encoder Networks

View Synthesis by Appearance Flow

DRAW: A Recurrent Neural Network for Image Generation

Transfer learning

Cross Model Distillation for Supervision Transfer

Similarity loss between internal features.
Paired images of the same scene with different modalities.

Generic 3D Representation via Pose Estimation and Matching

Porxy 3D tasks: object-centric camera pose estimation and wide baseline feature matching.

Unsupervised Domain Adaption by Backpropagation

Domain confusion.

A Survey on Deep Transfer Learning

Instances-based: find similar instances in the source domain.
Mapping-based: map instances from two domains into a new data space with better similarity.
Network-based: reuse the partial of network pre-trained in the source domain.
Adversarial-based: Use domain confusion to penalize the network if it predict the domain correctly.

Unsupervised learning

Unsupervised Visual Representation Learning by Context Prediction.

Predict relative location between patches.

One-shot learning

Learning Feed-Forward One-Shot Learners

Train a learnet to predict the model parameters.
Assess the model on another exemplar to predict if the new exemplar is of the same class with the one used by the learnet.

Graph

GraphGAN: Generating Graphs via Random Walk

Defromable Graph Matching

Attention

Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition

Recurrent Models of Visual Attention

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

Visualization

Understanding Deep Image Representations by Inverting Them

Similar to DeepDream.

Object Detectors Emerge in Deep Scene CNNs

Simplifying the input images.
Visualizing the receptive fields of units and their activation patterns.
Indentifying the semantics of internal units.

Deep Convolutional Inverse Graphics Network

AutoEncoder. The latent code is divided into segments.
Only one attribute changes in each mini-batch.

3D Learning

Point cloud

Tengent Convolutions for Dense Prediction in 3D

Project nearby points onto the tangent plane.

VoxelNet

Divide point cloud into voxels and process points inside each voxel using a PointNet.

Point Convolutional Neural Networks by Extension Operators

Apply extension operators to convert point cloud to volumetric representations (using basis functions).
Process the volumetric representation and sample back to point cloud.

PointCNN

K-neareat neighbor. Lift each neighbor into new feature, and concatenate lifted features with the current one.
Learn a KxK transformation matrix to permute, and a standard convolution to process.

Recurrent Slice Networks for 3D Segmentation on Point Clouds

Divide the point cloud into slices and use recurrent network to process slices sequentially.

Attentional ShapeContextNet for Point Cloud Recognition

Non-local modules.

Escape from Cells: Deep Kd-Networks for the Recognition of 3D Point Cloud Models

Voxels

CSGNet

Rendering + RL

Surfaces

DeepSDF: Learning Continuous Signed Distance Functions for Shape Representation

AutoDecoder which takes a 3D point and a shape code as input and predicts the SDF for the 3D point.

Learning Shape Templates with Structured Implicit Functions

Use structured implicit shape representation (100x7 parameters) to represent the 3D surface.
Classify sampled points as inside or outside based on the predicted parameters, which defines the loss.

2D-3D

Deep Continuous Fusion for Multi-Sensor 3D Object Detection

Graph

FeaStNet

Learn the weight for each neighbor point (similarity).
Compute the weighted summation of features (non-local module).

Scan2Mesh: From Unstructured Range Scans to 3D Meshes

Predict 100 vertices (set generation), read features from voxel grids, and use graph neural network to predict edges.
Find face candidates in the dual graph, and use graph network to predict face existence.
Generate training data using mesh simplification (https://github.com/kmammou/v-hacd).

Boxes

GRASS

Recursive network (merging two parts into one node).
Train an autoencoder and map the context feature to the latent representation for decoding.

Learning Shape Abstractions by Assembling Volumetric Primitives

Voxel -> boxes
Consistency loss + coverage loss. Reinforce algorithm to allow an arbitrary number of primitives.

Tracking / Localization

Detect to Track and Track to Detect

Lost Shopping! Monocular Localization in Large Indoor Spaces

Learning Transformation Synchronization

Predict weights between every pair of views, which are used for estimating the absolute pose.
Repeat the process iteratively.

Human Pose Estimation

Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields

Lifting from the Deep: Convolutional 3D Pose Estimation from a Single Image

Structured Feature Learning for Pose Estimation

End-to-End Learning of Deformable Mixture of Parts and Deep Convolutional Neural Networks for Human Pose Estimation

Message passing

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
README.md		README.md

singhmanmeetsingh/ComputerVisionPapers

Folders and files

Latest commit

History

Repository files navigation

Semantic Segmentation

Conditional Random Fields as Recurrent Neural Networks

ReSeg: A recurrent Neural Network-based Model for Semantic Segmentation

Co-segmentation

CoSegNet: Deep Co-Segmentation of 3D Shapes with Group Consistency Loss

Instance Segmentation / Object Detection

Conventional methods

Normalized cut

Proposal-based methods

ron reverse connection with objectness prior networks

Frustum PointNets for 3D Object Detection from RGB-D Data

Reinforcement learning to choose proposals

Cascade Object Detection with Deformable Part Models

SSD: Single Shot MultiBox Detector

Proposal-free methods

Graph-based methods

Iterative Visual Reasoning Beyond Convolutions

Semantic Object Parsing with Graph LSTM

Semi-convolutional operators

Bottom-up Instance Segmentation using Deep Higher-Order CRFs

SLAM/SfM

Basics

Direct vs indirect

Feature extraction

Three paradigms

Bayes filter

(Extended) Kalman filter

Grid map

Bundle adjustment

Indirect methods

MonoSLAM

PTAM

KinectFusion

DynamicFusion

EKF-SLAM

Fusion++: Volumetric Object-Level SLAM

Direct methods

LSD-SLAM

DSO (Direct Sparse Odometry)

DTAM (Dense Tracking and Mapping)

CodeSLAM

Stereo

Global optimization

PMVS

Manhattan-world Stereo

Correspondence

LIFT: Learned Invariant Feature Transform

GeoDesc

Learning Good Correspondences

NetVLAD: CNN Architecture for Weakly Supervised Place Recognition

Detect-to-Retrive: Efficient Regional Aggregation for Image Search

Efficient Deep Learning for Stereo Matching

Learning Correspondence from the Cycle-consistency of Time

End-to-End Learning of Geometry and Context for Deep Stereo Regression

DeepMVS: Learning Multi-view Stereopsis

MVSNet: Depth Inference for Unstructured Multi-view Stereo

Recurrent MVSNet for High-resolution Multi-view Stereo Depth Inference

3D correspondences

PPFNet: Global Context Aware Local Features for Robust 3D Point Matching

Scene Understanding

Naural Scene De-rendering

Learning to Parse Wireframes in Images of Man-Made Environments

Im2Pano3D

Single-View 3D Scene Parsing by Attributed Grammar

Indoor Segmentation and Support Inference from RGBD Images

Scene completion

Semantic Scene Completion from a Single Depth Image

ScanComplete: Large-Scale Scene Completion and Semantic Segmentation for 3D Scans

Layered scene decomposition

Single-image Reconstruction

Multi-view Consistency as Supervisory Signal for Learning Shape and Pose Prediction

Factoring Shape, Pose, and Layout from the 2D Image of a 3D Scene

Automatic Photo Pop-up

Unfolding an Indoor Origami World

Perspective Transformer Nets: Learning Single-View 3D Object Reconstruction without 3D Supervision