# The Unreasonable Effectiveness of Deep Learning

## Yann LeCun (http://yann.lecun.com)(http://videolectures.net/yann_lecun/)     - Facebook AI Research & Center for Data Science, NYU

### VideoLectures.net location (http://videolectures.net/sahd2014_lecun_deep_learning/)

#### 55 Years of hand-crafted features

- The Traditional model of pattern recognition (since the late 50's)
    - Fixed/engineered features (or fixed kernel) + trainable classifier
        - Image -> hand-crafted Feature Extractor -> "Simple" Trainable Classifier
        - Build feature extractor -> Classifier -> train classifier
- Perceptron
                                                    

#### Architecture of "Classical" Recognition Systems

- "Classic" architecture for pattern recognition
    - Speech, and Object recognition (until recently)
    - Handwriting recognition (long ago)
    - Graphical model has latent variables (location of parts)
    - Fixed front end feature extractor -> separate separate levels of features -> Classifier -> train classifier
    - The way of doing machine learning until about 2 years ago

#### Architecture of Deep Learning-Based Recognition Systems

- "Deep" architecture for pattern recognition
    - Speech, and Object recognition (recently)
    - Handwriting recognition (since the 1990s)
    - Convolutional Net with optional Graphical model on top
    - Trained purely supervised
    - Graphical model has latent variables (locatin of parts)
    - stacking a bunch of parameterized function in a supervised way
    - post-processing reasoning that would require some sort of optimization
    - conditional random field (http://en.wikipedia.org/wiki/Conditional_random_field)

#### Future Systems

- Globally-trained *deep architecture*
    - Handwriting recognition (since the mid 1990s)
    - All the modules are trained with a combination of unsupervised and supervised learning
    - End-to-end training == deep structured prediction
    - we would eventually integrated supervised and unsupervised learning together (where we have a ton of data but not a lot of labeled data)

#### Deep Learning = Learning Hierarchical Representations

- we need a working learning rule that can work/scale (boltzmann's don't work and require sampling)
- hierarchical architecture - the world is compositional
- notable local features are extracted (with high correlations between adjacent patches of image) 
- compositions of lower level features -> eliminate redundant variability in a signal by increasing the number of representations or decreasing the resolution of features in the next level
- It's *deep* if it has *more than one stage* of non-linear feature transformation
- Feature visualization of convolutional net trained on ImageNet:
    - Zieler & Fergus 2013, *Visualizing and Understanding Convolutional Networks*, (http://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf)
    - Matthew Zeiler (https://scholar.google.com/citations?user=a2KklUoAAAAJ&hl=en&oi=sra)
    - Rob Fergus (https://scholar.google.com/citations?user=GgQ9GEkAAAAJ&hl=en)
    - convolutional neural networks (http://deeplearning.net/tutorial/lenet.html)

#### Trainable Feature Hierarchy

- Hierarchy of representations with increasing level of abstraction in all kinds of signals
- Each stage is a kind of trainable feature transform
- Image recognition:
    - Pixel -> edge -> texton -> motif -> part -> object
- Text:
    - character -> word -> word group -> clause -> sentence -> story
- Speech:
    - sample -> spectral band -> sound -> ... -> phone -> phoneme -> word

#### Learning Representations: a Challenge for ML, CV, AI, Neuroscience, Cognitive Science ...

- Data is made of high-dimensional hierarchical relationships - how do we learn relationships to represent them by just observing it?
- How do we learn representations of the perceptual world?
    - How can a perceptual system build itself by looking at the world?
    - How much prior is necessary?
- ML/AI: how do we learn features or feature hierarchies?
    - What is the fundamental principle? What is the learning algorithm? What is the architecture?
- Neuroscience: how does the cortex learn perception?
    - Does the cortex "run" a single, general learning algorithm? (or a small number of them)
- CogSci: how does the mind learn abstract concepts on top of less abstract ones?
- Deep Learning addresses the problem of learning hierarchical representations with a single algorithm
    - or perhaps with a few algorithms

#### The Mammalian Visual Cortex is Hierarchical

- The ventral (recognition) pathway in the visual cortex has multiple stages 
- Retina - LGN - V1 - V2 - V4 - PIT - AIT ...
- Lots of intermediate representations
- (FIGURE) Van Essen & Gallant, 1994, *Neural Mechanisms of Form and Motion Processing in the Primate Visual System*, (http://cognitrn.psych.indiana.edu/busey/temp/illusoryconju%20refs/vanEssen_gallant94.pdf)
    - DC Van Essen (https://scholar.google.com/scholar?q=DC+Van+Essen&btnG=&hl=en&as_sdt=0%2C10)
    - JL Gallant (https://scholar.google.com/scholar?q=JL+Gallant&btnG=&hl=en&as_sdt=0%2C10)
- (FIGURE) S Thorpe, D Fize, C Marlot, 1996, *Speed of processing in the human visual system*, (http://vpl.uchicago.edu/pages/courses/sp2005/Thorpe96.pdf)
    - Simon Thorpe (https://scholar.google.com/citations?user=uR-7ex4AAAAJ&hl=en&oi=ao)
- "there is this sense that for very fast perception it the brain uses mostly feedforward pathways with relatively little feedback, since it is just too fast." 

#### Which Models are Deep?

- 2-layer models are not deep (even if you train the first layer)
    - BECAUSE THERE IS NO FEATURE HIERARCHY
- Neural nets with 1 hidden layer are not deep
- SVMs and Kernel methods are not deep
    - Layer1: kernels; Layer2: linear
    - The first layer is "trained" in with the simplest unsupervised method ever devised: using the samples as templates for the kernel functions
    - "*glorified template matching*"
- Classification trees are not deep
    - No hierarchy of features. ALl decisions are made in input space

### What are Good Features?

#### Discovering the Hidden Structure in High-Dimensional Data: The Manifold Hypothesis

- Learning Representations of Data:
    - Discovering and disentangling the independent explanaotry factors
- The Manifold Hypothesis:
    - Natural data lives in low-dimensional (non-linear) manifold
    - Because variables in natural data
- Example: all face images of a person
    - 1000x1000 pixels = 1,000,000 dimensions
    - But the face has 3 cartesian coordinates and 3 Euler angles
    - And humans have less than about 50 muscles in the face
    - Hence the manifold of face images for a person has <56 dimensions
- The perfect representations of a face image:
    - The coordinates on the face manifold
    - The coordinates away from the manifold
- We do not have good and general methods to learn functions that turns an image into this kind of representation

#### Basic Idea for Invariant Feature Learning

- Embed the input *non-linearly* into a high(er) dimensional space
    - In the new space, things that were non separable may become separable
- Pool regions of the new space together
    - Bringing togehter things that are semantically similar. Like pooling.

#### Sparse Non-Linear Expansion -> Pooling

#### Overall Architecture: multiple stages of Normalization -> Filter Bank -> Nonlinearity -> Pooling

- Normalization: variation on whitening (optional)
    - Subtractive: average removal, high pass filtering
    - Divisive: local contrast normalization, variance normalization
- Filter Bank: dimension expansion, projection on overcomplete basis
- Non-Linearity: sparsification, saturation, lateral inhibition
    - Rectification (ReLU), Component-wise shrinking, tanh
    $$ReLU(x)=max (x,0 )$$
- Pooling: aggregation over space or feature type
    $$MAX: Max_i\Big(X_i\Big); L_p:\sqrt[p]{X_{i}^{p}}; PROB: \frac{1}{2} \log \Big(\sum\mathrm{e}^{bX_i}\Big)$$

#### Deep Nets with ReLUs and Max Pooling

- Stack of Linear transforms interspersed with Max operators
- Point-wise ReLUs: $$ReLU(x)=max (x,0 )$$
- Max Pooling
    - "switches" from one layer to the next

#### Supervised Training: Stochastic (Sub)Gradient Optimization

To compute all the derivatives, we use a backward sweep called the back-propagation algorithm that uses the recurrent equation for $\frac{\partial E}{\partial X_i}$


In [40]:
from IPython.display import display, Math, Latex
display(Math(r'\frac{\partial E}{\partial X_i} = \frac{\partial C (X_n,Y)}{\partial (X_n)}'))
display(Math(r'\frac{\partial E}{\partial X_{n-1}} = \frac{\partial E}{\partial X_n}\frac{\partial F_n (X_{n-1},W_n)}{\partial (X_{n-1})}'))
display(Math(r'\frac{\partial E}{\partial W_{n}} = \frac{\partial E}{\partial X_n}\frac{\partial F_n (X_{n-1},W_n)}{\partial (W_{n})}'))
display(Math(r'\frac{\partial E}{\partial X_{n-2}} = \frac{\partial E}{\partial X_{n-1}}\frac{\partial F_{n-1} (X_{n-2},W_{n-1})}{\partial (X_{n-2})}'))
display(Math(r'\frac{\partial E}{\partial X_{n-1}} = \frac{\partial E}{\partial X_{n-1}}\frac{\partial F_{n-1} (X_{n-2},W_{n-1})}{\partial (W_{n-1})}'))


<IPython.core.display.Math at 0x1049459d0>

<IPython.core.display.Math at 0x104945cd0>

<IPython.core.display.Math at 0x104945fd0>

<IPython.core.display.Math at 0x1049458d0>

<IPython.core.display.Math at 0x104957090>

- .... etc, until we reach the first module
- we now have all the $\frac{\partial E}{\partial W_i}$ for $ i \in [1,n]$

#### Loss Function for a Simple Network

- 1-1-1 network
    - $ Y = W_1*W_2*X $
- Trained to compute the identity function with quadratic loss
    - Single sample $X=1$, $Y=1$
    - $L(W) = (1-W_1*W_2)^2$

#### Deep Nets with ReLUs

- Single Output:
     $$\hat{Y} = \sum\limits_{P} \delta_{P} (W,X) (\prod\limits_{(ij) \in P}W_{ij}) X_{P_{start}}$$
- $W_{ij}$ weight from *j* to *i*
- P: path in network from input to output
    - *P = (3,(14,3),(22,14),(31,22))*
- *di*: 1 if *ReLU i* is linear, 0 if saturated.
- $X_{P_{start}}$: input unit for path *P*
$$\hat{Y} = \sum\limits_{P} \delta_{P} (W,X) (\prod\limits_{(ij) \in P}W_{ij}) X_{P_{start}}$$
- $ Dp(W,X)$: 1 if path *P* is "active", 0 if inactive
- Input-output function is piece-wise linear
- Polynomial in *W* with random coefficients

#### Deep Convolutional Nets (and other deep neural nets)

- Training samlple: $(X_i,Y_i)$, *k = 1 to K*
- Objective Function (with magin-type loss = ReLU)

In [59]:
from IPython.display import display, Math, Latex
display(Math(r'L(W) = \sum\limits_{k} ReLU (1 - Y^{k} \sum\limits_{P} \delta_{P}(W,X^{k})(\prod\limits_{(ij) \in P} W_{ij})X_{P_{start}}^{k})'))

display(Math(r'L(W) = \sum\limits_{k} \sum\limits_{P} (X_{P_{start}}^{k} Y^{k}) \delta_{P}(W,X^{k})(\prod\limits_{(ij) \in P} W_{ij})'))

display(Math(r'L(W) = \sum\limits_{P} \big[\sum\limits_{k} (X_{P_{start}}^{k} Y^{k}) \delta_{P}(W,X^{k})\big](\prod\limits_{(ij) \in P} W_{ij})'))

display(Math(r'L(W) = \sum\limits_{P} C_{P}(X, Y,W) (\prod\limits_{(ij) \in P} W_{ij})'))


<IPython.core.display.Math at 0x1037c52d0>

<IPython.core.display.Math at 0x1037c5250>

<IPython.core.display.Math at 0x1037c5590>

<IPython.core.display.Math at 0x1037c5490>

- Polynomial in *W* of degree *l* (number of adaptive layers)
- Continuous, piece-wise polynomial with "switched" and partially random coefficients
    - Coefficients are switched in and out depending on *W*

#### Deep Nets with ReLUs: Objective Function is Piecewise Polynomial

- If we use a hinge loss, delta now depends on label $Y_k$
$$L(W) = \sum\limits_{P} C_{P}(X, Y,W) (\prod\limits_{(ij) \in P} W_{ij})$$
- Piecewise polynomial in *W* with random coefficients
- A lot is known about the distribution of critical points of polynomials on the sphere with random (Gaussian) coefficients
    - High-order spherical spin glasses
    - Random matrix theory

### Convolutional Networks

Le Cun, et al. 1989, *Handwritten Zip Code Recognition with Multilayer Networks*, (http://yann.lecun.com/exdb/publis/pdf/lecun-90e.pdf)

#### Early Hierarchical Feature Models for Vision

Hubel & Wiesel, 1962, *Receptive Fields, Binocular Interaction and Functional Architecture in the Cat's Visual Cortex* (http://corevision.cns.nyu.edu/~tony/vns/readings/hubel-wiesel-1962.pdf)
    - *simple cells* detect local features
    - *complex cells* "pool" the outputs of simple cells within a retinotopic neighborhood.

Cognitron & Neocognitron [ Fukushima 1974-1982 ] https://scholar.google.com/scholar?q=K+Fukushima&btnG=&hl=en&as_sdt=0%2C10

#### The Convolutional Net Model (Multistage Hubel-Wiesel system)

- Training is supervised
- With stochastic gradient descent
- LeCun et al, 1989, *Optimal Brain Damage*, http://www.cnbc.cmu.edu/~plaut/IntroPDP/papers/LeCunDenkerSolla90NIPS.pdf
- LeCun et al, 1989, *Generalization and Network Design Strategies*, http://masters.donntu.org/2012/fknt/umiarov/library/lecun.pdf
- LeCun et al, 1998, *Gradient-based learning applied to document recognition*, https://itb.biologie.hu-berlin.de/~zito/teaching/CNSIII-2006/proj6/proj6_2.pdf

- Non-linearity: half-wave rectification (ReLU), shrinkage function, sigmoid
- Pooling: max, average, L1, L2, log-sum-exp
- *Training*: Supervised (1988-2006), Unsupervised + Supervised (2006-Now)

## Brute Force Approach to Multiple Object Recognition

### Idea #1: Sliding Window ConvNet + Weighted FSM

- "Space Displacement Neural Net".
- Convolutions are appplied to a large image
- Output and feature maps are extended/replicated accordingly

## Convolutional Networks in Visual Object Recognition

### We knew ConvNet worked well with characters and small images

- Traffic Sign Recognition (GTSRB)
    - German Traffic Sign Reco Bench
    - 99.2% accuracy (IDSIA)
- House Number Recognition (Google)
    - Street View House Numbers
    - 94.3% accuracy (NYU)

### NORB Dataset (2004): 5 categories, multiple views and illuminations

- Less than 6% error on test set with cluttered backgrounds
- 291,600 training samples
- 58.320 test samples

### mid 2000s: state of the art results on face detection

- (Review) Tolba et al. 2006, *Face Recognition: A Literature Review*, http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.179.2182&rep=rep1&type=pdf
- Vaillant et al. IEE 1994, * An original approach for the localization of objects in images*, http://digital-library.theiet.org/content/journals/10.1049/ip-vis_19941301
- Osadchy et al. 2004, *Efficient detection under varying illumination conditions and image plan rotations*, http://rita.osadchy.net/papers/cviu.pdf
- Simultaneous face detection and pose estimation
    - Osadchy et al. JMLR 2007, *Synergistic Face Detection and Pose Estimation with Energy-Based Models*, http://www.jmlr.org/papers/volume8/osadchy07a/osadchy07a.pdf

### Visual Object Recognition with Convolutional Nets

- In the mid 2000s, ConvNets were getting decent results on object recognition
- Dataset: "Caltech101"
    - 101 categories
    - 30 training samples per category
- But the results were slightly worse than more "traditional" computer vision methods, because
    1. The datasets were too small
    2. the computers were too slow
- But we couldn't beat the state of the art because the datasets were too small
- *But we learned that rectification and max pooling are useful!*
    - Jarrett et al. ICCV 2009, *What is the Best Multi-Stage Architecture for Object Recognition?*, http://yann.lecun.com/exdb/publis/pdf/jarrett-iccv-09.pdf
        - "show that using non-linearities that include rectification and local contrast normalization is the single most important ingredient for good accuracy on object recognition benchmarks.
        - Most surprisingly, we show that a two-stage system with *random filters* can yield almost 63% recognition rate on Caltech-101, provided that the proper non-lineariteis and pooling layers are used. 
        - Finally, we show that will supervised refinement, the system achieves state-of-the-art performance on NORB dataset and unsupervised pre-training followed by supervised refinement produced good accuracy on Caltech-101"
 
- Object Recognition: Krizhevsky, Sutskever, Hinton 2012, *ImageNet Classification with Deep Convolutional Neural Networks*, (http://www.cs.toronto.edu/~fritz/absps/imagenet.pdf)
    - Won the 2012 ImageNet LSVRC. 60 Million parameters, 832 MAC ops
    - employed DROPOUT
    - Method: large convolutional net
        - 650K neurons, 832M synapses, 60M parameters
        - Trained with backprop on GPU
        - Trained "with all the tricks Yann came up with in the last 20 years, plus dropout
        - Rectification, contrast normalization
    - Error rate: 15% (whenever correct class isn't in top 5)
    - Previous state of the art: 25%
    - A REVOLUTION IN COMPUTER VISION
    - acquired by Google in Jan 2013
    - Deployed in Google+ Photo Tagging in May 2013

#### Then, two things happened ...

- The ImageNet dataset 
    - Fei-Fei et al 2012, http://vision.stanford.edu/publications.html
    - 1.5 million training samples
    - 1000 categories
- Fast Graphical Processing Units (GPU)
    - Capable of 1 trillion operations/second
- ImageNet Large-Scale Visual Recognition Challenge http://www.image-net.org/challenges/LSVRC/2014/


#### NYU ConvNet Trained on ImageNet: OverFeat

- Sermanet et al. 2014, *OverFeat: Integrated Recognition, Localization and Detection Using Convolutional Neural Networks*, http://arxiv.org/pdf/1312.6229.pdf
- Trained on GPU using Torch7 (http://torch.ch)
- Uses a number of new tricks
- Classification 1000 categories:
    - 13.8% error (top 5) with an ensemble of 7 networks
    - 15.4% error (top 5) with a single network
- Classification+Localization
    - 30% error
- Detection (200 categories)
    - 19% correct
- Downloadable code (running, no training)
    - Search for "overfeatNYU" on Google (http://cilvr.nyu.edu/doku.php?id=code:start)

#####     Classification + Localization: multiscale sliding window

- Apply convnet with a sliding window over the image at multiple scales
- Important note: it's very cheap to slide a convnet over an image
    - Just complete the convolutions over the whole image and replicate the fully-connected layers
    - Traditional Detectors/Classifiers must be applied to evey location on a large input image, at multiple scales.
    - Convolutional nets can be replicated over large images very cheaply
    - Simply apply the convolutions to the entire image and spatially replicate the fully-connected layers
- For each window, predict a class and bounding box parameters
    - Even if the object is not completely contained in the viewing window, the Convnet can predict where it thinks the object is
- Compute an "average" bounding box, weighted by scores

### ConvNets as Generic Feature Extractors

- Bo, Ren, Fox, 2013, * Multipath Sparse Coding Using Hierarchical Matching Pursuit*, http://research.cs.washington.edu/istc/lfb/paper/cvpr13.pdf
    - Liefeng Bo, http://research.cs.washington.edu/istc/lfb/, https://scholar.google.com/citations?user=FJwtMf0AAAAJ&hl=en
    - Xiaofeng Ren, https://scholar.google.com/citations?user=1KFFbEIAAAAJ&hl=en
    - Dieter Fox, https://scholar.google.com/citations?user=DqXsbPAAAAAJ&hl=en&oi=ao
- Sohn, Jung, Lee, Hero ICCV 2011, *Efficient Learning of Sparse, Distributed, Convolutional Feature Representations for Object Recognition*, http://web.eecs.umich.edu/~honglak/iccv2011-sparseConvLearning.pdf
    - Kihyuk Sohn https://scholar.google.com/citations?user=IIHEmDUAAAAJ&hl=en&oi=ao
    - Dae Yon Jung https://scholar.google.com/citations?user=xgQd1qgAAAAJ&hl=en
    - Honglak Lee https://scholar.google.com/citations?user=fmSHtE8AAAAJ&hl=en
    - Alfred Hero III https://scholar.google.com/citations?user=DSiNzkIAAAAJ&hl=en
- Razavian, Azizpour, Sullivan, Carlsson, 2014, *CNN features off-the-shelf: An astounding baseline for recognition* http://www.datascienceassn.org/sites/default/files/CNN%20Features%20off-the-shelf%20-%20an%20Astounding%20Baseline%20for%20Recognition.pdf
    - Ali Razavian https://scholar.google.com/citations?user=E3fqfDIAAAAJ&hl=en&oi=sra
    - Hossein Azizpour https://scholar.google.com/citations?user=t6CRgJsAAAAJ&hl=en&oi=sra
    - Josephine Sullivan https://scholar.google.com/citations?user=REbc02cAAAAJ&hl=en&oi=sra
    - http://www.csc.kth.se/cvap/cvg/DL/ots/

#### Other ConvNet Results

- Zeiler and Fergus, 2013, *Visualizing and Understanding Convolutional Networks*, http://www.cs.nyu.edu/~fergus/papers/zeilerECCV2014.pdf
- Donahue, Jia, Vinjals, et al. 2014, *Decaf: A deep convolutional activation feature for generic visual recognition*, http://www.eecs.berkeley.edu/~nzhang/papers/icml14_decaf.pdf
    - Jeff Donahue https://scholar.google.com/citations?user=UfbuDH8AAAAJ&hl=en&oi=sra
    - Yangqing Jia https://scholar.google.com/citations?user=mu5Y2rYAAAAJ&hl=en&oi=sra
    - Oriol Vinyals https://scholar.google.com/citations?user=NkzyCvUAAAAJ&hl=en
    - Judy Hoffman https://scholar.google.com/citations?user=3dlBGiQAAAAJ&hl=en
    - Ning Zhnag https://scholar.google.com/citations?user=DplAah0AAAAJ&hl=en
    - Eric Tzeng https://scholar.google.com/citations?user=nABXo3sAAAAJ&hl=en
    - Trevor Darrell https://scholar.google.com/citations?user=bh-uRFMAAAAJ&hl=en
- Girschick et al, 2013, *Rich feature hierarchies for accurate object detection and semantic segmentation*, http://www.cs.berkeley.edu/~rbg/papers/r-cnn-arxiv.pdf
    - Ross Girschick https://scholar.google.com/citations?user=W8VIEZgAAAAJ&hl=en&oi=sra
    - Jitendra Malik https://scholar.google.com/citations?user=oY9R5YQAAAAJ&hl=en
- Oquad, et al 2013, * Learning and transferring mid-level image representations using convolutional neural networks*, http://www.di.ens.fr/willow/pdfscurrent/oquab14cvpr.pdf
    - Maxime Oquab
    - Leon Bottou https://scholar.google.com/citations?user=kbN88gsAAAAJ&hl=en&oi=sra
    - Ivan Laptev https://scholar.google.com/citations?user=-9ifK0cAAAAJ&hl=en&oi=sra
    - Josef Sivic
- Kahn, et al 2014, *Automatic Feature Learning for Robust Shadow Detection*, http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6909646&url=http%3A%2F%2Fieeexplore.ieee.org%2Fxpls%2Fabs_all.jsp%3Farnumber%3D6909646
    - Salman Khan https://scholar.google.com/citations?user=M59O9lkAAAAJ&hl=en&oi=sra
    - Mohammed Bannamoun https://scholar.google.com/citations?user=ylX5MEAAAAAJ&hl=en&oi=sra
    - Ferdous Sohel https://scholar.google.com/citations?user=Xj1MBQcAAAAJ&hl=en&oi=sra
- Sander Dieleman, Kaggle Galaxy Zoo Challenge 2014, http://benanne.github.io/2014/04/05/galaxy-zoo.html
    - https://scholar.google.com/citations?user=2ZU62T4AAAAJ&hl=en&oi=ao
    
- RESULTS COMPILATION: http://cs.nyu.edu/~sermanet/papers/Deep_ConvNets_for_Vision-Results.pdf

## Image Similarity Matching With Siamese Networks Embedding, DrLIM

### DrLIM: Metric Learning

#### Contrative Loss function

### Siamese Architecture

#### Siamese architecture and loss function

#### Loss function 

### Face Recognition: DeepFace (Facebook AI Research)

### Depth Estimation from Stereo Pairs

## Body Pose Estimation

### Pose Estimation and Attribute Recovery with ConvNets

### Other Tasks for Which Deep Convolutional Nets are the Best

## Deep Learning and Convolutional Networks in Speech, Audio, and Signals

### Acoustic Modeling in Speech Recognition (Google)

### Speech Recognition with Convolutional Nets (NYU/IBM)

## Convolutional Networks in Image Segmentation, & Scene Labeling

### ConvNets for Image Segmentation

### ConvNet in Connectomics

### Semantic Labeling/Scene Parsing: Labeling every pixel with the object it belongs to

### Scene Parsing/Labeling: ConvNet Architecture

#### Method 1: majority over super-pixel regions

### Scene Parsing/Labeling: Performance

### Scene Parsing/Laveling: SIFT Flow dataset (33 categories)

### Temporal Consistency

### NYU RGB-D Dataset

### Labeling Videos

### Semantic Segmentation on RGB+D Images and Videos

### Commercial applications of Convolutional Nets

### Software Platform for Deep Learning: Torch7

## Unsupervised Learning

### Energy-Based Unsupervised Learning

#### Learning the Energy Function

#### Seven Strategies to Shape the Energy Function

##### 1. constant volume of low energy Energy surface for PCA and K-means

##### 2. push down of the energy of data points, push up everywhere else

##### 3. push down of the energy of data points, push up on chosen locations

## Dictionary Learning with Fast Approximate Inference: Sparse Auto-Encoders

### Sparse Modeling: Sparse Coding + Dictionary Learning

## Learning to Perform Approximate Inference: Predictive Sparse Decomposition Sparse Auto-encoders

### Sparse auto-encoder: Predictive Sparse Decomposition (PSD)

### Regularized Encoder-Decoder Model (auto-Encoder) for Unsupervised Feature Learning

### PSD: Basis Functions on MNIST

### Predictive Sparse Decomposition (PSD): Training

### Learned Features on natural patches: V1-like receptive fields

## Learning to Perform the Approximate Inference LISTA

### Better Idea: Give the "right" structure to the encoder

### LISTA: Train We and S matrices to give a good approximation quickly

### Learning ISTA (LISTA) vs ISTA/FISTA

### LISTA with partial mutual inhibition matrix

### Learning Coordinate Descent (LcoD): faster than LISTA

###     Convolutional Sparse Coding ("deconvolutional networks")

Zeiler, Taylor, Fergus, CVPR 2010, *Deconvolutional Networks*, http://www.uoguelph.ca/~gwtaylor/publications/mattcvpr2010/deconvolutionalnets.pdf

### Convolutional PSD: Encoder with a soft sh() Function

### Convolutional Sparse Auto-Encoder on Natural Images

### Using PSD to Train a Hierarchy of Features

## Unsupervised + Supervised for Pedestrian Detection