# Review
* Tools
 * Ch13. Linear Factor Models
 * Ch14. Autoencoders
* Overviews(?)
 * Ch15. Representation Learning
* Specific Issues
 * Ch16. Structured Probabilistic Models for Deep Learning
 * Ch17. Monte Carlo Methods
 * Ch18. Confronting the Partition Function
 * Ch19. Approximation Inference
 * Ch20. Deep Generative Models

# Contents
* Introduction
 * Definition of Representation
* Greedy Layer-Wise Unsupervised Pretraining
 * When and Why Does Unsupervised Pretraining Work?
* Transfer Learning and Domain Adaptation
 * Use shared representation
* Semi-Supervised Disentangling of Casual Factors
 * Use information from unsupervised tasks to perform supervised task
* Distributed Representation
* Exponential Gains from Depth
 * Deep representation
* Providing Clues to Discover Underlying Causes

# Introduction
## Representation
* Arabic numeral representation VS Roman numeral representation
 * 210 / 6  VS  CCX / VI
* Better representation in Machine Learning
 * Good one makes a subsequent task easier
* Almost learning algorithms learns "Representations" in the Deep Architecture
 * Supervised/Unsupervised Learning learns "implicitly" as side effects
 * Some algorithms designed explicitly for Representation Learning
   * e.g. Distribution Learing (Density Estimation)
* Tradeoff Issue
 * Preserving much information VS Nice properties (e.g. Independence)

## Use Unlabeled Data for a good representation
* Unsupervised Learning
* Semi-suprevised learning

# Greedy Layer-Wise Unsupervised Pretraining
* Greedy Layer-Wise?
 * optimizes each layer at a time rather than jointly optimizing all pices
 
* Use single-layer representation learning algorithm
 * RBM, single-layer autoencoder, sparse coding model (Ch13/14)
 * Take the output of the previous layer
 * Produce a new simpler representation

<img src='Algorithm_15_1.png' width=800>

* Good Initialization for a joint learning procedure over all the layers of a deep neural net for supervised task
* Used to successfully train "even" fully connected architectures


* Fine tuning after pretraining
 * Optimizes all layers together
 * Can be done in the pretraining phase (pretraining & fine-tuning simultaneously)
* Can be viewed as a regularizer in supervised learning task
* Overall training scheme is nearly the same
 * learning algorithms, model types can differ


* Initialization for unsupervised learning algorithms for...
 * Deep autoencoders
 * Probailistic models with many layers of latent variables
 * Deep Generative Models (Ch20)
   * Deep belief networks
   * Deep Boltzmann machines
   
## When and Why Does Unsupervised Pretraining Work?

* History
 * Substantial improvements in test error for "Classification Tasks"
   * Revival of deep neural networks (2006, Hinton)
 * Harmful on many other tasks
 * Ma,J. (2015, Deep neural nets as a method for quantitative sturucture) found...
   * Significantly helpful for many tasks
   * Slightly harmful on average
 * So we should know "When and Why pretraining works" for a particular task


* 2 Intuitions
 * Act as regularizer
  * e.g. Optimize only higher layers(classifier) freezing lower layers (feature extractor) 
  * Prevent overfitting
  * Improve test set error
  * Speed up optimization
 * Some features that are useful for the unsupervised task may also be useful for the supervised learning task
  * After extracting wheels, we can classify cars and motorcycles by counting wheels

* Expected Values
 * More effective when the initial input is poor
   * dimension reduction + manifold learning (Ch14)
   * e.g. good similarity metrics between two words for word embeddings
 * User unlabeled data when labeled data is very small (Semi-supervised learning)
 * Regularization for complicated functions
 
 
* Why it works
 * reduce the viraince of the estimation process
   * Figure 15.1 explanation
     * Input-output projection for visualization 
     * variaous starting points (initialization)
     * blue -> red: time line, from origin to outside
     * points based on pretraing move to small region

<img src='Figure_15_1.png' width=800>

* Comparison to other ways
 * Two "separate" phases
 * Increase hyperparameters => time consuming
 * => one phase pretraining
   * Unsupervised learning and supervised learning simultaneously
   * Attach unsupervised learning term to objective function
* Two phase VS one phase
 * many hyperparams vs single hyperparam
 * several trial-error iteration vs one-shot
 * no way to control regularization term vs control it by the coefficient of unsupervised cost term 

* The popularity of unsupervised pretraining has declined
 * Still popular in NLP(natural language processing)
 * Regualized with dropout or Batch normalization for classification
   * outperform pretraing versions on even medium-size datasets
 * Bayesian methods outperform on small datasets
 
 
 
* Nevertheless unsupervised pretraining...
 * an important milestone in the history of deep learing research
 * continues to influence contemporary approaches

# Transfer Learning and Domain Adaptation

* One example problelm of Transfer learning
 * How to use feature extractor from Zebra vs Horse for classification of Dalmatian vs Dog
 
* In transfer learning, the learner must perform two or more different tasks
 * e.g. Learn on significantly more data (P1), apply the learned transformation on P2(Small data)
 

* Sharing layers
 * Share lower layers (Underlying factor in low level feature) => Multi-task learning
   * e.g. Visual categorizing
     * low-level notions of "Edges" and "Visual shapes" (corner? circle?)
 <img src='Figure_7_2.png' width=800>
 * Share higher layers (Speech recognitoin) => Domain Adaptation
 <img src='Figure_15_2.png' width=800>
 
* Domain Adaptation (Sharing Higher Layer)
 * Same task -> Different distrtibution P
  * e.g. Learning positive/Negative sentiment
    * Task1: about Music, Task2: about Movies
    * Why?: vocabulary and style vary from one domain to another


* Concept Drift
 * Gradual changes in the data distribution over time
 
---
```
While the phrase "multi-task learning" typically refers to supervised learning tasks, the more general notion of transfer learning is applicable to unsupervised learning and reinforcement learning as well.
```
----

* Same representation may be useful in both settings
 * e.g. Transfer learning competition
   * Mesnil, G. 2011, Unsupervised and tranfer learning challenge: a deep learing approach
   * 1st: Learn on $P_1$
   * 2st: Apply the learned transformation to $P_2$
   * Result
     * deeper representations => faster learning $P_2$

* Two examples: One-shot learning and zero-shot(zero-data) learning
 * Extreme forms of transfer learning
 * One-shot: One example in the 2nd stage
   * e.g. 
     * learn "wheels" from images of bikes n cars
     * learn the one image of a 3-wheel bike
     * test on images of 3-wheel bikes
 * Zero-shot
   * Testing without data in the 2nd stage???
   * Learn 2 representations and their relation
   * e.g. Text-Image learning
     * Link text space("4 Legs") - Image space(visual shape of legs and their count)
     * Learn Birds("2 Legs", "No Ear"), Dogs("2 Legs", "Round Ears")
     * Input: Text about Cats (4 Legs, Pointy ears)
     * Apply to the images of Cats
   * e.g. Machine translation
     * We can translate sentences even though some word has no label
     * X in language A - Y in language B have similar behavior => Same meaning    
 
 
 <img src='Figure_15_3.png' width=600>


* Zero-shot Model
 * $P(y| x, T)$
   * Traditional input $x$
   * Traditional output $y$
   * Additional random variables, Task $T$
   * e.g. $x$ is descriptions about cats, $y$ is "yes" or "no", $T$ is "Is there a cat in this image?"
 
    ---
    ```
If we have a training set containing unsupervised examples of objects that live in the same space as T , we may be able to infer the meaning of unseen instances of T.
    ```
    ---

    * $T$ should be represented in a way that allows some of generalization.
      * "Is there a sort of "animals" in this image?
      


# Semi-Supervised Disentangling of Causal Factors
* Large amount of unlabeled data and relatively little labeled data

<img src=https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png width=300>

* $P(x)$ is helpful for $P(y|x)$
* Causal Factor -(Representation)-> Feature

<img src="Figure_15_4.png" width=600>


* Better Representations?
  1. Representation disentangles the causes from one another
  2. Easy to model
    * e.g. Simple model: sparsity, independence
  
* Hypothesis motivation of Semi-supervised learning
  * If (1), (2) conside =>
  * If a representation $h$ represents many of the underlying causes of the observed $x$
    * the outputs $y$ are among the most "salient" causes, then it is easy to predict $y$ from $h$.
    * $P(y|x)$, $P(x|h)$, $P(h)$
  * c.f. If $P(x)$ is uniformly distributed => Semi-supervised learning fails
  * Simple example
    * 
  

* Issus: Hard to capture salient factors
  * Two Strategy
    1. Use a supervised learning signal (labeld data)
    2. Use much larger representation
    
    
* Adversarial Framework (CH 20)
  * Modify the definition of which underlying causes are most salient.
  
  
<img src="Figure_15_6.png" width=600>

# Distributed Representation



# Exponential Gains from Depth

## Deep representation


# Providing Clues to Discover Underlying Causes
