## Meta Learning
Finn's lecture on ICML 2019: [video](https://www.facebook.com/icml.imls/videos/400619163874853/)

Problems:
- no large dataset
- want a general-purpose AI System
- long tail

Solution:
- SIFT features, HOG features
- Fine-tuning from ImageNet features
- Domain adaption from other painters

fewer human priors, more data-driven priors

# problem statement
two ways to view meta-learning
- Mechanistic view
    - deep neural network model can read an entire dataset and predictions fro new data points
    - Traning this network uwes a meta-dataset, which itself consists of many datases,ch for a different task.
- Probalistic view 
    - Extract prior information from a set of (meta-traing) tasks that allows efficient lerning from new tasks.
    - learning a new task uses this prior and (small) traing set to infer most likely posterior parameters
    - this view makes it easier to understand meta-learning algorithms


## problem definitions
- supervised learning

can we incorporate addtional data?
- data from different task: $D_{meta-train}$
- data from same task without label: semi-supervised learning

## The meta-learning problem
$$
\arg \max_{\Phi} \log(\phi | D,D_{meta-train})
$$
我们不想保存$D_{meta-train}$，我们学习得到$\theta : p(\theta| D_{meta-train})$
$$
\log p(\phi | D, D_{meta-train}) = \log \int_{\Theta}p(\phi | D,\theta)p(\theta|D_{meta-train})d\theta \\
\approx \log p(\phi|D,\theta^*) + \log p(\theta^*|D_{meta-train})
$$

后面部分，其实就是meta-learning需要解决的问题
$$
\theta^* = \arg \max_{\theta} \log p(\theta|D_{meta-train})
$$

前面部分，就是特定任务需要解决的问题：
$$
\arg \max \log p(\phi|D,D_{meta-train}) = \arg \max_{\phi} \log p(\phi|D,\theta^*)
$$

key idea:
"our training rocedure is based on a simple machine learning principle: test and train conditons must match" Vinyals et al., Matching Networks for one-shot learning

### the complete meta-learning optimization
$$
\theta^* = \max_{\theta}\sum_{i=1}^n \log p(\phi_i|D_i^{ts})\
$$
where $\phi_i=f_{\theta}(D_i^{tr})$

## meta-learning terminology
- meta-training: 
    - meta-training tasks:
        - support set
        - query set
- meta-test:
    - meta-test:

## Closely related problem settings
hyperparameter optimization & auto-ML: can be cast as meta-learning
- hyperparameter optimization: $\theta$ = hyperapameters, $\phi$= network weights
- architecture search: $\theta$= hyperapameters, $\phi$= network weights

# Meta-learning Algorithms
- Black-box adaptation
- Optimizatin-based inference
- Non-Parameter mehtods
- Bayesian meta-learning

## How to evaluate 
- 5-way, 1-shot image classification
- regreesion, language generation, skill learning

## how to design a meta-learning algorithm
1. Choose a form of $p(\phi_i|D_i^{tr},\theta)$
2. Choose how to optimeze $\theta$ w.r.t. max-likeihood objective using $D_{meta-train}$

### Black-box adaptation
key idea: train a neural network to represent $p(\phi_i|D_i^{tr},\theta)$
1. Sample task $T_i$
2. Sample disjoint datasets $D_i^{tr}, D_i^{test}$
3. Compute $\phi_i \leftarrow f_\theta(D_i^{tr})$
4. update $\theta$ using $\nabla_\theta \mathcal{L}(\phi_i, D_i^{test})$

### Form of $f_{\theta}$
- LSTM
- NTM
- self-attention
- 1D convolutions
- feedforward + average

Idea: Do not need to output all parameters of neural net, only sufficient statitics. 只输出关键的参数. MANN, SNAIL

general form:
$$
y^{ts} = f_{\theta}(D_i^{tr},x^{ts})
$$

## Optimization based method
key idea: $\theta$ serves as a prior
- Fine-tuning: $\phi \leftarrow \theta-\alpha \nabla_\theta \mathcal{L}(\theta, D^{tr})$

1. Sample task $T_i$
2. Sample disjoint datasets $D_i^{tr}, D_i^{test}$
3. Compute $\phi_i \leftarrow \theta-\alpha \nabla_\theta \mathcal{L}(\theta, D^{tr})$
4. update $\theta$ using $\nabla_\theta \mathcal{L}(\phi_i, D_i^{test})$

### MAML: Model-agnostic meta-learning
- MAML can be viewed as computational graph
- Ravi & Larochelle ICLR'17, Replace gradient update with learned network
$$
\phi_i = \theta - f(\theta,D_i^{tr},\nabla_\theta \mathcal{L})
$$


other form:
- implicit MAML: Gradient-descent with explicit Gaussian prior 
- Auto-Meta: Progressive neural architecture search + MAML
- Second-order meta-optimization,
- Automatically learn iner vector learning rate, tune outer learning rate: Meta-SGD, AlphaMAML
- Optimize only a subset of the parameters in the inner loop: DEML, CAVIA
- Decouple inner learning rate, BN statistics per-step
- Introduce context variables for increased expressive power

## Non-Parameter Model
can we use parametric meta-learners that produce effective non-parametric learner
- learn the metric space: Siamese network


can we make meta-train & meta-test match?
- Matching Networks

prototypical Networks

Challenge:

learn more complex relationships between datapoints:
- Relation Net
- IMP
- GNN

## mix & match components of computation graph
- both condition on data & run gradient descent: CAML 19
- gradient descent on relation net embedding: LEO 19
- MAML, but initialize last layer as ProtoNet during meta-training

## Bayesian method
model $p(\phi_i|\theta)$ as Gaussian

# Application
- few-shot image recognition
- human motion and pose prediction
- domain adaption
- few-shot segmentation
- few-shot image generation
- few-shot image-to-image translation
- generation of novel viewpoints
- generating talking heads from images

# One-shot Imitation Learning
- meta imitation learning

# Learning to learn from weak supervision

# Other application
- adapting to new programs
- adapting to new language
- adapting to new personas