Skip to content
Using SGD-based meta-learning as an auto-regressive model
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


This is an experiment in using SGD + a simple feedforward network as an autoregressive generative model.

The Idea

MAML and Reptile effectively turn feedforward networks (e.g. CNNs and MLPs) into sequence models. Just like in an RNN, MAML and Reptile have a state (the network parameters) which is updated over time (every inner-loop training step). Thus, we can pose a sequence of N tuples (x_i, y_i) as a dataset of inputs (x_i) and outputs (y_i), and use MAML/Reptile to train the network to predict the output y_i given input x_i after being tuned (with SGD) on all of the previous input/output pairs in the dataset. In text generation, I set x_i to the index of the current character. For image generation, I set x_i to the 2-D pixel coordinate.

Generation using such a model is simple:

  • Set i = 0.
  • Set theta = parameters learned with Reptile/MAML.
  • Set x_i = some fixed indices/coordinates.
  • For i from 1 to N:
    • Compute P = f_theta(x_i), i.e. the network output for input i.
    • Compute y_i by sampling from P (which is typically logits).
    • Take an SGD step to increase P(y_i).
  • The result of all y_i is the sampled sequence.

A more concrete example can be found in, and is simplified here for illustration:

model = load_text_model()
opt = optim.SGD(model.parameters())
sequence = []
for i in range(128):
    # Create input Tensor containing x_i (the index).
    inputs = torch.from_numpy(np.array([[i]])).to(device).long()
    # Compute output distribution, P.
    logits = model(inputs)
    probs = F.softmax(logits[0], dim=0).detach().cpu().numpy()
    # Sample from the output distribution.
    sample = np.random.choice(np.arange(256), p=probs)
    # NUL character ends the sequence early.
    if sample == 0:
    # Generate a target vector to increase P(sample).
    targets = torch.from_numpy(np.array([sample])).to(device).long()
    # Compute -P(sample) and take a step to minimize it.
    loss = F.cross_entropy(logits, targets)

# Print the final sample.
print(str(bytes([min(0x79, x) for x in sequence]), 'ascii'))


I tried reptile-gen with both MAML and Reptile on two tasks: text generation and binarized sequential MNIST generation.

I also experimented with different model architectures. First I tried vanilla MLPs, but quickly saw some improvement by switching to gated activation functions. Presumably, this allowed the model to prevent certain parameters from being heavily affected by every SGD step. I found the most improvement from using an LSTM block directly, but with no recurrent connections. Essentially, it was just an LSTM block where the state vector was just another trainable parameter.


Here are the best MNIST samples my models were able to generate. Not very nice looking!

MNIST samples

All of my experiments resulted in significant underfitting. However, MAML tended to produce better models than Reptile, indicating that Reptile's gradient-free nature prevents it from modeling long-term dependencies.

The best architecture for reptile-gen was, by a small margin, the LSTM block with no recurrent connections. Since LSTMs were not taylored for reptile-gen, it seems likely that a gating mechanism more taylored for reptile-gen could produce better results. However, my interested in the project started to run thin before I got to that.

You can’t perform that action at this time.