# Recipe Generation with Seq2Seq GAN

In this notebook, I train a Seq2Seq autoencoder to encode and decode recipe names. I also train a Generative Adversarial Network (GAN) alongside this to create a generator of fake recipe names and a discriminator of real versus fake recipes.

## Load and preprocess data

First, we must acquire the data. For my experiment, I used data from [Eight Portions](https://eightportions.com/datasets/Recipes/), who provide a very useful dataset of recipes including names, ingredients, and directions. I only plan to use the names of the recipes, so I will trim the data for that information.

In [1]:
# Download the recipes
# Source: https://eightportions.com/datasets/Recipes/
!curl -o recipes.zip 'https://storage.googleapis.com/recipe-box/recipes_raw.zip'

# Unzip without remorse
!unzip -o recipes.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.8M  100 50.8M    0     0  9773k      0  0:00:05  0:00:05 --:--:--  9.7M
Archive:  recipes.zip
  inflating: recipes_raw_nosource_ar.json  
  inflating: recipes_raw_nosource_epi.json  
  inflating: recipes_raw_nosource_fn.json  
  inflating: LICENSE                 


In [2]:
!head recipes_raw_nosource_ar.json

{
  "rmK12Uau.ntP510KeImX506H6Mr6jTu": {
    "title": "Slow Cooker Chicken and Dumplings",
    "ingredients": [
      "4 skinless, boneless chicken breast halves ADVERTISEMENT",
      "2 tablespoons butter ADVERTISEMENT",
      "2 (10.75 ounce) cans condensed cream of chicken soup ADVERTISEMENT",
      "1 onion, finely diced ADVERTISEMENT",
      "2 (10 ounce) packages refrigerated biscuit dough, torn into pieces ADVERTISEMENT",
      "ADVERTISEMENT"


When extracting the names, I found that some recipes did not have names. So I had to filter those out:

In [3]:
import json

# Load the JSON data
with open('recipes_raw_nosource_ar.json', 'r') as f:
    data = json.load(f)

In [4]:
# Pull out the names
names = [data[k]['title'] for k in data if 'title' in data[k]]

for r in names[:10]:
    print(r)

print(len(names), 'of', len(data))

Slow Cooker Chicken and Dumplings
Awesome Slow Cooker Pot Roast
Brown Sugar Meatloaf
Best Chocolate Chip Cookies
Homemade Mac and Cheese Casserole
Banana Banana Bread
Chef John's Fisherman's Pie
Mom's Zucchini Bread
The Best Rolled Sugar Cookies
Singapore Chili Crabs
39522 of 39802


In [5]:
def preprocess_string(txt):
    # Trim non-unicode
    for i in range(len(txt)-1, -1, -1):
        if ord(txt[i]) > 127:
            txt = txt[:i] + txt[i+1:]
            
    return (txt
            .replace('(', ' ( ') # Left parentheses
            .replace(')', ' ) ') # Right parentheses
    )

In [6]:
names = list(map(preprocess_string, names))

## Tokenize data

In order to pass the strings in, we need to create a numerical representation that can be used by the network. I define methods for encoding  a given string of text or decoding label predictions into a string.

In [7]:
import numpy as np
import math, random

import tensorflow_datasets as tfds

In [8]:
# Create a tokenizer
chars = set(c for n in names for c in n)
words = set(w for n in names for w in n.split())

chars.add('\t')
words.add('<start>')

chars.add('\n')
words.add('<end>')

chars = list(chars)
words = list(words)

print(chars)

['k', 'b', 'J', '+', 'd', 'f', 'r', 'Y', 'l', 'z', '(', 'V', 'X', '!', '7', 'n', 'e', '$', 'v', 'w', '.', '@', ':', '?', '0', 'C', 'h', 'R', 'H', 'I', '5', 'q', 'o', 'A', '%', 'K', ')', 'p', "'", '8', ';', '&', 'W', 'u', 't', 'D', 'M', 'T', 'E', 'N', 'B', 'Q', 'U', '#', 'G', 'S', 'P', 'c', 'g', '4', '3', 'L', 'x', '-', ',', '6', '\n', '\t', 'j', '*', 'a', '9', 'O', '2', '/', '"', 'F', 'Z', 'y', 'i', '1', ' ', 'm', '=', 's']


In [9]:
word_inv_idx = {i+1 : w for i, w in enumerate(words)}
word_idx = {w : i+1 for i, w in enumerate(words)}

In [10]:
vocab_size = 1 + len(words)

print('Vocab size:', vocab_size)

Vocab size: 11922


Now, we can define the tokenization and detokenization behavior. We define two methods:

- `str2tok()`: Converts text to tokens
- `tok2str()`: Converts tokens to text

In [11]:
def str2tok(txt, length=None):
    # Split the string and break into tokens
    enc = [word_idx[w] for w in txt.split()]
    
    if length:
        enc += (length - len(enc)) * [0]
        
    return enc

def tok2str(idxs):
    # Rejoin
    return ' '.join([word_inv_idx[i] for i in idxs if i])

Using this, we can tokenize the data into the format usable by the neural networks.

In [12]:
# Encode the dataset
max_len = max(len(str2tok(n)) for n in names)

enc_input = np.array([str2tok(n, max_len) for n in names])
dec_input = np.array([str2tok(f"<start> {n}", 2+max_len) for n in names])
dec_output = np.array([str2tok(f"{n} <end>", 2+max_len) for n in names])

print('Max length:', max_len)
print(enc_input[0])
print(dec_input[0])
print(dec_output[0])

print(enc_input.shape)
print(dec_input.shape)
print(dec_output.shape)

Max length: 20
[ 6096 11194  6637  7454  7740     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]
[ 1776  6096 11194  6637  7454  7740     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
[ 6096 11194  6637  7454  7740  1437     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
(39522, 20)
(39522, 22)
(39522, 22)


The next step is to define a training/test split. This allows us to check for overfitting of the data during hyperparameter tuning. I do this by generating a list of indices and splitting them into two buckets: one that's used to train and one used solely for evaluation.

In [13]:
# Train-test-valid split
idxs = list(range(len(enc_input)))
random.shuffle(idxs)

a = int(0.9 * len(idxs))
train_idxs = idxs[:a]
test_idxs = idxs[a:]
idxs = train_idxs

a = int(0.9 * len(idxs))
train_idxs = idxs[:a]
valid_idxs = idxs[a:]
del idxs

## Model

I used a Seq2Seq model to convert to and from text. This constitutes an encoder to convert text to an encoding, and a decoder to convert it back to text. I also construct a discriminator that trains to distinguish between real and fake encodings, as well as a generator to create encodings that the discriminator thinks are real.

In [14]:
import tensorflow as tf

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *

# Silence warnings
import logging
tf.get_logger().setLevel(logging.ERROR)

In [15]:
latent_dim = 64
dropout = 0.4

### Encoder

The Seq2Seq encoder includes an embedding layer that converts a token into a representative vector and a recurrent neural network (RNN) component. The RNN takes in a sequence/list of these vectors and processes them one by one to compute a representative encoding of the input.

In [16]:
enc_in = Input(shape=(None,), name='enc_in')

# Apply an embedding to the input
emb = Embedding(vocab_size, latent_dim)
y = emb(enc_in)

# Pass through an RNN
rnn = Bidirectional(GRU(latent_dim // 2, return_state=True, dropout=dropout))
_, h1, h2 = rnn(y)

# Concatenate the output states
h = Concatenate()([h1, h2])

encoder = Model(enc_in, h, name='encoder')

### Decoder

The decoder goes in the other way; taking the encoding and converting it into a list of characters. However, the network generates probabilities. This allows us to randomize our results or deterministically select the most likely character.

In [17]:
dec_in = Input(shape=(None,), name='dec_in')
h = Input(shape=(latent_dim,), name='state_in')

# Embed the decoder input
emb = Embedding(vocab_size, latent_dim)
y = emb(dec_in)

# Pass through a generator LSTM
rnn = GRU(latent_dim, return_state=True, return_sequences=True, dropout=dropout)
y, c = rnn(y, initial_state = h)

# Choose the the character by computing a probability distribution
dense = Dense(vocab_size, activation='softmax')
y = dense(y)

decoder = Model([dec_in, h], [y, c], name='decoder')

### GAN

A Generative Adversarial Network (GAN) consists of a generator and a discriminator. The goal is to train two networks:

- A *discriminator* that can classify encodings as real (made from real recipe names) or fake (made by some other process)
- A *generator* that can create encodings that the discriminator believes are real.

In [18]:
def wrap(mdl):
    """ Given a Model, create a new Model that computes the same values as
    the input model, but cannot be trained. Used for GAN procedures.
    """
    # Build the input(s)
    if isinstance(mdl.input_shape, list):
        x = [Input(v[1:]) for v in mdl.input_shape]
    else:
        x = Input(mdl.input_shape[1:])
    
    # Build the new model
    fn = Model(x, mdl(x))
    
    # Save the trainability of the model
    trainable = mdl.trainable
    # The new model is untrainable
    fn.trainable = False
    # The old model stays the way it was
    mdl.trainable = trainable
    
    return fn

In [19]:
# Discriminator
h = Input(shape=(latent_dim,), name='encoding')
y = Dense(64, activation='relu')(h)
y = Dense(1, activation='sigmoid')(y)
discriminator = Model(h, y, name='discriminator')

In [20]:
# Generator
x = Input(shape=(latent_dim,), name='noise')
y = Dense(latent_dim, activation='relu')(x)
y = Dense(latent_dim, activation='tanh')(y)
generator = Model(x, y, name='generator')

### Trainers

Now, we can define training procedures. To train, I built four training models:

- `ae_train`: Trains the autoencoder component (encoder and decoder)
- `gen_train`: Trains the generator to maximize the discriminator's score
- `dsc_real_train`: Trains the discriminator to recognize real inputs from the encoder
- `dsc_fake_train`: Trains the discriminator to recognize fake inputs from the generator

In [21]:
# Wrappers for all of the previously defined layers are used
# to build the trainers. This ensures weights are trained at
# the right time.
enc_wrap = wrap(encoder)
dec_wrap = wrap(decoder)
dsc_wrap = wrap(discriminator)
gen_wrap = wrap(generator)

In [22]:
# Model to train an autoencoder
enc_in = Input(shape=(None,), name='enc_in')
dec_in = Input(shape=(None,), name='dec_in')

z = encoder(enc_in)
y, _ = decoder([dec_in, z])

ae_train = Model([enc_in, dec_in], y, name='autoencoder_trainer')

In [23]:
# Generator trainer
noise = Input(shape=(latent_dim,), name='noise')
h = generator(noise)
y = dsc_wrap(h)
gen_train = Model(noise, y, name='gen_trainer')

# Discriminator fake trainer
noise = Input(shape=(latent_dim,), name='noise')
h = gen_wrap(noise)
y = discriminator(h)
dsc_fake_train = Model(noise, y, name='dsc_fake_trainer')

# Discriminator real trainer
enc_in = Input(shape=(None,))
h = enc_wrap(enc_in)
y = discriminator(h)
dsc_real_train = Model(enc_in, y, name='dsc_real_trainer')

## Model training

Now, we can train our model. I define a data generator that provides data one batch at a time. This is done because the actual values used by the network would require a massive amount of memory to store.

In [24]:
from keras.utils import to_categorical

def data_gen(idxs, batch_size=64, repeat=True):
    not_done = True
    
    while not_done:
        random.shuffle(idxs)

        for i in range(batch_size, len(idxs), batch_size):
            # Chosen items
            i = idxs[i-batch_size:i]

            # Input
            xe = enc_input[i]
            xd = dec_input[i]
            
            j = 0
            while j < max_len:
                if all(xe[:,j] == 0):
                    break
                j += 1

            # Output
            y = dec_output[i]
            data = np.zeros((batch_size, y.shape[1], vocab_size))
            for i in range(len(data)):
                data[i] = to_categorical(y[i], num_classes=vocab_size)
              
            xe = xe[:,:j]
            xd = xd[:,:j+1]
            data = data[:,:j+1]
            
            yield [xe, xd], data
        
        not_done = repeat

Using TensorFlow backend.


In [25]:
encoder.summary()
decoder.summary()
discriminator.summary()
generator.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
enc_in (InputLayer)             [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 64)     763008      enc_in[0][0]                     
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, 64), (None,  18816       embedding[0][0]                  
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 64)           0           bidirectional[0][1]              
                                                                 bidirectional[0][2]        

In [26]:
ae_train.summary()
gen_train.summary()
dsc_fake_train.summary()
dsc_real_train.summary()

Model: "autoencoder_trainer"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
enc_in (InputLayer)             [(None, None)]       0                                            
__________________________________________________________________________________________________
dec_in (InputLayer)             [(None, None)]       0                                            
__________________________________________________________________________________________________
encoder (Model)                 (None, 64)           781824      enc_in[0][0]                     
__________________________________________________________________________________________________
decoder (Model)                 [(None, None, 11922) 1562898     dec_in[0][0]                     
                                                                 encoder[2][0]  

In [27]:
ae_train.compile(optimizer=Adam(lr=1e-3), loss='categorical_crossentropy')
gen_train.compile(optimizer=Adam(lr=1e-3), loss='binary_crossentropy')
dsc_fake_train.compile(optimizer=Adam(lr=2e-3), loss='binary_crossentropy')
dsc_real_train.compile(optimizer=Adam(lr=2e-3), loss='binary_crossentropy')

In [28]:
epochs = 32
batch_size = 64
steps_per_epoch = len(train_idxs) // batch_size
validation_steps = len(valid_idxs) // batch_size

To train the GAN architecture, we will need to train all three of the GAN models at the same time. This creates the process of competition between the discriminator and the generator.

In [29]:
"""def train_iter():
    # Training round
    loss = [0 for _ in range(4)]
    for step, (x, y) in enumerate(data_gen(train_idxs, batch_size=batch_size, repeat=False)):
        # Train on real data
        loss[0] += ae_train.train_on_batch(x, y)
        loss[1] += dsc_real_train.train_on_batch(x, np.ones((batch_size,)))
        
        # Train on fake data
        noise = np.random.normal(0, 1, size=(batch_size, latent_dim))
        loss[2] += gen_train.train_on_batch(noise, np.ones((batch_size,)))
        loss[3] += dsc_fake_train.train_on_batch(noise, np.zeros((batch_size,)))
        
        # Display the loss
        print(f'\r{step+1}/{steps_per_epoch} loss:', end='')
        for x in loss:
            print(f' {x / (1+step):.4f}', end='')
    print()
"""
    
def train_iter():
    # Training round
    loss = [0 for _ in range(3)]
    for step, (x, y) in enumerate(data_gen(train_idxs, batch_size=batch_size, repeat=False)):
        # Train on real data
        loss[0] += dsc_real_train.train_on_batch(x, np.ones((batch_size,)))
        
        # Train on fake data
        noise = np.random.normal(0, 1, size=(batch_size, latent_dim))
        loss[1] += gen_train.train_on_batch(noise, np.ones((batch_size,)))
        loss[2] += dsc_fake_train.train_on_batch(noise, np.zeros((batch_size,)))
        
        # Display the loss
        print(f'\r{step+1}/{steps_per_epoch} loss:', end='')
        for x in loss:
            print(f' {x / (1+step):.4f}', end='')
    print()

In [30]:
# Train the autoencoder
print('Training autoencoder')
ae_train.fit_generator(data_gen(train_idxs, batch_size=batch_size),
                   steps_per_epoch=steps_per_epoch,
                   validation_data=data_gen(valid_idxs, batch_size=batch_size),
                   validation_steps=validation_steps,
                   epochs=epochs)

# Train the GAN
print('Training GAN')
for ep in range(epochs):
    print(f'Epoch {ep+1}/{epochs}')
    train_iter()

Training autoencoder
Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32
Training GAN
Epoch 1/32
500/500 loss: 0.1939 0.7032 0.8967 0.1434 0.8054 0.7996
Epoch 2/32
500/500 loss: 0.2016 0.5131 1.2110
Epoch 3/32
500/500 loss: 0.2346 0.3468 1.6379
Epoch 4/32
500/500 loss: 0.1933 0.6256 1.3615
Epoch 5/32
500/500 loss: 0.2232 0.4255 1.5950
Epoch 6/32
500/500 loss: 0.2533 0.2574 1.9724
Epoch 7/32
500/500 loss: 0.2620 0.3256 1.8985
Epoch 8/32
500/500 loss: 0.2451 0.5850 1.6240
Epoch 9/32
500/500 loss: 0.2274 0.4738 1.5607
Epoch 10/32
500/500 loss: 0.2527 0.3467 1.8419
Epoch 11/32
500/500 loss: 0.2626 0.3157 1.9967
Epoch 12/32
500/500 loss: 0.2509 0.3994 1.8541
Epoch 13/32
500

## Generate

Now that we have a model, let's make some food! First, I define some utility functions to do the generation.

In [31]:
def choose_char(p, temperature=0.2):
    # Apply temperature
    p = np.log(p)
    p /= temperature

    # Rescale
    p = np.exp(p.astype('float64'))
    p = p / np.sum(p)

    # Randomly choose one from the distribution
    p = np.random.multinomial(1, p, 1)

    # Choose the most likely character
    sampled_token_index = np.argmax(p)
    
    if sampled_token_index:
        token = word_inv_idx[sampled_token_index]
    else:
        token = False
    
    return sampled_token_index, token

def decode_state(h, temperature=0.2):
    x = np.array([[str2tok('<start>')[0]]])
    h = np.array([h])

    res = []
    for _ in range(max_len):
        p, h = decoder.predict([x, h])
        
        i, c = choose_char(p[0][0], temperature=temperature)

        if not i or c is False or c == '<end>' or c == '\n':
            # We reached the end of the text
            break
        else:
            # Extend the result with the new token
            res.append(i)
            # The token to feed in is the one last generated
            x[0,0] = i
    
    # Attach all of the words and return
    return tok2str(res)

def regenerate(txt, temperature=0.2):
    """ Given text, pass it through the encoder and then the decoder.
    """
    # Tokenize the text
    x = np.array([str2tok(txt)])
    
    # Encode the text
    h = encoder.predict(x)[0]
    
    # Decode and return
    return decode_state(h)

def gan_generate(temperature=0.2):
    """ Uses the GAN to generate a recipe name.
    """
    x = np.random.normal(0, 1, size=(1, latent_dim))
    h = generator.predict(x)[0]
    return decode_state(h)

def generate(temperature=0.2):
    """ Generate a random recipe from the space of possible encodings.
    """
    res = None
    while not res:
        # Choose a completely random state
        h = np.random.uniform(-1, 1, size=(latent_dim,))
        # Use the state to generate some text
        res = decode_state(h)
    
    return res

In [32]:
def classify_food(txt):
    """ Given a recipe name, determine whether or not the name is 'real'
    """
    x = np.array([str2tok(txt)])
    h = encoder.predict(x)
    y = discriminator.predict(h)[0][0]
    return [False, True][int(round(y))]    

Before we use the network to generate new food, let's see how it works on existing food. Remember that it should produce the same thing we put in. Of course, it won't be perfect, but that's not really a problem here.

In [33]:
# Show usage on existing samples
for n in names[:20]:
    print('Source:', n)
    print('Target:', regenerate(n, temperature=0.2))
    print()

Source: Slow Cooker Chicken and Dumplings
Target: Slow Cooker Chicken and Dumplings

Source: Awesome Slow Cooker Pot Roast
Target: Awesome Slow Cooker Pot Roast

Source: Brown Sugar Meatloaf
Target: Brown Sugar Meatloaf

Source: Best Chocolate Chip Cookies
Target: Best Chocolate Chip Cookies

Source: Homemade Mac and Cheese Casserole
Target: Homemade Macaroni and Cheese Casserole

Source: Banana Banana Bread
Target: Banana Banana Bread

Source: Chef John's Fisherman's Pie
Target: Chef John's Margarita Shake

Source: Mom's Zucchini Bread
Target: Mom's Zucchini Bread

Source: The Best Rolled Sugar Cookies
Target: The Best Lemon Tea Cookies

Source: Singapore Chili Crabs
Target: Noodle Pot Stickers

Source: Downeast Maine Pumpkin Bread
Target: Aunt Wheat Chocolate Bread

Source: Best Big, Fat, Chewy Chocolate Chip Cookie
Target: Best Big, Fat, Chewy Chocolate Chip Cookie Bars

Source: Aimee's Mashed Cauliflower 'Potatoes'
Target: Aimee's Mashed Cauliflower 'Potatoes'

Source: Irish Lamb S

Assuming the previous results look good, we can now try generating new values. One method we can use to generate random recipe names is to randomly select encoding values from the set of possible encodings. In a GRU, the values are bounded between -1 and 1.

In [34]:
# Generate a text sample
for _ in range(20):
    n = generate()
    
    lbl = '*' if n not in names else ' '
    lbl += '+' if classify_food(n) else ' '
    
    print(lbl, n)

*  Creme Fraiche
*  Meat Nachos
   Pumpkin Pie
*+ Wild Kale
*  Buckwheat Brownies
*+ Cobbler For Hot Dish
*  Sriracha
*  Fudge I
*+ Onions
*+ Greens Pasta
   Zucchini Pie
*+ Authentic la Neige )
*  Shepherds Pie
*  Shepherd's Sticker Rice
*+ Casserole I
*  Old Fashioned Sauce
*  Crinkle Cookies Brownies
*+ Fried Shallots
*  Cupcakes
 + Cocktail Meatballs III


Now, let's try the GAN model, which was previously trained to be labeled as either real or fake.

In [35]:
# Generate a text sample
for _ in range(20):
    n = gan_generate()
    
    lbl = '*' if n not in names else ' '
    lbl += '+' if classify_food(n) else ' '
    
    print(lbl, n)

*+ The Best Classic Tiramisu
*+ Potato and Brie in a Stick
*+ Buttery Sopapillas
*+ Apple Cookies
*  Scallops with Okra ( Kholdnyk )
*+ Snickerdoodle I
*+ Ultra Easy Frosting
*  Barb's Ceviche
*  Dragon Tomato Wraps
*+ Southwestern Burger with Garlicky
*+ Very Best Cinnamon Vanilla Brownies Ever
*+ Vegan Pecans III
*+ Rocky Cake Cake Brownies
*+ Bing Cherry French Toast
*  French Onion Triangles
*  New Year's Lemonade Cake
*+ Pasta and Tomato with Couscous
 + Sweet Potato Rolls
*+ Oven Sweet Turkey Burgers
*+ Lime Mango Margarita
