# Recipe Generation with Seq2Seq Autoencoders

In this notebook, I train a Seq2Seq autoencoder to encode and decode recipe names. Using this, I use random state inputs to the decoder to generate fictitious recipe names.

## Load and preprocess data

First, we must acquire the data. For my experiment, I used data from [Eight Portions](https://eightportions.com/datasets/Recipes/), who provide a very useful dataset of recipes including names, ingredients, and directions. I only plan to use the names of the recipes, so I will trim the data for that information.

In [1]:
# Download the recipes
!curl -o recipes.zip 'https://storage.googleapis.com/recipe-box/recipes_raw.zip'

# Unzip without remorse
!unzip -o recipes.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 50.8M  100 50.8M    0     0  9730k      0  0:00:05  0:00:05 --:--:--  9.8M
Archive:  recipes.zip
  inflating: recipes_raw_nosource_ar.json  
  inflating: recipes_raw_nosource_epi.json  
  inflating: recipes_raw_nosource_fn.json  
  inflating: LICENSE                 


In [2]:
!head recipes_raw_nosource_ar.json

{
  "rmK12Uau.ntP510KeImX506H6Mr6jTu": {
    "title": "Slow Cooker Chicken and Dumplings",
    "ingredients": [
      "4 skinless, boneless chicken breast halves ADVERTISEMENT",
      "2 tablespoons butter ADVERTISEMENT",
      "2 (10.75 ounce) cans condensed cream of chicken soup ADVERTISEMENT",
      "1 onion, finely diced ADVERTISEMENT",
      "2 (10 ounce) packages refrigerated biscuit dough, torn into pieces ADVERTISEMENT",
      "ADVERTISEMENT"


In [3]:
import json

# Load the JSON data
with open('recipes_raw_nosource_ar.json', 'r') as f:
    data = json.load(f)

When extracting the names, I found that some recipes did not have names. So I had to filter those out:

In [4]:
# Pull out the names
names = [data[k]['title'] for k in data if 'title' in data[k]]

for r in names[:10]:
    print(r)

print(len(names), 'of', len(data))

Slow Cooker Chicken and Dumplings
Awesome Slow Cooker Pot Roast
Brown Sugar Meatloaf
Best Chocolate Chip Cookies
Homemade Mac and Cheese Casserole
Banana Banana Bread
Chef John's Fisherman's Pie
Mom's Zucchini Bread
The Best Rolled Sugar Cookies
Singapore Chili Crabs
39522 of 39802


In [5]:
def preprocess_string(txt):
    # Trim special characters to simplify everyone's lives
    for i in range(len(txt)-1, -1, -1):
        if ord(txt[i]) > 127:
            txt = txt[:i] + txt[i+1:]
            
    return (txt
            .replace('(', ' ( ') # Left parentheses
            .replace(')', ' ) ') # Right parentheses
    )

In [6]:
names = list(map(preprocess_string, names))

## Tokenize data

In order to pass the strings in, we need to create a numerical representation that can be used by the network. I define methods for encoding  a given string of text or decoding label predictions into a string.

In [7]:
import numpy as np
import math, random

import tensorflow_datasets as tfds

In [8]:
# Extract all of the characters and words
chars = set(c for n in names for c in n)
words = set(w for n in names for w in n.split())

# We also select special start and end characters. These are used
# on the decoder side.
chars.add('\t')
words.add('<start>')
chars.add('\n')
words.add('<end>')

chars = list(chars)
words = list(words)

print(chars)

['n', '9', '5', '?', 'T', '8', 'W', 'Q', 'X', 'R', 'a', 'D', 'O', 'x', '7', 'o', '-', 'M', ':', 'E', 'b', 's', '0', 'd', '.', 'i', 'A', 'p', 'S', 'g', 'c', 'f', 'q', '/', 'h', 'K', '@', '&', 'Z', 'N', 'l', '4', 'k', 'r', '\t', 'G', 'w', '+', 'J', 'L', 'j', '"', "'", '$', '*', '=', '2', '1', 'U', '(', 'B', ';', 'z', 'y', 't', 'v', 'e', '#', 'C', '6', '\n', 'H', ' ', 'u', 'F', ',', 'V', 'I', 'm', '3', ')', '%', 'Y', 'P', '!']


In [9]:
# Given a word, provide the equivalent token
word_idx = {w : i+1 for i, w in enumerate(words)}

# Given a token, give back the word
word_inv_idx = {i+1 : w for i, w in enumerate(words)}

In [10]:
# Number of possible characters, including the 'zero' character used by embeddings
vocab_size = 1 + len(words)

print('Vocab size:', vocab_size)

Vocab size: 11922


Now, we can define the tokenization and detokenization behavior. We define two methods:

- `str2tok()`: Converts text to tokens
- `tok2str()`: Converts tokens to text

In [11]:
def str2tok(txt, length=None):
    # Split the string and break into tokens
    enc = [word_idx[w] for w in txt.split()]
    
    if length:
        enc += (length - len(enc)) * [0]
        
    return enc

def tok2str(idxs):
    # Rejoin
    return ' '.join([word_inv_idx[i] for i in idxs if i])

Using this, we can tokenize the data into the format usable by the neural networks.

In [12]:
# Encode the dataset
max_len = max(len(str2tok(n)) for n in names)

enc_input = np.array([str2tok(n, max_len) for n in names])
dec_input = np.array([str2tok(f"<start> {n}", 2+max_len) for n in names])
dec_output = np.array([str2tok(f"{n} <end>", 2+max_len) for n in names])

print('Max length:', max_len)
print(enc_input[0])
print(dec_input[0])
print(dec_output[0])

print(enc_input.shape)
print(dec_input.shape)
print(dec_output.shape)

Max length: 20
[ 3633  1689 11516  7502 11569     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0]
[ 7622  3633  1689 11516  7502 11569     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
[ 3633  1689 11516  7502 11569  2680     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0]
(39522, 20)
(39522, 22)
(39522, 22)


The next step is to define a training/test split. This allows us to check for overfitting of the data during hyperparameter tuning. I do this by generating a list of indices and splitting them into two buckets: one that's used to train and one used solely for evaluation.

In [13]:
# Train-test-valid split
idxs = list(range(len(enc_input)))
random.shuffle(idxs)

a = int(0.9 * len(idxs))
train_idxs = idxs[:a]
valid_idxs = idxs[a:]
idxs = train_idxs

## Model

We will use a Seq2Seq model to generate our results. This constitutes an encoder to convert text to an encoding, and a decoder to convert it back to text.

In [14]:
import tensorflow as tf

from tensorflow.keras.layers import *
from tensorflow.keras.models import *
from tensorflow.keras.optimizers import *

# Silence warnings
import logging
tf.get_logger().setLevel(logging.ERROR)

In [15]:
latent_dim = 64
dropout = 0.4

### Encoder

The Seq2Seq encoder includes an embedding layer that converts a token into a representative vector and a recurrent neural network (RNN) component. The RNN takes in a sequence/list of these vectors and processes them one by one to compute a representative encoding of the input.

In [16]:
enc_in = Input(shape=(None,), name='enc_in')

# Apply an embedding to the input
emb = Embedding(vocab_size, latent_dim)
y = emb(enc_in)

# Pass through an RNN
rnn = Bidirectional(GRU(latent_dim // 2, return_state=True, dropout=dropout))
_, h1, h2 = rnn(y)

# Concatenate the output states
h = Concatenate()([h1, h2])

encoder = Model(enc_in, h, name='encoder')

### Decoder

The decoder goes in the other way; taking the encoding and converting it into a list of characters. However, the network generates probabilities. This allows us to randomize our results or deterministically select the most likely character.

In [17]:
dec_in = Input(shape=(None,), name='dec_in')
h = Input(shape=(latent_dim,), name='state_in')

# Embed the decoder input
emb = Embedding(vocab_size, latent_dim)
y = emb(dec_in)

# Pass through a generator LSTM
rnn = GRU(latent_dim, return_state=True, return_sequences=True, dropout=dropout)
y, c = rnn(y, initial_state = h)

# Choose the the character by computing a probability distribution
dense = Dense(vocab_size, activation='softmax')
y = dense(y)

decoder = Model([dec_in, h], [y, c], name='decoder')

### Trainer

The trainer is a concatenation of the encoder and decoder. We take a sequence of tokens, pass it into the encoder to get an encoding, then pass it back into the decoder. The goal is to get the original string back, but in general we may want to decode into something else (ex: neural machine translation).

In [18]:
# Model to train an autoencoder
enc_in = Input(shape=(None,), name='enc_in')
dec_in = Input(shape=(None,), name='dec_in')

z = encoder(enc_in)
y, _ = decoder([dec_in, z])

model = Model([enc_in, dec_in], y, name='autoencoder_trainer')

## Model training

Now, we can train our model. I define a data generator that provides data one batch at a time. This is done because the actual values used by the network would require a massive amount of memory to store.

In [19]:
from keras.utils import to_categorical

def data_gen(idxs, batch_size=64, repeat=True):
    not_done = True
    
    while not_done:
        random.shuffle(idxs)

        for i in range(batch_size, len(idxs), batch_size):
            # Chosen items
            i = idxs[i-batch_size:i]

            # Input
            xe = enc_input[i]
            xd = dec_input[i]
            
            # Find the length of the longest item in the data point
            j = 0
            while j < max_len:
                if all(xe[:,j] == 0):
                    break
                j += 1

            # Output
            y = dec_output[i]
            data = np.zeros((batch_size, y.shape[1], vocab_size))
            for i in range(len(data)):
                data[i] = to_categorical(y[i], num_classes=vocab_size)
              
            # Use the length to trim the string. This is a heuristic to
            # reduce the amount of training time spent on groupings of
            # shorter inputs.
            xe = xe[:,:j]
            xd = xd[:,:j+1]
            data = data[:,:j+1]
            
            yield [xe, xd], data
        
        not_done = repeat

Using TensorFlow backend.


In [20]:
encoder.summary()
decoder.summary()

Model: "encoder"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
enc_in (InputLayer)             [(None, None)]       0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, None, 64)     763008      enc_in[0][0]                     
__________________________________________________________________________________________________
bidirectional (Bidirectional)   [(None, 64), (None,  18816       embedding[0][0]                  
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 64)           0           bidirectional[0][1]              
                                                                 bidirectional[0][2]        

In [21]:
model.summary()

Model: "autoencoder_trainer"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
enc_in (InputLayer)             [(None, None)]       0                                            
__________________________________________________________________________________________________
dec_in (InputLayer)             [(None, None)]       0                                            
__________________________________________________________________________________________________
encoder (Model)                 (None, 64)           781824      enc_in[0][0]                     
__________________________________________________________________________________________________
decoder (Model)                 [(None, None, 11922) 1562898     dec_in[0][0]                     
                                                                 encoder[1][0]  

In [22]:
# Build the model
model.compile(optimizer=Adam(), loss='categorical_crossentropy')

In [23]:
epochs = 32
batch_size = 64
steps_per_epoch = len(train_idxs) // batch_size
validation_steps = len(valid_idxs) // batch_size

In [24]:
model.fit_generator(data_gen(train_idxs, batch_size=batch_size),
                   steps_per_epoch=steps_per_epoch,
                   validation_data=data_gen(valid_idxs, batch_size=batch_size),
                   validation_steps=validation_steps,
                   epochs=epochs)

Epoch 1/32
Epoch 2/32
Epoch 3/32
Epoch 4/32
Epoch 5/32
Epoch 6/32
Epoch 7/32
Epoch 8/32
Epoch 9/32
Epoch 10/32
Epoch 11/32
Epoch 12/32
Epoch 13/32
Epoch 14/32
Epoch 15/32
Epoch 16/32
Epoch 17/32
Epoch 18/32
Epoch 19/32
Epoch 20/32
Epoch 21/32
Epoch 22/32
Epoch 23/32
Epoch 24/32
Epoch 25/32
Epoch 26/32
Epoch 27/32
Epoch 28/32
Epoch 29/32
Epoch 30/32
Epoch 31/32
Epoch 32/32


<tensorflow.python.keras.callbacks.History at 0x7f92b3f33208>

## Generate

Now that we have a model, let's make some food! First, I define some utility functions to do the generation.

In [25]:
def choose_char(p, temperature=0.2):
    # Apply temperature
    p = np.log(p)
    p /= temperature

    # Rescale
    p = np.exp(p.astype('float64'))
    p = p / np.sum(p)

    # Randomly choose one from the distribution
    p = np.random.multinomial(1, p, 1)

    # Choose the most likely character
    sampled_token_index = np.argmax(p)
    
    if sampled_token_index:
        token = word_inv_idx[sampled_token_index]
    else:
        token = False
    
    return sampled_token_index, token

def decode_state(h, temperature=0.2):
    x = np.array([[str2tok('<start>')[0]]])
    h = np.array([h])

    res = []
    for _ in range(max_len):
        p, h = decoder.predict([x, h])
        
        i, c = choose_char(p[0][0], temperature=temperature)

        if not i or c is False or c == '<end>' or c == '\n':
            # We reached the end of the text
            break
        else:
            # Extend the result with the new token
            res.append(i)
            # The token to feed in is the one last generated
            x[0,0] = i
    
    # Attach all of the words and return
    return tok2str(res)

def regenerate(txt, temperature=0.2):
    x = np.array([str2tok(txt)])
    h = encoder.predict(x)[0]
    return decode_state(h)

def generate(temperature=0.2):
    res = None
    while not res:
        # Choose a completely random state
        h = np.random.uniform(-1, 1, size=(latent_dim,))
        # Use the state to generate some text
        res = decode_state(h)
    
    return res

Before we use the network to generate new food, let's see how it works on existing food. Remember that it should produce the same thing we put in. Of course, it won't be perfect, but that's not really a problem here.

In [26]:
# Show usage on existing samples
for n in names[:20]:
    print('Source:', n)
    print('Target:', regenerate(n, temperature=0.2))
    print()

Source: Slow Cooker Chicken and Dumplings
Target: Slow Cooker Chicken and Dumplings

Source: Awesome Slow Cooker Pot Roast
Target: Awesome Slow Cooker Pizza Roast

Source: Brown Sugar Meatloaf
Target: Brown Sugar Biscuits

Source: Best Chocolate Chip Cookies
Target: Best Chocolate Chip Cookies

Source: Homemade Mac and Cheese Casserole
Target: Homemade Mac and Cheese Casserole

Source: Banana Banana Bread
Target: Banana Bread Pudding

Source: Chef John's Fisherman's Pie
Target: Chef John's Fisherman's Pie

Source: Mom's Zucchini Bread
Target: Mom's Zucchini Bake

Source: The Best Rolled Sugar Cookies
Target: The Best Dog Sugar Cookies

Source: Singapore Chili Crabs
Target: Deer and One

Source: Downeast Maine Pumpkin Bread
Target: Downeast Layer Pumpkin Recipe

Source: Best Big, Fat, Chewy Chocolate Chip Cookie
Target: Best Big, Fat, Chewy Chocolate Chip Cookie

Source: Aimee's Mashed Cauliflower 'Potatoes'
Target: Sun Eggplant Tomato Oranges

Source: Irish Lamb Stew
Target: Irish Fish

Assuming the previous results look good, we can now try generating new values. The code will put stars next to foods that don't exist in the recipe database.

In [27]:
# Generate a text sample
for _ in range(20):
    n = generate()
    if n not in names:
        print('*', n)
    else:
        print(' ', n)

* Eggless Slow Cooker Shots
* Ever Mango Pudding
* Bars from Reynolds Wrap
* Italian Swedish King House
* Orange Sprouts
* Chile Chicken Livers Bundles
* for One
* Marshmallows
* Tomato Banana Muffins with Raisins and Walnuts
* Casserole Ever
* Sauce
* Quinoa
* Punch II
* Melt Cotta )
* Pickles Pecans
  Amaretto
* Free )
* Ricotta Pops
* Souffle Fudge
* Meatballs )
