# LSTM on Recipe Data

**The notebook has been adapted from the notebook provided in David Foster's Generative Deep Learning, 2nd Edition.**

- Book: [Amazon](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=sr_1_1?keywords=generative+deep+learning%2C+2nd+edition&qid=1684708209&sprefix=generative+de%2Caps%2C93&sr=8-1)
- Original notebook (tensorflow and keras): [Github](https://github.com/davidADSP/Generative_Deep_Learning_2nd_Edition/blob/main/notebooks/05_autoregressive/01_lstm/lstm.ipynb)
- Dataset: [Kaggle](https://www.kaggle.com/datasets/hugodarwood/epirecipes)

In [1]:
import numpy as np
import json
import re
import string
import time

import torch
from torch import nn
from torch.nn.functional import pad
from torch.utils.data import Dataset, DataLoader, random_split

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

import torchinfo

## 0. Train parameters

In [2]:
DATA_DIR = '../../data/epirecipes/full_format_recipes.json'

EMBEDDING_DIM = 100
HIDDEN_DIM = 128
VALIDATION_SPLIT = 0.2
SEED = 1024
BATCH_SIZE = 32
EPOCHS = 30

MAX_PAD_LEN = 200
MAX_VAL_TOKENS = 100 # Max number of tokens when generating texts

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

## 1. Load dataset

In [3]:
def pad_punctuation(sentence):
    sentence = re.sub(f'([{string.punctuation}])', r' \1 ', sentence)
    sentence = re.sub(' +', ' ', sentence)
    return sentence

In [4]:
# Load dataset
with open(DATA_DIR, 'r+') as f:
    recipe_data = json.load(f)

In [5]:
# preprocess dataset
filtered_data = [
    'Recipe for ' + x['title'] + ' | ' + ' '.join(x['directions'])
    for x in recipe_data
    if 'title' in x and x['title']
    and 'directions' in x and x['directions']
]

text_ds = [pad_punctuation(sentence) for sentence in filtered_data]

print(f'Total recipe loaded: {len(text_ds)}')

Total recipe loaded: 20098


In [6]:
print('Sample data:')
sample_data = np.random.choice(text_ds)
print(sample_data)

Sample data:
Recipe for Ricotta Cheesecake | Preheat oven to 350°F . Pulse flour , sugar , salt , and butter in a food processor until mixture resembles coarse meal . Add yolk , vanilla , and lemon juice and pulse just until mixture begins to form a dough . Spread dough with a small offset spatula or back of a spoon over buttered bottom of a 24 - centimeter springform pan and prick all over with a fork . Chill 30 minutes . Bake crust in a shallow baking pan ( to catch drips ) in middle of oven until golden brown , about 25 minutes , and cool on a rack . Increase temperature to 375°F . Discard liquid and cheesecloth and force drained ricotta through sieve into bowl . Beat yolks and sugar with an electric mixer until thick and pale , then beat in ricotta , flour , and zests . Beat whites with salt in another bowl until they hold soft peaks , and fold into ricotta mixture . Butter side of springform pan and pour filling over crust ( pan will be completely full ) . Bake in baking pan in mi

## 2. Build vocabularies

In [7]:
# The iterator that yields tokenized data
def yield_tokens(data_iter, tokenizer):
    for sample in data_iter:
        yield tokenizer(sample)

# Building vocabulary
def build_vocab(dataset, tokenizer):
    vocab = build_vocab_from_iterator(
        yield_tokens(dataset, tokenizer),
        min_freq=2,
        specials=['<pad>', '<unk>']
    )
    return vocab

In [8]:
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab(text_ds, tokenizer)
vocab.set_default_index(vocab['<unk>'])

# Create index-to-word mapping
index_to_word = {index : word for word, index in vocab.get_stoi().items()}

In [9]:
# display some token-word mappings
for i in range(10):
    word = vocab.get_itos()[i]
    print(f'{i}: {word}')

0: <pad>
1: <unk>
2: .
3: ,
4: and
5: to
6: in
7: the
8: with
9: a


In [10]:
# Check mappings
mapped_sample = vocab(tokenizer(sample_data))
print('Source text:')
print(sample_data)
print('\n')
print('Mapped sample:')
print(mapped_sample)

Source text:
Recipe for Ricotta Cheesecake | Preheat oven to 350°F . Pulse flour , sugar , salt , and butter in a food processor until mixture resembles coarse meal . Add yolk , vanilla , and lemon juice and pulse just until mixture begins to form a dough . Spread dough with a small offset spatula or back of a spoon over buttered bottom of a 24 - centimeter springform pan and prick all over with a fork . Chill 30 minutes . Bake crust in a shallow baking pan ( to catch drips ) in middle of oven until golden brown , about 25 minutes , and cool on a rack . Increase temperature to 375°F . Discard liquid and cheesecloth and force drained ricotta through sieve into bowl . Beat yolks and sugar with an electric mixer until thick and pale , then beat in ricotta , flour , and zests . Beat whites with salt in another bowl until they hold soft peaks , and fold into ricotta mixture . Butter side of springform pan and pour filling over crust ( pan will be completely full ) . Bake in baking pan in mi

# 3. Create DataLoader

In [11]:
class Collate():
    def __init__(self, tokenizer, vocab, max_padding, pad_idx):
        self.tokenizer = tokenizer
        self.vocab = vocab

        self.max_padding = max_padding
        self.pad_idx = pad_idx

    
    def collate_fn(self, batch):
        src_list = []
        tgt_list = []

        # Prepare source and target batch
        for sentence in batch:
            # convert text to vocab tensor
            tokens = self.tokenizer(sentence)
            src_mapping = torch.tensor(self.vocab(tokens[:-1]), dtype=torch.int64)
            tgt_mapping = torch.tensor(self.vocab(tokens[1:]), dtype=torch.int64)
            # pad sequence
            src_padded = pad(src_mapping, [0, self.max_padding - len(src_mapping)], value=self.pad_idx)
            tgt_padded = pad(tgt_mapping, [0, self.max_padding - len(tgt_mapping)], value=self.pad_idx)
            # append padded sequence to corresponding lists
            src_list.append(src_padded)
            tgt_list.append(tgt_padded)

        # stack batch
        src = torch.stack(src_list)
        tgt = torch.stack(tgt_list)

        return (src, tgt)

In [12]:
# Split dataset into training and validation splits
train_ds, valid_ds = random_split(text_ds, [1-VALIDATION_SPLIT, VALIDATION_SPLIT])
print("Num. training data: \t", len(train_ds))
print("Num. validation data: \t", len(valid_ds))

Num. training data: 	 16079
Num. validation data: 	 4019


In [13]:
pad_idx = vocab.get_stoi()['<pad>']
print('index of <pad> token: ', pad_idx)

collate = Collate(tokenizer, vocab, MAX_PAD_LEN+1, pad_idx)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, 
                          shuffle=True, num_workers=8, pin_memory=True,
                          collate_fn=collate.collate_fn)

valid_loader = DataLoader(valid_ds, batch_size=BATCH_SIZE, 
                          shuffle=False, num_workers=8, pin_memory=True,
                          collate_fn=collate.collate_fn)

index of <pad> token:  0


## 4. Build LSTM model

In [14]:
class LSTM_Net(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                      embedding_dim=EMBEDDING_DIM,
                                      padding_idx=pad_idx)
        
        self.lstm = nn.LSTM(input_size=EMBEDDING_DIM,
                            hidden_size=HIDDEN_DIM,
                            num_layers=2,
                            batch_first=True)
        
        self.output = nn.Linear(in_features=HIDDEN_DIM,
                                out_features=vocab_size)
        
    def forward(self, x):
        x = self.embedding(x)
        x, hidden_state = self.lstm(x)
        return self.output(x)


model = LSTM_Net(len(vocab))
torchinfo.summary(model=model, input_size=(BATCH_SIZE, MAX_PAD_LEN+1), 
                  dtypes=[torch.int64], depth=3)

Layer (type:depth-idx)                   Output Shape              Param #
LSTM_Net                                 [32, 201, 8628]           --
├─Embedding: 1-1                         [32, 201, 100]            862,800
├─LSTM: 1-2                              [32, 201, 128]            249,856
├─Linear: 1-3                            [32, 201, 8628]           1,113,012
Total params: 2,225,668
Trainable params: 2,225,668
Non-trainable params: 0
Total mult-adds (G): 1.67
Input size (MB): 0.05
Forward/backward pass size (MB): 455.69
Params size (MB): 8.90
Estimated Total Size (MB): 464.65

## 5. Train step functions

In [15]:
class TextGenerator():
    def __init__(self, index_to_word):
        self.index_to_word = index_to_word

    # Scaling the model's output probability with temperature
    def sample_from(self, probs, temperature):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    # Generate text
    def generate(self, model, start_prompt, max_tokens, temperature, output_info=False):
        model.eval()
        
        start_tokens = vocab(tokenizer(start_prompt))
        sample_token = None
        info = []
        
        while len(start_tokens) < max_tokens and sample_token != 0: # also avoid padding index
            input_prompts = torch.tensor(start_tokens, device=DEVICE).unsqueeze(0)
            probs = model(input_prompts)[0][-1]
            probs = nn.functional.softmax(probs, dim=-1)
            sample_token, probs = self.sample_from(probs.detach().cpu().numpy(), temperature)
            
            start_tokens.append(sample_token)
            if output_info:
                info.append({'token': np.copy(start_tokens), 'word_probs': probs})

        output_text = [self.index_to_word[token] for token in start_tokens if token != 0]
        print(' '.join(output_text))
        return info

In [16]:
# Training function
def train_step(model, dataloader, loss_fn, optimizer):
    
    model.train()
    total_loss = 0
    
    for sources, targets in dataloader:    
        optim.zero_grad()
    
        sources, targets = sources.to(DEVICE), targets.to(DEVICE)
        preds = model(sources)
        loss = loss_fn(preds.reshape(-1, preds.shape[-1]), targets.reshape(-1))
        loss.backward()
        optim.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)


# Evaluation function
def eval(model, dataloader, loss_fn):

    model.eval()
    valid_loss = 0
    
    for sources, targets in dataloader:
        sources, targets = sources.to(DEVICE), targets.to(DEVICE)
        preds = model(sources)
        loss = loss_fn(preds.reshape(-1, preds.shape[-1]), targets.reshape(-1))
        valid_loss += loss.item()

    return valid_loss / len(dataloader)

## 6. Training

In [17]:
model = LSTM_Net(len(vocab)).to(DEVICE)

# if torch.__version__.split('.')[0] == '2':
#     torch.set_float32_matmul_precision('high')
#     model = torch.compile(model, mode="max-autotune")
#     print('model compiled')

loss_fn = nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters())

text_generator = TextGenerator(index_to_word)

In [18]:
loss_hist = {'train':[], 'valid':[]}

for i in range(EPOCHS):
    prev_time = time.time()
    train_loss = train_step(model, train_loader, loss_fn, optim)
    valid_loss = eval(model, valid_loader, loss_fn)

    loss_hist['train'].append(train_loss)
    loss_hist['valid'].append(valid_loss)
    
    curr_time = time.time()
    print(f'Epoch: {i+1}\tepoch time {(curr_time - prev_time) / 60:.2f} min')
    print(f'\ttrain loss: {train_loss:.4f}, valid loss: {valid_loss:.4f}')

    if (i + 1) % 10 == 0:
        print('\nGenerated text:')
        text_generator.generate(model, 'recipe for', MAX_VAL_TOKENS, 1.0)
        print('\n')

Epoch: 1	epoch time 0.11 min
	train loss: 4.2283, valid loss: 3.5230
Epoch: 2	epoch time 0.11 min
	train loss: 3.0208, valid loss: 2.7528
Epoch: 3	epoch time 0.11 min
	train loss: 2.5472, valid loss: 2.4550
Epoch: 4	epoch time 0.11 min
	train loss: 2.3142, valid loss: 2.2764
Epoch: 5	epoch time 0.11 min
	train loss: 2.1636, valid loss: 2.1592
Epoch: 6	epoch time 0.11 min
	train loss: 2.0621, valid loss: 2.0800
Epoch: 7	epoch time 0.11 min
	train loss: 1.9867, valid loss: 2.0197
Epoch: 8	epoch time 0.11 min
	train loss: 1.9275, valid loss: 1.9755
Epoch: 9	epoch time 0.11 min
	train loss: 1.8799, valid loss: 1.9377
Epoch: 10	epoch time 0.11 min
	train loss: 1.8382, valid loss: 1.9050

Generated text:
recipe for potato root vegetables | blend first 5 ingredients in processor . using electric mixer , beat shells in medium bowl until stiff peaks form . divide batter among parchment nonstick pans . fold 1 piece into pastry overhang until edges are hold cucumbers are golden brown bits . rub o

## 7. Generate texts

In [19]:
# print prompt and top k candidate words probability
def print_probs(info, index_to_word, top_k=5):
    assert len(info) > 0, 'Please make `output_info=True`'
    for i in range(len(info)):
        start_tokens, word_probs = info[i].values()
        start_prompts = [index_to_word[token] for token in start_tokens if token != 0]
        start_prompts = ' '.join(start_prompts)
        print(f'\nPrompt: {start_prompts}')
        # word_probs
        probs_sorted = np.argsort(word_probs)[::-1][:top_k]
        for idx in probs_sorted:
            print(f'{index_to_word[idx]}\t{word_probs[idx] * 100:.2f}%')

In [20]:
# Candidate words probability with temperature = 1.0
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=6, 
                               temperature=1.0, 
                               output_info=True)

print_probs(info, index_to_word, 5)

recipe for roast duck lo mein

Prompt: recipe for roast duck
chicken	21.73%
turkey	19.65%
pork	11.71%
beef	11.62%
rack	4.90%

Prompt: recipe for roast duck lo
with	46.12%
breasts	15.77%
legs	9.92%
breast	9.39%
|	2.87%

Prompt: recipe for roast duck lo mein
mein	93.91%
with	1.38%
|	0.26%
<unk>	0.22%
lamb	0.11%


In [21]:
# Candidate words probability distribution with temperature = 1.0
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=6, 
                               temperature=0.2, 
                               output_info=True)

print_probs(info, index_to_word, 5)

recipe for roast chicken with fresh

Prompt: recipe for roast chicken
chicken	59.02%
turkey	35.69%
pork	2.68%
beef	2.58%
rack	0.03%

Prompt: recipe for roast chicken with
with	100.00%
breasts	0.00%
|	0.00%
thighs	0.00%
legs	0.00%

Prompt: recipe for roast chicken with fresh
fresh	56.15%
rosemary	19.81%
lemon	13.31%
roasted	3.59%
garlic	1.65%


In [22]:
# generate text with temperature = 1.0
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=100, 
                               temperature=1.0, 
                               output_info=True)

recipe for roast chicken breasts | preheat oven to 425°f . halve onions lengthwise , reserving soaking seeds . toss scallions with garlic , salt , and 1 / 2 teaspoon pepper . in a small saucepan combine lemongrass , scallions , and scallions and toss with lime juice , parmesan , salt , and pepper to taste until combined . season with salt , pepper , and remaining herb mixture . garnish with lemon wedges and chives and serve immediately , passing olive oils . ( older chorizo peas can be made 2 hours ahead and refrigerated , punched


In [23]:
# generate text with temperature = 0.2
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=100, 
                               temperature=0.2, 
                               output_info=True)

recipe for roast turkey with creamy mushroom - wine glaze | preheat oven to 350°f . butter 13x9x2 - inch glass baking dish . mix first 4 ingredients in small bowl . season with salt and pepper . place 1 / 4 cup cheese in center of each . sprinkle with salt and pepper . bake until golden brown , about 15 minutes . cool slightly . cut into wedges . ( can be made 1 day ahead . cover and refrigerate . ) preheat oven to 400°f . place 1 / 4 of cheese in center of each of
