# LSTM on Recipe Data

**The notebook has been adapted from the notebook provided in David Foster's Generative Deep Learning, 2nd Edition.**

- Book: [Amazon](https://www.amazon.com/Generative-Deep-Learning-Teaching-Machines/dp/1098134184/ref=sr_1_1?keywords=generative+deep+learning%2C+2nd+edition&qid=1684708209&sprefix=generative+de%2Caps%2C93&sr=8-1)
- Original notebook (tensorflow and keras): [Github](https://github.com/davidADSP/Generative_Deep_Learning_2nd_Edition/blob/main/notebooks/05_autoregressive/01_lstm/lstm.ipynb)
- Dataset: [Kaggle](https://www.kaggle.com/datasets/hugodarwood/epirecipes)

In [1]:
import numpy as np
import json
import re
import string
import time

import torch
from torch import nn
from torch.nn.functional import pad
from torch.utils.data import Dataset, DataLoader, random_split

from torchtext.vocab import build_vocab_from_iterator
from torchtext.data.utils import get_tokenizer

import torchinfo

## 0. Train parameters

In [2]:
DATA_DIR = '../../data/epirecipes/full_format_recipes.json'

EMBEDDING_DIM = 100
HIDDEN_DIM = 128
VALIDATION_SPLIT = 0.2
SEED = 1024
BATCH_SIZE = 32
EPOCHS = 30

MAX_PAD_LEN = 200
MAX_VAL_TOKENS = 100 # Max number of tokens when generating texts

DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'

## 1. Load dataset

In [3]:
def pad_punctuation(sentence):
    sentence = re.sub(f'([{string.punctuation}])', r' \1 ', sentence)
    sentence = re.sub(' +', ' ', sentence)
    return sentence

In [4]:
# Load dataset
with open(DATA_DIR, 'r+') as f:
    recipe_data = json.load(f)

In [5]:
# preprocess dataset
filtered_data = [
    'Recipe for ' + x['title'] + ' | ' + ' '.join(x['directions'])
    for x in recipe_data
    if 'title' in x and x['title']
    and 'directions' in x and x['directions']
]

text_ds = [pad_punctuation(sentence) for sentence in filtered_data]

print(f'Total recipe loaded: {len(text_ds)}')

Total recipe loaded: 20098


In [6]:
print('Sample data:')
sample_data = np.random.choice(text_ds)
print(sample_data)

Sample data:
Recipe for Tomato Barbecue Baby Back Ribs | Put oven rack in lower third of oven and preheat oven to 350°F . Line a 17 - by 12 - by 1 - inch shallow baking pan with foil . Stir together all sauce ingredients in a 2 - quart heavy saucepan and bring to a boil over moderate heat . Reduce heat and simmer , covered , 15 minutes . Transfer sauce to a food processor and purée until smooth . Coat both sides of ribs with sauce . Arrange ribs , meaty side up , in 1 layer in baking pan , overlapping if necessary . Cover pan tightly with foil and bake ribs 1 hour . Remove foil ( from top ) and bake ribs 1 hour more . 


## 2. Build vocabularies

In [7]:
# The iterator that yields tokenized data
def yield_tokens(data_iter, tokenizer):
    for sample in data_iter:
        yield tokenizer(sample)

# Building vocabulary
def build_vocab(dataset, tokenizer):
    vocab = build_vocab_from_iterator(
        yield_tokens(dataset, tokenizer),
        min_freq=2,
        specials=['<pad>', '<unk>']
    )
    return vocab

In [8]:
tokenizer = get_tokenizer('basic_english')
vocab = build_vocab(text_ds, tokenizer)
vocab.set_default_index(vocab['<unk>'])

# Create index-to-word mapping
index_to_word = {index : word for word, index in vocab.get_stoi().items()}

In [9]:
# display some token-word mappings
for i in range(10):
    word = vocab.get_itos()[i]
    print(f'{i}: {word}')

0: <pad>
1: <unk>
2: .
3: ,
4: and
5: to
6: in
7: the
8: with
9: a


In [10]:
# Check mappings
mapped_sample = vocab(tokenizer(sample_data))
print('Source text:')
print(sample_data)
print('\n')
print('Mapped sample:')
print(mapped_sample)

Source text:
Recipe for Tomato Barbecue Baby Back Ribs | Put oven rack in lower third of oven and preheat oven to 350°F . Line a 17 - by 12 - by 1 - inch shallow baking pan with foil . Stir together all sauce ingredients in a 2 - quart heavy saucepan and bring to a boil over moderate heat . Reduce heat and simmer , covered , 15 minutes . Transfer sauce to a food processor and purée until smooth . Coat both sides of ribs with sauce . Arrange ribs , meaty side up , in 1 layer in baking pan , overlapping if necessary . Cover pan tightly with foil and bake ribs 1 hour . Remove foil ( from top ) and bake ribs 1 hour more . 


Mapped sample:
[25, 16, 263, 531, 1321, 401, 453, 26, 239, 46, 118, 6, 585, 435, 14, 46, 4, 85, 46, 5, 215, 2, 327, 9, 1460, 13, 179, 193, 13, 179, 11, 13, 52, 341, 57, 43, 8, 166, 2, 41, 109, 121, 53, 130, 6, 9, 15, 13, 248, 77, 79, 4, 83, 5, 9, 68, 20, 267, 17, 2, 152, 17, 4, 69, 3, 120, 3, 126, 12, 2, 39, 53, 5, 9, 289, 187, 4, 322, 10, 140, 2, 162, 405, 128, 14, 45

# 3. Create DataLoader

In [11]:
class Collate():
    def __init__(self, tokenizer, vocab, max_padding, pad_idx):
        self.tokenizer = tokenizer
        self.vocab = vocab

        self.max_padding = max_padding
        self.pad_idx = pad_idx

    
    def collate_fn(self, batch):
        src_list = []
        tgt_list = []

        # Prepare source and target batch
        for sentence in batch:
            # convert text to vocab tensor
            tokens = self.tokenizer(sentence)
            src_mapping = torch.tensor(self.vocab(tokens[:-1]), dtype=torch.int64)
            tgt_mapping = torch.tensor(self.vocab(tokens[1:]), dtype=torch.int64)
            # pad sequence
            src_padded = pad(src_mapping, [0, self.max_padding - len(src_mapping)], value=self.pad_idx)
            tgt_padded = pad(tgt_mapping, [0, self.max_padding - len(tgt_mapping)], value=self.pad_idx)
            # append padded sequence to corresponding lists
            src_list.append(src_padded)
            tgt_list.append(tgt_padded)

        # stack batch
        src = torch.stack(src_list)
        tgt = torch.stack(tgt_list)

        return (src, tgt)

In [12]:
# Split dataset into training and validation splits
train_ds, valid_ds = random_split(text_ds, [1-VALIDATION_SPLIT, VALIDATION_SPLIT])
print("Num. training data: \t", len(train_ds))
print("Num. validation data: \t", len(valid_ds))

Num. training data: 	 16079
Num. validation data: 	 4019


In [13]:
pad_idx = vocab.get_stoi()['<pad>']
print('index of <pad> token: ', pad_idx)

collate = Collate(tokenizer, vocab, MAX_PAD_LEN+1, pad_idx)

train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, 
                          shuffle=True, num_workers=8, pin_memory=True,
                          collate_fn=collate.collate_fn)

valid_loader = DataLoader(valid_ds, batch_size=BATCH_SIZE, 
                          shuffle=False, num_workers=8, pin_memory=True,
                          collate_fn=collate.collate_fn)

index of <pad> token:  0


## 4. Build LSTM model

In [14]:
class LSTM_Net(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings=vocab_size,
                                      embedding_dim=EMBEDDING_DIM,
                                      padding_idx=pad_idx)
        
        self.lstm = nn.LSTM(input_size=EMBEDDING_DIM,
                            hidden_size=HIDDEN_DIM,
                            num_layers=2,
                            batch_first=True)
        
        self.output = nn.Linear(in_features=HIDDEN_DIM,
                                out_features=vocab_size)
        
    def forward(self, x):
        x = self.embedding(x)
        x, hidden_state = self.lstm(x)
        return self.output(x)


model = LSTM_Net(len(vocab))
torchinfo.summary(model=model, input_size=(BATCH_SIZE, MAX_PAD_LEN+1), dtypes=[torch.int64])

Layer (type:depth-idx)                   Output Shape              Param #
LSTM_Net                                 [32, 201, 8628]           --
├─Embedding: 1-1                         [32, 201, 100]            862,800
├─LSTM: 1-2                              [32, 201, 128]            249,856
├─Linear: 1-3                            [32, 201, 8628]           1,113,012
Total params: 2,225,668
Trainable params: 2,225,668
Non-trainable params: 0
Total mult-adds (G): 1.67
Input size (MB): 0.05
Forward/backward pass size (MB): 455.69
Params size (MB): 8.90
Estimated Total Size (MB): 464.65

## 5. Train step functions

In [15]:
class TextGenerator():
    def __init__(self, index_to_word):
        self.index_to_word = index_to_word

    # Scaling the model's output probability with temperature
    def sample_from(self, probs, temperature):
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    # Generate text
    def generate(self, model, start_prompt, max_tokens, temperature, output_info=False):
        start_tokens = vocab(tokenizer(start_prompt))
        sample_token = None
        info = []
        
        while len(start_tokens) < max_tokens and sample_token != 0: # also avoid padding index
            input_prompts = torch.tensor(start_tokens, device=DEVICE).unsqueeze(0)
            probs = model(input_prompts)[0][-1]
            probs = nn.functional.softmax(probs, dim=-1)
            sample_token, probs = self.sample_from(probs.detach().cpu().numpy(), temperature)
            
            start_tokens.append(sample_token)
            if output_info:
                info.append({'token': np.copy(start_tokens), 'word_probs': probs})
            
            del input_prompts

        output_text = [self.index_to_word[token] for token in start_tokens if token != 0]
        print(' '.join(output_text))
        return info

In [16]:
# Training function
def train_step(model, dataloader, loss_fn, optimizer):
    
    model.train()
    total_loss = 0
    
    for sources, targets in dataloader:    
        optim.zero_grad()
    
        sources, targets = sources.to(DEVICE), targets.to(DEVICE)
        preds = model(sources)
        loss = loss_fn(preds.reshape(-1, preds.shape[-1]), targets.reshape(-1))
        loss.backward()
        optim.step()

        total_loss += loss.item()

    return total_loss / len(dataloader)


# Evaluation function
def eval(model, dataloader, loss_fn):

    model.eval()
    valid_loss = 0
    
    for sources, targets in dataloader:
        sources, targets = sources.to(DEVICE), targets.to(DEVICE)
        preds = model(sources)
        loss = loss_fn(preds.reshape(-1, preds.shape[-1]), targets.reshape(-1))
        valid_loss += loss.item()

    return valid_loss / len(dataloader)

## 6. Training

In [17]:
model = LSTM_Net(len(vocab)).to(DEVICE)
loss_fn = nn.CrossEntropyLoss()
optim = torch.optim.Adam(model.parameters())

text_generator = TextGenerator(index_to_word)

In [18]:
loss_hist = {'train':[], 'valid':[]}

for i in range(EPOCHS):
    prev_time = time.time()
    train_loss = train_step(model, train_loader, loss_fn, optim)
    valid_loss = eval(model, valid_loader, loss_fn)

    loss_hist['train'].append(train_loss)
    loss_hist['valid'].append(valid_loss)
    
    curr_time = time.time()
    print(f'Epoch: {i+1}\tepoch time {(curr_time - prev_time) / 60:.2f} min')
    print(f'\ttrain loss: {train_loss:.4f}, valid loss: {valid_loss:.4f}')

    if (i + 1) % 10 == 0:
        print('\nGenerated text:')
        text_generator.generate(model, 'recipe for', MAX_VAL_TOKENS, 1.0)
        print('\n')

Epoch: 1	epoch time 0.10 min
	train loss: 4.3229, valid loss: 3.6531
Epoch: 2	epoch time 0.10 min
	train loss: 3.1213, valid loss: 2.8040
Epoch: 3	epoch time 0.10 min
	train loss: 2.6016, valid loss: 2.4894
Epoch: 4	epoch time 0.10 min
	train loss: 2.3598, valid loss: 2.3052
Epoch: 5	epoch time 0.10 min
	train loss: 2.2058, valid loss: 2.1854
Epoch: 6	epoch time 0.10 min
	train loss: 2.0972, valid loss: 2.0987
Epoch: 7	epoch time 0.10 min
	train loss: 2.0166, valid loss: 2.0337
Epoch: 8	epoch time 0.10 min
	train loss: 1.9530, valid loss: 1.9825
Epoch: 9	epoch time 0.10 min
	train loss: 1.8996, valid loss: 1.9395
Epoch: 10	epoch time 0.10 min
	train loss: 1.8554, valid loss: 1.9064

Generated text:
recipe for earl shrimp | preheat oven to 350°f . brush grill lightly . melt oil in heavy large wide pot over high heat . add sear until browned on it . add eggs , flour , gochujang , and remaining 3 / 4 teaspoon salt and simmer just until tender but still evaporate and grains . 4 to 12 1 / 2

## 7. Generate texts

In [19]:
# print prompt and top k candidate words probability
def print_probs(info, index_to_word, top_k=5):
    assert len(info) > 0, 'Please make `output_info=True`'
    for i in range(len(info)):
        start_tokens, word_probs = info[i].values()
        start_prompts = [index_to_word[token] for token in start_tokens if token != 0]
        start_prompts = ' '.join(start_prompts)
        print(f'\nPrompt: {start_prompts}')
        # word_probs
        probs_sorted = np.argsort(word_probs)[::-1][:top_k]
        for idx in probs_sorted:
            print(f'{index_to_word[idx]}\t{word_probs[idx] * 100:.2f}%')

In [20]:
# Candidate words probability with temperature = 1.0
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=6, 
                               temperature=1.0, 
                               output_info=True)

print_probs(info, index_to_word, 5)

recipe for roast - and -

Prompt: recipe for roast -
turkey	25.32%
chicken	14.07%
pork	13.27%
beef	10.08%
leg	3.44%

Prompt: recipe for roast - and
roast	9.22%
roasted	5.77%
spiced	3.94%
vegetable	3.40%
turkey	2.65%

Prompt: recipe for roast - and -
-	84.59%
vegetable	0.79%
red	0.38%
smoky	0.37%
spicy	0.31%


In [21]:
# Candidate words probability distribution with temperature = 1.0
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=6, 
                               temperature=0.2, 
                               output_info=True)

print_probs(info, index_to_word, 5)

recipe for roast turkey with port

Prompt: recipe for roast turkey
turkey	90.69%
chicken	4.81%
pork	3.59%
beef	0.90%
leg	0.00%

Prompt: recipe for roast turkey with
with	100.00%
and	0.00%
breast	0.00%
,	0.00%
stock	0.00%

Prompt: recipe for roast turkey with port
port	74.78%
gravy	12.70%
rosemary	5.40%
sage	3.00%
garlic	1.03%


In [22]:
# generate text with temperature = 1.0
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=100, 
                               temperature=1.0, 
                               output_info=True)

recipe for roast duck with black bean purée with chiles and shiitake reduction | trim excess fat from potatoes and line a finely chop potatoes , then chop and add to large bowl . pour beans into flour and mix with rubber fork . sprinkle with crabmeat . mix cheese and peanut butter in a medium bowl . mash coarse with salt in cream . add flour in another medium bowl , breaking side comes with a wooden thin to a paste . leaving oil by 1 - inch shoot - size pieces of your hands to avoid lifting shredding


In [23]:
# generate text with temperature = 0.2
info = text_generator.generate(model, 
                               'recipe for roast', 
                               max_tokens=100, 
                               temperature=0.2, 
                               output_info=True)

recipe for roast turkey with port gravy | preheat oven to 350°f . butter and flour 13x9x2 - inch glass baking dish . whisk egg yolks , salt , and pepper in large bowl to blend . add butter and rub in with fingertips until mixture resembles coarse meal . add 1 / 2 cup oil and 1 / 4 cup parmesan cheese blend until moist clumps form . transfer to large bowl . add 1 / 4 cup oil and 1 / 2 cup oil . season with salt and pepper . place 1 bread slices on work surface
