#  LSTM on Using Eucledian Recipe Data

In this notebook, we'll walk through the steps required to train your own LSTM on the recipes dataset

# LSTM Recipe Generation Notebook Overview

## Overview

This Jupyter Notebook demonstrates the process of training a Long Short-Term Memory (LSTM) model on a recipe dataset for generating new recipes. The model is built using TensorFlow and leverages the power of recurrent neural networks to learn patterns and structures in the provided recipes.

## Sections

1. **Parameters Setup**
   - Define essential parameters such as vocabulary size, maximum sequence length, embedding dimensions, LSTM units, etc.

2. **Load the Data**
   - Load the recipe dataset from a JSON file, filter and preprocess the data.

3. **Tokenize the Data**
   - Tokenize the recipes by padding punctuation and convert them into a format suitable for training.

4. **Create the Training Set**
   - Prepare the input and output sequences for training the LSTM model.

5. **Build the LSTM Model**
   - Define the architecture of the LSTM model using TensorFlow's Keras API.

6. **Train the LSTM Model**
   - Compile and train the LSTM model on the prepared dataset. Checkpoints and TensorBoard are utilized for monitoring.

7. **Generate Text using the LSTM**
   - Implement a TextGenerator callback to generate text during training. Use the trained model to generate recipes based on specified prompts.

8. **Print Probability Analysis**
   - Evaluate the probabilities of predicted words for different prompts and temperatures.

9. **Save the Model**
   - Save the final trained model for future use.

10. **Results and Usage**
    - Discuss the generated results, provide instructions for model usage, and showcase generated recipes.

## Usage

1. Clone the repository and navigate to the project directory.

2. Install the required dependencies using `pip install -r requirements.txt`.

3. Download the recipe dataset and place it in the `/data` directory.

4. Run the Jupyter Notebook: `jupyter notebook LSTM_Recipe_Generation.ipynb`.

5. Experiment with different parameters, prompts, and temperatures to generate diverse recipes.




In [1]:


import numpy as np
import json
import re
import string

import tensorflow as tf
from tensorflow.keras import layers, models, callbacks, losses

## 0. Parameters <a name="parameters"></a>

1. **VOCAB_SIZE (10000):**
   - The size of the vocabulary, which represents the total number of unique words in your dataset. It is often set to a specific number to limit the vocabulary size and make the training process more manageable.

2. **MAX_LEN (200):**
   - Maximum sequence length. It defines the maximum number of tokens (words or characters) in each input sequence. Sequences longer than this length will be truncated, and sequences shorter than this length will be padded.

3. **EMBEDDING_DIM (100):**
   - The dimensionality of the word embeddings. Each word in the vocabulary will be represented as a dense vector of this size. Embeddings capture semantic relationships between words.

4. **N_UNITS (128):**
   - The number of LSTM units or cells in the LSTM layer. LSTM units are responsible for learning and capturing sequential patterns in the input data.

5. **VALIDATION_SPLIT (0.2):**
   - The fraction of the dataset that will be used for validation during training. In this case, 20% of the data will be reserved for validation, helping to monitor the model's performance on data it has not seen during training.

6. **SEED (42):**
   - A random seed for reproducibility. Setting a seed ensures that the random initialization of weights in the model and shuffling of the dataset are consistent across runs, making experiments reproducible.

7. **LOAD_MODEL (False):**
   - A boolean flag indicating whether to load a pre-trained model. If set to `True`, the notebook will attempt to load a saved model instead of training a new one.

8. **BATCH_SIZE (32):**
   - The number of samples used in each iteration during training. It defines how many training examples are processed together before updating the model's weights. A smaller batch size may lead to a more stable training process.

9. **EPOCHS (3):**
   - The number of epochs for training the model. An epoch is one complete pass through the entire training dataset. Training for multiple epochs allows the model to learn from the data iteratively.


In [2]:
VOCAB_SIZE = 10000
MAX_LEN = 200
EMBEDDING_DIM = 100
N_UNITS = 128
VALIDATION_SPLIT = 0.2
SEED = 42
LOAD_MODEL = False
BATCH_SIZE = 32
EPOCHS = 3

## 1. Load the data <a name="load"></a>

In [3]:
# Load the full dataset
with open("/kaggle/input/recipe/full_format_recipes.json") as json_data:
    recipe_data = json.load(json_data)

In [4]:
# Filter the dataset
filtered_data = [
    "Recipe for " + x["title"] + " | " + " ".join(x["directions"])
    for x in recipe_data
    if "title" in x
    and x["title"] is not None
    and "directions" in x
    and x["directions"] is not None
]

In [5]:
# Count the recipes
n_recipes = len(filtered_data)
print(f"{n_recipes} recipes loaded")

20111 recipes loaded


In [6]:
example = filtered_data[9]
print(example)

Recipe for Ham Persillade with Mustard Potato Salad and Mashed Peas  | Chop enough parsley leaves to measure 1 tablespoon; reserve. Chop remaining leaves and stems and simmer with broth and garlic in a small saucepan, covered, 5 minutes. Meanwhile, sprinkle gelatin over water in a medium bowl and let soften 1 minute. Strain broth through a fine-mesh sieve into bowl with gelatin and stir to dissolve. Season with salt and pepper. Set bowl in an ice bath and cool to room temperature, stirring. Toss ham with reserved parsley and divide among jars. Pour gelatin on top and chill until set, at least 1 hour. Whisk together mayonnaise, mustard, vinegar, 1/4 teaspoon salt, and 1/4 teaspoon pepper in a large bowl. Stir in celery, cornichons, and potatoes. Pulse peas with marjoram, oil, 1/2 teaspoon pepper, and 1/4 teaspoon salt in a food processor to a coarse mash. Layer peas, then potato salad, over ham.


## 2. Tokenise the data

In [7]:
# Pad the punctuation, to treat them as separate 'words'
def pad_punctuation(s):
    s = re.sub(f"([{string.punctuation}])", r" \1 ", s)
    s = re.sub(" +", " ", s)
    return s


text_data = [pad_punctuation(x) for x in filtered_data]

In [8]:
# Display an example of a recipe
example_data = text_data[9]
example_data

'Recipe for Ham Persillade with Mustard Potato Salad and Mashed Peas | Chop enough parsley leaves to measure 1 tablespoon ; reserve . Chop remaining leaves and stems and simmer with broth and garlic in a small saucepan , covered , 5 minutes . Meanwhile , sprinkle gelatin over water in a medium bowl and let soften 1 minute . Strain broth through a fine - mesh sieve into bowl with gelatin and stir to dissolve . Season with salt and pepper . Set bowl in an ice bath and cool to room temperature , stirring . Toss ham with reserved parsley and divide among jars . Pour gelatin on top and chill until set , at least 1 hour . Whisk together mayonnaise , mustard , vinegar , 1 / 4 teaspoon salt , and 1 / 4 teaspoon pepper in a large bowl . Stir in celery , cornichons , and potatoes . Pulse peas with marjoram , oil , 1 / 2 teaspoon pepper , and 1 / 4 teaspoon salt in a food processor to a coarse mash . Layer peas , then potato salad , over ham . '

In [9]:
# Convert to a Tensorflow Dataset
text_ds = (
    tf.data.Dataset.from_tensor_slices(text_data)
    .batch(BATCH_SIZE)
    .shuffle(1000)
)

In [10]:
# Create a vectorisation layer
vectorize_layer = layers.TextVectorization(
    standardize="lower",
    max_tokens=VOCAB_SIZE,
    output_mode="int",
    output_sequence_length=MAX_LEN + 1,
)

In [11]:
# Adapt the layer to the training set
vectorize_layer.adapt(text_ds)
vocab = vectorize_layer.get_vocabulary()

In [12]:
# Display some token:word mappings
for i, word in enumerate(vocab[:10]):
    print(f"{i}: {word}")

0: 
1: [UNK]
2: .
3: ,
4: and
5: to
6: in
7: the
8: with
9: a


In [13]:
# Display the same example converted to ints
example_tokenised = vectorize_layer(example_data)
print(example_tokenised.numpy())

[  26   16  557    1    8  298  335  189    4 1054  494   27  332  228
  235  262    5  594   11  133   22  311    2  332   45  262    4  671
    4   70    8  171    4   81    6    9   65   80    3  121    3   59
   12    2  299    3   88  650   20   39    6    9   29   21    4   67
  529   11  164    2  320  171  102    9  374   13  643  306   25   21
    8  650    4   42    5  931    2   63    8   24    4   33    2  114
   21    6  178  181 1245    4   60    5  140  112    3   48    2  117
  557    8  285  235    4  200  292  980    2  107  650   28   72    4
  108   10  114    3   57  204   11  172    2   73  110  482    3  298
    3  190    3   11   23   32  142   24    3    4   11   23   32  142
   33    6    9   30   21    2   42    6  353    3 3224    3    4  150
    2  437  494    8 1281    3   37    3   11   23   15  142   33    3
    4   11   23   32  142   24    6    9  291  188    5    9  412  572
    2  230  494    3   46  335  189    3   20  557    2    0    0    0
    0 

## 3. Create the Training Set

In [14]:
# Create the training set of recipes and the same text shifted by one word
def prepare_inputs(text):
    text = tf.expand_dims(text, -1)
    tokenized_sentences = vectorize_layer(text)
    x = tokenized_sentences[:, :-1]
    y = tokenized_sentences[:, 1:]
    return x, y


train_ds = text_ds.map(prepare_inputs)

## 4. Build the LSTM <a name="build"></a>

Input Layer:

inputs = layers.Input(shape=(None,), dtype="int32"): Defines an input layer for variable-length sequences. It specifies that the input data will be integer-encoded sequences.
Embedding Layer:

x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs): Converts integer-encoded input sequences into dense vectors. Each word in the sequence is represented by a dense vector of size EMBEDDING_DIM.
LSTM Layer:

x = layers.LSTM(N_UNITS, return_sequences=True)(x): Utilizes an LSTM layer with N_UNITS cells. return_sequences=True is set to return the full sequence of outputs for each input sequence.
Dense Layer (Output):

outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x): The dense layer with VOCAB_SIZE units and softmax activation. It produces a probability distribution over the vocabulary for each element in the sequence.
Model Compilation:

lstm = models.Model(inputs, outputs): Creates the LSTM model with the specified input and output layers.
Model Summary:

lstm.summary(): Displays a summary of the model, including layer names, output shapes, and the number of parameters. This summary is useful for understanding the architecture and ensuring that it matches the intended design.

In [15]:
inputs = layers.Input(shape=(None,), dtype="int32")
x = layers.Embedding(VOCAB_SIZE, EMBEDDING_DIM)(inputs)
x = layers.LSTM(N_UNITS, return_sequences=True)(x)
outputs = layers.Dense(VOCAB_SIZE, activation="softmax")(x)
lstm = models.Model(inputs, outputs)
lstm.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, None)]            0         
                                                                 
 embedding (Embedding)       (None, None, 100)         1000000   
                                                                 
 lstm (LSTM)                 (None, None, 128)         117248    
                                                                 
 dense (Dense)               (None, None, 10000)       1290000   
                                                                 
Total params: 2407248 (9.18 MB)
Trainable params: 2407248 (9.18 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


In [16]:
if LOAD_MODEL:
    # model.load_weights('./models/model')
    lstm = models.load_model("./models/lstm", compile=False)

## 5. Train the LSTM <a name="train"></a>

In [17]:
loss_fn = losses.SparseCategoricalCrossentropy()
lstm.compile("adam", loss_fn)

In [18]:
# Create a TextGenerator checkpoint
class TextGenerator(callbacks.Callback):
    def __init__(self, index_to_word, top_k=10):
        self.index_to_word = index_to_word
        self.word_to_index = {
            word: index for index, word in enumerate(index_to_word)
        }  # <1>

    def sample_from(self, probs, temperature):  # <2>
        probs = probs ** (1 / temperature)
        probs = probs / np.sum(probs)
        return np.random.choice(len(probs), p=probs), probs

    def generate(self, start_prompt, max_tokens, temperature):
        start_tokens = [
            self.word_to_index.get(x, 1) for x in start_prompt.split()
        ]  # <3>
        sample_token = None
        info = []
        while len(start_tokens) < max_tokens and sample_token != 0:  # <4>
            x = np.array([start_tokens])
            y = self.model.predict(x, verbose=0)  # <5>
            sample_token, probs = self.sample_from(y[0][-1], temperature)  # <6>
            info.append({"prompt": start_prompt, "word_probs": probs})
            start_tokens.append(sample_token)  # <7>
            start_prompt = start_prompt + " " + self.index_to_word[sample_token]
        print(f"\ngenerated text:\n{start_prompt}\n")
        return info

    def on_epoch_end(self, epoch, logs=None):
        self.generate("recipe for", max_tokens=100, temperature=1.0)

In [19]:
# Create a model save checkpoint
model_checkpoint_callback = callbacks.ModelCheckpoint(
    filepath="./checkpoint/checkpoint.ckpt",
    save_weights_only=True,
    save_freq="epoch",
    verbose=0,
)

tensorboard_callback = callbacks.TensorBoard(log_dir="./logs")

# Tokenize starting prompt
text_generator = TextGenerator(vocab)

In [20]:
lstm.fit(
    train_ds,
    epochs=EPOCHS,
    callbacks=[model_checkpoint_callback, tensorboard_callback, text_generator],
)

Epoch 1/3
generated text:
recipe for , 25 of stir preheat vinegar and potato with put feed heavy segments crostini of very a wrap , tent , place measure thyme and sauté minutes . simmer , pat onions . until coarsely nuts the a large meat until beef sides with cups sauce . pepper . bake freshly it . watching pod . simmering 3 . tuna as have if egg tablespoons shoulder . 50 to crusts on scorching 

Epoch 2/3
generated text:
recipe for honeydew of microwave apple teaspoon arugula zest | beat in salt lemon . place sauce in large layer in skillet over heat heat until vanilla , about 30 minutes . add tomatoes and pepper until stick pie only if sheet bubbles . if desired after chicken , oysters occasionally , until pale brown , cover and preferably onto fragrant , about 10 minutes . stir the sugar , bay , onion , and taste and yolks ; tongs gently gently gently purée to coat . cut center of in paper or adhere . if smooth . jalapeño the

Epoch 3/3
generated text:
recipe for eggs chunks with to

<keras.src.callbacks.History at 0x7a412e9c3eb0>

**The auto-generated text** appears to be the output of the LSTM (Long Short-Term Memory) model trained on a recipe dataset. The training process involved learning patterns and structures within the provided recipes, enabling the model to generate new text based on a given prompt.

**The results of the auto-generator** show text snippets that resemble recipes. However, the generated text exhibits a balance between coherence and diversity, influenced by the temperature parameter during sampling. Higher temperatures (e.g., 1.0) result in more diverse but potentially less coherent text, while lower temperatures (e.g., 0.2) make the generated text more deterministic and focused.

**The training log** displays the loss values during each epoch of training, indicating how well the model is learning from the dataset. Additionally, the generated examples after each epoch provide insights into the model's evolving capabilities in creating coherent and contextually relevant recipe-like text based on the learned patterns.

**Overall, the auto-generator** demonstrates the model's ability to generate human-like text based on the learned knowledge from the recipe dataset, offering a glimpse into its creative and pattern-recognition capabilities.


In [21]:
# Save the final model
lstm.save("./models/lstm")

## 6. Generate text using the LSTM

In [22]:
def print_probs(info, vocab, top_k=5):
    for i in info:
        print(f"\nPROMPT: {i['prompt']}")
        word_probs = i["word_probs"]
        p_sorted = np.sort(word_probs)[::-1][:top_k]
        i_sorted = np.argsort(word_probs)[::-1][:top_k]
        for p, i in zip(p_sorted, i_sorted):
            print(f"{vocab[i]}:   \t{np.round(100*p,2)}%")
        print("--------\n")

In [23]:
info = text_generator.generate(
    "recipe for roasted vegetables | chop 1 /", max_tokens=10, temperature=1.0
)


generated text:
recipe for roasted vegetables | chop 1 / 4 cups



In [24]:
print_probs(info, vocab)


PROMPT: recipe for roasted vegetables | chop 1 /
2:   	42.78%
4:   	33.32%
3:   	7.24%
8:   	3.59%
1:   	3.33%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 4
-:   	53.76%
cup:   	19.65%
inch:   	4.73%
cups:   	4.3%
teaspoon:   	2.38%
--------



In [25]:
info = text_generator.generate(
    "recipe for roasted vegetables | chop 1 /", max_tokens=10, temperature=0.2
)


generated text:
recipe for roasted vegetables | chop 1 / 2 -



In [26]:
print_probs(info, vocab)


PROMPT: recipe for roasted vegetables | chop 1 /
2:   	77.71%
4:   	22.27%
3:   	0.01%
8:   	0.0%
1:   	0.0%
--------


PROMPT: recipe for roasted vegetables | chop 1 / 2
-:   	99.78%
cup:   	0.22%
cups:   	0.0%
/:   	0.0%
inch:   	0.0%
--------



In [27]:
info = text_generator.generate(
    "recipe for chocolate ice cream |", max_tokens=7, temperature=1.0
)
print_probs(info, vocab)


generated text:
recipe for chocolate ice cream | cook


PROMPT: recipe for chocolate ice cream |
preheat:   	14.89%
combine:   	10.14%
in:   	9.32%
whisk:   	4.49%
heat:   	4.31%
--------



In [28]:
info = text_generator.generate(
    "recipe for chocolate ice cream |", max_tokens=7, temperature=0.2
)
print_probs(info, vocab)


generated text:
recipe for chocolate ice cream | preheat


PROMPT: recipe for chocolate ice cream |
preheat:   	79.9%
combine:   	11.69%
in:   	7.69%
whisk:   	0.2%
heat:   	0.16%
--------



**Analysis for the Prompt: "recipe for roasted vegetables | chop 1 /"**
**Top Predictions:**
- Word "2" with a probability of 42.78%
- Word "4" with a probability of 33.32%
- Word "3" with a probability of 7.24%
- Word "8" with a probability of 3.59%
- Word "1" with a probability of 3.33%

**Analysis for the Prompt: "recipe for roasted vegetables | chop 1 / 4"**
**Top Predictions:**
- Word "-" with a probability of 53.76%
- Word "cup" with a probability of 19.65%
- Word "inch" with a probability of 4.73%
- Word "cups" with a probability of 4.3%
- Word "teaspoon" with a probability of 2.38%

**Interpretation:**
- The model's predictions are influenced by the training data, and the likelihood percentages reflect the model's confidence in each word.
- For the first prompt, the model predicts numerical values ("2", "4", "3", "8", "1") with relatively high probabilities, suggesting a continuation involving numerical quantities.
- In the second prompt, the model predicts a dash ("-") with the highest probability, indicating a potential continuation in the form of a list or step in a recipe. Other predictions include units of measurement like "cup," "inch," "cups," and "teaspoon."
