# Ingredient Parsing using Tensorflow


In this notebook I train and save a model for extracting relevant information from ingredient strings e.g. "3 very ripe tomatoes". The model is trained in tensorflow and using data from the NYTimes (see readme for link).

The model has a single input but is trained to predict the position of the name of the ingredient in the string, the position of the unit, the quantity and also the number of decimals in the quantity.

These 4 outputs are optimized simultaneously, with each seperate loss function combined before being fed into the Adam optimizer. I have had to adjust the relative weights of these loss functions in order to ensure the model does not overfit to a specific task.

NOTE: Tensorflow version 2.5 has been used as this is compatible with TFLite 2.5. You will get errors saving the trained model if you do not use these versions.

### Model Input: 

"2 tablespoons and 1 teaspoon of white sugar" -> 
Remove punctuation ->
Word Vectorisation -> 

[3, 12, 456, 34, 2304, 304, 78, 6529, 489]

Padding to longest sequence -> 

[3, 12, 456, 34, 2304, 304, 78, 6529, 489, 0, 0, 0, ..., 0]

### Label:

[

    [
        [0 0 0 0 0 0 1 1],      Ingredient Binary Mask
        [0 0 0 0 1 0 0 0]       Unit Binary Mask
    ],
    [
        7.0,                    Quantity Scalar
    ],
    
    [
        [1, 0, 0]               One hot array to indicate the number of decimal places
    ]
]


### Model:

Embedding Layer (pretrained Glove Embeddings)

LSTM layers

Dense Layers


#### Model Tail 1:
Loss: Binary Crossentropy

Metrics: F1 score for unit and name seperatly

Used as a binary classifier on the input ingredient sentence, where each word is classified for it being an ingredient or not an ingredient. The same is done for the unit and the outputs are stacked:

Given "2 tablespoons and 1 teaspoon of white sugar" it will return [0 0 0 0 0 0 1 1] in the first row to indicate the ingredient and [0 0 0 0 1 0 0 0] in the second dimension to indicate unit.

#### Model Tail 2:
Loss: Huber Loss

Metrics: MSE for the quantity 

Used for a regression on the embedding of the last dense layer, to predict the quantity of the ingredient. 

#### Model Tail 3:
Loss: Categorical Cross-entropy

Metrics: MSE for the quantity and range end seperatly

Used to predict if the quantity is an integerm has one decimal place or 2 more decimal places - this is used to round the output appropriatly.

## Data Loading



In [26]:
import pandas as pd

# Load in the labelled ingredient data provided by NYT

ing_df = pd.read_csv("nyt-ingredients-snapshot-2015.csv", index_col="index", dtype={'input': object, "name": object, "qty": object, "range_end": object, "unit": object})

# Drop the columns not needed
ing_df.drop(["comment"], axis=1, inplace=True)
ing_df.info()



<class 'pandas.core.frame.DataFrame'>
Int64Index: 179207 entries, 0 to 179206
Data columns (total 5 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   input      179063 non-null  object
 1   name       178759 non-null  object
 2   qty        179207 non-null  object
 3   range_end  173986 non-null  object
 4   unit       123082 non-null  object
dtypes: object(5)
memory usage: 8.2+ MB


## Preprocessing

A few bits of preprocessing need to happen to the data first

1. Data Cleaning
- Removing punctuation except for . and , as these are common and have contextual meaning
- making . and , seperate words
- Checking that for each row in the dataset the name of the ingredient and the unit are in the input (these are given seperately)
- Removing null inputs (where there is no entry for the input and``  ingredient)
- Dropping rows that dont have a quantity
- replacing nan values in the unit column with 0's

2. Creating labels
- Converting the data to the relevant output (regexp)
- Dropping rows that havent matched to the regexp or dont contain a label

In [27]:
import numpy as np
import re
# Check if the name column has any punctuation - It does have a few rows that have way too much information, I will remove these
def check_if_punct(string: str):
    b = True
    punc_regexp = r"[.,]"
    match = re.search(punc_regexp, string)
    if match:
        b = np.nan
    return b

# Remove rows that dont have the required fields
ing_df.dropna(subset=['input', "name"], inplace=True)
print(f"After removing na values in the name an input, the length of the df is now {len(ing_df)}")

# Convert all the string columns to strings
for col in ['input', 'name', 'unit', "qty"]:
    ing_df[col] = ing_df[col].apply(str)
    
# Replace all na values in the qty column with 0
ing_df["qty"] = ing_df["qty"].fillna("0")
ing_df["range_end"] = ing_df["range_end"].fillna("0")

ing_df['name_no_punct'] = ing_df['name'].apply(check_if_punct)

# Remove rows that have punctuation in the
ing_df.dropna(subset=["name_no_punct"], inplace=True)
print(f"After removing rows where the name or the input contain commas or full stops, the length of the df is now {len(ing_df)}")

After removing na values in the name an input, the length of the df is now 178668
After removing rows where the name or the input contain commas or full stops, the length of the df is now 174983


The above shows that the filter is working and that rows are being removed. It is also good to see that this filter does not drastically affect the number of rows.

After all the filters have been compelted, then I will manually review subsets of the data to see if there are any further issues or things that have slipped through the net.

In [28]:
import re

# Preprocess the input

def process_punctuation(string):
    string = re.sub(r"([,./])", r" \g<1> ", string)
    string = re.sub(r'[^\w\s.,/]', r" ", string)
    string = re.sub(r"\s+", r" ", string)
    string = re.sub("(\s$)|(^\s)", "", string)

    string = string.lower()
    # Remove trailing white space
    if string[-1] == " ":
        string = string[:-1]
    if string[0] == " ":
        string = string[1:]
    return string


for col in ["input", "name", "unit"]:
    ing_df[col + '_parsed'] = ing_df[col].apply(process_punctuation)


ing_df.head()

Unnamed: 0_level_0,input,name,qty,range_end,unit,name_no_punct,input_parsed,name_parsed,unit_parsed
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,True,1 1 / 4 cups cooked and pureed fresh butternut...,butternut squash,cup
1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.0,0.0,cup,True,1 cup peeled and cooked fresh chestnuts about ...,chestnuts,cup
2,"1 medium-size onion, peeled and chopped",onion,1.0,0.0,,True,"1 medium size onion , peeled and chopped",onion,
3,"2 stalks celery, chopped coarse",celery,2.0,0.0,stalk,True,"2 stalks celery , chopped coarse",celery,stalk
4,1 1/2 tablespoons vegetable oil,vegetable oil,1.5,0.0,tablespoon,True,1 1 / 2 tablespoons vegetable oil,vegetable oil,tablespoon


In the above the string columns are processed to ensure they are all lower case and also to remove any punctuation. The only punctuation I am including is . and , as these are used regularly and provide useful contextual information. finally I am removing multiple whitespace characters.

In [29]:
import re
# Check that the input column contains the ingredient name and also the unit if it has one
def check_if_name_and_unit(input, name, unit):
    name_regexp = f"{name}\w*"
    unit_regexp = f"{unit}\w*"

    match = re.search(name_regexp, input)

    if not match:
        return False
    # Check that the unit isnt nan
    if not unit==unit:
        print(unit)
        match = re.search(unit_regexp, input)
        if not match:
            return False
    return True


ing_df["check_name"] = ing_df.apply(lambda x: check_if_name_and_unit(x['input_parsed'], x['name_parsed'], x['unit_parsed']), axis=1)
ing_df.head()

# Check a few of the entries where it is false
ing_df[~ing_df["check_name"]].head()

Unnamed: 0_level_0,input,name,qty,range_end,unit,name_no_punct,input_parsed,name_parsed,unit_parsed,check_name
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
253,2 to 3 teaspoons minced jalapeño,jalapeños,2.0,3.0,teaspoon,True,2 to 3 teaspoons minced jalapeño,jalapeños,teaspoon,False
274,Salt and freshly ground black pepper to taste,Salt and black pepper,0.0,0.0,,True,salt and freshly ground black pepper to taste,salt and black pepper,,False
332,Salt and freshly ground black pepper,Salt and black pepper,0.0,0.0,,True,salt and freshly ground black pepper,salt and black pepper,,False
347,Salt and freshly ground black pepper,Salt and black pepper,0.0,0.0,,True,salt and freshly ground black pepper,salt and black pepper,,False
362,Salt and freshly ground pepper,Salt and pepper,0.0,0.0,,True,salt and freshly ground pepper,salt and pepper,,False


In the above I confirm that the input column has the name and unit in it. This helps to sift out data containing errors.
It also means sifting out data such as input = "2 teaspoons of jalapeños" where the name has been given as name = "jalapeño".
It will remove this data, but I do not believe it will have a significant impact as whether the ingredient is plural or
not it will still be in the same position in the word.

In [30]:
ing_df = ing_df[ing_df["check_name"]]
print(f"After removing all the rows where the name and unit are not included in the input, it leaves {len(ing_df)} rows")



After removing all the rows where the name and unit are not included in the input, it leaves 170302 rows


## Creating labels

For the unit and ingredient labels, I wil need to convert "2 tablespoons and 1 teaspoon of white sugar" into [0 0 0 0 0 0 1 1] for instance.

Now that the data is processed, this should not be too difficult.

In [31]:
# Convert a string of multiple words into a binary array with 1 for each word in 
# Match is and 0 where any other words are.
def create_label(inp, match):
    match_regexp = f"{match}\w*"
    match_num_words = len(match.split(" "))                  
    match_replaced = re.sub(match_regexp, " MATCH ", inp)
    match_replaced = re.sub("\s+", " " ,match_replaced)
    match_replaced = re.sub("(\s$)|(^\s)", "", match_replaced)
    word_array = match_replaced.split(" ")
    label = []
    for word in word_array:
        if word == "MATCH":
            label.extend([1]*match_num_words)
        else:
            label.append(0)
    return label
        
ing_df["name_label"] = ing_df.apply(lambda x: create_label(x['input_parsed'], x['name_parsed']), axis=1)
ing_df["unit_label"] = ing_df.apply(lambda x: create_label(x['input_parsed'], x['unit_parsed']), axis=1)

# column indicating if the quantity is an integer
ing_df["qty_integer"] = ing_df.apply(lambda x: True if float(x.qty).is_integer() else False, axis=1)

# Column to hold the number of decimal places of the qty column
ing_df["qty_decimal"] = ing_df.apply(lambda x: len(str(x["qty"]).split(".")[1]), axis=1)

ing_df.loc[ing_df["qty_integer"], "qty_decimal"] = 0

def one_hot(i, l):
    output = np.zeros(l)
    output[i] = 1
    return output

ing_df["qty_decimal_oh"] = ing_df.apply(lambda x: one_hot(x.qty_decimal, 3), axis=1)
ing_df["qty"] = ing_df["qty"].astype(float)
ing_df.head()


Unnamed: 0_level_0,input,name,qty,range_end,unit,name_no_punct,input_parsed,name_parsed,unit_parsed,check_name,name_label,unit_label,qty_integer,qty_decimal,qty_decimal_oh
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
0,1 1/4 cups cooked and pureed fresh butternut s...,butternut squash,1.25,0.0,cup,True,1 1 / 4 cups cooked and pureed fresh butternut...,butternut squash,cup,True,"[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, ...","[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",False,2,"[0.0, 0.0, 1.0]"
1,1 cup peeled and cooked fresh chestnuts (about...,chestnuts,1.0,0.0,cup,True,1 cup peeled and cooked fresh chestnuts about ...,chestnuts,cup,True,"[0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, ...",True,0,"[1.0, 0.0, 0.0]"
2,"1 medium-size onion, peeled and chopped",onion,1.0,0.0,,True,"1 medium size onion , peeled and chopped",onion,,True,"[0, 0, 0, 1, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0]",True,0,"[1.0, 0.0, 0.0]"
3,"2 stalks celery, chopped coarse",celery,2.0,0.0,stalk,True,"2 stalks celery , chopped coarse",celery,stalk,True,"[0, 0, 1, 0, 0, 0]","[0, 1, 0, 0, 0, 0]",True,0,"[1.0, 0.0, 0.0]"
4,1 1/2 tablespoons vegetable oil,vegetable oil,1.5,0.0,tablespoon,True,1 1 / 2 tablespoons vegetable oil,vegetable oil,tablespoon,True,"[0, 0, 0, 0, 0, 1, 1]","[0, 0, 0, 0, 1, 0, 0]",False,1,"[0.0, 1.0, 0.0]"


In [32]:
ing_df_labelled = ing_df[["input_parsed", 'name_label', "unit_label", "qty", "qty_decimal_oh"]]
ing_df_labelled.sample(100)

Unnamed: 0_level_0,input_parsed,name_label,unit_label,qty,qty_decimal_oh
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
20374,2 1 / 2 teaspoons finely minced garlic,"[0, 0, 0, 0, 0, 1, 1, 1]","[0, 0, 0, 0, 1, 0, 0, 0]",2.50,"[0.0, 1.0, 0.0]"
136549,salt,[1],[0],0.00,"[1.0, 0.0, 0.0]"
132837,"2 onions , minced","[0, 1, 0, 0]","[0, 0, 0, 0]",2.00,"[1.0, 0.0, 0.0]"
88598,1 / 4 cup cumin seeds,"[0, 0, 0, 0, 1, 1]","[0, 0, 0, 1, 0, 0]",0.25,"[0.0, 0.0, 1.0]"
41455,"20 scallions or green onions , about 1 / 3 pound","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",20.00,"[1.0, 0.0, 0.0]"
...,...,...,...,...,...
144704,"2 teaspoons canola oil , rice bran oil or extr...","[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]",2.00,"[1.0, 0.0, 0.0]"
124170,"1 bulb green garlic , root and dark green part...","[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...","[0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...",1.00,"[1.0, 0.0, 0.0]"
153628,5 ounces baby spinach,"[0, 0, 1, 1]","[0, 1, 0, 0]",5.00,"[1.0, 0.0, 0.0]"
44560,1 tablespoon balsamic vinegar,"[0, 0, 1, 1]","[0, 1, 0, 0]",1.00,"[1.0, 0.0, 0.0]"


These look pretty good. The next stage would be to encode the text so that it is also numeric.

To do this I need to:
- Identify the vocabulary from the corpus
- Convert the words into their numeric representation
- Pad the sequences so that all the inputs are the same length
- Download the pretrained word embeddings (using Glove 6)


In [33]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from more_itertools import take
import json



In [34]:
# Download and extract the word embeddings
#!wget http://nlp.stanford.edu/data/glove.6B.zip
#!unzip -q glove.6B.zip

In [35]:
# Open the word embeddings 
import os
import numpy as np


path_to_glove_file = "glove.6B.50d.txt"

embeddings_index = {}
with open(path_to_glove_file) as f:
    for line in f:
        word, coefs = line.split(maxsplit=1)
        coefs = np.fromstring(coefs, "f", sep=" ")
        embeddings_index[word] = coefs

print("Found %s word vectors." % len(embeddings_index))

Found 400000 word vectors.


In [43]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from more_itertools import take
import pickle

# Initialize the Tokenizer class, making sure to keep , . and /
tokenizer = Tokenizer(oov_token="<OOV>", filters='!"#$%&()*+-:;<=>?@[\\]^_`{|}~\t\n')

# Generate the word index dictionary from the embeddings
tokenizer.fit_on_texts(embeddings_index.keys())

# Print the length of the word index
word_index_example = take(20, tokenizer.word_index.items())
word_index = tokenizer.word_index

# Save the word vocabulary as a pickle file as it has a funny encoding that I hvent figured out how to load with json
word_index_filename = "word_index.pckl"
pickle.dump(word_index, open(word_index_filename, "wb" ))
    
print(f'number of words in word_index: {len(word_index)}')

# Print the word index
print(f'word_index: {word_index_example}')
print()



number of words in word_index: 364809
word_index: [('<OOV>', 1), ('1', 2), ('2', 3), ('non', 4), ('3', 5), ('4', 6), ('5', 7), ('10', 8), ('6', 9), ('based', 10), ('year', 11), ('0', 12), ('a', 13), ('8', 14), ('12', 15), ('7', 16), ('re', 17), ('http', 18), ('al', 19), ('15', 20)]



In [12]:
# Generate and pad the sequences
sequences = tokenizer.texts_to_sequences(ing_df_labelled["input_parsed"])

X = pad_sequences(sequences, padding='post')
y_name = pad_sequences(ing_df_labelled["name_label"], padding="post")
y_unit = pad_sequences(ing_df_labelled["unit_label"], padding="post")

y_qty_decimal = np.stack(ing_df_labelled["qty_decimal_oh"].to_numpy())
y_qty = np.expand_dims(np.asarray(ing_df_labelled["qty"].values), axis=-1)

# Combine these for jointly training the model
y_name_ex = np.expand_dims(y_name, axis=1)
y_unit_ex = np.expand_dims(y_unit, axis=1)


y_combined = np.concatenate((y_name_ex, y_unit_ex), axis=1)

In [14]:
import random 
# Check a random index to make sure everything has worked
idx = random.randint(0,20000)
rand_ing = ing_df.loc[ing_df.index.values[idx], "input_parsed"]
print(f"Ingredient sentence is {rand_ing}")
print(f"X is {X[idx]}")
print(f"y_name is {y_name[idx]}")
print(f"y_unit is {y_unit[idx]}")

Ingredient sentence is 4 tablespoons extra virgin olive oil
X is [    6 16895   514 14282  3994   771     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0]
y_name is [0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
y_unit is [0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]


Great so we have a training database! 

## Word Embeddings

To save time training my own embeddings, I will use the ones already trained in the Glove version 6 embeddings. These were created by stanford on a much larger corpus. Naturally some words might not exsist in this corpus, so these will be replaced with the OOV token.

In [15]:
num_tokens = len(word_index) + 1
embedding_dim = 50
hits = 0
misses = 0
oov_words = []

# Prepare embedding matrix
embedding_matrix = np.zeros((num_tokens, embedding_dim))
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        # Words not found in embedding index will be all-zeros.
        # This includes the representation for "padding" and "OOV"
        embedding_matrix[i] = embedding_vector

        hits += 1
    else:
        oov_words.append(word)
        misses += 1
print("Converted %d words (%d misses)" % (hits, misses))
print(f"Some examples of the words not included are: {oov_words[:20]}")

Converted 363496 words (1313 misses)
Some examples of the words not included are: ['<OOV>', 'eur2004', 'cvw', 'sonderburg', 'orthoplex', '0267', '4697', '4480', '9867', '4425', '3382', 'utf', '3622', 'ob.', 'hiberno', '5456', '4204', '7282', '0075', '0089']


In [17]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.initializers import Constant
# Create the embedding layer, make it not trainable and fix the embedding values
embedding_lay = Embedding(
    num_tokens,
    embedding_dim,
    embeddings_initializer=Constant(embedding_matrix),
    trainable=False,
    input_length=X.shape[1],
    name="embedding_layer"
)

## Baseline Calculation

If my model was to output all 1's or all 0's for the output, what would the success rate be?

In [18]:
from sklearn.metrics import accuracy_score, f1_score

baseline_y_unit = f1_score(y_unit, np.ones(y_unit.shape), average="micro")
baseline_y_name = f1_score(y_name, np.ones(y_name.shape), average="micro")

print(f"The baseline f1 scores for y_unit are {baseline_y_unit}")
print(f"The baseline f1 scores for y_name are {baseline_y_name}")

The baseline f1 scores for y_unit are 0.02337087910892176
The baseline f1 scores for y_name are 0.055756400440218165


## Metrics 

I have chosen to use f1 score which is a combined metric or precision and recall. It is better than accuracy in this instance because accuracy would give high scores for predicting all 0's or 1's.

In addition I have created a custom metric "percent par correct" which is for evaluating how many answers are pretty much correct (or could be down to a difference in interpretation). This gives the percentage of ingredients that have one word precicted accurately

In [19]:
import tensorflow_addons as tfa
import tensorflow as tf

f1_score_name = tfa.metrics.F1Score(
    num_classes = X.shape[1],
    average = "micro",
    threshold=0.5
)

f1_score_unit = tfa.metrics.F1Score(
    num_classes = X.shape[1],
    average = "micro",
    threshold=0.5
)



f1_score_qty_decimal = tfa.metrics.F1Score(
    num_classes = 2,
    average = "micro",
    threshold=0.5
)


def name_f1_score(y_true_comb, y_pred_comb):
    y_true_name, y_true_unit = tf.split(y_true_comb, num_or_size_splits=[1, 1], axis=1)
    y_pred_name, y_pred_unit = tf.split(y_pred_comb, num_or_size_splits=[1, 1], axis=1)
    return f1_score_name(tf.squeeze(y_true_name), tf.squeeze(y_pred_name))

def unit_f1_score(y_true_comb, y_pred_comb):
    y_true_name, y_true_unit = tf.split(y_true_comb, num_or_size_splits=[1, 1], axis=1)
    y_pred_name, y_pred_unit = tf.split(y_pred_comb, num_or_size_splits=[1, 1], axis=1)
    return f1_score_unit(tf.squeeze(y_true_unit), tf.squeeze(y_pred_unit))



def name_percent_part_correct(y_true_comb, y_pred_comb):
    y_true_name, y_true_unit = tf.split(y_true_comb, num_or_size_splits=[1, 1], axis=1)
    y_pred_name, y_pred_unit = tf.split(y_pred_comb, num_or_size_splits=[1, 1], axis=1)
    
    metric = percent_part_correct(tf.squeeze(y_true_name), tf.squeeze(y_pred_name))
    return metric

def percent_part_correct(y_true, y_pred):
    y_round = tf.cast(tf.math.round(y_pred), tf.bool)
    y_true_bool = tf.cast(y_true, tf.bool)
    total_values = tf.gather(tf.shape(y_true), 0)
    
    # Returns a boolean tensor where the elements match
    equal = tf.math.logical_and(y_true_bool, y_round)
    only_ones = tf.logical_and(equal, y_true_bool)
    reduced_eq = tf.reduce_any(only_ones, 1)
    total_part_correct = tf.reduce_sum(tf.cast(reduced_eq, tf.int64))
    percent_part_correct = tf.math.divide_no_nan(tf.cast(total_part_correct, tf.float32), tf.cast(total_values, tf.float32))
    return percent_part_correct

def qty_mse(y_true, y_pred):
    y_true_qty = tf.split(y_true, num_or_size_splits=[1], axis=-1)
    y_pred_qty = tf.split(y_pred, num_or_size_splits=[1], axis=-1)
    mse = tf.keras.losses.MeanSquaredError()
    mse_error = mse(y_true_qty, y_pred_qty)
    return mse_error



# Test execution of custom metric
y_p = tf.constant([[[0, 0.51, 0], [0.51, 0, 0]], [[1, 0, 0], [0, 0, 0]]], dtype=tf.float32)
y_t = tf.constant([[[0, 1, 0], [1, 0, 0]], [[0, 0, 0], [0, 0, 0]]], dtype=tf.float32)

print(name_percent_part_correct(y_combined, y_combined).numpy())




1.0


 The versions of TensorFlow you are currently using is 2.5.0 and is not supported. 
Some things might work, some things might not.
If you were to encounter a bug, do not file an issue.
If you want to make sure you're using a tested and supported configuration, either change the TensorFlow version or the TensorFlow Addons's version. 
You can find the compatibility matrix in TensorFlow Addon's readme:
https://github.com/tensorflow/addons
2022-03-27 18:00:36.836853: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


## Split Data

In [20]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y_combined, test_size=0.2, random_state=42)
_, _, y_qty_train, y_qty_test = train_test_split(X, y_qty, test_size=0.2, random_state=42)
_, _, y_qty_decimal_train, y_qty_decimal_test = train_test_split(X, y_qty_decimal, test_size=0.2, random_state=42)


Unfortunately although I wanted to train a model to include the range as well, I do not believe that there are enough samples to give an accurate result.

## Build and Train Model

In [21]:
import tensorflow as tf

lstm_dim = 32
dense_dim = 256
output_dim = X.shape[1]*2
NUM_EPOCHS = 50
BATCH_SIZE = 64

# Model Definition with LSTM
model_head = tf.keras.Sequential([
    embedding_lay,
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_dim, return_sequences=True)),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(lstm_dim, return_sequences=True)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(dense_dim, activation='relu'),
    tf.keras.layers.Dropout(.4, input_shape=(256,))
], name="model_head")
# Print the model summary
model_head.summary()

save_freq = int(1*(len(X_train)/BATCH_SIZE))
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint('./models/model{epoch:08d}.ckpt', save_freq=save_freq, save_weights_only="true")
                                                                        



Model: "model_head"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_layer (Embedding)  (None, 60, 50)            18240500  
_________________________________________________________________
bidirectional (Bidirectional (None, 60, 64)            21248     
_________________________________________________________________
bidirectional_1 (Bidirection (None, 60, 64)            24832     
_________________________________________________________________
flatten (Flatten)            (None, 3840)              0         
_________________________________________________________________
dense (Dense)                (None, 256)               983296    
_________________________________________________________________
dropout (Dropout)            (None, 256)               0         
Total params: 19,269,876
Trainable params: 1,029,376
Non-trainable params: 18,240,500
____________________________________

In [23]:
model_tail_1 = tf.keras.Sequential([
    tf.keras.layers.Dense(output_dim, activation='sigmoid'), 
    tf.keras.layers.Reshape((2, X.shape[1]))
], name="name_unit")

model_tail_2 = tf.keras.Sequential([
    tf.keras.layers.Dense(1, activation='relu')
], name="qty")

model_tail_3 = tf.keras.Sequential([
    tf.keras.layers.Dense(3, activation='softmax')
], name="qty_decimal")



def build_model(inference=False):
    model_input = tf.keras.layers.Input(shape=[X.shape[1]])
    features = model_head(model_input)
    model_output1 = model_tail_1(features)
    model_output2 = model_tail_2(features)
    model_output3 = model_tail_3(features)

    model = tf.keras.Model(inputs=model_input, outputs=[model_output1, model_output2, model_output3])

    TAIL1_WEIGHT = 8
    TAIL2_WEIGHT = 2
    TAIL3_WEIGHT = 0.2


    if not inference:
        model.compile(
            optimizer="Adam",
            loss={"name_unit": "binary_crossentropy", "qty": "huber", "qty_decimal": "categorical_crossentropy"},  
            loss_weights={"name_unit": TAIL1_WEIGHT, "qty": TAIL2_WEIGHT, "qty_decimal": TAIL3_WEIGHT},
            metrics={"name_unit": [name_f1_score, unit_f1_score, name_percent_part_correct], "qty": "mse", "qty_decimal": [f1_score_qty_decimal]} 
        )

    model.summary()
    return model



In [24]:
model = build_model()

Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 60)]         0                                            
__________________________________________________________________________________________________
model_head (Sequential)         (None, 256)          19269876    input_1[0][0]                    
__________________________________________________________________________________________________
name_unit (Sequential)          (None, 2, 60)        30840       model_head[0][0]                 
__________________________________________________________________________________________________
qty (Sequential)                (None, 1)            257         model_head[0][0]                 
______________________________________________________________________________________________

In [25]:
model.fit(X_train, [y_train, y_qty_train, y_qty_decimal_train], batch_size=BATCH_SIZE, 
                              epochs=NUM_EPOCHS, validation_data=(X_test, [y_test, y_qty_test, y_qty_decimal_test]), callbacks= [checkpoint_callback])

Epoch 1/50
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50


Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
 182/2129 [=>............................] - ETA: 1:36 - loss: 4.4461 - name_unit_loss: 0.0065 - qty_loss: 2.1953 - qty_decimal_loss: 0.0197 - name_unit_name_f1_score: 0.8967 - name_unit_unit_f1_score: 0.9323 - name_unit_name_percent_part_correct: 0.9973 - qty_mse: 229.7426 - qty_decimal_f1_score: 0.9955

KeyboardInterrupt: 

## Visualize Training

In [29]:
import plotly.graph_objects as go

def plot_graphs(history, strings):
    fig = go.Figure()
    x = list(range(NUM_EPOCHS))
    for string in strings:
        fig.add_scatter(x=x, y=history.history[string],
                        name=string)
        fig.add_scatter(x=x, y=history.history['val_'+string],
                        name='val_'+string)
    fig.update_traces(mode='lines+markers')# hoverinfo='text+name+y', 
    fig.update_layout(legend=dict(y=0.5, traceorder='reversed', font_size=16), 
                    title="Training Results")
    fig.show()

plot_graphs(model.history, ["name_unit_name_f1_score", "name_unit_unit_f1_score", "name_unit_name_percent_part_correct"])
plot_graphs(model.history, ["loss", "name_unit_loss", "qty_loss", "qty_decimal_loss"])
plot_graphs(model.history, ['name_unit_name_percent_part_correct'])
plot_graphs(model.history, ['qty_mse'])


The above loss graph can be used to adjust the weights on each of the contributing losses based on when each tail starts to overfit, ideally they will all start to overfit at the same time.

The below code can be used to select the epoch number to use as the final model. It will then load in those weights before saving the tensorflow lite model.

# Save Model to TF Lite

I am using TFlite to convert the model to work locally within the package. THis saves the model from having to be run as an API call to the cloud.

In [30]:
epoch_num = 45
model = build_model(inference=True)

weights_file = f'./models/model{str(epoch_num).zfill(8)}.ckpt'

model.load_weights(weights_file)


Model: "model_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 60)]         0                                            
__________________________________________________________________________________________________
model_head (Sequential)         (None, 256)          19269876    input_2[0][0]                    
__________________________________________________________________________________________________
name_unit (Sequential)          (None, 2, 60)        30840       model_head[1][0]                 
__________________________________________________________________________________________________
qty (Sequential)                (None, 1)            257         model_head[1][0]                 
____________________________________________________________________________________________

<tensorflow.python.training.tracking.util.CheckpointLoadStatus at 0x1a7ee8690>

In [31]:
# Convert the model.
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.target_spec.supported_ops = [tf.lite.OpsSet.TFLITE_BUILTINS, tf.lite.OpsSet.SELECT_TF_OPS]
converter._experimental_lower_tensor_list_ops = False
tflite_model = converter.convert()

# Save the model.
with open('output_model.tflite', 'wb') as f:
    f.write(tflite_model)





INFO:tensorflow:Assets written to: /var/folders/8v/d50_nc2n7kj916nj0mct09f40000gn/T/tmpidzf737m/assets


INFO:tensorflow:Assets written to: /var/folders/8v/d50_nc2n7kj916nj0mct09f40000gn/T/tmpidzf737m/assets


## Analyse Success and Failure Cases

In [42]:
def test_combined_model(n_samples, qty_failure=False):
    start_index = 100000
    subset_size = 10000
    X_sentences = ing_df_labelled.loc[ing_df_labelled.index[start_index:start_index+subset_size], "input_parsed"].values
    X_input = X[start_index:start_index+subset_size]

    y_name_subset = y_name[start_index:start_index+subset_size]
    y_unit_subset = y_unit[start_index:start_index+subset_size]
    y_qty_subset  = y_qty[start_index:start_index+subset_size]
    #y_range_subset = y_range[start_index:start_index+subset_size]
    pred_y_subset = model.predict(X[start_index:start_index+subset_size])
    

    pred_y_name = pred_y_subset[0][:, 0, :]
    pred_y_unit = pred_y_subset[0][:, 1, :]
    pred_y_qty  = pred_y_subset[1]
    #pred_y_range = pred_y_subset[2]
        
    rounded_yp_name = np.rint(np.asarray(pred_y_name))
    rounded_yp_unit = np.rint(np.asarray(pred_y_unit))

    for i in range(n_samples):
        if qty_failure:
            if np.abs(pred_y_qty[i] - y_qty_subset[i]) > 3:
                pass
            else:
                continue
            
        print(f"Sentence: {X_sentences[i]}")
        print()
        print(f"y_name: {binary_mask_to_words(y_name_subset[i], X_sentences[i])}")
        print(f"pred_name: {binary_mask_to_words(rounded_yp_name[i], X_sentences[i])}")
        print()
        print(f"y_unit: {binary_mask_to_words(y_unit_subset[i], X_sentences[i])}")
        print(f"pred_unit: {binary_mask_to_words(rounded_yp_unit[i], X_sentences[i])}")
        print()
        print(f"y_qty: {y_qty_subset[i]}")
        print(f"pred_qty: {pred_y_qty[i]}")
        print()

# Applies a binary mask to a sentence and returns the strings. Used for calculating the 
def binary_mask_to_words(binary_mask, sentence):
    split_sentence = sentence.split(" ")
    binary_mask.astype(np.bool)
    idxs = np.nonzero(binary_mask)
    if idxs:
        y_words = " ".join([split_sentence[idx] for idx in idxs[0]])
        return y_words
    else:
        return ""
    
test_combined_model(1000, qty_failure=True)



Sentence: 1 5 1 / 2 pound rolled turkey breast , tied at intervals to shape it like a sausage ask your butcher to do this

y_name: turkey breast
pred_name: breast

y_unit: pound
pred_unit: 

y_qty: [5.5]
pred_qty: [1.9153899]

Sentence: 1 cup , plus 1 tablespoon unsalted butter , at room temperature , cut into small pieces

y_name: unsalted butter
pred_name: butter

y_unit: tablespoon
pred_unit: tablespoon

y_qty: [17.]
pred_qty: [12.965963]



One interesting failure (or maybe a success?):

3 / 4 cup plus 2 tablespoons sugar -> Model: 9.98 Tablespoons sugar Label: 3/4 cup sugar

3/4 cups is actually 10 tablespoons, so the model answer is just as good as the true answer. If only it were able to add the additional 2 teaspoons as well!



## Identify fail cases

Because the data I have used has had the labels converted and because it is text data (I find it complicated and havent used it before), I want to see where the model has failed. This will helo to refine the preprocesing and dta filtering. It might be errors from the differences between what different labelers consider to be the ingredient.

In [34]:
def print_examples(fail_pass: str, part_correct: bool, name_unit: str, limit: int):
    start_index = 120000
    subset_size = 1000
    X_sentences = ing_df_labelled.loc[ing_df_labelled.index[start_index:start_index+subset_size], "input_parsed"].values
    X_input = X[start_index:start_index+subset_size]
    if name_unit == "name":
        y_subset = y_name[start_index:start_index+subset_size]
    else:
        y_subset = y_unit[start_index:start_index+subset_size]
        
    pred_y_subset = model.predict(X[start_index:start_index+subset_size])
    
    if name_unit == "name":
        pred_y_subset = pred_y_subset[0][:, 0, :]
    else:
        pred_y_subset = pred_y_subset[0][:, 1, :]
        
    rounded_y_pred = np.rint(np.asarray(pred_y_subset))
    
    printed = 0
    # Print any errors
    for i in range(subset_size):
        if printed > limit - 1:
            return
        if fail_pass == "fail":
            if part_correct:
                # Checks that there is at least one matching positive and then checks that there is not a perfect match.
                if (np.asarray(y_subset[i]) & rounded_y_pred[i].astype(int)).any() and not (np.asarray(y_subset[i]) == rounded_y_pred[i]).all():
                    print(f"Sentence: {X_sentences[i]}")
                    print(f"y: {binary_mask_to_words(y_subset[i], X_sentences[i])}")
                    print(f"pred_y: {binary_mask_to_words(rounded_y_pred[i], X_sentences[i])}")
                    print()
                    printed += 1
            else:
                # Checks that there is at least one word that is a name or unit
                # Then finds out where the overlap is between the y_true and the y_pred, bthen combines it with the y_pred array using and 
                # to remove the matches where both are 0
                # Then checks that there are no 1's in this array, meaning a compelte error
                if np.any(np.asarray(y_subset[i])):
                    mask = (np.asarray(y_subset[i]) == rounded_y_pred[i]) & np.asarray(y_subset[i])
                    if not mask.any():
                        print(f"Sentence: {X_sentences[i]}")
                        print(f"y: {binary_mask_to_words(y_subset[i], X_sentences[i])}")
                        print(f"pred_y: {binary_mask_to_words(rounded_y_pred[i], X_sentences[i])}")
                        print()
                        printed += 1

        if fail_pass == "pass":
            if (np.asarray(y_subset[i]) == rounded_y_pred[i]).all():
                print(f"Sentence: {X_sentences[i]}")
                print()
                printed += 1




## Failed Name cases

In [35]:
print_examples("fail", True, "name", 20)

Sentence: 1 / 2 cup plain yogurt , preferably whole milk
y: plain yogurt
pred_y: yogurt

Sentence: 1 / 4 cup peanut oil or neutral oil , like grapeseed or corn
y: peanut oil or neutral oil
pred_y: peanut oil

Sentence: 3 tablespoons mild or sweet miso , like yellow or white
y: mild or sweet miso
pred_y: sweet miso

Sentence: 1 tablespoon dark sesame oil
y: dark sesame oil
pred_y: sesame oil

Sentence: 1 egg yolk
y: egg yolk
pred_y: egg

Sentence: 1 tablespoon lemon juice or sherry or white wine vinegar
y: lemon juice or sherry or white wine vinegar
pred_y: lemon juice

Sentence: 1 tablespoon packed , finely chopped flat leaf parsley
y: flat leaf parsley
pred_y: parsley

Sentence: 1 tablespoon meyer lemon or lemon juice
y: meyer lemon or lemon juice
pred_y: meyer lemon lemon juice

Sentence: chilled boiled potatoes , for serving , optional
y: potatoes
pred_y: chilled boiled potatoes

Sentence: sour cream or yogurt , for serving
y: sour cream or yogurt
pred_y: sour cream yogurt

Sentence

It appears that there are two kind of errors. Errors where the model has clearly gotten it wrong:

1 pinch sea salt or fleur de sel -> Model Output: "Sea Salt Or"

And others which are open to interpretation or where the labeller has gotten it wrong:

1 red grapefruit , peeled and segmented , optional -> Model Output: "red grapefruit" Label: "grapefruit"

1 / 4 cup vadouvan exotique spice mix -> Model Output: "vadouvan exotique spice mix" Label: "spice mix"

2 teaspoons aged balsamic vinegar -> Model Output: "balsamic vinegar" Label: "vinegar"

I cannot think of any straightforward ways to seperate these mistakes, however, I can create a metric that includes the above examples as successful predictions and then have a look at the ones that are not these type of errors. This would be useful as a better predictor of model performance I believe.

This metric would include any prediction with an overlap with the Label a successful prediction and then give a percentage of the data set that fall into this category.

In [37]:
print_examples("fail", False, "name", 10)

Sentence: 10 large or 20 small precooked canned or vacuum packed beets
y: beets
pred_y: precooked canned

Sentence: 2 tablespoons nam pla fish sauce
y: nam pla
pred_y: fish sauce

Sentence: 1 large bunch or 2 smaller bunches greens , such as swiss chard , beet greens , turnip greens or kale about 1 1 / 2 pounds , stemmed and washed well in several changes of water
y: greens greens greens
pred_y: smaller

Sentence: 1 to 2 tablespoons gochujang korean chili paste
y: gochujang
pred_y: chili paste

Sentence: 3 / 4 cup panko japanese bread crumbs
y: panko
pred_y: japanese bread crumbs



From the aboe example of errors, it looks like the algorithm is actually better than the human labeller in this instance.

crackers or sliced cucumber , for serving -> Model: crackers, Label: cucumber
1 / 2 cup soaking water from the apricots , as needed - > Model: water Label: apricots

## Failed Unit Cases

In [38]:
print("Successful Predictions")
print_examples("pass", False, "unit", 10)

Successful Predictions
Sentence: 1 / 2 cup plain yogurt , preferably whole milk

Sentence: 1 / 2 cup olive oil , or less

Sentence: 1 / 2 cup chopped fresh mint

Sentence: zest and juice of 1 lemon

Sentence: salt

Sentence: freshly ground black pepper

Sentence: 1 / 4 cup peanut oil or neutral oil , like grapeseed or corn

Sentence: 1 / 4 cup rice vinegar

Sentence: 1 tablespoon dark sesame oil

Sentence: 2 medium carrots , roughly chopped



In [39]:
print("Unsuccessful Predictions")
print_examples("fail", False, "unit", 5)

Unsuccessful Predictions
Sentence: 3 / 4 cup plus 2 tablespoons sugar
y: cup
pred_y: tablespoons

Sentence: 10 large or 20 small precooked canned or vacuum packed beets
y: large
pred_y: 

Sentence: 1 large onion , chopped
y: large
pred_y: 

Sentence: 1 large cucumber , peeled , seeded and coarsely chopped
y: large
pred_y: 

Sentence: 1 tablespoon plus 1 teaspoon freshly squeezed lime juice
y: tablespoon
pred_y: 



In [40]:
print("Unsuccessful predictions where part of the prediction is correct")
print_examples("fail", True, "unit", 100)

Unsuccessful predictions where part of the prediction is correct
Sentence: 1 large bunch chamomile or 2 chamomile tea bags
y: large bunch
pred_y: bunch

Sentence: 2 / 3 cup pitted oil cured black olives , halved , or 1 / 2 cup pitted green olives , chopped , or 3 tablespoons capers optional
y: cup cup
pred_y: cup

Sentence: 1 large bunch swiss chard , ribs removed , leaves torn into pieces
y: large bunch
pred_y: bunch

Sentence: 2 8 ounce packages cream cheese , at room temperature
y: 8 ounce packages
pred_y: ounce

Sentence: 4 tablespoons minced fresh dill , or 2 tablespoons dried dill
y: tablespoons tablespoons
pred_y: tablespoons

