# Ingredient Vectors from Decomposing a Ingredient-Ingredient Pointwise Mutual Information Matrix
Let's create some simple ingredients embeddings based on the principle of [word vectors](https://en.wikipedia.org/wiki/Word_embedding) by applying a [singular value decomposition](https://en.wikipedia.org/wiki/Singular-value_decomposition) to a [pointwise mutual information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) ingredient-ingredient matrix.  There are many other ways to create embeddings, but matrix decomposition is one of the most straightforward.  A well cited description of the technique used in this notebook can be found in Chris Moody's blog post [Stop Using word2vec](https://multithreaded.stitchfix.com/blog/2017/10/18/stop-using-word2vec/).    If you are interested in reading further about the history of word embeddings and a discussion of modern approaches check out the following blog post by Sebastian Ruder, [An overview of word embeddings and their connection to distributional semantic models](http://blog.aylien.com/overview-word-embeddings-history-word2vec-cbow-glove/).

The end goal here is to find good substitutes to ingredients when those are not readily available to cook a given recipe. We will compare those embeddings to pre-trained word vectors.

We will be using a list of recipes that I scraped from the web. Here the recipes don't contain any particular indications and are just equivalent to sets of ingredients (ingredients' names are in french).
In this notebook tutorial we will implement as much as we can without using libraries that obfuscate the algorithm.  We're not going to write our own linear algebra or sparse matrix routines, but we will calculate unigram frequency, skipgram frequency, and the pointwise mutual information matrix "by hand".  Hopefully this will make the method easier to understand!

In [None]:
%matplotlib inline

from collections import Counter
import itertools

import numpy as np
from scipy import sparse
from scipy.sparse import linalg 
from sklearn.preprocessing import normalize
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import json
import sys
import os

# Read Data and Preview

In [None]:
recipes = []

with open(os.path.join('/kaggle/input/ingredient-sets', 'recipes.json'), 'r') as f:
    recipes = json.load(f)

In [None]:
recipes[0]

# Minimal Preprocessing
We're going to create a ingredient-ingredient co-occurrence matrix from the ingredients in the recipes.  We will define two ingredients as "co-occurring" if they appear in the same recipe.  Using this definition, single ingredient recipes are not interestintg for us. 

In [None]:
ingredients = [[ingredient['ingredientName'] for ingredient in recipe['ingredients']] for recipe in recipes]

# remove single ingredients recipes
ingredients = [ing for ing in ingredients if len(ingredients) > 1]
# show results
ingredients[0:5]

# Unigrams
Now lets calculate a unigram vocabulary.  The following code assigns a unique ID to each token, stores that mapping in two dictionaries (`tok2indx` and `indx2tok`), and counts how often each token appears in the corpus.

In [None]:
unigram_counts = Counter()
for i, ingredient in enumerate(ingredients):
    for token in ingredient:
        unigram_counts[token] += 1

tok2indx = {tok: indx for indx,tok in enumerate(unigram_counts.keys())}
indx2tok = {indx: tok for tok,indx in tok2indx.items()}

print('Done')
print('Vocabulary size: {}'.format(len(unigram_counts)))
print('Most common: {}'.format(unigram_counts.most_common(10)))

# Skipgrams
Now lets calculate a skipgram vocabulary.  We will loop through each ingredient in a recipe (the focus ingredient) and then form skipgrams by examing other ingredients within the same recipe.  As an example, the first recipe ,
```
['tortilla', 'abricot', 'beurre', 'sucre vanillé']
```
would produce the following skipgrams, 
```
('tortilla', 'abricot')
('tortilla', 'beurre')
('tortilla', 'sucre vanillé')
('abricot', 'tortilla')
('abricot', 'beurre')
('abricot', 'sucre vanillé')
('beurre', 'tortilla')
('beurre', 'abricot')
('beurre', 'sucre vanillé')
('sucre vanillé', 'tortilla')
('sucre vanillé', 'abricot')
('sucre vanillé', 'beurre')
```

In [None]:
# Note we store the token vocab indices in the skipgram counter

skipgram_counts = Counter()
for ingredient in ingredients:
    tokens = [tok2indx[tok] for tok in ingredient]
    for i_ingredient, ingredient in enumerate(tokens):
        for i_context in range(len(tokens)):
            if i_ingredient == i_context:
                continue
            skipgram = (tokens[i_ingredient], tokens[i_context])
            skipgram_counts[skipgram] += 1    
        
print('Done')
print('Number of skipgrams: {}'.format(len(skipgram_counts)))
most_common = [
    (indx2tok[sg[0][0]], indx2tok[sg[0][1]], sg[1]) 
    for sg in skipgram_counts.most_common(10)]
print('Most common: {}'.format(most_common))

# Sparse Matrices

We will calculate several matrices that store ingredient-ingredient information.  These matrices will be $N \times N$ where $N = 350$ is the size of our vocabulary.  We will need to use a sparse format so that it will fit into memory.  A nice implementation is available in [scipy.sparse.csr_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html).  To create these sparse matrices we create three iterables that store row indices, column indices, and data values. 

# Ingredient-Ingredient Count Matrix
Our very first ingredient vectors will come from a ingredient-ingredient count matrix.  This matrix is symmetric so we can (equivalently) take the ingredient vectors to be the rows or columns.  However we will try and code as if the rows are ingredient vectors and the columns are context vectors. 

In [None]:
row_indxs = []
col_indxs = []
dat_values = []
i = 0
for (tok1, tok2), sg_count in skipgram_counts.items():
    i += 1  
    row_indxs.append(tok1)
    col_indxs.append(tok2)
    dat_values.append(sg_count)

iicnt_mat = sparse.csr_matrix((dat_values, (row_indxs, col_indxs)))
print('Done')

# Ingredient Similarity with Sparse Count Matrices

In [None]:
def ii_sim(ingredient, mat, topn=10):
    """Calculate topn most similar ingredients to ingredient"""
    indx = tok2indx[ingredient]
    if isinstance(mat, sparse.csr_matrix):
        v1 = mat.getrow(indx)
    else:
        v1 = mat[indx:indx+1, :]
    sims = cosine_similarity(mat, v1).flatten()
    sindxs = np.argsort(-sims)
    sim_ingredient_scores = [(indx2tok[sindx], sims[sindx]) for sindx in sindxs[0:topn]]
    return sim_ingredient_scores

In [None]:
ii_sim('beurre', iicnt_mat)

In [None]:
# Normalize each row using L2 norm
iicnt_norm_mat = normalize(iicnt_mat, norm='l2', axis=1)

In [None]:
# Demonstrate normalization
row = iicnt_mat.getrow(10).toarray().flatten()
print(np.sqrt((row*row).sum()))

row = iicnt_norm_mat.getrow(10).toarray().flatten()
print(np.sqrt((row*row).sum()))

In [None]:
ii_sim('poulet', iicnt_norm_mat)

# Pointwise Mutual Information Matrices
The pointwise mutual information (PMI) for a (ingredient, context) pair in our corpus is defined as the probability of their co-occurrence divided by the probabilities of them appearing individually, 
$$
{\rm pmi}(w, c) = \log \frac{p(w, c)}{p(w) p(c)}
$$

$$
p(w, c) = \frac{
f_{i,j}
}{
\sum_{i=1}^N \sum_{j=1}^N f_{i,j}
}, \quad 
p(w) = \frac{
\sum_{j=1}^N f_{i,j}
}{
\sum_{i=1}^N \sum_{j=1}^N f_{i,j}
}, \quad
p(c) = \frac{
\sum_{i=1}^N f_{i,j}
}{
\sum_{i=1}^N \sum_{j=1}^N f_{i,j}
}
$$
where $f_{i,j}$ is the ingredient-ingredient count matrix we defined above.
In addition we can define the positive pointwise mutual information as, 
$$
{\rm ppmi}(w, c) = {\rm max}\left[{\rm pmi(w,c)}, 0 \right]
$$

Note that the definition of PMI above implies that ${\rm pmi}(w, c) = {\rm pmi}(c, w)$ and so this matrix will be symmetric.  However this is not true for the variant in which we smooth over the contexts.

In [None]:
num_skipgrams = iicnt_mat.sum()
assert(sum(skipgram_counts.values()) == num_skipgrams)

# for creating sparce matrices
row_indxs = []
col_indxs = []

pmi_dat_values = []    # pointwise mutual information
ppmi_dat_values = []   # positive pointwise mutial information
spmi_dat_values = []   # smoothed pointwise mutual information
sppmi_dat_values = []  # smoothed positive pointwise mutual information

# reusable quantities

# sum_over_rows[i] = sum_over_ingredients[i] = iicnt_mat.getcol(i).sum()
sum_over_ingredients = np.array(iicnt_mat.sum(axis=0)).flatten()
# sum_over_cols[i] = sum_over_contexts[i] = iicnt_mat.getrow(i).sum()
sum_over_contexts = np.array(iicnt_mat.sum(axis=1)).flatten()

# smoothing
alpha = 0.75
sum_over_ingredients_alpha = sum_over_ingredients**alpha
nca_denom = np.sum(sum_over_ingredients_alpha)

for (tok_ingredient, tok_context), sg_count in skipgram_counts.items():
    # here we have the following correspondance with Levy, Goldberg, Dagan
    #========================================================================
    #   num_skipgrams = |D|
    #   nwc = sg_count = #(w,c)
    #   Pwc = nwc / num_skipgrams = #(w,c) / |D|
    #   nw = sum_over_cols[tok_ingredient]    = sum_over_contexts[tok_ingredient] = #(w)
    #   Pw = nw / num_skipgrams = #(w) / |D|
    #   nc = sum_over_rows[tok_context] = sum_over_ingredients[tok_context] = #(c)
    #   Pc = nc / num_skipgrams = #(c) / |D|
    #
    #   nca = sum_over_rows[tok_context]^alpha = sum_over_ingredients[tok_context]^alpha = #(c)^alpha
    #   nca_denom = sum_{tok_content}( sum_over_ingredients[tok_content]^alpha )
    
    nwc = sg_count
    Pwc = nwc / num_skipgrams
    nw = sum_over_contexts[tok_ingredient]
    Pw = nw / num_skipgrams
    nc = sum_over_ingredients[tok_context]
    Pc = nc / num_skipgrams
    
    nca = sum_over_ingredients_alpha[tok_context]
    Pca = nca / nca_denom
    
    # note 
    # pmi = log {#(w,c) |D| / [#(w) #(c)]} 
    #     = log {nwc * num_skipgrams / [nw nc]}
    #     = log {P(w,c) / [P(w) P(c)]} 
    #     = log {Pwc / [Pw Pc]}
    pmi = np.log2(Pwc/(Pw*Pc))   
    ppmi = max(pmi, 0)
    spmi = np.log2(Pwc/(Pw*Pca))
    sppmi = max(spmi, 0)
    
    row_indxs.append(tok_ingredient)
    col_indxs.append(tok_context)
    pmi_dat_values.append(pmi)
    ppmi_dat_values.append(ppmi)
    spmi_dat_values.append(spmi)
    sppmi_dat_values.append(sppmi)
        
pmi_mat = sparse.csr_matrix((pmi_dat_values, (row_indxs, col_indxs)))
ppmi_mat = sparse.csr_matrix((ppmi_dat_values, (row_indxs, col_indxs)))
spmi_mat = sparse.csr_matrix((spmi_dat_values, (row_indxs, col_indxs)))
sppmi_mat = sparse.csr_matrix((sppmi_dat_values, (row_indxs, col_indxs)))

print('Done')

# Ingredient Similarity with Sparse PMI Matrices

In [None]:
ii_sim('carotte', pmi_mat)

In [None]:
ii_sim('carotte', ppmi_mat)

In [None]:
ii_sim('carotte', spmi_mat)

In [None]:
ii_sim('carotte', sppmi_mat)

# Singular Value Decomposition
With the PMI and PPMI matrices in hand, we can apply a singular value decomposition to create dense ingredients vectors from the sparse ones we've been using. 

In [None]:
# Let's define a function to plot the vectors in a 2D space
def tsne_plot(word_to_vec_map):  
    labels, tokens = [], []

    for word, vector in word_to_vec_map.items():
        tokens.append(vector)
        labels.append(word)
    
    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(tokens)

    x, y = [], []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])
        
    plt.figure(figsize=(12, 12)) 
    for i in range(len(x)):
        plt.scatter(x[i], y[i])
        plt.annotate(labels[i],
                     xy=(x[i], y[i]),
                     xytext=(5, 2),
                     textcoords='offset points',
                     ha='right',
                     va='bottom')
    plt.show()

In [None]:
pmi_use = ppmi_mat
embedding_size = 30
uu, ss, vv = linalg.svds(pmi_use, embedding_size) 

In [None]:
print('Vocab size: {}'.format(len(unigram_counts)))
print('Embedding size: {}'.format(embedding_size))
print('uu.shape: {}'.format(uu.shape))
print('ss.shape: {}'.format(ss.shape))
print('vv.shape: {}'.format(vv.shape))

In [None]:
unorm = uu / np.sqrt(np.sum(uu*uu, axis=1, keepdims=True))
vnorm = vv / np.sqrt(np.sum(vv*vv, axis=0, keepdims=True))

ingredient_vecs = uu + vv.T
ingredient_vecs_norm = ingredient_vecs / np.sqrt(np.sum(ingredient_vecs*ingredient_vecs, axis=1, keepdims=True))

In [None]:
# Let's create a mapping from ingredient to vector
ingredient_to_vec_map = {}
for idx, vector in enumerate(ingredient_vecs_norm):
    ingredient_to_vec_map[indx2tok[idx]] = vector

In [None]:
tsne_plot(ingredient_to_vec_map)

In [None]:
def ingredient_sim(ingredient, sim_mat):
    sim_ingredient_scores = ii_sim(ingredient, sim_mat)
    for sim_ingredient, sim_score in sim_ingredient_scores:
        print(sim_ingredient, sim_score)

In [None]:
ingredient = 'carotte'
ingredient_sim(ingredient, ingredient_vecs)

# Ingredient substitution

Let's use the matrices built above to find good substitudes to ingredients. The idea is to suggest other ingredients when those needed for a recipe are not readily available.

In [None]:
def find_substitutes(ingredient, sim_mat, pmi_mat, threshold=None):
    substitutes = []
    
    candidates = ii_sim(ingredient, sim_mat, 20)
    indx1 = tok2indx[ingredient]
    for candidate, score in candidates:
        if candidate == ingredient:
            continue
        if threshold:
            if score < threshold:
                continue
                
        indx2 = tok2indx[candidate]
        pmi = pmi_mat[indx1, indx2]
        
        if pmi == 0: # We want ingredient that doesn't appear together within recipes
            substitutes.append((candidate, score))
    return substitutes

In [None]:
find_substitutes('tomate', iicnt_mat, ppmi_mat)

In [None]:
find_substitutes('tomate', ingredient_vecs_norm, ppmi_mat)

 In order to be conservative, i.e. in order to suggest substitutes that have a high chance to be valid substitutes, let's use a high threshold.

In [None]:
substitutes = {}
for ingredient in unigram_counts.keys():
    sub = find_substitutes(ingredient, iicnt_mat, ppmi_mat, 0.95)
    substitutes[ingredient] = sub

In [None]:
# Count the number of ingredients with possible substitutes
count = sum(1 for val in substitutes.values() if len(val) > 0)

# Ratio of ingredients covered
count / len(unigram_counts.keys())

# Pre-trained word vectors

Let's use already pre-trained word vectors to see if we can find better ingredients' substitutes. To do so, we will use the [Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/) (GloVe) from Stanford. The idea here is to replace ingredient vectors by their names' word vectors.

By doing so, we are taking the risk to don't find matching vectors for some of our ingredients. Indeed, some of the ingredients' names are composed of several words and even by using the pre-trained word vectors with the biggest vocabulary (2.2M) we are not sure to find correspondances.

Note also that, as those pre-trained word vectors are for words in english and that our ingredients are in french, we will need to translate our ingredients in english first to be able to use them. Here we will use a translation file to do so but it is also to possible to use the [Google Cloud Translation API](https://cloud.google.com/translate/docs/).

In [None]:
# Retrieve ingredients and create translation map
ingredients_fr_en = {}
ingredients_en_fr = {}

with open(os.path.join('/kaggle/input/ingredient-sets', 'ingredients_fr_en.csv'), 'r') as f:
    for line in f:
        ing_fr, ing_en = line.split(',')
        ing_fr, ing_en = ing_fr.strip().lower(), ing_en.strip().lower()
            
        ingredients_fr_en[ing_fr] = ing_en
        ingredients_en_fr[ing_en] = ing_fr

In [None]:
# Is all ingredients in french map to single translation in english ?
len(ingredients_fr_en), len(ingredients_en_fr)

In [None]:
def get_word_to_vec_map(glove_file, words_to_keep): 
    word_to_vec_map = {}
    with open(glove_file, 'r', encoding='utf-8') as f: 
        for row in f:
            row = row.strip().split()
            word = row[0]
            if word in words_to_keep:
                word_to_vec_map[word] = np.array(row[1:], dtype=np.float64)
    return word_to_vec_map

In [None]:
def find_best_substitute(word, word_to_vec_map, ingredients_fr_en, ingredients_en_fr, threshold=None):
    word = word.lower()
    
    if not word in ingredients_fr_en.keys():
        return []
    word_en = ingredients_fr_en[word]
    
    if not word_en in word_to_vec_map:
        return []
    e_a = word_to_vec_map[word_en]
    
    max_cosine_sim = -sys.maxsize        # Initialize max_cosine_sim to a large negative number
    best_word = None                     # Initialize best_word with None, it will help keep track of the word to output

    for w in word_to_vec_map.keys():         
        # Compute cosine similarity between the vector e_a and the w's vector representation
        if w == word_en:
            continue
        cosine_sim = cosine_similarity([e_a], [word_to_vec_map[w]])[0][0]
        if threshold:
            if cosine_sim < threshold:
                continue
        
        if cosine_sim > max_cosine_sim:
            max_cosine_sim = cosine_sim
            best_word = w
     
    if best_word:
        return [(ingredients_en_fr[best_word], max_cosine_sim)]
    else:
        return []

In [None]:
word_to_vec_map = get_word_to_vec_map(os.path.join('/kaggle/input/glove840b300dtxt', 'glove.840B.300d.txt'), ingredients_en_fr.keys())

In [None]:
# Is all ingredients have a corresponding vectors within GloVe ?
len(word_to_vec_map)

In [None]:
tsne_plot(word_to_vec_map)

In [None]:
find_best_substitute('beurre', word_to_vec_map, ingredients_fr_en, ingredients_en_fr)

In [None]:
wv_substitutes = {}
for ingredient in unigram_counts.keys():
    sub = find_best_substitute(ingredient, word_to_vec_map, ingredients_fr_en, ingredients_en_fr, 0.7)
    wv_substitutes[ingredient] = sub

In [None]:
# Count the number of ingredients with possible substitutes
count = sum(1 for val in wv_substitutes.values() if len(val) > 0)

# Ratio of ingredients covered
count / len(unigram_counts.keys())

# What's next

There is a lot of research done to find automatically ingredients' subsitutes and there are more advanced and more efficient approaches to tackle this issue than the ones presented in this notebook. If you are interested by ths subject you can check the Master thesis (and its references) of Swaan Dekkers: [Automatic ingredient replacement indigital recipes:  combining machinelearning with expert knowledge](http://flavourspace.com/wp-content/uploads/2018/06/Master_Thesis_Swaan-Dekkers_final.pdf).

It is also possible to look at other fields of research and import their approaches to tackle this issue. One approach that would be interesting to try and that is in the continuation of the word vectors approach is the use of [BERT](https://en.wikipedia.org/wiki/BERT_(language_model). You can check the original paper [here](https://arxiv.org/abs/1810.04805) and its implementation on [GitHub](https://github.com/google-research/bert). The idea behind such a technique is that static word vectors are not good enough and we need embeddings that are adapted to the context. Take for example the word 'bank' which can siginify, according to the context, either a financial institution or a raised portion of seabed along the edge of a river (see [Bank (disambiguation)](https://en.wikipedia.org/wiki/Bank_(disambiguation))). With word2vec or GloVe we will have only one embedding for the token 'bank' but with BERT it is possible to have embeddings according to the context.

One use of BERT that can be particularly interesting for our case is lexical simplification. Such approach is presented by Qiang et al. in their paper [A Simple BERT-Based Approach for Lexical Simplification](https://arxiv.org/abs/1907.06226). The idea, as presented on the [GitHub of the project](https://github.com/qiang2100/BERT-LS), is:
> Suppose that there is a sentence "the cat perched on the mat" and the complex word "perched". We concatenate the original sequence S and S' as a sentence pair, and feed the sentence pair {S,S'} into the BERT to obtain the probability distribution of the vocabulary corresponding to the mask word. Finally, we select as simplification candidates the top words from the probability distribution, excluding the morphological derivations of the complex word. For this example, we can get the top three simplification candidate words "sat, seated, hopped".

![BERT-LS illustration](https://raw.githubusercontent.com/qiang2100/BERT-LS/master/BERT_LS.png)

We could adapt this technique to perform substitution. Indeed, instead of feeding the BERT with sentences we could feed it with lists of ingredients (one list of ingredients for each recipe). Or course, it will ask us to train a BERT especially for this purpose. Then, by masking out of the ingredient we could retrive the best substitute to it according to the context i.e. for a particular recipe. By doing so, we will not have a static map for ingredients' substitutes but it would be possible to adapt the ingredient substitution to the context.