# CSCA 5632 Unsupervised Algorithms in Machine Learning Final Project

Discover hidden similarities in recipe data.

1. group cuisine similarities
2. assign cuisine (influences) to arbitrary recipes
3. suggest ingredients pairings

We will use the following datasets:
1. small dataset of `cuisine -> [ingredients]` (24 cuines, ~30 ingredients)
2. large dataset of `recipe -> [ingredients]` (200k)

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

from sklearn.decomposition import PCA
from sklearn.decomposition import NMF

drive = Path("/content/drive/MyDrive")
drive_data = drive / "UCB_USL_final_data"

In [2]:
cuisines_df = pd.read_csv(drive_data / "cuisines.csv", index_col=0)
cuisines_df.head()

Unnamed: 0,ackee,aji amarillo,aji amarillo paste,aji amarillo peppers,aji limo pepper,allspice,almond paste,almonds,anchovy broth,andouille sausage,...,white pepper,white wine,white wine vinegar,wide rice noodles,worcestershire sauce,yams,yeast,yellow split peas,za atar spice blend,zucchini
Korean,False,False,False,False,False,False,False,False,True,False,...,False,False,False,False,False,False,False,False,False,False
Italian,False,False,False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False
Indian,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Thai,False,False,False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,False,False,False,False
French,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,True


In [3]:
cuisines = cuisines_df.to_numpy()
cuisines.shape

(24, 401)

## Cuisines similarity

We will visualize the similarity of cuisines based on the ingredients used in the cuisine recipes.

Using PCA to get a 2D representation (2 components).

The relative distance on the plot shows relative similarity of cuisines.
The size of the dot shows the number of ingredients considered for the cusine (the bigger the more confident we are in the cuisine placement).

_The explained variance of PCA is relativelly low, we are losing a lot of data by projecting to 2D, this visualization should not be taken too seriously._

In [4]:
pca = PCA(n_components=2)
components = pca.fit_transform(cuisines)
print(components.shape)
print(pca.explained_variance_ratio_)

(24, 2)
[0.10277276 0.07974787]


In [5]:
fig = px.scatter(components, x=0, y=1, color=cuisines_df.index, size=cuisines.sum(axis=1))
fig.update_layout(title = "PCA of Cuisines")
fig.show()

### Conclusion



## Predict recipes cuisine

In [6]:
recipes_df = pd.read_csv(drive_data / "ingredients.csv", index_col='title')
print(recipes_df.shape)
recipes_df.head()

(200000, 325)


Unnamed: 0_level_0,aji amarillo paste,allspice,almond paste,almonds,andouille sausage,aonori,apples,arborio rice,assorted vegetables,avocado,...,white pepper,white wine,white wine vinegar,wide rice noodles,worcestershire sauce,yams,yeast,yellow split peas,za atar spice blend,zucchini
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Moist Hash Brown Casserole,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Chicken Tortilla Casserole,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Lettuce Wedge with Poppy Seed Dressing,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Slammin Salmon,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
Orange Ginger Tisane or Rum Toddy,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [7]:
recipes = recipes_df.to_numpy()
recipes.shape

(200000, 325)

In [8]:
# not all cuisine ingredients are used in the recipes,
# take the intersection of columns

overlap_cuisines = cuisines_df[recipes_df.columns].to_numpy()
overlap_cuisines.shape

(24, 325)

### Naive approach

Naive matrix factorization. Multiply the ingredients vector and the cuisines ingredients.

In [9]:
recipes_cuisines = (recipes * 1) @ overlap_cuisines.T
print(recipes_cuisines.shape)

recipes_cuisines = recipes_cuisines / recipes_cuisines.sum(axis=1, keepdims=True)

recipes_cuisines_df = pd.DataFrame(recipes_cuisines, index=recipes_df.index, columns=cuisines_df.index)
recipes_cuisines_df.head()

(200000, 24)



invalid value encountered in divide



Unnamed: 0_level_0,Korean,Italian,Indian,Thai,French,Mexican,Greek,Swedish,Ethiopian,Nigerian,...,Japanese,Vietnamese,Spanish,Moroccan,Brazilian,American,Jamaican,Lebanese,Irish,Chinese
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Moist Hash Brown Casserole,0.020408,0.040816,0.020408,0.020408,0.061224,0.081633,0.040816,0.061224,0.040816,0.020408,...,0.0,0.020408,0.061224,0.020408,0.061224,0.102041,0.040816,0.040816,0.061224,0.0
Chicken Tortilla Casserole,0.030303,0.030303,0.030303,0.030303,0.060606,0.090909,0.030303,0.060606,0.030303,0.030303,...,0.0,0.030303,0.060606,0.030303,0.060606,0.090909,0.030303,0.030303,0.060606,0.0
Lettuce Wedge with Poppy Seed Dressing,0.046512,0.046512,0.023256,0.046512,0.046512,0.046512,0.046512,0.046512,0.046512,0.046512,...,0.046512,0.023256,0.069767,0.023256,0.046512,0.069767,0.046512,0.046512,0.023256,0.0
Slammin Salmon,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.4,0.0,0.0,...,0.0,0.0,0.0,0.0,0.2,0.0,0.0,0.0,0.0,0.0
Orange Ginger Tisane or Rum Toddy,0.0,0.0,0.0,0.0,0.5,0.0,0.5,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
predicted_cuisine = recipes_cuisines_df.idxmax(axis=1)
predicted_cuisine.head(10)


The behavior of DataFrame.idxmax with all-NA values, or any-NA and skipna=False, is deprecated. In a future version this will raise ValueError



Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Moist Hash Brown Casserole,American
Chicken Tortilla Casserole,Mexican
Lettuce Wedge with Poppy Seed Dressing,Spanish
Slammin Salmon,Swedish
Orange Ginger Tisane or Rum Toddy,French
Ham 'N Tater Loaf,American
Oat Bran Muffins,Argentinian
Apple Pie Quesadillas,Mexican
Zippy Cauliflower,American
Bite-Size Cinnamon-Pecan Twirls,Italian


### NMF

Use non-negative matrix factorization to extract more precise cusine predictions.

In [11]:
model = NMF(n_components=24, init='random', random_state=20250622, max_iter=1000)
W = model.fit_transform(overlap_cuisines)
H = model.components_
print(W.shape, H.shape)

(24, 24) (24, 325)


In [12]:
recipes_cuisines = model.transform(recipes)
print(recipes_cuisines.shape)
recipes_cuisines = recipes_cuisines @ W.T
print(recipes_cuisines.shape)

recipes_cuisines_df = pd.DataFrame(recipes_cuisines, index=recipes_df.index, columns=cuisines_df.index)
recipes_cuisines_df.head()

(200000, 24)
(200000, 24)


Unnamed: 0_level_0,Korean,Italian,Indian,Thai,French,Mexican,Greek,Swedish,Ethiopian,Nigerian,...,Japanese,Vietnamese,Spanish,Moroccan,Brazilian,American,Jamaican,Lebanese,Irish,Chinese
title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Moist Hash Brown Casserole,7.387627e-10,0.0137343,0.006683462,8.076943e-07,0.00508828,0.013657,1.154806e-09,0.105624,0.004096,1.248455e-14,...,1.477718e-10,7.893773e-08,0.002886,1e-06,0.01704643,0.281411,4.129704e-09,4.66495e-09,0.04294321,0.0
Chicken Tortilla Casserole,8.780536e-10,2.592958e-10,0.06815703,5.05301e-06,0.006549979,0.012847,1.184632e-09,0.087217,0.000358,1.548973e-14,...,1.515884e-10,0.001504139,0.002956,1e-05,0.01244217,0.09514153,3.50279e-08,3.99634e-09,0.0337223,0.0
Lettuce Wedge with Poppy Seed Dressing,0.03702249,0.0216768,0.01662518,0.00606419,2.643454e-08,0.002255,5.275268e-09,0.056772,0.005371,0.01164037,...,0.01330384,1.858661e-07,0.013154,3e-06,2.595791e-10,0.09614727,1.13148e-08,0.02326707,5.10262e-10,0.0
Slammin Salmon,0.0,0.0,1.216263e-09,0.009787198,0.0,0.0,0.0,0.159042,0.0,0.0,...,0.0,5.251004e-08,0.0,0.0,0.01822227,0.0,0.0,0.0,0.0,0.0
Orange Ginger Tisane or Rum Toddy,1.347973e-11,0.0,1.176643e-10,0.0,0.01426926,0.0,0.04665096,0.0,0.0,0.0,...,0.0,1.915875e-07,0.0,0.0,0.0,6.056447e-09,0.0,0.0,0.0,0.0


In [13]:
predicted_cuisine = recipes_cuisines_df.idxmax(axis=1)
predicted_cuisine.head(10)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Moist Hash Brown Casserole,American
Chicken Tortilla Casserole,American
Lettuce Wedge with Poppy Seed Dressing,American
Slammin Salmon,Swedish
Orange Ginger Tisane or Rum Toddy,Greek
Ham 'N Tater Loaf,American
Oat Bran Muffins,Swedish
Apple Pie Quesadillas,Swedish
Zippy Cauliflower,American
Bite-Size Cinnamon-Pecan Twirls,Swedish


In [14]:
# get the top 3 cuisines

recipes_cuisines_df.iloc[range(10)].apply(lambda x: x.nlargest(3).index.tolist(), axis=1)

Unnamed: 0_level_0,0
title,Unnamed: 1_level_1
Moist Hash Brown Casserole,"[American, Swedish, Irish]"
Chicken Tortilla Casserole,"[American, Swedish, Indian]"
Lettuce Wedge with Poppy Seed Dressing,"[American, Swedish, Korean]"
Slammin Salmon,"[Swedish, Brazilian, Thai]"
Orange Ginger Tisane or Rum Toddy,"[Greek, French, Vietnamese]"
Ham 'N Tater Loaf,"[American, Swedish, Irish]"
Oat Bran Muffins,"[Swedish, American, Australian]"
Apple Pie Quesadillas,"[Swedish, Irish, Greek]"
Zippy Cauliflower,"[American, Irish, Greek]"
Bite-Size Cinnamon-Pecan Twirls,"[Swedish, Australian, Italian]"


### Conclusion

NMF is quite sensitive to initialization.

We would need more data & a method to evaluate the classifications. For example user rating or flagging misclassifications.

As a starting point or demonstration of the viability of this approach this will suffice, but any ML method can only really be evaluated in the context of the final application.

Our approach here could be tweaked as needed.

## Ingredient recommendation

Using collaborative filtering approach.

Construct a matrix of common occurances of ingredients in recipes.

Use matrix to recommend ingredients to use.

In [138]:
print(recipes.shape)

ingredients_matrix = np.zeros((len(recipes_df.columns), len(recipes_df.columns)))
print(ingredients_matrix.shape)

# construct ingredients cross-usage matrix

for i in range(recipes.shape[0]):
  ingredients_row = recipes[i, :]
  ingredients_used = ingredients_row.nonzero()[0]

  for j in ingredients_used:
    ingredients_matrix[j, ingredients_used] += (1.0 / recipes.shape[0])

# normalize by ingredient total count

# ingredients_matrix /= recipes.sum(axis=0)

print(ingredients_matrix[:5, :5])

(200000, 325)
(325, 325)
[[5.0000e-06 0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00]
 [0.0000e+00 6.3000e-03 0.0000e+00 6.0000e-05 0.0000e+00]
 [0.0000e+00 0.0000e+00 4.7000e-04 1.3000e-04 0.0000e+00]
 [0.0000e+00 6.0000e-05 1.3000e-04 1.4635e-02 0.0000e+00]
 [0.0000e+00 0.0000e+00 0.0000e+00 0.0000e+00 2.0500e-04]]


In [139]:
ingredients_names = recipes_df.columns
print(ingredients_names[:10])

def ingredients_vector(names):
  vector = np.zeros(len(ingredients_names))
  for name in names:
    if name not in ingredients_names:
      print(f"Warning: {name} not in ingredients")
  for name in names:
    vector[ingredients_names == name] = 1
  return vector

ingredients_vector(['almonds'])[:10]

Index(['aji amarillo paste', 'allspice', 'almond paste', 'almonds',
       'andouille sausage', 'aonori', 'apples', 'arborio rice',
       'assorted vegetables', 'avocado'],
      dtype='object')


array([0., 0., 0., 1., 0., 0., 0., 0., 0., 0.])

In [140]:
recommended = ingredients_vector(['pasta']) @ ingredients_matrix
print(recommended.shape)
print(recommended[:10])
ingredients_names[recommended.argsort()[-5:]].tolist()[::-1]

(325,)
[0.0e+00 5.0e-06 0.0e+00 9.0e-05 0.0e+00 5.0e-06 5.0e-06 0.0e+00 0.0e+00
 5.5e-05]


['pasta', 'garlic', 'salt', 'olive oil', 'tomatoes']

In [145]:
def recommend_ingredients(names, count=5):
  recommended = (ingredients_vector(names) @ ingredients_matrix)
  ingredients = ingredients_names[recommended.argsort()].tolist()
  ingredients.reverse()
  return [x for x in ingredients if x not in names][:count]

recommend_ingredients(['pasta'], count=10)

['garlic',
 'salt',
 'olive oil',
 'tomatoes',
 'onion',
 'parmesan cheese',
 'butter',
 'parsley',
 'basil',
 'pepper']

In [146]:
print(recommend_ingredients(['beef']))
print(recommend_ingredients(['beef', 'pepper']))
print(recommend_ingredients(['beef', 'pepper', 'butter']))
print(recommend_ingredients(['beef', 'pepper', 'butter', 'salt']))

['salt', 'onion', 'garlic', 'water', 'pepper']
['salt', 'onion', 'garlic', 'butter', 'water']
['salt', 'flour', 'sugar', 'eggs', 'milk']
['flour', 'sugar', 'eggs', 'onion', 'milk']


### Conclusion

We can get solid suggestions. The proposed ingredients make sense in the context of the provided list of ingredients.

FOr a better suggestion, we would need to consider not just the presence of the ingredients but the combinations as well.
We could start by adding pairs of ingredients (but this would make our CF matrix much larger).