# Hello
I would like to present my very first data science project. Please do not be intimidated by its magnificence.
This is a simple food recommendation project. It gives ideas of dishes basing on your favourite food.
This is first project version, which works on a "dead" dataset. In other version the project will base on API and connect with website to search for recipes.

## Libraries and knowledge

In [None]:
import os
import numpy as np # linear algebra and arrays
import pandas as pd # dataframes and stuff
import matplotlib.pyplot as plt # plots
%matplotlib inline
import seaborn as sns # more plots
import statistics as st # distributions
import scipy as sp # pivot engineering

#ML model
from sklearn.metrics.pairwise import cosine_similarity

## Data import and first glimpse into it

### Users
u - UserID
techniques - Cooking techniques encountered by user
items - Recipes interacted with, in order
n_items - Number of recipes reviewed
ratings - Ratings given to each recipe encountered by this user
n_ratings - Number of ratings in total

In [None]:
users_pp = pd.read_csv('archive/PP_users.csv')
users_pp.info()
users_pp.head()

### Recipes - preprocessed data
id - Recipe ID
i - Recipe ID mapped to contiguous integers from 0
name_tokens - BPE-tokenized recipe name
ingredient_tokens - BPE-tokenized ingredients list (list of lists)
steps_tokens - BPE-tokenized steps
techniques - List of techniques used in recipe
calorie_level - Calorie level in ascending order
ingredients_ids - IDs of ingredients in recipe

In [None]:
recipes_pp = pd.read_csv('archive/PP_recipes.csv')
recipes_pp.info()
recipes_pp.head()

### Recipes - raw data
name - Recipe name
id - Recipe ID
minutes - Minutes to prepare recipe
contributor_id - User ID who submitted this recipe
submitted - Date recipe was submitted
tags - Food.com tags for recipe
nutrition - Nutrition information (calories (#), total fat (PDV), sugar (PDV) , sodium (PDV) , protein (PDV) , saturated fat
steps - Text for recipe steps, in order
description - User-provided description

In [None]:
recipes_raw = pd.read_csv('archive/RAW_recipes.csv')
recipes_raw.info()
recipes_raw.head()

### Interactions
user_id - User ID
recipe_id - Recipe ID
date - Date of interaction
rating - Rating given
review - Review text

In [21]:
interactions_raw = pd.read_csv('archive/RAW_interactions.csv')
interactions_raw.dropna()
interactions_raw.info()
interactions_raw.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1132367 entries, 0 to 1132366
Data columns (total 5 columns):
 #   Column     Non-Null Count    Dtype 
---  ------     --------------    ----- 
 0   user_id    1132367 non-null  int64 
 1   recipe_id  1132367 non-null  int64 
 2   date       1132367 non-null  object
 3   rating     1132367 non-null  int64 
 4   review     1132198 non-null  object
dtypes: int64(3), object(2)
memory usage: 43.2+ MB


Unnamed: 0,user_id,recipe_id,date,rating,review
0,38094,40893,2003-02-17,4,Great with a salad. Cooked on top of stove for...
1,1293707,40893,2011-12-21,5,"So simple, so delicious! Great for chilly fall..."
2,8937,44394,2002-12-01,4,This worked very well and is EASY. I used not...
3,126440,85009,2010-02-27,5,I made the Mexican topping and took it to bunk...
4,57222,85009,2011-10-01,5,"Made the cheddar bacon topping, adding a sprin..."


## Data analysis

### Most often rated recipes

I am gonna do it using lists, where list indexes will be recipe ids. This requires creating lists longer by one than maximum recipe_id, so that recipe_id can directly match list index.
Having it done I will create a structured array as in formula:
[(recipe_id, interactions_count, average_rating), ...]

In [None]:
# Create lists for interactions count and sum of rates
interactions_count = [0] * (max(recipes_raw['id'].values) + 1)
ratings_sum = [0] * (max(recipes_raw['id'].values) + 1)

# Fill the lists
for row_index, row in interactions_raw.iterrows():
    recipe_id = interactions_raw.loc[row_index]['recipe_id']
    rating = interactions_raw.loc[row_index]['rating']
    interactions_count[recipe_id] += 1
    ratings_sum[recipe_id] += rating

In [None]:
# Create a structured array to store the recipes stats
array_structure = [('recipe_id', int), ('interactions_count', int), ('avg_rating', float)]
array_values = []

for recipe_id in recipes_raw['id'].values:
    interactions_number = interactions_count[recipe_id]
    avg_rating = ratings_sum[recipe_id] / interactions_count[recipe_id]
    array_values.append((recipe_id, interactions_number, avg_rating))
recipe_stats = np.array(array_values, dtype = array_structure)

In [31]:
# Find 10 most often rated recipes
most_often_recipes = recipe_stats.copy()
most_often_recipes.sort(order = 'interactions_count')
most_often_recipes = most_often_recipes[::-1]

for i in range(10, 0, -1):
    print('Position ', i, ':', sep = '')
    recipe_id = most_often_recipes[i - 1]['recipe_id']
    name = recipes_raw.loc[recipes_raw['id'] == recipe_id]['name'].to_string(index= False)
    print(name.upper())
    print('Interactions:  ', most_often_recipes[i - 1]['interactions_count'])
    avg_rating = most_often_recipes[i - 1]['avg_rating']
    print(f'Average rating: {avg_rating:1.2f} / 5\n')

Position 10:
JAPANESE MUM S CHICKEN
Interactions:   904
Average rating: 4.40 / 5

Position 9:
KITTENCAL S ITALIAN MELT IN YOUR MOUTH MEATBALLS
Interactions:   997
Average rating: 4.71 / 5

Position 8:
WHATEVER FLOATS YOUR BOAT  BROWNIES
Interactions:   1220
Average rating: 4.53 / 5

Position 7:
JO MAMA S WORLD FAMOUS SPAGHETTI
Interactions:   1234
Average rating: 4.42 / 5

Position 6:
YES  VIRGINIA THERE IS A GREAT MEATLOAF
Interactions:   1305
Average rating: 4.21 / 5

Position 5:
BEST EVER BANANA CAKE WITH CREAM CHEESE FROSTING
Interactions:   1322
Average rating: 4.33 / 5

Position 4:
CREAMY CAJUN CHICKEN PASTA
Interactions:   1448
Average rating: 4.54 / 5

Position 3:
CROCK POT CHICKEN WITH BLACK BEANS   CREAM CHEESE
Interactions:   1579
Average rating: 4.22 / 5

Position 2:
TO DIE FOR CROCK POT ROAST
Interactions:   1601
Average rating: 4.29 / 5

Position 1:
BEST BANANA BREAD
Interactions:   1613
Average rating: 4.19 / 5



### Highest rated recipes
At this point I would like to find recipes with highest average rating, but this would probably return a lot of recipes with just one, 5-stars mark.

In [None]:
# Find 10 top rated recipes

top_rated_recipes = recipe_stats.copy()
top_rated_recipes.sort(order = 'avg_rating')
top_rated_recipes = top_rated_recipes[::-1]

for i in range(10, 0, -1):
    print('Position ', i, ':', sep = '')
    recipe_id = top_rated_recipes[i - 1]['recipe_id']
    name = recipes_raw.loc[recipes_raw['id'] == recipe_id]['name'].to_string(index= False)
    print(name.upper())
    print('Interactions:  ', top_rated_recipes[i - 1]['interactions_count'])
    avg_rating = top_rated_recipes[i - 1]['avg_rating']
    print(f'Average rating: {avg_rating:1.2f} / 5\n')

We got exactly the expected result. So let's make the results more objective and let's set a threshold below which we will not consider results as valid. This could be some constant value (e.g. 20 or 50 ratings), but we want it to be prone to overall ratings number distribution. So let's check what does this distribution look like.

In [None]:
# Create distribution plot

sns.kdeplot(data = recipe_stats['interactions_count'], log_scale = True, fill = True, cbar = True)
plt.title('Interactions number distribution')
plt.xlabel('Number of interactions (logarithmic scale)')

In [None]:
# Print some statistical values

data = recipe_stats['interactions_count']
print('Number of recipes:', len(data))
print('Median:    ', st.median(data))
print('Quantiles: ', st.quantiles(data, n = 10))
print('Mean value:', st.mean(data))

After having a look at the chart above and statistical data I decided to set the first threshold as >= 10 interactions, which will cut off majority of lowest popular recipes.

In [None]:
# Cut off recipes with less than 10 interactions and check how it worked

top_rated_recipes = recipe_stats.copy()
top_rated_recipes.sort(order = 'avg_rating')
top_rated_recipes = top_rated_recipes[::-1]

print('Number of recipes included:', len(top_rated_recipes))
print('5 first elements, second column is number of interactions:')
print(top_rated_recipes[0:5])

print('\n', '*'*50, '\n')

filter_array = []
for element in top_rated_recipes:
    if element['interactions_count'] >= 10:
        filter_array.append(True)
    else:
        filter_array.append(False)
top_rated_recipes = top_rated_recipes[filter_array]
print('Number of recipes included:', len(top_rated_recipes))
print('5 first elements, second column is number of interactions:')
print(top_rated_recipes[0:5])

As we can see, setting minimum interactions count to just 10 interactions let us filter 90% of recipes. Let's check the distribution and statistical data now:

In [None]:
# Create - again - distribution plot

sns.kdeplot(data = top_rated_recipes['interactions_count'], log_scale = True, fill = True, cbar = True)
plt.title('Interactions number distribution')
plt.xlabel('Number of interactions (logarithmic scale)')

In [None]:
# Print - again - some statistical values

data = top_rated_recipes['interactions_count']
print('Number of recipes:', len(data))
print('Median:    ', st.median(data))
print('Quantiles: ', st.quantiles(data, n=10))
print('Mean value:', st.mean(data))

For now we can consider it as a decent outcome. Let's get our desired top recipe list!

In [None]:
# Print top rated recipes

for i in range(10, 0, -1):
    print('Position ', i, ':', sep = '')
    recipe_id = top_rated_recipes[i - 1]['recipe_id']
    name = recipes_raw.loc[recipes_raw['id'] == recipe_id]['name'].to_string(index= False)
    print(name.upper())
    print('Interactions:  ', top_rated_recipes[i - 1]['interactions_count'])
    avg_rating = top_rated_recipes[i - 1]['avg_rating']
    print(f'Average rating: {avg_rating:1.2f} / 5\n')

Ok, sorry for that - that's again a list of only-5-stars recipes. So let's filter it once again and check how it looks for at least 100 interactions:

<again?_again_meme>

In [None]:
# Cut off recipes with less than 100 interactions and check how it worked

print('Number of recipes included:', len(top_rated_recipes))
print('5 first elements, second column is number of interactions:')
print(top_rated_recipes[0:5])

print('\n', '*'*50, '\n')

filter_array = []
for element in top_rated_recipes:
    if element['interactions_count'] >= 100:
        filter_array.append(True)
    else:
        filter_array.append(False)
top_rated_recipes = top_rated_recipes[filter_array]
print('Number of recipes included:', len(top_rated_recipes))
print('5 first elements, second column is number of interactions:')
print(top_rated_recipes[0:5])

In [None]:
# Create - yes, again - distribution plot

sns.kdeplot(data = top_rated_recipes['interactions_count'], log_scale = True, fill = True, cbar = True)
plt.title('Interactions number distribution')
plt.xlabel('Number of interactions (logarithmic scale)')

In [None]:
# Print - yes, again - some statistical values

data = top_rated_recipes['interactions_count']
print('Number of recipes:', len(data))
print('Median:    ', st.median(data))
print('Quantiles: ', st.quantiles(data, n=10))
print('Mean value:', st.mean(data))

In [None]:
# Print top rated recipes

for i in range(10, 0, -1):
    print('Position ', i, ':', sep = '')
    recipe_id = top_rated_recipes[i - 1]['recipe_id']
    name = recipes_raw.loc[recipes_raw['id'] == recipe_id]['name'].to_string(index= False)
    print(name.upper())
    print('Interactions:  ', top_rated_recipes[i - 1]['interactions_count'])
    avg_rating = top_rated_recipes[i - 1]['avg_rating']
    print(f'Average rating: {avg_rating:1.2f} / 5\n')

Now we got a really nice top recipes list, which means we can head to the supermarket for some tex-mex ingredients :)

## Recommendation tools

### Creating DataFrame for pivot table

We will have to limit size of the table, because it would get too big to be processed by Pandas. To do so we will delete users with only one interaction and recipes also with only one interaction. These will be our steps:
 1. Create a new DataFrame to aggregate interactions count for each recipe.

In [71]:
# Step 1
min_recipe_interactions = 5
recipes_valid = interactions_raw['recipe_id'].value_counts()
recipes_valid = recipes_valid[recipes_valid.ge(min_recipe_interactions)]
interactions_valid = interactions_raw[interactions_raw['recipe_id'].isin(recipes_valid.index)]

min_user_interactions = 3
users_valid = interactions_raw['user_id'].value_counts()
users_valid = users_valid[users_valid.ge(min_user_interactions)]
interactions_valid = interactions_valid[interactions_valid['user_id'].isin(users_valid.index)]


In [72]:
# Create a new DataFrame combining user id, recipe name and its rating

rated_recipes = interactions_valid.merge(recipes_raw, left_on = 'recipe_id', right_on = 'id')
rated_recipes = rated_recipes[['user_id', 'name', 'rating']]
print(len(rated_recipes))
rated_recipes.head()

648613


Unnamed: 0,user_id,name,rating
0,76535,kfc honey bbq strips,4
1,353911,kfc honey bbq strips,5
2,190375,kfc honey bbq strips,5
3,468945,kfc honey bbq strips,0
4,255338,kfc honey bbq strips,5


In [73]:
# Create pivot table to gather users and their ratings in one table

pivot = pd.pivot_table(rated_recipes, values = 'rating', index = 'user_id', columns = 'name')
pivot.head()

name,0 carb 0 cal gummy worms,0 point soup ww,0 point soup crock pot,1 00 tangy chicken recipe,1 000 artichoke hearts,1 1 1 tempura batter,1 2 3 4 tater tot casserole,1 2 3 4 cake,1 2 3 4 cake with caramel icing,1 2 3 apple crisp,...,zuke soup,zulu cabbage,zuppa di broccoli broccoli soup,zuppa di pesce cioppino or fish stew,zuppa toscana from olive garden,zuppa toscana soup olive garden clone,zurie s overnight no knead bread,zwiebelkuchen southwest german onion cake,zydeco soup,zydeco ya ya deviled eggs
user_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1533,,,,,,,,,,,...,,,,,,,,,,
1535,,5.0,,,,,,,,,...,,,,,,,,,,
1634,,,,,,,,,,,...,,,,,,,,,,
1676,,,,,,,,,,,...,,,,,,,,,,
1773,,,,,,,,,,,...,,,,,,,,,,
