In [1]:
import pandas as pd
import numpy as np

In [26]:
recipes = pd.read_csv('myRec/app/recommender_comp/datasets/new_allrecipes.csv', sep=",", error_bad_lines=False, encoding="latin-1")
recipes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5422 entries, 0 to 5421
Data columns (total 13 columns):
id                    5422 non-null int64
title                 5422 non-null object
category              5422 non-null object
cook_time_minutes     5422 non-null int64
ingredients           5422 non-null object
instructions          5422 non-null object
photo_url             5422 non-null object
prep_time_minutes     5422 non-null int64
total_time_minutes    5422 non-null int64
rating_stars          5422 non-null float64
review_count          5422 non-null int64
calories              5422 non-null int64
url                   5422 non-null object
dtypes: float64(1), int64(6), object(6)
memory usage: 550.8+ KB


In [5]:
recipes['ingredients'].head()

0    ['5 cups cubed potatoes', '2 cups carrots, sli...
1    ['1/2 cup butter', '1 cup white sugar', '1 cup...
2    ['1 (10 pound) whole goose', '2 tablespoons ko...
3    ['1 cup packed brown sugar', '1 cup white suga...
4    ['10 pounds white potatoes, peeled and quarter...
Name: ingredients, dtype: object

In [6]:
recipes['instructions'].head()

0    ['In a 4 quart casserole dish combine cubed po...
1    ['Cream 1/2 cup butter or margarine and 1 cup ...
2    ['Rinse goose and pat dry. Remove excess fat. ...
3    ['In a saucepan, combine the brown sugar, whit...
4    ['Preheat oven to 350 degrees F (175 degrees C...
Name: instructions, dtype: object

###  Compute Term Frequency-Inverse Document Frequency (TF-IDF) vectors for each document
This will give you a matrix where each column represents a word in the overview vocabulary (all the words that appear in at least one document) and each column represents a recipe, as before.

TF-IDF score is the frequency of a word occurring in a document, down-weighted by the number of documents in which it occurs. This is done to reduce the importance of words that occur frequently in plot overviews and therefore, their significance in computing the final similarity score

In [7]:
#Import TfIdfVectorizer from scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer

In [8]:
#Define a TF-IDF Vectorizer Object. Remove all english stop words such as 'the', 'a'
tfidf = TfidfVectorizer(stop_words='english')

In [18]:
#Replace NaN with an empty string
recipes['ingredients'] =recipes['ingredients'].fillna('')
recipes['ingredients'].head()

0    ['5 cups cubed potatoes', '2 cups carrots, sli...
1    ['1/2 cup butter', '1 cup white sugar', '1 cup...
2    ['1 (10 pound) whole goose', '2 tablespoons ko...
3    ['1 cup packed brown sugar', '1 cup white suga...
4    ['10 pounds white potatoes, peeled and quarter...
Name: ingredients, dtype: object

In [22]:
#Construct the required TF-IDF matrix by fitting and transforming the data
tfidf_matrix = tfidf.fit_transform(recipes['ingredients'])

## 1621 words were used to describe ingredients, for 5422 recipes

In [24]:
#Output the shape of tfidf_matrix
tfidf_matrix.shape

(5422, 1621)

You will be using the cosine similarity to calculate a numeric quantity that denotes the similarity between two movies. You use the cosine similarity score since it is independent of magnitude and is relatively easy and fast to calculate (especially when used in conjunction with TF-IDF scores, which will be explained later). Mathematically, it is defined as follows: 

cosine(x,y)=x.y⊺||x||.||y||cosine(x,y)=x.y⊺||x||.||y||
Since you have used the TF-IDF vectorizer, calculating the dot product will directly give you the cosine similarity score. Therefore, you will use sklearn's linear_kernel() instead of cosine_similarities() since it is faster.

In [27]:
# Import linear_kernel
from sklearn.metrics.pairwise import linear_kernel

In [28]:
# Compute the cosine similarity matrix
cosine_sim = linear_kernel(tfidf_matrix, tfidf_matrix)

You're going to define a function that takes in a movie title as an input and outputs a list of the 10 most similar movies. Firstly, for this, you need a reverse mapping of movie titles and DataFrame indices. In other words, you need a mechanism to identify the index of a movie in your metadata DataFrame, given its title.

In [30]:
#Construct a reverse map of indices and movie titles
indices = pd.Series(recipes.index, index=recipes['title']).drop_duplicates()

- Get the index of the movie given its title.
- Get the list of cosine similarity scores for that particular movie with all movies. Convert it into a list of tuples where the first element is its position and the second is the similarity score.
- Sort the aforementioned list of tuples based on the similarity scores; that is, the second element.
- Get the top 10 elements of this list. Ignore the first element as it refers to self (the movie most similar to a particular movie is the movie itself).
- Return the titles corresponding to the indices of the top elements.

In [35]:
# Function that takes in movie title as input and outputs most similar movies
def get_recommendations(title, cosine_sim=cosine_sim):
    # Get the index of the movie that matches the title
    idx = indices[title]

    # Get the pairwsie similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar movies
    sim_scores = sim_scores[1:11]

    # Get the movie indices
    recipe_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar movies
    return recipes['title'].iloc[recipe_indices]

In [36]:
get_recommendations('Banana Crumb Muffins')

1009               Banana Split Cookies
2586             Cinnamon Sugar Cookies
4344         Janine's Best Banana Bread
2065                    Pineapple Puffs
717                  Butterscotch Bread
5007    Grandma's Homemade Banana Bread
2411             Lighter Banana Muffins
4214                    Streusel Kuchen
2959                Finnish Pannu Kakku
5054                Banana Bran Muffins
Name: title, dtype: object

In [40]:
get_recommendations('Stuffed Mushrooms IV')

2094    Best Ever Meatloaf with Brown Gravy
3856                      Mushroom Meatloaf
3970                       Mushroom Risotto
1924                            Peanut Soup
2985                   Tomato-Mushroom Soup
4712            Oven Fried Parmesan Chicken
4534       Mouth-Watering Stuffed Mushrooms
3105       Mushroom Stuffed Chicken Rollups
5160                         Mushroom Sauce
3021                       Creamy Corn Soup
Name: title, dtype: object

In [42]:
get_recommendations('Maple Roast Turkey')

4517        Maple Roast Turkey and Gravy
2340     Awesome Tangerine-Glazed Turkey
4813                   Vegetable Chowder
4533            Beef and Barley Soup III
3667                   Oyster Dressing I
4862    Ibby's Pumpkin Mushroom Stuffing
1874                 Veggie Cheddar Soup
4024                 Chicken Jambalaya I
1433                   Mushroom Stuffing
5387               Mulligatawny Soup III
Name: title, dtype: object