### recipeeze
* Contributors
    * Todd McCullough [Git](https://github.com/tamccullough)
    * Marko Topitch [Git](https://github.com/TopMarko)

In [1]:
import pandas as pd
import numpy as np
import heapq
import random
from math import floor

##### Import Surprise
[Surprise](http://surpriselib.com/) is a Python scikit building and analyzing recommender systems that deal with explicit rating data.

In [2]:
from surprise import Reader, Dataset
from surprise import KNNWithMeans
from surprise.model_selection import cross_validate

import pickle
filename = 'model/recipes_recommender_model.sav'
rr_model = pickle.load(open(filename, 'rb'))

##### Import Data

In [3]:
recipes = pd.read_csv('datasets/rr-recipes.csv')
ratings = pd.read_csv('datasets/rr-ratings.csv')

In [4]:
ratings.head(2)

Unnamed: 0,userid,itemid,rating
0,675719,7000,5
1,1478626,7000,5


In [5]:
ratings.shape

(1555581, 3)

In [6]:
recipes.head(2)

Unnamed: 0,itemid,title,prep_time,cook_time,ready_time,ingredients,directions,url,photo_url
0,7000,Golden Crescent Rolls Recipe,25,15,190,"yeast,water,white sugar,salt,egg,butter,flour,...","Dissolve yeast in warm water.**Stir in sugar, ...",https://www.allrecipes.com/recipe/7000,https://images.media-allrecipes.com/userphotos...
1,7001,Poppy Seed Bread with Glaze Recipe,15,60,80,"flour,salt,baking powder,poppy,butter,vegetabl...",'Preheat oven to 350 degrees F (175 degrees C)...,https://www.allrecipes.com/recipe/7001,https://images.media-allrecipes.com/userphotos...


In [7]:
ingredients = pd.DataFrame(recipes.ingredients,columns=['ingredients'])
ingredients['count'] = 1

In [8]:
ingredients_g = ingredients.groupby('ingredients').count().reset_index()
ingredients_g = ingredients_g.sort_values('count', ascending=False)
ingredients_g.head(5)

Unnamed: 0,ingredients,count
3765,"chocolate,chocolate",4
5922,"flour,salt,shortening,water",4
4441,"cream cheese,marshmallow",3
4277,"cranberry,marshmallow,white sugar,pineapple,ba...",3
5642,"flour,brown sugar,white sugar,pumpkin,vegetabl...",3


In [9]:
min_recipe_ratings = 50
filter_recipes = ratings['itemid'].value_counts() > min_recipe_ratings
filter_recipes = filter_recipes[filter_recipes].index.tolist()

min_user_ratings = 50
filter_users = ratings['userid'].value_counts() > min_user_ratings
filter_users = filter_users[filter_users].index.tolist()

ratings = ratings[(ratings['itemid'].isin(filter_recipes)) & (ratings['userid'].isin(filter_users))]
ratings.shape

(132353, 3)

In [10]:
users = ratings.userid.unique()

In [11]:
users.shape

(1426,)

##### Define a Ratings scale
This scale is determined by the lowest and highest rating possible. 
In this case the lowest rating is 1, while the highest is 5.

In [12]:
reader = Reader(rating_scale=(1,5)) # This just defines the rating scale
data = Dataset.load_from_df(ratings[['userid', 'itemid', 'rating']], reader=reader)

### Build the model

##### KNN with Means - Surprise

[KNN with Means](https://surprise.readthedocs.io/en/stable/knn_inspired.html#surprise.prediction_algorithms.knns.KNNWithMeans) has been chosen for the recommender, which is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

In [13]:
def build_recommender(user_based=False, sim_type='cosine'):
    sim_options = {
        "name": sim_type,
        "user_based": user_based
    }

    return KNNWithMeans(sim_options=sim_options)

##### Calculate the Similarity Matrix

Ignoring folds this builds the *Trainset* using [build_full_trainset()](https://surprise.readthedocs.io/en/stable/dataset.html#surprise.dataset.DatasetAutoFolds.build_full_trainset)

The Trainset is built using the data, but then contains more information about the data

In [14]:
trainset = data.build_full_trainset()
# user_based_recommender = build_recommender(user_based=True)
item_based_recommender = build_recommender()
# User based seems to give a memory error when fit, due to the much larger amount of users than recipes.
# user_based_recommender.fit(trainset)
item_based_recommender.fit(trainset)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x7f6ff6457290>

### Evaluate the Model

Using [cross_validation()](https://surprise.readthedocs.io/en/stable/model_selection.html#cross-validation) from surprise, we can quickly evaluate the model using a few metrics. 

In [15]:
cross_validate(item_based_recommender, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8033  0.8080  0.8050  0.7960  0.7937  0.8012  0.0055  
MAE (testset)     0.5957  0.5984  0.5925  0.5914  0.5891  0.5934  0.0033  
Fit time          11.31   10.81   10.72   10.68   10.70   10.84   0.24    
Test time         6.09    6.05    6.03    5.92    6.15    6.05    0.07    


{'test_rmse': array([0.80331337, 0.80799963, 0.80499603, 0.79596723, 0.79365242]),
 'test_mae': array([0.59565276, 0.59838486, 0.5925189 , 0.59135301, 0.58908241]),
 'fit_time': (11.305488109588623,
  10.813036680221558,
  10.716081142425537,
  10.682083368301392,
  10.703468084335327),
 'test_time': (6.086737394332886,
  6.0480334758758545,
  6.025444984436035,
  5.923325538635254,
  6.150327682495117)}

### Inference

Here is the meat and potatoes(har) of the whole thing.

In [16]:
def get_r(user_id):
    # Select which system to use. Due to memory constraints, item based is the only viable option
    recommender_system = item_based_recommender
    # N will represent how many items to recommend
    N = 1000
    
    # The setting to a set and back to list is a failsafe.
    rated_items = list(set(ratings.loc[ratings['userid'] == user_id]['itemid'].tolist()))
    ratings_list = recipes['itemid'].values.tolist()
    reduced_ratings = ratings.loc[ratings['itemid'].isin(ratings_list)].copy()
    
    # Self explanitory name
    all_item_ids = list(set(reduced_ratings['itemid'].tolist()))
    
    # New_items just represents all the items not rated by the user
    new_items = [x for x in all_item_ids if x not in rated_items]
    
    # Estimate ratings for all unrated items
    predicted_ratings = {}
    for item_id in new_items:
        predicted_ratings[item_id] = recommender_system.predict(user_id, item_id).est
        pass
    
    # Get the item_ids for the top ratings
    recommended_ids = heapq.nlargest(N, predicted_ratings, key=predicted_ratings.get)
    recommended_ids = sorted(recommended_ids)
    
    # predicted_ratings
    recommended_df = recipes.loc[recipes['itemid'].isin(recommended_ids)].copy()
    #recommended_df.insert(1, 'pred_rating', np.zeros(len(recommended_ids)))
    recommended_df.insert(1, 'pred_rating', 0)
    
    # recommended_df = recipes.copy()
    for idx,item_id in enumerate(recommended_ids):
        recommended_df.iloc[idx, recommended_df.columns.get_loc('pred_rating')] = predicted_ratings[item_id]
        pass
    return recommended_df.head(N).sort_values('pred_rating', ascending=False)

In [17]:
def cap_str(item):
    string = item
    return string.capitalize()

def reg_frame(r_list,items):
    s_ = ''
    for i in items:
        j = i.strip()
        str_ = f'(?=.*{j})'
        s_ += str_
    s_
    r_list = r_list[r_list['ingredients'].str.contains(fr'^\b{s_}\b',regex=True)]
    return r_list

def set_up_ml(user_id,ingredient_list):
    recipe_list = get_r(user_id)
    items = ingredient_list.split(',')
    recipe_list = reg_frame(recipe_list,items)
    return recipe_list

### Get a Recommendation Based on Ingredients

The final code that will be impletented in a cleaner fashion through the browser interface.

In [23]:
ingredient_list = 'tofu'

In [24]:
user_1 = random.choice(users)#123711 # user with a medium amount of ratings
user_2 = random.choice(users)#15078 # user with the lowest number of ratings
user_3 = random.choice(users)#72315 # user with a lot of ratings

In [25]:
recommended_list_1 = set_up_ml(user_1,ingredient_list) # generate a list of recommendations for each user
recommended_list_2 = set_up_ml(user_2,ingredient_list)
recommended_list_3 = set_up_ml(user_3,ingredient_list)

In [26]:
def get_final_recommendation(list_1,list_2,list_3): # combine all recommendations
    recipe_recommendation = pd.DataFrame()
    recipe_recommendation = pd.concat([list_1,list_2,list_3]) # concat lists
    recipe_recommendation = recipe_recommendation.drop_duplicates() # drop recommended duplicates of films
    recipe_recommendation = recipe_recommendation.sort_values('pred_rating',ascending=False) # sort by predicted rating
    recipe_recommendation.pop('pred_rating') # drop the rating column
    recipe_recommendation = recipe_recommendation.reset_index()
    recipe_recommendation.pop('index') # reset and pop the old index
    recipe_recommendation.pop('itemid')
    #recipe_recommendation['total time'] = recipe_recommendation['prep_time']+recipe_recommendation['cook_time']
    return recipe_recommendation

In [27]:
final_recommendation = get_final_recommendation(recommended_list_1,recommended_list_2,recommended_list_3)
final_recommendation

Unnamed: 0,title,prep_time,cook_time,ready_time,ingredients,directions,url,photo_url
0,Barbecue Tofu Sandwiches Recipe,0,10,15,"tofu,vegetable oil,onion,barbecue sauce,hamburger",Drain the tofu between paper towels until most...,https://www.allrecipes.com/recipe/14047,https://images.media-allrecipes.com/userphotos...
1,Cucumber and Tomato Salad Recipe,0,0,15,"tomato,cucumber,onion,bean,tofu,basil,salad dr...","In a large bowl, combine the tomato, cucumber,...",https://www.allrecipes.com/recipe/14176,https://images.media-allrecipes.com/userphotos...


## Save the Model

In [28]:
import pickle
filename = 'model/recipEeze_recommender_model.sav'
pickle.dump(item_based_recommender, open(filename, 'wb'))

In [17]:
rr_model = pickle.load(open(filename, 'rb'))

In [18]:
cross_validate(rr_model, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Computing the cosine similarity matrix...
Done computing similarity matrix.
Evaluating RMSE, MAE of algorithm KNNWithMeans on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9071  0.9070  0.9061  0.9060  0.9066  0.9066  0.0004  
MAE (testset)     0.6266  0.6249  0.6253  0.6253  0.6251  0.6254  0.0006  
Fit time          22.56   24.01   21.76   22.46   25.57   23.27   1.36    
Test time         16.98   17.87   18.30   15.50   18.73   17.48   1.15    


{'test_rmse': array([0.9071256 , 0.90697745, 0.90610662, 0.90600469, 0.90660522]),
 'test_mae': array([0.62662526, 0.62488307, 0.62527877, 0.62525857, 0.62507525]),
 'fit_time': (22.558342695236206,
  24.009589433670044,
  21.762155532836914,
  22.45959758758545,
  25.570380926132202),
 'test_time': (16.97546100616455,
  17.86919665336609,
  18.300251483917236,
  15.498399496078491,
  18.733608961105347)}