# Text Mining Group Project


Alternative idea: maybe make model input a list of ingredients (based on what you have at home), and then let the model generate a reicpe for you based on the ingredients you have available

if that is the route we take, we can alternatively make another function that generates random lists of ingredients, and then feed the random ingredients to the recipe maker based on ingredients

### Imports

In [22]:
import torch
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Data Acquisition

In [6]:
import json
#https://eightportions.com/datasets/Recipes/#fn:1 where the data is from
#scraped from 
#Foodnetwork.com, Epicurious.com, Allrecipes.com by Ryan Lee

allr = open('recipes_raw_nosource_ar.json')
epi = open('recipes_raw_nosource_epi.json')
food = open('recipes_raw_nosource_fn.json')

data1 = json.load(allr) #data 1 has a lot of the word ADVERTISEMENT
data2 = json.load(epi) #data 2 looks good
data3 = json.load(food) #this too

#this load module loads the data into a dictionary 


In [12]:
#Example of the double dict structue of the .json files:
print(data3["p3pKOD6jIHEcjf20CCXohP8uqkG5dGi"]['title'])

Grammie Hamblet's Deviled Crab


# Data processing

With some help from [coursera](https://www.coursera.org/projects/generating-new-recipes-python)

Our current recipes are in a dict of list of dict form, where the first dict contains the recipe code as key and the recipe as value, and the recipe value itself is a list of dicts in the shape [{title: value}, {ingredients: value}, {instructions:value}]. This needs to be processed to be more workable


We start by defining a list of all the recipe codes

In [20]:
#creating lists of keys
codes1 = list(data1.keys())
codes2 = list(data2.keys())
codes3 = list(data3.keys())

#and doing some data exploration
NumRecipes = len(codes1) + len(codes2) + len(codes3)
print('The first dataset contains', len(codes1), 'recipes')
print('in total, we have', NumRecipes, 'recipes')

The first dataset contains 39802 recipes
in total, we have 125164 recipes


#### Creating Pandas dataframe
The dict of lists of dicts format is rather annoying to work with for the modeling tasks we want to perform on it. Therefore, we convert everything nicely into our own pandas dataframe:

In [97]:
#Create dataframe
Data = pd.DataFrame()

#initializing empty lists which we'll add to the dataframe
Title = []
Ingredients = []
Instructions = []

#a for-loop to put all the required data in the lists
datasets = [data1, data2, data3]

for data in datasets:
    for _, val in data.items():
        #We occasionally get keyerrors due to corrupted data
        #so a try-except is added
        try:
            Title.append(val['title'])
            Ingredients.append([
                #And we remove the random ADVERTISEMENT clutter
                ingredient.replace(
                'ADVERTISEMENT', '') for ingredient in val['ingredients']])
            Instructions.append([str(
                val['instructions']).replace('ADVERTISEMENT','').replace('\n', ' ')])
                              
        except:
            continue

#Quick check to see if it worked
if len(Title) == len(Ingredients) and len(Title) == len(Instructions):
    print("All data has been added to the lists succesfully!")

All data has been added to the lists succesfully!


In [68]:
print("During this transformation,", NumRecipes - len(Title), "empty values have been removed")
print("We now have", len(Title), "recipes")

During this transformation, 517 empty values have been removed
We now have 124647 recipes


Earlier, we noticed that the first dataset contained a lot of random ADVERTISEMENT strings scattered around. This clutter has also been removed during the list comprehensions used above
#### Adding data to dataframe
We can now add all of the data we just created and finish up the dataframe

In [103]:
Data['Title'] = Title
Data['Ingredients'] = Ingredients
Data['Instructions'] = Instructions
Data[0:4]

Unnamed: 0,Title,Ingredients,Instructions
0,Slow Cooker Chicken and Dumplings,"[4 skinless, boneless chicken breast halves , ...","[Place the chicken, butter, soup, and onion in..."
1,Awesome Slow Cooker Pot Roast,[2 (10.75 ounce) cans condensed cream of mushr...,"[In a slow cooker, mix cream of mushroom soup,..."
2,Brown Sugar Meatloaf,"[1/2 cup packed brown sugar , 1/2 cup ketchup ...",[Preheat oven to 350 degrees F (175 degrees C)...
3,Best Chocolate Chip Cookies,"[1 cup butter, softened , 1 cup white sugar , ...",[Preheat oven to 350 degrees F (175 degrees C)...


# Language Recognition

With some help from: [TowardsDataScience](https://towardsdatascience.com/text-generation-with-python-and-gpt-2-1fecbff1635b)

### Sequence
Here we define an arbitrary recipe string sequence to initially train and test our model with. This text will be improved when we have acquired some data, but this will do for initial development:

In [5]:
sequence = """Ingredients: 3 tomatoes, garlic, spaghetti, Spanish chili's, cheese. Boil the pasta, then rinse. Cut chili's and mince garlic, add to pan
Add tomatoes and the pasta, and grate a generous amount of cheese
Serve with a touch of basil and a glass of good red wine
           """




#Allemaal van Sergio

"""
seqdata = []
print(len(data3.keys()))
count = 0
for key, value in data3.items():
    check = False
    
    #check whether it has the key ingredients
    if "ingredients" in value.keys():
        check = True
    
    #amount of recipes you want
    if count <= 800 and check:
        count = count +1
        listingr = "ingredients: "
        for i in value["ingredients"]:
            listingr = listingr + i + " "
        if type(value["instructions"]) == str:
            r = listingr + value["instructions"]
            seqdata.append(r)
            continue

print(type(seqdata))
#first two recipes
print(seqdata[2])

"""

"placeholder"

'placeholder'

In [6]:
#We use the pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

GPT2's development can be found at: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language Models are Unsupervised Multitask Learners. 24.

In [7]:
#Tokenizing the inputs
#print(seqdata[1])
inputs = tokenizer.encode(sequence, return_tensors='pt')
#print(inputs)



#Allemaal van Sergio

"""
tokenizer.pad_token = "tokenizer.eos_token"
#padding and truncation to make all tensors same size.
#encoded_choices = [tokenizer.encode(s) for s in seqdata]
input_ids = []

for i in seqdata:
    input_ids.append(tokenizer.encode(i, return_tensors='pt', padding=True,truncation = True))

print(input_ids) #input Ids is now list with tensors of each datapoint

"""
'placeholder'

'placeholder'

# Language Generation

Here we generate a recipe based on the encoded cooking texts that we fed the model above. We will use the model.generate() function to do this.
Model.generate() has a lot of different variable options worth looking at, which can be used to optimize our model. For instance, the temperature variable can be tweaked between 0-5 for increased randomness, we can try no_repeat_ngram_size=2 to prevent repitition, or tweak top_k. We'll just have to play around a bit and see what works best

In [None]:
import pandas as pd
import numpy as np
# We set the output length to 500 tokens
EncodedRecipe = model.generate(inputs, max_length=500, do_sample=True)



#Allemaal van Sergio
"""
#this only works only when you pick a specific tensor of input ids and not for all
EncodedRecipe2 = model.generate(input_ids[2], max_length=1024, do_sample=True)
print(EncodedRecipe2)
"""

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [None]:
#Decode the tokenized outputs
#recipe = tokenizer.decode(EncodedRecipe[0], skip_special_tokens=True)
#We filter out the initial context sequence given to the model (258 characters)
#print(recipe[259:])

#Decode the tokenized outputs
recipe = tokenizer.decode(EncodedRecipe[0], skip_special_tokens=True)
#We filter out the initial context sequence given to the model (258 characters)
print(len(recipe))
print(recipe) #part of it original and part of it the same as before

this repetition in this recipe is a great example of why we may have to try ways to prevent repetition in the model generation

## Saving Recipes

In [None]:
#Defining a function to save generated recipes as .txt files:
def save(recipe, filename):
    text_file = open(filename, "w", encoding = 'utf8')
    n = text_file.write(recipe)
    text_file.close()

In [None]:
#Saving the previously generated recipe:
save(recipe[259:], 'FirstRecipe.txt')
#Note: I am leaving out the 259 context characters

# Reinforcement learning
I had some inspiration for a potentially novel approach to take this model in: reinforcement learning. While the current model already performs relatively well on creating sensical recipe texts, the model still has absolutely no idea of 'taste', and which ingredients work well together and which don't. Therefore, it may be desireable to give the model some sort of loss function score on its recipes so it knows which recipes are good and which are bad, so the model can improve.
We could do this by either reading the recipes and rating them on how good we think they might be, or maybe even try to cook some of the things our model suggests and see how well they work, so we can help the model improve. While this may take some time and make the coding harder, it could be really cool to have a recipe generation model trained and optimized on actual real-world cooking, and I can hardly imagine anyone has ever done something like that before. Who knows, maybe the model will actually end up becoming a really good cook

Contributors: Emilia, Sergio, Tim