# Text Mining Group Project


### Imports

In [1]:
#Don't forget to 'pip install tranformers, pip install torch' in conda prompt
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Data Acquisition

In [15]:
import json
#https://eightportions.com/datasets/Recipes/#fn:1 where the data is from
#scraped from 
#Foodnetwork.com, Epicurious.com, Allrecipes.com by someone

allr = open('recipes_raw_nosource_ar.json')
epi = open('recipes_raw_nosource_epi.json')
food = open('recipes_raw_nosource_fn.json')

data1 = json.load(allr) #data 1 has a lot of the word ADVERTISEMENT
data2 = json.load(epi) #data 2 looks good
data3 = json.load(food) #this too

#this load module loads the data into a dictionary 

#for key, value in data3.items():
#   print(key,value)
#example of a specific recipe with a certain key and the value is also a dictionary
print(data3["p3pKOD6jIHEcjf20CCXohP8uqkG5dGi"])

#example of only instructions
print(data3["p3pKOD6jIHEcjf20CCXohP8uqkG5dGi"]['instructions'])

{'instructions': 'Toss ingredients lightly and spoon into a buttered baking dish. Top with additional crushed cracker crumbs, and brush with melted butter. Bake in a preheated at 350 degrees oven for 25 to 30 minutes or until delicately browned.', 'ingredients': ['1/2 cup celery, finely chopped', '1 small green pepper finely chopped', '1/2 cup finely sliced green onions', '1/4 cup chopped parsley', '1 pound crabmeat', '1 1/4 cups coarsely crushed cracker crumbs', '1/2 teaspoon salt', '3/4 teaspoons dry mustard', 'Dash hot sauce', '1/4 cup heavy cream', '1/2 cup melted butter'], 'title': "Grammie Hamblet's Deviled Crab", 'picture_link': None}
Toss ingredients lightly and spoon into a buttered baking dish. Top with additional crushed cracker crumbs, and brush with melted butter. Bake in a preheated at 350 degrees oven for 25 to 30 minutes or until delicately browned.


# Data Organization

For this part, we really have to look at getting all of the recipes into the same format: so the model will learn to recognize the format, and start to use it

In [3]:
pass

# Language Recognition

With some help from: [TowardsDataScience](https://towardsdatascience.com/text-generation-with-python-and-gpt-2-1fecbff1635b)

### Sequence
Here we define an arbitrary recipe string sequence to initially train and test our model with. This text will be improved when we have acquired some data, but this will do for initial development:

In [21]:
sequence = """Ingredients: 3 tomatoes, garlic, spaghetti, Spanish chili's, cheese. Boil the pasta, then rinse. Cut chili's and mince garlic, add to pan
Add tomatoes and the pasta, and grate a generous amount of cheese
Serve with a touch of basil and a glass of good red wine
           """

In [12]:
#We use the pretrained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

GPT2's development can be found at: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language Models are Unsupervised Multitask Learners. 24.

In [13]:
#Tokenizing the inputs
inputs = tokenizer.encode(sequence, return_tensors='pt')

# Language Generation

Here we generate a recipe based on the encoded cooking texts that we fed the model above. We will use the model.generate() function to do this.
Model.generate() has a lot of different variable options worth looking at, which can be used to optimize our model. For instance, the temperature variable can be tweaked between 0-5 for increased randomness, we can try no_repeat_ngram_size=2 to prevent repitition, or tweak top_k. We'll just have to play around a bit and see what works best

In [49]:
# We set the output length to 500 tokens
EncodedRecipe = model.generate(inputs, max_length=500, do_sample=True)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [50]:
#Decode the tokenized outputs
recipe = tokenizer.decode(EncodedRecipe[0], skip_special_tokens=True)
#We filter out the initial context sequence given to the model (258 characters)
print(recipe[259:])


           
This is not my first attempt at creating a tomato-based dip. To get it right, I went with a little piece of pasta that had a mixture of tomato's and garlic, so I went on with a very simple and easy version. The other idea is to add salt to the pasta to help keep the salt down.
You could try the pasta without the salt, but you'll likely just try it on its own. It's pretty easy and inexpensive, especially since it's made from scratch.
Coriander on the side: A couple of minutes of fresh coriander leaves to set on top of the base.
A slightly salty side of this sauce would be better.
Toppings [ edit ]
2 cups tomatoes, rinsed and cut into pieces.
Sea salt to taste.
1 cup spaghetti.
Spicy dressing: pickled onions, chopped scallions, crushed red peppers, crushed black olives (or other meat you prefer) to taste.
Method and Preparation [ edit ]
Drain out and put in the freezer for about 2 to 4 hours. Once it's done, add a dollop of sauce and mix it down. The pasta will start to go m

this repetition in this recipe is a great example of why we may have to try ways to prevent repetition in the model generation

## Saving Recipes

In [51]:
#Defining a function to save generated recipes as .txt files:
def save(recipe, filename):
    text_file = open(filename, "w", encoding = 'utf8')
    n = text_file.write(recipe)
    text_file.close()

In [52]:
#Saving the previously generated recipe:
save(recipe[259:], 'FirstRecipe.txt')
#Note: I am leaving out the 259 context characters

# Reinforcement learning
I had some inspiration for a potentially novel approach to take this model in: reinforcement learning. While the current model already performs relatively well on creating sensical recipe texts, the model still has absolutely no idea of 'taste', and which ingredients work well together and which don't. Therefore, it may be desireable to give the model some sort of loss function score on its recipes so it knows which recipes are good and which are bad, so the model can improve.
We could do this by either reading the recipes and rating them on how good we think they might be, or maybe even try to cook some of the things our model suggests and see how well they work, so we can help the model improve. While this may take some time and make the coding harder, it could be really cool to have a recipe generation model trained and optimized on actual real-world cooking, and I can hardly imagine anyone has ever done something like that before. Who knows, maybe the model will actually end up becoming a really good cook

Contributors: Emilia, Sergio, Tim