# Text Mining Group Project


Alternative idea: maybe make model input a list of ingredients (based on what you have at home), and then let the model generate a reicpe for you based on the ingredients you have available

if that is the route we take, we can alternatively make another function that generates random lists of ingredients, and then feed the random ingredients to the recipe maker based on ingredients

### Imports

In [1]:
import torch
import pandas as pd
from transformers import GPT2LMHeadModel, GPT2Tokenizer
from transformers import AutoTokenizer, AutoModelForCausalLM
import random

# Initial model:
When we started this project, we initially had a very different modeling approach from our final version. In order to showcase our thought process, we start this analysis off by creating our initial recipe generation model from the pretrained GPT-2 model.

with some help from [TowardsDataScience](https://towardsdatascience.com/text-generation-with-python-and-gpt-2-1fecbff1635b)

In [48]:
#Defining the model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')
#Defining a context sequence
sequence = 'Bake 5 eggs. Add them to the tomatoes. Boil the pasta.'
#encoding input
encoded = tokenizer.encode(sequence, return_tensors='pt')
#Letting the model generate text based on the context
output = model.generate(encoded, max_length=100, 
                        do_sample=True,
                        temperature = 7.0,
                        top_k=20)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [51]:
recipe = tokenizer.decode(output[0])
print(recipe)

Bake 5 eggs. Add them to the tomatoes. Boil the pasta. Put into a plastic plastic baking tric-ten (one-six). Cook till very paleo/dairy-sweet with milk fro thyst. Set aside for another 5-3 hours or until you add 2 tsp butter and a tablespoon salt and allow until the mixture begins a smooth dough, or add a tablespoon, 1 minute or more and you can just leave 1 to 3 tablespoons left on plate to set at


GPT2's development can be found at: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2018). Language Models are Unsupervised Multitask Learners. 24.

# Data Acquisition

In [2]:
import json
#https://eightportions.com/datasets/Recipes/#fn:1 where the data is from
#scraped from 
#Foodnetwork.com, Epicurious.com, Allrecipes.com by Ryan Lee

allr = open('recipes_raw_nosource_ar.json')
epi = open('recipes_raw_nosource_epi.json')
food = open('recipes_raw_nosource_fn.json')

data1 = json.load(allr) #data 1 has a lot of the word ADVERTISEMENT
data2 = json.load(epi) #data 2 looks good
data3 = json.load(food) #this too

#this load module loads the data into a dictionary 


In [3]:
#Example of the double dict structue of the .json files:
print(data3["p3pKOD6jIHEcjf20CCXohP8uqkG5dGi"]['title'])

Grammie Hamblet's Deviled Crab


# Data processing

With some help from [coursera](https://www.coursera.org/projects/generating-new-recipes-python)

Our current recipes are in a dict of list of dict form, where the first dict contains the recipe code as key and the recipe as value, and the recipe value itself is a list of dicts in the shape [{title: value}, {ingredients: value}, {instructions:value}]. This needs to be processed to be more workable


We start by defining a list of all the recipe codes

In [4]:
#creating lists of keys
codes1 = list(data1.keys())
codes2 = list(data2.keys())
codes3 = list(data3.keys())

#and doing some data exploration
NumRecipes = len(codes1) + len(codes2) + len(codes3)
print('The first dataset contains', len(codes1), 'recipes')
print('in total, we have', NumRecipes, 'recipes')

The first dataset contains 39802 recipes
in total, we have 125164 recipes


#### Creating Pandas dataframe
The dict of lists of dicts format is rather annoying to work with for the modeling tasks we want to perform on it. Therefore, we convert everything nicely into our own pandas dataframe:

In [5]:
#Create dataframe
Data = pd.DataFrame()

#initializing empty lists which we'll add to the dataframe
Title = []
Ingredients = []
Instructions = []

#a for-loop to put all the required data in the lists
datasets = [data1, data2, data3]

for data in datasets:
    for _, val in data.items():
        #We occasionally get keyerrors due to corrupted data
        #so a try-except is added
        try:
            Title.append(val['title'])
            Ingredients.append([
                #And we remove the random ADVERTISEMENT clutter
                ingredient.replace(
                'ADVERTISEMENT', '') for ingredient in val['ingredients']])
            Instructions.append([str(
                val['instructions']).replace('ADVERTISEMENT','').replace('\n', ' ')])                      
        except:
            continue

#Quick check to see if it worked
if len(Title) == len(Ingredients) and len(Title) == len(Instructions):
    print("All data has been added to the lists succesfully!")

All data has been added to the lists succesfully!


In [6]:
print("During this transformation,", NumRecipes - len(Title), "empty values have been removed")
print("We now have", len(Title), "recipes")

During this transformation, 517 empty values have been removed
We now have 124647 recipes


Earlier, we noticed that the first dataset contained a lot of random ADVERTISEMENT strings scattered around. This clutter has also been removed during the list comprehensions used above
#### Adding data to dataframe
We can now add all of the data we just created and finish up the dataframe

In [15]:
Data['Title'] = Title
Data['Ingredients'] = Ingredients
Data['Instructions'] = Instructions
Data[0:5]

Unnamed: 0,Title,Ingredients,Instructions
0,Slow Cooker Chicken and Dumplings,"[4 skinless, boneless chicken breast halves , ...","[Place the chicken, butter, soup, and onion in..."
1,Awesome Slow Cooker Pot Roast,[2 (10.75 ounce) cans condensed cream of mushr...,"[In a slow cooker, mix cream of mushroom soup,..."
2,Brown Sugar Meatloaf,"[1/2 cup packed brown sugar , 1/2 cup ketchup ...",[Preheat oven to 350 degrees F (175 degrees C)...
3,Best Chocolate Chip Cookies,"[1 cup butter, softened , 1 cup white sugar , ...",[Preheat oven to 350 degrees F (175 degrees C)...
4,Homemade Mac and Cheese Casserole,"[8 ounces whole wheat rotini pasta , 3 cups fr...",[Preheat oven to 350 degrees F. Line a 2-quart...


# Recipe Recognition
With some help from: [TowardsDataScience](https://towardsdatascience.com/text-generation-with-python-and-gpt-2-1fecbff1635b), [StackExchange](https://datascience.stackexchange.com/questions/66394/how-does-bert-and-gpt-2-encoding-deal-with-token-such-as-startoftext-s), and [Coursera](https://www.coursera.org/projects/generating-new-recipes-python) and [TowardsDataScience2](https://towardsdatascience.com/train-gpt-2-in-your-own-language-fc6ad4d60171)

Now that we have the data processed, we want to use the database's structure to train our GPT-2 model. Specifically, we want it to learn to recognize the following structure:

Ingredients: <br/> ingredient1 <br/> ingredient2 <br/> ingredient3 <br/><br/>
Instructions: <br/> some text explaining instructions

### Train/Test split
In order to train our model towards this structure, we first create a training and validation set. We take 10% of the data as a traing set, and 50% as test set. We use such a small percentage for training since otherwise the dataset is simply too large to for the model training, and this model already takes hours to train.

In [16]:
#Define GPT-2 special end of document token
special_token = ' <|endoftext|> '

#Create a dataframe column that combines ingredients and instructions
Data['combined'] = ' \n Ingredients: \n ' + Data.Ingredients.str.join(' \n ') + \
' \n Instructions: \n ' +Data.Instructions.str.join(' \n ') + special_token

In [17]:
length = len(Data['Title'])
train = Data[:int(0.1*length)].combined.values
test = Data[int(0.9*length):].combined.values

In [18]:
len(train)

12464

And save them:

In [19]:
#Write training and test data to a file
with open('TrainingData.txt','w', encoding='utf-8') as f:
    f.write('\n'.join(train))
with open('TestData.txt','w', encoding='utf-8') as f:
    f.write('\n'.join(test))

## Training the model
Now that we have the test and training datasets, we can train our model on this specific data. This task, however, is rather complicated: and cannot be performed solely in this notebook. Rather, we upload the TrainingData.txt to Google Colab. Then, we use the [run_lm_finetuning.py](https://github.com/alontalmor/pytorch-transformers/blob/master/examples/run_lm_finetuning.py) script from huggingface and define a bashscript run_experiments.sh to run it and to start training our model on the defined training data. run_lm_finetuning.py contains a number of errors, specifically with importing WarmupLinearSchedule from transformers. Therefore, we define our own WarmupLinearSchedule in the run_lm_finetuning.py, and use run_experiments.sh to train the custom model on our recently created training data. 

The trained model itself is unfortunately too large to upload to github, but the .sh and .py scripts used to train and create this custom model are attached with this assignment repository. 6 out of the 7 files of which our model consists do fit on github, and can be found in the /CustomModel directory. For the sake of inclusivity, our entire trained model is also available [here](https://1drv.ms/u/s!AlUeI82AcSLCo41KNfOUS5dTqT0tEQ?e=4Wyq21)

In [21]:
#using the models trained on our dataset:
tokenizer = AutoTokenizer.from_pretrained('CustomModel/')
model = AutoModelForCausalLM.from_pretrained('CustomModel/')

We now want to test whether our trained model recognizes the Ingredients: etc. Instructions: etc. format.
Therefore, we define a testsequence consisting of a list of ingredients from the test set

In [38]:
testsequence = test[100].split('Instructions')[0] + " \n Instructions:"

In [43]:
print(testsequence)

 
 Ingredients: 
 1/4 cup oil 
 6 medium onions, chopped 
 4 bell peppers, chopped 
 3 carrots, chopped 
 1 cup string beans, broken into pieces 
 3/4 cup peas 
 6 tomatoes, chopped 
 1/2 teaspoon black pepper 
 1 teaspoon dried thyme 
 4 cups medium grain brown rice, cooked (cold) 
 1/2 cup tomato paste 
  
 Instructions:


This has the perfect format for our model to recognize

In [40]:
encoded = tokenizer.encode(testsequence, add_special_tokens = False,
                           return_tensors='pt')

modelgenerated = model.generate(input_ids = encoded, max_length = 700,
                                temperature = 0.9,
                                top_k = 20,
                                do_sample = True,
                               repetition_penalty = 1.0)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [46]:
#For more information on hyperparameters:
model.generate?

## And Success!
The model recognizes the ingredient list format, and generates a recipe based on the available ingredients!

In [44]:
for sequence in modelgenerated:
    recipe = tokenizer.decode(sequence)
    print(recipe)

 
 Ingredients: 
 1/4 cup oil 
 6 medium onions, chopped 
 4 bell peppers, chopped 
 3 carrots, chopped 
 1 cup string beans, broken into pieces 
 3/4 cup peas 
 6 tomatoes, chopped 
 1/2 teaspoon black pepper 
 1 teaspoon dried thyme 
 4 cups medium grain brown rice, cooked (cold) 
 1/2 cup tomato paste 
  
 Instructions: 
 Mix together the oil, oil, salt, and pepper in a large skillet over medium heat. Cook potatoes, then stir in the carrots and tomatoes, then bring to boil. Add beans and cook in the preheated oven until golden brown, about 5 minutes. Add beans, peas and peas. Cook until rice is tender and tender, about 3 minutes. Add tomatoes and stir in thyme, then add beans, peas, peas, and peas. Serve cold. Heat oil in a Dutch oven and cook potatoes through until golden brown. Place in a medium saucepan. Cook potatoes and carrots in the saucepan until softened, about 4 minutes. Add beans and bring to a boil. Remove from the heat, remove from the heat, and simmer until beans are t

### Now, we want to create more recipes from our unused, test ingredient lists:

In [67]:
#Generate recipes based on ingredients list and add them to a string
AllRecipes = ''
for i in range(10):
    sequence = test[i].split('Instructions')[0] + " \n Instructions:"
    encoded = tokenizer.encode(testsequence, add_special_tokens = False,
                           return_tensors='pt')

    modelgenerated = model.generate(input_ids = encoded, max_length = 700,
                                temperature = 0.9,
                                top_k = 20,
                                do_sample = True,
                               repetition_penalty = 1.0)
    for recipe in modelgenerated:
        text = tokenizer.decode(recipe)
        AllRecipes = AllRecipes + text
    

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


## Saving Recipes
And finally, we save all of the created recipes in a .txt file

In [68]:
#Defining a function to save generated recipes as .txt files:
def save(recipe, filename):
    text_file = open(filename, "w", encoding = 'utf8')
    n = text_file.write(recipe)
    text_file.close()

In [69]:
save(AllRecipes, 'Recipes/AllRecipes.txt')

In [59]:
print(AllRecipes)

 
 Ingredients: 
 1/4 cup oil 
 6 medium onions, chopped 
 4 bell peppers, chopped 
 3 carrots, chopped 
 1 cup string beans, broken into pieces 
 3/4 cup peas 
 6 tomatoes, chopped 
 1/2 teaspoon black pepper 
 1 teaspoon dried thyme 
 4 cups medium grain brown rice, cooked (cold) 
 1/2 cup tomato paste 
  
 Instructions:   Cut 2 large onions, in half lengthwise and place them on a baking sheet.   Cover with a plastic baking sheet.   Let sit 30 minutes.   Set aside.   In a bowl, toss in the onion mixture.   Add the rice and cook gently.   Remove from heat, and then place rice on a lightly oiled baking sheet.   Bake at 425 degrees for 10 minutes, or until the top is golden brown.  Remove from heat, and then place on serving platter.
To serve, garnish the cooked rice with chopped parsley and chopped coriander.
You can serve this appetizer on it's own or with other rice.
Enjoy!
Print Author: Baked Rice Salad Print Recipe Prep Time 20 mins Cook Time 20 mins Total Time 24 mins Author: Bake

Contributors: Emilia, Sergio, Tim