# Analysis + Evaluation (Discriminator Network)

Here we attempt to train a neural network to distinguish between ground truth recipes and RNN-produced recipes, and also between ground truth recipes and GPT-2 finetuned recipes. As a discriminator neural network we implemented an LSTM.

Accuracy of distinguishing truth from GPT-2? 0.765

Accuracy of distinguishing truth from RNN? 0.695

Conclusion: I think something went wrong along the way. Will try some other neural network instead.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
import os
import pathlib
import pandas as pd

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [None]:
CACHE_DIR = "./drive/Shared drives/CS 269: Recipe/tmp"
pathlib.Path(CACHE_DIR).mkdir(exist_ok=True)
dataset_path = os.path.join(CACHE_DIR, 'recipes.pkl')

In [None]:
!head -n 50 "./drive/Shared drives/CS 269: Recipe/tmp/text_recipes.txt"

<TITLE>
Slow Cooker Chicken and Dumplings
<INGREDIENTS>
• 4 skinless, boneless chicken breast halves
• 2 tablespoons butter
• 2 (10.75 ounce) cans condensed cream of chicken soup
• 1 onion, finely diced
• 2 (10 ounce) packages refrigerated biscuit dough, torn into pieces
<INSTRUCTIONS>
‣ Place the chicken, butter, soup, and onion in a slow cooker, and fill with enough water to cover.
‣ Cover, and cook for 5 to 6 hours on High. About 30 minutes before serving, place the torn biscuit dough in the slow cooker. Cook until the dough is no longer raw in the center.
<DONE>
<TITLE>
Awesome Slow Cooker Pot Roast
<INGREDIENTS>
• 2 (10.75 ounce) cans condensed cream of mushroom soup
• 1 (1 ounce) package dry onion soup mix
• 1 1/4 cups water
• 5 1/2 pounds pot roast
<INSTRUCTIONS>
‣ In a slow cooker, mix cream of mushroom soup, dry onion soup mix and water. Place pot roast in slow cooker and coat with soup mixture.
‣ Cook on High setting for 3 to 4 hours, or on Low setting for 8 to 9 hours.
<DONE

In [None]:
if not os.path.exists(dataset_path):
    raise SystemExit("Run preprocess_pickle.ipynb to generate data file before continuing")
else:
    recipes = pd.read_pickle(dataset_path)

# TODO: Remove subsetting for final training
recipes = recipes[:20000]

In [None]:
recipes

Unnamed: 0,title,ingredients,instructions
0,Slow Cooker Chicken and Dumplings,"• 4 skinless, boneless chicken breast halves\n...","‣ Place the chicken, butter, soup, and onion i..."
1,Awesome Slow Cooker Pot Roast,• 2 (10.75 ounce) cans condensed cream of mush...,"‣ In a slow cooker, mix cream of mushroom soup..."
2,Brown Sugar Meatloaf,• 1/2 cup packed brown sugar\n• 1/2 cup ketchu...,‣ Preheat oven to 350 degrees F (175 degrees C...
3,Best Chocolate Chip Cookies,"• 1 cup butter, softened\n• 1 cup white sugar\...",‣ Preheat oven to 350 degrees F (175 degrees C...
4,Homemade Mac and Cheese Casserole,• 8 ounces whole wheat rotini pasta\n• 3 cups ...,‣ Preheat oven to 350 degrees F. Line a 2-quar...
...,...,...,...
20161,Georgia's Tennessee Jam Cake,"• 1 cup butter, softened\n• 2 cups white sugar...",‣ Preheat the oven to 350 degrees F (175 degre...
20162,Poached Eggs and Asparagus,• 4 eggs\n• 1 cube chicken bouillon (optional)...,‣ Fill a saucepan half way full of water. Brin...
20163,Bistecca alla Fiorentina (Tuscan Porterhouse),"• 4 sprigs fresh rosemary, chopped\n• 1 (2 1/2...",‣ Press chopped rosemary onto both sides of po...
20164,Courtney's Three Tomato Pasta Sauce,• 1/2 pound bulk mild Italian sausage\n• 1/2 p...,‣ Cook mild and hot Italian sausage in a large...


In [None]:
def recipe_to_str(recipe):
    # Combine components of recipe into a string
    return f"{recipe.title}<ING>{recipe.ingredients}<INS>{recipe.instructions}"

recipe_strings = recipes.apply(recipe_to_str, axis=1)

In [None]:
recipe_strings[:10]

0     Slow Cooker Chicken and Dumplings<ING>• 4 skin...
1     Awesome Slow Cooker Pot Roast<ING>• 2 (10.75 o...
2     Brown Sugar Meatloaf<ING>• 1/2 cup packed brow...
3     Best Chocolate Chip Cookies<ING>• 1 cup butter...
4     Homemade Mac and Cheese Casserole<ING>• 8 ounc...
5     Banana Banana Bread<ING>• 2 cups all-purpose f...
7     Mom's Zucchini Bread<ING>• 3 cups all-purpose ...
8     The Best Rolled Sugar Cookies<ING>• 1 1/2 cups...
9     Singapore Chili Crabs<ING>• Sauce:\n• 1/2 cup ...
10    Downeast Maine Pumpkin Bread<ING>• 1 (15 ounce...
dtype: object

#Import GPT-2 recipes and RNN recipes

GPT-2 recipes

In [None]:
CACHE_DIR = "./drive/Shared drives/CS 269: Recipe/tmp"
#pathlib.Path(CACHE_DIR).mkdir(exist_ok=True)
gpt2_recipes_path = os.path.join(CACHE_DIR, 'gpt2_finetuned_output_recipes')

In [None]:
gpt2_recipe_strings = []

for i in range(500):
  print(f"> {i} out of 500")
  file_path = os.path.join(gpt2_recipes_path, f"gpt2_recipe_{i}.txt")
  gpt2_recipe_string = ""
  for line in open(file_path, 'r'):
    gpt2_recipe_string += line
  gpt2_recipe_strings.append(gpt2_recipe_string)

> 0 out of 500
> 1 out of 500
> 2 out of 500
> 3 out of 500
> 4 out of 500
> 5 out of 500
> 6 out of 500
> 7 out of 500
> 8 out of 500
> 9 out of 500
> 10 out of 500
> 11 out of 500
> 12 out of 500
> 13 out of 500
> 14 out of 500
> 15 out of 500
> 16 out of 500
> 17 out of 500
> 18 out of 500
> 19 out of 500
> 20 out of 500
> 21 out of 500
> 22 out of 500
> 23 out of 500
> 24 out of 500
> 25 out of 500
> 26 out of 500
> 27 out of 500
> 28 out of 500
> 29 out of 500
> 30 out of 500
> 31 out of 500
> 32 out of 500
> 33 out of 500
> 34 out of 500
> 35 out of 500
> 36 out of 500
> 37 out of 500
> 38 out of 500
> 39 out of 500
> 40 out of 500
> 41 out of 500
> 42 out of 500
> 43 out of 500
> 44 out of 500
> 45 out of 500
> 46 out of 500
> 47 out of 500
> 48 out of 500
> 49 out of 500
> 50 out of 500
> 51 out of 500
> 52 out of 500
> 53 out of 500
> 54 out of 500
> 55 out of 500
> 56 out of 500
> 57 out of 500
> 58 out of 500
> 59 out of 500
> 60 out of 500
> 61 out of 500
> 62 out of 500
> 

In [None]:
len(gpt2_recipe_strings)

500

In [None]:
gpt2_recipe_strings_df = pd.DataFrame(gpt2_recipe_strings)

In [None]:
gpt2_recipe_strings_df

Unnamed: 0,0
0,Lo-Mein Cake III<ING>• 1 (18.25 ounce) package...
1,Gobble Bars II<ING>• 1 1/2 cups all-purpose fl...
2,Hasenpfeffer<ING>• 2 cups milk• 1 1/4 cups sug...
3,Bundles for the Rich and Famous<ING>• 1 cup ra...
4,Sweetbreads for the Rich and the Famous<ING>• ...
...,...
495,Creole Pork Chops<ING>• 1 1/2 cups dry brown r...
496,Beefy Pork Chops<ING>• 1 1/2 cups water• 4 tea...
497,Ponchartrain Cake III<ING>• 1 (18.25 ounce) pa...
498,Oatie Bars II<ING>• 1 1/2 cups all-purpose flo...


In [None]:
print(gpt2_recipe_strings[0])

Lo-Mein Cake III<ING>• 1 (18.25 ounce) package white cake mix• 3 cups milk• 1/2 cup butter, softened• 1/2 cup white sugar• 2 eggs• 1 1/2 cups all-purpose flour• 1 teaspoon baking powder• 1/2 teaspoon salt• 1/4 teaspoon baking soda• 1 cup chopped pecans• 2 cups confectioners' sugar• 2 tablespoons butter• 3 tablespoons milk• 1 teaspoon vanilla extract<INS>‣ Preheat oven to 350 degrees F (175 degrees C). Grease and flour a 9x13 inch pan.‣ In a large bowl, mix together cake mix, 3 cups milk and 1/2 cup butter. Add sugar, eggs, flour, baking powder, salt and baking soda and mix until smooth.‣ Divide batter evenly between prepared pan and bake for 30 minutes.‣ To Make Filling: In a small bowl combine confectioners sugar, 2 tablespoons butter and 2 tablespoons milk. Beat until smooth. In a small bowl combine confectioners' sugar, 2 tablespoons butter and 3 tablespoons milk. Beat until smooth. Spread over warm cake.



Char-level RNN recipes

In [None]:
#CACHE_DIR = "./drive/Shared drives/CS 269: Recipe/tmp"
#pathlib.Path(CACHE_DIR).mkdir(exist_ok=True)
rnn_recipes_path = os.path.join(CACHE_DIR, 'rnn_output_recipes')

rnn_recipe_strings = []

for i in range(500):
  print(f"> {i} out of 500")
  file_path = os.path.join(rnn_recipes_path, f"rnn_recipe_{i}.txt")
  rnn_recipe_string = ""
  for line in open(file_path, 'r'):
    rnn_recipe_string += line
  rnn_recipe_strings.append(rnn_recipe_string)

> 0 out of 500
> 1 out of 500
> 2 out of 500
> 3 out of 500
> 4 out of 500
> 5 out of 500
> 6 out of 500
> 7 out of 500
> 8 out of 500
> 9 out of 500
> 10 out of 500
> 11 out of 500
> 12 out of 500
> 13 out of 500
> 14 out of 500
> 15 out of 500
> 16 out of 500
> 17 out of 500
> 18 out of 500
> 19 out of 500
> 20 out of 500
> 21 out of 500
> 22 out of 500
> 23 out of 500
> 24 out of 500
> 25 out of 500
> 26 out of 500
> 27 out of 500
> 28 out of 500
> 29 out of 500
> 30 out of 500
> 31 out of 500
> 32 out of 500
> 33 out of 500
> 34 out of 500
> 35 out of 500
> 36 out of 500
> 37 out of 500
> 38 out of 500
> 39 out of 500
> 40 out of 500
> 41 out of 500
> 42 out of 500
> 43 out of 500
> 44 out of 500
> 45 out of 500
> 46 out of 500
> 47 out of 500
> 48 out of 500
> 49 out of 500
> 50 out of 500
> 51 out of 500
> 52 out of 500
> 53 out of 500
> 54 out of 500
> 55 out of 500
> 56 out of 500
> 57 out of 500
> 58 out of 500
> 59 out of 500
> 60 out of 500
> 61 out of 500
> 62 out of 500
> 

In [None]:
rnn_recipe_strings

['Smoked-Bluefishing or Asparagus topping, or so other spaghetti sauce from the consistency butter, then slice remaining bread slices on top of the juice concentrate the diced processed in cold water and put them on the bias and toasted brined with a large pond cake from stems and garlic to form a full of your cooked through for steaky directions for dipping sauce.\n\n\U0001f963\n‣ Preheat oven to 350 degrees F (175 degrees C).\n‣ Spread 1 inch of olive oil over the top of the green chops.\n‣ Bake in the preheated oven 10 to 15 minutes in the preheated oven, or until crust is golden. Cool completely. Cut into 8 wedges. Serve warm leaves.',
 'Shortbread)\n\n🥑\n• 1 (10 ounce) package line down the center mix\n• 2 (16 ounce) cans cream of chocolate sugar and creamy peanut butter cups\n• 1 cup semisweet chocolate chips\n\n\U0001f963\n‣ In a large bowl, mix together flour, baking powder, baking soda, salt, oregano, sugar, baking powder, salt and baking powder. Add lemon juice and Cheddar ch

In [None]:
rnn_recipe_strings[7][20:30]

'\n\U0001f963\n‣ In a '

In [None]:
len(rnn_recipe_strings)

500

In [None]:
# Clean up RNN recipe strings

cleaned_rnn_recipe_strings = []

for rnn_recipe_string in rnn_recipe_strings:
  cleaned_string = ""
  for char in rnn_recipe_string:
    if char == "\n":
      pass
    elif char == "\U0001f963":
      cleaned_string += "<INS>"
    else:
      cleaned_string += char
  cleaned_rnn_recipe_strings.append(cleaned_string)

In [None]:
len(cleaned_rnn_recipe_strings)
cleaned_rnn_recipe_strings

['Smoked-Bluefishing or Asparagus topping, or so other spaghetti sauce from the consistency butter, then slice remaining bread slices on top of the juice concentrate the diced processed in cold water and put them on the bias and toasted brined with a large pond cake from stems and garlic to form a full of your cooked through for steaky directions for dipping sauce.<INS>‣ Preheat oven to 350 degrees F (175 degrees C).‣ Spread 1 inch of olive oil over the top of the green chops.‣ Bake in the preheated oven 10 to 15 minutes in the preheated oven, or until crust is golden. Cool completely. Cut into 8 wedges. Serve warm leaves.',
 'Shortbread)🥑• 1 (10 ounce) package line down the center mix• 2 (16 ounce) cans cream of chocolate sugar and creamy peanut butter cups• 1 cup semisweet chocolate chips<INS>‣ In a large bowl, mix together flour, baking powder, baking soda, salt, oregano, sugar, baking powder, salt and baking powder. Add lemon juice and Cheddar cheese and mix well.‣ Fold the butter 

In [None]:
rnn_recipe_strings_df = pd.DataFrame(rnn_recipe_strings)

In [None]:
rnn_recipe_strings_df

Unnamed: 0,0
0,"Smoked-Bluefishing or Asparagus topping, or so..."
1,Shortbread)\n\n🥑\n• 1 (10 ounce) package line ...
2,Cinnamon-Spiked BReastarthy Italian Cream Chee...
3,(Eggs) and dried cherries\n• 1 (18.25 ounce) p...
4,"Asparagus, Taper to Making Surfoin or mixed wi..."
...,...
495,"Sombrero's Ansiffer Grumes), divided\n• 5 tabl..."
496,"Snowpeas, and finely chopped peaches and juice..."
497,Carned Pecan Pie Irandar Jack cheese and fruit...
498,Toklas's Ice Cream Concentrity Companimutes® R...


Save these datasets

In [None]:
# dataset_path = os.path.join(CACHE_DIR, 'gpt2_finetuned_recipes.pkl')
# gpt2_recipe_strings_df.to_pickle(dataset_path) 

In [None]:
# dataset_path = os.path.join(CACHE_DIR, 'rnn_recipes.pkl')
# rnn_recipe_strings_df.to_pickle(dataset_path) 

#Training

Load in datasets

In [None]:
gpt2_recipe_strings_df = pd.read_pickle(os.path.join(CACHE_DIR, "gpt2_finetuned_recipes.pkl"))
rnn_recipe_strings_df = pd.read_pickle(os.path.join(CACHE_DIR, "rnn_recipes.pkl"))

Training-Test split

In [None]:
# recipe_strings['label'] = 0 # ground truth

recipe_strings_df = pd.DataFrame(recipe_strings)
# recipe_strings_df.rename(columns={"0" : "text"})
recipe_strings_df['text'] = recipe_strings_df[0]
recipe_strings_df['label'] = 0 # ground truth
gpt2_recipe_strings_df['text'] = gpt2_recipe_strings_df[0]
gpt2_recipe_strings_df['label'] = 1 # GPT-2 finetuned
rnn_recipe_strings_df['text'] = rnn_recipe_strings_df[0]
rnn_recipe_strings_df['label'] = 2 # RNN

In [None]:
recipe_strings_df['label']

0        0
1        0
2        0
3        0
4        0
        ..
20161    0
20162    0
20163    0
20164    0
20165    0
Name: label, Length: 20000, dtype: int64

In [None]:
gpt2_recipe_strings_df

Unnamed: 0,0,text,label
0,Lo-Mein Cake III<ING>• 1 (18.25 ounce) package...,Lo-Mein Cake III<ING>• 1 (18.25 ounce) package...,1
1,Gobble Bars II<ING>• 1 1/2 cups all-purpose fl...,Gobble Bars II<ING>• 1 1/2 cups all-purpose fl...,1
2,Hasenpfeffer<ING>• 2 cups milk• 1 1/4 cups sug...,Hasenpfeffer<ING>• 2 cups milk• 1 1/4 cups sug...,1
3,Bundles for the Rich and Famous<ING>• 1 cup ra...,Bundles for the Rich and Famous<ING>• 1 cup ra...,1
4,Sweetbreads for the Rich and the Famous<ING>• ...,Sweetbreads for the Rich and the Famous<ING>• ...,1
...,...,...,...
495,Creole Pork Chops<ING>• 1 1/2 cups dry brown r...,Creole Pork Chops<ING>• 1 1/2 cups dry brown r...,1
496,Beefy Pork Chops<ING>• 1 1/2 cups water• 4 tea...,Beefy Pork Chops<ING>• 1 1/2 cups water• 4 tea...,1
497,Ponchartrain Cake III<ING>• 1 (18.25 ounce) pa...,Ponchartrain Cake III<ING>• 1 (18.25 ounce) pa...,1
498,Oatie Bars II<ING>• 1 1/2 cups all-purpose flo...,Oatie Bars II<ING>• 1 1/2 cups all-purpose flo...,1


##Distinguishing GPT-2 from the ground truth

In [None]:
truth_vs_gpt2_df = pd.concat([recipe_strings_df.sample(n=500), gpt2_recipe_strings_df.sample(n=500)])

In [None]:
truth_vs_gpt2_df

Unnamed: 0,0,text,label
16710,"Fattoush<ING>• 3 pita rounds, torn into pieces...","Fattoush<ING>• 3 pita rounds, torn into pieces...",0
12527,Magic Pickle Dip<ING>• 1 (8 ounce) package sof...,Magic Pickle Dip<ING>• 1 (8 ounce) package sof...,0
18179,Mint Ice Cubes<ING>• 36 fresh mint leaves\n• 2...,Mint Ice Cubes<ING>• 36 fresh mint leaves\n• 2...,0
18076,Easy Creamy Chicken in a Sun-Dried Tomato Wine...,Easy Creamy Chicken in a Sun-Dried Tomato Wine...,0
5979,Personal Portobello Pizza<ING>• 1 large portob...,Personal Portobello Pizza<ING>• 1 large portob...,0
...,...,...,...
448,Andy's Pork Chops<ING>• 1 1/2 cups water• 4 te...,Andy's Pork Chops<ING>• 1 1/2 cups water• 4 te...,1
493,Rangoon Cake III<ING>• 1 (18.25 ounce) package...,Rangoon Cake III<ING>• 1 (18.25 ounce) package...,1
359,Island-Style Pork Chops<ING>• 1 1/2 cups water...,Island-Style Pork Chops<ING>• 1 1/2 cups water...,1
146,Italian-Style) Cake III<ING>• 1 (18.25 ounce) ...,Italian-Style) Cake III<ING>• 1 (18.25 ounce) ...,1


In [None]:
truth_vs_gpt2_df['nwords'] = truth_vs_gpt2_df['text'].apply(lambda x: len(x.split()))

truth_vs_gpt2_df = truth_vs_gpt2_df[truth_vs_gpt2_df['nwords']<350]

truth_vs_gpt2_df

Unnamed: 0,0,text,label,nwords
16710,"Fattoush<ING>• 3 pita rounds, torn into pieces...","Fattoush<ING>• 3 pita rounds, torn into pieces...",0,189
12527,Magic Pickle Dip<ING>• 1 (8 ounce) package sof...,Magic Pickle Dip<ING>• 1 (8 ounce) package sof...,0,46
18179,Mint Ice Cubes<ING>• 36 fresh mint leaves\n• 2...,Mint Ice Cubes<ING>• 36 fresh mint leaves\n• 2...,0,56
18076,Easy Creamy Chicken in a Sun-Dried Tomato Wine...,Easy Creamy Chicken in a Sun-Dried Tomato Wine...,0,273
5979,Personal Portobello Pizza<ING>• 1 large portob...,Personal Portobello Pizza<ING>• 1 large portob...,0,98
...,...,...,...,...
448,Andy's Pork Chops<ING>• 1 1/2 cups water• 4 te...,Andy's Pork Chops<ING>• 1 1/2 cups water• 4 te...,1,200
493,Rangoon Cake III<ING>• 1 (18.25 ounce) package...,Rangoon Cake III<ING>• 1 (18.25 ounce) package...,1,129
359,Island-Style Pork Chops<ING>• 1 1/2 cups water...,Island-Style Pork Chops<ING>• 1 1/2 cups water...,1,53
146,Italian-Style) Cake III<ING>• 1 (18.25 ounce) ...,Italian-Style) Cake III<ING>• 1 (18.25 ounce) ...,1,121


In [None]:
truth_vs_gpt2_df['nwords'].max()

340

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from keras.models import Model
from keras.layers import LSTM, Activation, Dense, Dropout, Input, Embedding
from keras.optimizers import RMSprop
from keras.preprocessing.text import Tokenizer
from keras.preprocessing import sequence
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping

In [None]:
truth_vs_gpt2_df.drop([0, 'nwords'], axis=1, inplace=True)
truth_vs_gpt2_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,text,label
16710,"Fattoush<ING>• 3 pita rounds, torn into pieces...",0
12527,Magic Pickle Dip<ING>• 1 (8 ounce) package sof...,0
18179,Mint Ice Cubes<ING>• 36 fresh mint leaves\n• 2...,0
18076,Easy Creamy Chicken in a Sun-Dried Tomato Wine...,0
5979,Personal Portobello Pizza<ING>• 1 large portob...,0
...,...,...
448,Andy's Pork Chops<ING>• 1 1/2 cups water• 4 te...,1
493,Rangoon Cake III<ING>• 1 (18.25 ounce) package...,1
359,Island-Style Pork Chops<ING>• 1 1/2 cups water...,1
146,Italian-Style) Cake III<ING>• 1 (18.25 ounce) ...,1


In [None]:
X = truth_vs_gpt2_df.text
Y = truth_vs_gpt2_df.label

In [None]:
le = LabelEncoder()
Y = le.fit_transform(Y)
Y = Y.reshape(-1, 1)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [None]:
max_words = 1000
max_len = 150
tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences, maxlen = max_len)

In [None]:
sequences_matrix

array([[  0,   0,   0, ..., 199, 182, 280],
       [456, 163,  52, ..., 373, 366, 621],
       [ 26, 313,   1, ..., 209,  61, 566],
       ...,
       [  1,   5,  17, ...,  10,  66, 159],
       [  0,   0,   0, ..., 426,  13, 145],
       [ 67, 107,  31, ...,   9, 180, 184]], dtype=int32)

In [None]:
def discriminator_RNN():
  inputs = Input(name='inputs', shape=[max_len])
  layer = Embedding(max_words, 100, input_length=max_len)(inputs)
  layer = LSTM(128)(layer)
  layer = Dense(512, name='FC1')(layer)
  layer = Activation('relu')(layer)
  layer = Dropout(0.5)(layer)
  layer = Dense(1, name='out_layer')(layer)
  layer = Activation('sigmoid')(layer)
  model = Model(inputs=inputs, outputs=layer)
  return model

In [None]:
model = discriminator_RNN()
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

Model: "model_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 150)]             0         
_________________________________________________________________
embedding_7 (Embedding)      (None, 150, 100)          100000    
_________________________________________________________________
lstm_7 (LSTM)                (None, 128)               117248    
_________________________________________________________________
FC1 (Dense)                  (None, 512)               66048     
_________________________________________________________________
activation_14 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_7 (Dropout)          (None, 512)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 513 

In [None]:
model.fit(sequences_matrix, Y_train, batch_size=128, epochs=10,
          validation_split=0.2, callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.0001)])

Epoch 1/10
Epoch 2/10
Epoch 3/10


<tensorflow.python.keras.callbacks.History at 0x7f55de6df470>

In [None]:
test_sequences = tokenizer.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences, maxlen = max_len)

In [None]:
accuracy = model.evaluate(test_sequences_matrix, Y_test)
print('Test set\nLoss: {:0.3f}\n Accuracy {:0.3f}'.format(accuracy[0], accuracy[1]))

Test set
Loss: 0.585
 Accuracy 0.765


##Distinguishing the RNN from the ground truth

In [None]:
truth_vs_rnn_df = pd.concat([recipe_strings_df.sample(n=500), rnn_recipe_strings_df.sample(n=500)])

truth_vs_rnn_df

Unnamed: 0,0,text,label
6895,Easy Peanut Butter Bars<ING>• cooking spray\n•...,Easy Peanut Butter Bars<ING>• cooking spray\n•...,0
15460,French Silk Chocolate Pie III<ING>• 2 cups but...,French Silk Chocolate Pie III<ING>• 2 cups but...,0
9783,Cookie Press Butter Cookies<ING>• 1 1/2 cups u...,Cookie Press Butter Cookies<ING>• 1 1/2 cups u...,0
7033,BBQ Feta and Hot Banana Pepper Turkey Burgers<...,BBQ Feta and Hot Banana Pepper Turkey Burgers<...,0
5782,Cranberry Nut Granola Bars<ING>• 2 cups quick-...,Cranberry Nut Granola Bars<ING>• 2 cups quick-...,0
...,...,...,...
114,Klastch Chicken Broiled for medium-high soup I...,Klastch Chicken Broiled for medium-high soup I...,2
451,Lemon-Infused Mushroom Rings\n\n🥑\n• 2 pounds ...,Lemon-Infused Mushroom Rings\n\n🥑\n• 2 pounds ...,2
345,"Jim's mashed potatoes, under covered with a po...","Jim's mashed potatoes, under covered with a po...",2
460,Pastelillos) and romaine leaf\n• 1 (10.75 ounc...,Pastelillos) and romaine leaf\n• 1 (10.75 ounc...,2


In [None]:
truth_vs_rnn_df['nwords'] = truth_vs_rnn_df['text'].apply(lambda x: len(x.split()))

truth_vs_rnn_df = truth_vs_rnn_df[truth_vs_rnn_df['nwords']<350]

truth_vs_rnn_df

Unnamed: 0,0,text,label,nwords
6895,Easy Peanut Butter Bars<ING>• cooking spray\n•...,Easy Peanut Butter Bars<ING>• cooking spray\n•...,0,136
15460,French Silk Chocolate Pie III<ING>• 2 cups but...,French Silk Chocolate Pie III<ING>• 2 cups but...,0,118
9783,Cookie Press Butter Cookies<ING>• 1 1/2 cups u...,Cookie Press Butter Cookies<ING>• 1 1/2 cups u...,0,137
7033,BBQ Feta and Hot Banana Pepper Turkey Burgers<...,BBQ Feta and Hot Banana Pepper Turkey Burgers<...,0,84
5782,Cranberry Nut Granola Bars<ING>• 2 cups quick-...,Cranberry Nut Granola Bars<ING>• 2 cups quick-...,0,184
...,...,...,...,...
114,Klastch Chicken Broiled for medium-high soup I...,Klastch Chicken Broiled for medium-high soup I...,2,154
451,Lemon-Infused Mushroom Rings\n\n🥑\n• 2 pounds ...,Lemon-Infused Mushroom Rings\n\n🥑\n• 2 pounds ...,2,164
345,"Jim's mashed potatoes, under covered with a po...","Jim's mashed potatoes, under covered with a po...",2,185
460,Pastelillos) and romaine leaf\n• 1 (10.75 ounc...,Pastelillos) and romaine leaf\n• 1 (10.75 ounc...,2,185


In [None]:
truth_vs_rnn_df['nwords'].max()

349

In [None]:
truth_vs_rnn_df.drop([0, 'nwords'], axis=1, inplace=True)
truth_vs_rnn_df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


Unnamed: 0,text,label
6895,Easy Peanut Butter Bars<ING>• cooking spray\n•...,0
15460,French Silk Chocolate Pie III<ING>• 2 cups but...,0
9783,Cookie Press Butter Cookies<ING>• 1 1/2 cups u...,0
7033,BBQ Feta and Hot Banana Pepper Turkey Burgers<...,0
5782,Cranberry Nut Granola Bars<ING>• 2 cups quick-...,0
...,...,...
114,Klastch Chicken Broiled for medium-high soup I...,2
451,Lemon-Infused Mushroom Rings\n\n🥑\n• 2 pounds ...,2
345,"Jim's mashed potatoes, under covered with a po...",2
460,Pastelillos) and romaine leaf\n• 1 (10.75 ounc...,2


In [None]:
X = truth_vs_rnn_df.text
Y = truth_vs_rnn_df.label

In [None]:
le = LabelEncoder()
Y = le.fit_transform(Y)
Y = Y.reshape(-1, 1)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [None]:
max_words = 1000
max_len = 150
tokenizer = Tokenizer(num_words = max_words)
tokenizer.fit_on_texts(X_train)
sequences = tokenizer.texts_to_sequences(X_train)
sequences_matrix = sequence.pad_sequences(sequences, maxlen = max_len)

In [None]:
sequences_matrix

array([[  3,   4, 705, ..., 164,   3, 164],
       [ 24,  98, 189, ...,  30, 138, 146],
       [  3,  87,   9, ..., 476, 512, 684],
       ...,
       [  0,   0,   0, ..., 658, 165,  55],
       [  0,   0,   0, ..., 377, 631, 176],
       [100,   1,   2, ..., 214,  11, 112]], dtype=int32)

In [None]:
model = discriminator_RNN()
model.summary()
model.compile(loss='binary_crossentropy', optimizer=RMSprop(), metrics=['accuracy'])

Model: "model_9"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
inputs (InputLayer)          [(None, 150)]             0         
_________________________________________________________________
embedding_9 (Embedding)      (None, 150, 100)          100000    
_________________________________________________________________
lstm_9 (LSTM)                (None, 128)               117248    
_________________________________________________________________
FC1 (Dense)                  (None, 512)               66048     
_________________________________________________________________
activation_18 (Activation)   (None, 512)               0         
_________________________________________________________________
dropout_9 (Dropout)          (None, 512)               0         
_________________________________________________________________
out_layer (Dense)            (None, 1)                 513 

In [None]:
model.fit(sequences_matrix, Y_train, batch_size=128, epochs=10,
          validation_split=0.2, callbacks=[EarlyStopping(monitor='val_loss', min_delta=0.0001)])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10


<tensorflow.python.keras.callbacks.History at 0x7f55aed87400>

In [None]:
test_sequences = tokenizer.texts_to_sequences(X_test)
test_sequences_matrix = sequence.pad_sequences(test_sequences, maxlen = max_len)

In [None]:
accuracy = model.evaluate(test_sequences_matrix, Y_test)
print('Test set\nLoss: {:0.3f}\n Accuracy {:0.3f}'.format(accuracy[0], accuracy[1]))

Test set
Loss: 0.613
 Accuracy 0.695


# Leftover DistilBERT stuff (ignore)

In [None]:
model_class, tokenizer_class, pretrained_weights = ppb.DistilBertModel, ppb.DistilBertTokenizer, 'distilbert-base-uncased'

# Load pretrained model/tokenizer
tokenizer = tokenizer_class.from_pretrained(pretrained_weights)
model = model_class.from_pretrained(pretrained_weights)

In [None]:
tokenized = truth_vs_rnn_df['text'].apply((lambda x: tokenizer.encode(x, add_special_tokens=True)))

tokenized

15677    [101, 4086, 1011, 1050, 1011, 14768, 20963, 22...
15670    [101, 2665, 14068, 1998, 2004, 28689, 12349, 1...
858      [101, 4086, 1998, 14768, 15960, 3523, 1026, 13...
655      [101, 7975, 1998, 20377, 28168, 16220, 10624, ...
18224    [101, 12183, 3527, 1026, 13749, 1028, 1528, 10...
                               ...                        
51       [101, 2175, 18581, 25650, 2721, 1024, 6366, 24...
30       [101, 1040, 1005, 6253, 5044, 8808, 2452, 1795...
238      [101, 25935, 1011, 21229, 11345, 100, 1528, 10...
432      [101, 16510, 8091, 1005, 1055, 6904, 19570, 20...
32       [101, 2531, 1003, 7427, 4487, 16643, 11001, 23...
Name: text, Length: 197, dtype: object

In [None]:
max_len = max([len(i) for i in tokenized.values])

padded = np.array([i + [0]*(max_len - len(i)) for i in tokenized.values])

attention_mask = np.where(padded != 0, 1, 0)
attention_mask.shape

(197, 469)

In [None]:
input_ids = torch.tensor(padded)
attention_mask = torch.tensor(attention_mask)

with torch.no_grad():
  last_hidden_states = model(input_ids, attention_mask=attention_mask)

In [None]:
features = last_hidden_states[0][:,0,:].numpy()

In [None]:
features.shape

(197, 768)

In [None]:
labels = truth_vs_rnn_df['label']
labels

15677    0
15670    0
858      0
655      0
18224    0
        ..
51       2
30       2
238      2
432      2
32       2
Name: label, Length: 197, dtype: int64

In [None]:
train_features, test_features, train_labels, test_labels = train_test_split(features, labels)

In [None]:
class FFNet(nn.Module):
  def __init__(self, input_dim=768, hidden_dim=1024, output_dim=1, dropout=0.8):
    super(FFNet, self).__init__()
    self.fc1 = nn.Sequential(
        nn.Linear(input_dim, hidden_dim),
        nn.Dropout(dropout),
        nn.LeakyReLU(),
        nn.BatchNorm1d(hidden_dim),
    )
    self.fc3 = nn.Sequential(
        nn.Linear(hidden_dim, output_dim),
        nn.Sigmoid(),
    )

  def forward(self, x):
    x = self.fc1(x)
    x = self.fc3(x)
    return x

ffnet = FFNet()

In [None]:
criterion = nn.BCELoss()
optimizer = optim.SGD(ffnet.parameters(), lr=0.001, momentum=0.9)

In [None]:
class Dataset(torch.utils.data.Dataset):
  def __init__(self, features, labels):
    self.features = features
    self.labels = labels
  def __len__(self):
    return self.features.shape[0]
  def __getitem__(self, index):
    X = self.features[index,:]
    y = self.labels[index]
    return X, y

train_dataset = Dataset(train_features, train_labels.to_numpy())

In [None]:
trainloader = torch.utils.data.DataLoader(
    train_dataset,
    batch_size=8,
    shuffle=True
)

In [None]:
NUM_EPOCHS = 1000

for epoch in range(NUM_EPOCHS):
  running_loss = 0.0
  for data, labels in trainloader:
    optimizer.zero_grad()
    outputs = ffnet(data)
    loss = criterion(outputs, labels.float().unsqueeze(1))
    loss.backward()
    optimizer.step()
    #print(outputs.view(1, -1))
    #print(labels.view(1, -1))

    running_loss += loss.item()
  if epoch % 50 == 0:
    print('Epoch {}, loss: {}'.format(epoch, running_loss))

print('Finished training')

Epoch 0, loss: 11.31788244843483
Epoch 50, loss: -194.97640949487686
Epoch 100, loss: -242.69247835874557
Epoch 150, loss: -200.0297458767891
Epoch 200, loss: -229.0006217956543
Epoch 250, loss: -311.5216683149338
Epoch 300, loss: -461.4820215702057
Epoch 350, loss: -306.5561623573303
Epoch 400, loss: -310.3633278235793
Epoch 450, loss: -332.66350173950195
Epoch 500, loss: -300.150194644928
Epoch 550, loss: -188.09028300642967
Epoch 600, loss: -300.6163331270218
Epoch 650, loss: -436.1982421875
Epoch 700, loss: -252.1306470632553
Epoch 750, loss: -274.73557567596436
Epoch 800, loss: -217.45036166906357
Epoch 850, loss: -244.4103483557701
Epoch 900, loss: -227.3741238117218
Epoch 950, loss: -151.6368461647071
Finished training


In [None]:
test_dataset = Dataset(test_features, test_labels.to_numpy())

testloader = torch.utils.data.DataLoader(
    test_dataset,
    batch_size=64,
    shuffle=True
)

In [None]:
y_pred = []
y = []
#total = 0
#correct = 0
with torch.no_grad():
  for data, labels in testloader:
    outputs = ffnet(data)
    predicted = torch.LongTensor(np.where(outputs > 0.5, 2, 0)).view(-1)
    y_pred.extend(predicted.tolist())
    y.extend(labels.tolist())
    #total += labels.size(0)
    #correct += (predicted == labels).sum().item()

print(y_pred)
print(y)

print('Accuracy of the FFNet trained on BERT sentence embeddings\non the test sentences: %0.3f %%' % accuracy_score(np.array(y), np.array(y_pred)))
print('F1-score of the FFNet trained on BERT sentence embeddings\non the test sentences: %0.3f %%' % f1_score(np.array(y), np.array(y_pred), average='micro'))

[2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 0, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2]
[2, 2, 2, 0, 0, 2, 0, 0, 2, 2, 0, 0, 2, 2, 0, 2, 0, 0, 0, 2, 2, 2, 0, 2, 0, 2, 0, 0, 2, 2, 2, 0, 0, 2, 0, 0, 2, 0, 2, 0, 2, 2, 0, 2, 0, 0, 0, 0, 2, 0]
Accuracy of the FFNet trained on BERT sentence embeddings
on the test sentences: 0.600 %
F1-score of the FFNet trained on BERT sentence embeddings
on the test sentences: 0.600 %
