# MSDS 534: Statistical Learning - Homework 2

### NAME:  Phan Nguyen Huong Le

### NET ID:  hpl14

In this homework, we predict whether a recipe is vegetarian or not vegetarian, given the description and ingredients. The data is from https://www.kaggle.com/datasets/shuyangli94/foodcom-recipes-with-search-terms-and-tags. The data has been processed to retain mostly dinner recipes, and sampled to ensure 50/50 vegetarian/non-vegetarian recipes.

In particular, we compare:
1.  a bag-of-words version, where the input data is a matrix of `recipes x n_ingredients`. Row `i` has a `1` in column `j` if recipe `i` uses ingredient `j`.
2. a DistilBERT version, where the input data is a matrix of `recipes x bert_input`, where `bert_input` is a string containing the recipe title, description and ingredient list.

In [None]:
from transformers import AutoTokenizer, AutoModel
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch import optim
from torch.utils.data import TensorDataset, DataLoader
import warnings
import itertools

warnings.filterwarnings('ignore')

### BERT data

- `recipes_processed.csv` is a matrix where each row is a recipe. The column `bert_input` are going to be the input to the DistilBert model

In [None]:
bert_df = pd.read_csv('recipes_processed.csv')

bert_df = bert_df[["bert_input", "vegetarian"]]

print(bert_df.head())
print(bert_df.shape)

                                          bert_input  vegetarian
0  Creamy Chicken &amp; Pasta. Prepare this ahead...       False
1  Vidalia Onion Dip. This recipe was my grandmot...        True
2  Veggie Lasagna. From Simple and Delicious.  I ...        True
3  Restaurant Plaisir/Sante Ratatouille. This sim...        True
4  Fresh Corn Chowder. Naturally sweet corn and c...        True
(22000, 2)


We import the DistilBERT tokenizer and model.

In [None]:
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
mybert = AutoModel.from_pretrained("distilbert-base-uncased")

### Bag-of-words data

- 'recipes_bow.csv' is the `recipes x vocab_size` matrix, where `vocab_size` is the total number of ingredients
- 'ingredient_dict.csv' is the lookup table for the vocabulary of ingredients

In [None]:
bow_df = pd.read_pickle('recipes_bow.pkl')
ingredient_dict = pd.read_csv('ingredient_dict.csv')

In [None]:
print(bow_df.iloc[:, range(4)].head())
print(ingredient_dict.head())
print(bow_df.shape)

   ingred_0  ingred_1  ingred_2  ingred_3
0         0         0         0         0
1         0         0         0         0
2         0         0         0         0
3         0         1         0         1
4         1         0         0         0
     ingred          vocab
0  ingred_0           salt
1  ingred_1      olive oil
2  ingred_2          onion
3  ingred_3  garlic cloves
4  ingred_4          water
(22000, 7080)


In [None]:
n_vocab = ingredient_dict.shape[0]
bow_feats = ['ingred_'+str(j) for j in range(n_vocab)]

In [None]:
xbow = np.array(bow_df[bow_feats])
ybow = np.array(bow_df['vegetarian'])

### Question 1:

(a) Turn the bag-of-words data into torch tensors.

(b) Turn the strings `bert_input` into tokens for BERT.

In [None]:
# turn bow data into torch tensors (hint: make sure they are of torch.float type)

xbow = torch.tensor(xbow, dtype = torch.float32)
ybow = torch.tensor(ybow, dtype = torch.float32)

In [None]:
# Turn bert_df['bert_input'] into a list (hint: pandas has an easy function to turn a column into a list)

bert_list = bert_df['bert_input'].tolist()

In [None]:
# turn bert_list into tokens using tokenizer with max_length=180, padding='max_length', truncation=True
bert_tokens = tokenizer(bert_list, max_length=180, padding='max_length', truncation=True)

In [None]:
# get both the input_ids and the attention mask from bert_tokens
bert_ids = bert_tokens['input_ids']
bert_mask = bert_tokens['attention_mask']

In [None]:
# turn bert_df['vegetarian'] into a torch tensor.
ybert = torch.tensor(bert_df['vegetarian'], dtype = torch.float32)

### Question 2:

Set up a torch dataset called BERT Dataset. The purpose of this dataset is to have an easy way to access:
- input_ids
- attention_mask
- ybert

In [None]:
class BERTDataset(torch.utils.data.Dataset):
    def __init__(self, input_ids, attention_mask, label):

        #self.input_ids = torch.tensor(input_ids, dtype = torch.long)
        #self.attention_mask = torch.tensor(attention_mask, dtype = torch.float32)
        #self.label = torch.tensor(label, dtype = torch.float32)

        self.input_ids = input_ids.dataset
        self.attention_mask = attention_mask.dataset
        self.label = label.dataset
        self.indices = input_ids.indices

    def __len__(self):
        return len(self.label)

    def __getitem__(self, idx):

        actual_idx = self.indices[idx]
        input_ids = self.tokens['input_ids'][idx]
        attention_mask = self.tokens['attention_mask'][idx]
        label = self.label[idx]

        return input_ids, attention_mask, label

### Question 3:

Split both BOW data and BERT data into training and test datasets. You should use the same indices for both datasets (i.e. training examples in BOW are the same training examples in BERT).

In [None]:
n = BERTDataset(bert_ids, bert_mask, ybert)
split = np.random.binomial(1, p=0.9, size=len(n)) # 1 if training, 0 if test
train_ind = int(0.5*len(n))
test_ind = len(n) - train_ind

In [None]:
xbow_train = torch.utils.data.Subset(xbow, train_ind)
xbow_test = torch.utils.data.Subset(xbow, test_ind)

ybow_train = torch.utils.data.Subset(ybow, train_ind)
ybow_test = torch.utils.data.Subset(ybow, test_ind)

In [None]:
# bert input ids
bert_ids_train = torch.utils.data.Subset(bert_ids, train_ind)
bert_ids_test = torch.utils.data.Subset(bert_ids, test_ind)

# bert attention mask
bert_mask_train = torch.utils.data.Subset(bert_mask, train_ind)
bert_mask_test = torch.utils.data.Subset(bert_mask, test_ind)

# bert label
ybert_train = torch.utils.data.Subset(ybert, train_ind)
ybert_test = torch.utils.data.Subset(ybert, test_ind)

### Question 4:

(a) Create a neural network with 2 hidden layers for the BOW data to predict vegetarian/non-vegetarian.

(b) Create a neural network which uses a single layer on top of the distilBERT model to predict vegetarian/non-vegetarian.

In [None]:
## neural network model
class BOWNet(nn.Module):
    def __init__(self, input_dim, hidden_dims):
        super(BOWNet, self).__init__()

        # hidden_dims is a 2-dim list: hidden_dims[0], hidden_dims[1]
        # initialize a neural network with linear, ReLU, linear, dropout(p=0.5), linear, ReLU, linear
        # what should the final function be after the last linear layer? (hint: y is {0, 1})

        self.layers = nn.Sequential(
            nn.Linear(input_dim, hidden_dims[0]),
            nn.ReLU(),
            nn.Linear(hidden_dims[0], hidden_dims[1]),
            nn.Dropout(p=0.5),
            nn.Linear(hidden_dims[1], 1),
            nn.ReLU(),
            nn.Linear(1, 1),
            nn.Sigmoid()
        )

    def forward(self, x):

        y_pred = self.layers(x).flatten()
        return y_pred

    def loss(self, y_pred, y):

        # y is binary {0, 1}. what should the loss function be?
        loss_fn = nn.BCELoss()
        loss = loss_fn(y_pred, y)

        return loss


In [None]:
class BERTNet(nn.Module):
    def __init__(self):
        super(BERTNet, self).__init__()

        self.bert = mybert # this is the distilbert model from the beginning of the notebook
        self.layer1 = nn.Sequential(nn.Linear(self.bert.config.hidden_size, 1),
                                    nn.Sigmoid())

    def forward(self, input_ids, attention_mask):

        bert_outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = bert_outputs[0][:,0,:] # this should get the <CLS> token from distilBERT

        x = self.layer1(pooled_output)

        x = x.squeeze(1)

        return x

    def loss(self, y_pred, y):

        loss_fn = nn.BCELoss()
        loss = loss_fn(y_pred, y)

        return loss

### Question 5:

(a) Create torch datasets and data loaders for the BOW and BERT data.

(b) Instantiate a BERT model and a BOW model.

In [None]:
ybert_train

<torch.utils.data.dataset.Subset at 0x7941a8d75c00>

In [None]:
print(type(xbow_train), type(ybow_train))


<class 'torch.utils.data.dataset.Subset'> <class 'torch.utils.data.dataset.Subset'>


In [None]:
bert_train = BERTDataset(bert_ids_train, bert_mask_train, ybert_train)
bert_loader = DataLoader(bert_train, batch_size=10, shuffle=True)

#bow_train = TensorDataset(xbow_train, ybow_train)
bow_loader = DataLoader(xbow_train, batch_size=100, shuffle=True)

TypeError: object of type 'int' has no len()

In [None]:
bert_model = BERTNet(mybert)
bow_model = BOWNet(input_dim, hidden_dims=[500,100])

### Question 6:

(a) Train the BERT neural network to predict vegetarian/not vegetarian.

Note: each epoch may take a while (on Prof Moran's laptop, each epoch took around 20 mins). You can obtain a batch of data from your data loader to test your code using `next(iter(bert_loader))`.

In [None]:
# train BERT neural network

epochs = 2
lr = 1e-4
optimizer = optim.Adam([INSERT CODE])

for epoch in range(epochs):

    epoch_loss = 0
    batch_iter = 0

    for batch in bert_loader:

        # get predictions
        y_pred = bert_model([INPUT CODE])

        # calculate loss
        loss = [INPUT CODE]

        # compute gradients in a backward pass
        [INPUT CODE]

        # take a gradient step to update parameters
        [INPUT CODE]

        # zero gradients in preparation for next iteration
        [INPUT CODE]

        epoch_loss += loss.item()

        batch_iter += 1
        if batch_iter % 20 == 0:
            print('epoch: ', epoch, 'batch: ', batch_iter)

    print('epoch: ', epoch, 'loss:', f"{epoch_loss:.3}")


In [None]:
## save model parameters
torch.save(bert_model.state_dict(), "recipe-distilbert.pt")

(b) Train the BOW neural network to predict vegetarian or not vegetarian.

In [None]:
## train neural network to predict vegetarian

epochs = 100
lr = 1e-4
optimizer = [INPUT CODE]

for epoch in range(epochs):

    epoch_loss = 0

    for x_batch, y_batch in bow_loader:

        [INPUT CODE]

        epoch_loss += loss.item()

    if epoch % 20 == 0:
        print('epoch: ', epoch, 'loss:', f"{epoch_loss:.3}")


In [None]:
## save model parameters
torch.save(bow_model.state_dict(), "recipe-bow.pt")

### Question 7

(a) Calculate test accuracy for the BERT model. Note: this is done as a for loop as it requires too much memory to input all test data at once into the BERT model.

In [None]:
## calculate testing error
## takes about 1.5 minutes to run

bert_model.eval()
bert_acc = 0

n_test = len(test_ind)

for i in range(n_test):

    # get predictions from bert_ids_test, bert_mask_test (note: bert_model expects dimensions batch_size x max_length)
    # that is, make sure inputs are 2-dim torch tensors (1 x max_length)
    pred = [INPUT CODE]

    # turn probability into {0, 1}
    pred = [INPUT CODE]

    bert_acc = bert_acc + (pred == ybert_test[i]).sum() / len(pred)

bert_acc = bert_acc / n_test

(b) Calculate test accuracy for the BOW model.

In [None]:
ybow_pred = bow_model([INPUT CODE])
bow_acc = [INPUT CODE]

In [None]:
print("BERT accuracy:", bert_acc, 'BOW accuracy:', bow_acc)

### Question 8

How do the BERT and BOW accuracies compare? Comment on the differences between the BERT and BOW models.

[Answer]:

Just for fun: input your own recipe ingredients and see what the BERT model predicts.

In [None]:
my_recipe_collection = [
"The ingredients are: tofu, bread, arugula, nutritional yeast",
"Recipe uses: beef, milk, eggs, ketchup, worchestershire sauce"
]

for recipe in my_recipe_collection:

    recipe_tokens = tokenizer(recipe, return_tensors='pt',
                                max_length=180, padding='max_length', truncation=True)


    my_pred = bert_model(recipe_tokens['input_ids'], recipe_tokens['attention_mask'])
    my_pred = my_pred.detach().numpy()

    print(my_pred)
