## Final Project Day 3: Use LSTM or fine-tune BERT for the Product Safety Dataset

We continue to work with the final project dataset. This time you can work with [LSTM](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html) (its use is similar to RNN) or fine-tune a BERT model. Be careful with the BERT approach as it takes a long time. You will again predict the __human_tag__ field of the dataset.

Use the notebooks from the class and implement the model, train and test with the corresponding datasets.
You can follow these steps:
1. Read training-test data (Given)
2. Train a classifier (Implement)
3. Make predictions on your test dataset (Implement)
4. Write your test predictions to a CSV file (Given)

In [None]:
# Upgrade dependencies
!pip install -r ../../requirements.txt

In [1]:
import boto3
import os
from os import path
import pandas as pd

import numpy as np
import re, time

from collections import Counter

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score

import torch, torchtext
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab
from torch.utils.data import TensorDataset, DataLoader
from torchtext.vocab import GloVe
from torch import nn, optim

## 1. Reading the dataset

We will use the __pandas__ library to read our dataset. Let's first download the files.

#### __Training data:__

In [2]:
train_df = pd.read_csv('../../data/final_project/training.csv', encoding='utf-8', header=0)
train_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title,human_tag
0,47490,15808037321,"I ordered a sample of the Dietspotlight Burn, ...",6/25/2018 17:51,1,DO NOT BUY!,0
1,16127,16042300811,This coffee tasts terrible as if it got burnt ...,2/8/2018 15:59,2,Coffee not good,0
2,51499,16246716471,I've been buying lightly salted Planters cashe...,3/22/2018 17:53,2,"Poor Quality - Burnt, Shriveled Nuts With Blac...",0
3,36725,14460351031,This product is great in so many ways. It goes...,12/7/2017 8:49,4,"Very lovey product, good sunscreen, but strong...",0
4,49041,15509997211,"My skin did not agree with this product, it wo...",3/21/2018 13:51,1,Not for everyone. Reactions can be harsh.,1


#### __Test data:__

In [3]:
test_df = pd.read_csv('../../data/final_project/test.csv', encoding='utf-8', header=0)
test_df.head()

Unnamed: 0,ID,doc_id,text,date,star_rating,title
0,62199,15449606311,"Quality of material is great, however, the bac...",3/7/2018 19:47,3,great backpack with strange fit
1,76123,15307152511,The product was okay but wasn't refined campho...,43135.875,2,Not refined
2,78742,12762748321,I normally read the reviews before buying some...,42997.37708,1,"Doesnt work, wouldnt recommend"
3,64010,15936405041,These pads are completely worthless. The light...,43313.25417,1,The lighter colored side of the pads smells li...
4,17058,13596875291,The saw works great but the blade oiler does n...,12/5/2017 20:17,2,The saw works great but the blade oiler does n...


In [4]:
train_df["human_tag"].value_counts()

0    53375
1     9759
Name: human_tag, dtype: int64

In [5]:
print(train_df.isna().sum())

ID             0
doc_id         0
text           6
date           0
star_rating    0
title          1
human_tag      0
dtype: int64


In [6]:
train_df['text'] = train_df['text'].fillna("missing")

## 2. Train a Classifier

In [7]:
# Implement this
# Implement this
train_text, val_text, train_label, val_label = train_test_split(
    train_df["text"].tolist(),
    train_df["human_tag"].tolist(),
    test_size=0.10,
    shuffle=True,
    random_state=324,
)

In [8]:
tokenizer = get_tokenizer("basic_english")
counter = Counter()
for line in train_text:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=2) #min_freq>1 for skipping misspelled words

print(vocab.itos[0:25])

['<unk>', '<pad>', '.', 'the', 'i', ',', 'it', 'and', 'to', 'a', "'", 'of', 'this', 'is', 'my', 'for', 'in', 'that', 'not', 'on', 'but', 'was', 'you', 't', 'with']


In [9]:
# Let's create a mapper to transform our text data
text_transform_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

In [10]:
print(f"Before transform:\t{train_text[37]}")
print(f"After transform:\t{text_transform_pipeline(train_text[37])}")

Before transform:	Horrible. Other reviewers have said it worked on their sensitive skin - this burned intensely immediately when I put a tiny drop on just to test it out. I washed it out thoroughly immediately but still got a rash. I wish I could return this!
After transform:	[408, 2, 90, 1356, 25, 278, 6, 197, 19, 185, 359, 86, 100, 12, 54, 6745, 411, 40, 4, 123, 9, 715, 1050, 19, 46, 8, 554, 6, 38, 2, 4, 762, 6, 38, 2229, 411, 20, 106, 99, 9, 1034, 2, 4, 338, 4, 114, 205, 12, 26]


In [11]:
def pad_features(reviews_split, seq_length):
    # Transform the text
    # use the dict to tokenize each review in reviews_split
    # store the tokenized reviews in reviews_ints
    reviews_ints = []
    for review in reviews_split:
        reviews_ints.append(text_transform_pipeline(review))
    
    # getting the correct rows x cols shape
    features = np.ones((len(reviews_ints), seq_length), dtype=int)
    
    # for each review, I grab that review
    for i, row in enumerate(reviews_ints):
        features[i, -len(row):] = np.array(row)[:seq_length]
    
    return torch.tensor(features, dtype=torch.int64)

In [12]:
for text in train_text[9:11]:
    print(f"Text: {text}\n")
    print(f"Original length of the text: {len(text)}\n")
    tt = pad_features([text], seq_length=50)
    print(f"Transformed text: \n{tt}\n")
    print(f"Shape of transformed text: {tt.shape}\n")

Text: I was looking for an electric smoker in hopes that I could smoke during the winter, when temperatures were in the single digits. Mainly for maintaining the temperatures during smoking and cooking. Well, it does, it worked great in single digit temperatures. I was also glad that I chose this model because of the insulated door and walls. The units with glass doors were questionable to me regarding maintenance of temperatures. The only down side I have found with this unit and the reason for not giving it a five star is this. I found that with ambient temperatures in the 80s or higher the units heater doesn't turn on enough to light the wood chips and create smoke. What I have had to do is half latch the door, so the door is slightly open and allows the heat out and the heating elements will remain on to try and raise the internal temp to what you have set, causing the heating elements to heat up enough to get the wood chips burning, once they are lit I close the door. I am very ha

In [13]:
max_len = 50
batch_size = 64

# Pass transformed and padded data to dataset
# Create data loaders
train_dataset = TensorDataset(
    pad_features(train_text, max_len), torch.tensor(train_label, dtype=torch.float32)
)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

val_dataset = TensorDataset(pad_features(val_text, max_len), torch.tensor(val_label, dtype=torch.float32))
val_loader = DataLoader(val_dataset, batch_size=batch_size)

In [14]:
glove = GloVe(name="6B", dim=300)
embedding_matrix = glove.get_vecs_by_tokens(vocab.itos)

In [15]:
# Size of the state vectors
hidden_size = 128

# General NN training parameters
learning_rate = 0.0001
epochs = 20

# Embedding vector and vocabulary sizes
embed_size = 300  # glove.6B.300d.txt
vocab_size = len(vocab.itos)

In [16]:
class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size, padding_idx=1)
        self.lstm = nn.LSTM(
            embed_size, hidden_size, num_layers=num_layers, batch_first=True
        )

        self.linear = nn.Linear(hidden_size, 1)  # <==============
        #self.linear = nn.Linear(hidden_size, 5)  # <==============
        self.act = nn.Sigmoid()
        #self.act = nn.Softmax(dim=1)

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        # Call the RNN layer
        outputs, _ = self.lstm(embeddings)
        
        # Output shape after RNN: (batch_size, max_len, hidden_size)
        # Get the output from the last time step with outputs[:, -1, :] below
        # The output shape becomes: (batch_size, 1, hidden_size)
        # Send it to the linear layer
        outs = self.linear(outputs[:, -1, :])
        return self.act(outs)
    
# Initialize the weights
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
    if type(m) == nn.LSTM:
        for param in m._flat_weights_names:
            if "weight" in param:
                nn.init.xavier_uniform_(m._parameters[param])

In [17]:
# Our architecture with 2 RNN layers
model = Net(vocab_size, embed_size, hidden_size, num_layers=2)

# We set the embedding layer's parameters from GloVe
model.embedding.weight.data.copy_(embedding_matrix)
# We won't change/train the embedding layer
model.embedding.weight.requires_grad = False

In [18]:
model

Net(
  (embedding): Embedding(31128, 300, padding_idx=1)
  (lstm): LSTM(300, 128, num_layers=2, batch_first=True)
  (linear): Linear(in_features=128, out_features=1, bias=True)
  (act): Sigmoid()
)

In [19]:
# Setting our trainer
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# We will use Binary Cross-entropy loss
# reduction="sum" sums the losses for given output and target
cross_ent_loss = nn.BCELoss(reduction="sum")


In [20]:
# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print("device = ", device)

model.apply(init_weights)
model.to(device)

for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop, train the network
    for data, target in train_loader:
        trainer.zero_grad()
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        L = cross_ent_loss(output, target.unsqueeze(1))
        training_loss += L.item()
        L.backward()
        trainer.step()

    # Validate the network, no training (no weight update)
    for data, target in val_loader:
        val_predictions = model(data.to(device))
        L = cross_ent_loss(val_predictions, target.to(device).unsqueeze(1))
        val_loss += L.item()

    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)

    end = time.time()
    print(
        f"Epoch {epoch}. Train_loss {training_loss}. Val_loss {val_loss}. Seconds {end-start}"
    )

device =  cuda
Epoch 0. Train_loss 0.4572097093024249. Val_loss 0.4321973819992457. Seconds 11.199868440628052
Epoch 1. Train_loss 0.430823131582763. Val_loss 0.4315806000074858. Seconds 11.2220299243927
Epoch 2. Train_loss 0.43015274879001725. Val_loss 0.43093157071345284. Seconds 11.332505226135254
Epoch 3. Train_loss 0.4294619638959273. Val_loss 0.43025316254111606. Seconds 11.291494607925415
Epoch 4. Train_loss 0.4287265259126088. Val_loss 0.4295192491339852. Seconds 11.277247190475464
Epoch 5. Train_loss 0.42791628057123365. Val_loss 0.4286992879921324. Seconds 11.038425207138062
Epoch 6. Train_loss 0.4269944144486961. Val_loss 0.42775529572055204. Seconds 11.061888933181763
Epoch 7. Train_loss 0.42591312674307896. Val_loss 0.4266372938340398. Seconds 10.995811462402344
Epoch 8. Train_loss 0.42460711478179963. Val_loss 0.4252764335115159. Seconds 10.991759538650513
Epoch 9. Train_loss 0.4229831214355608. Val_loss 0.42357344804098607. Seconds 11.07912015914917
Epoch 10. Train_loss 

In [21]:
val_predictions = []
for data, target in val_loader:
    val_preds = model(data.to(device))
    val_predictions.extend(
        [np.rint(val_pred)[0] for val_pred in val_preds.detach().cpu().numpy()]
    )
print(val_predictions[:10])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


In [22]:
# Use the fitted pipeline to make predictions on the validation dataset
print(confusion_matrix(val_label, val_predictions))
print(classification_report(val_label, val_predictions))
print("Accuracy (validation):", accuracy_score(val_label, val_predictions))

[[5223  110]
 [ 795  186]]
              precision    recall  f1-score   support

           0       0.87      0.98      0.92      5333
           1       0.63      0.19      0.29       981

    accuracy                           0.86      6314
   macro avg       0.75      0.58      0.61      6314
weighted avg       0.83      0.86      0.82      6314

Accuracy (validation): 0.856667722521381


## 3. Make predictions on your test dataset

In [23]:
# Implement this
test_text = test_df["text"].fillna(value="missing").tolist()

test_dataset = TensorDataset(pad_features(test_text, max_len)) #, torch.tensor(val_label))
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [24]:
test_predictions = []
for data, in test_loader:
    test_preds = model(data.to(device))
    test_predictions.extend(
        [np.rint(test_pred)[0] for test_pred in test_preds.detach().cpu().numpy()]
    )
print(test_predictions[:10])

[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]


## 4. Write your predictions to a CSV file
You can use the following code to write your test predictions to a CSV file. Then upload your file to https://mlu.corp.amazon.com/contests/redirect/53

In [25]:
result_df = pd.DataFrame()
result_df["ID"] = test_df["ID"]
result_df["human_tag"] = test_predictions
 
#result_df.to_csv("../../data/final_project/project_day3_result.csv", encoding='utf-8', index=False)
result_df.to_csv("./project_day3_result.csv", encoding='utf-8', index=False)